Neural network motion controller
By training a physics-based neural network and combining low-level motion actuators and high-level motion schedulers, the problem of insufficient adaptability of neural networks in animated motion is solved, achieving realistic animation and environmental adaptability of virtual objects, which is suitable for real-time interactive applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NVIDIA CORP
- Filing Date
- 2021-09-14
- Publication Date
- 2026-06-23
AI Technical Summary
Existing neural networks struggle to adapt to various types of animated motion during training, resulting in a lack of flexibility and adaptability, and thus failing to effectively promote computer-generated graphics animation.
By training a physics-based neural network, combining low-level motion actuators and high-level motion schedulers, and using constrained multi-objective reward optimization and policy variance control, the animation of virtual objects is achieved.
It achieves robustness and adaptability to various motion types, can generate realistic animation effects in real-time interactive applications, adapts to environmental changes and disturbances, and supports multiple interaction modes.
Smart Images

Figure CN114186673B_ABST
Abstract
Description
Technical Field
[0001] At least one embodiment relates to a processor and computing system for training neural networks to facilitate the animation of computer-generated graphics. Background Technology
[0002] Training neural networks to facilitate the animation of computer-generated graphics can result in inflexible networks or networks that are not well-suited to facilitating various types of movement in animations. The training of neural networks used to facilitate the animation of computer-generated graphics can be improved. Attached Figure Description
[0003] Figure 1 An example of a system for animates virtual objects according to at least one embodiment is shown;
[0004] Figure 2 An example of motion balance for training a system for animate virtual objects, according to at least one embodiment, is shown;
[0005] Figure 3 An example of a video stream scheduler according to at least one embodiment is shown;
[0006] Figure 4 An example of a command stream scheduler according to at least one embodiment is shown;
[0007] Figure 5 An example of a motion splicing scheduler according to at least one embodiment is shown;
[0008] Figure 6 An example of a reaction state initialization scheme according to at least one embodiment is shown;
[0009] Figure 7 An example of policy variance control according to at least one embodiment is shown;
[0010] Figure 8 An example of animate virtual objects using a training system according to at least one embodiment is shown;
[0011] Figure 9A The inference and / or training logic according to at least one embodiment is illustrated;
[0012] Figure 9B The inference and / or training logic according to at least one embodiment is illustrated;
[0013] Figure 10 The training and deployment of a neural network according to at least one embodiment are illustrated;
[0014] Figure 11 An example data center system according to at least one embodiment is shown;
[0015] Figure 12A An example of an autonomous vehicle according to at least one embodiment is shown;
[0016] Figure 12B The illustration shows an embodiment according to at least one of the embodiments. Figure 12A Examples of camera positions and field of view for autonomous vehicles;
[0017] Figure 12C This is an illustration based on at least one embodiment. Figure 12A A block diagram of an example system architecture for an autonomous vehicle;
[0018] Figure 12D The illustration, according to at least one embodiment, is for one or more cloud-based servers and Figure 12A A diagram of a system for communication between autonomous vehicles;
[0019] Figure 13 This is a block diagram illustrating a computer system according to at least one embodiment;
[0020] Figure 14 This is a block diagram illustrating a computer system according to at least one embodiment;
[0021] Figure 15 A computer system according to at least one embodiment is shown;
[0022] Figure 16 A computer system according to at least one embodiment is shown;
[0023] Figure 17A A computer system according to at least one embodiment is shown;
[0024] Figure 17B A computer system according to at least one embodiment is shown;
[0025] Figure 17C A computer system according to at least one embodiment is shown;
[0026] Figure 17D A computer system according to at least one embodiment is shown;
[0027] Figure 17E and Figure 17F A shared programming model according to at least one embodiment is shown;
[0028] Figure 18 An exemplary integrated circuit and a related graphics processor according to at least one embodiment are shown;
[0029] Figure 19A and Figure 19BAn exemplary integrated circuit and an associated graphics processor according to at least one embodiment are shown;
[0030] Figure 20A and Figure 20B Additional exemplary graphics processor logic according to at least one embodiment is shown;
[0031] Figure 21 A computer system according to at least one embodiment is shown;
[0032] Figure 22A A parallel processor according to at least one embodiment is shown;
[0033] Figure 22B A partitioning unit according to at least one embodiment is shown;
[0034] Figure 22C A processing cluster according to at least one embodiment is shown;
[0035] Figure 22D A graphics multiprocessor according to at least one embodiment is shown;
[0036] Figure 23 A multi-graphics processing unit (GPU) system according to at least one embodiment is illustrated;
[0037] Figure 24 A graphics processor according to at least one embodiment is shown;
[0038] Figure 25 It is a block diagram illustrating a processor microarchitecture for a processor according to at least one embodiment;
[0039] Figure 26 A deep learning application processor according to at least one embodiment is shown;
[0040] Figure 27 A block diagram of an example neuromorphic processor is shown according to at least one embodiment;
[0041] Figure 28 At least a portion of a graphics processor according to one or more embodiments is shown;
[0042] Figure 29 At least a portion of a graphics processor according to one or more embodiments is shown;
[0043] Figure 30 At least a portion of a graphics processor according to one or more embodiments is shown;
[0044] Figure 31 It is a block diagram of a graphics processing engine of a graphics processor according to at least one embodiment;
[0045] Figure 32 It is a block diagram of at least a portion of a graphics processor core according to at least one embodiment;
[0046] Figure 33A and Figure 33B The diagram illustrates thread execution logic according to at least one embodiment, which includes an array of processing elements of a graphics processor core.
[0047] Figure 34 A parallel processing unit (“PPU”) according to at least one embodiment is shown;
[0048] Figure 35 A general-purpose processing cluster (“GPC”) according to at least one embodiment is illustrated;
[0049] Figure 36 A memory partition unit of a parallel processing unit (“PPU”) according to at least one embodiment is shown;
[0050] Figure 37 A streaming multiprocessor according to at least one embodiment is illustrated;
[0051] Figure 38 This is an example data flow diagram of an advanced computing pipeline according to at least one embodiment;
[0052] Figure 39 This is a system diagram of an example system for training, adapting, instantiating, and deploying machine learning models in an advanced computing pipeline, according to at least one embodiment;
[0053] Figure 40 Example illustrations of an advanced computing pipeline for processing imaging data according to at least one embodiment;
[0054] Figure 41A Includes example data flow diagrams of virtual instruments supporting ultrasound equipment according to at least one embodiment;
[0055] Figure 41B Includes example data flow diagrams of virtual instruments supporting CT scanners according to at least one embodiment;
[0056] Figure 42A A data flow diagram illustrating the process for training a machine learning model according to at least one embodiment is shown; and
[0057] Figure 42B This is an example illustration of a client-server architecture that utilizes a pre-trained annotation model to enhance an annotation tool, according to at least one embodiment. Detailed Implementation
[0058] Figure 1Examples of systems for animate one or more virtual objects according to at least one embodiment are shown. In at least one embodiment, the one or more virtual objects represent characters (such as human figures, animal figures, robots, etc.) or devices (such as robotic arms, manufacturing equipment, etc.). In at least one embodiment, the character or device represented by the one or more virtual objects is animated within a virtual environment (such as a video game or simulation). In at least one embodiment, such an environment is dynamic because it may have different changing characteristics that the animated character or device can interact with, react to, or overcome.
[0059] In at least one embodiment, physics-based animation is used to provide realistic motion and rich interaction with the virtual environment. In at least one embodiment, physics-based animation is used to support a wide variety of motion types with improved performance and efficiency.
[0060] In at least one embodiment, one or more neural networks are trained to determine the amount of force to be applied to the joints of one or more virtual objects. In at least one embodiment, the joints of the corresponding real-world objects are modeled, such as the ankle or knee joints of a human subject. In at least one embodiment, a force, such as torque, is applied to such a model, and a physics-based determination is performed to resolve character movement. In at least one embodiment, the neural network is trained based on aspects of motion demonstrated in a training dataset.
[0061] In at least one embodiment, the data-driven technique provides scalability for a wide variety of motions by learning directly from the representation of the movement. In at least one embodiment, the reward signal for the compatible learning technique uses a reward signal based on the distance or similarity between the generated motion and the target motion.
[0062] In at least one embodiment, a policy network maps the character's current state to torques applied to each joint, and is then optimized to minimize this distance. The policy, trained using imitation learning, can successfully drive the virtual character to naturally follow the motion of a reference target in a physically realistic manner. In at least one embodiment, to increase the range of supported motion and allow for better user interaction, a physics-based general neural controller is implemented, enabling a wide range of real-time interactive applications.
[0063] In at least one embodiment, the system 100 includes: a low-level motion actuator 104 for generating physics-based control signals to drive a character to follow a target reference motion; and a high-level motion scheduler 102 for translating various high-level inputs (such as keyboard or joystick commands) into target reference motion. In at least one embodiment, to achieve a robust and powerful motion actuator system 100, it is taught to use techniques that allow the motion actuator 104 to be trained using large-scale motion datasets with different motion styles, employing reinforcement learning. In at least one embodiment, constrained multi-objective reward optimization is used. In at least one embodiment, a motion balancer is used. In at least one embodiment, a policy variance controller is used.
[0064] In at least one embodiment, once the low-level motion actuator 104 is trained, the system 100 can use different motion schedulers for real-time interactive applications. For example, in at least one embodiment, the system 100 can be used to perform keyboard or joystick-driven control. In at least one embodiment, the system 100 facilitates the synthesis of user-specified motion sequences. In at least one embodiment, the system 100 supports virtual teleportation of a person captured on video into a virtual environment, where the person is represented by a physics-based avatar.
[0065] In at least one embodiment, system 100 can be used to mimic movements that are not visible during training. In at least one embodiment, system 100 is taught to automatically or without requiring specific training samples related to the transition in its training dataset to perform natural transition skills between movements.
[0066] In at least one embodiment, system 100 produces robust control even when the source motion quality is poor. For example, in at least one embodiment, system 100 exhibits zero-shot robustness to environmental obstacles (such as projectiles or other characters) not seen during training. In at least one embodiment, system 100 is capable of adapting to characters with widely varying weights or other characteristics. In at least one embodiment, system 100 is capable of adapting to motions that are slower or faster than those seen during training.
[0067] In at least one embodiment, system 100 is used for interactive applications, such as video games. In at least one embodiment, versatility and robustness allow for a variety of modes of interactive character control, ranging from keyboard commands, noisy pose tracking from video capture, and sequences of user-specified positional motion or cardiac motion. In at least one embodiment, these applications can be used without having to retrain or fine-tune separate models for each application.
[0068] In at least one embodiment, the low-level motion actuator assumes input in the form of a target character's state. These states may be derived from motion capture data or generated by a high-level motion scheduler. The low-level controller is implemented as a policy neural network designed to output physics-based control signals that drive the character to optimally track the target state.
[0069] In at least one embodiment, the role state is represented as In at least one embodiment, the root location is used. Root rotation quaternion Joint position and joint rotation quaternions To describe the character's state, where J is the number of joints. In at least one embodiment, it can be derived from p r p j q r Reasoning joint position q j In at least one embodiment, first-order information from four states is considered, including the root translation velocity. Root angular velocity Joint translation speed and joint angular velocity In at least one embodiment, state This can be expressed as:
[0070]
[0071] In at least one embodiment, the time step in the physics engine is represented as t, and the role state at time step t is represented as... In at least one embodiment, the target state input to the low-level actuator 104 is represented as follows: Where τ is the length of the target frame. In at least one embodiment, the data is taken from the training dataset or from the control signal c. t The high-level motion scheduler 102 generates the target state 120.
[0072] In at least one embodiment, the observation function pairs with respect to the current state. and the target future state The information from both is encoded. In at least one embodiment, a proxy center state encoding operator is used. Its relative to root p r ,q r Transform quaternions, translations, and corresponding velocities. This agent-centric local state observation function. It can be written as:
[0073]
[0074] In at least one embodiment, the generality of the observation function is increased by handling the actual state of the role in a proxy-centric manner. In at least one embodiment, the second part of the observation function is a relative information function. This relative information function extracts the relative root information between the future target state and the current actual state, such as:
[0075]
[0076] In at least one embodiment, by combining local state information The observation vector s contains both relative information between the states and the observation vector. t Take the following forms:
[0077]
[0078] In at least one embodiment, the low-level actuator 104 includes a controller based on torque or other force values. In at least one embodiment, a force-based approach is used, rather than techniques such as proportional-derivative (“PD”) controllers, to provide advantages such as reducing or preventing overfitting and avoidance strategies based on unforeseen features.
[0079] In at least one embodiment, the torque-based controller is represented as π(a t |s t ), where a t It is a series of torques applied to each joint. In at least one embodiment, a fully connected neural network, such as a multilayer perceptron (“MLP”), is used, where the network weights can be represented as θ. π In at least one embodiment, the controller uses a neural network with three hidden layers and 1024 units.
[0080] In at least one embodiment, constrained multi-objective reward optimization is used. In at least one embodiment, the reward is defined as the sum of several terms that measure the difference between the target state and the actual state in different statistics, i.e.
[0081]
[0082] In at least one embodiment, the weighting coefficient The values are (0.2, 0.2, 0.1, 0.4, 0.1). In at least one embodiment, the same reward function is used. Used for joint quaternion deviation. Used for joint angular velocity deviation and root position deviation (Percentage of mass). In at least one embodiment, instead of penalizing only the misalignment of the end effector (such as the hand and foot), all joints are penalized, because an accurate representation of many movements requires attention to all joints, not just the hand and foot. In at least one embodiment, similar to the penalty for joint rotation, root rotation is penalized separately.
[0083] In at least one embodiment, directly optimizing the sum of rewards is problematic because it is essentially a multi-objective reward function, and mathematically, each individual reward term competes with others during training. In at least one embodiment, this competition is not significant when training on a very small number of movements, but it can occur when the rewards are dominated by certain reward terms. In at least one embodiment, these problems are addressed by a constrained optimization objective:
[0084]
[0085]
[0086] Where α i It is the tolerance coefficient to prevent reward items from dominating other reward items.
[0087] In at least one embodiment, directly optimizing this equation using existing reinforcement learning algorithms is difficult or impossible. In at least one embodiment, the constraint r is maintained by individually enforcing early termination for each term. i (s t )>α i A soft version. In at least one embodiment, if any reward item drops below a tolerance threshold, the episode is terminated. In at least one embodiment, Based on experience, the work is deemed good.
[0088] In at least one embodiment, proximal policy optimization (“PPO”) is used to optimize the controller of motion actuator 104. In at least one embodiment, either PPO or policy gradient (“PG”) methods are used. In at least one embodiment, a PPO surrogate objective is optimized, wherein the objective is:
[0089]
[0090] Where A t It is the estimated advantage function, π oldβ represents the policy weights fixed during the update, and β is the weight of the Kullback Leibler divergence penalty that prevents overconfidence updates. In at least one embodiment, the iterative update of the equation is sample-based, consistent with how motion is sampled, as described with respect to the various embodiments disclosed herein. In at least one embodiment, 4096 workers are simultaneously used to generate training samples during training. In at least one embodiment, the number of samples per worker per iteration is 64.
[0091] In at least one embodiment, the available training motion dataset contains samples with an imbalance regarding motion type, causing certain types of motion to tend to dominate. In at least one embodiment, random sampling of such a dataset during training may lead to a skill-specific dominant policy, but utilizing such datasets may still be advantageous because they are likely to be readily available. In at least one embodiment, an additional challenge is presented through the category labels of the motions, which may include a mixture of coarse and fine-grained labels, thereby introducing an additional challenge while maintaining a valid balance of training samples.
[0092] In at least one embodiment, a motion balancer is used. In at least one embodiment, the motion balancer establishes a hierarchical tree structure for category labels. In at least one embodiment, for each movement, starting from the category root node, the motion balancer labels a higher-level or more general category, such as walking, and then moves down the tree structure to lower-level or more specialized categories, such as walking forward or walking backward. In at least one embodiment, the hierarchical depth is unlimited, allowing for the generation of more fine-grained category labels for complex movements. For example, in at least one embodiment, a sample of a specific "zombie" walking style could be labeled as root-walking-forward zombie or root-walking-backward zombie.
[0093] In at least one embodiment, during training, motion is sampled by traversing down the hierarchical tree and uniformly sampling from all sub-node (child node) categories of the current node. In at least one embodiment, each node is represented as v, and its sub-nodes or child nodes are represented as C(v). The sampling process can be described as...
[0094]
[0095] In at least one embodiment, the sampling probability of each action can be calculated offline once for the entire dataset.
[0096] In at least one embodiment, Reference State Initialization (“RSI”) can be performed to sample the state of a particular frame as an initial state. In at least one embodiment, instead, initialization is performed using a Reactive State Initialization Scheme (“RSIS”), where the agent is initialized to the state of a frame k time steps away from the actual target frame it should be tracking. In at least one embodiment, RSIS includes a significant amount of noise added to the initial state for velocity and translation. In at least one embodiment, using RSIS enables the agent to learn how to self-adjust and catch up to the target state. In at least one embodiment, this recovery skill can be learned automatically without requiring training a separate recovery network or adding recovery actions to the training dataset. In at least one embodiment, RSIS can be used in conjunction with a motion stitching scheduler, enabling the agent to generate natural transitions between stitching motions that may have large discontinuities. In at least one embodiment, RSIS improves robustness relative to recovery from perturbations in the moving environment.
[0097] In at least one embodiment, additional steps are taken during training to avoid undesirable local minima. In at least one embodiment, a random policy π(a) is considered. t |s t The variance of ). In at least one embodiment, with respect to PPO, the trainable vector Used to represent the standard deviation of a diagonal Gaussian policy. In at least one embodiment, for training a single movement or a small number of movements, Automatic learning can be achieved by optimizing the PPO surrogate objective, as provided in Equation 7 above; however, for large-scale datasets, variance may be more fragile during training. In at least one embodiment, exponential annealing of the policy variance is used to set the initial values. In at least one embodiment, different joints of the character require variations at different scales. For example, in at least one embodiment, the control variance of the character's toes is smaller than the variance of the character's legs. In at least one embodiment, to preserve the learned differences between joints, an adaptive variance update scheme is used, as follows:
[0098]
[0099]
[0100] Where L PPO It is the loss defined in Equation 7, α lr is the learning rate, and l is the current PPO training iteration. In at least one embodiment, the control iteration range is defined as from 0 to L, during which time, The target average value is derived from the hyperparameters logstd0 and logstd F Linear annealing. In at least one embodiment, It is The logarithm of each component is linearly increased or decreased by the same amount of computation, such that the mean matches the target value of linear annealing. In at least one embodiment, this preserves the learned variance structure.
[0101] In at least one embodiment, the high-level motion scheduler 102 outputs a target reference state from interactive control signals or from a motion dataset for use in conjunction with the low-level motion actuator 104. In at least one embodiment, the motion scheduler 102 may include one or more specialized scheduler components 108-114 to implement different control schemes, such as keyboard input, joystick input, replication of the observed subject's movements, etc. However, note that although... Figure 1 Different schedulers 108-114 are depicted as components of the higher-level motion scheduler 102, and in at least one embodiment, the schedulers 108-114 can operate independently of each other. Therefore, in at least one embodiment, one of the schedulers 108-114 can replace the higher-level motion scheduler 102.
[0102] In at least one embodiment, the high-level motion scheduler 102 is represented as φ, and outputs the states of τ future frames as follows:
[0103]
[0104] Where τ c and τ x θ represents the historical length of the control signal and the role state, respectively; and θ represents the parameters or configuration of the scheduler 102. In at least one embodiment, the planning length τ is carefully chosen so that it is not too long, as this can make the actuator 104 difficult to train and more prone to overfitting. Furthermore, in at least one embodiment, for real-time applications (such as animation from video), generating frames for the unpredictable future is impossible or impractical. In at least one embodiment, a relatively short output length τ is therefore chosen, for example, a value of 1 or 2.
[0105] In at least one embodiment, a general and transferable framework for different control input sources is established by unifying input observations to a low-level motion actuator 104. In at least one embodiment, the low-level actuator 104 trained with a large-scale motion dataset can be used directly with any of the different types of schedulers 102 for a specific application without the need for retraining.
[0106] In at least one embodiment, the motion training scheduler 108 is used to train the low-level actuator 104 using a motion dataset. In at least one embodiment, during training, random samples of motion m and samples of frame ID j are obtained, and the state of this frame is set to agent. The initial state. For time step i, the training scheduler outputs the state from motion m as:
[0107] φ MocapData (t)=[m(t+j+1),m(t+j+2),…,m(t+j+τ)] (11)
[0108] In at least one embodiment, the motion training scheduler 108 terminates and reschedules a new motion when the current motion has reached its end or when the target state and the actual state have deviated significantly.
[0109] In at least one embodiment, a motion capture (“MoCap”) dataset is obtained and used to generate training and testing datasets. In at least one embodiment, smaller datasets are created by grouping by style and task. For example, in at least one embodiment, a dataset of “household chores” motions may include things like sweeping the floor, washing dishes, etc., an “animal” dataset may contain movements of humans attempting to imitate animals (such as cats and dogs), and a “miscellaneous” dataset may contain all remaining movements after filtering out infeasible movements or movements highly dependent on external objects (such as stairs).
[0110] In at least one embodiment, the dataset is divided into a training set with 80% of the frames and a test set with 20%. In at least one embodiment, the available dataset has imbalanced motion examples. For example, in a given, readily available dataset, it is possible that 35.4% of the examples are walking motions, while 25.5% of all motion examples are forward walking motions. Similarly, in at least one embodiment, the dataset may have a large number of motion categories with very few examples representing them.
[0111] In at least one embodiment, classes with a small number of examples are first assigned such that they are equally distributed in the test and training sets. In at least one embodiment, if a class exists with only one movement, it is placed in the test set. In at least one embodiment, large classes such as forward walking are assigned to populate the remaining training and test sets. In at least one embodiment, although the resulting training dataset may be imbalanced across classes, embodiments of the motion balancer described herein facilitate efficient training of the motion actuator 104.
[0112] In at least one embodiment, the video stream scheduler 114 is used as an interactive control input. In at least one embodiment, the motion of the human subject is captured by a camera and reconstructed via an avatar in a virtual environment. In at least one embodiment, a real-time pose estimator is used to estimate the 3D pose of the subject from the video. In at least one embodiment, the pose estimator is composed of weights θ. CNNThe convolutional neural network is parameterized, and the pose estimator is represented as... In at least one embodiment, the video frame at time step t+1 is represented as I. t+1 In at least one embodiment, the estimator generates the following prediction:
[0113]
[0114] In at least one embodiment, the animation engine is connected to the pose estimator to send data in real time. In at least one embodiment, by maintaining the length τ p The attitude buffer is then used to interpolate first-order information (such as linear velocity and angular velocity) from the previous attitude, for example...
[0115] In at least one embodiment, the video stream controller 114 can be written as:
[0116]
[0117] In at least one embodiment, the accuracy of the attitude estimation is imperfect and may contain a significant amount of noise. In at least one embodiment, the low-level motion actuator 104 can handle noisy estimated attitudes with reasonable accuracy.
[0118] In at least one embodiment, the command flow scheduler 112 is trained to drive low-level motion actuators 104 based on user input (such as keyboard or joystick commands). In at least one embodiment, instead of training a hierarchical interactive scheduler from scratch, a phase-function neural network (“PFNN”) is used to process commands and generate future states. In at least one embodiment, the PFNN can control the agent’s movement direction and select a movement style from walking, jogging, squatting, etc. In at least one embodiment, the target state generation of the PFNN is written as:
[0119]
[0120] In at least one embodiment, PFNN is an autoregressive method, where previously generated states... Also used for generation The input. In at least one embodiment, experiments have shown that it is not necessary to input the actual state. Instead of feeding back to the PFNN, the low-level actuator 104 can automatically correct accumulated errors in the PFNN. In at least one embodiment, another keyframe-based animation system is used.
[0121] In at least one embodiment, the user can also interactively specify the motion of a character using the motion splicing scheduler 110, where motion splicing refers to directly and continuously splicing new motions without regard to appropriate transitions. In at least one embodiment, while maintaining the target motion buffer B for splicing, another motion m with |m| frames is interactively added to the buffer before the current buffer expires and no state remains:
[0122]
[0123] In at least one embodiment, spherical linear interpolation is used to add several transition target frames. In at least one embodiment, at each time step, the motion stitching scheduler 110 generates subsequent target states by popping states from a FIFO buffer.
[0124]
[0125] In at least one embodiment, the motion splicing scheduler 110 can be viewed as a simplified version of a motion graph, where movements are animated one by one. However, in at least one embodiment, the motion splicing scheduler 110 can animate a wide variety of highly difficult acrobatic movements, some of which are not visible during training. In at least one embodiment, this allows characters to be animated with automatic, smooth transitions between movements, even if the training set does not include transitional skills.
[0126] In at least one embodiment, the character model 106 incorporates physics to model the character's movement. In at least one embodiment, the character model 106 is used in conjunction with a physics engine (such as a GPU-accelerated physics engine) as a core backend for physics simulation. In at least one embodiment, a CUDA-based Newton pre-tuned conjugate residual method (PCR) solver for rigid bodies is provided by the physics engine. In at least one embodiment, gravity is set downwards to 9.8 m / s². 2 The value 1.0 is used for both static and dynamic friction coefficients.
[0127] In at least one embodiment, the character model 106 is designed as a humanoid model with a basic topological structure of rigid body representation modeled after the human body. In at least one embodiment, the character model 106 includes 20 rigid bodies and 35 degrees of freedom, wherein each degree of freedom is assigned an effort factor in the range of 50 to 600 to simulate the differences in joint strength in the human body. In at least one embodiment, this effort factor is taken into account when torque control is applied. In at least one embodiment, the height and mass of the character model 106 are respectively similar to the realistic proportions of a human body at 1.8m and 70kg. In at least one embodiment, the mass of each rigid body in the character model 106 is distributed proportionally based on a rough estimate of the human body's mass distribution.
[0128] In at least one embodiment, the agent can resist perturbations from projectiles or new target models with weight distributions different from those initially trained. In at least one embodiment, the embodiments described herein can achieve zero-shot robustness, wherein the agent never sees perturbations or retargeting information during training and is required to perform the task under perturbation or using different humanoid models with different qualities.
[0129] In at least one embodiment, the embodiment of system 100 disclosed herein may have the ability to resist unseen perturbations and retargeting problems. In at least one embodiment, zero-shot robustness is achieved against zero-shot perturbation robustness, zero-shot velocity robustness, and zero-shot model retargeting.
[0130] In at least one embodiment, zero-sample perturbation robustness relates to training the agent in the absence of projectiles or other obstacles or hindrances during training. In at least one embodiment, during testing, the agent is required to perform tasks under such projectiles or other obstacles.
[0131] In at least one embodiment, zero-sample velocity robustness involves training an agent with motion at the original velocity, but requiring the agent to reproduce the motion at different velocities during testing.
[0132] In at least one embodiment, the robustness of zero-shot model retargeting involves an agent trained on one or more models during training, but using a previously untrained model to perform the task during testing. For example, in at least one embodiment, a 25% heavy and 25% light model is used during testing.
[0133] In at least one embodiment, the robotic device is trained to perform one or more movements based at least in part on an environment simulated using system 100. In at least one embodiment, the simulated environment includes a simulated environment, simulated sensors, and a set of assets. In at least one embodiment, these assets include, but are not limited to, assets related to characters, terrain, obstacles, etc. In at least one embodiment, system 100, or other embodiments described herein, is used to generate signals to control the movement of a virtual character and to generate video data depicting said movement. In at least one embodiment, the robotic device is trained to perform movements based on this video data.
[0134] In at least one embodiment, system 100 is used for interactive simulations, such as interactive games, simulated factories, simulated driving environments, etc. In at least one embodiment, video data depicting the one or more simulations is used to train one or more other neural networks to perform additional tasks.
[0135] In at least one embodiment, these tasks include:
[0136] Figure 2 An example of motion balancing for training a system for animate virtual objects, according to at least one embodiment, is shown. In at least one embodiment, example 200 of motion balancing includes a motion balancer 204 for inputting a training dataset 202 and outputting a hierarchical training dataset 206.
[0137] In at least one embodiment, the virtual object corresponds to a character, such as a virtual human or animal. In at least one embodiment, the virtual object corresponds to a mechanism simulated within a virtual environment, such as a robotic arm or other robotic device. In at least one embodiment, the model of such device simulates the physical characteristics of a corresponding physical robotic arm or other robotic device.
[0138] In at least one embodiment, the training dataset 202 includes multiple motion examples. In at least one embodiment, the training dataset 202 contains video data or motion capture data relating to human subjects participating in different types of motion. Examples of such motions include walking, running, jumping, etc. In at least one embodiment, the training dataset 202 is overweighted for specific types of motion (such as walking or running) and similarly lacks sufficient examples of such motions (such as brisk walking) and sufficient examples of other motion types (such as jumping).
[0139] In at least one embodiment, the motion balancer 204 constructs a hierarchical tree structure of sample or category labels. In at least one embodiment, for each example in the training dataset 202, the motion balancer 204 starts from the root node and places higher-level or more general categories (such as general examples of walking) in the first level 210 of the hierarchy 206, then moves down the hierarchical tree structure to more specialized categories, such as walking forward or backward, which are placed in the second level 212. Other levels of the hierarchy 206 can be used for even more specialized motion examples, such as walking with a particular or unique style, which are placed in the third level 214.
[0140] In at least one embodiment, during training, motion is sampled by going down the hierarchical training dataset 206 and sampling uniformly from each level of the hierarchical training dataset 206. Using this method helps teach fundamental aspects of motion physics while avoiding over-specialization for specific types of motion.
[0141] Figure 3An example of a video stream scheduler according to at least one embodiment is shown. In at least one embodiment, the video stream scheduler 300 includes software or circuitry that receives and stores video stream 302, individually or in combination, for subsequent analysis. In at least one embodiment, the video stream scheduler 300 also includes a neural pose estimator 304 capable of performing pose estimation at near real-time speeds.
[0142] In at least one embodiment, the video stream scheduler 300 serves as an interactive control input. For example, in at least one embodiment, the video stream 302 includes video data of a human subject whose motion will be controlled using a system (such as...). Figure 1 The system 100 described in the text is used to analyze and reproduce the virtual avatar in order to animate it.
[0143] In at least one embodiment, the neural pose estimator 304 is a convolutional neural network. In at least one embodiment, the neural pose estimator 304 estimates the pose of a subject in frames of video stream 302. In at least one embodiment, the estimated pose is then used as the target state of the corresponding virtual avatar. In at least one embodiment, the target state is derived by predicting future states based on the estimated pose. For example, in at least one embodiment, first-order information such as linear velocity and angular velocity may allow interpolation or prediction of the state of the observed subject in the near future, and the interpolated or predicted state may be used as the target state.
[0144] In at least one embodiment, the target state is output as a target state stream 306. In at least one embodiment, the target state stream 306 includes multiple time-ordered target states that can be tracked by a connected motion actuator, such as... Figure 1 As shown.
[0145] Figure 4 An example of a command stream scheduler 400 according to at least one embodiment is shown. In at least one embodiment, the command stream scheduler 400 includes software or circuitry that receives and stores a command stream 402, individually or in combination, for subsequent analysis. In at least one embodiment, the command stream 402 includes input from one or more of a keyboard, joystick, trackball, or other mechanism capable of generating commands. Examples of such commands in at least one embodiment include directional commands such as “forward” or “backward”; speed modification commands such as “increase speed” or “decrease speed”; and position commands such as “crouch” or “stand”. It should be understood that these examples are intended to be illustrative rather than limiting, and thus, the disclosed examples should not be interpreted in a way that limits the scope of the embodiments to include only those examples provided.
[0146] In at least one embodiment, the command stream scheduler 400 is trained to drive commands such as those from the command stream 402. Figure 1 The low-level motion actuator described herein. In at least one embodiment, the command-driven planner 404 includes a neural network with phase as a function for processing commands and generating future states. In at least one embodiment, the PFNN can control the agent's direction of motion and select a motion style such as walking, jogging, or squatting. In at least one embodiment, the PFNN is autoregressive, such that previously generated states are also inputs for subsequently generated states. The output from the command-driven planner 404 is used to provide states including a target state flow 406. In at least one embodiment, the states from the target state flow 406 enable the system (such as...) Figure 1 The system 100 depicted in the text is an animated virtual object.
[0147] Figure 5 An example of a motion splicing scheduler according to at least one embodiment is shown. In at least one embodiment, the motion splicing scheduler 500 includes software or circuitry that receives and stores a motion stream 502, individually or in combination. In at least one embodiment, the motion stream 502 includes a series of movements to be performed by a character. For example, in at least one embodiment, a user can interactively specify the movements of characters to be spliced together. In at least one embodiment, motion splicing refers to directly combining new movements in a sequential order for appropriate transitions between movements, at least at this stage without particular concern. In at least one embodiment, a target motion buffer stream 504 is maintained for splicing, which includes a target state sequence to be implemented by a motion actuator, such as in... Figure 1 As described in [the document]. In at least one embodiment, an additional frame representing additional motion is added to buffer 504 before the contents of buffer 504 are emptied.
[0148] In at least one embodiment, spherical linear interpolation is used to add one or more transition target frames to buffer 504 to improve the transition between motions. In at least one embodiment, motion stitching scheduler 500 can output target state stream 506, which is configured by, for example... Figure 1 The low-level motion actuators described herein can animate a variety of limb movements, even those not seen during training. In at least one embodiment, this enables characters to be animated with automatic, smooth transitions between movements, even if no transition examples were included in the training.
[0149] Figure 6 An example of a reactive state initialization scheme according to at least one embodiment is shown.
[0150] Although the example process 600 is depicted as a sequence of operations, it will be appreciated that in embodiments, the depicted operations may be modified in different ways, and some operations may be omitted, reordered, or performed in parallel with other operations, except where the order is explicitly stated or logically implied, such as when the input from one operation depends on the output of another operation.
[0151] Figure 6 The described operation can be performed by a system (such as...) Figure 1 The system 100 depicted herein executes, the system including at least one processor and a memory having stored instructions that, in response to execution by the at least one processor, cause the system to perform the depicted operations. In at least one embodiment, the depicted operations are performed by a combination of hardware and software, wherein the hardware includes one or more APUs, CPUs, GPUs, PPUs, GPGPUs, parallel processors, processing clusters, graphics processors, multiprocessors, etc., as depicted in the figures herein. In at least one embodiment, the software includes libraries such as any one of CUDA, OpenGL, OpenLC, and ROCm, and may also include operating system software.
[0152] At 602, the system identifies the target frame. In at least one embodiment, the target frame refers to a frame representing the state that the agent intends to track during training. In at least one embodiment, the agent refers to a frame using, for example... Figure 1 The system described herein is software and / or circuitry used to coordinate the animation of virtual objects.
[0153] At 604, the system identifies a frame k steps away from the target frame. In at least one embodiment, k is selected based on a random process such that the frame used to initialize the agent is a certain number of time steps before or after the expected target state.
[0154] At 606, the system initializes the agent based on a state associated with the frame, which is k steps away from the target frame. In at least one embodiment, the initialization of the agent refers to the starting state from which movement will begin. For example, in at least one embodiment, the starting state of walking movement is such that both legs are perpendicular to the ground. In at least one embodiment, this can be used as the target state. However, in at least one embodiment, the initialization state is instead k steps away, which can reflect some other deviation such as one leg being angled forward or backward, the knee bent, or standing upright. It should be understood that this example is intended to be illustrative and not restrictive, and therefore should not be interpreted in a way that limits potential embodiments to those conforming to this example.
[0155] At point 608, the system trains its motion actuators to recover from the difference between the target frame and the starting frame (i.e., the frame that is k steps away). In at least one embodiment, the system is not specifically trained for recovery, but is trained to include compensating forces or movements learned based on various motion types.
[0156] Figure 7 An example of policy variance control according to at least one embodiment is shown.
[0157] Although the example process 700 is depicted as a sequence of operations, it will be appreciated that in embodiments, the depicted operations may be modified in different ways, and some operations may be omitted, reordered, or performed in parallel with other operations, except where the order is explicitly stated or logically implied, such as when the input from one operation depends on the output of another operation.
[0158] Figure 7 The described operation can be performed by a system (such as...) Figure 1 The system 100 depicted herein executes, the system including at least one processor and a memory having stored instructions that, in response to execution by the at least one processor, cause the system to perform the depicted operations. In at least one embodiment, the depicted operations are performed by a combination of hardware and software, wherein the hardware includes one or more APUs, CPUs, GPUs, PPUs, GPGPUs, parallel processors, processing clusters, graphics processors, multiprocessors, etc., as depicted in the figures herein. In at least one embodiment, the software includes libraries such as any one of CUDA, OpenGL, OpenLC, and ROCm, and may also include operating system software.
[0159] At point 702, the system initializes the allowable variance via joints. In at least one embodiment, each joint of the character model or other object model may have a variance that reflects the different physical characteristics of the subject it represents. For example, in at least one embodiment, the control variance of the subject's toes may be significantly smaller than the control variance of the subject's knee joint.
[0160] At 704, the system schedules variance decay. In at least one embodiment, the variance decreases over time during training, resulting in an initial large variance that then decreases over time. In at least one embodiment, the variance decay is based on exponential annealing, although it will be recognized that a variety of other techniques can be employed. In at least one embodiment, an adaptive variance update scheme is deployed, as described above.
[0161] At point 706, the system is trained according to a predetermined variance. In at least one embodiment, this is accomplished according to the process described above for training a scheduler on a motion dataset.
[0162] Figure 8 An example of training a system to animate virtual objects according to at least one embodiment is shown. In at least one embodiment, such as Figure 8 The process 800 described herein is based on the various embodiments described herein (including those concerning...). Figure 1 The described embodiments) train the neural network.
[0163] Although the example process 800 is depicted as a sequence of operations, it will be appreciated that in embodiments, the depicted operations may be modified in different ways, and some operations may be omitted, reordered, or performed in parallel with other operations, except where the order is explicitly stated or logically implied, such as when the input from one operation depends on the output of another operation.
[0164] Figure 8 The described operation can be performed by a system (e.g., Figure 1 The system 100 depicted herein executes, the system including at least one processor and a memory having stored instructions that, in response to execution by the at least one processor, cause the system to perform the depicted operations. In at least one embodiment, the operations are performed by a combination of hardware and software, wherein the hardware includes one or more APUs, CPUs, GPUs, PPUs, GPGPUs, parallel processors, processing clusters, graphics processors, multiprocessors, etc., as depicted in the figures herein. In at least one embodiment, the software includes libraries such as any one of CUDA, OpenGL, OpenLC, and ROCm, and may also include operating system software.
[0165] In at least one embodiment, the system includes one or more processors. In at least one embodiment, at least one of the one or more processors includes circuitry for training one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of motion of the one or more objects. In at least one embodiment, the aspects of motion are motions depicted by examples provided in the training data, such as walking, running, jumping, etc. In at least one embodiment, the training data includes video data.
[0166] In at least one embodiment, the one or more objects are models of graphics or characters. In at least one embodiment, the one or more objects are models of human, animal, or robotic graphics or characters. In at least one embodiment, the one or more objects include virtual joints representing points where bending or stretching can occur. In at least one embodiment, the one or more forces are applied to the one or more joints. In at least one embodiment, the application of the forces is simulated such that the subsequent state of the one or more objects can be determined based on the amount of the applied force, according to the moving object. Such a state flow can then be used to generate animated graphics including the objects.
[0167] In at least one embodiment, the one or more neural networks include motion actuators trained to generate one or more forces to be applied to one or more joints of one or more objects, based at least in part on target states of one or more objects provided as input to the motion actuators.
[0168] At 802, the system hierarchically organizes the training data. In at least one embodiment, examples of hierarchically organizing aspects of motion in the training data according to specialization of aspects of motion are provided herein. Examples of such hierarchical organization are provided, for example, regarding... Figure 2 .
[0169] At 804, the system determines control variance decay. In at least one embodiment, the variance of joints associated with one or more objects is decayed during training according to an arranged variance decay. In at least one embodiment, the control variance is decayed according to one or more techniques described herein. In at least one embodiment, each joint associated with one or more objects has a specific variance based on the physical properties of the corresponding real-world joint. For example, in at least one embodiment, objects representing human figures may have differentiable control variances for the ankle and knee joints to reflect the corresponding different ranges of motion.
[0170] At point 806, the system selects examples of motion from the levels of the hierarchy. In at least one embodiment, one or more neural networks are trained by randomly selecting aspects of motion from a first level of the hierarchy, and then randomly selecting aspects of motion from a second level of the hierarchy below the first level. In at least one embodiment, a certain number of N1 samples are selected from the top level of the hierarchy, and then a certain number of N2 samples are selected from the second level below the top level, and so on.
[0171] At point 808, the system uses a reactive state initialization scheme to establish a training segment. In at least one embodiment, the training segment is initialized based on the state of motion frames shifted one or more frames from the starting frame of video data, which includes aspects of motion.
[0172] At point 810, when any individual reward item drops below a threshold level, the system can determine to terminate the training segment. In at least one embodiment, this is based on, for example, regarding... Figure 1 This is accomplished using one or more techniques for terminating the training segment.
[0173] At point 812, the system determines whether further training is needed. If not, training is completed at point 814; otherwise, training continues using an additional example of the motion aspect.
[0174] Reasoning and training logic
[0175] Figure 9A Inference and / or training logic 915 is shown for performing inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 9A and / or Figure 9B Provide details about reasoning and / or training logic 915.
[0176] In at least one embodiment, inference and / or training logic 915 may include, but is not limited to, code and / or data storage 901 for storing forward and / or output weights and / or input / output data, and / or other parameters configuring neurons or layers of a neural network trained for and / or used for inference in one or more embodiments. In at least one embodiment, training logic 915 may include or be coupled to code and / or data storage 901 for storing graph code or other software to control timing and / or sequence, wherein weight and / or other parameter information is loaded to configure logic, including integer and / or floating-point units (collectively, arithmetic logic units (ALUs)). In at least one embodiment, code (such as graph code) loads weight or other parameter information into the processor ALU based on the architecture of the neural network to which the code corresponds. In at least one embodiment, code and / or data storage 901 stores weight parameters and / or input / output data of each layer of a neural network trained or used in one or more embodiments during forward propagation of input / output data and / or weight parameters during training and / or inference using one or more embodiments. In at least one embodiment, any portion of the code and / or data storage 901 may be included within other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory.
[0177] In at least one embodiment, any portion of the code and / or data storage 901 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, the code and / or data storage 901 may be a cache memory, dynamic random-addressable memory (“DRAM”), static random-addressable memory (“SRAM”), non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice of whether the code and / or data storage 901 is internal or external to the processor, for example, or composed of DRAM, SRAM, flash memory, or some other storage type, may depend on the available on-chip or off-chip storage space, the latency requirements of the training and / or inference functions being performed, the batch size of the data used in the inference and / or training of the neural network, or some combination of these factors.
[0178] In at least one embodiment, the inference and / or training logic 915 may include, but is not limited to, code and / or data storage 905 to store backpropagation and / or output weights and / or input / output data neural networks corresponding to neurons or layers of a neural network trained and / or used for inference in one or more embodiments. In at least one embodiment, during training and / or inference using one or more embodiments, the code and / or data storage 905 stores weight parameters and / or input / output data for each layer of a neural network trained or used in one or more embodiments during backpropagation of input / output data and / or weight parameters. In at least one embodiment, the training logic 915 may include or be coupled to code and / or data storage 905 for storing graph code or other software to control timing and / or sequence, wherein weight and / or other parameter information is loaded to configure logic including integer and / or floating-point units (collectively, arithmetic logic units (ALUs)).
[0179] In at least one embodiment, code (such as graph code) causes the architecture of the neural network corresponding to that code to load weights or other parameter information into the processor ALU. In at least one embodiment, any portion of the code and / or data storage 905 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory. In at least one embodiment, any portion of the code and / or data storage 905 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, the code and / or data storage 905 may be cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice between the code and / or data storage 905 being internal or external to the processor, for example, whether it consists of DRAM, SRAM, flash memory, or some other type of storage, depends on whether the available storage is on-chip or off-chip, the latency requirements of the training and / or inference functions being performed, the data batch size used in the inference and / or training of the neural network, or some combination of these factors.
[0180] In at least one embodiment, code and / or data storage 901 and code and / or data storage 905 may be separate storage structures. In at least one embodiment, code and / or data storage 901 and code and / or data storage 905 may be the same storage structure. In at least one embodiment, code and / or data storage 901 and code and / or data storage 905 may be partially combined and partially separated. In at least one embodiment, any portion of code and / or data storage 901 and code and / or data storage 905 may be included with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory.
[0181] In at least one embodiment, the inference and / or training logic 915 may include, but is not limited to, one or more arithmetic logic units (“ALUs”) 910 (including integer and / or floating-point units) for performing logical and / or mathematical operations at least in part based on or instructed by training and / or inference code (e.g., graph code), the results of which may produce activations (e.g., output values from layers or neurons within a neural network) stored in activation storage 920, which are functions of input / output and / or weight parameter data stored in code and / or data storage 901 and / or code and / or data storage 905. In at least one embodiment, activation is activated in response to execution instructions or other code, and linear algebraic and / or matrix-based mathematical generation performed by ALU 910 is stored in activation storage 920, wherein weight values stored in code and / or data storage 905 and / or code and / or data storage 901 are used as operands with other values, such as bias values, gradient information, momentum values, or other parameters or hyperparameters, and any or all of these can be stored in code and / or data storage 905 or code and / or data storage 901 or other on-chip or off-chip storage.
[0182] In at least one embodiment, one or more processors or other hardware logic devices or circuits include one or more ALUs 910, while in another embodiment, one or more ALUs 910 may be located outside the processor or other hardware logic device or the circuitry using them (e.g., a coprocessor). In at least one embodiment, one or more ALUs 910 may be included within an execution unit of a processor, or otherwise included in a group of ALUs accessible by the execution unit of the processor, which may be within the same processor or distributed among different processors of different types (e.g., a central processing unit, a graphics processing unit, a fixed-function unit, etc.). In at least one embodiment, code and / or data storage 901, code and / or data storage 905, and activation storage 920 may share a processor or other hardware logic device or circuitry, while in another embodiment, they may be located in different processors or other hardware logic devices or circuitry, or in some combination of the same and different processors or other hardware logic devices or circuitry. In at least one embodiment, any portion of activation storage 920 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory. Furthermore, inference and / or training code may be stored together with other code accessible to the processor or other hardware logic or circuitry, and may be retrieved and / or processed using the processor’s fetch, decode, schedule, execute, exit, and / or other logic circuitry.
[0183] In at least one embodiment, the active memory 920 may be a cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other memory. In at least one embodiment, the active memory 920 may be wholly or partially located inside or outside one or more processors or other logic circuits. In at least one embodiment, the choice of whether the active memory 920 is internal to or external to the processor may depend on the availability of on-chip or off-chip storage, the latency requirements for training and / or inference functions, the batch size of data used in inference and / or training the neural network, or some combination of these factors. For example, it may include DRAM, SRAM, flash memory, or other memory types.
[0184] In at least one embodiment, Figure 9A The inference and / or training logic 915 shown can be used in conjunction with an application-specific integrated circuit (“ASIC”), such as those from Google. Processing unit, from Graphcore TM Inference processing units (IPUs) or from Intel Corp. (e.g., "Lake Crest") processor. In at least one embodiment, Figure 9A The inference and / or training logic 915 shown can be used in conjunction with central processing unit (“CPU”) hardware, graphics processing unit (“GPU”) hardware or other hardware (such as field programmable gate array (“FPGA”)).
[0185] Figure 9B An inference and / or training logic 915 according to at least one embodiment is illustrated. In at least one embodiment, the inference and / or training logic 915 may include, but is not limited to, hardware logic, wherein computational resources are dedicated or otherwise uniquely used in conjunction with weight values or other information corresponding to one or more layers of neurons within a neural network. In at least one embodiment, Figure 9B The inference and / or training logic 915 shown can be used in conjunction with an application-specific integrated circuit (ASIC), such as those from Google. Processing unit, from Graphcore TM Inference processing units (IPUs) or from Intel Corp. (e.g., "Lake Crest") processor. In at least one embodiment, Figure 9BThe inference and / or training logic 915 shown can be used in conjunction with central processing unit (CPU) hardware, graphics processing unit (GPU) hardware, or other hardware (e.g., field-programmable gate array (FPGA)). In at least one embodiment, the inference and / or training logic 915 includes, but is not limited to, code and / or data storage 901 and code and / or data storage 905, which can be used to store code (e.g., graph code), weight values, and / or other information, including bias values, gradient information, momentum values, and / or other parameter or hyperparameter information. Figure 9B In at least one embodiment shown, each of code and / or data storage 901 and code and / or data storage 905 is associated with dedicated computing resources (e.g., computing hardware 902 and computing hardware 906). In at least one embodiment, each of computing hardware 902 and computing hardware 906 includes one or more ALUs that perform mathematical functions (e.g., linear algebraic functions) only on the information stored in code and / or data storage 901 and code and / or data storage 905, respectively, and the results of the function execution are stored in activation storage 920.
[0186] In at least one embodiment, each of the code and / or data storage 901 and 905 and the corresponding computing hardware 902 and 906 corresponds to a different layer of the neural network, such that activation obtained from one “store / computation pair 901 / 902” of the code and / or data storage 901 and computing hardware 902 provides input as input to the next “store / computation pair 905 / 906” of the code and / or data storage 905 and computing hardware 906, in order to reflect the conceptual organization of the neural network. In at least one embodiment, each store / computation pair 901 / 902 and 905 / 906 may correspond to more than one neural network layer. In at least one embodiment, additional store / computation pairs (not shown) may be included in the inference and / or training logic 915 after or in parallel with the store / computation pairs 901 / 902 and 905 / 906.
[0187] Neural network training and deployment
[0188] Figure 10Training and deployment of a deep neural network according to at least one embodiment are illustrated. In at least one embodiment, an untrained neural network 1006 is trained using a training dataset 1002. In at least one embodiment, the training framework 1004 is the PyTorch framework, while in other embodiments, the training framework 1004 is TensorFlow, Boost, Caffe, Microsoft Cognitive Toolkit / CNTK, MXNet, Chainer, Keras, Deeplearning4j, or other training frameworks. In at least one embodiment, the training framework 1004 trains the untrained neural network 1006 and enables it to be trained using the processing resources described herein to generate a trained neural network 1008. In at least one embodiment, the weights may be randomly selected or pre-trained using a deep belief network. In at least one embodiment, training may be performed in a supervised, partially supervised, or unsupervised manner.
[0189] In at least one embodiment, supervised learning is used to train an untrained neural network 1006, wherein the training dataset 1002 includes inputs paired with desired outputs for input, or wherein the training dataset 1002 includes inputs with known outputs and the neural network 1006 is a manually hierarchical output. In at least one embodiment, the untrained neural network 1006 is trained in a supervised manner, and inputs from the training dataset 1002 are processed, and the resulting outputs are compared with a set of expected or desired outputs. In at least one embodiment, errors are then propagated back through the untrained neural network 1006. In at least one embodiment, a training framework 1004 adjusts the weights controlling the untrained neural network 1006. In at least one embodiment, the training framework 1004 includes tools for monitoring the degree to which the untrained neural network 1006 converges to a model (e.g., a trained neural network 1008) adapted to generate the correct answer (e.g., result 1014) based on input data (e.g., a new dataset 1012). In at least one embodiment, the training framework 1004 repeatedly trains the untrained neural network 1006 while adjusting the weights to improve the output of the untrained neural network 1006 using a loss function and tuning algorithm (e.g., stochastic gradient descent). In at least one embodiment, the training framework 1004 trains the untrained neural network 1006 until the untrained neural network 1006 reaches the desired accuracy. In at least one embodiment, the trained neural network 1008 can then be deployed to implement any number of machine learning operations.
[0190] In at least one embodiment, unsupervised learning is used to train an untrained neural network 1006, wherein the untrained neural network 1006 attempts to train itself using unlabeled data. In at least one embodiment, the unsupervised learning training dataset 1002 will include input data without any associated output data or "ground truth" data. In at least one embodiment, the untrained neural network 1006 can learn groupings within the training dataset 1002 and can determine how each input relates to the untrained dataset 1002. In at least one embodiment, unsupervised training can be used to generate a self-organizing graph in a trained neural network 1008, which is capable of performing operations useful for reducing the dimensionality of the new dataset 1012. In at least one embodiment, unsupervised training can also be used to perform anomaly detection, which allows the identification of data points in the new dataset 1012 that deviate from the normal patterns of the new dataset 1012.
[0191] In at least one embodiment, semi-supervised learning can be used, a technique in which a mixture of labeled and unlabeled data is included in the training dataset 1002. In at least one embodiment, the training framework 1004 can be used to perform incremental learning, for example, through transfer learning techniques. In at least one embodiment, incremental learning enables the trained neural network 1008 to adapt to a new dataset 1012 without forgetting the knowledge injected into the trained neural network 1008 during initial training.
[0192] Data Center
[0193] Figure 11 An example data center 1100 that can be used with at least one embodiment is shown. In at least one embodiment, the data center 1100 includes a data center infrastructure layer 1110, a framework layer 1120, a software layer 1130, and an application layer 1140.
[0194] In at least one embodiment, such as Figure 11As shown, the data center infrastructure layer 1110 may include a resource coordinator 1112, packet computing resources 1114, and node computing resources (“nodes CR”) 1116(1)-1116(N), where “N” represents a positive integer (which may be an integer “N” different from the integers used in other diagrams). In at least one embodiment, nodes CR 1116(1)-1116(N) may include, but are not limited to, any number of central processing units (“CPUs”) or other processors (including accelerators, field-programmable gate arrays (FPGAs), graphics processors, etc.), memory storage devices 1118(1)-1118(N) (e.g., dynamic read-only memory, solid-state drives, or disk drives), network input / output (“NW I / O”) devices, network switches, virtual machines (“VMs”), power modules, and cooling modules, etc. In at least one embodiment, one or more nodes CR 1116(1)-1116(N) may be servers having one or more of the aforementioned computing resources.
[0195] In at least one embodiment, the grouped computing resource 1114 may include individual groups (not shown) of node CRs housed within one or more racks, or a plurality of racks (also not shown) housed within data centers in various geographical locations. In at least one embodiment, the individual groups of node CRs within the grouped computing resource 1114 may include computing, networking, memory, or storage resources that can be configured or allocated to support groups of one or more workloads. In at least one embodiment, several node CRs, including CPUs or processors, may be grouped within one or more racks to provide computing resources to support one or more workloads. In at least one embodiment, the one or more racks may also include any number of power modules, cooling modules, and network switches, in any combination.
[0196] In at least one embodiment, resource coordinator 1112 may be configured or otherwise control one or more nodes CR1116(1)-1116(N) and / or grouped computing resources 1114. In at least one embodiment, resource coordinator 1112 may include a Software Design Infrastructure (“SDI”) management entity for data center 1100. In at least one embodiment, resource coordinator 912 may include hardware, software, or some combination thereof.
[0197] In at least one embodiment, such as Figure 11As shown, framework layer 1120 includes a job scheduler 1122, a configuration manager 1124, a resource manager 1126, and a distributed file system 1128. In at least one embodiment, framework layer 1120 may include a framework of software 1132 supporting software layer 1130 and / or one or more applications 1142 supporting application layer 1140. In at least one embodiment, software 1132 or application 1142 may respectively include web-based service software or applications, such as services or applications provided by Amazon Web Services, Google Cloud, and Microsoft Azure. In at least one embodiment, framework layer 1120 may be, but is not limited to, a free and open-source software web application framework, such as Apache Spark, which can utilize distributed file system 1128 for large-scale data processing (e.g., “big data”). TM (Hereinafter referred to as "Spark"). In at least one embodiment, the job scheduler 1132 may include a Spark driver to facilitate the scheduling of workloads supported by various layers of data center 1100. In at least one embodiment, the configuration manager 1124 may be able to configure different layers, such as software layer 1130 and framework layer 1120 including Spark and a distributed file system 1128 for supporting large-scale data processing. In at least one embodiment, the resource manager 1126 is able to manage cluster or group computing resources mapped to or allocated to support distributed file system 1128 and job scheduler 1122. In at least one embodiment, cluster or group computing resources may include group computing resources 1114 on data center infrastructure layer 1110. In at least one embodiment, the resource manager 1126 may coordinate with resource coordinator 1112 to manage these mapped or allocated computing resources.
[0198] In at least one embodiment, the software 1132 included in software layer 1130 may include software used by at least a portion of nodes CR1116(1)-1116(N), grouped computing resources 1114, and / or the distributed file system 1128 of framework layer 1120. In at least one embodiment, one or more types of software may include, but are not limited to, Internet web page search software, email virus scanning software, database software, and streaming video content software.
[0199] In at least one embodiment, one or more applications 1142 included in application layer 1140 may include one or more types of applications used by at least a portion of nodes CR1116(1)-1116(N), grouped computing resources 1114, and / or the distributed file system 1128 of framework layer 1120. In at least one embodiment, one or more types of applications may include, but are not limited to, any number of genomics applications, cognitive computing, applications, and machine learning applications, including training or inference software, machine learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.), or other machine learning applications used in conjunction with one or more embodiments.
[0200] In at least one embodiment, any of the configuration manager 1124, resource manager 1126, and resource coordinator 1112 can implement any number and type of self-modification actions based on any amount and type of data acquired in any technically feasible manner. In at least one embodiment, self-modification actions can mitigate potentially poor configuration decisions by data center operators of data center 1100 and can prevent underutilization and / or poor performance of the data center.
[0201] In at least one embodiment, data center 1100 may include tools, services, software, or other resources to train one or more machine learning models or to use one or more machine learning models to predict or infer information according to one or more embodiments described herein. For example, in at least one embodiment, a machine learning model can be trained by calculating weight parameters based on a neural network architecture using the software and computing resources described above with respect to data center 1100. In at least one embodiment, information can be inferred or predicted using trained machine learning models corresponding to one or more neural networks using the resources described above with respect to data center 1100, by using weight parameters calculated through one or more training techniques described herein.
[0202] In at least one embodiment, the data center may use a CPU, application-specific integrated circuit (ASIC), GPU, FPGA, or other hardware to utilize the aforementioned resources to perform training and / or inference. Furthermore, one or more of the aforementioned software and / or hardware resources may be configured as a service to allow a user to train or perform information inference, such as image recognition, speech recognition, or other artificial intelligence services.
[0203] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9BDetails are provided regarding the inference and / or training logic 915. In at least one embodiment, the inference and / or training logic 915 can be in the system. Figure 11 Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0204] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0205] Autonomous vehicles
[0206] Figure 12A An example of an autonomous vehicle 1200 according to at least one embodiment is shown. In at least one embodiment, the autonomous vehicle 1200 (which may alternatively be referred to herein as "vehicle 1200") may be, but is not limited to, a passenger vehicle, such as a car, truck, bus, and / or another type of vehicle capable of accommodating one or more passengers. In at least one embodiment, vehicle 1200 may be a semi-tractor-trailer for hauling goods. In at least one embodiment, vehicle 1200 may be an aircraft, robotic vehicle, or other type of vehicle.
[0207] Autonomous vehicles can be described according to the levels of automation defined by the National Highway Traffic Safety Administration (“NHTSA”) and the Society of Automotive Engineers (“SAE”) of the U.S. Department of Transportation in their standard “Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles” (e.g., standard number J3016-201806, published June 15, 2018; standard number J3016-201609, published September 30, 2016; and previous and future versions of this standard). In at least one embodiment, vehicle 1200 may be able to function according to one or more of the levels of autonomous driving from Level 1 to Level 5. For example, in at least one embodiment, vehicle 1200 may be able to perform conditional automation (Level 3), high automation (Level 4), and / or full automation (Level 5).
[0208] In at least one embodiment, vehicle 1200 may include, but is not limited to, components such as chassis, body, wheels (e.g., 2, 4, 6, 8, 18, etc.), tires, axles, and other vehicle components. In at least one embodiment, vehicle 1200 may include, but is not limited to, propulsion system 1250, such as an internal combustion engine, a hybrid powertrain, an all-electric motor, and / or another type of propulsion system. In at least one embodiment, propulsion system 1250 may be connected to the drivetrain of vehicle 1200, which may include, but is not limited to, a transmission, to enable propulsion of vehicle 1200. In at least one embodiment, propulsion system 1250 may be controlled in response to receiving a signal from throttle / accelerator 1252.
[0209] In at least one embodiment, when the propulsion system 1250 is operating (e.g., when the vehicle 1200 is traveling), the steering system 1254 (which may include, but is not limited to, a steering wheel) is used to steer the vehicle 1200 (e.g., along a desired path or route). In at least one embodiment, the steering system 1254 may receive signals from the steering actuator 1256. In at least one embodiment, the steering wheel may be optional for fully automated (Level 5) functionality. In at least one embodiment, the brake sensor system 1246 may be used to operate the vehicle brakes in response to signals received from the brake actuator 1248 and / or brake sensors.
[0210] In at least one embodiment, the controller 1236 may include, but is not limited to, one or more system-on-chips (“SoCs”). Figure 12AA controller 1236 (not shown) and / or a graphics processing unit (“GPU”) provides signals (e.g., representing commands) to one or more components and / or systems of vehicle 1200. For example, in at least one embodiment, controller 1236 may send signals to operate vehicle braking via brake actuator 1248, to operate steering system 1254 via one or more steering actuators 1256, and to operate propulsion system 1250 via one or more throttles / accelerators 1252. In at least one embodiment, one or more controllers 1236 may include one or more onboard (e.g., integrated) computing devices that process sensor signals and output operating commands (e.g., signals representing commands) to enable autonomous driving and / or assist a driver in driving vehicle 1200. In at least one embodiment, one or more controllers 1236 may include a first controller for autonomous driving functions, a second controller for functional safety functions, a third controller for artificial intelligence functions (e.g., computer vision), a fourth controller for infotainment functions, a fifth controller for redundancy in emergency situations, and / or other controllers. In at least one embodiment, a single controller may handle two or more of the functions described above, and two or more controllers may handle a single function and / or any combination thereof.
[0211] In at least one embodiment, one or more controllers 1236 provide signals for controlling one or more components and / or systems of vehicle 1200 in response to sensor data received from one or more sensors (e.g., sensor inputs). In at least one embodiment, sensor data can be received from sensors, including but not limited to one or more Global Navigation Satellite System (“GNSS”) sensors 1258 (e.g., one or more Global Positioning System sensors), one or more RADAR sensors 1260, one or more ultrasonic sensors 1262, one or more LIDAR sensors 1264, one or more inertial measurement unit (IMU) sensors 1266 (e.g., one or more accelerometers, one or more gyroscopes, one or more magnetic compasses, one or more magnetometers, etc.), one or more microphones 1296, one or more stereo cameras 1268, one or more wide-angle cameras 1270 (e.g., fisheye cameras), one or more infrared cameras 1272, one or more surround cameras 1274 (e.g., 360-degree cameras), and remote cameras (…). Figure 12A (not shown in the image), medium-range camera ( Figure 12A(Not shown in the diagram) One or more speed sensors 1244 (e.g., for measuring the speed of vehicle 1200), one or more vibration sensors 1242, one or more steering sensors 1240, one or more brake sensors (e.g., as part of brake sensor system 1246) and / or other sensor types are received.
[0212] In at least one embodiment, one or more controllers 1236 may receive input (e.g., represented by input data) from the dashboard 1232 of the vehicle 1200 and provide output (e.g., represented by output data, display data, etc.) via a human-machine interface (“HMI”) display 1234, a voice signaler, a speaker, and / or other components of the vehicle 1200. In at least one embodiment, the output may include information such as vehicle speed, velocity, time, map data (e.g., high-definition map). Figure 12A The HMI display 1234 may display information about the presence of one or more objects (e.g., road signs, warning signs, traffic light changes, etc.) and / or information about the driving operation of the vehicle that has been, is being, or will be made (e.g., changing lanes now, exiting exit 34B within two miles, etc.). For example, in at least one embodiment, the HMI display 1234 may display information about the presence of one or more objects (e.g., road signs, warning signs, traffic light changes, etc.) and / or information about the driving operation of the vehicle that has been, is being, or will be made (e.g., changing lanes now, exiting exit 34B within two miles, etc.).
[0213] In at least one embodiment, vehicle 1200 further includes a network interface 1224 that can communicate over one or more networks using one or more wireless antennas 1226 and / or one or more modems. For example, in at least one embodiment, network interface 1224 may be able to communicate over Long Term Evolution (“LTE”), Wideband Code Division Multiple Access (“WCDMA”), Universal Mobile Telecommunications System (“UMTS”), Global System for Mobile Communications (“GSM”), IMT-CDMA Multicarrier (“CDMA2000”) networks, etc. In at least one embodiment, one or more wireless antennas 1226 may also enable communication between objects in the environment (e.g., vehicles, mobile devices) using one or more local area networks (e.g., Bluetooth, Bluetooth Low Energy (LE), Z-Wave, ZigBee, etc.) and / or one or more low-power wide area networks (hereinafter “LPWAN”) (e.g., LoRaWAN, SigFox, etc. protocols).
[0214] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details are provided regarding the inference and / or training logic 915. In at least one embodiment, the inference and / or training logic 915 can be in the system. Figure 12A The operation is used to infer or predict the operation based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0215] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0216] Figure 12B The illustration shows an embodiment according to at least one of the embodiments. Figure 12A Examples of camera positions and fields of view for an autonomous vehicle 1200. In at least one embodiment, the camera and its respective field of view are an example embodiment and are not intended to be limiting. For example, in at least one embodiment, additional and / or alternative cameras may be included and / or the cameras may be located at different positions on the vehicle 1200.
[0217] In at least one embodiment, the camera type used for the camera may include, but is not limited to, a digital camera suitable for use with components and / or systems of vehicle 1200. In at least one embodiment, one or more cameras may operate at Automotive Safety Integrity Level (“ASIL”) B and / or other ASILs. In at least one embodiment, the camera type may have any image capture rate, such as 60 frames per second (fps), 1220 fps, 240 fps, etc. In at least one embodiment, the camera may be able to use a rolling shutter, a global shutter, another type of shutter, or a combination thereof. In at least one embodiment, the color filter array may include a red-to-clear (“RCCC”) color filter array, a red-to-clear-blue (“RCCB”) color filter array, a red-blue-green (“RBGC”) color filter array, a Foveon X3 color filter array, a Bayer sensor (“RGGB”) color filter array, a monochrome sensor color filter array, and / or other types of color filter arrays. In at least one embodiment, a transparent pixel camera, such as a camera with an array of RCCC, RCCB and / or RBGC color filters, may be used to improve photosensitivity.
[0218] In at least one embodiment, one or more cameras may be used to perform advanced driver assistance system (“ADAS”) functions (e.g., as part of a redundancy or fail-safe design). For example, in at least one embodiment, a multi-function mono camera may be installed to provide functions including lane departure warning, traffic sign assist, and intelligent headlight control. In at least one embodiment, one or more cameras (e.g., all cameras) may simultaneously record and provide image data (e.g., video).
[0219] In at least one embodiment, one or more cameras may be mounted in a mounting assembly, such as a custom-designed (3D-printed) assembly, to cut out stray light and reflections within the vehicle 1200 (e.g., reflections from the dashboard in the windshield mirror), which may interfere with the camera's image data capture capabilities. Regarding the rearview mirror mounting assembly, in at least one embodiment, the rearview mirror assembly may be 3D-printed custom-made such that the camera mounting plate matches the shape of the rearview mirror. In at least one embodiment, one or more cameras may be integrated into the rearview mirror. In at least one embodiment, for side-view cameras, one or more cameras may also be integrated within four pillars at each corner of the cabin.
[0220] In at least one embodiment, a camera (e.g., a forward-facing camera) having a field of view including a portion of the environment in front of the vehicle 1200 can be used for surround view and, with the assistance of one or more controllers 1236 and / or control SoCs, to help identify forward paths and obstacles, thereby providing information crucial for generating an occupancy grid and / or determining a preferred vehicle path. In at least one embodiment, the forward-facing camera can be used to perform many ADAS functions similar to LIDAR, including but not limited to emergency braking, pedestrian detection, and collision avoidance. In at least one embodiment, the forward-facing camera can also be used for ADAS functions and systems, including but not limited to lane departure warning (“LDW”), adaptive cruise control (“ACC”), and / or other functions (e.g., traffic sign recognition).
[0221] In at least one embodiment, various cameras can be used in a forward-facing configuration, including, for example, a monocular camera platform including a CMOS (“complementary metal-oxide-semiconductor”) color imager. In at least one embodiment, a wide-angle camera 1270 can be used to sense objects entering from the periphery (e.g., pedestrians, crosswalkers, or bicycles). Although in Figure 12BOnly one wide-angle camera 1270 is shown; however, in other embodiments, the vehicle 1200 may have any number (including zero) of wide-angle cameras. In at least one embodiment, any number of remote cameras 1298 (e.g., a pair of remote stereo cameras) can be used for depth-based object detection, especially for objects for which a neural network has not yet been trained. In at least one embodiment, the remote camera 1298 can also be used for object detection and classification, as well as basic object tracking.
[0222] In at least one embodiment, any number of stereo cameras 1268 may also be included in a forward configuration. In at least one embodiment, one or more stereo cameras 1268 may include an integrated control unit comprising a scalable processing unit that may provide programmable logic (“FPGA”) and a multi-core microprocessor with a controller area network (“CAN”) or Ethernet interface integrated on a single chip. In at least one embodiment, such a unit may be used to generate a 3D map of the environment of the vehicle 1200, including distance estimates for all points in the image. In at least one embodiment, one or more stereo cameras 1268 may include, but are not limited to, a compact stereo vision sensor, which may include, but is not limited to, two camera samples (one on the left and one on the right) and an image processing chip that can measure the distance from the vehicle 1200 to a target object and use the generated information (e.g., metadata) to activate autonomous emergency braking and lane departure warning functions. In at least one embodiment, other types of stereo cameras 1268 may also be used in addition to those described herein.
[0223] In at least one embodiment, a camera (e.g., a side-view camera) having a field of view including a portion of the environment on the side of the vehicle 1200 can be used for surround viewing, thereby providing information for creating and updating the occupied grid, and generating a side collision warning. For example, in at least one embodiment, a surround camera 1274 (e.g., such as...) Figure 12B The four surround cameras shown can be positioned on vehicle 1200. In at least one embodiment, one or more surround cameras 1274 may include, but are not limited to, any number and combination of wide-angle cameras, one or more fisheye cameras, one or more 360-degree cameras, and / or similar cameras. For example, in at least one embodiment, four fisheye cameras may be located at the front, rear, and sides of vehicle 1200. In at least one embodiment, vehicle 1200 may use three surround cameras 1274 (e.g., left, right, and rear) and may utilize one or more other cameras (e.g., forward-facing cameras) as a fourth surround-view camera.
[0224] In at least one embodiment, a camera (e.g., a rear-view camera) having a field of view including a portion of the environment behind the vehicle 1200 can be used for parking assistance, surround view, rear collision warning, and creating and updating occupancy raster. In at least one embodiment, a wide variety of cameras can be used, including but not limited to cameras that are also suitable as one or more forward-facing cameras (e.g., long-range camera 1298 and / or one or more mid-range cameras 1276, one or more stereo cameras 1268, one or more infrared cameras 1272, etc.), as described herein.
[0225] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. Figure 9A and / or Figure 9B This document provides details regarding inference and / or training logic 915. In at least one embodiment, inference and / or training logic 915 can be... Figure 12B Used in systems for reasoning or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0226] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0227] Figure 12C The illustration shows an embodiment according to at least one of the embodiments. Figure 12A A block diagram of an example system architecture for an autonomous vehicle 1200. In at least one embodiment, Figure 12CEach of one or more components, one or more features, and one or more systems of vehicle 1200 is shown as connected via bus 1202. In at least one embodiment, bus 1202 may include, but is not limited to, a CAN data interface (which may alternatively be referred to herein as a "CAN bus"). In at least one embodiment, CAN may be a network within vehicle 1200 used to help control various features and functions of vehicle 1200, such as brake actuation, acceleration, braking, steering, windshield wipers, etc. In one embodiment, bus 1202 may be configured to have dozens or even hundreds of nodes, each node having its own unique identifier (e.g., a CAN ID). In at least one embodiment, bus 1202 can be read to find steering wheel angle, ground speed, engine rotation speed ("RPM"), button position, and / or other vehicle status indicators. In at least one embodiment, bus 1202 may be an ASIL B compliant CAN bus.
[0228] In at least one embodiment, FlexRay and / or Ethernet protocols may be used in addition to or from CAN. In at least one embodiment, there may be any number of molded buses 1202, which may include, but are not limited to, zero or more CAN buses, zero or more FlexRay buses, zero or more Ethernet buses, and / or zero or more other types of buses using other protocols. In at least one embodiment, two or more buses may be used to perform different functions and / or may be used for redundancy. For example, a first bus may be used for a collision avoidance function, and a second bus may be used for actuation control. In at least one embodiment, each of any number of System-on-Chip (“SoC”) 1204 (e.g., SoC 1204(A) and SoC 1204(B)), each of one or more controllers 1236, and / or each computer within the vehicle may access the same input data (e.g., input from sensors of the vehicle 1200) and may be connected to a common bus, such as a CAN bus.
[0229] In at least one embodiment, vehicle 1200 may include one or more controllers 1236, such as those described herein. Figure 12A As described above. In at least one embodiment, controller 1236 can be used for a variety of functions. In at least one embodiment, controller 1236 can be coupled to any of various other components and systems of vehicle 1200 and can be used to control vehicle 1200, artificial intelligence of vehicle 1200, infotainment and / or other functions of vehicle 1200.
[0230] In at least one embodiment, vehicle 1200 may include any number of SoCs 1204. In at least one embodiment, each of the SoCs 1204 may include, but is not limited to, a central processing unit (“one or more CPUs”) 1206, a graphics processing unit (“one or more GPUs”) 1208, one or more processors 1210, one or more caches 1212, one or more accelerators 1214, one or more data storage 1216, and / or other components and features not shown. In at least one embodiment, one or more SoCs 1204 may be used to control vehicle 1200 on various platforms and systems. For example, in at least one embodiment, one or more SoCs 1204 may be combined with a high-definition (“HD”) map 1222 in a system (e.g., the system of vehicle 1200), the high-definition map 1222 being accessible from one or more servers via a network interface 1224. Figure 12C (Not shown in the image) Get map refresh and / or update.
[0231] In at least one embodiment, one or more CPUs 1206 may include CPU clusters or CPU complexes (which may alternatively be referred to herein as “CCPLEX”). In at least one embodiment, one or more CPUs 1206 may include multiple cores and / or a secondary (“L2”) cache. For example, in at least one embodiment, one or more CPUs 1206 may include eight cores in an intercoupled multiprocessor configuration. In at least one embodiment, one or more CPUs 1206 may include four dual-core clusters, each cluster having a dedicated L2 cache (e.g., 2MB L2 cache). In at least one embodiment, one or more CPUs 1206 (e.g., CCPLEX) may be configured to support simultaneous cluster operation, such that any combination of clusters of one or more CPUs 1206 can be active at any given time.
[0232] In at least one embodiment, one or more CPUs 1206 may implement power management functions, including but not limited to one or more of the following features: automatic clock gating of individual hardware modules to conserve dynamic power when idle; clock gating of each core when the core is not actively executing instructions due to executing Wait for Interrupt (“WFI”) / Event Wait (“WFE”) instructions; independent power supply for each core; clock gating of each core cluster when all cores are clock-gated or power-gated; and / or power gating of each core cluster when all cores are power-gated. In at least one embodiment, one or more CPUs 1206 may further implement an enhanced algorithm for managing power states, wherein allowed power states and expected wake-up times are specified, and the hardware / microcode determines the optimal power state for cores, clusters, and CCPLEX inputs. In at least one embodiment, the processing core may support a simplified power state input sequence in software, wherein the work is offloaded to the microcode.
[0233] In at least one embodiment, one or more GPUs 1208 may include integrated GPUs (or "iGPUs" herein). In at least one embodiment, one or more GPUs 1208 may be programmable and efficient for parallel workloads. In at least one embodiment, one or more GPUs 1208 may use an enhanced tensor instruction set. In at least one embodiment, one or more GPUs 1208 may include one or more streaming microprocessors, wherein each streaming microprocessor may include a Level 1 ("L1") cache (e.g., an L1 cache with at least 96KB of storage capacity), and two or more streaming microprocessors may share an L2 cache (e.g., an L2 cache with 512KB of storage capacity). In at least one embodiment, one or more GPUs 1208 may include at least eight streaming microprocessors. In at least one embodiment, one or more GPUs 1208 may use a computation application programming interface (API). In at least one embodiment, one or more GPUs 1208 may use one or more parallel computing platforms and / or programming models (e.g., NVIDIA's CUDA model).
[0234] In at least one embodiment, one or more GPU 1208s may be power-optimized for optimal performance in automotive and embedded use cases. For example, in at least one embodiment, one or more GPU 1208s may be fabricated on FinFET (“FinFET”) circuitry. In at least one embodiment, each streaming microprocessor may include multiple mixed-precision processing cores divided into multiple blocks. For example, but not limited to, 64 PF32 cores and 32 PF64 cores may be divided into four processing blocks. In at least one embodiment, each processing block may be allocated 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, two mixed-precision NVIDIA Tensor cores for deep learning matrix arithmetic, a level-zero (“L0”) instruction cache, a thread bundle scheduler, a dispatch unit, and / or a 64KB register file. In at least one embodiment, the streaming microprocessor may include independent parallel integer and floating-point data paths to provide efficient execution of workloads that mix computation and addressing operations. In at least one embodiment, the streaming microprocessor may include independent thread scheduling capabilities to enable finer-grained synchronization and collaboration between parallel threads. In at least one embodiment, the streaming microprocessor may include a combined L1 data cache and shared memory unit to improve performance while simplifying programming.
[0235] In at least one embodiment, one or more GPUs 1208 may include high-bandwidth memory (“HBM”) and / or a 16GB HBM2 memory subsystem to provide a peak storage bandwidth of approximately 900GB / s in some examples. In at least one embodiment, in addition to or instead of HBM memory, synchronous graphics random access memory (“SGRAM”) may be used, such as graphics double data rate type five synchronous random access memory (“GDDR5”).
[0236] In at least one embodiment, one or more GPUs 1208 may include unified memory technology. In at least one embodiment, address translation service (“ATS”) support may be used to allow one or more GPUs 1208 to directly access the page tables of one or more CPUs 1206. In at least one embodiment, when a memory management unit (“MMU”) of one or more GPUs 1208 experiences a miss, an address translation request may be sent to one or more CPUs 1206. In response, in at least one embodiment, two CPUs of one or more CPUs 1206 may look up the virtual-physical mapping of the address in their page tables and transfer the translation back to one or more GPUs 1208. In at least one embodiment, unified memory technology may allow a single unified virtual address space to be used for the memory of both one or more CPUs 1206 and one or more GPUs 1208, thereby simplifying the programming of one or more GPUs 1208 and the porting of applications to one or more GPUs 1208.
[0237] In at least one embodiment, one or more GPUs 1208 may include any number of access counters that can track the frequency of memory accesses by one or more GPUs 1208 to other processors. In at least one embodiment, one or more access counters can help ensure that memory pages are moved to the physical memory of the processor that accesses the pages most frequently, thereby improving the efficiency of shared memory ranges between processors.
[0238] In at least one embodiment, one or more SoCs 1204 may include any number of caches 1212, including those described herein. For example, in at least one embodiment, one or more caches 1212 may include a Level 3 (“L3”) cache available for one or more CPUs 1206 and one or more GPUs 1208 (e.g., connected to CPUs 1206 and GPUs 1208). In at least one embodiment, one or more caches 1212 may include a write-back cache that can, for example, track the state of a line using a cache coherence protocol (e.g., MEI, MESI, MSI, etc.). In at least one embodiment, although a smaller cache size may be used, according to an embodiment, the L3 cache may include 4 MB of memory or more.
[0239] In at least one embodiment, one or more SoCs 1204 may include one or more accelerators 1214 (e.g., hardware accelerators, software accelerators, or a combination thereof). In at least one embodiment, one or more SoCs 1204 may include a hardware acceleration cluster, which may include optimized hardware accelerators and / or large on-chip memory. In at least one embodiment, large on-chip memory (e.g., 4MB of SRAM) enables the hardware acceleration cluster to accelerate neural networks and other computations. In at least one embodiment, the hardware acceleration cluster may be used to supplement one or more GPUs 1208 and offload some tasks from one or more GPUs 1208 (e.g., freeing up more cycles from one or more GPUs 1208 to perform other tasks). In at least one embodiment, one or more accelerators 1214 may be used for a target workload (e.g., perceptual, convolutional neural network (“CNN”), recurrent neural network (“RNN”), etc.) that is sufficiently stable to withstand acceleration testing. In at least one embodiment, the CNN may include region-based or region convolutional neural networks (“RCNN”) and fast RCNN (e.g., for object detection) or other types of CNNs.
[0240] In at least one embodiment, one or more accelerators 1214 (e.g., a hardware acceleration cluster) may include one or more deep learning accelerators (“DLAs”). In at least one embodiment, one or more DLAs may include, but are not limited to, one or more Tensor Processing Units (“TPUs”), which may be configured to provide an additional 10 trillion operations per second for deep learning applications and inference. In at least one embodiment, a TPU may be an accelerator configured and optimized for performing image processing functions (e.g., for CNNs, RCNNs, etc.). In at least one embodiment, one or more DLAs may be further optimized for specific sets of neural network types and floating-point operations and inference. In at least one embodiment, one or more DLAs are designed to provide higher performance per millimeter than typical general-purpose GPUs and typically significantly outperform CPUs. In at least one embodiment, one or more TPUs may perform several functions, including single-instance convolution functions supporting, for example, INT8, INT16, and FP16 data types for features and weights, as well as post-processor functions. In at least one embodiment, one or more DLAs can execute neural networks, particularly CNNs, quickly and efficiently on processed or unprocessed data for any of the various functions, including, but not limited to: CNNs for object recognition and detection using data from camera sensors; CNNs for distance estimation using data from camera sensors; CNNs for emergency vehicle detection, recognition, and identification using data from microphones; CNNs for face recognition and vehicle owner recognition using data from camera sensors; and / or CNNs for safety and / or safety-related events.
[0241] In at least one embodiment, the DLA can perform any function of one or more GPUs 1208, and by using an inference accelerator, for example, the designer can target one or more DLAs or one or more GPUs 1208 for any function. For example, in at least one embodiment, the designer can concentrate the CNN processing and floating-point operations on one or more DLAs, leaving other functions to one or more GPUs 1208 and / or one or more accelerators 1214.
[0242] In at least one embodiment, one or more accelerators 1214 may include programmable vision accelerators (“PVAs”), which may alternatively be referred to herein as computer vision accelerators. In at least one embodiment, one or more PVAs may be designed and configured to accelerate computer vision algorithms for advanced driver assistance systems (“ADAS”) 1238, autonomous driving, augmented reality (“AR”) applications, and / or virtual reality (“VR”) applications. In at least one embodiment, one or more PVAs may strike a balance between performance and flexibility. For example, in at least one embodiment, each of one or more PVAs may include, for example, but not limited to, any number of reduced instruction set computer (“RISC”) cores, direct memory access (“DMA”), and / or any number of vector processors.
[0243] In at least one embodiment, the RISC core can interact with an image sensor (e.g., the image sensor of any camera described herein), an image signal processor, etc. In at least one embodiment, each RISC core may include any number of memories. In at least one embodiment, the RISC core may use any of a variety of protocols, depending on the embodiment. In at least one embodiment, the RISC core may execute a real-time operating system (“RTOS”). In at least one embodiment, the RISC core may be implemented using one or more integrated circuit devices, application-specific integrated circuits (“ASICs”), and / or storage devices. For example, in at least one embodiment, the RISC core may include an instruction cache and / or tightly coupled RAM.
[0244] In at least one embodiment, DMA enables components of the PVA to access system memory independently of one or more CPUs 1206. In at least one embodiment, DMA can support any number of features for providing optimization to the PVA, including but not limited to, support for multidimensional addressing and / or circular addressing. In at least one embodiment, DMA can support up to six or more addressing dimensions, which may include, but are not limited to, block width, block height, block depth, horizontal block step, vertical block step, and / or depth step.
[0245] In at least one embodiment, the vector processor may be a programmable processor designed to efficiently and flexibly execute programming for computer vision algorithms and provide signal processing capabilities. In at least one embodiment, the PVA may include a PVA core and two vector processing subsystem partitions. In at least one embodiment, the PVA core may include a processor subsystem, a DMA engine (e.g., two DMA engines), and / or other peripherals. In at least one embodiment, the vector processing subsystem may serve as the main processing engine of the PVA and may include a vector processing unit (“VPU”), an instruction cache, and / or a vector memory (e.g., “VMEM”). In at least one embodiment, the VPU core may include a digital signal processor, such as a Single Instruction Multiple Data (“SIMD”) or Very Long Instruction Word (“VLIW”) digital signal processor. In at least one embodiment, the combination of SIMD and VLIW can improve throughput and speed.
[0246] In at least one embodiment, each vector processor may include an instruction cache and may be coupled to dedicated memory. As a result, in at least one embodiment, each vector processor may be configured to execute independently of other vector processors. In at least one embodiment, the vector processors included in a particular PVA may be configured to employ data parallelism. For example, in at least one embodiment, multiple vector processors included in a single PVA may execute general-purpose computer vision algorithms, except on different regions of an image. In at least one embodiment, vector processors included in a particular PVA may execute different computer vision algorithms simultaneously on a single image, or even execute different algorithms on a sequence of images or portions of images. In at least one embodiment, among others, any number of PVAs may be included in the hardware-accelerated cluster, and any number of vector processors may be included in each PVA. In at least one embodiment, the PVA may include additional error-correcting code (“ECC”) memory to enhance overall system security.
[0247] In at least one embodiment, one or more accelerators 1214 may include an on-chip computer vision network and static random access memory (“SRAM”) for providing high-bandwidth, low-latency SRAM to one or more accelerators 1214. In at least one embodiment, the on-chip memory may include at least 4 MB of SRAM, comprising, for example, but not limited to, eight field-configurable memory blocks accessible to both the PVA and DLA. In at least one embodiment, each pair of memory blocks may include an advanced peripheral bus (“APB”) interface, configuration circuitry, a controller, and a multiplexer. In at least one embodiment, any type of memory may be used. In at least one embodiment, the PVA and DLA may access the memory via a backbone providing high-speed access to the memory for both the PVA and DLA. In at least one embodiment, the backbone may include an on-chip computer vision network that interconnects the PVA and DLA to the memory (e.g., using an APB).
[0248] In at least one embodiment, the on-chip computer vision network may include an interface that determines that both the PVA and DLA provide ready and valid signals before transmitting any control signals / addresses / data. In at least one embodiment, the interface may provide separate phases and separate channels for transmitting control signals / addresses / data, as well as bursty communication for continuous data transmission. In at least one embodiment, although other standards and protocols may be used, the interface may conform to the International Organization for Standardization (“ISO”) 26262 or the International Electrotechnical Commission (“IEC”) 61508 standard.
[0249] In at least one embodiment, one or more SoCs 1204 may include a real-time eye-tracking hardware accelerator. In at least one embodiment, the real-time eye-tracking hardware accelerator may be used to quickly and efficiently determine the location and extent of an object (e.g., within a world model) to generate real-time visualization simulations for RADAR signal interpretation, for sound propagation synthesis and / or analysis, for simulation of SONAR systems, for general wave propagation simulation, for comparison with LIDAR data for localization and / or other functions, and / or for other purposes.
[0250] In at least one embodiment, one or more accelerators 1214 have broad applications for autonomous driving. In at least one embodiment, PVA can be used in critical processing stages in ADAS and autonomous vehicles. In at least one embodiment, the capabilities of PVA with low power consumption and low latency are well-matched to algorithmic domains requiring predictable processing. In other words, PVA performs well in semi-intensive or intensive conventional computations, even on small datasets that may require predictable runtimes with low latency and low power consumption. In at least one embodiment, such as in vehicle 1200, PVA may be designed to run classical computer vision algorithms, as they are efficient in object detection and integer mathematical operations.
[0251] For example, according to at least one embodiment of the technology, PVA is used to perform computer stereo vision. In at least one embodiment, a semi-global matching-based algorithm may be used in some examples, although this is not intended to be limiting. In at least one embodiment, applications for Level 3-5 autonomous driving use dynamic estimation / stereo matching during operation (e.g., structure recovery from motion, pedestrian recognition, lane detection, etc.). In at least one embodiment, PVA can perform computer stereo vision functions on input from two monocular cameras.
[0252] In at least one embodiment, the PVA can be used to perform intensive optical flow. For example, in at least one embodiment, the PVA can process raw RADAR data (e.g., using 4D Fast Fourier Transform) to provide processed RADAR data. In at least one embodiment, the PVA is used for time-of-flight depth processing, for example, by processing raw time-of-flight data to provide processed time-of-flight data.
[0253] In at least one embodiment, the DLA can be used to run any type of network to enhance control and driving safety, including, but not limited to, neural networks whose output is used for a confidence score for each object detection. In at least one embodiment, the confidence score can be represented or interpreted as a probability, or as providing a relative “weight” for each detection relative to other detections. In at least one embodiment, the confidence score measurement enables the system to make further decisions about which detections should be considered true positives rather than false positives. In at least one embodiment, the system can set a threshold for the confidence score and only consider detections exceeding the threshold as true positives. In embodiments using an Automatic Emergency Braking (“AEB”) system, false positives would cause the vehicle to automatically perform emergency braking, which is obviously undesirable. In at least one embodiment, a highly confident detection can be considered a trigger for AEB. In at least one embodiment, the DLA can run a neural network for regressing the confidence score value. In at least one embodiment, the neural network may take at least a subset of parameters as its input, such as bounding box size, obtained ground plane estimate (e.g., from another subsystem), and outputs of one or more IMU sensors 1266 related to the vehicle 1200 orientation, distance, and 3D position estimate of the object obtained from the neural network and / or other sensors (e.g., one or more LiDAR sensors 1264 or one or more RADAR sensors 1260).
[0254] In at least one embodiment, one or more SoCs 1204 may include one or more data storage devices 1216 (e.g., memory). In at least one embodiment, one or more data storage devices 1216 may be on-chip memory of one or more SoCs 1204, which may store neural networks to be executed on one or more GPUs 1208 and / or DLAs. In at least one embodiment, one or more data storage devices 1216 may have a sufficiently large capacity to store multiple instances of the neural network for redundancy and security. In at least one embodiment, one or more data storage devices 1216 may include L2 or L3 caches.
[0255] In at least one embodiment, one or more SoCs 1204 may include any number of processors 1210 (e.g., embedded processors). In at least one embodiment, one or more processors 1210 may include a startup and power management processor, which may be a dedicated processor and subsystem for handling startup power and management functions, as well as associated security implementations. In at least one embodiment, the startup and power management processor may be part of a startup sequence of one or more SoCs 1204s and may provide runtime power management services. In at least one embodiment, the startup power and management processor may provide clock and voltage programming, assist system low-power state transitions, thermal and temperature sensor management of one or more SoCs 1204s, and / or power state management of one or more SoCs 1204s. In at least one embodiment, each temperature sensor may be implemented with its output frequency proportional to temperature, and one or more SoCs 1204s may use the ring oscillator to detect the temperature of one or more CPUs 1206s, one or more GPUs 1208s, and / or one or more accelerators 1214s. In at least one embodiment, if it is determined that the temperature exceeds a threshold, the startup and power management processor may enter a temperature fault routine and place one or more SoCs 1204s into a lower power state and / or place the vehicle 1200 into a driver's safe stopping pattern (e.g., bring the vehicle 1200 to a safe stop).
[0256] In at least one embodiment, one or more processors 1210 may further include a set of embedded processors that can serve as an audio processing engine. The audio processing engine may be an audio subsystem capable of providing full hardware support for multi-channel audio through multiple interfaces and a wide and flexible range of audio I / O interfaces. In at least one embodiment, the audio processing engine is a dedicated processor core with a digital signal processor having dedicated RAM.
[0257] In at least one embodiment, one or more processors 1210 may further include an always-on processor engine that can provide the necessary hardware features to support low-power sensor management and wake-up use cases. In at least one embodiment, the processor on the always-on processor engine may include, but is not limited to, a processor core, tightly coupled RAM, peripheral support (e.g., timers and interrupt controllers), various I / O controller peripherals, and routing logic.
[0258] In at least one embodiment, one or more processors 1210 may further include a security cluster engine, which includes, but is not limited to, a dedicated processor subsystem for handling security management of automotive applications. In at least one embodiment, the security cluster engine may include, but is not limited to, two or more processor cores, tightly coupled RAM, supporting peripherals (e.g., timers, interrupt controllers, etc.) and / or routing logic. In a secure mode, in at least one embodiment, the two or more cores may operate in lockstep mode and may be used as a single core with comparison logic for detecting any differences between their operations. In at least one embodiment, one or more processors 1210 may further include a real-time camera engine, which may include, but is not limited to, a dedicated processor subsystem for handling real-time camera management. In at least one embodiment, one or more processors 1210 may further include a high dynamic range signal processor, which may include, but is not limited to, an image signal processor, which is a hardware engine as part of the camera processing pipeline.
[0259] In at least one embodiment, one or more processors 1210 may include a video image synthesizer, which may be a processing block (e.g., implemented on a microprocessor) that implements video post-processing functions required by the video playback application to produce the final image for the player window. In at least one embodiment, the video image synthesizer may perform lens distortion correction on one or more wide-angle cameras 1270, one or more surround cameras 1274, and / or one or more cabin monitoring camera sensors. In at least one embodiment, preferably, the cabin monitoring camera sensors are monitored by a neural network running on another instance of SoC 1204, the neural network being configured to recognize cabin events and respond accordingly. In at least one embodiment, the cabin system may perform, but is not limited to, lip reading to activate cellular service and make phone calls, instruct emails, change the vehicle's destination, activate or change the vehicle's infotainment system and settings, or provide voice-activated web browsing. In at least one embodiment, certain functions are available to the driver when the vehicle is operating in autonomous mode, and are otherwise disabled.
[0260] In at least one embodiment, the video image synthesizer may include enhanced temporal denoising for simultaneous spatial and temporal denoising. For example, in at least one embodiment, when motion occurs in the video, denoising appropriately weights spatial information, thereby reducing the weight of information provided by adjacent frames. In at least one embodiment, when the image or a portion of the image does not contain motion, temporal denoising performed by the video image synthesizer may use information from previous images to reduce noise in the current image.
[0261] In at least one embodiment, the video image compositor can also be configured to perform stereoscopic correction on the input stereo lens frames. In at least one embodiment, when using an operating system desktop, the video image compositor can also be used for user interface compositing and does not require one or more GPUs 1208 to continuously render new surfaces. In at least one embodiment, when one or more GPUs 1208 are powered and actively performing 3D rendering, the video image compositor can be used to offload one or more GPUs 1208 to improve performance and responsiveness.
[0262] In at least one embodiment, one or more SoCs of SoC 1204 may further include a Mobile Industrial Processor Interface (“MIPI”) camera serial interface, a high-speed interface, and / or a video input block that can be used for receiving video and input from a camera and associated pixel input functions. In at least one embodiment, one or more SoCs of SoC 1204 may further include an input / output controller that can be software controlled and can be used to receive I / O signals not assigned to a specific role.
[0263] In at least one embodiment, one or more SoCs of SoC 1204 may further include extensive peripheral interfaces to enable communication with peripheral devices, audio encoders / decoders (“codecs”), power management and / or other devices. In at least one embodiment, one or more SoCs of SoC 1204 may be used to process data from (e.g., connected via gigabit multimedia serial links and Ethernet channels) cameras, sensors (e.g., one or more LiDAR sensors 1264, one or more RADAR sensors 1260, etc., which may be connected via Ethernet channels), data from bus 1202 (e.g., vehicle 1200 speed, steering wheel position, etc.), data from one or more GNSS sensors 1258 (e.g., connected via Ethernet bus or CAN bus), etc. In at least one embodiment, one or more SoCs of SoC 1204 may further include a dedicated high-performance mass storage controller, which may include its own DMA engine and may be used to free one or more CPUs 1206 from routine data management tasks.
[0264] In at least one embodiment, one or more SoCs 1204 can be an end-to-end platform with a flexible architecture spanning automation levels 3-5, providing a comprehensive functional safety architecture that leverages and effectively utilizes computer vision and ADAS technologies to achieve diversity and redundancy. This provides a platform offering a flexible and reliable driving software stack as well as deep learning tools. In at least one embodiment, one or more SoCs 1204 can be faster, more reliable, and even more energy and space efficient than conventional systems. For example, in at least one embodiment, one or more accelerators 1214, when combined with one or more CPUs 1206, one or more GPUs 1208, and one or more data storage devices 1216, can provide a fast and efficient platform for Level 3-5 autonomous vehicles.
[0265] In at least one embodiment, the computer vision algorithm can be executed on a CPU, which can be configured using a high-level programming language (e.g., C) to execute multiple processing algorithms on a variety of visual data. However, in at least one embodiment, the CPU typically cannot meet the performance requirements of many computer vision applications, such as performance requirements related to execution time and power consumption. In at least one embodiment, many CPUs cannot execute complex object detection algorithms in real time, which are used in automotive ADAS applications and practical Level 3-5 autonomous vehicles.
[0266] The embodiments described herein allow multiple neural networks to be executed simultaneously and / or sequentially, and allow the results to be combined to achieve Level 3-5 autonomous driving capabilities. For example, in at least one embodiment, a CNN executed on a DLA or discrete GPU (e.g., one or more GPUs 1220) may include text and word recognition, thereby allowing a supercomputer to read and understand traffic signs, including signs for which the neural network has not yet been specifically trained. In at least one embodiment, the DLA may also include a neural network capable of recognizing, interpreting, and providing semantic understanding of symbols, and passing this semantic understanding to a path planning module running on a CPU Complex.
[0267] In at least one embodiment, for drives of levels 3, 4, or 5, multiple neural networks can run simultaneously. For example, in at least one embodiment, a warning sign consisting of a light bulb accompanied by the warning sign “Caution: flashing lights indicate icy conditions” can be interpreted independently or jointly by multiple neural networks. In at least one embodiment, the warning sign itself can be recognized as a traffic sign by a first deployed neural network (e.g., a trained neural network), and the text “flashing lights indicate icy conditions” can be interpreted by a second deployed neural network, which informs the vehicle’s path planning software (preferably executed on the CPU Complex) that icing conditions exist when flashing lights are detected. In at least one embodiment, flashing lights can be identified by operating a third deployed neural network across multiple frames, informing the vehicle’s path planning software of the presence (or absence) of flashing lights. In at least one embodiment, all three neural networks can run simultaneously, for example within the DLA and / or on one or more GPUs 1208.
[0268] In at least one embodiment, the CNN for facial recognition and vehicle owner identification can use data from camera sensors to identify the presence of an authorized driver and / or the owner of vehicle 1200. In at least one embodiment, a normally open sensor processor engine can be used to unlock the vehicle when the owner approaches the driver's door and turns on the lights, and, in security mode, can be used to disable the vehicle when the owner leaves it. In this way, one or more SoCs 1204 provide protection against theft and / or carjacking.
[0269] In at least one embodiment, the CNN for emergency vehicle detection and identification can use data from microphone 1296 to detect and identify emergency vehicle sirens. In at least one embodiment, one or more SoCs 1204 use the CNN to classify environmental and urban sounds, as well as visual data. In at least one embodiment, the CNN running on DLA is trained to identify the relative approach speed of emergency vehicles (e.g., by using the Doppler effect). In at least one embodiment, the CNN can also be trained to identify emergency vehicles in the area where the vehicle is operating, as identified by one or more GNSS sensors 1258. In at least one embodiment, when operating in Europe, the CNN will seek to detect European sirens, while in North America, the CNN will seek to identify only North American sirens. In at least one embodiment, once an emergency vehicle is detected, a control program can be used, with the assistance of one or more ultrasonic sensors 1262, to execute emergency vehicle safety routines, slow down the vehicle, pull the vehicle to the side of the road, stop, and / or leave the vehicle idle until the emergency vehicle passes.
[0270] In at least one embodiment, vehicle 1200 may include one or more CPUs 1218 (e.g., one or more discrete CPUs or one or more dCPUs) that may be coupled to one or more SoCs 1204 via high-speed interconnects (e.g., PCIe). In at least one embodiment, one or more CPUs 1218 may include x86 processors. For example, one or more CPUs 1218 may be used to perform any of the various functions, such as arbitrating the results of potential inconsistencies between ADAS sensors and one or more SoCs 1204, and / or monitoring the status and health of one or more monitoring controllers 1236 and / or on-chip information systems (“information SoCs”) 1230.
[0271] In at least one embodiment, vehicle 1200 may include one or more GPUs 1220 (e.g., one or more discrete GPUs or one or more dGPUs) coupled to one or more SoCs 1204 via high-speed interconnects (e.g., NVIDIA's NVLINK channels). In at least one embodiment, one or more GPUs 1220 may provide additional artificial intelligence capabilities, such as by executing redundant and / or different neural networks, and may be used to train and / or update the neural networks based at least in part on inputs from sensors of vehicle 1200 (e.g., sensor data).
[0272] In at least one embodiment, vehicle 1200 may further include a network interface 1224, which may include, but is not limited to, one or more wireless antennas 1226 (e.g., one or more wireless antennas for different communication protocols, such as cellular antennas, Bluetooth antennas, etc.). In at least one embodiment, network interface 1224 may be used to enable wireless connectivity with other vehicles and / or computing devices (e.g., passenger client devices) via Internet cloud services (e.g., using servers and / or other network devices). In at least one embodiment, for communication with other vehicles, a direct link and / or an indirect link (e.g., via a network and the Internet) may be established between vehicle 1200 and another vehicle. In at least one embodiment, a vehicle-to-vehicle communication link may be used to provide a direct link. In at least one embodiment, the vehicle-to-vehicle communication link may provide vehicle 1200 with information about vehicles near vehicle 1200 (e.g., vehicles in front, to the side, and / or behind vehicle 1200). In at least one embodiment, the foregoing functionality may be part of a cooperative adaptive cruise control function of vehicle 1200.
[0273] In at least one embodiment, network interface 1224 may include a System-on-Chip (SoC) that provides modulation and demodulation functions and enables one or more controllers 1236 to communicate over a wireless network. In at least one embodiment, network interface 1224 may include a radio frequency (RF) front-end for up-conversion from baseband to RF and down-conversion from RF to baseband. In at least one embodiment, frequency conversion may be performed in any technically feasible manner. For example, frequency conversion may be performed using known processes and / or using a superheterodyne process. In at least one embodiment, the RF front-end functionality may be provided by a separate chip. In at least one embodiment, the network interface may include wireless functions for communication via LTE, WCDMA, UMTS, GSM, CDMA2000, Bluetooth, Bluetooth LE, Wi-Fi, Z-Wave, ZigBee, LoRaWAN, and / or other wireless protocols.
[0274] In at least one embodiment, vehicle 1200 may further include one or more data storage units 1228, which may include, but are not limited to, off-chip (e.g., one or more SoC 1204) storage. In at least one embodiment, one or more data storage units 1228 may include, but are not limited to, one or more storage elements, including RAM, SRAM, dynamic random access memory (“DRAM”), video random access memory (“VRAM”), flash memory, hard disk and / or other components and / or devices capable of storing at least one bit of data.
[0275] In at least one embodiment, the vehicle 1200 may further include one or more GNSS sensors 1258 (e.g., GPS and / or auxiliary GPS sensors) to assist in map creation, perception, occupancy raster generation, and / or path planning functions. In at least one embodiment, any number of GNSS sensors 1258 may be used, including, for example, but not limited to, GPS sensors connected to a serial interface (e.g., RS-232) bridge using a USB connector with Ethernet.
[0276] In at least one embodiment, vehicle 1200 may further include one or more RADAR sensors 1260. In at least one embodiment, one or more RADAR sensors 1260 may be used by vehicle 1200 for remote vehicle detection, even in dark and / or inclement weather conditions. In at least one embodiment, the RADAR functional safety level may be ASIL B. In at least one embodiment, one or more RADAR sensors 1260 may use a CAN bus and / or bus 1202 (e.g., to transmit data generated by one or more RADAR sensors 1260) for control and access to object tracking data, and in some examples, an Ethernet channel may be accessible to access the raw data. In at least one embodiment, a wide variety of RADAR sensor types may be used. For example, but not limited to, one or more of the RADAR sensors 1260 may be suitable for front, rear, and side RADAR use. In at least one embodiment, one or more RADAR sensors 1260 are pulse Doppler RADAR sensors.
[0277] In at least one embodiment, one or more RADAR sensors 1260 may include different configurations, such as long-range with a narrow field of view, short-range with a wide field of view, short-range side coverage, etc. In at least one embodiment, the long-range RADAR can be used for adaptive cruise control functions. In at least one embodiment, the long-range RADAR system can provide a wide field of view achieved through two or more independent scans (e.g., within a 250m range). In at least one embodiment, one or more RADAR sensors 1260 can help distinguish between stationary and moving objects and can be used by the ADAS system 1238 for emergency braking assistance and forward collision warning. In at least one embodiment, one or more sensors 1260 included in the long-range RADAR system may include, but are not limited to, a monostatic multimode RADAR with multiple (e.g., six or more) fixed RADAR antennas and high-speed CAN and FlexRay interfaces. In at least one embodiment, having six antennas, with the four central antennas, can create a focused beammap designed to record the surrounding environment of the vehicle 1200 at a high speed while minimizing traffic interference from adjacent lanes. In at least one embodiment, the other two antennas can expand the field of view, thereby enabling rapid detection of vehicles 1200 entering or leaving the lane.
[0278] In at least one embodiment, as an example, a mid-range RADAR system may include, for example, a range of up to 160m (front) or 80m (rear), and a field of view of up to 42 degrees (front) or 150 degrees (rear). In at least one embodiment, a short-range RADAR system may include, but is not limited to, any number of RADAR sensors 1260 designed to be mounted at both ends of the rear bumper. When mounted at both ends of the rear bumper, in at least one embodiment, the RADAR sensor system may generate two beams that continuously monitor the rearward direction of the vehicle and nearby blind spots. In at least one embodiment, the short-range RADAR system may be used in ADAS system 1238 for blind spot detection and / or lane change assistance.
[0279] In at least one embodiment, the vehicle 1200 may further include one or more ultrasonic sensors 1262. In at least one embodiment, one or more ultrasonic sensors 1262, which may be positioned at the front, rear, and / or sides of the vehicle 1200, may be used for parking assistance and / or creating and updating occupancy detectors. In at least one embodiment, a wide variety of ultrasonic sensors 1262 may be used, and different ultrasonic sensors 1262 may be used for different detection ranges (e.g., 2.5m, 4m). In at least one embodiment, the ultrasonic sensors 1262 may operate at the ASIL B functional safety level.
[0280] In at least one embodiment, vehicle 1200 may include one or more LiDAR sensors 1264. In at least one embodiment, one or more LiDAR sensors 1264 may be used for object and pedestrian detection, emergency braking, collision avoidance, and / or other functions. In at least one embodiment, one or more LiDAR sensors 1264 may operate at functional safety level ASIL B. In at least one embodiment, vehicle 1200 may include multiple (e.g., two, four, six, etc.) LiDAR sensors 1264 that can use Ethernet channels (e.g., providing data to a Gigabit Ethernet switch).
[0281] In at least one embodiment, one or more LiDAR sensors 1264 may be able to provide a list of objects and their distances for a 360-degree field of view. In at least one embodiment, one or more commercially available LiDAR sensors 1264 may, for example, have an advertising range of approximately 100m, an accuracy of 2cm-3cm, and support a 100Mbps Ethernet connection. In at least one embodiment, one or more non-protruding LiDAR sensors may be used. In such an embodiment, one or more LiDAR sensors 1264 may include small devices that can be embedded in the front, rear, side, and / or corner locations of vehicle 1200. In at least one embodiment, one or more LiDAR sensors 1264, in such an embodiment, can provide a horizontal field of view of up to 120 degrees and a vertical field of view of 35 degrees, even for objects with low reflectivity, and have a range of 200m. In at least one embodiment, one or more forward-facing LiDAR sensors 1264 may be configured for a horizontal field of view between 45 degrees and 135 degrees.
[0282] In at least one embodiment, LIDAR technology (such as 3D flash LIDAR) may also be used. In at least one embodiment, 3D flash LIDAR uses a laser flash as a transmission source to illuminate approximately 200m around vehicle 1200. In at least one embodiment, the flash LIDAR unit includes, but is not limited to, a receiver that records the laser pulse propagation time and reflected light on each pixel, which in turn corresponds to the range from vehicle 1200 to the object. In at least one embodiment, flash LIDAR can allow the generation of highly accurate and distortion-free images of the surrounding environment using each laser flash. In at least one embodiment, four flash LIDAR sensors may be deployed, one on each side of vehicle 1200. In at least one embodiment, the 3D flash LIDAR system includes, but is not limited to, a solid-state 3D line-of-sight array LIDAR camera with no moving parts other than a fan (e.g., a non-scanning LIDAR device). In at least one embodiment, the flash LIDAR device can use a 5-nanosecond Class I (eye-safe) laser pulse per frame and can capture reflected laser light as a 3D ranging point cloud and co-registered intensity data.
[0283] In at least one embodiment, vehicle 1200 may further include one or more IMU sensors 1266. In at least one embodiment, one or more IMU sensors 1266 may be located at the center of the rear axle of vehicle 1200. In at least one embodiment, one or more IMU sensors 1266 may include, for example, but not limited to, one or more accelerometers, one or more magnetometers, one or more gyroscopes, a magnetic compass, multiple magnetic compasses, and / or other sensor types. In at least one embodiment, for example in a six-axis application, one or more IMU sensors 1266 may include, but are not limited to, accelerometers and gyroscopes. In at least one embodiment, for example in a nine-axis application, one or more IMU sensors 1266 may include, but are not limited to, accelerometers, gyroscopes, and magnetometers.
[0284] In at least one embodiment, one or more IMU sensors 1266 may be implemented as a miniature, high-performance GPS-assisted inertial navigation system (“GPS / INS”) combining a microelectromechanical system (“MEMS”) inertial sensor, a high-sensitivity GPS receiver, and an advanced Kalman filtering algorithm to provide position, velocity, and attitude estimations; in at least one embodiment, one or more IMU sensors 1266 may enable vehicle 1200 to estimate heading without input from a magnetic sensor obtained by directly observing and correlating velocity changes from GPS to one or more IMU sensors 1266. In at least one embodiment, one or more IMU sensors 1266 and one or more GNSS sensors 1258 may be combined in a single integrated unit.
[0285] In at least one embodiment, vehicle 1200 may include one or more microphones 1296 placed inside and / or around vehicle 1200. In at least one embodiment, in addition, one or more microphones 1296 may be used for emergency vehicle detection and identification.
[0286] In at least one embodiment, vehicle 1200 may further include any number of camera types, including one or more stereo cameras 1268, one or more wide-angle cameras 1270, one or more infrared cameras 1272, one or more surround cameras 1274, one or more long-range cameras 1298, one or more mid-range cameras 1276, and / or other camera types. In at least one embodiment, the cameras can be used to capture image data around the entire perimeter of vehicle 1200. In at least one embodiment, the type of camera used depends on vehicle 1200. In at least one embodiment, any combination of camera types can be used to provide the necessary coverage around vehicle 1200. In at least one embodiment, the number of cameras deployed may vary depending on the embodiment. For example, in at least one embodiment, vehicle 1200 may include six cameras, seven cameras, ten cameras, twelve cameras, or other numbers of cameras. In at least one embodiment, the cameras may be, by way of example but not limited to, supporting gigabit multimedia serial link (“GMSL”) and / or gigabit Ethernet communication. In at least one embodiment, previously referenced herein Figure 12A and Figure 12B Each camera can be described in more detail.
[0287] In at least one embodiment, the vehicle 1200 may further include one or more vibration sensors 1242. In at least one embodiment, the one or more vibration sensors 1242 may measure vibrations of components of the vehicle 1200 (e.g., axles). For example, in at least one embodiment, changes in vibration may indicate changes in road surface conditions. In at least one embodiment, when two or more vibration sensors 1242 are used, differences between vibrations may be used to determine road surface friction or slippage (e.g., when there is a vibration difference between a power drive axle and a free-rotating axle).
[0288] In at least one embodiment, vehicle 1200 may include ADAS system 1238. In at least one embodiment, ADAS system 1238 may include, but is not limited to, SoC. In at least one embodiment, ADAS system 1238 may include, but is not limited to, any number of autonomous / adaptive / automatic cruise control (“ACC”) systems, cooperative adaptive cruise control (“CACC”) systems, forward collision warning (“FCW”) systems, automatic emergency braking (“AEB”) systems, lane departure warning (“LDW”) systems, lane keeping assist (“LKA”) systems, blind spot warning (“BSW”) systems, rear cross traffic warning (“RCTW”) systems, collision warning (“CW”) systems, lane centering (“LC”) systems, and / or other systems, features, and / or functions, and combinations thereof.
[0289] In at least one embodiment, the ACC system may use one or more RADAR sensors 1260, one or more LIDAR sensors 1264, and / or any number of cameras. In at least one embodiment, the ACC system may include a longitudinal ACC system and / or a lateral ACC system. In at least one embodiment, the longitudinal ACC system monitors and controls the distance to another vehicle adjacent to the vehicle 1200 and automatically adjusts the speed of the vehicle 1200 to maintain a safe distance from the vehicle ahead. In at least one embodiment, the lateral ACC system performs distance holding and suggests that the vehicle 1200 change lanes if necessary. In at least one embodiment, the lateral ACC is associated with other ADAS applications, such as LC and CW.
[0290] In at least one embodiment, the CACC system uses information from other vehicles, which may be received from other vehicles via network interface 1224 and / or one or more wireless antennas 1226 via a wireless link or indirectly via a network connection (e.g., via the Internet). In at least one embodiment, the direct link may be provided by a vehicle-to-vehicle (“V2V”) communication link, while the indirect link may be provided by an infrastructure-to-vehicle (“I2V”) communication link. Typically, V2V communication provides information about the vehicle immediately preceding it (e.g., a vehicle immediately in front of vehicle 1200 and in the same lane as it), while I2V communication provides information about traffic further ahead. In at least one embodiment, the CACC system may include one or both of the I2V and V2V information sources. In at least one embodiment, given information about vehicles preceding vehicle 1200, the CACC system can be more reliable and has the potential to improve traffic flow smoothness and reduce road congestion.
[0291] In at least one embodiment, the FCW system is designed to warn the driver of a hazard so that the driver can take corrective action. In at least one embodiment, the FCW system uses a forward-facing camera and / or one or more RADAR sensors 1260, coupled to a dedicated processor, DSP, FPGA, and / or ASIC, electrically coupled to components providing driver feedback, such as a display, speaker, and / or vibration. In at least one embodiment, the FCW system can provide warnings, for example, in the form of audible, visual, haptic, and / or rapid braking pulses.
[0292] In at least one embodiment, the AEB system detects an impending forward collision with another vehicle or other object and can automatically apply brakes if the driver does not take corrective action within a specified time or distance parameter. In at least one embodiment, the AEB system may use one or more forward-facing cameras and / or one or more RADAR sensors 1260 coupled to a dedicated processor, DSP, FPGA, and / or ASIC. In at least one embodiment, when the AEB system detects a hazard, it typically first warns the driver to take corrective action to avoid a collision, and if the driver does not take corrective action, the AEB system may automatically apply brakes to attempt to prevent or at least mitigate the effects of the predicted collision. In at least one embodiment, the AEB system may include techniques such as dynamic braking to support and / or brakes for impending collisions.
[0293] In at least one embodiment, when vehicle 1200 crosses lane markings, the LDW system provides visual, auditory, and / or tactile warnings, such as steering wheel or seat vibrations, to alert the driver. In at least one embodiment, the LDW system is inactive when the driver indicates intentional lane departure, such as by activating turn signals. In at least one embodiment, the LDW system may use a front-facing camera coupled to a dedicated processor, DSP, FPGA, and / or ASIC, which is electrically coupled to provide driver feedback such as a display, speaker, and / or vibration components. In at least one embodiment, the LKA system is a variant of the LDW system. In at least one embodiment, if vehicle 1200 begins to leave the lane, the LKA system provides steering input or braking to correct vehicle 1200.
[0294] In at least one embodiment, the BSW system detects and warns the driver of a vehicle in the blind spot. In at least one embodiment, the BSW system can provide visual, auditory, and / or tactile alerts to indicate that merging or changing lanes is unsafe. In at least one embodiment, the BSW system can provide additional warnings when the driver uses the turn signal. In at least one embodiment, the BSW system can use one or more rear-facing cameras and / or one or more RADAR sensors 1260 coupled to a dedicated processor, DSP, FPGA, and / or ASIC, electrically coupled to driver feedback, such as a display, speaker, and / or vibration assembly.
[0295] In at least one embodiment, the RCTW system can provide visual, auditory, and / or tactile notifications when an object is detected outside the range of the rear camera while the vehicle 1200 is reversing. In at least one embodiment, the RCTW system includes an AEB system to ensure the applied vehicle brakes to avoid a collision. In at least one embodiment, the RCTW system may use one or more rear-facing RADAR sensors 1260 coupled to a dedicated processor, DSP, FPGA, and / or ASIC, which are electrically coupled to provide driver feedback such as displays, speakers, and / or vibration components.
[0296] In at least one embodiment, conventional ADAS systems may be prone to generating false alarms, which can be annoying and distracting to the driver, but are generally not catastrophic because conventional ADAS systems alert the driver and allow the driver to determine whether a safe situation truly exists and take appropriate action. In at least one embodiment, in the event of conflicting results, the vehicle 1200 itself decides whether to follow the result of the main computer or the auxiliary computer (e.g., the first or second controller of controller 1236). For example, in at least one embodiment, ADAS system 1238 may be a backup and / or auxiliary computer for providing perception information to a backup computer rationality module. In at least one embodiment, the backup computer rationality monitor may run redundant software on hardware components to detect faults in perception and dynamic driving tasks. In at least one embodiment, the output from ADAS system 1238 may be provided to a monitoring MCU. In at least one embodiment, if the output from the main computer and the output from the auxiliary computer conflict, the monitoring MCU decides how to reconcile the conflict to ensure safe operation.
[0297] In at least one embodiment, the master computer may be configured to provide a confidence score to the supervisory MCU to indicate the master computer's confidence in the selected result. In at least one embodiment, if the confidence score exceeds a threshold, the supervisory MCU may follow the master computer's instructions regardless of whether the auxiliary computer provides conflicting or inconsistent results. In at least one embodiment, if the confidence score does not meet the threshold, and if the master computer and the auxiliary computer indicate different results (e.g., conflicting), the supervisory MCU may arbitrate between the computers to determine the appropriate result.
[0298] In at least one embodiment, the supervisory MCU may be configured to run a neural network trained and configured to determine, at least in part, the conditions under which the auxiliary computer provides a false alarm based on outputs from a host computer and an auxiliary computer. In at least one embodiment, the neural network in the supervisory MCU may learn when the outputs of the auxiliary computer can be trusted and when they cannot. For example, in at least one embodiment, when the auxiliary computer is a RADAR-based FCW system, the neural network in the supervisory MCU may learn when the FCW system recognizes a metallic object that is not actually dangerous, such as a drain grating or manhole cover that would trigger an alarm. In at least one embodiment, when the auxiliary computer is a camera-based LDW system, the neural network in the supervisory MCU may learn to override the LDW when a cyclist or pedestrian is present and lane departure is actually the safest operation. In at least one embodiment, the supervisory MCU may include at least one of a DLA or GPU suitable for running a neural network with associated memory. In at least one embodiment, the supervisory MCU may include and / or be included as a component of one or more SoC 1204s.
[0299] In at least one embodiment, the ADAS system 1238 may include an auxiliary computer that performs ADAS functions using conventional computer vision rules. In at least one embodiment, the auxiliary computer may use classic computer vision rules (if-then), and the presence of a neural network in the supervisory MCU can improve reliability, security, and performance. For example, in at least one embodiment, diverse implementations and intentional non-identity make the entire system more fault-tolerant, especially for failures caused by software (or software-hardware interface) functionality. For example, in at least one embodiment, if a software vulnerability or bug exists in the software running on the host computer, and different software code running on the auxiliary computer provides consistent overall results, the supervisory MCU can more confidently assume that the overall result is correct and that the vulnerability in the software or hardware on the host computer will not lead to a significant error.
[0300] In at least one embodiment, the output of the ADAS system 1238 can be input to the perception module and / or the dynamic driving task module of the host computer. For example, in at least one embodiment, if the ADAS system 1238 indicates a forward collision warning due to an object directly ahead, the perception block can use this information when identifying the object. In at least one embodiment, as described herein, the assistance computer can have its own neural network trained to reduce the risk of false alarms.
[0301] In at least one embodiment, vehicle 1200 may further include an infotainment SoC 1230 (e.g., an in-vehicle infotainment system (IVI)). Although shown and described as an SoC, in at least one embodiment, the infotainment system SoC 1230 may not be an SoC and may include, but is not limited to, two or more discrete components. In at least one embodiment, the infotainment SoC 1230 may include, but is not limited to, a combination of hardware and software that can be used to provide audio (e.g., music, personal digital assistant, navigation instructions, news, radio, etc.), video (e.g., television, movies, streaming media, etc.), telephone (e.g., hands-free calling), network connectivity (e.g., LTE, WiFi, etc.) and / or information services (e.g., navigation system, rear parking assist, radio data system, vehicle-related information such as fuel level, total coverage distance, brake fuel level, fuel level, door opening / closing, air filter information, etc.) to vehicle 1200. For example, the infotainment SoC 1230 may include a radio, disk player, navigation system, video player, USB and Bluetooth connectivity, vehicle, in-vehicle entertainment system, WiFi, steering wheel audio controls, hands-free voice control, head-up display (“HUD”), HMI display 1234, telematics device, control panel (e.g., for controlling and / or interacting with various components, features and / or systems) and / or other components. In at least one embodiment, the infotainment SoC 1230 may further be used to provide information (e.g., visual and / or auditory) to a user of vehicle 1200, such as information from ADAS system 1238, autonomous driving information (such as planned vehicle maneuvers), trajectory, surrounding environment information (e.g., intersection information, vehicle information, road information, etc.) and / or other information.
[0302] In at least one embodiment, the infotainment SoC 1230 may include any number and type of GPU functionality. In at least one embodiment, the infotainment SoC 1230 may communicate with other devices, systems, and / or components of the vehicle 1200 via bus 1202. In at least one embodiment, the infotainment SoC 1230 may be coupled to a monitoring MCU, enabling the GPU of the infotainment system to perform some autonomous driving functions in the event of a failure of the main controller 1236 (e.g., the main computer and / or backup computer of the vehicle 1200). In at least one embodiment, the infotainment SoC 1230 may cause the vehicle 1200 to enter a driver-to-safe-stop mode, as described herein.
[0303] In at least one embodiment, vehicle 1200 may further include instrument panel 1232 (e.g., digital instrument panel, electronic instrument panel, digital instrument control panel, etc.). In at least one embodiment, instrument panel 1232 may include, but is not limited to, controllers and / or supercomputers (e.g., discrete controllers or supercomputers). In at least one embodiment, instrument panel 1232 may include, but is not limited to, any number and combination of a set of instruments, such as speedometer, fuel level, oil pressure, tachometer, odometer, turn indicator, shift position indicator, one or more seatbelt warning lights, one or more parking brake warning lights, one or more engine malfunction lights, auxiliary restraint system (e.g., airbag) information, lighting controls, safety system controls, navigation information, etc. In some examples, information may be displayed and / or shared between infotainment SoC 1230 and instrument panel 1232. In at least one embodiment, instrument panel 1232 may be included as part of infotainment SoC 1230, or vice versa.
[0304] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details are provided regarding the inference and / or training logic 915. In at least one embodiment, the inference and / or training logic 915 can be in the system. Figure 12C The operation is used to infer or predict the operation based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0305] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0306] Figure 12D It is based on at least one embodiment in a cloud-based server and Figure 12A A diagram of a system 1276 for communication between autonomous vehicles 1200. In at least one embodiment, system 1276 may include, but is not limited to, one or more servers 1278, one or more networks 1290, and any number and type of vehicles, including vehicle 1200. In at least one embodiment, one or more servers 1278 may include, but is not limited to, multiple GPUs 1284(A)-1284(H) (collectively referred to herein as GPU 1284), PCIe switches 1282(A)-1282(D) (collectively referred to herein as PCIe switch 1282), and / or CPUs 1280(A)-1280(B) (collectively referred to herein as CPU 1280). GPU 1284, CPU 1280, and PCIe switch 1282 may be interconnected with high-speed interconnects, such as, but not limited to, NVLink interface 1288 developed by NVIDIA and / or PCIe connection 1286. In at least one embodiment, the GPU 1284 is connected via NVLink and / or NVSwitchSoC, and the GPU 1284 and PCIe switch 1282 are connected via PCIe interconnect. Although eight GPUs 1284, two CPUs 1280, and four PCIe switches 1282 are shown, this is not intended to be limiting. In at least one embodiment, each of one or more servers 1278 may include, but is not limited to, any combination of any number of GPUs 1284, CPUs 1280, and / or PCIe switches 1282. For example, in at least one embodiment, one or more servers 1278 may each include eight, sixteen, thirty-two, and / or more GPUs 1284.
[0307] In at least one embodiment, one or more servers 1278 may receive image data representing images from vehicles via one or more networks 1290, the images showing unexpected or changed road conditions, such as recently commenced roadworks. In at least one embodiment, one or more servers 1278 may transmit updated neural network 1292 and / or map information 1294, including but not limited to information about traffic and road conditions, to vehicles via one or more networks 1290. In at least one embodiment, updates to map information 1294 may include, but are not limited to, updates to HD map 1222, such as information about construction sites, potholes, sidewalks, floods, and / or other obstacles. In at least one embodiment, neural network 1292 and / or map information 1294 may be generated from new training and / or experience represented by data received from any number of vehicles in the environment, and / or at least based on training performed in a data center (e.g., using one or more servers 1278 and / or other servers).
[0308] In at least one embodiment, one or more servers 1278 may be used to train a machine learning model (e.g., a neural network) at least in part based on training data. In at least one embodiment, the training data may be generated by the vehicle, and / or may be generated in a simulation (e.g., using a game engine). In at least one embodiment, any amount of training data is labeled (e.g., where the associated neural network benefits from supervised learning) and / or undergoes other preprocessing. In at least one embodiment, no amount of training data is labeled and / or preprocessed (e.g., where the associated neural network does not require supervised learning). In at least one embodiment, once the machine learning model is trained, the machine learning model may be used by the vehicle (e.g., transmitted to the vehicle via one or more networks 1290), and / or the machine learning model may be used by one or more servers 1278 to remotely monitor the vehicle.
[0309] In at least one embodiment, one or more servers 1278 may receive data from the vehicle and apply the data to state-of-the-art real-time neural networks for real-time intelligent inference. In at least one embodiment, one or more servers 1278 may include a deep learning supercomputer and / or a dedicated AI computer powered by one or more GPUs 1284, such as the DGX and DGX Station machines developed by NVIDIA. However, in at least one embodiment, one or more servers 1278 may include a deep learning infrastructure in a data center using CPU power.
[0310] In at least one embodiment, the deep learning infrastructure of one or more servers 1278 may be capable of fast, real-time inference and can use this capability to assess and verify the health of the processor, software, and / or associated hardware in vehicle 1200. For example, in at least one embodiment, the deep learning infrastructure may receive periodic updates from vehicle 1200, such as image sequences and / or objects located by vehicle 1200 in the image sequence (e.g., via computer vision and / or other machine learning object classification techniques). In at least one embodiment, the deep learning infrastructure may run its own neural network to identify objects and compare them with objects identified by vehicle 1200, and if the results do not match and the deep learning infrastructure determines that the AI in vehicle 1200 is malfunctioning, one or more servers 1278 may signal to vehicle 1200 to instruct the fail-safe computer of vehicle 1200 to take control, notify passengers, and complete a safe stopping operation.
[0311] In at least one embodiment, one or more servers 1278 may include one or more GPUs 1284 and one or more programmable inference accelerators (e.g., NVIDIA's TensorRT 3 devices). In at least one embodiment, the combination of GPU-driven servers and inference acceleration enables real-time response. In at least one embodiment, for example, where performance is less critical, servers driven by CPUs, FPGAs, and other processors may be used for inference. In at least one embodiment, hardware architecture 915 is used to execute one or more embodiments. This document incorporates... Figure 9A and / or Figure 9B Provide details about the 915 hardware architecture.
[0312] Computer System
[0313] Figure 13 This is a block diagram illustrating an exemplary computer system according to at least one embodiment. The exemplary computer system may be a system of interconnected devices and components, a system-on-a-chip (SOC), or some combination thereof formed with a processor, which may include an execution unit to execute instructions. In at least one embodiment, according to this disclosure, such as the embodiments described herein, computer system 1300 may include, but is not limited to, components such as processor 1302, whose execution unit includes logic to execute algorithms for process data. In at least one embodiment, computer system 1300 may include a processor, such as those available from Intel Corporation of Santa Clara, California. Processor family, Xeon TM , XScale TM and / or StrongARM TM , Core TM or Nervana TM A microprocessor may be used, although other systems (including PCs, engineering workstations, set-top boxes, etc.) with other microprocessors may also be used. In at least one embodiment, computer system 1300 may execute a version of the Windows operating system available from Microsoft Corporation of Redmond, Washington, although other operating systems (such as UNIX and Linux), embedded software, and / or graphical user interfaces may also be used.
[0314] The embodiments can be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular phones, Internet Protocol (IP) devices, digital cameras, personal digital assistants (“PDAs”), and handheld PCs. In at least one embodiment, the embedded application may include a microcontroller, a digital signal processor (“DSP”), a system-on-a-chip (SoC), a network computer (“NetPC”), a set-top box, a network hub, a wide area network (“WAN”) switch, or any other system that can execute one or more instructions according to at least one embodiment.
[0315] In at least one embodiment, the computer system 1300 may include, but is not limited to, a processor 1302, which may include, but is not limited to, one or more execution units 1308, to perform machine learning model training and / or inference according to the techniques described herein. In at least one embodiment, the computer system 1300 is a single-processor desktop or server system, but in another embodiment, the computer system 1300 may be a multiprocessor system. In at least one embodiment, the processor 1302 may include, but is not limited to, a Complex Instruction Set Computer (“CISC”) microprocessor, a Reduced Instruction Set Computing (“RISC”) microprocessor, a Very Long Instruction Word (“VLIW”) microprocessor, a processor implementing instruction set combination, or any other processor device, such as a digital signal processor. In at least one embodiment, the processor 1302 may be coupled to a processor bus 1310, which can transmit data signals between the processor 1302 and other components in the computer system 1300.
[0316] In at least one embodiment, processor 1302 may include, but is not limited to, a Level 1 (“L1”) internal cache memory (“cache”) 1304. In at least one embodiment, processor 1302 may have a single internal cache or multiple levels of internal cache. In at least one embodiment, the cache memory may reside external to processor 1302. Depending on specific implementation and requirements, other embodiments may also include a combination of internal and external caches. In at least one embodiment, register file 1306 may store different types of data in various registers, including but not limited to integer registers, floating-point registers, status registers, and instruction pointer registers.
[0317] In at least one embodiment, an execution unit 1308, including but not limited to logic for performing integer and floating-point operations, is also located within the processor 1302. In at least one embodiment, the processor 1302 may also include a microcode (“ucode”) read-only memory (“ROM”) for storing microcode of certain macro instructions. In at least one embodiment, the execution unit 1308 may include logic for processing a packaged instruction set 1309. In at least one embodiment, by including the packaged instruction set 1309 in the instruction set of a general-purpose processor, along with the associated circuitry for executing the instructions, the packaged data in the processor 1302 can be used to perform operations used by numerous multimedia applications. In at least one embodiment, many multimedia applications can be executed more quickly and efficiently by using the full width of the processor’s data bus to perform operations on the packaged data, which may eliminate the need to transfer smaller data units on the processor’s data bus to perform one or more operations on one data element at a time.
[0318] In at least one embodiment, execution unit 1308 may also be used in a microcontroller, embedded processor, graphics device, DSP, and other types of logic circuitry. In at least one embodiment, computer system 1300 may include, but is not limited to, memory 1320. In at least one embodiment, memory 1320 may be a dynamic random access memory (“DRAM”) device, a static random access memory (“SRAM”) device, a flash memory device, or another storage device. In at least one embodiment, memory 1320 may store instructions 1319 and / or data 1321 represented by data signals that can be executed by processor 1302.
[0319] In at least one embodiment, the system logic chip may be coupled to the processor bus 1310 and the memory 1320. In at least one embodiment, the system logic chip may include, but is not limited to, a memory controller hub (“MCH”) 1316, and the processor 1302 may communicate with the MCH 1316 via the processor bus 1310. In at least one embodiment, the MCH 1316 may provide a high-bandwidth memory path 1318 to the memory 1320 for instruction and data storage, as well as for storage of graphics commands, data, and textures. In at least one embodiment, the MCH 1316 may initiate data signals between the processor 1302, the memory 1320, and other components in the computer system 1300, and bridge data signals between the processor bus 1310, the memory 1320, and the system I / O interface 1322. In at least one embodiment, the system logic chip may provide a graphics port for coupling to a graphics controller. In at least one embodiment, MCH 1316 may be coupled to memory 1320 via high-bandwidth memory path 1318, and graphics / video card 1312 may be coupled to MCH 1316 via Accelerated Graphics Port (“AGP”) interconnect 1314.
[0320] In at least one embodiment, the computer system 1300 may use the system I / O interface 1322 as a proprietary hub interface bus to couple the MCH 1316 to the I / O controller hub (“ICH”) 1330. In at least one embodiment, the ICH 1330 may provide direct connectivity to certain I / O devices via a local I / O bus. In at least one embodiment, the local I / O bus may include, but is not limited to, a high-speed I / O bus for connecting peripheral devices to the memory 1320, chipset, and processor 1302. Examples may include, but are not limited to, an audio controller 1329, a firmware hub (“Flash BIOS”) 1328, a wireless transceiver 1326, a data storage 1324, a conventional I / O controller 1323 including a user input and keyboard interface 1325, a serial expansion port 1327 (e.g., a Universal Serial Bus (USB) port), and a network controller 1334. In at least one embodiment, the data storage 1324 may include a hard disk drive, a floppy disk drive, a CD-ROM device, a flash memory device, or other mass storage device.
[0321] In at least one embodiment, Figure 13 One embodiment is shown that includes interconnected hardware devices or "chips," while in other embodiments, Figure 13 An exemplary SoC can be shown. In at least one embodiment, Figure 13The devices shown can be interconnected with dedicated interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, one or more components of the computer system 1300 are interconnected using a Fast Compute Link (CXL) interconnect.
[0322] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A 9B provides details regarding the inference and / or training logic 915. In at least one embodiment, the inference and / or training logic 915 can be used in the system. Figure 13 In this context, it refers to operations used for reasoning or prediction based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture or neural network usage as described herein.
[0323] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0324] Figure 14 This is a block diagram illustrating an electronic device 1400 for utilizing a processor 1410 according to at least one embodiment. In at least one embodiment, the electronic device 1400 may be, for example, but not limited to, a laptop, tower server, rack server, blade server, laptop computer, desktop computer, tablet computer, mobile device, telephone, embedded computer, or any other suitable electronic device.
[0325] In at least one embodiment, the electronic device 1400 may include, but is not limited to, a processor 1410 communicatively coupled to any suitable number or type of components, peripherals, modules, or devices. In at least one embodiment, the processor 1410 is coupled using a bus or interface, such as I... 2 C-bus, System Management Bus (“SMBus”), Low Pin Count (LPC) bus, Serial Peripheral Interface (“SPI”), High Definition Audio (“HDA”) bus, Serial Advanced Technology Accessory (“SATA”) bus, Universal Serial Bus (“USB”) (versions 1, 2, 3, etc.) or Universal Asynchronous Receiver / Transmitter (“UART”) bus. In at least one embodiment, Figure 14 The illustration shows a system comprising interconnected hardware devices or "chips," while in other embodiments, Figure 14An exemplary SoC can be shown. In at least one embodiment, Figure 14 The devices shown can be interconnected with dedicated interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, a compute fast link (CXL) interconnect is used for interconnection. Figure 14 One or more components.
[0326] In at least one embodiment, Figure 14 It may include a display 1424, a touch screen 1425, a touchpad 1430, a near field communication unit (“NFC”) 1445, a sensor hub 1440, a thermal sensor 1446, a fast chipset (“EC”) 1435, a trusted platform module (“TPM”) 1438, a BIOS / firmware / flash memory (“BIOS, FW flash”) 1422, a DSP 1460, a drive 1420 (such as a solid-state drive (“SSD”) or a hard disk drive (“HDD”)), a wireless local area network unit (“WLAN”) 1450, a Bluetooth unit 1452, a wireless wide area network unit (“WWAN”) 1456, a global positioning system (GPS) unit 1455, a camera (“USB 3.0 camera”) 1454 (such as a USB 3.0 camera) and / or a low-power double data rate (“LPDDR”) memory unit (“LPDDR3”) 1415 implemented in, for example, the LPDDR3 standard. These components can each be implemented in any suitable way.
[0327] In at least one embodiment, other components may be communicatively coupled to processor 1410 via the components described herein. In at least one embodiment, accelerometer 1441, ambient light sensor (“ALS”) 1442, compass 1443, and gyroscope 1444 may be communicatively coupled to sensor hub 1440. In at least one embodiment, thermal sensor 1439, fan 1437, keyboard 1436, and touchpad 1430 may be communicatively coupled to EC 1435. In at least one embodiment, speaker 1463, earphone 1464, and microphone (“mic”) 1465 may be communicatively coupled to audio unit (“audio codec and Class D amplifier”) 1462, which in turn may be communicatively coupled to DSP 1460. In at least one embodiment, audio unit 1462 may include, for example, but not limited to, audio encoder / decoder (“codec”) and Class D amplifier. In at least one embodiment, SIM card (“SIM”) 1457 may be communicatively coupled to WWAN unit 1456. In at least one embodiment, components such as WLAN unit 1450, Bluetooth unit 1452, and WWAN unit 1456 can be implemented in a next-generation form factor (“NGFF”).
[0328] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A 9B provides details regarding inference and / or training logic 915. In at least one embodiment, inference and / or training logic 915 may be... Figure 14 The system is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture or neural network usage as described herein.
[0329] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0330] Figure 15 A computer system 1500 according to at least one embodiment is shown. In at least one embodiment, the computer system 1500 is configured to implement the various processes and methods described throughout this disclosure.
[0331] In at least one embodiment, the computer system 1500 includes, but is not limited to, at least one central processing unit (“CPU”) 1502 connected to a communication bus 1510 implemented using any suitable protocol, such as PCI (“Peripheral Component Interconnect”), Fast Peripheral Component Interconnect (“PCI-Express”), AGP (“Accelerated Graphics Port”), HyperTransport, or any other bus or point-to-point communication protocol. In at least one embodiment, the computer system 1500 includes, but is not limited to, main memory 1504 and control logic (e.g., implemented in hardware, software, or a combination thereof), and data is stored in main memory 1504, which may take the form of random access memory (“RAM”). In at least one embodiment, a network interface subsystem (“network interface”) 1522 provides an interface to other computing devices and networks for receiving data from and transmitting data to other systems having the computer system 1500.
[0332] In at least one embodiment, the computer system 1500 includes, but is not limited to, an input device 1508, a parallel processing system 1512, and a display device 1506 that can be implemented using conventional cathode ray tube (“CRT”), liquid crystal display (“LCD”), light-emitting diode (“LED”) display, plasma display, or other suitable display technologies. In at least one embodiment, user input is received from the input device 1508, such as a keyboard, mouse, touchpad, microphone, etc. In at least one embodiment, each module described herein may reside on a single semiconductor platform to form the processing system.
[0333] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A 9B provides details regarding the inference and / or training logic 915. In at least one embodiment, the inference and / or training logic 915 can be used in the system. Figure 15 In this context, it refers to operations used for reasoning or prediction based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture or neural network usage as described herein.
[0334] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0335] Figure 16 A computer system 1600 according to at least one embodiment is illustrated. In at least one embodiment, the computer system 1600 includes, but is not limited to, a computer 1610 and a USB stick 1620. In at least one embodiment, the computer 1610 may include, but is not limited to, any number and type of processors (not shown) and memory (not shown). In at least one embodiment, the computer 1610 includes, but is not limited to, a server, a cloud instance, a laptop computer, and a desktop computer.
[0336] In at least one embodiment, the USB stick 1620 includes, but is not limited to, a processing unit 1630, a USB interface 1640, and USB interface logic 1650. In at least one embodiment, the processing unit 1630 can be any instruction execution system, apparatus, or device capable of executing instructions. In at least one embodiment, the processing unit 1630 can include, but is not limited to, any number and type of processing cores (not shown). In at least one embodiment, the processing unit 1630 includes an application-specific integrated circuit (“ASIC”) optimized to perform any amount and type of operations associated with machine learning. For example, in at least one embodiment, the processing unit 1630 is a tensor processing unit (“TPC”) optimized to perform machine learning inference operations. In at least one embodiment, the processing unit 1630 is a vision processing unit (“VPU”) optimized to perform machine vision and machine learning inference operations.
[0337] In at least one embodiment, the USB interface 1640 can be any type of USB connector or USB receptacle. For example, in at least one embodiment, the USB interface 1640 is a USB 3.0 Type-C receptacle for data and power. In at least one embodiment, the USB interface 1640 is a USB 3.0 Type-A connector. In at least one embodiment, the USB interface logic 1650 may include any amount and type of logic that enables the processing unit 1630 to engage with a device (e.g., computer 1610) via the USB connector 1640.
[0338] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details regarding the inference and / or training logic 915 are provided. In at least one embodiment, the inference and / or training logic 915 may be used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functionality and / or architecture, or neural network usage as described herein.
[0339] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0340] Figure 17AAn exemplary architecture is shown in which multiple GPUs 1710(1)-1710(N) are communicatively coupled to multiple multi-core processors 1705(1)-1705(M) via high-speed links 1740(1)-1740(N) (e.g., bus / point-to-point interconnect, etc.). In at least one embodiment, the high-speed links 1740(1)-1740(N) support communication throughput of 4GB / s, 30GB / s, 80GB / s, or higher. In at least one embodiment, various interconnect protocols may be used, including but not limited to PCIe 4.0 or 5.0 and NVLink 2.0. In the various figures, “N” and “M” represent positive integers, the values of which may vary from figure to figure.
[0341] Furthermore, in at least one embodiment, two or more GPUs 1710 are interconnected via high-speed links 1729(1)-1729(2), which can be implemented using a protocol / link similar to or different from that used for high-speed links 1740(1)-1740(N). Similarly, two or more multi-core processors 1705 can be connected via high-speed link 1728, which can be a symmetric multiprocessor (SMP) bus operating at speeds of 20GB / s, 30GB / s, 120GB / s, or higher. Alternatively, similar protocols / links (e.g., via a common interconnect structure) can be used. Figure 17A This shows all communication between the various system components.
[0342] In at least one embodiment, each multi-core processor 1705 is communicatively coupled to processor memories 1701(1)-1701(M) via memory interconnects 1726(1)-1726(M), and each GPU 1710(1)-1710(N) is communicatively coupled to GPU memories 1720(1)-1720(N) via GPU memory interconnects 1750(1)-1750(N). In at least one embodiment, memory interconnects 1726 and 1750 may utilize similar or different memory access technologies. By way of example and not limitation, processor memories 1701(1)-1701(M) and GPU memories 1720 may be volatile memories, such as dynamic random access memory (DRAM) (including stacked DRAM), graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or high bandwidth memory (HBM), and / or may be non-volatile memories, such as 3D XPoint or Nano-RAM. In at least one embodiment, some portions of the processor memory 1701 may be volatile memory, while other portions may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy).
[0343] As described herein, although various multi-core processors 1705 and GPUs 1710 can be physically coupled to specific memories 1701 and 1720 respectively, and / or can implement a unified memory architecture, in which the virtual system address space (also known as the “effective address” space) is distributed among the various physical memories. For example, processor memories 1701(1)-1701(M) can each contain 64GB of system memory address space, and GPU memories 1720(1)-1720(N) can each contain 32GB of system memory address space, resulting in a total addressable memory size of 256GB when M=2 and N=4. N and M may also be other values.
[0344] Figure 17B Additional details are shown regarding the interconnection between a multi-core processor 1707 and a graphics acceleration module 1746 according to an exemplary embodiment. In at least one embodiment, the graphics acceleration module 1746 may include one or more GPU chips integrated on a line card coupled to the processor 1707 via a high-speed link 1740 (e.g., PCIe bus, NVLink, etc.). In at least one embodiment, the graphics acceleration module 1746 may optionally be integrated on a package or chip having the processor 1707.
[0345] In at least one embodiment, the processor 1707 includes multiple cores 1760A-1760D, each core having a translation back cover (“TLB”) 1761A-1761D and one or more caches 1762A-1762D. In at least one embodiment, the cores 1760A-1760D may include various other components (not shown) for executing instructions and processing data. In at least one embodiment, the caches 1762A-1762D may include level 1 (L1) and level 2 (L2) caches. Furthermore, one or more shared caches 1756 may be included in the caches 1762A-1762D and shared by the respective groups of cores 1760A-1760D. For example, one embodiment of the processor 1707 includes 24 cores, each core having its own L1 cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, two adjacent cores share one or more L2 and L3 caches. In at least one embodiment, the processor 1707 and the graphics acceleration module 1746 are connected to a system memory 1714, which may include... Figure 17A The processor memory in the memory is 1701(1)-1701(M).
[0346] In at least one embodiment, consistency of data and instructions stored in the various caches 1762A-1762D, 1756 and system memory 1714 is maintained via inter-core communication through the consistency bus 1764. In at least one embodiment, for example, each cache may have associated cache consistency logic / circuit to communicate via the consistency bus 1764 in response to the detection of a read or write to a particular cache line. In at least one embodiment, a cache snooping protocol is implemented via the consistency bus 1764 to snoop on cache accesses.
[0347] In at least one embodiment, proxy circuitry 1725 communicatively couples graphics acceleration module 1746 to coherence bus 1764, thereby allowing graphics acceleration module 1746 to participate in cache coherence protocols as a peer of cores 1760A-1760D. Specifically, in at least one embodiment, interface 1735 provides connectivity to proxy circuitry 1725 via high-speed link 1740, and interface 1737 connects graphics acceleration module 1746 to high-speed link 1740.
[0348] In at least one embodiment, the accelerator integrated circuit 1736 provides cache management, memory access, context management, and interrupt management services for a plurality of graphics processing engines 1731(1)-1731(N) of the graphics acceleration module. In at least one embodiment, the graphics processing engines 1731(1)-1731(N) may each include a separate graphics processing unit (GPU). In at least one embodiment, the graphics processing engines 1731(1)-1731(N) may optionally include different types of graphics processing engines within the GPU, such as graphics execution units, media processing engines (e.g., video encoders / decoders), samplers, and blit engines. In at least one embodiment, the graphics acceleration module 1746 may be a GPU having a plurality of graphics processing engines 1731(1)-1731(N), or the graphics processing engines 1731(1)-1731(N) may be individual GPUs integrated on a general-purpose package, line card, or chip.
[0349] In at least one embodiment, the accelerator integrated circuit 1736 includes a memory management unit (MMU) 1739 for performing various memory management functions, such as virtual-to-physical memory translation (also known as effective-to-real memory translation), and a memory access protocol for accessing system memory 1714. In at least one embodiment, the MMU 1739 may also include a translation back buffer (“TLB”) (not shown) for caching virtual / effective-to-physical / real address translations. In at least one embodiment, cache 1738 may store commands and data for efficient access by graphics processing engines 1731(1)-1731(N). In at least one embodiment, a fetch unit 1744 may be used to keep data stored in cache 1738 and graphics memory 1733(1)-1733(M) consistent with core caches 1762A-1762D, 1756 and system memory 1714. As previously mentioned, this task can be accomplished via proxy circuitry 1725 representing cache 1738 and graphics memory 1733(1)-1733(M) (e.g., sending updates related to the modification / access of cache lines on processor caches 1762A-1762D, 1756 to cache 1738 and receiving updates from cache 1738).
[0350] In at least one embodiment, a set of registers 1745 stores context data of threads executed by graphics processing engines 1731(1)-1731(N), and context management circuitry 1748 manages the thread context. For example, context management circuitry 1748 can perform save and restore operations to save and restore the context of individual threads during context switching (e.g., saving the first thread and storing the second thread so that the second thread can be executed by the graphics processing engine). For example, during context switching, context management circuitry 1748 can store the current register value to a designated area in memory (e.g., identified by a context pointer). The register value can then be restored when returning to the context. In at least one embodiment, interrupt management circuitry 1747 receives and processes interrupts received from system devices.
[0351] In at least one embodiment, MMU 1739 translates virtual / effective addresses from graphics processing engine 1731 into real / physical addresses in system memory 1714. In at least one embodiment, accelerator integrated circuit 1736 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 1746 and / or other accelerator devices. In at least one embodiment, graphics accelerator module 1746 may be dedicated to a single application executing on processor 1707, or may be shared among multiple applications. In at least one embodiment, a virtualized graphics execution environment is presented, wherein resources of graphics processing engines 1731(1)-1731(N) are shared with multiple applications or virtual machines (VMs). In at least one embodiment, resources may be subdivided into “slices” based on processing requirements and priorities associated with VMs and / or applications, which are allocated to different VMs and / or applications.
[0352] In at least one embodiment, the accelerator integrated circuit 1736 acts as a bridge to the system of the graphics acceleration module 1746 and provides address translation and system memory caching services. Additionally, in at least one embodiment, the accelerator integrated circuit 1736 can provide virtualization facilities for the host processor to manage the virtualization, interrupt, and memory management of the graphics processing engines 1731(1)-1731(N).
[0353] In at least one embodiment, since the hardware resources of the graphics processing engines 1731(1)-1731(N) are explicitly mapped to the real address space seen by the host processor 1707, any host processor can directly address these resources using valid address values. In at least one embodiment, a function of the accelerator integrated circuit 1736 is to physically separate the graphics processing engines 1731(1)-1731(N) so that they appear as independent units to the system.
[0354] In at least one embodiment, one or more graphics memories 1733(1)-1733(M) are coupled to each graphics processing engine 1731(1)-1731(N), and N = M. In at least one embodiment, the graphics memories 1733(1)-1733(M) store instructions and data processed by each graphics processing engine 1731(1)-1731(N). In at least one embodiment, the graphics memories 1733(1)-1733(M) may be volatile memory, such as DRAM (including stacked DRAM), GDDR memory (e.g., GDDR5, GDDR6), or HBM, and / or may be non-volatile memory, such as 3DXPoint or Nano-RAM.
[0355] In at least one embodiment, to reduce data traffic on the high-speed link 1740, a biasing technique is used to ensure that the data stored in the graphics memory 1733(1)-1733(M) is the data most frequently used by the graphics processing engine 1731(1)-1731(N), and preferably data that the cores 1760A-1760D do not use (or at least do not use frequently). Similarly, in at least one embodiment, the biasing mechanism attempts to keep the data needed by the cores (and preferably not the graphics processing engine 1731(-1)-1731(N)) in the caches 1762A-1762D, 1756 and system memory 1714.
[0356] Figure 17C Another exemplary embodiment is shown, in which the accelerator integrated circuit 1736 is integrated within the processor 1707. In this embodiment, the graphics processing engines 1731(1)-1731(N) communicate directly with the accelerator integrated circuit 1736 via a high-speed link 1740 through interfaces 1737 and 1735 (which can also be any form of bus or interface protocol). In at least one embodiment, the accelerator integrated circuit 1736 can perform operations related to... Figure 17B The described operation is similar. However, due to its close proximity to the coherence bus 1764 and caches 1762A-1762D, 1756, it may have higher throughput. In at least one embodiment, the accelerator integrated circuit supports different programming models, including a dedicated process programming model (without graphics acceleration module virtualization) and a shared programming model (with virtualization), which may include a programming model controlled by the accelerator integrated circuit 1736 and a programming model controlled by the graphics acceleration module 1746.
[0357] In at least one embodiment, graphics processing engines 1731(1)-1731(N) are dedicated to a single application or process under a single operating system. In at least one embodiment, a single application can funnel requests from other applications to graphics processing engines 1731(1)-1731(N), thereby providing virtualization within a VM / partition.
[0358] In at least one embodiment, graphics processing engines 1731(1)-1731(N) can be shared by multiple VM / application partitions. In at least one embodiment, the shared model can use a hypervisor to virtualize graphics processing engines 1731(1)-1731(N) to allow each operating system to access them. In at least one embodiment, for a single-partition system without a hypervisor, the operating system owns graphics processing engines 1731(1)-1731(N). In at least one embodiment, the operating system can virtualize graphics processing engines 1731(1)-1731(N) to provide access to each process or application.
[0359] In at least one embodiment, the graphics acceleration module 1746 or the individual graphics processing engine 1731(1)-1731(N) uses a process handle to select a process element. In at least one embodiment, the process element is stored in system memory 1714 and can be addressed using the effective address to real address translation techniques described herein. In at least one embodiment, the process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engine 1731(1)-1731(N) (i.e., invoking system software to add the process element to the process element linked list). In at least one embodiment, the lower 16 bits of the process handle may be the offset of the process element in the process element linked list.
[0360] Figure 17D An exemplary accelerator integration slice 1790 is illustrated. In at least one embodiment, a "slice" includes a designated portion of the processing resources of an accelerator integrated circuit 1736. In at least one embodiment, the application is an effective address space 1782 in system memory 1714, which stores process element 1783. In at least one embodiment, process element 1783 is stored in response to a GPU call 1781 from an application 1780 executing on processor 1707. In at least one embodiment, process element 1783 contains the process state of the corresponding application 1780. In one embodiment, a job descriptor (WD) 1784 contained in process element 1783 may be a single job requested by the application, or it may contain a pointer to a job queue. In at least one embodiment, WD 1784 is a pointer to a job request queue in the effective address space 1782 of the application.
[0361] In at least one embodiment, the graphics acceleration module 1746 and / or the various graphics processing engines 1731(1)-1731(N) may be shared by all processes or a subset of processes in the system. In at least one embodiment, infrastructure may be included for setting process states and sending WD 1784 to the graphics acceleration module 1746 to begin operations in a virtualized environment.
[0362] In at least one embodiment, the dedicated process programming model is implementation-specific. In at least one embodiment, in this model, a single process owns either the graphics acceleration module 1746 or an individual graphics processing engine 1731. In at least one embodiment, when the graphics acceleration module 1746 is owned by a single process, the hypervisor initializes the accelerator integrated circuit for the owned partition, and when the graphics acceleration module 1746 is assigned, the operating system initializes the accelerator integrated circuit 1736 for the owned process.
[0363] In at least one embodiment, during operation, the WD acquisition unit 1791 in the accelerator integration slice 1790 acquires the next WD 1784, which includes instructions for work to be performed by one or more graphics processing engines of the graphics acceleration module 1746. In at least one embodiment, data from the WD 1784 may be stored in register 1745 and used by the MMU 1739, interrupt management circuitry 1747, and / or context management circuitry 1748, as shown. For example, one embodiment of the MMU 1739 includes segment / page roaming circuitry for accessing segment / page tables 1786 within the OS virtual address space 1785. In at least one embodiment, the interrupt management circuitry 1747 may process an interrupt event 1792 received from the graphics acceleration module 1746. In at least one embodiment, when performing graphics operations, a valid address 1793 generated by graphics processing engines 1731(1)-1731(N) is translated into a real address by the MMU 1739.
[0364] In at least one embodiment, register 1745 is copied for each graphics processing engine 1731(1)-1731(N) and / or graphics acceleration module 1746, and said register 1745 may be initialized by a hypervisor or operating system. In at least one embodiment, each of these copied registers may be included in accelerator integration slice 1790. Exemplary registers that may be initialized by a hypervisor are shown in Table 1.
[0365]
[0366] Table 2 shows exemplary registers that can be initialized by the operating system.
[0367]
[0368]
[0369] In at least one embodiment, each WD 1784 is specific to a particular graphics acceleration module 1746 and / or graphics processing engine 1731(1)-1731(N). In at least one embodiment, it contains all the information required for the graphics processing engine 1731(1)-1731(N) to complete its work, or it may be a pointer to a memory location where the application has set up a command queue for the work to be completed.
[0370] Figure 17E Additional details of an exemplary embodiment of the shared model are shown. This embodiment includes a hypervisor real address space 1798, in which a list of process elements 1799 is stored. In at least one embodiment, the hypervisor real address space 1798 can be accessed via a hypervisor 1796, which virtualizes the graphics acceleration module engine for operating system 1795.
[0371] In at least one embodiment, the shared programming model allows all processes or subsets of processes from all partitions or subsets of partitions in the system to use the graphics acceleration module 1746. In at least one embodiment, there are two programming models in which the graphics acceleration module 1746 is shared by multiple processes and partitions, namely, time-slice sharing and graphics-oriented sharing.
[0372] In at least one embodiment, in this model, the hypervisor 1796 owns the graphics acceleration module 1746 and makes its functionality available to all operating systems 1795. In at least one embodiment, for the graphics acceleration module 1746 to support virtualization through the hypervisor 1796, the graphics acceleration module 1746 may comply with certain requirements, such as (1) the job requests of the application must be autonomous (i.e., no state needs to be maintained between jobs), or the graphics acceleration module 1746 must provide a context saving and recovery mechanism, (2) the graphics acceleration module 1746 guarantees that the job requests of the application are completed within a specified amount of time, including any conversion errors, or the graphics acceleration module 1746 provides the ability to preempt job processing, and (3) when operating in a directed shared programming model, fairness between the processes of the graphics acceleration module 1746 must be ensured.
[0373] In at least one embodiment, application 1780 needs to make operating system 1795 system calls using the graphics acceleration module type, working descriptor (WD), authority mask register (AMR) value, and context save / restore region pointer (CSRP). In at least one embodiment, the graphics acceleration module type describes the target acceleration function for the system call. In at least one embodiment, the graphics acceleration module type can be a system-specific value. In at least one embodiment, the WD is specifically formatted for graphics acceleration module 1746 and can take the form of graphics acceleration module 1746 commands, valid address pointers to user-defined structures, valid address pointers to command queues, or any other data structure describing the work to be performed by graphics acceleration module 1746.
[0374] In at least one embodiment, the AMR value is the AMR state for the current process. In at least one embodiment, the value passed to the operating system is similar to that of the application that sets the AMR. In at least one embodiment, if the implementation of the accelerator integrated circuit 1736 (not shown) and the graphics acceleration module 1746 does not support the User Rights Mask Overwrite Register (UAMOR), the operating system may apply the current UAMOR value to the AMR value before passing the AMR in the hypervisor call. In at least one embodiment, the hypervisor 1796 may selectively apply the current Rights Mask Overwrite Register (AMOR) value before placing the AMR into the process element 1783. In at least one embodiment, CSRP is one of the registers 1745 that contains the effective address of a region in the effective address space 1782 of the application for the graphics acceleration module 1746 to save and restore the context state. In at least one embodiment, this pointer is optional if it is not necessary to save state between jobs or when a job is preempted. In at least one embodiment, the context save / restore region may be fixed system memory.
[0375] Upon receiving a system call, operating system 1795 can verify that application 1780 has been registered and granted permission to use graphics acceleration module 1746. Then, in at least one embodiment, operating system 1795 uses the information shown in Table 3 to invoke hypervisor 1796.
[0376]
[0377] In at least one embodiment, upon receiving a hypervisor call, hypervisor 1796 verifies that operating system 1795 has been registered and granted permission to use graphics acceleration module 1746. Then, in at least one embodiment, hypervisor 1796 adds process element 1783 to a linked list of process elements of the corresponding graphics acceleration module 1746 type. In at least one embodiment, the process element may include the information shown in Table 4.
[0378]
[0379] In at least one embodiment, the hypervisor initializes multiple accelerator integration slice 1790 registers 1745.
[0380] like Figure 17F As shown, in at least one embodiment, a unified memory is used, which is addressable via a common virtual memory address space for accessing physical processor memories 1701(1)-1701(N) and GPU memories 1720(1)-1720(N). In this implementation, operations performed on GPUs 1710(1)-1710(N) utilize the same virtual / effective memory address space to access processor memories 1701(1)-1701(M) and vice versa, thereby simplifying programmability. In at least one embodiment, a first portion of the virtual / effective address space is allocated to processor memory 1701(1), a second portion to second processor memory 1701(N), a third portion to GPU memory 1720(1), and so on. In at least one embodiment, the entire virtual / effective memory space (sometimes referred to as the effective address space) is thus distributed across each of processor memory 1701 and GPU memory 1720, thereby allowing any processor or GPU to access that memory using a virtual address mapped to any physical memory.
[0381] In at least one embodiment, the bias / coherence management circuitry 1794A-1794E within one or more MMUs 1739A-1739E ensures cache coherence between one or more host processors (e.g., 1705) and the cache of the GPU 1710, and implements biasing techniques to indicate the physical memory in which certain types of data should be stored. In at least one embodiment, although in Figure 17F Several instances of the bias / coherence management circuitry 1794A-1794E are shown, but the bias / coherence circuitry can be implemented within the MMU of one or more host processors 1705 and / or within the accelerator integrated circuit 1736.
[0382] One embodiment allows GPU memory 1720 to be mapped as part of system memory and accessed using shared virtual memory (SVM) technology without suffering the performance drawbacks associated with full system cache coherence. In at least one embodiment, the ability to access GPU memory 1720 as system memory without the heavy overhead of cache coherence provides a favorable operating environment for GPU offloading. In at least one embodiment, this arrangement allows the host processor 1705 to software-set operands and access computation results without the overhead of conventional I / O DMA data copying. In at least one embodiment, such conventional copying includes driver calls, interrupts, and memory-mapped I / O (MMIO) accesses, all of which are less efficient than simple memory accesses. In at least one embodiment, the ability to access GPU memory 1720 without cache coherence overhead can be critical to the execution time of offloaded computations. In at least one embodiment, for example, in cases with high streaming write memory traffic, cache coherence overhead can significantly reduce the effective write bandwidth seen by GPU 1710. In at least one embodiment, the efficiency of operand setting, the efficiency of result access, and the efficiency of GPU computation can play a role in determining the effectiveness of GPU offloading.
[0383] In at least one embodiment, the selection of GPU bias and host processor bias is driven by a bias tracker data structure. In at least one embodiment, for example, a bias table can be used, which may be a page-granular structure (e.g., controlled at the memory page level) comprising one or two bits of memory pages attached to each GPU. In at least one embodiment, with or without a bias cache (e.g., for caching frequently / recently used entries in the bias table) in GPU 1710, the bias table can be implemented across one or more stolen memory ranges of GPU memory 1720. Alternatively, in at least one embodiment, the entire bias table can be maintained within the GPU.
[0384] In at least one embodiment, prior to actual access to GPU memory, an access to the bias table entry associated with each access to GPU-attached memory 1720 is performed, resulting in the following operations: In at least one embodiment, a local request from GPU 1710 to find its page in the GPU bias is forwarded directly to the corresponding GPU memory 1720. In at least one embodiment, a local request from the GPU to find its page in the host bias is forwarded to processor 1705 (e.g., via the high-speed link described herein). In at least one embodiment, a request from processor 1705 to find the requested page in the host processor bias completes a request similar to a normal memory read. Alternatively, a request for a page pointing to the GPU bias can be forwarded to GPU 1710. In at least one embodiment, if the GPU is not currently using the page, the GPU may subsequently migrate the page to the host processor bias. In at least one embodiment, the page bias state can be changed through a software-based mechanism, a hardware-assisted software mechanism, or, in limited cases, a purely hardware-based mechanism.
[0385] In at least one embodiment, a mechanism for changing the bias state employs an API call (e.g., OpenCL), which subsequently invokes the GPU's device driver. The device driver then sends a message (or enqueues a command descriptor) to the GPU, instructing the GPU to change the bias state and, in some migration, performs a cache refresh operation on the host. In at least one embodiment, the cache refresh operation is used for migration from the host processor 1705 bias to the GPU bias, but not for the reverse migration.
[0386] In at least one embodiment, cache coherence is maintained by temporarily rendering GPU bias pages that the host processor 1705 cannot cache. In at least one embodiment, to access these pages, the processor 1705 may request access from the GPU 1710, which may or may not immediately grant access. Therefore, in at least one embodiment, to reduce communication between the processor 1705 and the GPU 1710, it is beneficial to ensure that the GPU bias pages are pages needed by the GPU rather than those needed by the host processor 1705, and vice versa.
[0387] One or more hardware structures 915 are used to execute one or more embodiments. This document may combine... Figure 9A and / or Figure 9B Provide details about one or more hardware architectures 915.
[0388] Figure 18Exemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which may be manufactured using one or more IP cores. In addition to the illustrations, at least one embodiment may include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.
[0389] Figure 18 This is a block diagram illustrating an exemplary system on a chip integrated circuit 1800 that can be fabricated using one or more IP cores according to at least one embodiment. In at least one embodiment, the integrated circuit 1800 includes one or more application processors 1805 (e.g., CPU), at least one graphics processor 1810, and may additionally include an image processor 1815 and / or a video processor 1820, any of which may be a modular IP core. In at least one embodiment, the integrated circuit 1800 includes peripheral or bus logic, which includes a USB controller 1825, a UART controller 1830, an SPI / SDIO controller 1835, and an I... 2 2S / I 2 2C controller 1840. In at least one embodiment, integrated circuit 1800 may include a display device 1845 coupled to one or more of a High Definition Multimedia Interface (HDMI) controller 1850 and a Mobile Industrial Processor Interface (MIPI) display interface 1855. In at least one embodiment, storage may be provided by a flash memory subsystem 1860, including flash memory and a flash memory controller. In at least one embodiment, a memory interface may be provided via memory controller 1865 for accessing SDRAM or SRAM memory devices. In at least one embodiment, some integrated circuits also include an embedded security engine 1870.
[0390] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details regarding inference and / or training logic 915 are provided. In at least one embodiment, inference and / or training logic 915 may be used in integrated circuit 1800 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0391] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0392] Figure 19A and Figure 19B Exemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which may be manufactured using one or more IP cores. In addition to the illustrations, at least one embodiment may include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.
[0393] Figures 19A-19B This is a block diagram illustrating an exemplary graphics processor used within a SoC according to embodiments described herein. Figure 19A An exemplary graphics processor 1910 of a system-on-a-chip integrated circuit according to at least one embodiment is shown, which can be manufactured using one or more IP cores. Figure 19B An additional exemplary graphics processor 1940 of a system-on-a-chip integrated circuit according to at least one embodiment is shown, which can be manufactured using one or more IP cores. In at least one embodiment, Figure 19A The graphics processor 1910 is a low-power graphics processor core. In at least one embodiment, Figure 19B The graphics processor 1940 is a higher-performance graphics processor core. In at least one embodiment, each graphics processor 1910, 1940 may be... Figure 18 A variant of the 1810 graphics processor.
[0394] In at least one embodiment, the graphics processor 1910 includes a vertex processor 1905 and one or more fragment processors 1915A-1915N (e.g., 1915A, 1915B, 1915C, 1915D to 1915N-1 and 1915N). In at least one embodiment, the graphics processor 1910 can execute different shader programs via separate logic, such that the vertex processor 1905 is optimized to perform operations for vertex shader programs, while one or more fragment processors 1915A-1915N perform fragment (e.g., pixel) shading operations for fragments or pixels or shader programs. In at least one embodiment, the vertex processor 1905 performs the vertex processing stage of the 3D graphics pipeline and generates primitive and vertex data. In at least one embodiment, one or more fragment processors 1915A-1915N use the primitive and vertex data generated by the vertex processor 1905 to generate framebuffers for display on a display device. In at least one embodiment, one or more fragment processors 1915A-1915N are optimized to execute fragment shader programs as provided in the OpenGL API, which can be used to perform operations similar to those of pixel shader programs provided in the Direct 3D API.
[0395] In at least one embodiment, the graphics processor 1910 additionally includes one or more memory management units (MMUs) 1920A-1920B, one or more caches 1925A-1925B, and one or more circuit interconnects 1930A-1930B. In at least one embodiment, one or more MMUs 1920A-1920B provide a virtual-to-physical address mapping for the graphics processor 1910, including for the vertex processor 1905 and / or fragment processors 1915A-1915N, which can reference vertex or image / texture data stored in memory, in addition to vertex or image / texture data stored in one or more caches 1925A-1925B. In at least one embodiment, one or more MMUs 1920A-1920B can be synchronized with other MMUs within the system, including with... Figure 18 One or more application processors 1805, graphics processors 1815, and / or video processors 1820 are associated with one or more MMUs, enabling each processor 1805-1820 to participate in a shared or unified virtual memory system. In at least one embodiment, one or more circuit interconnects 1930A-1930B enable the graphics processor 1910 to connect to other IP cores within the SoC via the SoC's internal bus or via a direct connection.
[0396] In at least one embodiment, the graphics processor 1940 includes one or more shader cores 1955A-1955N (e.g., 1955A, 1955B, 1955C, 1955D, 1955E, 1955F to 1955N-1 and 1955N), such as Figure 19B As shown, it provides a unified shader core architecture, where a single core or type or core can execute all types of programmable shader code, including shader program code for implementing vertex shaders, fragment shaders, and / or compute shaders. In at least one embodiment, the number of shader cores can vary. In at least one embodiment, the graphics processor 1940 includes an inter-core task manager 1945, which acts as a thread dispatcher to assign execution threads to one or more shader cores 1955A-1955N and a tile unit 1958 to accelerate tile-based rendering operations, where scene rendering operations are subdivided in image space, for example, to take advantage of local spatial consistency within the scene or optimize the use of internal caches.
[0397] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details regarding the inference and / or training logic 915 are provided. In at least one embodiment, the inference and / or training logic 915 may be integrated into an integrated circuit. Figure 19A and / or Figure 19B The above is used for inference or prediction operations based at least in part on weight parameters calculated using neural network training operations, neural network functions or architectures, or neural network use cases described herein.
[0398] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0399] Figures 20A-20B Additional exemplary graphics processor logic according to embodiments described herein is illustrated. In at least one embodiment, Figure 20A It shows that it can be included in Figure 18 The graphics core 2000 within the graphics processor 1810, and in at least one embodiment, may be as follows: Figure 19B The Unified Shader Cores 1955A-1955N are shown. Figure 20BA highly parallel general-purpose graphics processing unit (“GPGPU”) 2030 suitable for deployment on a multi-chip module is shown in at least one embodiment.
[0400] In at least one embodiment, the graphics core 2000 includes a shared instruction cache 2002, texture units 2018, and cache / shared memory 2020, which are common to the execution resources within the graphics core 2000. In at least one embodiment, the graphics core 2000 may include multiple slices 2001A-2001N or partitions of each core, and the graphics processor may include multiple instances of the graphics core 2000. In at least one embodiment, slices 2001A-2001N may include supporting logic, including local instruction caches 2004A-2004N, thread schedulers 2006A-2006N, thread dispatchers 2008A-2008N, and a set of registers 2010A-2010N. In at least one embodiment, slices 2001A-2001N may include a set of additional functional units (AFU 2012A-2012N), floating-point units (FPU 2014A-2014N), integer arithmetic logic units (ALU 2016A-2016N), address calculation units (ACU 2013A-2013N), double-precision floating-point units (DPFPU 2015A-2015N), and matrix processing units (MPU 2017A-2017N).
[0401] In at least one embodiment, the FPU 2014A-2014N can perform single-precision (32-bit) and half-precision (16-bit) floating-point operations, while the DPFPU 2015A-2015N performs double-precision (64-bit) floating-point operations. In at least one embodiment, the ALU 2016A-2016N can perform variable-precision integer operations with 8-bit, 16-bit, and 32-bit precision, and can be configured for mixed-precision operations. In at least one embodiment, the MPU 2017A-2017N can also be configured for mixed-precision matrix operations, including half-precision floating-point operations and 8-bit integer operations. In at least one embodiment, the MPU 2017A-2017N can perform various matrix operations to accelerate machine learning application frameworks, including enabling support for accelerated generalized matrix-to-matrix multiplication (GEMM). In at least one embodiment, the AFU 2012A-2012N can perform additional logical operations not supported by floating-point or integer units, including trigonometric operations (e.g., sine, cosine, etc.).
[0402] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This is combined with... Figure 9A and / or Figure 9BDetails regarding inference and / or training logic 915 are provided. In at least one embodiment, inference and / or training logic 915 may be used in the graphics core 2000 to infer or predict operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0403] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0404] Figure 20B A general-purpose processing unit (GPGPU) 2030 is illustrated in at least one embodiment, which can be configured to enable highly parallel computational operations to be performed by a set of graphics processing units. In at least one embodiment, the GPGPU 2030 can be directly linked to other instances of the GPGPU 2030 to create a multi-GPU cluster to improve the training speed for deep neural networks. In at least one embodiment, the GPGPU 2030 includes a host interface 2032 for connection to a host processor. In at least one embodiment, the host interface 2032 is a PCI Express interface. In at least one embodiment, the host interface 2032 may be a vendor-specific communication interface or communication structure. In at least one embodiment, the GPGPU 2030 receives commands from the host processor and uses a global scheduler 2034 to allocate execution threads associated with those commands to a set of compute clusters 2036A-2036H. In at least one embodiment, compute clusters 2036A-2036H share a cache memory 2038. In at least one embodiment, cache memory 2038 can be used as a higher-level cache than cache memory within computing clusters 2036A-2036H.
[0405] In at least one embodiment, the GPGPU 2030 includes memories 2044A-2044B, which are coupled to computing clusters 2036A-2036H via a set of memory controllers 2042A-2042B. In at least one embodiment, memories 2044A-2044B may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), which includes graphics double data rate (GDDR) memory.
[0406] In at least one embodiment, each of the computing clusters 2036A-2036H includes a set of graphics cores, for example... Figure 20A The graphics core 2000 may include various types of integer and floating-point logic units that can perform computational operations across a range of precisions, including precisions suitable for machine learning computations. For example, in at least one embodiment, at least a subset of the floating-point units in each computing cluster 2036A-2036H may be configured to perform 16-bit or 32-bit floating-point operations, while different subsets of the floating-point units may be configured to perform 64-bit floating-point operations.
[0407] In at least one embodiment, multiple instances of the GPGPU 2030 can be configured as a computing cluster. In at least one embodiment, the communication used for synchronization and data exchange by the computing clusters 2036A-2036H varies between embodiments. In at least one embodiment, the multiple instances of the GPGPU 2030 communicate via a host interface 2032. In at least one embodiment, the GPGPU 2030 includes an I / O hub 2039 that couples the GPGPU 2030 to a GPU link 2040, enabling direct connection to other instances of the GPGPU 2030. In at least one embodiment, the GPU link 2040 is coupled to a dedicated GPU-to-GPU bridge, which enables communication and synchronization between the multiple instances of the GPGPU 2030. In at least one embodiment, the GPU link 2040 is coupled to a high-speed interconnect for sending and receiving data to and from other GPGPUs or parallel processors. In at least one embodiment, the multiple instances of the GPGPU 2030 reside in a separate data processing system and communicate via network devices accessible through the host interface 2032. In at least one embodiment, GPU link 2040 may be configured to enable connection to a host processor other than or as a replacement for host interface 2032.
[0408] In at least one embodiment, the GPGPU 2030 can be configured to train a neural network. In at least one embodiment, the GPGPU 2030 can be used within an inference platform. In at least one embodiment, when the GPGPU 2030 is used for inference, the GPGPU 2030 may include fewer compute clusters 2036A-2036H compared to when the GPGPU 2030 is used to train a neural network. In at least one embodiment, the memory technology associated with the memories 2044A-2044B can differ between inference and training configurations, wherein a higher bandwidth memory technology is dedicated to the training configuration. In at least one embodiment, the inference configuration of the GPGPU 2030 can support inference-specific instructions. For example, in at least one embodiment, the inference configuration can provide support for one or more 8-bit integer dot product instructions, which can be used during the inference operation of the deployed neural network.
[0409] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details regarding inference and / or training logic 915 are provided. In at least one embodiment, inference and / or training logic 915 may be used in the GPGPU 2030 for inferring or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architecture, or neural network use cases described herein.
[0410] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0411] Figure 21A block diagram of a computer system 2100 according to at least one embodiment is shown. In at least one embodiment, the computer system 2100 includes a processing subsystem 2101 having one or more processors 2102 and a system memory 2104 communicating via an interconnect path that may include a memory hub 2105. In at least one embodiment, the memory hub 2105 may be a separate component within a chipset assembly or may be integrated within one or more processors 2102. In at least one embodiment, the memory hub 2105 is coupled to an I / O subsystem 2111 via a communication link 2106. In one embodiment, the I / O subsystem 2111 includes an I / O hub 2107 that enables the computer system 2100 to receive input from one or more input devices 2108. In at least one embodiment, the I / O hub 2107 enables a display controller to provide output to one or more display devices 2110A, the display controller being included in one or more processors 2102. In at least one embodiment, one or more display devices 2110A coupled to the I / O hub 2107 may include local, internal, or embedded display devices.
[0412] In at least one embodiment, the processing subsystem 2101 includes one or more parallel processors 2112 coupled to the memory hub 2105 via a bus or other communication link 2113. In at least one embodiment, the communication link 2113 may use any of many standards-based communication link technologies or protocols, such as, but not limited to, PCI Express, or may be a vendor-specific communication interface or communication architecture. In at least one embodiment, one or more parallel processors 2112 form a compute-intensive parallel or vector processing system, which may include a large number of processing cores and / or processing clusters, such as a multi-core integrated (MIC) processor. In at least one embodiment, one or more parallel processors 2112 form a graphics processing subsystem that can output pixels to one or more display devices 2110A coupled via an I / O hub 2107. In at least one embodiment, the parallel processors 2112 may also include a display controller and a display interface (not shown) to enable direct connection to one or more display devices 2110B.
[0413] In at least one embodiment, system storage unit 2114 may be connected to I / O hub 2107 to provide a storage mechanism for computer system 2100. In at least one embodiment, I / O switch 2116 may be used to provide an interface mechanism to enable connectivity between I / O hub 2107 and other components, such as network adapter 2118 and / or wireless network adapter 2119 which may be integrated into the platform, and various other devices that can be added via one or more additional devices 2120. In at least one embodiment, network adapter 2118 may be an Ethernet adapter or another wired network adapter. In at least one embodiment, wireless network adapter 2119 may include one or more of Wi-Fi, Bluetooth, Near Field Communication (NFC), or other network devices including one or more wireless devices.
[0414] In at least one embodiment, the computer system 2100 may include other components not explicitly shown, such as USB or other port connections, optical storage drives, video capture devices, etc., which may also be connected to the I / O hub 2107. In at least one embodiment, the interconnection can be implemented using any suitable protocol (e.g., PCI-based protocols such as PCI-Express or other bus or point-to-point communication interfaces and / or protocols). Figure 21 The communication paths of the various components, such as NV-Link high-speed interconnect or interconnect protocols.
[0415] In at least one embodiment, one or more parallel processors 2112 include circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constituting a graphics processing unit (GPU). In at least one embodiment, the parallel processors 2112 include circuitry optimized for general-purpose processing. In at least one embodiment, components of the computer system 2100 may be integrated with one or more other system elements on a single integrated circuit. For example, in at least one embodiment, the parallel processor 2112, memory hub 2105, processor 2102, and I / O hub 2107 may be integrated into a system-on-a-chip (SoC) integrated circuit. In at least one embodiment, components of the computer system 2100 may be integrated into a single package to form a system-in-package (SIP) configuration. In at least one embodiment, at least a portion of the components of the computer system 2100 may be integrated into a multi-chip module (MCM) that can interconnect with other MCMs to a modular computer system.
[0416] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details regarding the inference and / or training logic 915 are provided. In at least one embodiment, the inference and / or training logic 915 can be... Figure 21 The system 2100 is used for reasoning or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0417] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0418] processor
[0419] Figure 22A A parallel processor 2200 according to at least one embodiment is illustrated. In at least one embodiment, various components of the parallel processor 2200 may be implemented using one or more integrated circuit devices, such as programmable processors, application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). In at least one embodiment, the illustrated parallel processor 2200 is according to an exemplary embodiment. Figure 21 The variant of the parallel processor 2112 shown.
[0420] In at least one embodiment, the parallel processor 2200 includes a parallel processing unit 2202. In at least one embodiment, the parallel processing unit 2202 includes an I / O unit 2204 that enables communication with other devices, including other instances of the parallel processing unit 2202. In at least one embodiment, the I / O unit 2204 can be directly connected to other devices. In at least one embodiment, the I / O unit 2204 is connected to other devices using a hub or switch interface (e.g., a memory hub 2105). In at least one embodiment, the connection between the memory hub 2205 and the I / O unit 2204 forms a communication link 2213. In at least one embodiment, the I / O unit 2204 is connected to a host interface 2206 and a memory crossbar switch 2216, wherein the host interface 2206 receives commands for performing processing operations, and the memory crossbar switch 2216 receives commands for performing memory operations.
[0421] In at least one embodiment, when host interface 2206 receives a command buffer via I / O unit 2204, host interface 2206 can direct work operations to execute those commands to front end 2208. In at least one embodiment, front end 2208 is coupled to scheduler 2210, which is configured to assign commands or other work items to processing cluster array 2212. In at least one embodiment, scheduler 2210 ensures that processing cluster array 2212 is correctly configured and in an active state before assigning tasks to processing cluster array 2212. In at least one embodiment, scheduler 2210 is implemented via firmware logic executed on a microcontroller. In at least one embodiment, the microcontroller-implemented scheduler 2210 can be configured to perform complex scheduling and work assignment operations at both coarse and fine granular levels, thereby enabling fast preemption and context switching of threads executing on processing array 2212. In at least one embodiment, host software can demonstrate workloads for scheduling on processing array 2212 via one of multiple graphics processing paths. In at least one embodiment, the workload can then be automatically distributed on the processing array 2212 by the scheduler 2210 logic within the microcontroller, which includes the scheduler 2210.
[0422] In at least one embodiment, the processing cluster array 2212 may include up to "N" processing clusters (e.g., clusters 2214A, 2214B to 2214N), where "N" represents a positive integer (which may be an integer different from the integer "N" used in other diagrams). In at least one embodiment, each cluster 2214A-2214N of the processing cluster array 2212 can execute a large number of concurrent threads. In at least one embodiment, the scheduler 2210 may use various scheduling and / or work allocation algorithms to allocate work to the clusters 2214A-2214N of the processing cluster array 2212, which may vary depending on the workload generated by each type of program or computation. In at least one embodiment, scheduling may be handled dynamically by the scheduler 2210, or may be partially assisted by compiler logic during the compilation of program logic configured to be executed by the processing cluster array 2212. In at least one embodiment, the different clusters 2214A-2214N of the processing cluster array 2212 may be assigned to process different types of programs or to perform different types of computations.
[0423] In at least one embodiment, the processing cluster array 2212 can be configured to perform various types of parallel processing operations. In at least one embodiment, the processing cluster array 2212 is configured to perform general-purpose parallel computing operations. For example, in at least one embodiment, the processing cluster array 2212 may include logic for performing processing tasks, including filtering video and / or audio data, performing modeling operations, including physical operations, and performing data transformations.
[0424] In at least one embodiment, the processing cluster array 2212 is configured to perform parallel graphics processing operations. In at least one embodiment, the processing cluster array 2212 may include additional logic to support the execution of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, as well as tessellation logic and other vertex processing logic. In at least one embodiment, the processing cluster array 2212 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. In at least one embodiment, the parallel processing unit 2202 may transfer data from system memory via I / O unit 2204 for processing. In at least one embodiment, during processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 2222) and then written back to system memory.
[0425] In at least one embodiment, when the parallel processing unit 2202 is used to perform graphics processing, the scheduler 2210 may be configured to divide the processing workload into tasks of approximately equal size to better distribute graphics processing operations among the multiple clusters 2214A-2214N of the processing cluster array 2212. In at least one embodiment, portions of the processing cluster array 2212 may be configured to perform different types of processing. For example, in at least one embodiment, a first portion may be configured to perform vertex shading and topology generation, a second portion may be configured to perform tessellation and geometry shading, and a third portion may be configured to perform pixel shading or other screen-space operations to generate a rendered image for display. In at least one embodiment, intermediate data generated by one or more of the clusters 2214A-2214N may be stored in a buffer to allow intermediate data to be transferred between the clusters 2214A-2214N for further processing.
[0426] In at least one embodiment, the processing cluster array 2212 may receive processing tasks to be executed via a scheduler 2210, which receives commands defining the processing tasks from a front end 2208. In at least one embodiment, the processing task may include an index of data to be processed, such as surface (patch) data, raw data, vertex data, and / or pixel data, as well as state parameters and commands defining how the data is processed (e.g., what program to execute). In at least one embodiment, the scheduler 2210 may be configured to acquire an index corresponding to a task, or may receive an index from the front end 2208. In at least one embodiment, the front end 2208 may be configured to ensure that the processing cluster array 2212 is configured to be active before initiating the workload specified by an incoming command buffer (e.g., a batch buffer, push buffer, etc.).
[0427] In at least one embodiment, each of one or more instances of the parallel processing unit 2202 may be coupled to the parallel processor memory 2222. In at least one embodiment, the parallel processor memory 2222 may be accessed via a memory crossbar switch 2216, which may receive memory requests from the processing cluster array 2212 and the I / O unit 2204. In at least one embodiment, the memory crossbar switch 2216 may be accessed via a memory interface 2218. In at least one embodiment, the memory interface 2218 may include a plurality of partition units (e.g., partition units 2220A, 2220B to 2220N), each of which may be coupled to a portion (e.g., a memory cell) of the parallel processor memory 2222. In at least one embodiment, the plurality of partition units 2220A-2220N are configured to be equal to the number of memory units, such that the first partition unit 2220A has a corresponding first memory unit 2224A, the second partition unit 2220B has a corresponding memory unit 2224B, and the Nth partition unit 2220N has a corresponding Nth memory unit 2224N. In at least one embodiment, the number of partition units 2220A-2220N may not be equal to the number of memory units.
[0428] In at least one embodiment, memory cells 2224A-2224N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. In at least one embodiment, memory cells 2224A-2224N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). In at least one embodiment, rendering targets such as frame buffers or texture maps may be stored across memory cells 2224A-2224N, allowing partitioning cells 2220A-2220N to write portions of each rendering target in parallel to efficiently utilize the available bandwidth of the parallel processor memory 2222. In at least one embodiment, local instances of the parallel processor memory 2222 may be excluded to facilitate a unified memory design that combines system memory with local cache memory.
[0429] In at least one embodiment, any of clusters 2214A-2214N of the processing cluster array 2212 can process data to be written to any memory cell 2224A-2224N within the parallel processor memory 2222. In at least one embodiment, the memory crossbar switch 2216 can be configured to transfer the output of each cluster 2214A-2214N to any partition cell 2220A-2220N or another cluster 2214A-2214N, and clusters 2214A-2214N can perform further processing operations on the output. In at least one embodiment, each cluster 2214A-2214N can communicate with the memory interface 2218 via the memory crossbar switch 2216 to read from or write to various external storage devices. In at least one embodiment, the memory crossbar switch 2216 has a connection to the memory interface 2218 for communication with the I / O unit 2204, and a connection to a local instance of the parallel processor memory 2222, thereby enabling processing units within different processing clusters 2214A-2214N to communicate with system memory or other memory not local to the parallel processing unit 2202. In at least one embodiment, the memory crossbar switch 2216 may use virtual channels to separate traffic flows between clusters 2214A-2214N and partition units 2220A-2220N.
[0430] In at least one embodiment, multiple instances of the parallel processing unit 2202 may be provided on a single insert card, or multiple insert cards may be interconnected. In at least one embodiment, different instances of the parallel processing unit 2202 may be configured to interoperate, even if the different instances have different numbers of processing cores, different numbers of local parallel processor memories, and / or other configuration differences. For example, in at least one embodiment, some instances of the parallel processing unit 2202 may include higher-precision floating-point units relative to other instances. In at least one embodiment, a system combining one or more instances of the parallel processing unit 2202 or the parallel processor 2200 may be implemented in various configurations and form factors, including but not limited to desktop, laptop, or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.
[0431] Figure 22B This is a block diagram of partitioning unit 2220 according to at least one embodiment. In at least one embodiment, partitioning unit 2220 is... Figure 22A This is an example of one of the partitioning units 2220A-2220N. In at least one embodiment, the partitioning unit 2220 includes an L2 cache 2221, a frame buffer interface 2225, and a ROP 2226 (raster operation unit). In at least one embodiment, the L2 cache 2221 is a read / write cache configured to perform load and store operations received from the memory crossbar switch 2216 and the ROP 2226. In at least one embodiment, the L2 cache 2221 outputs read misses and urgent write-back requests to the frame buffer interface 2225 for processing. In at least one embodiment, updates can also be sent to the frame buffer for processing via the frame buffer interface 2225. In at least one embodiment, the frame buffer interface 2225 communicates with memory cells in the parallel processor memory (such as...). Figure 22A The memory cells 2224A-2224N (e.g., within the parallel processor memory 2222) interact with one of them.
[0432] In at least one embodiment, ROP 2226 is a processing unit that performs raster operations such as stenciling, z-testing, blending, etc. In at least one embodiment, ROP 2226 then outputs processed graphics data stored in graphics memory. In at least one embodiment, ROP 2226 includes compression logic to compress depth or color data written to memory and decompress depth or color data read from memory. In at least one embodiment, the compression logic may be lossless compression logic utilizing one or more of a variety of compression algorithms. In at least one embodiment, the type of compression performed by ROP 2226 may vary based on the statistical characteristics of the data to be compressed. For example, in at least one embodiment, incremental color compression is performed based on depth and color data on a per-tile basis.
[0433] In at least one embodiment, ROP 2226 is included within each processing cluster (e.g., Figure 22A Clusters 2214A-2214N are used instead of partition units 2220. In at least one embodiment, read and write requests for pixel data are made via memory crossbar switch 2216 instead of pixel fragment data transfer. In at least one embodiment, the processed graphics data can be displayed on a display device (such as...). Figure 21 Displayed by one or more display devices 2110, routed by processor 2102 for further processing, or by... Figure 22A One of the processing entities within the parallel processor 2200 is routed for further processing.
[0434] Figure 22C This is a block diagram of a processing cluster 2214 within a parallel processing unit according to at least one embodiment. In at least one embodiment, the processing cluster is... Figure 22A An instance of one of the processing clusters 2214A-2214N. In at least one embodiment, processing cluster 2214 can be configured to execute a number of threads in parallel, where a "thread" refers to an instance of a specific program executing on a particular set of input data. In at least one embodiment, Single Instruction Multiple Data (SIMD) instruction issuing technology is used to support the parallel execution of a large number of threads without providing multiple independent instruction units. In at least one embodiment, Single Instruction Multiple Threading (SIMT) technology is used to support the parallel execution of a large number of generally synchronous threads, which uses a common instruction unit configured to issue instructions to a set of processing engines within each processing cluster.
[0435] In at least one embodiment, the operation of the processing cluster 2214 can be controlled by a pipeline manager 2232 that assigns processing tasks to the SIMT parallel processors. In at least one embodiment, the pipeline manager 2232... Figure 22AThe scheduler 2210 receives instructions and manages the execution of these instructions via the graphics multiprocessor 2234 and / or texture unit 2236. In at least one embodiment, the graphics multiprocessor 2234 is an exemplary instance of a SIMT parallel processor. However, in at least one embodiment, the processing cluster 2214 may include various types of SIMT parallel processors with different architectures. In at least one embodiment, the processing cluster 2214 may include one or more instances of the graphics multiprocessor 2234. In at least one embodiment, the graphics multiprocessor 2234 can process data, and the data cross switch 2240 can be used to distribute the processed data to one of a number of possible destinations (including other shader units). In at least one embodiment, the pipeline manager 2232 can facilitate the distribution of processed data by specifying the destination of the processed data to be distributed via the data cross switch 2240.
[0436] In at least one embodiment, each graphics multiprocessor 2234 within the processing cluster 2214 may include the same set of functional execution logic (e.g., arithmetic logic units, load-memory units, etc.). In at least one embodiment, the functional execution logic may be configured in a pipelined manner, wherein new instructions may be issued before previous instructions complete. In at least one embodiment, the functional execution logic supports a variety of operations, including integer and floating-point arithmetic, comparison operations, Boolean operations, shift operations, and computation of various algebraic functions. In at least one embodiment, the same functional unit hardware may be used to perform different operations, and any combination of functional units may exist.
[0437] In at least one embodiment, instructions sent to the processing cluster 2214 constitute threads. In at least one embodiment, a group of threads executed across a set of parallel processing engines is a thread group. In at least one embodiment, the thread group executes a general program on different input data. In at least one embodiment, each thread within the thread group may be assigned to a different processing engine within the graphics multiprocessor 2234. In at least one embodiment, the thread group may include fewer threads than the number of processing engines within the graphics multiprocessor 2234. In at least one embodiment, when the number of threads included in the thread group is less than the number of processing engines, one or more processing engines may be idle during a loop that is processing the thread group. In at least one embodiment, the thread group may also include more threads than the number of processing engines within the graphics multiprocessor 2234. In at least one embodiment, when the thread group includes more threads than the number of processing engines within the graphics multiprocessor 2234, processing can be performed in consecutive clock cycles. In at least one embodiment, multiple thread groups can be executed simultaneously on the graphics multiprocessor 2234.
[0438] In at least one embodiment, the graphics multiprocessor 2234 includes an internal cache memory for performing load and store operations. In at least one embodiment, the graphics multiprocessor 2234 may forgo the internal cache and use a cache memory within the processing cluster 2214 (e.g., L1 cache 2248). In at least one embodiment, each graphics multiprocessor 2234 may also access partition units (e.g., Figure 22A The L2 cache is located within partition units 2220A-2220N, which are shared among all processing clusters 2214 and can be used to transfer data between threads. In at least one embodiment, the graphics multiprocessor 2234 can also access off-chip global memory, which may include one or more of local parallel processor memory and / or system memory. In at least one embodiment, any memory outside of the parallel processing unit 2202 can be used as global memory. In at least one embodiment, the processing cluster 2214 includes multiple instances of the graphics multiprocessor 2234, which can share common instructions and data that can be stored in the L1 cache 2248.
[0439] In at least one embodiment, each processing cluster 2214 may include a memory management unit (“MMU”) 2245 configured to map virtual addresses to physical addresses. In at least one embodiment, one or more instances of the MMU 2245 may reside in Figure 22A The memory interface 2218 is located within the MMU 2245. In at least one embodiment, the MMU 2245 includes a set of page table entries (PTEs) for mapping virtual addresses to physical addresses of tiles and optionally to cache line indices. In at least one embodiment, the MMU 2245 may include an address translation back buffer (TLB) or a cache that may reside within the graphics multiprocessor 2234, the L1 cache 2248, or the processing cluster 2214. In at least one embodiment, physical addresses are processed to allocate surface data access locality for efficient request interleaving between partition units. In at least one embodiment, cache line indices may be used to determine whether a request for a cache line is a hit or a miss.
[0440] In at least one embodiment, the processing cluster 2214 can be configured such that each graphics multiprocessor 2234 is coupled to a texture unit 2236 to perform texture mapping operations that determine texture sample locations, read texture data, and filter texture data. In at least one embodiment, texture data is read as needed from an internal texture L1 cache (not shown) or from an L1 cache within the graphics multiprocessor 2234, and texture data is also retrieved from an L2 cache, local parallel processor memory, or system memory. In at least one embodiment, each graphics multiprocessor 2234 outputs a processed task to a data crossbar switch 2240 to provide the processed task to another processing cluster 2214 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via a memory crossbar switch 2216. In at least one embodiment, a preROP 2242 (pre-raster operation unit) is configured to receive data from the graphics multiprocessor 2234 and direct the data to a ROP unit, which may be associated with a partitioning unit (e.g., [missing information]). Figure 22A The PreROP 2242 unit is located together with the partitioning units 2220A-2220N. In at least one embodiment, the PreROP 2242 unit can perform optimizations for color blending, organize pixel color data, and perform address translation.
[0441] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details regarding inference and / or training logic 915 are provided. In at least one embodiment, inference and / or training logic 915 may be used in graphics processing cluster 2214 to perform inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0442] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0443] Figure 22DA graphics multiprocessor 2234 according to at least one embodiment is illustrated. In at least one embodiment, the graphics multiprocessor 2234 is coupled to a pipeline manager 2232 of a processing cluster 2214. In at least one embodiment, the graphics multiprocessor 2234 has an execution pipeline including, but not limited to, an instruction cache 2252, an instruction unit 2254, an address mapping unit 2256, a register file 2258, one or more general-purpose graphics processing unit (GPGPU) cores 2262, and one or more load / store units 2266. In at least one embodiment, the GPGPU cores 2262 and the load / store units 2266 are coupled to a cache memory 2272 and a shared memory 2270 via a memory and cache interconnect 2268.
[0444] In at least one embodiment, instruction cache 2252 receives a stream of instructions to be executed from pipeline manager 2232. In at least one embodiment, instructions are cached in instruction cache 2252 and dispatched to instruction unit 2254 for execution. In one embodiment, instruction unit 2254 may dispatch instructions as thread groups (e.g., thread bundles), assigning each thread of the thread group to a different execution unit within GPGPU core 2262. In at least one embodiment, instructions can access any local, shared, or global address space by specifying an address within a unified address space. In at least one embodiment, address mapping unit 2256 may be used to translate addresses in the unified address space into different memory addresses that can be accessed by load / store unit 2266.
[0445] In at least one embodiment, register file 2258 provides a set of registers for functional units of graphics multiprocessor 2234. In at least one embodiment, register file 2258 provides temporary storage for operands of data paths connected to functional units of graphics multiprocessor 2234 (e.g., GPGPU core 2262, load / store unit 2266). In at least one embodiment, register file 2258 is partitioned among each functional unit, such that a dedicated portion of register file 2258 is allocated to each functional unit. In at least one embodiment, register file 2258 is partitioned among different thread bundles being executed by graphics multiprocessor 2234.
[0446] In at least one embodiment, each of the GPGPU cores 2262 may include a floating-point unit (FPU) and / or an integer arithmetic logic unit (ALU) for executing instructions of the graphics multiprocessor 2234. In at least one embodiment, the GPGPU cores 2262 may be architecturally similar or may differ in architecture. In at least one embodiment, a first portion of the GPGPU core 2262 includes a single-precision FPU and an integer ALU, while a second portion of the GPGPU core includes a double-precision FPU. In at least one embodiment, the FPU may implement the IEEE 754-2008 standard for floating-point algorithms or enable variable-precision floating-point algorithms. In at least one embodiment, the graphics multiprocessor 2234 may additionally include one or more fixed-function or special-function units to perform specific functions, such as copying rectangles or pixel blending operations. In at least one embodiment, one or more of the GPGPU cores 2262 may also include fixed-function or special-function logic.
[0447] In at least one embodiment, the GPGPU core 2262 includes SIMD logic capable of executing a single instruction on multiple sets of data. In one embodiment, the GPGPU core 2262 can physically execute SIMD4, SIMD8, and SIMD16 instructions and logically execute SIMD1, SIMD2, and SIMD32 instructions. In at least one embodiment, the SIMD instructions for the GPGPU core can be generated by a shader compiler at compile time or automatically generated when executing a program written and compiled for a Single Program Multiple Data (SPMD) or SIMT architecture. In at least one embodiment, multiple threads of a program configured for a SIMT execution model can be executed using a single SIMD instruction. For example, in at least one embodiment, eight SIMD threads performing the same or similar operations can be executed in parallel using a single SIMD8 logic unit.
[0448] In at least one embodiment, the memory and cache interconnect 2268 is an interconnect network connecting each functional unit of the graphics multiprocessor 2234 to the register file 2258 and the shared memory 2270. In at least one embodiment, the memory and cache interconnect 2268 is a cross-switch interconnect that allows the load / store unit 2266 to perform load and store operations between the shared memory 2270 and the register file 2258. In at least one embodiment, the register file 2258 can operate at the same frequency as the GPGPU core 2262, resulting in very low latency for data transfer between the GPGPU core 2262 and the register file 2258. In at least one embodiment, the shared memory 2270 can be used to enable communication between threads executing on functional units within the graphics multiprocessor 2234. In at least one embodiment, the cache memory 2272 can be used, for example, as a data cache to cache texture data communicated between functional units and texture units 2236. In at least one embodiment, the shared memory 2270 can also be used as a program-managed cache. In at least one embodiment, in addition to the automatically cached data stored in cache memory 2272, the thread executing on GPGPU core 2262 can also programmatically store data in shared memory.
[0449] In at least one embodiment, a parallel processor or GPGPU, as described herein, is communicatively coupled to a host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. In at least one embodiment, the GPU may be communicatively coupled to the host processor / core via a bus or other interconnect (e.g., a high-speed interconnect such as PCIe or NVLink). In at least one embodiment, the GPU may be integrated with the core on a package or chip and communicatively coupled to the core via an internal processor bus / interconnect (i.e., inside the package or chip). In at least one embodiment, regardless of how the GPU is connected, the processor core may assign work to the GPU in the form of a sequence of commands / instructions contained in a job descriptor. In at least one embodiment, the GPU then uses dedicated circuitry / logic to efficiently process these commands / instructions.
[0450] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A 9B and / or 9B provide details regarding the inference and / or training logic 915. In at least one embodiment, the inference and / or training logic 915 may be used in the graphics multiprocessor 2234 to infer or predict operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architecture or neural network usage described herein.
[0451] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0452] Figure 23 A multi-GPU computing system 2300 according to at least one embodiment is illustrated. In at least one embodiment, the multi-GPU computing system 2300 may include a processor 2302 coupled to a plurality of general-purpose graphics processing units (GPGPUs) 2306A-D via a host interface switch 2304. In at least one embodiment, the host interface switch 2304 is a fast PCI switch device that couples the processor 2302 to a fast PCI bus through which the processor 2302 can communicate with the GPGPUs 2306A-D. In at least one embodiment, the GPGPUs 2306A-D may be interconnected via a set of high-speed point-to-point GPU-to-GPU links 2316. In at least one embodiment, the GPU-to-GPU links 2316 are connected to each of the GPGPUs 2306A-D via dedicated GPU links. In at least one embodiment, the P2P GPU links 2316 enable direct communication between each of the GPGPUs 2306A-D without requiring communication on the host interface bus 2304 to which the processor 2302 is connected. In at least one embodiment, the host interface bus 2304 remains available for system memory access or communication with other instances of the multi-GPU computing system 2300, for example, via one or more network devices, through GPU-to-GPU traffic directed to the P2P GPU link 2316. While in at least one embodiment, the GPGPUs 2306A-D are connected to the processor 2302 via the host interface switch 2304, in at least one embodiment, the processor 2302 includes direct support for the P2P GPU link 2316 and can be directly connected to the GPGPUs 2306A-D.
[0453] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A9B and / or 9B provide details regarding the inference and / or training logic 915. In at least one embodiment, the inference and / or training logic 915 may be used in a multi-GPU computing system 2300 to infer or predict operations based at least in part on weight parameters computed using neural network training operations, neural network functionality and / or architecture, or neural network usage as described herein.
[0454] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0455] Figure 24 This is a block diagram of a graphics processor 2400 according to at least one embodiment. In at least one embodiment, the graphics processor 2400 includes a ring interconnect 2402, a pipeline front end 2404, a media engine 2437, and graphics cores 2480A-2480N. In at least one embodiment, the ring interconnect 2402 couples the graphics processor 2400 to other processing units, including other graphics processors or one or more general-purpose processor cores. In at least one embodiment, the graphics processor 2400 is one of many processors integrated within a multi-core processing system.
[0456] In at least one embodiment, the graphics processor 2400 receives multiple batches of commands via a ring interconnect 2402. In at least one embodiment, the incoming commands are interpreted by a command stream converter 2403 in a pipeline front-end 2404. In at least one embodiment, the graphics processor 2400 includes scalable execution logic for performing 3D geometry processing and media processing via one or more graphics cores 2480A-2480N. In at least one embodiment, for 3D geometry processing commands, the command stream converter 2403 provides commands to the geometry pipeline 2436. In at least one embodiment, for at least some media processing commands, the command stream converter 2403 provides commands to a video front-end 2434 coupled to a media engine 2437. In at least one embodiment, the media engine 2437 includes a video quality engine (VQE) 2430 for video and image post-processing and a multi-format encoding / decoding (MFX) engine 2433 for providing hardware-accelerated media data encoding and decoding. In at least one embodiment, the geometry pipeline 2436 and the media engine 2437 each generate execution threads for thread execution resources provided by at least one graphics core 2480.
[0457] In at least one embodiment, the graphics processor 2400 includes scalable thread execution resources characterized by graphics cores 2480A-2480N (which may be modular and are sometimes referred to as core slices), each graphics core having multiple sub-cores 2450A-2450N, 2460A-2460N (sometimes referred to as core sub-slices). In at least one embodiment, the graphics processor 2400 may have any number of graphics cores 2480A. In at least one embodiment, the graphics processor 2400 includes graphics cores 2480A having at least a first sub-core 2450A and a second sub-core 2460A. In at least one embodiment, the graphics processor 2400 is a low-power processor having a single sub-core (e.g., 2450A). In at least one embodiment, the graphics processor 2400 includes multiple graphics cores 2480A-2480N, each graphics core including a set of first sub-cores 2450A-2450N and a set of second sub-cores 2460A-2460N. In at least one embodiment, each of the first sub-cores 2450A-2450N includes at least a first set of execution units 2452A-2452N and media / texture samplers 2454A-2454N. In at least one embodiment, each of the second sub-cores 2460A-2460N includes at least a second set of execution units 2462A-2462N and samplers 2464A-2464N. In at least one embodiment, each sub-core 2450A-2450N and 2460A-2460N shares a set of shared resources 2470A-2470N. In at least one embodiment, the shared resources include a shared cache memory and pixel operation logic.
[0458] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details regarding inference and / or training logic 915 are provided. In at least one embodiment, inference and / or training logic 915 may be used in graphics processor 2400 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture or neural network usage described herein.
[0459] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8The described techniques and embodiments are implemented by the embodiments of the figures.
[0460] Figure 25 This is a block diagram illustrating the microarchitecture of a processor 2500 that may include logic circuitry for executing instructions according to at least one embodiment. In at least one embodiment, the processor 2500 can execute instructions, including x86 instructions, ARM instructions, and special-purpose instructions for application-specific integrated circuits (ASICs). In at least one embodiment, the processor 2500 may include registers for storing packaged data, such as the 64-bit wide MMX registers used in Intel Corporation's Santa Clara, California-enabled microprocessors employing MMX technology. TM Registers. In at least one embodiment, MMX registers available in integer and floating-point forms can operate alongside packaged data elements accompanied by Single Instruction Multiple Data (“SIMD”) and Streaming SIMD Extensions (“SSE”) instructions. In at least one embodiment, a 128-bit wide XMM register associated with SSE2, SSE3, SSE4, AVX, or later (generally referred to as “SSEx”) technologies can hold such packaged data operands. In at least one embodiment, processor 2500 can execute instructions to accelerate machine learning or deep learning algorithms, training, or inference.
[0461] In at least one embodiment, processor 2500 includes an ordered front end (“front end”) 2501 to fetch instructions to be executed and prepare instructions for later use in the processor pipeline. In at least one embodiment, front end 2501 may include several units. In at least one embodiment, instruction prefetcher 2526 fetches instructions from memory and provides the instructions to instruction decoder 2528, which in turn decodes or interprets the instructions. For example, in at least one embodiment, instruction decoder 2528 decodes the received instructions into one or more machine-executable so-called “micro-instructions” or “micro-operations” (also referred to as “micro-operations” or “micro-instructions”). In at least one embodiment, instruction decoder 2528 parses the instructions into opcodes and corresponding data and control fields, which can be used by the microarchitecture to perform operations according to at least one embodiment. In at least one embodiment, trace cache 2530 may assemble the decoded micro-instructions into a program-ordered sequence or trace in micro-instruction queue 2534 for execution. In at least one embodiment, when the trace cache 2530 encounters complex instructions, the microcode ROM 2532 provides the microinstructions required to complete the operation.
[0462] In at least one embodiment, some instructions may be converted into a single micro-operation, while others require several micro-operations to complete the entire operation. In at least one embodiment, if more than four micro-instructions are required to complete an instruction, the instruction decoder 2528 may access the microcode ROM 2532 to execute the instruction. In at least one embodiment, an instruction may be decoded into a small number of micro-instructions for processing at the instruction decoder 2528. In at least one embodiment, if multiple micro-instructions are required to complete the operation, the instructions may be stored in the microcode ROM 2532. In at least one embodiment, the trace cache 2530 references an entry point programmable logic array (“PLA”) to determine the correct micro-instruction pointer for reading a microcode sequence from the microcode ROM 2532 to complete one or more instructions, according to at least one embodiment. In at least one embodiment, after the microcode ROM 2532 has completed the micro-operation ordering of the instructions, the machine front end 2501 may resume fetching micro-operations from the trace cache 2530.
[0463] In at least one embodiment, the out-of-order execution engine (“out-of-order engine”) 2503 can prepare instructions for execution. In at least one embodiment, the out-of-order execution logic has multiple buffers to smooth and reorder the instruction flow to optimize performance as instructions descend the pipeline and are scheduled for execution. In at least one embodiment, the out-of-order execution engine 2503 includes, but is not limited to, an allocator / register renamer 2540, a memory microinstruction queue 2542, an integer / floating-point microinstruction queue 2544, a memory scheduler 2546, a fast scheduler 2502, a slow / general-purpose floating-point scheduler (“slow / general-purpose FP scheduler”) 2504, and a simple floating-point scheduler (“simple FP scheduler”) 2506. In at least one embodiment, the fast scheduler 2502, the slow / general-purpose floating-point scheduler 2504, and the simple floating-point scheduler 2506 are also collectively referred to as “microinstruction schedulers 2502, 2504, 2506”. In at least one embodiment, the allocator / register renamer 2540 allocates the machine buffers and resources required for the sequential execution of each microinstruction. In at least one embodiment, the allocator / register renamer 2540 renames logical registers to entries in a register file. In at least one embodiment, the allocator / register renamer 2540 also allocates entries for each microinstruction in one of two microinstruction queues, a memory microinstruction queue 2542 for memory operations and an integer / floating-point microinstruction queue 2544 for non-memory operations, preceding the memory scheduler 2546 and microinstruction schedulers 2502, 2504, and 2506. In at least one embodiment, the microinstruction schedulers 2502, 2504, and 2506 determine when they are ready to execute a microinstruction based on the readiness of their dependent input register operand sources and the availability of the execution resource microinstructions that need to be completed. In at least one embodiment, the fast scheduler 2502 can schedule on each half of the master clock cycle, while the slow / general-purpose floating-point scheduler 2504 and the simple floating-point scheduler 2506 can schedule once per master processor clock cycle. In at least one embodiment, microinstruction schedulers 2502, 2504, and 2506 arbitrate the scheduling port to schedule microinstructions for execution.
[0464] In at least one embodiment, execution block 2511 includes, but is not limited to, integer register file / tribute network 2508, floating-point register file / tribute network (“FP register file / tribute network”) 2510, address generation units (“AGU”) 2512 and 2514, fast arithmetic logic units (“fast ALU”) 2516 and 2518, slow arithmetic logic unit (“slow ALU”) 2520, floating-point ALU (“FP”) 2522, and floating-point movement unit (“FP movement”) 2524. In at least one embodiment, integer register file / tribute network 2508 and floating-point register file / bypass network 2510 are also referred to herein as “register files 2508, 2510”. In at least one embodiment, AGUs 2512 and 2514, fast ALUs 2516 and 2518, slow ALU 2520, floating-point ALU 2522, and floating-point movement unit 2524 are also referred to herein as "execution units 2512, 2514, 2516, 2518, 2520, 2522, and 2524". In at least one embodiment, execution block 2511 may include, but is not limited to, any number (including zero) and type of register files, branch networks, address generation units, and execution units (in any combination).
[0465] In at least one embodiment, register networks 2508, 2510 may be arranged between microinstruction schedulers 2502, 2504, 2506 and execution units 2512, 2514, 2516, 2518, 2520, 2522, and 2524. In at least one embodiment, integer register file / tribute network 2508 performs integer operations. In at least one embodiment, floating-point register file / tribute network 2510 performs floating-point operations. In at least one embodiment, each of register networks 2508, 2510 may include, but is not limited to, a tribute network that can bypass or forward recently completed results not yet written to a register file to a new dependent object. In at least one embodiment, register networks 2508, 2510 may communicate data with each other. In at least one embodiment, integer register file / tribute network 2508 may include, but is not limited to, two separate register files, one register file for low-order 32-bit data and a second register file for high-order 32-bit data. In at least one embodiment, the floating-point register file / branch network 2510 may include, but is not limited to, entries with a width of 128 bits, since floating-point instructions typically have operands with a width of 64 to 128 bits.
[0466] In at least one embodiment, execution units 2512, 2514, 2516, 2518, 2520, 2522, and 2524 can execute instructions. In at least one embodiment, register networks 2508 and 2510 store integer and floating-point data operation values that the microinstructions need to execute. In at least one embodiment, processor 2500 can be, but is not limited to, any number of execution units 2512, 2514, 2516, 2518, 2520, 2522, and 2524, and combinations thereof. In at least one embodiment, floating-point ALU 2522 and floating-point movement unit 2524 can perform floating-point, MMX, SIMD, AVX, and SSE or other operations, including specialized machine learning instructions. In at least one embodiment, floating-point ALU 2522 can be, but is not limited to, a 64-bit multiplication-64-bit floating-point divider to perform division, square root, and remainder micro-operations. In at least one embodiment, floating-point hardware can be used to process instructions involving floating-point values. In at least one embodiment, ALU operations can be passed to fast ALUs 2516 and 2518. In at least one embodiment, fast ALUs 2516 and 2518 can perform fast operations with an effective delay of half a clock cycle. In at least one embodiment, most complex integer operations are routed to slow ALU 2520, because slow ALU 2520 can include, but is not limited to, integer execution hardware for long-latency type operations, such as multipliers, shifters, flag logic, and branching. In at least one embodiment, memory load / store operations can be performed by AGUs 2512 and 2514. In at least one embodiment, fast ALU 2516, fast ALU 2518, and slow ALU 2520 can perform integer operations on 64-bit data operands. In at least one embodiment, fast ALU 2516, fast ALU 2518, and slow ALU 2520 can be implemented to support various data bit sizes, including sixteen, thirty-two, 128, 256, etc. In at least one embodiment, the floating-point ALU 2522 and the floating-point moving unit 2524 can be implemented to support a range of operands with various bit widths, for example, they can be combined with SIMD and multimedia instructions to operate on 128-bit wide packaged data operands.
[0467] In at least one embodiment, microinstruction schedulers 2502, 2504, and 2506 schedule dependent operations before the parent load completes execution. In at least one embodiment, since microinstructions can be speculatively scheduled and executed within processor 2500, processor 2500 may also include logic for handling memory misses. In at least one embodiment, if a data load miss occurs in the data cache, there may be a dependent operation running in the pipeline that temporarily deprives the scheduler of the correct data. In at least one embodiment, a replay mechanism tracks and re-executes instructions that use incorrect data. In at least one embodiment, it may be necessary to replay dependent operations and may allow independent operations to be completed. In at least one embodiment, the scheduler and replay mechanism of at least one embodiment of the processor may also be designed to capture instruction sequences for text string comparison operations.
[0468] In at least one embodiment, "register" can refer to an onboard processor storage location that can be used as part of an instruction that identifies an operand. In at least one embodiment, a register can be one that can be used externally to the processor (from a programmer's perspective). In at least one embodiment, a register may not be limited to a particular type of circuit. Rather, in at least one embodiment, a register can store data, provide data, and perform the functions described herein. In at least one embodiment, the registers described herein can be implemented using a variety of different techniques via circuitry within the processor, such as dedicated physical registers, dynamically allocated physical registers renamed using register renaming, a combination of dedicated and dynamically allocated physical registers, etc. In at least one embodiment, an integer register stores 32-bit integer data. The register file of at least one embodiment also includes eight multimedia SIMD registers for encapsulating data.
[0469] Inference and / or training logic 915 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 9A and / or Figure 9B Details regarding inference and / or training logic 915 are provided. In at least one embodiment, some or all of the inference and / or training logic 915 may be incorporated into execution block 2511 and other memories or registers shown or not shown. For example, in at least one embodiment, the training and / or inference techniques described herein may use one or more ALUs shown in execution block 2511. Furthermore, weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown) that configure the ALUs of execution block 2511 to execute one or more machine learning algorithms, neural network architectures, use cases, or training techniques described herein.
[0470] In at least one embodiment, one or more circuits, processors, computing systems, or other devices or technologies are adapted, referring to the figures, to train one or more neural networks to identify one or more forces to be applied to one or more objects, at least in part, based on training data corresponding to two or more aspects of the motion of the one or more objects. In at least one embodiment, this is based on the above relative to the previous... Figure 1-8 The described techniques and embodiments are implemented by the embodiments of the figures.
[0471] Figure 26 A deep learning application processor 2600 according to at least one embodiment is illustrated. In at least one embodiment, the deep learning application processor 2600 uses instructions, which, if executed by the deep learning application processor 2600, cause the deep learning application processor 2600 to perform some or all of the processes and techniques described herein. In at least one embodiment, the deep learning application processor 2600 is an application-specific integrated circuit (ASIC). In at least one embodiment, the application processor 2600 performs matrix multiplication operations or is "hardwired" into hardware as a result of executing one or more instructions or both. In at least one embodiment, the deep learning application processor 2600 includes...
Claims
1. A processor, comprising: One or more circuits are used to proportionally equalize the training data, at least in part, based on an accuracy metric of one or more neural networks to be trained using two or more sets of neural network training data with a disproportionate ratio.
2. The processor of claim 1, wherein the one or more neural networks are trained to apply one or more forces to one or more joints of one or more objects, such that the one or more objects move according to a physics-based simulation.
3. The processor of claim 1, wherein the training data is used to train the one or more neural networks according to the hierarchy of the training data, the training data being organized according to two or more aspects of motion.
4. The processor of claim 3, wherein the one or more neural networks are trained by the following steps: randomly selecting aspects of motion from a first level of the layer, and subsequently randomly selecting aspects of motion from a second level of the layer below the first level.
5. The processor of claim 1, wherein the segment training the one or more neural networks is terminated when any one of the plurality of reward items drops below a threshold level.
6. The processor of claim 1, wherein the training segment is initialized to a state based on motion frames, the motion frames being shifted from the starting frame of video data including an example of motion aspects by one or more frames.
7. The processor of claim 1, wherein the variance of joints associated with models of one or more objects is decayed during training of the one or more neural networks according to a predetermined variance decay.
8. The processor of claim 1, wherein the one or more neural networks include motion actuators trained to generate, at least in part, amounts of one or more forces to be applied to one or more joints of the one or more objects based on target states of one or more objects provided as input to the motion actuators.
9. The processor of claim 1, wherein the training data comprises hierarchical motion levels.
10. The processor of claim 1, wherein the one or more neural networks are trained, at least in part, based on selected neural network training data to identify one or more forces to be applied to one or more objects, wherein, The neural network training data is selected at least in part based on the type of motion depicted in the neural network training data.
11. A system comprising: One or more processors are configured to proportionally equalize the training data, at least in part, based on an accuracy metric of one or more neural networks to be trained using two or more sets of proportionally imbalanced neural network training data.
12. The system of claim 11, wherein the one or more neural networks are trained to apply one or more forces to one or more joints of one or more objects.
13. The system of claim 11, wherein the one or more processors are configured to train the one or more neural networks according to a hierarchy using aspects of motion obtained from the training data, the training data being organized according to two or more aspects of motion.
14. The system of claim 13, wherein the one or more processors are configured to train the one or more neural networks by means of the following steps: randomly selecting aspects of motion from a first level of the layer, and subsequently randomly selecting aspects of motion from a second level of the layer below the first level.
15. The system of claim 11, wherein the one or more processors are configured to terminate the segment at least in part based on any one of a plurality of reward items falling below a threshold level during a segment of training the one or more neural networks.
16. The system of claim 11, wherein the one or more processors are configured to initialize a training segment to a state based on motion frames, the motion frames being shifted from the starting frame of an example of the motion aspect by one or more frames.
17. The system of claim 11, wherein the variance of the joints of one or more objects is decayed during the training of the one or more neural networks according to a predetermined decay of control variance.
18. The system of claim 17, wherein the range of control variance of a first joint, at least in part based on one or more objects, is initially set to be an amount smaller than the variance of a second joint, wherein the one or more neural networks are trained to apply one or more forces to the one or more objects.
19. A method comprising: The training data is balanced proportionally, at least in part, based on the accuracy metrics of one or more neural networks that will be trained using two or more sets of neural network training data with an imbalanced ratio.
20. The method of claim 19, wherein the one or more neural networks are trained to infer a force including a quantity of torque to be applied to one or more joints of one or more objects.
21. The method of claim 19, further comprising: The training data is organized into a hierarchy based at least in part on specifications of aspects of motion indicated in the training data; as well as The one or more neural networks are trained using one or more randomly selected examples of an aspect of motion corresponding to a first level of the hierarchy, and subsequently trained using one or more randomly selected examples of an aspect of motion corresponding to a second level of the hierarchy, where the second level is lower than the first level.
22. The method of claim 19, further comprising: The segment is terminated at least in part based on any one of the multiple reward items falling below a threshold level during a segment of training the one or more neural networks.
23. The method of claim 19, further comprising: The training segment is initialized to a state based on motion frames, which are shifted from the starting frame of an example of the motion aspect by one or more frames.
24. The method of claim 19, wherein the control variance of the joints of one or more objects decays during training, at least in part, and the one or more neural networks are trained to infer the forces to be applied to the one or more objects.
25. The method of claim 19, wherein the range of control variance of a first joint of at least part of one or more objects is initially set to be less than the variance of a second joint, and the one or more neural networks are used to be trained to infer the force to be applied to the one or more objects.
26. A machine-readable medium having instructions stored thereon, said instructions, if executed by one or more processors, causing said one or more processors to at least: The training data is balanced proportionally, at least in part, based on the accuracy metrics of one or more neural networks that will be trained using two or more sets of neural network training data with an imbalanced ratio.
27. The machine-readable medium of claim 26, having further instructions stored thereon, which, if executed by one or more processors, cause the one or more processors to at least: The segment is terminated at least in part based on any one of the multiple reward items falling below a threshold level during a segment of training the one or more neural networks.
28. The machine-readable medium of claim 26, having further instructions stored thereon, which, if executed by one or more processors, cause the one or more processors to at least: The training segment is initialized to a state based on motion frames, which are shifted from the starting frame of an example of the motion aspect by one or more frames.
29. The machine-readable medium of claim 26, wherein the one or more neural networks include motion actuators.
30. The machine-readable medium of claim 29, wherein the motion actuator is driven by a target state generated by a video stream scheduler, wherein the video stream scheduler generates the target state at least in part based on video of the moving subject.
31. The machine-readable medium of claim 29, wherein the motion actuator is driven by a target state generated by a command stream scheduler, the command stream scheduler generating the target state based on user input.
32. The machine-readable medium of claim 29, wherein the motion actuator is driven by a target state generated by a motion splicing scheduler.