Depth estimation using neural networks
By using multi-camera image warping and neural network training, the problem of inaccurate object depth in existing depth estimation methods is solved, enabling real-time and accurate depth estimation in autonomous vehicles and computer vision applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NVIDIA CORP
- Filing Date
- 2020-12-09
- Publication Date
- 2026-06-23
AI Technical Summary
Existing depth estimation methods cannot effectively capture the precise distance information of objects in the field of view, especially in autonomous vehicles and computer vision applications, where it is difficult to calculate the depth plane of potential moving objects in real time.
By using multiple cameras or image capture devices, combined with neural network training and image warping techniques, the depth of an object in the field of view is estimated. Specific steps include: capturing images using multiple cameras, warping the images to a uniform space using homography transformation, performing depth estimation using a neural network training framework, and performing refined depth calculations using binary search and zero-crossing networks.
It achieves accurate estimation of object depth, enabling real-time and accurate determination of the distance of objects in the field of view in autonomous vehicles and computer vision applications, thus improving the accuracy and efficiency of depth estimation.
Smart Images

Figure CN114761996B_ABST
Abstract
Description
[0001] Cross-references to related applications
[0002] This application claims priority to U.S. Patent Application No. 16 / 714,359, filed December 13, 2019, entitled “DEPTHESTIMATION USING A NEURAL NETWORK”, the entire contents of which are incorporated herein by reference in their entirety and for all purposes. Technical Field
[0003] At least one embodiment relates to processing resources for performing and facilitating computer vision through artificial intelligence. For example, at least one embodiment relates to a processor or computing system for training neural networks according to various new techniques described herein to estimate the depth of objects in a field of view and / or infer the estimated depth of objects using the trained neural network. Background Technology
[0004] Estimating the depth of objects within a field of view is a complex step in many emerging technologies, such as autonomous vehicles and other computer vision applications. Modern depth estimation methods provide limited approximations of object depth but fail to capture crucial information. In autonomous vehicles and other applications, information about how far ahead or behind an object is greatly enhanced when compared to a depth plane, which must be computed in real time using the potential moving object. Given the increasing availability of multiple image-capturing devices in consumer devices (e.g., mobile phones and vehicles), a method that provides better object depth estimation in a plane is available and valuable. Attached Figure Description
[0005] Figure 1 An example environment for object depth estimation according to at least one embodiment is shown;
[0006] Figure 2 Image warping in an example environment according to at least one embodiment is shown to facilitate object depth estimation;
[0007] Figure 3 The movement of objects in distorted images from one or more cameras in an example environment is illustrated according to at least one embodiment;
[0008] Figure 4 A system for training and inference to estimate the depth of an object using one or more neural networks, according to at least one embodiment, is shown.
[0009] Figure 5 A method for continuous, adaptive object depth estimation using two image capture devices in an example environment, according to at least one embodiment, is illustrated;
[0010] Figure 6 An example environment for object depth estimation using more than two image capture devices according to at least one embodiment is shown;
[0011] Figure 7 A process for continuous, adaptive object depth estimation according to at least one embodiment is illustrated;
[0012] Figure 8A The inference and / or training logic according to at least one embodiment is illustrated;
[0013] Figure 8B The inference and / or training logic according to at least one embodiment is illustrated;
[0014] Figure 9 The training and deployment of a neural network according to at least one embodiment are illustrated;
[0015] Figure 10 An example data center system according to at least one embodiment is shown;
[0016] Figure 11A An example of an autonomous vehicle according to at least one embodiment is shown;
[0017] Figure 11B The illustration shows an embodiment according to at least one of the embodiments. Figure 11A Examples of camera positions and field of view for autonomous vehicles;
[0018] Figure 11C This is an illustration based on at least one embodiment. Figure 11A A block diagram of an example system architecture for an autonomous vehicle;
[0019] Figure 11D The illustration, according to at least one embodiment, is for one or more cloud-based servers and Figure 11A A diagram of a system for communication between autonomous vehicles;
[0020] Figure 12 This is a block diagram illustrating a computer system according to at least one embodiment;
[0021] Figure 13 This is a block diagram illustrating a computer system according to at least one embodiment;
[0022] Figure 14 A computer system according to at least one embodiment is shown;
[0023] Figure 15 A computer system according to at least one embodiment is shown;
[0024] Figure 16A A computer system according to at least one embodiment is shown;
[0025] Figure 16B A computer system according to at least one embodiment is shown;
[0026] Figure 16C A computer system according to at least one embodiment is shown;
[0027] Figure 16D A computer system according to at least one embodiment is shown;
[0028] Figure 16E and Figure 16F A shared programming model according to at least one embodiment is shown;
[0029] Figure 17 An exemplary integrated circuit and a related graphics processor according to at least one embodiment are shown;
[0030] Figure 18A and Figure 18B An exemplary integrated circuit and an associated graphics processor according to at least one embodiment are shown;
[0031] Figure 19A and Figure 19B Additional exemplary graphics processor logic according to at least one embodiment is shown;
[0032] Figure 20 A computer system according to at least one embodiment is shown;
[0033] Figure 21A A parallel processor according to at least one embodiment is shown;
[0034] Figure 21B A partitioning unit according to at least one embodiment is shown;
[0035] Figure 21C A processing cluster according to at least one embodiment is shown;
[0036] Figure 21D A graphics multiprocessor according to at least one embodiment is shown;
[0037] Figure 22 A multi-graphics processing unit (GPU) system according to at least one embodiment is illustrated;
[0038] Figure 23 A graphics processor according to at least one embodiment is shown;
[0039] Figure 24 It is a block diagram illustrating a processor microarchitecture for a processor according to at least one embodiment;
[0040] Figure 25A deep learning application processor according to at least one embodiment is shown;
[0041] Figure 26 A block diagram of an example neuromorphic processor is shown according to at least one embodiment;
[0042] Figure 27 At least a portion of a graphics processor according to one or more embodiments is shown;
[0043] Figure 28 At least a portion of a graphics processor according to one or more embodiments is shown;
[0044] Figure 29 At least a portion of a graphics processor according to one or more embodiments is shown;
[0045] Figure 30 This is a block diagram of a graphics processing engine 3010 of a graphics processor according to at least one embodiment;
[0046] Figure 31 It is a block diagram of at least a portion of a graphics processor core according to at least one embodiment;
[0047] Figure 32A and Figure 32B A thread execution logic 3200 according to at least one embodiment is shown, which includes an array of processing elements of a graphics processor core;
[0048] Figure 33 A parallel processing unit (“PPU”) according to at least one embodiment is shown;
[0049] Figure 34 A general-purpose processing cluster (“GPC”) according to at least one embodiment is illustrated;
[0050] Figure 35 A memory partition unit of a parallel processing unit (“PPU”) according to at least one embodiment is shown; and
[0051] Figure 36 A streaming multiprocessor according to at least one embodiment is shown. Detailed Implementation
[0052] Figure 1An example environment for object depth estimation is illustrated. In at least one embodiment, one or more cameras 104, 106 exist in an initial depth plane d0116, wherein depth planes 116, 118, 120, 122, 124 may be parallel planes connecting cameras 104 or connecting two or more cameras 104, 106, such as surfaces on which the cameras are mounted. In at least one embodiment, depth planes 116, 118, 120, 122, 124 may be spaced at regular intervals. In at least one embodiment, depth planes 116, 118, 120, 122, 124 may be spaced at irregular intervals, for example, where a larger distance separates depth planes 116, 118, 120, 122, 124 from one or more cameras 104, 106, while depth planes with finer granularity and smaller spacing between planes closer to the one or more cameras 104, 106 are also present. In at least one embodiment, the depth plane may be a surface containing equidistant points of a reference camera 104 of one or more cameras 104, 106. In at least one embodiment, one or more depth planes 116, 118, 120, 122, 124 may be separated from the one or more cameras 104, 106 by one or more predetermined distances or depths. In at least one embodiment, the depth planes 116, 118, 120, 122, 124 may be located at any distance from the initial depth plane d0116 and are determined in real time to adapt to changing distance estimation requirements.
[0053] In at least one embodiment, one or more cameras 104, 106 are separated by a known distance 108. In at least one embodiment, one or more cameras 104, 106 may be fixed to surface 102 at depth d0116. In at least one embodiment, one or more cameras or other image capturing devices 104, 106 capture a set of images, each camera 104, 106 or other image capturing device capturing at least one image. In at least one embodiment, if only one camera 104 is used, the camera must capture two images, wherein a first image is captured from an initial position 104 and a second image is captured from a secondary position 106 at a known distance 108 from the initial position. In at least one embodiment, if only one camera 104 is used, the camera may capture images from a static position via one or more reflective devices, wherein the reflective devices capture different perspectives of objects 110, 112, 114 in the depth plane.
[0054] In at least one embodiment, one or more objects 110, 112, 114 may exist in the visual space observed by one or more cameras or other image capturing devices 404, 406. In at least one embodiment, one or more objects 110, 112, 114 may be observed at one or more depths in the visual space. In at least one embodiment, one or more predetermined depth planes 116, 118, 120, 122, 124 may be used for depth estimation, as described herein.
[0055] In at least one embodiment, the depth plane counts 116, 118, 120, 122, 124 can be determined based on available computing resources for depth estimation. In at least one embodiment, the depth plane counts 116, 118, 120, 122, 124 can be determined based on other factors, such as camera availability, the speed of necessary depth estimation, or the desired accuracy of depth estimation.
[0056] Figure 2 Image distortion in an example environment is illustrated to facilitate object depth estimation. In at least one embodiment, one or more objects 210, 212, 214 exist in a visual space that can be observed by a plurality of cameras 204, 206 spaced at known distances 208 on depth plane d0 224. In at least one embodiment, the visual space containing one or more objects 210, 212, 214 may contain predetermined depth planes d0...d4 216, 218, 220, 222, 224 which define distance thresholds in the visual space. In at least one embodiment, one or more objects will be in front of all predetermined depth planes 222, between the respective predetermined depth planes 216, 218, 220, 222, 224, or beyond all predetermined depth planes 216. In at least one embodiment, the techniques described herein determine whether an object is in front of all predetermined depth planes 222, whether it extends beyond all predetermined depth planes 216, or locates two depth planes 216, 218, 220, 222, 224 that are close to the nearest and farthest possible distances of one or more objects 210, 212, 214.
[0057] In at least one embodiment, multiple cameras 204, 206 each capture images of objects in visual space. In at least one embodiment, the fields of view captured by each camera (or a single camera) from multiple locations are the same or at least overlap. In at least one embodiment, a single camera may be used, but it must capture a first image, move a predetermined distance 208, and capture a second image. To determine the relationship of one or more objects 210, 212, 214 to individual depth planes 216, 218, 220, 222, 224, in at least one embodiment, the system implementing the technology described herein distorts or transforms 228 the image captured by the right-hand camera or other image capture device 206 to the space captured by the left-hand camera or other image capture device 204. In at least one embodiment, the image captured by the left-hand camera or other image capture device 224 is distorted or transformed 228 to the space captured by the right-hand camera or other image capture device 206.
[0058] In at least one embodiment, any general depth plane 216, 218, 220, 222, 224 in the visible space, for example Figure 2 As shown, homography, transformation, or mapping is induced between the two cameras 204, 206. In at least one embodiment, an image captured by the right-hand camera or other image capture device 206 can be distorted or transformed 228 from the right-hand camera or other image capture device 206 to the left-hand camera or other image capture device 204 by an affine transformation (such as, but not limited to, homography). In at least one embodiment, a reference plane can be obtained for each camera 204, 206, wherein the reference plane is between the left-hand camera 204 and the right-hand camera 206, and the image captured from each camera 204, 206 is distorted by homography or other available affine transformations. In at least one embodiment, an image captured by the right-hand camera or other image capture device 206 can be distorted via homography x2 = H. p x1 is distorted or transformed 228 onto the left-hand camera or other image capture device 204, where x1 is a two-dimensional image of the three-dimensional point X 226 in the right-hand camera or other image capture device 206, H p This is homography caused by the depth plane p220, and x2 is a three-dimensional image of the same point 226 in the two-dimensional image of the left-hand camera 204. In at least one embodiment, a reference image I captured by the left-hand camera or other image capture device 204 is given. L and a reference image I captured by a right-hand camera or other image capture device 206 R I L The projection matrix can be defined as P L =[I|0]andI R The projection matrix can be defined as P R= [R|t]. In at least one embodiment, plane p = [n T The homography matrix H of d]220 p It can be calculated as In at least one embodiment, the distorted or transformed image I can be calculated relative to the plane p220. R p Because I R p =H p I L In at least one embodiment, if the image is based on homography I R p =H p I L If any other affine transformation is distorted or transformed, the image will substantially match another image. In at least one embodiment, the images will substantially match when a set of objects in the images move according to a homography transformation in visual space.
[0059] In at least one embodiment, the reference image I captured by the left-hand camera or other image capture device 204 L And the distortion or transformation of image I relative to plane p220 R p The segmentation target is provided to the segmentation network, where the segmentation objective used to train the neural network described in this paper is to use the depth map D. L It is calculated using plane p220.
[0060] Figure 3 The illustration shows object movement in distorted images from one or more cameras in an example environment. In at least one embodiment, one or more cameras or other image capturing devices 304, 306 are connected to object 302 at depth plane d0 326. In at least one embodiment, one or more cameras capture images of one or more objects 314, 318, 332 in a visual space containing one or more depth planes d0...d3320, 322, 324, 326. In at least one embodiment, the image from the right-hand camera or other image capturing device 306 is distorted or transformed 308 to the image space captured by the left-hand camera or other image capturing device 304 via homography, as described above.
[0061] In at least one embodiment, an image from a right-hand camera or other image capture device 306 is distorted or transformed 308 using the techniques described above to the image space captured by a left-hand camera or other image capture device 304, wherein the distortion or transformation 308 is performed relative to a point A 310 located on a distance plane d1324. In at least one embodiment, the distortion or transformation 308 relative to point A 310 on the distance plane d1324 causes observed objects 314, 318, 332 to move in the distorted or transformed image 328 according to their relationship to the distance plane d1324. In at least one embodiment, if objects 314, 318, 332 are at a depth beyond the depth plane d1324 where the image is distorted or transformed 308, then objects 314, 318, 332 captured in the distorted or transformed image 328 move to the right. In at least one embodiment, objects 314, 318, 332 captured in the distorted or transformed image 328 by a system implementing the techniques described herein move further to the right according to their distance beyond the depth plane 324. In at least one embodiment, the object 314, which is closer to the depth plane 324, moves further to the right than the object 318, which is further away from the depth plane 324.
[0062] In at least one embodiment, an image from a right-hand camera or other image capture device 306 is distorted or transformed 308 using the techniques described above to the image space captured by a left-hand camera or other image capture device 304, wherein the distortion or transformation 308 is performed relative to a point B 312 located on the distance plane d2322. In at least one embodiment, the distortion or transformation 308 relative to point B 312 on the distance plane d2322 causes observed objects 314, 318, 332 to move in the distorted or transformed image 330 according to their relationship to the distance plane d2322. In at least one embodiment, if objects 314, 318, 332 are beyond the depth of the depth plane d2322 where the image is distorted or transformed 308, then objects 314, 318, 332 captured in the distorted or transformed image 330 move to the right. In at least one embodiment, if objects 314, 318, 332 are at a depth in front of the depth plane d2322 of the image being distorted or transformed 308, then objects 314, 318, 332 captured in the distorted or transformed image 330 are moved to the left. In at least one embodiment, objects 314, 318, 332 captured in the distorted or transformed image 330 are further moved in either direction depending on their distance in front of or beyond the depth plane 322.
[0063] Figure 4A system for training and inference to estimate the depth of an object using one or more neural networks is illustrated. In at least one embodiment, the depth estimation environment described above is used to generate segmented training data 406 as described herein for determining whether the object is in front of or beyond the distance plane as described above. In at least one embodiment, baseline data 404 may include information about the depth plane and the object, such as reference or ground truth information, or any baseline data required to train an untrained neural network 412 using the techniques described herein. In at least one embodiment, training data 406, such as segmented data output from example environment 402, may be combined with baseline data 404 and provided as input to training framework 410 as described herein to train the untrained neural network 412.
[0064] In at least one embodiment, the new data 408 may include reference and distorted or transformed images, as described above. In at least one embodiment, the new data 408 may be collected in real time by a camera or other image capture device configured according to an environment such as described above. In at least one embodiment, the training framework 410 may be used to train an untrained neural network 412 as described herein to determine whether an object is in front of or beyond a depth plane in the visual space captured by one or more cameras or other image capture devices. In at least one embodiment, the training framework 410 may generate a trained neural network 414 for inference using the techniques described herein. In at least one embodiment, the new data 408 obtained from the example environment, as described above, may be used by the trained neural network 414 to infer a result 416 using the techniques described below. In at least one embodiment, the result may include information about pixels in one or more images, or objects comprising multiple pixels in one or more images, being in front of or beyond a predetermined distance plane.
[0065] Figure 5 A method for continuous, adaptive object depth estimation is illustrated in an example environment using two image capture devices. In at least one embodiment, the visual space includes depth planes d0...d4 516, 518, 520, 522, 524. In at least one embodiment, a neural network is trained as described above to determine whether pixels in a set of input images captured from one or more cameras 504, 506 are in front of or beyond the depth planes 516, 518, 520, 522, 524. In at least one embodiment, images are captured by one or more cameras 504, 506 such that the images contain objects 510, 512, 514 in the visual space.
[0066] In at least one embodiment, a binary search is performed to locate one or more depth planes 516, 518, 520, 522, 524, wherein if a depth plane is located, the target object 512 is in front of or beyond that depth plane. In at least one embodiment, if two depth planes are located, the target object 512 is located at the depth between each located depth plane. For example, in one embodiment, the target object 512 is captured in an image by one or more cameras 504, 506. In at least one embodiment, an image captured by a right-hand camera or other image capture device 506 is distorted or transformed 528 relative to an initial point A 526 at depth plane d2520 using the techniques described above. In at least one embodiment, an initial point A 526 is selected using a binary search 532 of predetermined depth planes 516, 518, 520, 522, 524, wherein A 526 on plane d252 is a midpoint located in a set of predetermined depth planes 516, 518, 520, 522, 524. In at least one embodiment, the neural network is trained to recognize pixels or objects containing a set of pixels in front of or beyond depth planes 516, 518, 520, 522, 524, or beyond depth plane d2520 for a determined target object 512.
[0067] In at least one embodiment, a search algorithm other than binary search is performed. In at least one embodiment, one or more distant planes are used as thresholds to modify a set of depth planes for searching. In at least one embodiment, if a distant depth plane is crossed, the set of depth planes for searching is adjusted to include a different set of depth planes within a range different from each camera or other image capture device. In at least one embodiment, adjusting the set of depth planes for searching allows for refinement of accuracy based on the total distance to the object. For example, in at least one embodiment, the object is detected by a single depth plane that determines whether the object is within the range of interest of one or more cameras or other image capture devices. In at least one embodiment, if the object enters the region of interest of one or more cameras or image capture devices, and the available depth planes are focused, for example, on a range that does not include the entire region of interest of the one or more cameras, the available depth planes can be adjusted to include a subregion of the region of interest (containing the object). In at least one embodiment, adjusting the depth planes allows for variable accuracy in depth estimation during changing conditions, where new or adjusted depth planes are used to temporarily determine the object distance. In at least one embodiment, the set of depth planes to be searched can be adjusted based on a triggering event, such as a single depth plane being in front of or beyond the object. In at least one embodiment, a truncation search is performed based on the object's relationship to one or more depth planes.
[0068] In at least one embodiment, using the above-described technique, an image captured by a right-hand camera or other image capture device 506 is distorted or transformed 528 relative to a second point C 530 at depth plane d3518, wherein the next point and depth plane are determined based on a search of available depth planes 534, such as a binary search. In at least one embodiment, available depth planes include depth planes that can still be used as candidate boundaries for objects. In at least one embodiment, available depth planes include all depth planes between the currently used depth plane for searching and the object being searched. In at least one embodiment, point B 536 at depth plane d4516 is not selected because it is not the next according to the binary search algorithm. In at least one embodiment, a neural network trained to identify whether a pixel or an object containing a set of pixels is in front of or beyond depth planes 516, 518, 520, 522, 524 determines that target object 512 is in front of depth plane d3518. In at least one embodiment, when there are no further depth planes 516, 518, 520, 522, 524 between the current depth plane 518 and the previously checked depth plane 520, the target object 512 is determined to be between the two located depth planes 518, 520, and an approximate value of the depth or distance of the target object is determined.
[0069] In at least one embodiment, a binary search or any other search algorithm, as described above, can be performed on any object 510, 512, 514 in the visual space captured by one or more cameras 504, 506. In at least one embodiment, the camera or other image capturing device 504, 506 can be fixed to an object moving through the visual space. In at least one embodiment, as described above, a binary search can be performed at specific time intervals when new images are captured by one or more cameras or other image capturing devices 504, 506, providing a continuous estimate of the depth of object 510, 512, 514 as said one or more cameras or other image capturing devices 504, 506 move through the visual space. In at least one embodiment, when object 510, 512, 514 crosses depth planes 516, 518, 520, 522, 524, a new estimate must be performed using the techniques described above to determine a new depth plane to approximate the depth or distance of object 510, 512, 514 in the visual space.
[0070] In at least one embodiment, a zero-crossing network is used to perform intermediate estimates of object distances associated with depth planes 516, 518, 520, 522, and 524. In at least one embodiment, the zero-crossing network is trained to determine whether an object is closer to one of the boundary depth planes when it is between two depth planes 516, 518, 520, 522, and 524. In at least one embodiment, the zero-crossing network is trained to parse intermediate depth values where objects 510, 512, and 514 are between two depth planes to more accurately estimate which depth plane objects 510, 512, and 514 are closest to, and to filter uncertainty and noise from the depth estimates using the methods described above. In at least one embodiment, the zero-crossing network is trained to determine when an object crosses a specific depth plane 516, 518, 520, 522, and 524, thereby providing a finer depth estimate between the depth planes 516, 518, 520, 522, and 524. In at least one embodiment, a zero-crossing network is trained to determine the proximity of an object to a certain depth plane 516, 618, 520, 522, 524. In at least one embodiment, the zero-crossing network is trained to provide a more accurate estimate of the depth of objects 510, 512, 514 as objects 510, 512, 514 become closer to a particular depth plane 516, 618, 520, 522, 524, and it becomes more difficult to determine whether objects 510, 512, 514 are in front of or behind the depth planes 516, 618, 520, 522, 524.
[0071] Figure 6 An example environment for object depth estimation using more than two image capture devices is illustrated. In at least one embodiment, the aforementioned techniques, in which two cameras or other image capture devices 604, 606 transform or distort two images to determine object distance or depth, can be extended to environments using multiple (>2) cameras for depth or distance estimation using the aforementioned search and calculation techniques. In at least one embodiment, the visual space can be captured by three or more cameras or other image capture devices 604, 606, 608 fixed to object 602 at an initial depth plane d0. In at least one embodiment, one or more cameras or other image capture devices 604, 606 can be used to estimate the depth or distance of an object at a nearby distance 614, while an additional camera or other image capture device 608 can be used to refer to the camera or other image capture device 606 to determine the object depth or distance located at depth planes 630, 632 at a greater distance from said cameras 604, 606, 608.
[0072] In at least one embodiment, for one or more objects 614 at a closer distance or depth in visual space, one or more cameras or other image capturing devices 604, 606 may be used to capture images at a specific offset 610 from each other. In at least one embodiment, the images captured to determine the closer distance or depth are distorted or transformed relative to points 616, 618 located on depth planes 626, 628 at shorter distances to calculate a finer granular depth approximation of the object 614 at the shorter distance. In at least one embodiment, objects at a greater distance are captured by one or more cameras or other image capturing devices 606, 608 located at different offsets 612 from each other, and the images are distorted or transformed as described above, depending on whether points 620, 624 are located on more widely spaced depth planes 630, 632. In at least one embodiment, pre-determining depth planes 630, 632 at smaller granular intervals to obtain greater distances requires less computation time until the object is closer to the camera or other image capturing device 604, 606, 608.
[0073] Figure 7 A process for continuous, adaptive object depth estimation is illustrated. In at least one embodiment, the process for continuous, adaptive object depth estimation begins 702 with a system implementing the techniques described herein, such as a computer vision system, to determine the distance to an object in a car or mobile phone, wherein the system begins 704 by generating or capturing a new image from another camera, as described above. In at least one embodiment, a depth plane is selected 706 according to a binary search of a predetermined depth plane, as described above. In at least one embodiment, a captured image is distorted or transformed onto another captured image 708 using the techniques described herein based on the selected depth plane.
[0074] In at least one embodiment, a system implementing the techniques described herein, such as a computer vision system for determining the distance to an object in a car or mobile phone, utilizes a neural network 714 to infer whether the object is behind 710 or in front of a selected depth plane 712. In at least one embodiment, if the object is behind the selected depth plane 710, a new maximum depth plane is set 716. In at least one embodiment, if the object is in front of the selected depth plane 712, a new minimum depth plane is set 718. In at least one embodiment, if no more depth planes exist between the calculated minimum and maximum depth planes 720, a result 722 is returned. In at least one embodiment, if more depth planes exist to search, a new depth plane is selected according to a binary search algorithm 706. In at least one embodiment, once the minimum and maximum depth planes have been determined, and no more depth planes exist between the determined minimum and maximum 722, the process for continuous adaptive object depth estimation is completed 724.
[0075] Reasoning and training logic
[0076] Figure 8A Inference and / or training logic 815 is shown for performing inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 8A and / or Figure 8B Provide details about reasoning and / or training logic 815.
[0077] In at least one embodiment, inference and / or training logic 815 may include, but is not limited to, code and / or data storage 801 for storing forward and / or output weights and / or input / output data, and / or other parameters configuring neurons or layers of a neural network trained for and / or used for inference in one or more embodiments. In at least one embodiment, training logic 815 may include or be coupled to code and / or data storage 801 for storing graph code or other software to control timing and / or sequence, wherein weight and / or other parameter information is loaded to configure logic, including integer and / or floating-point units (collectively referred to as Arithmetic Logic Units (ALUs)). In at least one embodiment, code (such as graph code) loads weight or other parameter information into the processor ALU based on the architecture of the neural network to which the code corresponds. In at least one embodiment, code and / or data storage 801 stores weight parameters and / or input / output data of each layer of a neural network trained or used in one or more embodiments during forward propagation of input / output data and / or weight parameters during training and / or inference using one or more embodiments. In at least one embodiment, any portion of the code and / or data storage 801 may be included within other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory.
[0078] In at least one embodiment, any portion of the code and / or data storage 801 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, the code and / or data storage 801 may be a cache memory, dynamic random-addressable memory (“DRAM”), static random-addressable memory (“SRAM”), non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice of whether the code and / or data storage 801 is internal or external to the processor, for example, or composed of DRAM, SRAM, flash memory, or some other storage type, may depend on the available on-chip or off-chip storage space, the latency requirements of the training and / or inference functions being performed, the batch size of the data used in the inference and / or training of the neural network, or some combination of these factors.
[0079] In at least one embodiment, inference and / or training logic 815 may include, but is not limited to, code and / or data storage 805 for storing backpropagation and / or output weights and / or input / output data neural networks corresponding to neurons or layers of a neural network trained and / or used for inference in one or more embodiments. In at least one embodiment, during training and / or inference using one or more embodiments, code and / or data storage 805 stores weight parameters and / or input / output data for each layer of a neural network trained or used in one or more embodiments during backpropagation of input / output data and / or weight parameters. In at least one embodiment, training logic 815 may include or be coupled to code and / or data storage 805 for storing graph code or other software to control timing and / or sequence, wherein weight and / or other parameter information is loaded to configure logic including integer and / or floating-point units (collectively, arithmetic logic units (ALUs)). In at least one embodiment, code (such as graph code) causes weight or other parameter information to be loaded into a processor ALU based on the architecture of the neural network corresponding to the code. In at least one embodiment, any portion of the code and / or data storage 805 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory. In at least one embodiment, any portion of the code and / or data storage 805 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, the code and / or data storage 805 may be cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice between the code and / or data storage 805 being internal or external to the processor, for example, whether it consists of DRAM, SRAM, flash memory, or some other type of storage, depends on whether the available storage is on-chip or off-chip, the latency requirements of the training and / or inference functions being performed, the data batch size used in the inference and / or training of the neural network, or some combination of these factors.
[0080] In at least one embodiment, code and / or data storage 801 and code and / or data storage 805 may be separate storage structures. In at least one embodiment, code and / or data storage 801 and code and / or data storage 805 may be the same storage structure. In at least one embodiment, code and / or data storage 801 and code and / or data storage 805 may be partially the same storage structure and partially separate storage structures. In at least one embodiment, any portion of code and / or data storage 801 and code and / or data storage 805 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory.
[0081] In at least one embodiment, the inference and / or training logic 815 may include, but is not limited to, one or more arithmetic logic units (“ALUs”) 810 (including integer and / or floating-point units) for performing logical and / or mathematical operations at least in part based on or instructed by training and / or inference code (e.g., graph code), the results of which may produce activations (e.g., output values from layers or neurons within a neural network) stored in activation storage 820, which are functions of input / output and / or weight parameter data stored in code and / or data storage 801 and / or code and / or data storage 805. In at least one embodiment, activation is activated in response to execution instructions or other code, linear algebraic and / or matrix-based mathematical generation performed by ALU 810, and the activation is stored in activation storage 820, wherein weight values stored in code and / or data storage 805 and / or code and / or data storage 801 are used as operands with other values, such as bias values, gradient information, momentum values, or other parameters or hyperparameters, and any or all of these can be stored in code and / or data storage 805 or code and / or data storage 801 or other on-chip or off-chip storage.
[0082] In at least one embodiment, one or more ALUs 810 are included in one or more processors or other hardware logic devices or circuits, while in another embodiment, one or more ALUs 810 may be located outside the processor or other hardware logic device or the circuits using them (e.g., coprocessors). In at least one embodiment, one or more ALUs 810 may be included within an execution unit of a processor, or otherwise included in a group of ALUs accessible by the execution unit of the processor, which may be within the same processor or distributed among different processors of different types (e.g., central processing unit, graphics processing unit, fixed-function unit, etc.). In at least one embodiment, data storage 801, code and / or data storage 805, and activation storage 820 may be on the same processor or other hardware logic device or circuit, while in another embodiment, they may be on different processors or other hardware logic devices or circuits, or in some combination of the same and different processors or other hardware logic devices or circuits. In at least one embodiment, any portion of activation storage 820 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory. Furthermore, inference and / or training code may be stored together with other code accessible to the processor or other hardware logic or circuitry, and may be retrieved and / or processed using the processor’s fetch, decode, schedule, execute, exit, and / or other logic circuitry.
[0083] In at least one embodiment, the active memory 820 may be a cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other memory. In at least one embodiment, the active memory 820 may be wholly or partially located inside or outside one or more processors or other logic circuits. In at least one embodiment, the choice of whether the active memory 820 is internal to or external to the processor may depend on the available on-chip or off-chip storage, the latency requirements for training and / or inference functions, the batch size of data used in inference and / or training the neural network, or some combination of these factors. For example, it may include DRAM, SRAM, flash memory, or other memory types. In at least one embodiment, Figure 8A The inference and / or training logic 815 shown can be used in conjunction with an application-specific integrated circuit (“ASIC”), such as those from Google. Processing unit, from Graphcore TM Inference processing units (IPUs) or from Intel Corp. (e.g., "Lake Crest") processor. In at least one embodiment, Figure 8A The inference and / or training logic 815 shown can be used in conjunction with central processing unit (“CPU”) hardware, graphics processing unit (“GPU”) hardware or other hardware (such as field programmable gate array (“FPGA”)).
[0084] Figure 8B An inference and / or training logic 815 according to at least one embodiment is illustrated. In at least one embodiment, the inference and / or training logic 815 may include, but is not limited to, hardware logic, wherein computational resources are dedicated or otherwise uniquely used in conjunction with weight values or other information corresponding to one or more layers of neurons within a neural network. In at least one embodiment, Figure 8B The inference and / or training logic 815 shown can be used in conjunction with an application-specific integrated circuit (ASIC), such as those from Google. Processing unit, from Graphcore TM Inference processing units (IPUs) or from Intel Corp. (e.g., "Lake Crest") processor. In at least one embodiment, Figure 8BThe inference and / or training logic 815 shown can be used in conjunction with central processing unit (CPU) hardware, graphics processing unit (GPU) hardware, or other hardware (e.g., field-programmable gate array (FPGA)). In at least one embodiment, the inference and / or training logic 815 includes, but is not limited to, code and / or data storage 801 and code and / or data storage 805, which can be used to store code (e.g., graph code), weight values, and / or other information, including bias values, gradient information, momentum values, and / or other parameter or hyperparameter information. Figure 8B In at least one embodiment shown, each of code and / or data storage 801 and code and / or data storage 805 is associated with dedicated computing resources (e.g., computing hardware 802 and computing hardware 806), respectively. In at least one embodiment, each of computing hardware 802 and computing hardware 806 includes one or more ALUs that perform mathematical functions (e.g., linear algebraic functions) only on the information stored in code and / or data storage 801 and code and / or data storage 805, respectively, and the results of the function execution are stored in activation memory 820.
[0085] In at least one embodiment, each of the code and / or data storage 801 and 805 and the corresponding computing hardware 802 and 806 corresponds to a different layer of the neural network, such that activation obtained from one “store / computation pair 801 / 802” of the code and / or data storage 801 and computing hardware 802 provides input as input to the next “store / computation pair 805 / 806” of the code and / or data storage 805 and computing hardware 806, in order to reflect the conceptual organization of the neural network. In at least one embodiment, each store / computation pair 801 / 802 and 805 / 806 may correspond to more than one neural network layer. In at least one embodiment, additional store / computation pairs (not shown) may be included in the inference and / or training logic 815 after or in parallel with the store / computation pairs 801 / 802 and 805 / 806.
[0086] Neural network training and deployment
[0087] Figure 9Training and deployment of a deep neural network according to at least one embodiment are illustrated. In at least one embodiment, an untrained neural network 906 is trained using a training dataset 902. In at least one embodiment, the training framework 904 is the PyTorch framework, while in other embodiments, the training framework 904 is TensorFlow, Boost, Caffe, Microsoft Cognitive Toolkit / CNTK, MXNet, Chainer, Keras, Deeplearning4j, or other training frameworks. In at least one embodiment, the training framework 904 trains the untrained neural network 906 and enables it to be trained using the processing resources described herein to generate a trained neural network 908. In at least one embodiment, the weights may be randomly selected or pre-trained using a deep belief network. In at least one embodiment, training may be performed in a supervised, partially supervised, or unsupervised manner.
[0088] In at least one embodiment, supervised learning is used to train an untrained neural network 906, wherein training dataset 902 includes inputs paired with desired outputs for input, or wherein training dataset 902 includes inputs with known outputs and neural network 906 is manually graded output. In at least one embodiment, the untrained neural network 906 is trained in a supervised manner, and inputs from training dataset 902 are processed, and the resulting output is compared with a set of expected or desired outputs. In at least one embodiment, errors are then propagated back through the untrained neural network 906. In at least one embodiment, training framework 904 adjusts the weights controlling the untrained neural network 906. In at least one embodiment, training framework 904 includes tools for monitoring the degree to which the untrained neural network 906 converges to a model (e.g., a trained neural network 908) adapted to generate the correct answer (e.g., result 914) based on known input data (e.g., a new dataset 912). In at least one embodiment, training framework 904 repeatedly trains the untrained neural network 906 while adjusting the weights to improve the output of the untrained neural network 906 using a loss function and tuning algorithms (e.g., stochastic gradient descent). In at least one embodiment, the training framework 904 trains an untrained neural network 906 until the untrained neural network 906 reaches the desired accuracy. In at least one embodiment, the trained neural network 908 can then be deployed to perform any number of machine learning operations.
[0089] In at least one embodiment, unsupervised learning is used to train an untrained neural network 906, wherein the untrained neural network 906 attempts to train itself using unlabeled data. In at least one embodiment, the unsupervised learning training dataset 902 will include input data without any associated output data or "ground truth" data. In at least one embodiment, the untrained neural network 906 can learn groupings within the training dataset 902 and can determine how each input relates to the untrained dataset 902. In at least one embodiment, unsupervised training can be used to generate a self-organizing graph, which is of the type of trained neural network 908, capable of performing operations useful for reducing the dimensionality of the new dataset 912. In at least one embodiment, unsupervised training can also be used to perform anomaly detection, which allows the identification of data points in the new dataset 912 that deviate from the normal patterns of the new dataset 912.
[0090] In at least one embodiment, semi-supervised learning can be used, a technique in which a mixture of labeled and unlabeled data is included in the training dataset 902. In at least one embodiment, the training framework 904 can be used to perform incremental learning, for example, through transfer learning techniques. In at least one embodiment, incremental learning enables the trained neural network 908 to adapt to a new dataset 912 without forgetting the knowledge injected into the network during initial training.
[0091] Data Center
[0092] Figure 10 An example data center 1000 that can be used in at least one embodiment is shown. In at least one embodiment, the data center 1000 includes a data center infrastructure layer 1010, a framework layer 1020, a software layer 1030, and an application layer 1040.
[0093] In at least one embodiment, such as Figure 10As shown, the data center infrastructure layer 1010 may include a resource coordinator 1012, packet computing resources 1014, and node computing resources (“nodes CR”) 1016(1)-1016(N), where “N” represents any positive integer. In at least one embodiment, nodes CR 1016(1)-1016(N) may include, but are not limited to, any number of central processing units (“CPUs”) or other processors (including accelerators, field-programmable gate arrays (FPGAs), graphics processors, etc.), memory devices (e.g., dynamic read-only memory), storage devices (e.g., solid-state drives or disk drives), network input / output (“NW I / O”) devices, network switches, virtual machines (“VMs”), power modules, and cooling modules, etc. In at least one embodiment, one or more nodes CR 1016(1)-1016(N) may be servers having one or more of the aforementioned computing resources.
[0094] In at least one embodiment, the grouped computing resource 1014 may include individual groups (not shown) of node CRs housed within one or more racks, or a plurality of racks (also not shown) housed within data centers in various geographical locations. The individual groups of node CRs within the grouped computing resource 1014 may include computing, networking, memory, or storage resources that can be configured or allocated to support groups of one or more workloads. In at least one embodiment, several node CRs, including CPUs or processors, may be grouped within one or more racks to provide computing resources to support one or more workloads. In at least one embodiment, the one or more racks may also include any number of power modules, cooling modules, and network switches, in any combination.
[0095] In at least one embodiment, resource coordinator 1012 may be configured or otherwise control one or more nodes CR1016(1)-1016(N) and / or grouped computing resources 1014. In at least one embodiment, resource coordinator 1012 may include a Software Design Infrastructure (“SDI”) management entity for data center 1000. In at least one embodiment, resource coordinator 1012 may include hardware, software, or some combination thereof.
[0096] In at least one embodiment, such as Figure 10As shown, framework layer 1020 includes a job scheduler 1032, a configuration manager 1034, a resource manager 1036, and a distributed file system 1038. In at least one embodiment, framework layer 1020 may include a framework of software 1032 supporting software layer 1030 and / or one or more applications 1042 supporting application layer 1040. In at least one embodiment, software 1032 or application 1042 may respectively include web-based service software or applications, such as services or applications provided by Amazon Web Services, Google Cloud, and Microsoft Azure. In at least one embodiment, framework layer 1020 may be, but is not limited to, a free and open-source software web application framework, such as Apache Spark, which can utilize distributed file system 1038 for large-scale data processing (e.g., "big data"). TM (Hereinafter referred to as "Spark"). In at least one embodiment, the job scheduler 1032 may include a Spark driver to facilitate the scheduling of workloads supported by various layers of data center 1000. In at least one embodiment, the configuration manager 1034 may be able to configure different layers, such as software layer 1030 and framework layer 1020 including Spark and a distributed file system 1038 for supporting large-scale data processing. In at least one embodiment, the resource manager 1036 is able to manage cluster or group computing resources mapped to or allocated to support distributed file system 1038 and job scheduler 1032. In at least one embodiment, cluster or group computing resources may include group computing resources 1014 on data center infrastructure layer 1010. In at least one embodiment, the resource manager 1036 may coordinate with resource coordinator 1012 to manage these mapped or allocated computing resources.
[0097] In at least one embodiment, the software 1032 included in the software layer 1030 may include software used by at least a portion of the nodes CR1016(1)-1016(N), the grouped computing resources 1014, and / or the distributed file system 1038 of the framework layer 1020. One or more types of software may include, but are not limited to, Internet web page search software, email virus scanning software, database software, and streaming video content software.
[0098] In at least one embodiment, one or more applications 1042 included in application layer 1040 may include one or more types of applications used by at least a portion of nodes CR1016(1)-1016(N), grouped computing resources 1014, and / or the distributed file system 1038 of framework layer 1020. One or more types of applications may include, but are not limited to, any number of genomics applications, cognitive computing and machine learning applications, including training or inference software, machine learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.), or other machine learning applications used in conjunction with one or more embodiments.
[0099] In at least one embodiment, any of the configuration manager 1034, resource manager 1036, and resource coordinator 1012 can implement any number and type of self-modification actions based on any amount and type of data acquired in any technically feasible manner. In at least one embodiment, self-modification actions can mitigate potentially poor configuration decisions by data center operators of data center 1000 and can prevent underutilization and / or poor performance of the data center.
[0100] In at least one embodiment, data center 1000 may include tools, services, software, or other resources to train one or more machine learning models or to use one or more machine learning models to predict or infer information according to one or more embodiments described herein. For example, in at least one embodiment, a machine learning model can be trained by calculating weight parameters based on a neural network architecture using the software and computing resources described above with respect to data center 1000. In at least one embodiment, information can be inferred or predicted using trained machine learning models corresponding to one or more neural networks using the resources described above with respect to data center 1000, by using weight parameters calculated through one or more training techniques described herein.
[0101] In at least one embodiment, the data center may use a CPU, application-specific integrated circuit (ASIC), GPU, FPGA, or other hardware to utilize the aforementioned resources to perform training and / or inference. Furthermore, one or more of the aforementioned software and / or hardware resources may be configured as a service to allow a user to train or perform information inference, such as image recognition, speech recognition, or other artificial intelligence services.
[0102] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8BDetails are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 can be implemented in the system. Figure 10 Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0103] In at least one embodiment, the inference and / or training logic 412, 414 can be in the system Figure 10 Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0104] Autonomous vehicles
[0105] Figure 11A An example of an autonomous vehicle 1100 according to at least one embodiment is shown. In at least one embodiment, the autonomous vehicle 1100 (which may alternatively be referred to herein as "vehicle 1100") may be, but is not limited to, a passenger vehicle, such as a car, truck, bus, and / or another type of vehicle capable of accommodating one or more passengers. In at least one embodiment, vehicle 1100 may be a semi-tractor-trailer for hauling goods. In at least one embodiment, vehicle 1100 may be an aircraft, a robotic vehicle, or other type of vehicle.
[0106] Autonomous vehicles can be described according to the levels of automation defined by the National Highway Traffic Safety Administration (“NHTSA”) and the Society of Automotive Engineers (“SAE”) of the U.S. Department of Transportation in their “Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles” (e.g., standard number J3016-201806, published June 15, 2018; standard number J3016-201609, published September 30, 2016; and previous and future versions of this standard). In one or more embodiments, vehicle 1100 may be able to function according to one or more of the levels of autonomous driving from Level 1 to Level 5. For example, in at least one embodiment, vehicle 1100 may be able to perform conditional automation (Level 3), high automation (Level 4), and / or full automation (Level 5).
[0107] In at least one embodiment, vehicle 1100 may include, but is not limited to, components such as chassis, body, wheels (e.g., 2, 4, 6, 8, 18, etc.), tires, axles, and other vehicle components. In at least one embodiment, vehicle 1100 may include, but is not limited to, propulsion system 1150, such as an internal combustion engine, a hybrid powertrain, an all-electric motor, and / or another type of propulsion system. In at least one embodiment, propulsion system 1150 may be connected to the drivetrain of vehicle 1100, which may include, but is not limited to, a transmission, to enable propulsion of vehicle 1100. In at least one embodiment, propulsion system 1150 may be controlled in response to receiving a signal from throttle / accelerator 1152.
[0108] In at least one embodiment, when the propulsion system 1150 is operating (e.g., when the vehicle 1100 is traveling), the steering system 1154 (which may include, but is not limited to, a steering wheel) is used to steer the vehicle 1100 (e.g., along a desired path or route). In at least one embodiment, the steering system 1154 may receive signals from the steering actuator 1156. The steering wheel may be optional for fully automated (Level 5) functionality. In at least one embodiment, the brake sensor system 1146 may be used to operate the vehicle brakes in response to signals received from the brake actuator 1148 and / or brake sensors.
[0109] In at least one embodiment, controller 1136 may include, but is not limited to, one or more system-on-chips (“SoCs”). Figure 11AA controller 1136 (not shown) and / or a graphics processing unit (“GPU”) provides signals (e.g., representing commands) to one or more components and / or systems of vehicle 1100. For example, in at least one embodiment, controller 1136 may send signals to operate vehicle braking via brake actuator 1148, to operate steering system 1154 via one or more steering actuators 1156, and to operate propulsion system 1150 via one or more throttles / accelerators 1152. One or more controllers 1136 may include one or more onboard (e.g., integrated) computing devices (e.g., supercomputers) that process sensor signals and output operating commands (e.g., signals representing commands) to enable autonomous driving and / or assist a driver in driving vehicle 1100. In at least one embodiment, one or more controllers 1136 may include a first controller 1136 for autonomous driving functions, a second controller 1136 for functional safety functions, a third controller 1136 for artificial intelligence functions (e.g., computer vision), a fourth controller 1136 for infotainment functions, a fifth controller 1136 for redundancy in emergency situations, and / or other controllers. In at least one embodiment, a single controller 1136 may handle two or more of the above functions, and two or more controllers 1136 may handle a single function and / or any combination thereof.
[0110] In at least one embodiment, one or more controllers 1136 provide signals for controlling one or more components and / or systems of vehicle 1100 in response to sensor data received from one or more sensors (e.g., sensor inputs). In at least one embodiment, sensor data can be received from sensors, including, but not limited to, one or more Global Navigation Satellite System (“GNSS”) sensors 1158 (e.g., one or more Global Positioning System sensors), one or more RADAR sensors 1160, one or more ultrasonic sensors 1162, one or more LiDAR sensors 1164, one or more Inertial Measurement Unit (IMU) sensors 1166 (e.g., one or more accelerometers, one or more gyroscopes, one or more magnetic compasses, one or more magnetometers, etc.), one or more microphones 1196, one or more stereo cameras 1168, one or more wide-angle cameras 1170 (e.g., fisheye cameras), one or more infrared cameras 1172, one or more surround cameras 1174 (e.g., 360-degree cameras), and remote cameras (…). Figure 11A (not shown in the image), medium-range camera ( Figure 11A(Not shown in the image) One or more speed sensors 1144 (e.g., for measuring the speed of vehicle 1100), one or more vibration sensors 1142, one or more steering sensors 1140, one or more brake sensors (e.g., as part of brake sensor system 1146) and / or other sensor types are received.
[0111] In at least one embodiment, one or more controllers 1136 may receive input (e.g., represented by input data) from the dashboard 1132 of the vehicle 1100 and provide output (e.g., represented by output data, display data, etc.) via a human-machine interface (“HMI”) display 1134, a voice signaler, a speaker, and / or other components of the vehicle 1100. In at least one embodiment, the output may include information such as vehicle speed, velocity, time, map data (e.g., high-definition map). Figure 11A The HMI display 1134 may display information about the presence of one or more objects (e.g., road signs, warning signs, traffic light changes, etc.) and / or information about the driving operation of the vehicle that has been, is being, or will be made (e.g., changing lanes now, exiting exit 34B within two miles, etc.). For example, in at least one embodiment, the HMI display 1134 may display information about the presence of one or more objects (e.g., road signs, warning signs, traffic light changes, etc.) and / or information about the driving operation of the vehicle that has been, is being, or will be made (e.g., changing lanes now, exiting exit 34B within two miles, etc.).
[0112] In at least one embodiment, vehicle 1100 further includes a network interface 1124 that can communicate over one or more networks using one or more wireless antennas 1126 and / or one or more modems. For example, in at least one embodiment, network interface 1124 may be able to communicate via Long Term Evolution (“LTE”), Wideband Code Division Multiple Access (“WCDMA”), Universal Mobile Telecommunications System (“UMTS”), Global System for Mobile Communications (“GSM”), IMT-CDMA Multicarrier (“CDMA2000”), etc. In at least one embodiment, one or more wireless antennas 1126 may also enable communication between objects in the context (e.g., vehicles, mobile devices) using one or more local area networks (e.g., Bluetooth, Bluetooth Low Energy (LE), Z-Wave, ZigBee, etc.) and / or one or more low-power wide area networks (hereinafter “LPWAN”) (e.g., LoRaWAN, SigFox, etc.).
[0113] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 can be implemented in the system. Figure 11A The operation is used to infer or predict the operation based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0114] In at least one embodiment, the inference and / or training logic 412, 414 can be in the system Figure 11A Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0115] Figure 11B The illustration shows an embodiment according to at least one of the embodiments. Figure 11A Examples of camera positions and fields of view for an autonomous vehicle 1100. In at least one embodiment, the camera and its respective field of view are an example embodiment and are not intended to be limiting. For example, in at least one embodiment, additional and / or alternative cameras may be included and / or the cameras may be located at different positions on the vehicle 1100.
[0116] In at least one embodiment, the camera type used for the camera may include, but is not limited to, a digital camera suitable for use with components and / or systems of vehicle 1100. The camera may operate at Automotive Safety Integrity Level (“ASIL”) B and / or other ASILs. In at least one embodiment, the camera type may have any image capture rate, such as 60 frames per second (fps), 1220 fps, 240 fps, etc. In at least one embodiment, the camera may be able to use a rolling shutter, a global shutter, another type of shutter, or a combination thereof. In at least one embodiment, the color filter array may include a red-to-clear (“RCCC”) color filter array, a red-to-clear-blue (“RCCB”) color filter array, a red-blue-green (“RBGC”) color filter array, a Foveon X3 color filter array, a Bayer sensor (“RGGB”) color filter array, a monochrome sensor color filter array, and / or other types of color filter arrays. In at least one embodiment, a transparent pixel camera, such as a camera with RCCC, RCCB, and / or RBGC color filter arrays, may be used to improve photosensitivity.
[0117] In at least one embodiment, one or more cameras may be used to perform advanced driver assistance system (“ADAS”) functions (e.g., as part of a redundancy or fail-safe design). For example, in at least one embodiment, a multi-function mono camera may be installed to provide functions including lane departure warning, traffic sign assist, and intelligent headlight control. In at least one embodiment, one or more cameras (e.g., all cameras) may simultaneously record and provide image data (e.g., video).
[0118] In at least one embodiment, one or more cameras may be mounted in a mounting assembly, such as a custom-designed (3D-printed) assembly, to cut out stray light and reflections from within the vehicle (e.g., dashboard reflections in the windshield mirror), which may interfere with the camera's image data capture capabilities. Regarding the rearview mirror mounting assembly, in at least one embodiment, the rearview mirror assembly may be 3D-printed custom-made such that the camera mounting plate matches the shape of the rearview mirror. In at least one embodiment, one or more cameras may be integrated into the rearview mirror. In at least one embodiment, for side-view cameras, one or more cameras may also be integrated within four pillars at each corner of the cabin.
[0119] In at least one embodiment, a camera (e.g., a forward-facing camera) having a field of view including a portion of the context in front of the vehicle 1100 can be used for surround view and, with the assistance of one or more controllers 1136 and / or control SoCs, to help identify the forward path and obstacles, thereby providing information crucial for generating an occupancy grid and / or determining a preferred vehicle path. In at least one embodiment, the forward-facing camera can be used to perform many of the same ADAS functions as LIDAR, including but not limited to emergency braking, pedestrian detection, and collision avoidance. In at least one embodiment, the forward-facing camera can also be used for ADAS functions and systems, including but not limited to lane departure warning (“LDW”), adaptive cruise control (“ACC”), and / or other functions (e.g., traffic sign recognition).
[0120] In at least one embodiment, various cameras can be used in a forward-facing configuration, including, for example, a monocular camera platform including a CMOS (“complementary metal-oxide-semiconductor”) color imager. In at least one embodiment, a wide-angle camera 1170 can be used to sense objects entering from the periphery (e.g., pedestrians, people crossing the street, or bicycles). Although in Figure 11BOnly one wide-angle camera 1170 is shown; however, in other embodiments, the vehicle 1100 may have any number (including zero) of wide-angle cameras 1170. In at least one embodiment, any number of remote cameras 1198 (e.g., a pair of remote stereo cameras) can be used for depth-based object detection, especially for objects for which a neural network has not yet been trained. In at least one embodiment, the remote camera 1198 can also be used for object detection and classification, as well as basic object tracking.
[0121] In at least one embodiment, any number of stereo cameras 1168 may also be included in a forward configuration. In at least one embodiment, one or more stereo cameras 1168 may include an integrated control unit comprising a scalable processing unit that may provide programmable logic (“FPGA”) and a multi-core microprocessor with a controller area network (“CAN”) or Ethernet interface integrated on a single chip. In at least one embodiment, such a unit may be used to generate a 3D map of the context of vehicle 1100, including distance estimates for all points in the image. In at least one embodiment, one or more stereo cameras 1168 may include, but are not limited to, a compact stereo vision sensor, which may include, but is not limited to, two camera lenses (one on the left and one on the right) and an image processing chip that can measure the distance from vehicle 1100 to a target object and use the generated information (e.g., metadata) to activate autonomous emergency braking and lane departure warning functions. In at least one embodiment, other types of stereo cameras 1168 may also be used in addition to those described herein.
[0122] In at least one embodiment, a camera (e.g., a side-view camera) having a field of view that includes a portion of the context of the side of vehicle 1100 can be used for surround viewing, thereby providing information for creating and updating the occupied grid, and generating a side collision warning. For example, in at least one embodiment, a surround camera 1174 (e.g., as...) Figure 11B The four surround cameras 1174 shown can be positioned on vehicle 1100. One or more surround cameras 1174 can include, but are not limited to, any number and combination of wide-angle cameras 1170, one or more fisheye lenses, one or more 360-degree cameras, and / or the like. For example, in at least one embodiment, the four fisheye lens cameras can be located at the front, rear, and sides of vehicle 1100. In at least one embodiment, vehicle 1100 can use three surround cameras 1174 (e.g., left, right, and rear) and can utilize one or more other cameras (e.g., forward-facing cameras) as a fourth surround-view camera.
[0123] In at least one embodiment, a camera (e.g., a rear-view camera) having a field of view that includes a portion of the context behind vehicle 1100 can be used for parking assistance, surround view, rear collision warning, and creating and updating occupancy raster. In at least one embodiment, a wide variety of cameras can be used, including but not limited to cameras that are also suitable as one or more forward-facing cameras (e.g., long-range camera 1198 and / or one or more mid-range cameras 1176, one or more stereo cameras 1168, one or more infrared cameras 1172, etc.), as described herein.
[0124] The inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. Figure 8A and / or Figure 8B This document provides details regarding inference and / or training logic 815. In at least one embodiment, inference and / or training logic 815 may be... Figure 11B Used in systems for reasoning or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0125] In at least one embodiment, the inference and / or training logic 412, 414 can be in the system Figure 11B Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0126] Figure 11C The illustration shows an embodiment according to at least one of the embodiments. Figure 11A A block diagram of an example system architecture for an autonomous vehicle 1100. In at least one embodiment, Figure 11CEach of one or more components, one or more features, and one or more systems of vehicle 1100 is shown connected via bus 1102. In at least one embodiment, bus 1102 may include, but is not limited to, a CAN data interface (which may alternatively be referred to herein as “CAN bus”). In at least one embodiment, CAN may be a network within vehicle 1100 used to help control various features and functions of vehicle 1100, such as brake actuation, acceleration, braking, steering, windshield wipers, etc. In one embodiment, bus 1102 may be configured to have dozens or even hundreds of nodes, each node having its own unique identifier (e.g., CAN ID). In at least one embodiment, bus 1102 can be read to find steering wheel angle, ground speed, engine rotation speed (“RPM”), button position, and / or other vehicle status indicators. In at least one embodiment, bus 1102 may be an ASIL B compliant CAN bus.
[0127] In at least one embodiment, FlexRay and / or Ethernet may be used in addition to or from CAN. In at least one embodiment, there may be any number of buses 1102, which may include, but are not limited to, zero or more CAN buses, zero or more FlexRay buses, zero or more Ethernet buses, and / or zero or more other types of buses using other protocols. In at least one embodiment, two or more buses 1102 may be used to perform different functions and / or may be used for redundancy. For example, a first bus 1102 may be used for a collision avoidance function, and a second bus 1102 may be used for actuation control. In at least one embodiment, each bus 1102 may communicate with any component of vehicle 1100, and two or more buses 1102 may communicate with the same component. In at least one embodiment, each of any number of system-on-chip (“SoC”) 1104, each of one or more controllers 1136, and / or each computer within the vehicle may access the same input data (e.g., input from sensors of vehicle 1100) and may be connected to a common bus, such as a CAN bus.
[0128] In at least one embodiment, vehicle 1100 may include one or more controllers 1136, such as those described herein. Figure 11A The controller 1136 can be used for a variety of functions. In at least one embodiment, the controller 1136 can be coupled to any of the various other components and systems of the vehicle 1100, and can be used to control the vehicle 1100, the artificial intelligence of the vehicle 1100, the infotainment of the vehicle 1100, and / or the like.
[0129] In at least one embodiment, vehicle 1100 may include any number of SoCs 1104. Each of the SoCs 1104 may include, but is not limited to, a central processing unit (“one or more CPUs”) 1106, a graphics processing unit (“one or more GPUs”) 1108, one or more processors 1110, one or more caches 1112, one or more accelerators 1114, one or more data storage 1116, and / or other components and features not shown. In at least one embodiment, one or more SoCs 1104 may be used to control vehicle 1100 on various platforms and systems. For example, in at least one embodiment, one or more SoCs 1104 may be combined with a high-definition (“HD”) map 1122 in a system (e.g., the system of vehicle 1100), the high-definition map 1122 being available from one or more servers via a network interface 1124. Figure 11C (Not shown in the image) Get map refresh and / or update.
[0130] In at least one embodiment, one or more CPUs 1106 may include CPU clusters or CPU complexes (which may alternatively be referred to herein as “CCPLEX”). In at least one embodiment, one or more CPUs 1106 may include multiple cores and / or a secondary (“L2”) cache. For example, in at least one embodiment, one or more CPUs 1106 may include eight cores in an intercoupled multiprocessor configuration. In at least one embodiment, one or more CPUs 1106 may include four dual-core clusters, each cluster having a dedicated L2 cache (e.g., 2MB L2 cache). In at least one embodiment, one or more CPUs 1106 (e.g., CCPLEX) may be configured to support simultaneous cluster operation, such that any combination of clusters of one or more CPUs 1106 can be active at any given time.
[0131] In at least one embodiment, one or more CPUs 1106 may implement power management functions, including but not limited to one or more of the following features: automatic clock gating of individual hardware modules to conserve dynamic power when idle; clock gating of each core when the core is not actively executing instructions due to executing Wait for Interrupt (“WFI”) / Event Wait (“WFE”) instructions; independent power supply for each core; independent clock gating for each core cluster when all cores are clock-gated or power-gated; and / or independent power gating for each core cluster when all cores are power-gated. In at least one embodiment, one or more CPUs 1106 may further implement an enhanced algorithm for managing power states, wherein allowed power states and expected wake-up times are specified, and the hardware / microcode determines the optimal power state for cores, clusters, and CCPLEX inputs. In at least one embodiment, the processing core may support a simplified power state input sequence in software, wherein the work is offloaded to the microcode.
[0132] In at least one embodiment, one or more GPUs 1108 may include integrated GPUs (or "iGPUs" herein). In at least one embodiment, one or more GPUs 1108 may be programmable and efficient for parallel workloads. In at least one embodiment, one or more GPUs 1108 may use an enhanced tensor instruction set. In one embodiment, one or more GPUs 1108 may include one or more streaming microprocessors, wherein each streaming microprocessor may include a Level 1 ("L1") cache (e.g., an L1 cache with at least 96KB of storage capacity), and two or more streaming microprocessors may share an L2 cache (e.g., an L2 cache with 512KB of storage capacity). In at least one embodiment, one or more GPUs 1108 may include at least eight streaming microprocessors. In at least one embodiment, one or more GPUs 1108 may use a computation application programming interface (API). In at least one embodiment, one or more GPUs 1108 may use one or more parallel computing platforms and / or programming models (e.g., NVIDIA's CUDA).
[0133] In at least one embodiment, one or more GPUs 1108 may be power-optimized for optimal performance in automotive and embedded use cases. For example, in one embodiment, one or more GPUs 1108 may be fabricated on a FinFET (“FinFET”) circuit. In at least one embodiment, each streaming microprocessor may include multiple mixed-precision processing cores divided into multiple blocks. For example, but not limited to, 64 PF32 cores and 32 PF64 cores may be divided into four processing blocks. In at least one embodiment, each processing block may be allocated 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, two mixed-precision NVIDIA Tensor cores for deep learning matrix arithmetic, a level-zero (“L0”) instruction cache, a thread bundle scheduler, a dispatch unit, and / or a 64KB register file. In at least one embodiment, the streaming microprocessor may include independent parallel integer and floating-point data paths to provide efficient execution of workloads that mix computation and addressing operations. In at least one embodiment, the streaming microprocessor may include independent thread scheduling capabilities to enable finer-grained synchronization and cooperation between parallel threads. In at least one embodiment, the streaming microprocessor may include a combined L1 data cache and shared memory unit to improve performance while simplifying programming.
[0134] In at least one embodiment, one or more GPUs 1108 may include high-bandwidth memory (“HBM”) and / or a 16GB HBM2 memory subsystem to provide a peak storage bandwidth of approximately 900GB / s in some examples. In at least one embodiment, in addition to or instead of HBM memory, synchronous graphics random access memory (“SGRAM”) may be used, such as graphics double data rate type five synchronous random access memory (“GDDR5”).
[0135] In at least one embodiment, one or more GPUs 1108 may include unified memory technology. In at least one embodiment, address translation service (“ATS”) support may be used to allow one or more GPUs 1108 to directly access the page tables of one or more CPUs 1106. In at least one embodiment, when a memory management unit (“MMU”) of one or more GPUs 1108 experiences a miss, an address translation request may be sent to one or more CPUs 1106. In response, in at least one embodiment, one or more CPUs 1106 may look up the virtual-physical mapping of the address in their page tables and transfer the translation back to one or more GPUs 1108. In at least one embodiment, unified memory technology may allow a single unified virtual address space to be used for the memory of both one or more CPUs 1106 and one or more GPUs 1108, thereby simplifying the programming of one or more GPUs 1108 and the porting of applications to one or more GPUs 1108.
[0136] In at least one embodiment, one or more GPUs 1108 may include any number of access counters that can track the frequency of memory accesses by one or more GPUs 1108 to other processors. In at least one embodiment, one or more access counters can help ensure that memory pages are moved to the physical memory of the processor that accesses the pages most frequently, thereby improving the efficiency of shared memory ranges between processors.
[0137] In at least one embodiment, one or more SoCs 1104 may include any number of caches 1112, including those described herein. For example, in at least one embodiment, one or more caches 1112 may include a Level 3 (“L3”) cache available for both one or more CPUs 1106 and one or more GPUs 1108 (e.g., connecting both CPUs 1106 and GPUs 1108). In at least one embodiment, one or more caches 1112 may include a write-back cache that can, for example, track the state of a line using a cache coherence protocol (e.g., MEI, MESI, MSI, etc.). In at least one embodiment, although a smaller cache size may be used, according to an embodiment, the L3 cache may include 4 MB or more.
[0138] In at least one embodiment, one or more SoCs 1104 may include one or more accelerators 1114 (e.g., hardware accelerators, software accelerators, or combinations thereof). In at least one embodiment, one or more SoCs 1104 may include a hardware acceleration cluster, which may include optimized hardware accelerators and / or large on-chip memory. In at least one embodiment, large on-chip memory (e.g., 4MB of SRAM) enables the hardware acceleration cluster to accelerate neural networks and other computations. In at least one embodiment, the hardware acceleration cluster may be used to supplement one or more GPUs 1108 and offload some tasks from one or more GPUs 1108 (e.g., freeing up more cycles from one or more GPUs 1108 to perform other tasks). In at least one embodiment, one or more accelerators 1114 may be used for a target workload (e.g., perceptual, convolutional neural network (“CNN”), recurrent neural network (“RNN”), etc.) that is sufficiently stable to withstand acceleration testing. In at least one embodiment, the CNN may include region-based or region convolutional neural networks (“RCNN”) and fast RCNN (e.g., for object detection) or other types of CNNs.
[0139] In at least one embodiment, one or more accelerators 1114 (e.g., a hardware acceleration cluster) may include one or more deep learning accelerators (“DLAs”). One or more DLAs may include, but are not limited to, one or more Tensor Processing Units (“TPUs”), which may be configured to provide an additional 10 trillion operations per second for deep learning applications and inference. In at least one embodiment, the TPU may be an accelerator configured and optimized for performing image processing functions (e.g., for CNNs, RCNNs, etc.). One or more DLAs may be further optimized for specific sets of neural network types and floating-point operations and inference. In at least one embodiment, one or more DLAs are designed to provide higher performance per millimeter than typical general-purpose GPUs and typically significantly outperform CPUs. In at least one embodiment, one or more TPUs may perform several functions, including single-instance convolution functions supporting, for example, INT8, INT16, and FP16 data types for features and weights, as well as post-processor functions. In at least one embodiment, one or more DLAs can execute neural networks, particularly CNNs, quickly and efficiently on processed or unprocessed data for any of the various functions, including, but not limited to: CNNs for object recognition and detection using data from camera sensors; CNNs for distance estimation using data from camera sensors; CNNs for emergency vehicle detection, recognition, and identification using data from microphone 1196; CNNs for face recognition and vehicle owner recognition using data from camera sensors; and / or CNNs for safety and / or safety-related events.
[0140] In at least one embodiment, the DLA can perform any function of one or more GPUs 1108, and by using inference accelerators, for example, the designer can target one or more DLAs or one or more GPUs 1108 for any function. For example, in at least one embodiment, the designer can focus the CNN processing and floating-point operations on one or more DLAs, leaving other functions to one or more GPUs 1108 and / or one or more other accelerators 1114.
[0141] In at least one embodiment, one or more accelerators 1114 (e.g., a hardware acceleration cluster) may include programmable vision accelerators (“PVAs”), which may alternatively be referred to herein as computer vision accelerators. In at least one embodiment, one or more PVAs may be designed and configured to accelerate computer vision algorithms for advanced driver assistance systems (“ADAS”) 1138, autonomous driving, augmented reality (“AR”) applications, and / or virtual reality (“VR”) applications. In at least one embodiment, one or more PVAs may strike a balance between performance and flexibility. For example, in at least one embodiment, each of one or more PVAs may include, for example, but not limited to, any number of reduced instruction set computer (“RISC”) cores, direct memory access (“DMA”), and / or any number of vector processors.
[0142] In at least one embodiment, the RISC core can interact with an image sensor (e.g., the image sensor of any camera described herein), an image signal processor, and / or the like. In at least one embodiment, each RISC core may include any number of memories. In at least one embodiment, the RISC core may use any of a variety of protocols, depending on the embodiment. In at least one embodiment, the RISC core may execute a real-time operating system (“RTOS”). In at least one embodiment, the RISC core may be implemented using one or more integrated circuit devices, application-specific integrated circuits (“ASICs”), and / or storage devices. For example, in at least one embodiment, the RISC core may include an instruction cache and / or tightly coupled RAM.
[0143] In at least one embodiment, DMA enables components of the PVA to access system memory independently of one or more CPUs 1106. In at least one embodiment, DMA can support any number of features for providing optimization to the PVA, including but not limited to, support for multidimensional addressing and / or circular addressing. In at least one embodiment, DMA can support up to six or more addressing dimensions, which may include, but are not limited to, block width, block height, block depth, horizontal block step, vertical block step, and / or depth step.
[0144] In at least one embodiment, the vector processor may be a programmable processor designed to efficiently and flexibly execute programming for computer vision algorithms and provide signal processing capabilities. In at least one embodiment, the PVA may include a PVA core and two vector processing subsystem partitions. In at least one embodiment, the PVA core may include a processor subsystem, a DMA engine (e.g., two DMA engines), and / or other peripherals. In at least one embodiment, the vector processing subsystem may serve as the main processing engine of the PVA and may include a vector processing unit (“VPU”), an instruction cache, and / or a vector memory (e.g., “VMEM”). In at least one embodiment, the VPU core may include a digital signal processor, such as a Single Instruction Multiple Data (“SIMD”) or Very Long Instruction Word (“VLIW”) digital signal processor. In at least one embodiment, the combination of SIMD and VLIW can improve throughput and speed.
[0145] In at least one embodiment, each vector processor may include an instruction cache and may be coupled to dedicated memory. As a result, in at least one embodiment, each vector processor may be configured to execute independently of other vector processors. In at least one embodiment, vector processors included in a particular PVA may be configured to employ data parallelism. For example, in at least one embodiment, multiple vector processors included in a single PVA may execute the same computer vision algorithm, except on different regions of an image. In at least one embodiment, vector processors included in a particular PVA may execute different computer vision algorithms simultaneously on the same image, or even execute different algorithms on a sequence of images or portions of images. In at least one embodiment, among others, any number of PVAs may be included in the hardware acceleration cluster, and any number of vector processors may be included in each PVA. In at least one embodiment, the PVA may include additional error correction code (“ECC”) memory to enhance overall system security.
[0146] In at least one embodiment, one or more accelerators 1114 (e.g., a hardware acceleration cluster) may include an on-chip computer vision network and static random access memory (“SRAM”) for providing high-bandwidth, low-latency SRAM to one or more accelerators 1114. In at least one embodiment, the on-chip memory may include at least 4 MB of SRAM, comprising, for example, but not limited to, eight field-configurable memory blocks accessible to both the PVA and DLA. In at least one embodiment, each pair of memory blocks may include an Advanced Peripheral Bus (“APB”) interface, configuration circuitry, a controller, and a multiplexer. In at least one embodiment, any type of memory may be used. In at least one embodiment, the PVA and DLA may access the memory via a backbone providing high-speed access to the memory for both the PVA and DLA. In at least one embodiment, the backbone may include an on-chip computer vision network that interconnects the PVA and DLA to the memory (e.g., using an APB).
[0147] In at least one embodiment, the on-chip computer vision network may include an interface that determines that both the PVA and DLA provide ready and valid signals before transmitting any control signals / addresses / data. In at least one embodiment, the interface may provide separate phases and separate channels for transmitting control signals / addresses / data, as well as bursty communication for continuous data transmission. In at least one embodiment, although other standards and protocols may be used, the interface may conform to the International Organization for Standardization (“ISO”) 26262 or the International Electrotechnical Commission (“IEC”) 61508 standard.
[0148] In at least one embodiment, one or more SoCs 1104 may include a real-time eye-tracking hardware accelerator. In at least one embodiment, the real-time eye-tracking hardware accelerator may be used to quickly and efficiently determine the location and extent of an object (e.g., within a world model) to generate real-time visualization simulations for RADAR signal interpretation, for sound propagation synthesis and / or analysis, for simulation of SONAR systems, for general wave propagation simulation, for comparison with LIDAR data for localization and / or other functions, and / or for other purposes.
[0149] In at least one embodiment, one or more accelerators 1114 (e.g., a hardware acceleration cluster) have broad applications for autonomous driving. In at least one embodiment, the PVA can be a programmable vision accelerator that can be used in critical processing stages in ADAS and autonomous vehicles. In at least one embodiment, the capabilities of the PVA at low power and low latency are well-matched to algorithmic domains requiring predictable processing. In other words, the PVA performs well in semi-intensive or intensive conventional computations, even on small datasets that may require predictable runtimes with low latency and low power consumption. In at least one embodiment, for autonomous vehicles (such as vehicle 1100), the PVA may be designed to run classical computer vision algorithms, as they are efficient in object detection and integer mathematical operations.
[0150] For example, according to at least one embodiment of the technology, PVA is used to perform computer stereo vision. In at least one embodiment, a semi-global matching-based algorithm may be used in some examples, although this is not intended to be limiting. In at least one embodiment, applications for Level 3-5 autonomous driving use dynamic estimation / stereo matching during operation (e.g., structure recovery from motion, pedestrian recognition, lane detection, etc.). In at least one embodiment, PVA can perform computer stereo vision functions on input from two monocular cameras.
[0151] In at least one embodiment, the PVA can be used to perform intensive optical flow. For example, in at least one embodiment, the PVA can process raw RADAR data (e.g., using 4D Fast Fourier Transform) to provide processed RADAR data. In at least one embodiment, the PVA is used for time-of-flight depth processing, for example, by processing raw time-of-flight data to provide processed time-of-flight data.
[0152] In at least one embodiment, the DLA can be used to run any type of network to enhance control and driving safety, including, but not limited to, neural networks whose output is used for a confidence score for each object detection. In at least one embodiment, the confidence score can be represented or interpreted as a probability, or as providing a relative “weight” for each detection relative to other detections. In at least one embodiment, the confidence score measurement enables the system to make further decisions about which detections should be considered true positives rather than false positives. For example, in at least one embodiment, the system can set a threshold for the confidence score and only consider detections exceeding the threshold as true positives. In embodiments using an Automatic Emergency Braking (“AEB”) system, false positives would cause the vehicle to automatically perform emergency braking, which is obviously undesirable. In at least one embodiment, a highly confident detection can be considered a trigger for AEB. In at least one embodiment, the DLA can run a neural network for regressing the confidence score value. In at least one embodiment, the neural network may take at least a subset of parameters as its input, such as bounding box size, obtained ground plane estimate (e.g., from another subsystem), and outputs of one or more IMU sensors 1166 related to the vehicle 1100 orientation, distance, and 3D position estimate of the object obtained from the neural network and / or other sensors (e.g., one or more LiDAR sensors 1164 or one or more RADAR sensors 1160).
[0153] In at least one embodiment, one or more SoCs 1104 may include one or more data storage devices 1116 (e.g., memory). In at least one embodiment, one or more data storage devices 1116 may be on-chip memory of one or more SoCs 1104, which may store neural networks to be executed on one or more GPUs 1108 and / or DLAs. In at least one embodiment, one or more data storage devices 1116 may have a sufficiently large capacity to store multiple instances of the neural network for redundancy and security. In at least one embodiment, one or more data storage devices 1112 may include L2 or L3 caches.
[0154] In at least one embodiment, one or more SoCs 1104 may include any number of processors 1110 (e.g., embedded processors). One or more processors 1110 may include a startup and power management processor, which may be a dedicated processor and subsystem for handling startup power and management functions, as well as associated security implementations. In at least one embodiment, the startup and power management processor may be part of a startup sequence of one or more SoCs 1104 and may provide runtime power management services. In at least one embodiment, the startup power and management processor may provide clock and voltage programming, assist system low-power state transitions, thermal and temperature sensor management of one or more SoCs 1104s, and / or power state management of one or more SoCs 1104s. In at least one embodiment, each temperature sensor may be implemented with its output frequency proportional to temperature, and one or more SoCs 1104s may use the ring oscillator to detect the temperature of one or more CPUs 1106, one or more GPUs 1108, and / or one or more accelerators 1114. In at least one embodiment, if it is determined that the temperature exceeds a threshold, the startup and power management processor may enter a temperature fault routine and place one or more SoCs 1104 into a lower power state and / or place the vehicle 1100 into a driver’s safe stopping pattern (e.g., bring the vehicle 1100 to a safe stop).
[0155] In at least one embodiment, one or more processors 1110 may further include a set of embedded processors that can be used as an audio processing engine. In at least one embodiment, the audio processing engine may be an audio subsystem capable of providing full hardware support for multi-channel audio through multiple interfaces and a wide and flexible range of audio I / O interfaces. In at least one embodiment, the audio processing engine is a dedicated processor core with a digital signal processor with dedicated RAM.
[0156] In at least one embodiment, one or more processors 1110 may further include an always-on processor engine that can provide the necessary hardware features to support low-power sensor management and wake-up use cases. In at least one embodiment, the processor on the always-on processor engine may include, but is not limited to, a processor core, tightly coupled RAM, peripheral support (e.g., timers and interrupt controllers), various I / O controller peripherals, and routing logic.
[0157] In at least one embodiment, one or more processors 1110 may further include a secure clustering engine, which includes, but is not limited to, a dedicated processor subsystem for handling security management of automotive applications. In at least one embodiment, the secure clustering engine may include, but is not limited to, two or more processor cores, tightly coupled RAM, supporting peripherals (e.g., timers, interrupt controllers, etc.) and / or routing logic. In secure mode, in at least one embodiment, the two or more cores may operate in lockstep mode and may be used as a single core with comparison logic for detecting any differences between their operations. In at least one embodiment, one or more processors 1110 may further include a real-time camera engine, which may include, but is not limited to, a dedicated processor subsystem for handling real-time camera management. In at least one embodiment, one or more processors 1110 may further include a high dynamic range signal processor, which may include, but is not limited to, an image signal processor, which is a hardware engine as part of the camera processing pipeline.
[0158] In at least one embodiment, one or more processors 1110 may include a video image synthesizer, which may be a processing block (e.g., implemented on a microprocessor) that implements video post-processing functions required by the video playback application to produce the final image for the player window. In at least one embodiment, the video image synthesizer may perform lens distortion correction on one or more wide-angle cameras 1170, one or more surround cameras 1174, and / or one or more cabin monitoring camera sensors. In at least one embodiment, preferably, the cabin monitoring camera sensors are monitored by a neural network running on another instance of SoC 1104, the neural network being configured to recognize cabin events and respond accordingly. In at least one embodiment, the cabin system may perform, but is not limited to, lip reading to activate cellular service and make phone calls, instruct emails, change the vehicle's destination, activate or change the vehicle's infotainment system and settings, or provide voice-activated web browsing. In at least one embodiment, certain functions are available to the driver when the vehicle is operating in autonomous mode, and are otherwise disabled.
[0159] In at least one embodiment, the video image synthesizer may include enhanced temporal denoising for simultaneous spatial and temporal denoising. For example, in at least one embodiment, when motion occurs in the video, denoising appropriately weights spatial information, thereby reducing the weight of information provided by adjacent frames. In at least one embodiment, when the image or a portion of the image does not contain motion, temporal denoising performed by the video image synthesizer may use information from previous images to reduce noise in the current image.
[0160] In at least one embodiment, the video image compositor can also be configured to perform stereoscopic correction on the input stereo lens frames. In at least one embodiment, when using an operating system desktop, the video image compositor can also be used for user interface compositing and does not require one or more GPUs 1108 to continuously render new surfaces. In at least one embodiment, when one or more GPUs 1108 are powered and actively performing 3D rendering, the video image compositor can be used to offload one or more GPUs 1108 to improve performance and responsiveness.
[0161] In at least one embodiment, one or more SoCs 1104 may further include a Mobile Industrial Processor Interface (“MIPI”) camera serial interface, a high-speed interface, and / or a video input block that can be used for receiving video and input from a camera and associated pixel input functions. In at least one embodiment, one or more SoCs 1104 may further include an input / output controller that can be software controlled and can be used to receive I / O signals not assigned to a specific role.
[0162] In at least one embodiment, one or more SoCs 1104 may further include extensive peripheral interfaces to enable communication with peripheral devices, audio encoders / decoders (“codecs”), power management and / or other devices. One or more SoCs 1104 may be used to process data from (e.g., via gigabit multimedia serial links and Ethernet connections) cameras, sensors (e.g., one or more LiDAR sensors 1164, one or more RADAR sensors 1160, etc., which may be connected via Ethernet), data from bus 1102 (e.g., vehicle 1100 speed, steering wheel position, etc.), data from one or more GNSS sensors 1158 (e.g., via Ethernet or CAN bus connections), etc. In at least one embodiment, one or more SoCs 1104 may further include a dedicated high-performance mass storage controller, which may include its own DMA engine and may be used to free one or more CPUs 1106 from routine data management tasks.
[0163] In at least one embodiment, one or more SoCs 1104 can be an end-to-end platform with a flexible architecture spanning automation levels 3-5, providing a comprehensive functional safety architecture that leverages and effectively utilizes computer vision and ADAS technologies to achieve diversity and redundancy. This provides a platform offering a flexible and reliable driving software stack as well as deep learning tools. In at least one embodiment, one or more SoCs 1104 can be faster, more reliable, and even more energy and space efficient than conventional systems. For example, in at least one embodiment, one or more accelerators 1114, when combined with one or more CPUs 1106, one or more GPUs 1108, and one or more data storage devices 1116, can provide a fast and efficient platform for Level 3-5 autonomous vehicles.
[0164] In at least one embodiment, the computer vision algorithm can be executed on a CPU, which can be configured using a high-level programming language (e.g., C) to execute multiple processing algorithms on a variety of visual data. However, in at least one embodiment, the CPU typically cannot meet the performance requirements of many computer vision applications, such as performance requirements related to execution time and power consumption. In at least one embodiment, many CPUs cannot execute complex object detection algorithms in real time, which are used in automotive ADAS applications and practical Level 3-5 autonomous vehicles.
[0165] The embodiments described herein allow for the simultaneous and / or sequential execution of multiple neural networks and allow for the combination of results to achieve Level 3-5 autonomous driving capabilities. For example, in at least one embodiment, a CNN executed on a DLA or discrete GPU (e.g., one or more GPUs 1120) may include text and word recognition, thereby allowing a supercomputer to read and understand traffic signs, including signs for which the neural network has not yet been specifically trained. In at least one embodiment, the DLA may also include a neural network capable of recognizing, interpreting, and providing semantic understanding of symbols, and passing this semantic understanding to a path planning module running on a CPU Complex.
[0166] In at least one embodiment, for drives of levels 3, 4, or 5, multiple neural networks can run simultaneously. For example, in at least one embodiment, a warning sign consisting of a light bulb accompanied by the warning sign “Caution: flashing lights indicate icy conditions” can be interpreted independently or jointly by multiple neural networks. In at least one embodiment, the sign itself can be identified as a traffic sign by a first deployed neural network (e.g., a trained neural network), and the text “flashing lights indicate icy conditions” can be interpreted by a second deployed neural network, which informs the vehicle’s path planning software (preferably executed on the CPU Complex) that icing conditions exist when flashing lights are detected. In at least one embodiment, flashing lights can be identified by operating a third deployed neural network across multiple frames, informing the vehicle’s path planning software of the presence (or absence) of flashing lights. In at least one embodiment, all three neural networks can run simultaneously, for example within the DLA and / or on one or more GPUs 1108.
[0167] In at least one embodiment, the CNN for facial recognition and vehicle owner identification can use data from camera sensors to identify the presence of an authorized driver and / or the owner of vehicle 1100. In at least one embodiment, a normally open sensor processor engine can be used to unlock the vehicle when the owner approaches the driver's door and turns on the lights, and, in security mode, can be used to disable the vehicle when the owner leaves it. In this way, one or more SoCs 1104 provide protection against theft and / or carjacking.
[0168] In at least one embodiment, the CNN for emergency vehicle detection and identification can use data from microphone 1196 to detect and identify emergency vehicle sirens. In at least one embodiment, one or more SoCs 1104 use the CNN to classify contextual and urban sounds, as well as visual data. In at least one embodiment, the CNN running on DLA is trained to identify the relative approach speed of emergency vehicles (e.g., by using the Doppler effect). In at least one embodiment, the CNN can also be trained to identify emergency vehicles in the area where the vehicle is operating, as identified by one or more GNSS sensors 1158. In at least one embodiment, when operating in Europe, the CNN will seek to detect European sirens, while in North America, the CNN will seek to identify only US sirens. In at least one embodiment, once an emergency vehicle is detected, a control program can be used, with the assistance of one or more ultrasonic sensors 1162, to execute emergency vehicle safety routines, slow the vehicle, pull the vehicle to the side of the road, stop, and / or leave the vehicle idle until one or more emergency vehicles pass.
[0169] In at least one embodiment, vehicle 1100 may include one or more CPUs 1118 (e.g., one or more discrete CPUs or one or more dCPUs) that may be coupled to one or more SoCs 1104 via high-speed interconnects (e.g., PCIe). In at least one embodiment, one or more CPUs 1118 may include x86 processors, and one or more CPUs 1118 may be used to perform any of the various functions, such as arbitrating the results of potential inconsistencies between ADAS sensors and one or more SoCs 1104, and / or monitoring the status and health of one or more monitoring controllers 1136 and / or on-chip information systems (“information SoCs”) 1130.
[0170] In at least one embodiment, vehicle 1100 may include one or more GPUs 1120 (e.g., one or more discrete GPUs or one or more dGPUs) that may be coupled to one or more SoCs 1104 via a high-speed interconnect (e.g., NVIDIA's NVLINK). In at least one embodiment, one or more GPUs 1120 may provide additional artificial intelligence capabilities, such as by executing redundant and / or different neural networks, and may be used to train and / or update the neural networks based at least in part on inputs from sensors of vehicle 1100 (e.g., sensor data).
[0171] In at least one embodiment, vehicle 1100 may further include a network interface 1124, which may include, but is not limited to, one or more wireless antennas 1126 (e.g., one or more wireless antennas 1126 for different communication protocols, such as cellular antennas, Bluetooth antennas, etc.). In at least one embodiment, network interface 1124 may be used to enable wireless connectivity with other vehicles and / or computing devices (e.g., passenger client devices) via an internet cloud (e.g., employing servers and / or other network devices). In at least one embodiment, for communication with other vehicles, a direct link and / or an indirect link (e.g., via a network and the internet) may be established between vehicle 1100 and another vehicle. In at least one embodiment, a vehicle-to-vehicle communication link may be used to provide a direct link. The vehicle-to-vehicle communication link may provide vehicle 1100 with information about vehicles near vehicle 1100 (e.g., vehicles in front, to the side, and / or behind vehicle 1100). In at least one embodiment, the foregoing functionality may be part of a cooperative adaptive cruise control function of vehicle 1100.
[0172] In at least one embodiment, network interface 1124 may include a System-on-Chip (SoC) that provides modulation and demodulation functions and enables one or more controllers 1136 to communicate over a wireless network. In at least one embodiment, network interface 1124 may include a radio frequency (RF) front-end for up-conversion from baseband to RF and down-conversion from RF to baseband. In at least one embodiment, frequency conversion may be performed in any technically feasible manner. For example, frequency conversion may be performed using known processes and / or using a superheterodyne process. In at least one embodiment, the RF front-end functionality may be provided by a separate chip. In at least one embodiment, the network interface may include wireless functions for communication over LTE, WCDMA, UMTS, GSM, CDMA2000, Bluetooth, Bluetooth LE, Wi-Fi, Z-Wave, ZigBee, LoRaWAN, and / or other wireless protocols.
[0173] In at least one embodiment, vehicle 1100 may further include one or more data storage units 1128, which may include, but are not limited to, off-chip (e.g., one or more SoC 1104) storage. In at least one embodiment, one or more data storage units 1128 may include, but are not limited to, one or more storage elements, including RAM, SRAM, dynamic random access memory (“DRAM”), video random access memory (“VRAM”), flash memory, hard disk and / or other components and / or devices capable of storing at least one bit of data.
[0174] In at least one embodiment, vehicle 1100 may further include one or more GNSS sensors 1158 (e.g., GPS and / or auxiliary GPS sensors) to assist in map creation, perception, occupancy raster generation, and / or path planning functions. In at least one embodiment, any number of GNSS sensors 1158 may be used, including, for example, but not limited to, GPS sensors connected to a serial interface (e.g., RS-232) bridge using a USB connector with Ethernet.
[0175] In at least one embodiment, vehicle 1100 may further include one or more RADAR sensors 1160. One or more RADAR sensors 1160 can be used by vehicle 1100 for remote vehicle detection, even in dark and / or inclement weather conditions. In at least one embodiment, the RADAR functional safety level may be ASIL B. One or more RADAR sensors 1160 may use CAN and / or bus 1102 (e.g., to transmit data generated by one or more RADAR sensors 1160) for control and access to object tracking data, and in some examples, may access Ethernet to access raw data. In at least one embodiment, a wide variety of RADAR sensor types can be used. For example, but not limited to, one or more of the RADAR sensors 1160 may be suitable for front, rear, and side RADAR use. In at least one embodiment, one or more RADAR sensors 1160 are one or more pulse Doppler RADAR sensors.
[0176] In at least one embodiment, one or more RADAR sensors 1160 may include different configurations, such as long-range with a narrow field of view, short-range with a wide field of view, short-range side coverage, etc. In at least one embodiment, the long-range RADAR can be used for adaptive cruise control functions. In at least one embodiment, the long-range RADAR system can provide a wide field of view achieved through two or more independent scans (e.g., within a 250m range). In at least one embodiment, one or more RADAR sensors 1160 can help distinguish between stationary and moving objects and can be used by the ADAS system 1138 for emergency braking assistance and forward collision warning. One or more sensors 1160 included in the long-range RADAR system may include, but are not limited to, a monostatic multimode RADAR with multiple (e.g., six or more) fixed RADAR antennas and high-speed CAN and FlexRay interfaces. In at least one embodiment, having six antennas, with the four central antennas, can create a focused beammap designed to record the surrounding context of the vehicle 1100 at a high speed while minimizing traffic interference from adjacent lanes. In at least one embodiment, the other two antennas can expand the field of view, thereby enabling rapid detection of vehicles entering or leaving the lane of vehicle 1100.
[0177] In at least one embodiment, as an example, a mid-range RADAR system may include, for example, a range of up to 160m (front) or 80m (rear), and a field of view of up to 42 degrees (front) or 150 degrees (rear). In at least one embodiment, a short-range RADAR system may include, but is not limited to, any number of RADAR sensors 1160 designed to be mounted at both ends of the rear bumper. When mounted at both ends of the rear bumper, in at least one embodiment, the RADAR sensor system may generate two beams that continuously monitor the rearward direction of the vehicle and nearby blind spots. In at least one embodiment, the short-range RADAR system may be used in ADAS system 1138 for blind spot detection and / or lane change assistance.
[0178] In at least one embodiment, vehicle 1100 may further include one or more ultrasonic sensors 1162. One or more ultrasonic sensors 1162, which may be positioned at the front, rear, and / or sides of vehicle 1100, can be used for parking assistance and / or creating and updating occupancy detectors. In at least one embodiment, a wide variety of ultrasonic sensors 1162 can be used, and different ultrasonic sensors 1162 can be used for different detection ranges (e.g., 2.5m, 4m). In at least one embodiment, the ultrasonic sensors 1162 can operate at ASIL B functional safety level.
[0179] In at least one embodiment, vehicle 1100 may include one or more LiDAR sensors 1164. The one or more LiDAR sensors 1164 may be used for object and pedestrian detection, emergency braking, collision avoidance, and / or other functions. In at least one embodiment, the one or more LiDAR sensors 1164 may be of functional safety level ASIL B. In at least one embodiment, vehicle 1100 may include multiple (e.g., two, four, six, etc.) LiDAR sensors 1164 that can use Ethernet (e.g., providing data to a Gigabit Ethernet switch).
[0180] In at least one embodiment, one or more LiDAR sensors 1164 may be able to provide a list of objects and their distances for a 360-degree field of view. In at least one embodiment, one or more commercially available LiDAR sensors 1164 may, for example, have an advertising range of approximately 100m, an accuracy of 2cm-3cm, and support a 100Mbps Ethernet connection. In at least one embodiment, one or more non-protruding LiDAR sensors 1164 may be used. In such embodiments, one or more LiDAR sensors 1164 may be implemented as small devices that can be embedded in the front, rear, sides, and / or corners of vehicle 1100. In at least one embodiment, one or more LiDAR sensors 1164, in such embodiments, can provide a horizontal field of view of up to 120 degrees and a vertical field of view of 35 degrees, even for objects with low reflectivity, and have a range of 200m. In at least one embodiment, one or more forward-facing LiDAR sensors 1164 may be configured for a horizontal field of view between 45 degrees and 135 degrees.
[0181] In at least one embodiment, LIDAR technology (such as 3D flash LIDAR) may also be used. 3D flash LIDAR uses a laser flash as a transmission source to illuminate approximately 200m around the vehicle 1100. In at least one embodiment, the flash LIDAR unit includes, but is not limited to, a receiver that records the laser pulse propagation time and the reflected light on each pixel, which in turn corresponds to the range from the vehicle 1100 to the object. In at least one embodiment, flash LIDAR can allow the generation of highly accurate and distortion-free images of the surrounding context using each laser flash. In at least one embodiment, four flash LIDAR sensors may be deployed, one on each side of the vehicle 1100.
[0182] In at least one embodiment, the 3D flash LiDAR system includes, but is not limited to, a solid-state 3D line-of-sight array LiDAR camera with no moving parts other than a fan (e.g., a non-scanning LiDAR device). In at least one embodiment, the flash LiDAR device can use 5 nanosecond Class I (eye-safe) laser pulses per frame and can capture reflected laser light in the form of a 3D ranging point cloud and co-registered intensity data.
[0183] In at least one embodiment, the vehicle may further include one or more IMU sensors 1166. In at least one embodiment, one or more IMU sensors 1166 may be located at the center of the rear axle of the vehicle 1100. In at least one embodiment, one or more IMU sensors 1166 may include, for example, but not limited to, one or more accelerometers, one or more magnetometers, one or more gyroscopes, one or more magnetic compasses, and / or other sensor types. In at least one embodiment, for example in a six-axis application, one or more IMU sensors 1166 may include, but are not limited to, accelerometers and gyroscopes. In at least one embodiment, for example in a nine-axis application, one or more IMU sensors 1166 may include, but are not limited to, accelerometers, gyroscopes, and magnetometers.
[0184] In at least one embodiment, one or more IMU sensors 1166 may be implemented as a miniature, high-performance GPS-assisted inertial navigation system (“GPS / INS”) combining a microelectromechanical system (“MEMS”) inertial sensor, a high-sensitivity GPS receiver, and an advanced Kalman filtering algorithm to provide position, velocity, and attitude estimations; in at least one embodiment, one or more IMU sensors 1166 may enable vehicle 1100 to estimate heading without input from a magnetic sensor obtained by directly observing and correlating velocity changes from GPS to one or more IMU sensors 1166. In at least one embodiment, one or more IMU sensors 1166 and one or more GNSS sensors 1158 may be combined in a single integrated unit.
[0185] In at least one embodiment, vehicle 1100 may include one or more microphones 1196 placed inside and / or around vehicle 1100. In at least one embodiment, in addition, one or more microphones 1196 may be used for emergency vehicle detection and identification.
[0186] In at least one embodiment, vehicle 1100 may further include any number of camera types, including one or more stereo cameras 1168, one or more wide-angle cameras 1170, one or more infrared cameras 1172, one or more surround cameras 1174, one or more long-range cameras 1198, one or more mid-range cameras 1176, and / or other camera types. In at least one embodiment, the cameras can be used to capture image data around the entire perimeter of vehicle 1100. In at least one embodiment, the type of camera used depends on vehicle 1100. In at least one embodiment, any combination of camera types can be used to provide the necessary coverage around vehicle 1100. In at least one embodiment, the number of cameras can vary depending on the embodiment. For example, in at least one embodiment, vehicle 1100 may include six cameras, seven cameras, ten cameras, twelve cameras, or other numbers of cameras. The cameras may be examples, but are not limited to, supporting Gigabit Multimedia Serial Link (“GMSL”) and / or Gigabit Ethernet. In at least one embodiment, previously referenced herein Figure 11A and Figure 11B Each camera can be described in more detail.
[0187] In at least one embodiment, vehicle 1100 may further include one or more vibration sensors 1142. One or more vibration sensors 1142 can measure vibrations of components of vehicle 1100 (e.g., axles). For example, in at least one embodiment, changes in vibration can indicate changes in road surface conditions. In at least one embodiment, when two or more vibration sensors 1142 are used, differences between vibrations can be used to determine road surface friction or slippage (e.g., when there is a vibration difference between a power drive axle and a free-rotating axle).
[0188] In at least one embodiment, vehicle 1100 may include ADAS system 1138. ADAS system 1138 may include, but is not limited to, SoC. In at least one embodiment, ADAS system 1138 may include, but is not limited to, any number of autonomous / adaptive / automatic cruise control (“ACC”) systems, cooperative adaptive cruise control (“CACC”) systems, forward collision warning (“FCW”) systems, automatic emergency braking (“AEB”) systems, lane departure warning (“LDW”) systems, lane keeping assist (“LKA”) systems, blind spot warning (“BSW”) systems, rear cross traffic warning (“RCTW”) systems, collision warning (“CW”) systems, lane centering (“LC”) systems, and / or other systems, features, and / or functions, and combinations thereof.
[0189] In at least one embodiment, the ACC system may use one or more RADAR sensors 1160, one or more LIDAR sensors 1164, and / or any number of cameras. In at least one embodiment, the ACC system may include a longitudinal ACC system and / or a lateral ACC system. In at least one embodiment, the longitudinal ACC system monitors and controls the distance to vehicles adjacent to vehicle 1100 and automatically adjusts the speed of vehicle 1100 to maintain a safe distance from the vehicle ahead. In at least one embodiment, the lateral ACC system performs distance holding and suggests that vehicle 1100 change lanes when necessary. In at least one embodiment, lateral ACC is associated with other ADAS applications, such as LC and CW.
[0190] In at least one embodiment, the CACC system uses information from other vehicles, which may be received from other vehicles via network interface 1124 and / or one or more wireless antennas 1126 via a wireless link or indirectly via a network connection (e.g., via the Internet). In at least one embodiment, the direct link may be provided by a vehicle-to-vehicle (“V2V”) communication link, while the indirect link may be provided by an infrastructure-to-vehicle (“I2V”) communication link. Typically, the V2V communication concept provides information about the vehicle immediately preceding it (e.g., a vehicle immediately in front of vehicle 1100 and in the same lane as it), while the I2V communication concept provides information about traffic further ahead. In at least one embodiment, the CACC system may include one or both of the I2V and V2V information sources. In at least one embodiment, given information about vehicles preceding vehicle 1100, the CACC system can be more reliable and has the potential to improve traffic flow smoothness and reduce road congestion.
[0191] In at least one embodiment, the FCW system is designed to warn the driver of danger so that the driver can take corrective action. In at least one embodiment, the FCW system uses a forward-facing camera and / or one or more RADAR sensors 1160, coupled to a dedicated processor, DSP, FPGA, and / or ASIC, electrically coupled to driver feedback, such as a display, speaker, and / or vibration components. In at least one embodiment, the FCW system can provide warnings, for example, in the form of audible, visual warnings, vibrations, and / or rapid braking pulses.
[0192] In at least one embodiment, the AEB system detects an impending forward collision with another vehicle or other object and can automatically apply brakes if the driver does not take corrective action within a specified time or distance parameter. In at least one embodiment, the AEB system may use one or more forward-facing cameras and / or one or more RADAR sensors 1160 coupled to a dedicated processor, DSP, FPGA, and / or ASIC. In at least one embodiment, when the AEB system detects a hazard, it typically first warns the driver to take corrective action to avoid a collision, and if the driver does not take corrective action, the AEB system may automatically apply brakes to attempt to prevent or at least mitigate the effects of the predicted collision. In at least one embodiment, the AEB system may include techniques such as dynamic braking to support and / or brakes for impending collisions.
[0193] In at least one embodiment, when vehicle 1100 crosses lane markings, the LDW system provides visual, auditory, and / or tactile warnings, such as steering wheel or seat vibrations, to alert the driver. In at least one embodiment, the LDW system is inactive when the driver indicates intentional lane departure, such as by activating turn signals. In at least one embodiment, the LDW system may use a front-facing camera coupled to a dedicated processor, DSP, FPGA, and / or ASIC, which is electrically coupled to driver feedback such as a display, speaker, and / or vibration components. In at least one embodiment, the LKA system is a variant of the LDW system. If vehicle 1100 begins to leave the lane, the LKA system provides steering input or braking to correct vehicle 1100.
[0194] In at least one embodiment, the BSW system detects and warns the driver of a vehicle in the blind spot. In at least one embodiment, the BSW system can provide visual, auditory, and / or tactile alerts to indicate that merging or changing lanes is unsafe. In at least one embodiment, the BSW system can provide additional warnings when the driver uses the turn signal. In at least one embodiment, the BSW system can use one or more rear-facing cameras and / or one or more RADAR sensors 1160 coupled to a dedicated processor, DSP, FPGA, and / or ASIC, electrically coupled to driver feedback, such as a display, speaker, and / or vibration assembly.
[0195] In at least one embodiment, the RCTW system can provide visual, auditory, and / or tactile notifications when an object is detected outside the range of the rear camera while the vehicle 1100 is reversing. In at least one embodiment, the RCTW system includes an AEB system to ensure the applied vehicle brakes to avoid a collision. In at least one embodiment, the RCTW system may use one or more rear-facing RADAR sensors 1160 coupled to a dedicated processor, DSP, FPGA, and / or ASIC, which are electrically coupled to driver feedback such as a display, speaker, and / or vibration assembly.
[0196] In at least one embodiment, conventional ADAS systems may be prone to generating false alarms, which can be annoying and distracting to the driver, but are generally not catastrophic because conventional ADAS systems warn the driver and allow the driver to determine whether a safe situation truly exists and take appropriate action. In at least one embodiment, in the event of conflicting results, vehicle 1100 itself decides whether to follow the result of the primary computer or the secondary computer (e.g., the first controller 1136 or the second controller 1136). For example, in at least one embodiment, ADAS system 1138 may be a backup and / or auxiliary computer for providing perception information to a backup computer rationality module. In at least one embodiment, the backup computer rationality monitor may run redundant software on hardware components to detect faults in perception and dynamic driving tasks. In at least one embodiment, the output from ADAS system 1138 may be provided to a monitoring MCU. In at least one embodiment, if the outputs from the primary computer and the auxiliary computer conflict, the monitoring MCU decides how to reconcile the conflict to ensure safe operation.
[0197] In at least one embodiment, the master computer may be configured to provide a confidence score to the supervisory MCU to indicate the master computer's confidence in the selected result. In at least one embodiment, if the confidence score exceeds a threshold, the supervisory MCU may follow the master computer's instructions regardless of whether the auxiliary computer provides conflicting or inconsistent results. In at least one embodiment, if the confidence score does not meet the threshold, and if the master computer and the auxiliary computer indicate different results (e.g., conflicting), the supervisory MCU may arbitrate between the computers to determine the appropriate result.
[0198] In at least one embodiment, the supervisory MCU may be configured to run a neural network trained and configured to determine, at least in part, the conditions under which the auxiliary computer provides a false alarm based on outputs from both the host computer and the auxiliary computer. In at least one embodiment, the neural network in the supervisory MCU may learn when the output of the auxiliary computer can be trusted and when it cannot. For example, in at least one embodiment, when the auxiliary computer is a RADAR-based FCW system, the neural network in the supervisory MCU may learn when the FCW system recognizes a metallic object that is not actually dangerous, such as a drain grat or manhole cover that would trigger an alarm. In at least one embodiment, when the auxiliary computer is a camera-based LDW system, the neural network in the supervisory MCU may learn to override the LDW when a cyclist or pedestrian is present and lane departure is actually the safest operation. In at least one embodiment, the supervisory MCU may include at least one of a DLA or GPU suitable for running a neural network with associated memory. In at least one embodiment, the supervisory MCU may include and / or be included as a component of one or more SoC 1104s.
[0199] In at least one embodiment, the ADAS system 1138 may include an auxiliary computer that performs ADAS functions using conventional computer vision rules. In at least one embodiment, the auxiliary computer may use classic computer vision rules (if-then), and the presence of a neural network in the supervisory MCU can improve reliability, security, and performance. For example, in at least one embodiment, diverse implementations and intentional non-identity make the entire system more fault-tolerant, especially for failures caused by software (or software-hardware interface) functionality. For example, in at least one embodiment, if a software vulnerability or bug exists in the software running on the host computer, and different software code running on the auxiliary computer provides the same overall result, the supervisory MCU can more confidently assume that the overall result is correct and that the vulnerability in the software or hardware on the host computer will not lead to a significant error.
[0200] In at least one embodiment, the output of the ADAS system 1138 may be input to the perception module and / or the dynamic driving task module of the host computer. For example, in at least one embodiment, if the ADAS system 1138 indicates a forward collision warning due to an object directly ahead, the perception block may use this information when identifying the object. In at least one embodiment, as described herein, the assistance computer may have its own neural network trained to reduce the risk of false alarms.
[0201] In at least one embodiment, vehicle 1100 may further include an infotainment SoC 1130 (e.g., an in-vehicle infotainment system (IVI)). Although shown and described as an SoC, in at least one embodiment, the infotainment system 1130 may not be an SoC and may include, but is not limited to, two or more discrete components. In at least one embodiment, the infotainment SoC 1130 may include, but is not limited to, a combination of hardware and software that can be used to provide audio (e.g., music, personal digital assistant, navigation instructions, news, radio, etc.), video (e.g., television, movies, streaming media, etc.), telephone (e.g., hands-free calling), network connectivity (e.g., LTE, WiFi, etc.) and / or information services (e.g., navigation system, rear parking assist, radio data system, vehicle-related information such as fuel level, total coverage distance, brake fuel level, fuel level, door opening / closing, air filter information, etc.) to vehicle 1100. For example, the infotainment SoC 1130 may include a radio, disk player, navigation system, video player, USB and Bluetooth connectivity, vehicle, in-vehicle entertainment system, WiFi, steering wheel audio controls, hands-free voice control, head-up display (“HUD”), HMI display 1134, telematics device, control panel (e.g., for controlling and / or interacting with various components, features and / or systems) and / or other components. In at least one embodiment, the infotainment SoC 1130 may further be used to provide information (e.g., visual and / or auditory) to users of the vehicle, such as information from ADAS system 1138, autonomous driving information (such as planned vehicle maneuvers), trajectory, surrounding context information (e.g., intersection information, vehicle information, road information, etc.) and / or other information.
[0202] In at least one embodiment, the infotainment SoC 1130 may include any number and type of GPU functionality. In at least one embodiment, the infotainment SoC 1130 may communicate with other devices, systems, and / or components of the vehicle 1100 via bus 1102 (e.g., CAN bus, Ethernet, etc.). In at least one embodiment, the infotainment SoC 1130 may be coupled to a monitoring MCU, enabling the GPU of the infotainment system to perform some autonomous driving functions in the event of a failure of the main controller 1136 (e.g., the main computer and / or backup computer of the vehicle 1100). In at least one embodiment, the infotainment SoC 1130 may cause the vehicle 1100 to enter a driver-to-safe-stop mode, as described herein.
[0203] In at least one embodiment, vehicle 1100 may further include instrument panel 1132 (e.g., digital instrument panel, electronic instrument panel, digital instrument control panel, etc.). Instrument panel 1132 may include, but is not limited to, controllers and / or supercomputers (e.g., discrete controllers or supercomputers). In at least one embodiment, instrument panel 1132 may include, but is not limited to, any number and combination of a set of instruments, such as speedometer, fuel level, oil pressure, tachometer, odometer, turn indicator, shift position indicator, one or more seatbelt warning lights, one or more parking brake warning lights, one or more engine malfunction lights, auxiliary restraint system (e.g., airbag) information, lighting controls, safety system controls, navigation information, etc. In some examples, information may be displayed and / or shared between infotainment SoC 1130 and instrument panel 1132. In at least one embodiment, instrument panel 1132 may be included as part of infotainment SoC 1130, or vice versa.
[0204] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 can be implemented in the system. Figure 11C The operation is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture, or neural network use cases described herein.
[0205] In at least one embodiment, the inference and / or training logic 412, 414 can be in the system Figure 11C Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0206] Figure 11D It is based on at least one embodiment in a cloud-based server and Figure 11AA diagram of a system 1176 for communication between autonomous vehicles 1100. In at least one embodiment, system 1176 may include, but is not limited to, one or more servers 1178, one or more networks 1190, and any number and type of vehicles, including vehicle 1100. One or more servers 1178 may include, but are not limited to, multiple GPUs 1184(A)-1184(H) (collectively referred to herein as GPU 1184), PCIe switches 1182(A)-1182(H) (collectively referred to herein as PCIe switch 1182), and / or CPUs 1180(A)-1180(B) (collectively referred to herein as CPU 1180). GPU 1184, CPU 1180, and PCIe switch 1182 may be interconnected with high-speed cables, such as, but not limited to, NVLink interface 1188 developed by NVIDIA and / or PCIe connection 1186. The GPU 1184 is connected via NVLink and / or NVSwitchSoC, and the GPU 1184 and PCIe switch 1182 are connected via PCIe interconnect. In at least one embodiment, although eight GPUs 1184, two CPUs 1180, and four PCIe switches 1182 are shown, this is not intended to be limiting. In at least one embodiment, each of one or more servers 1178 may include, but is not limited to, any combination of any number of GPUs 1184, CPUs 1180, and / or PCIe switches 1182. For example, in at least one embodiment, one or more servers 1178 may each include eight, sixteen, thirty-two, and / or more GPUs 1184.
[0207] In at least one embodiment, one or more servers 1178 may receive image data representing images from vehicles via one or more networks 1190, the images showing unexpected or changed road conditions, such as recently commenced roadworks. In at least one embodiment, one or more servers 1178 may transmit data via one or more networks 1190 to vehicles, neural network 1192, updated neural network 1192, and / or map information 1194, including but not limited to information about traffic and road conditions. In at least one embodiment, updates to map information 1194 may include, but are not limited to, updates to HD map 1122, such as information about construction sites, potholes, sidewalks, floods, and / or other obstacles. In at least one embodiment, neural network 1192, updated neural network 1192, and / or map information 1194 may be generated from new training and / or experience represented by data received from any number of vehicles in the context, and / or at least based on training performed in a data center (e.g., using one or more servers 1178 and / or other servers).
[0208] In at least one embodiment, one or more servers 1178 may be used to train a machine learning model (e.g., a neural network) at least in part based on training data. The training data may be generated by the vehicle and / or may be generated in a simulation (e.g., using a game engine). In at least one embodiment, any amount of training data is labeled (e.g., where the associated neural network benefits from supervised learning) and / or undergoes other preprocessing. In at least one embodiment, no amount of training data is labeled and / or preprocessed (e.g., where the associated neural network does not require supervised learning). In at least one embodiment, once the machine learning model is trained, the machine learning model may be used by the vehicle (e.g., transmitted to the vehicle via one or more networks 1190), and / or the machine learning model may be used by one or more servers 1178 to remotely monitor the vehicle.
[0209] In at least one embodiment, one or more servers 1178 may receive data from the vehicle and apply the data to state-of-the-art real-time neural networks for real-time intelligent inference. In at least one embodiment, one or more servers 1178 may include a deep learning supercomputer and / or a dedicated AI computer powered by one or more GPUs 1184, such as the DGX and DGX Station machines developed by NVIDIA. However, in at least one embodiment, one or more servers 1178 may include a deep learning infrastructure in a data center using CPU power.
[0210] In at least one embodiment, the deep learning infrastructure of one or more servers 1178 may be capable of fast, real-time inference and can use this capability to assess and verify the health of the processor, software, and / or associated hardware in vehicle 1100. For example, in at least one embodiment, the deep learning infrastructure may receive periodic updates from vehicle 1100, such as image sequences and / or objects located by vehicle 1100 in the image sequence (e.g., via computer vision and / or other machine learning object classification techniques). In at least one embodiment, the deep learning infrastructure may run its own neural network to identify objects and compare them with objects identified by vehicle 1100, and if the results do not match and the deep learning infrastructure determines that the AI in vehicle 1100 is malfunctioning, one or more servers 1178 may signal to vehicle 1100 to instruct the fail-safe computer of vehicle 1100 to take control, notify passengers, and complete a safe stopping operation.
[0211] In at least one embodiment, one or more servers 1178 may include one or more GPUs 1184 and one or more programmable inference accelerators (e.g., NVIDIA's TensorRT 3). In at least one embodiment, the combination of GPU-driven servers and inference acceleration enables real-time response. In at least one embodiment, for example, where performance is less critical, servers driven by CPUs, FPGAs, and other processors may be used for inference. In at least one embodiment, hardware architecture 815 is used to execute one or more embodiments. This document incorporates... Figure 8A and / or Figure 8B Provide details about the 815 hardware architecture.
[0212] Computer System
[0213] Figure 12 This is a block diagram illustrating an exemplary computer system according to at least one embodiment. The exemplary computer system may be a system of interconnected devices and components, a system-on-a-chip (SOC), or some combination thereof 1200, including a processor that may include an execution unit to execute instructions. In at least one embodiment, according to this disclosure, such as the embodiments described herein, the computer system 1200 may include, but is not limited to, components such as processor 1202, whose execution unit includes logic to execute algorithms for process data. In at least one embodiment, the computer system 1200 may include a processor, such as those available from Intel Corporation of Santa Clara, California. Processor family, Xeon TM , XScale TM and / or StrongARM TM , Core TM or Nervana TM A microprocessor may be used, although other systems (including PCs, engineering workstations, set-top boxes, etc.) with other microprocessors may also be used. In at least one embodiment, computer system 1200 may execute a version of the Windows operating system available from Microsoft Corporation of Redmond, Washington, although other operating systems (such as UNIX and Linux), embedded software, and / or graphical user interfaces may also be used.
[0214] The embodiments can be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular phones, Internet Protocol (IP) devices, digital cameras, personal digital assistants (“PDAs”), and handheld PCs. In at least one embodiment, the embedded application may include a microcontroller, a digital signal processor (“DSP”), a system-on-a-chip (SoC), a network computer (“NetPC”), a set-top box, a network hub, a wide area network (“WAN”) switch, or any other system that can execute one or more instructions according to at least one embodiment.
[0215] In at least one embodiment, computer system 1200 may include, but is not limited to, processor 1202, which may include, but is not limited to, one or more execution units 1208, to perform machine learning model training and / or inference according to the techniques described herein. In at least one embodiment, system 12 is a single-processor desktop or server system, but in another embodiment, system 12 may be a multiprocessor system. In at least one embodiment, processor 1202 may include, but is not limited to, a Complex Instruction Set Computer (“CISC”) microprocessor, a Reduced Instruction Set Computing (“RISC”) microprocessor, a Very Long Instruction Word (“VLIW”) microprocessor, a processor implementing instruction set combination, or any other processor device, such as a digital signal processor. In at least one embodiment, processor 1202 may be coupled to processor bus 1210, which can transmit data signals between processor 1202 and other components in computer system 1200.
[0216] In at least one embodiment, processor 1202 may include, but is not limited to, a Level 1 (“L1”) internal cache memory (“cache”) 1204. In at least one embodiment, processor 1202 may have a single internal cache or multiple levels of internal cache. In at least one embodiment, the cache memory may reside external to processor 1202. Depending on specific implementation and requirements, other embodiments may also include a combination of internal and external caches. In at least one embodiment, register file 1206 may store different types of data in various registers, including but not limited to integer registers, floating-point registers, status registers, and instruction pointer registers.
[0217] In at least one embodiment, an execution unit 1208, including but not limited to logic for performing integer and floating-point operations, is also located within the processor 1202. The processor 1202 may also include a microcode (“ucode”) read-only memory (“ROM”) for storing microcode of certain macro instructions. In at least one embodiment, the execution unit 1208 may include logic for processing a packaged instruction set 1209. In at least one embodiment, by including the packaged instruction set 1209 in the instruction set of the general-purpose processor 1202, along with the associated circuitry for executing the instructions, packaged data in the general-purpose processor 1202 can be used to perform operations used by numerous multimedia applications. In one or more embodiments, many multimedia applications can be executed more quickly and efficiently by using the full width of the processor's data bus to perform operations on the packaged data, which may eliminate the need to transfer smaller data units on the processor's data bus to perform one or more operations on one data element at a time.
[0218] In at least one embodiment, execution unit 1208 may also be used in a microcontroller, embedded processor, graphics device, DSP, and other types of logic circuitry. In at least one embodiment, computer system 1200 may include, but is not limited to, memory 1220. In at least one embodiment, memory 1220 may be implemented as a dynamic random access memory (“DRAM”) device, a static random access memory (“SRAM”) device, a flash memory device, or another storage device. Memory 1220 may store instructions 1219 and / or data 1221 represented by data signals that can be executed by processor 1202.
[0219] In at least one embodiment, the system logic chip may be coupled to the processor bus 1210 and the memory 1220. In at least one embodiment, the system logic chip may include, but is not limited to, a memory controller hub (“MCH”) 1216, and the processor 1202 may communicate with the MCH 1216 via the processor bus 1210. In at least one embodiment, the MCH 1216 may provide a high-bandwidth memory path 1218 to the memory 1220 for instruction and data storage, as well as for storage of graphics commands, data, and textures. In at least one embodiment, the MCH 1216 may initiate data signals between the processor 1202, the memory 1220, and other components in the computer system 1200, and bridge data signals between the processor bus 1210, the memory 1220, and the system I / O 1222. In at least one embodiment, the system logic chip may provide a graphics port for coupling to a graphics controller. In at least one embodiment, MCH 1216 may be coupled to memory 1220 via high-bandwidth memory path 1218, and graphics / video card 1212 may be coupled to MCH 1216 via Accelerated Graphics Port (“AGP”) interconnect 1214.
[0220] In at least one embodiment, computer system 1200 may use system I / O 1222 as a proprietary hub interface bus to couple MCH 1216 to I / O controller hub (“ICH”) 1230. In at least one embodiment, ICH 1230 may provide direct connectivity to certain I / O devices via a local I / O bus. In at least one embodiment, the local I / O bus may include, but is not limited to, a high-speed I / O bus for connecting peripheral devices to memory 1220, chipset, and processor 1202. Examples may include, but are not limited to, audio controller 1229, firmware hub (“Flash BIOS”) 1228, wireless transceiver 1226, data storage 1224, a conventional I / O controller 1223 including user input and keyboard interfaces, serial expansion port 1227 (e.g., Universal Serial Bus (USB)), and network controller 1234. Data storage 1224 may include hard disk drives, floppy disk drives, CD-ROM devices, flash memory devices, or other mass storage devices.
[0221] In at least one embodiment, Figure 12 The illustration shows a system comprising interconnected hardware devices or "chips," while in other embodiments, Figure 12 An exemplary system-on-a-chip (“SoC”) may be illustrated. In at least one embodiment, Figure 12The devices shown can be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, one or more components of system 1200 are interconnected using a Compute Fast Link (CXL) interconnect.
[0222] The inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 can be... Figure 12 Used in systems for reasoning or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0223] In at least one embodiment, the inference and / or training logic 412, 414 can be in the system Figure 12 Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0224] Figure 13 This is a block diagram illustrating an electronic device 1300 for utilizing a processor 1310 according to at least one embodiment. In at least one embodiment, the electronic device 1300 may be, for example, but not limited to, a laptop computer, tower server, rack server, blade server, laptop computer, desktop computer, tablet computer, mobile device, telephone, embedded computer, or any other suitable electronic device.
[0225] In at least one embodiment, system 1300 may include, but is not limited to, processor 1310 communicatively coupled to any suitable number or type of components, peripherals, modules, or devices. In at least one embodiment, processor 1310 is coupled using a bus or interface, such as I... 2 C-bus, System Management Bus (“SMBus”), Low Pin Count (LPC) bus, Serial Peripheral Interface (“SPI”), High Definition Audio (“HDA”) bus, Serial Advanced Technology Accessory (“SATA”) bus, Universal Serial Bus (“USB”) (versions 1, 2, and 3) or Universal Asynchronous Receiver / Transmitter (“UART”) bus.
[0226] In at least one embodiment, Figure 13 The system shown includes interconnected hardware devices or "chips," while in other embodiments, Figure 13 An exemplary system-on-chip (“SoC”) may be shown.
[0227] In at least one embodiment, Figure 13 The device shown can be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, Figure 13 One or more components are interconnected using Computational Fast Link (CXL) interconnects.
[0228] In at least one embodiment, Figure 13 It may include a display 1324, a touch screen 1325, a touchpad 1330, a near field communication unit (“NFC”) 1345, a sensor hub 1340, a thermal sensor 1346, a fast chipset (“EC”) 1335, a trusted platform module (“TPM”) 1338, a BIOS / firmware / flash (“BIOS, FW Flash”) 1322, a DSP 1360, a drive (“SSD” or “HDD”) 1320 (e.g., a solid-state drive (“SSD”) or a hard disk drive (“HDD”)), a wireless LAN unit (“WLAN”) 1350, a Bluetooth unit 1352, a wireless wide area network unit (“WWAN”) 1356, a global positioning system (GPS) unit 1355, and a camera (“USB 3.0 camera”) 1354 (e.g., a USB… 3.0 camera) and / or low-power double data rate (“LPDDR”) memory cells (“LPDDR3”) 1315 implemented in, for example, the LPDDR3 standard. These components may each be implemented in any suitable manner.
[0229] In at least one embodiment, other components may be communicatively coupled to processor 1310 via the components described herein. In at least one embodiment, accelerometer 1341, context light sensor (“ALS”) 1342, compass 1343, and gyroscope 1344 may be communicatively coupled to sensor hub 1340. In at least one embodiment, thermal sensor 1339, fan 1337, keyboard 1346, and touchpad 1330 may be communicatively coupled to EC 1335. In at least one embodiment, speaker 1363, earphone 1364, and microphone (“mic”) 1365 may be communicatively coupled to audio unit (“audio codec and Class D amplifier”) 1364, which in turn may be communicatively coupled to DSP 1360. In at least one embodiment, audio unit 1364 may include, for example, but not limited to, audio encoder / decoder (“codec”) and Class D amplifier. In at least one embodiment, SIM card (“SIM”) 1357 may be communicatively coupled to WWAN unit 1356. In at least one embodiment, components such as WLAN unit 1350, Bluetooth unit 1352, and WWAN unit 1356 can be implemented as next-generation form factor (NGFF).
[0230] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 can be implemented in the system. Figure 13 It is used in the context of reasoning or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0231] In at least one embodiment, the inference and / or training logic 412, 414 can be in the system Figure 13 Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0232] Figure 14 A computer system 1400 according to at least one embodiment is shown. In at least one embodiment, the computer system 1400 is configured to implement various processes and methods described throughout this disclosure.
[0233] In at least one embodiment, the computer system 1400 includes, but is not limited to, at least one central processing unit (“CPU”) 1402 connected to a communication bus 1410 implemented using any suitable protocol, such as PCI (“Peripheral Interconnect”), Peripheral Component Interconnect Express (“PCI-Express”), AGP (“Accelerated Graphics Port”), HyperTransport, or any other bus or point-to-point communication protocol. In at least one embodiment, the computer system 1400 includes, but is not limited to, main memory 1404 and control logic (e.g., implemented in hardware, software, or a combination thereof), and data may be stored in main memory 1404 in the form of random access memory (“RAM”). In at least one embodiment, a network interface subsystem (“Network Interface”) 1422 provides an interface to other computing devices and networks for receiving data from the computer system 1400 and transferring data to other systems.
[0234] In at least one embodiment, the computer system 1400 includes, but is not limited to, an input device 1408, a parallel processing system 1412, and a display device 1406, which may be implemented using conventional cathode ray tube (“CRT”), liquid crystal display (“LCD”), light-emitting diode (“LED”) display, plasma display, or other suitable display technologies. In at least one embodiment, user input is received from the input device 1408 (such as a keyboard, mouse, touchpad, microphone, etc.). In at least one embodiment, each of the foregoing modules may reside on a single semiconductor platform to form the processing system.
[0235] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 can be implemented in the system. Figure 14 It is used to perform inference or prediction operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture or neural network use cases described herein.
[0236] In at least one embodiment, the inference and / or training logic 412, 414 can be in the system Figure 14 Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0237] Figure 15 A computer system 1500 according to at least one embodiment is illustrated. In at least one embodiment, the computer system 1500 includes, but is not limited to, a computer 1510 and a USB flash drive 1520. In at least one embodiment, the computer 1510 may include, but is not limited to, any number and type of processors (not shown) and memory (not shown). In at least one embodiment, the computer 1510 includes, but is not limited to, a server, a cloud instance, a laptop computer, and a desktop computer.
[0238] In at least one embodiment, the USB flash drive 1520 includes, but is not limited to, a processing unit 1530, a USB interface 1540, and USB interface logic 1550. In at least one embodiment, the processing unit 1530 can be any instruction execution system, apparatus, or device capable of executing instructions. In at least one embodiment, the processing core 1530 can include, but is not limited to, any number and type of processing cores (not shown). In at least one embodiment, the processing core 1530 includes an application-specific integrated circuit (“ASIC”) optimized to perform any number and type of operations associated with machine learning. For example, in at least one embodiment, the processing core 1530 is a tensor processing unit (“TPC”) optimized to perform machine learning inference operations. In at least one embodiment, the processing core 1530 is a vision processing unit (“VPU”) optimized to perform machine vision and machine learning inference operations.
[0239] In at least one embodiment, the USB interface 1540 can be any type of USB connector or USB receptacle. For example, in at least one embodiment, the USB interface 1540 is a USB 3.0 Type-C receptacle for data and power. In at least one embodiment, the USB interface 1540 is a USB 3.0 Type-A connector. In at least one embodiment, the USB interface logic 1550 may include any number and type of logic enabling the processing unit 1530 to connect to an OR device (e.g., computer 1510) via the USB connector 1540.
[0240] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 can be implemented in the system. Figure 15 In use, at least in part, the operation is based on weight parameters, neural network functions and / or architectures computed using neural network training operations, or neural network use cases described herein to infer or predict operations.
[0241] In at least one embodiment, the inference and / or training logic 412, 414 can be in the system Figure 15 Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0242] Figure 16AAn exemplary architecture is illustrated, in which multiple GPUs 1610-1613 are communicatively coupled to multiple multi-core processors 1605-1606 via high-speed links 1640-1643 (e.g., bus / point-to-point interconnect, etc.). In at least one embodiment, the high-speed links 1640-1643 support communication throughput of 4GB / s, 30GB / s, 80GB / s, or higher. Various interconnect protocols can be used, including but not limited to PCIe 4.0 or 5.0 and NVLink 2.0.
[0243] Furthermore, in one embodiment, two or more GPUs 1610-1613 are interconnected via high-speed links 1629-1630, which may use the same or different protocols / links as those used for high-speed links 1640-1643. Similarly, two or more multi-core processors 1605-1606 may be connected via high-speed link 1628, which may be a symmetric multiprocessor (SMP) bus operating at speeds of 20GB / s, 30GB / s, 120GB / s, or higher. Alternatively, the same protocol / link (e.g., via a common interconnect structure) may be used. Figure 16A This shows all communication between the various system components.
[0244] In one embodiment, each multi-core processor 1605-1606 is communicatively coupled to processor memories 1601-1602 via memory interconnects 1626-1627, and each GPU 1610-1613 is communicatively coupled to GPU memories 1620-1623 via GPU memory interconnects 1650-1653. Memory interconnects 1626-1627 and 1650-1653 may utilize the same or different memory access technologies. By way of example and not limitation, processor memories 1601-1602 and GPU memories 1620-1623 may be volatile memories, such as dynamic random access memory (DRAM) (including stacked DRAM), graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or high-bandwidth memory (HBM), and / or may be non-volatile memories, such as 3D XPoint or Nano-RAM. In one embodiment, some portions of the processor memories 1601-1602 may be volatile memory, while other portions may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy).
[0245] As described in this article, although various multi-core processors 1605-1606 and GPUs 1610-1613 can be physically coupled to specific memories 1601-1602, 1620-1623, and / or can implement a unified memory architecture, the virtual system address space (also known as the “effective address” space) is distributed across the various physical memories. For example, processor memories 1601-1602 can each contain 64GB of system memory address space, and GPU memories 1620-1623 can each contain 32GB of system memory address space (resulting in a total addressable memory size of 256GB in this example).
[0246] Figure 16B Additional details are shown regarding the interconnection between a multi-core processor 1607 and a graphics acceleration module 1646 according to an exemplary embodiment. The graphics acceleration module 1646 may include one or more GPU chips integrated on a line card coupled to the processor 1607 via a high-speed link 1640. Alternatively, the graphics acceleration module 1646 may be integrated on the same package or chip as the processor 1607.
[0247] In at least one embodiment, the processor 1607 shown includes multiple cores 1660A-1660D, each core having a translation back buffer 1661A-1661D and one or more caches 1662A-1662D. In at least one embodiment, cores 1660A-1660D may include various other components (not shown) for executing instructions and processing data. Caches 1662A-1662D may include level 1 (L1) and level 2 (L2) caches. Furthermore, one or more shared caches 1656 may be included in caches 1662A-1662D and shared by the respective groups of cores 1660A-1660D. For example, one embodiment of the processor 1607 includes 24 cores, each core having its own L1 cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, two adjacent cores share one or more L2 and L3 caches. Processor 1607 and graphics acceleration module 1646 are connected to system memory 1614, which may include Figure 16A The processor memory 1601-1602 in the memory.
[0248] The consistency bus 1664 maintains consistency for data and instructions stored in the various caches 1662A-1662D, 1656 and system memory 1614 via inter-core communication. For example, each cache may have associated cache consistency logic / circuit to communicate via the consistency bus 1664 in response to the detection of a read or write to a specific cache line. In one implementation, a cache snooping protocol is implemented via the consistency bus 1664 to snoop on cache accesses.
[0249] In at least one embodiment, proxy circuitry 1625 communicatively couples graphics acceleration module 1646 to coherence bus 1664, thereby allowing graphics acceleration module 1646 to participate in cache coherence protocols as a peer of cores 1660A-1660D. Specifically, interface 1635 provides connectivity to proxy circuitry 1625 via high-speed link 1640 (e.g., PCIe bus, NVLink, etc.), and interface 1637 connects graphics acceleration module 1646 to link 1640.
[0250] In one implementation, the accelerator integrated circuit 1636 provides cache management, memory access, context management, and interrupt management services for multiple graphics processing engines 1631, 1632, and N of the graphics acceleration module. Each of the graphics processing engines 1631, 1632, and N may include a separate graphics processing unit (GPU). Optionally, the graphics processing engines 1631, 1632, and N may include different types of graphics processing engines within the GPU, such as graphics execution units, media processing engines (e.g., video encoders / decoders), samplers, and blit engines. In at least one embodiment, the graphics acceleration module 1646 may be a GPU having multiple graphics processing engines 1631-1632, and N, or the graphics processing engines 1631-1632, and N may be individual GPUs integrated on a general-purpose package, line card, or chip.
[0251] In one embodiment, the accelerator integrated circuit 1636 includes a memory management unit (MMU) 1639 for performing various memory management functions, such as virtual-to-physical memory translation (also known as effective-to-real memory translation), and a memory access protocol for accessing system memory 1614. The MMU 1639 may also include a translation back buffer (“TLB”) (not shown) for caching virtual / effective-to-physical / real address translations. In one implementation, cache 1638 may store commands and data for efficient access by graphics processing engines 1631-1632, N. In one embodiment, data stored in cache 1638 and graphics memories 1633-1634, M is kept consistent with core caches 1662A-1662D, 1656 and system memory 1614. As previously mentioned, this task can be accomplished via proxy circuitry 1625 representing cache 1638 and graphics memory 1633-1634, M (e.g., sending updates related to the modification / access of cache lines on processor caches 1662A-1662D, 1656 to cache 1638 and receiving updates from cache 1638).
[0252] A set of registers 1645 stores the context data of the threads executed by graphics processing engines 1631-1632, N, and context management circuitry 1648 manages the thread context. For example, context management circuitry 1648 can perform save and restore operations to save and restore the context of individual threads during context switching (e.g., saving the first thread and storing the second thread so that the second thread can be executed by the graphics processing engine). For example, during context switching, context management circuitry 1648 can store the current register value to a designated area in memory (e.g., identified by a context pointer). The register value can then be restored when returning to the context. In one embodiment, interrupt management circuitry 1647 receives and processes interrupts received from system devices.
[0253] In one implementation, MMU 1639 translates virtual / effective addresses from graphics processing engine 1631 into real / physical addresses in system memory 1614. One embodiment of accelerator integrated circuit 1636 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 1646 and / or other accelerator devices. Graphics accelerator module 1646 may be dedicated to a single application executing on processor 1607, or may be shared among multiple applications. In one embodiment, a virtualized graphics execution context is presented, where resources of graphics processing engines 1631-1632, N are shared with multiple applications or virtual machines (VMs). In at least one embodiment, resources may be subdivided into “slices” based on processing requirements and priorities associated with VMs and / or applications, which are allocated to different VMs and / or applications.
[0254] In at least one embodiment, the accelerator integrated circuit 1636 acts as a bridge to the system of the graphics acceleration module 1646, providing address translation and system memory caching services. Additionally, the accelerator integrated circuit 1636 can provide virtualization facilities for the host processor to manage the virtualization, interrupt, and memory management of the graphics processing engines 1631-1632.
[0255] Because the hardware resources of the graphics processing engines 1631-1632 and N are explicitly mapped to the real address space seen by the host processor 1607, any host processor can directly address these resources using valid address values. One function of the accelerator integrated circuit 1636 is to physically separate the graphics processing engines 1631-1632 and N, making them appear as independent units to the system.
[0256] In at least one embodiment, one or more graphics memories 1633-1634, M are coupled to each graphics processing engine 1631-1632, N, respectively. Graphics memories 1633-1634, M store instructions and data, which are processed by each graphics processing engine 1631-1632, N. Graphics memories 1633-1634, M may be volatile memories, such as DRAM (including stacked DRAM), GDDR memory (e.g., GDDR5, GDDR6), or HBM, and / or may be non-volatile memories, such as 3D XPoint or Nano-RAM.
[0257] In one embodiment, to reduce data traffic on link 1640, a biasing technique is used to ensure that the data stored in graphics memories 1633-1634, M is the data most frequently used by graphics processing engines 1631-1632, N, and preferably not used (or at least infrequently used) by cores 1660A-1660D. Similarly, the biasing mechanism attempts to keep the data needed by the cores (and preferably not graphics processing engines 1631-1632, N) in the core caches 1662A-1662D, 1656 and system memory 1614.
[0258] Figure 16C Another exemplary embodiment is shown, in which the accelerator integrated circuit 1636 is integrated within the processor 1607. In this embodiment, graphics processing engines 1631-1632, N communicate directly with the accelerator integrated circuit 1636 via a high-speed link 1640 through interfaces 1637 and 1635 (which can also utilize any form of bus or interface protocol). The accelerator integrated circuit 1636 can perform operations related to... Figure 16B The operations described are the same. However, due to its close proximity to the coherence bus 1664 and caches 1662A-1662D, 1656, it may have higher throughput. One embodiment supports different programming models, including a dedicated process programming model (without graphics acceleration module virtualization) and a shared programming model (with virtualization), which may include a programming model controlled by accelerator integrated circuit 1636 and a programming model controlled by graphics acceleration module 1646.
[0259] In at least one embodiment, graphics processing engines 1631-1632, N are dedicated to a single application or process within a single operating system. In at least one embodiment, a single application can funnel requests from other applications to graphics processing engines 1631-1632, N, thereby providing virtualization within a VM / partition.
[0260] In at least one embodiment, graphics processing engines 1631-1632, N can be shared by multiple VM / application partitions. In at least one embodiment, the shared model can use a hypervisor to virtualize graphics processing engines 1631-1632, N to allow each operating system to access them. For a single-partition system without a hypervisor, the operating system owns graphics processing engines 1631-1632, N. In at least one embodiment, the operating system can virtualize graphics processing engines 1631-1632, N to provide access to each process or application.
[0261] In at least one embodiment, the graphics acceleration module 1646 or the individual graphics processing engines 1631-1632, N uses a process handle to select a process element. In one embodiment, the process element is stored in system memory 1614 and can be addressed using the effective address to physical address translation techniques described herein. In at least one embodiment, the process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engines 1631-1632, N (i.e., invoking system software to add the process element to the process element linked list). In at least one embodiment, the lower 16 bits of the process handle may be the offset of the process element in the process element linked list.
[0262] Figure 16D An exemplary accelerator integration slice 1690 is shown. As used herein, a “slice” includes a designated portion of the processing resources of the accelerator integrated circuit 1636. The application is an effective address space 1682 in system memory 1614 that stores process element 1683. In one embodiment, process element 1683 is stored in response to a GPU call 1681 from an application 1680 executing on processor 1607. Process element 1683 contains the process state of the corresponding application 1680. A job descriptor (WD) 1684 contained in process element 1683 may be a single job requested by the application, or it may contain a pointer to a job queue. In at least one embodiment, WD 1684 is a pointer to a job request queue in the application’s address space 1682.
[0263] The graphics acceleration module 1646 and / or the various graphics processing engines 1631-1632, N can be shared by all processes or a subset of processes in the system. In at least one embodiment, infrastructure may be included for setting process states and sending WD 1684 to the graphics acceleration module 1646 to start a job in a virtualization context.
[0264] In at least one embodiment, the dedicated process programming model is implementation-specific. In this model, a single process owns either the graphics acceleration module 1646 or an individual graphics processing engine 1631. Since the graphics acceleration module 1646 is owned by a single process, the hypervisor initializes the accelerator integrated circuits for the owned partition, and when the graphics acceleration module 1646 is assigned, the operating system initializes the accelerator integrated circuits 1636 for the owned process.
[0265] In operation, the WD fetch unit 1691 in the accelerator integrated slice 1690 fetches the next WD 1684, which includes instructions for the work to be performed by one or more graphics processing engines of the graphics acceleration module 1646. Data from the WD 1684 can be stored in register 1645 and used by the MMU 1639, interrupt management circuitry 1647, and / or context management circuitry 1648, as shown. For example, one embodiment of the MMU 1639 includes segment / page roaming circuitry for accessing segment / page tables 1686 within the OS virtual address space 1685. The interrupt management circuitry 1647 can handle interrupt events 1692 received from the graphics acceleration module 1646. When performing graphics operations, the effective address 1693 generated by the graphics processing engines 1631-1632, N is translated into a real address by the MMU 1639.
[0266] In one embodiment, the same set of registers 1645 is copied for each graphics processing engine 1631-1632, N, and / or graphics acceleration module 1646, and these registers 1645 can be initialized by a hypervisor or the operating system. Each of these copied registers can be included in the accelerator integration slice 1690. Exemplary registers that can be initialized by a hypervisor are shown in Table 1.
[0267]
[0268]
[0269] Table 2 shows exemplary registers that can be initialized by the operating system.
[0270]
[0271] In one embodiment, each WD 1684 is specific to a particular graphics acceleration module 1646 and / or graphics processing engine 1631-1632, N. It contains all the information required for the graphics processing engine 1631-1632, N to complete its work, or it may be a pointer to a memory location where the application has set up a command queue for the work to be completed.
[0272] Figure 16E Additional details of an exemplary embodiment of the shared model are shown. This embodiment includes a hypervisor real address space 1698, in which a list of process elements 1699 is stored. The hypervisor real address space 1698 can be accessed via a hypervisor 1696, which virtualizes the graphics acceleration module engine for operating system 1695.
[0273] In at least one embodiment, the shared programming model allows all processes or subsets of processes from all partitions or subsets of partitions in the system to use the graphics acceleration module 1646. Two programming models exist, in which the graphics acceleration module 1646 is shared by multiple processes and partitions: time-slice sharing and graphics-oriented sharing.
[0274] In this model, the hypervisor 1696 owns the graphics acceleration module 1646 and makes its functionality available to all operating systems 1695. For the graphics acceleration module 1646 to support virtualization through the hypervisor 1696, the graphics acceleration module 1646 can comply with the following: 1) Application job requests must be autonomous (i.e., no state needs to be maintained between jobs), or the graphics acceleration module 1646 must provide context saving and restoration mechanisms; 2) The graphics acceleration module 1646 guarantees that application job requests are completed within a specified amount of time, including any translation errors, or the graphics acceleration module 1646 provides the ability to preempt job processing; 3) When operating in a directed shared programming model, fairness among the processes of the graphics acceleration module 1646 must be ensured.
[0275] In at least one embodiment, application 1680 needs to make system calls to operating system 1695 using graphics acceleration module 1646 type, working descriptor (WD), authority mask register (AMR) value, and context save / restore region pointer (CSRP). In at least one embodiment, the graphics acceleration module 1646 type describes the target acceleration function for the system call. In at least one embodiment, the graphics acceleration module 1646 type can be a system-specific value. In at least one embodiment, the WD is specifically formatted for graphics acceleration module 1646 and can take the form of graphics acceleration module 1646 commands, valid address pointers to user-defined structures, valid address pointers to command queues, or any other data structure describing the work to be performed by graphics acceleration module 1646. In one embodiment, the AMR value is the AMR state for the current process. In at least one embodiment, the value passed to the operating system is similar to that of the application that sets the AMR. If the implementation of accelerator integrated circuit 1636 and graphics acceleration module 1646 does not support the User Authority Mask Overwrite Register (UAMOR), the operating system can apply the current UAMOR value to the AMR value before passing the AMR in the hypervisor call. Hypervisor 1696 may selectively apply the Current Privilege Mask Overwrite Register (AMOR) value before placing the AMR into process element 1683. In at least one embodiment, CSRP is one of registers 1645 containing the effective address of a region in the application's address space 1682 for the graphics acceleration module 1646 to save and restore context state. This pointer is optional if saving state between jobs is not required or when a job is preempted. In at least one embodiment, the context save / restore region may be fixed system memory.
[0276] Upon receiving a system call, operating system 1695 can verify that application 1680 has been registered and granted permission to use graphics acceleration module 1646. Then, operating system 1695 uses...
[0277] The information shown in Table 3 is used to invoke management program 1696.
[0278]
[0279] Upon receiving a hypervisor call, hypervisor 1696 verifies that operating system 1695 has been registered and granted permission to use graphics acceleration module 1646. Then, hypervisor 1696 adds process element 1683 to the linked list of process elements of the corresponding graphics acceleration module 1646 type. The process element may include the information shown in Table 4.
[0280]
[0281] In at least one embodiment, the hypervisor initializes multiple accelerator integration slice 1690 registers 1645.
[0282] like Figure 16F As shown, in at least one embodiment, a unified memory is used, which is addressable via a common virtual memory address space for accessing physical processor memories 1601-1602 and GPU memories 1620-1623. In this implementation, operations performed on GPUs 1610-1613 utilize the same virtual / effective memory address space to access processor memories 1601-1602, and vice versa, thereby simplifying programmability. In at least one embodiment, a first portion of the virtual / effective address space is allocated to processor memory 1601, a second portion to second processor memory 1602, a third portion to GPU memory 1620, and so on. In at least one embodiment, the entire virtual / effective memory space (sometimes referred to as the effective address space) is thus distributed across each of processor memories 1601-1602 and GPU memories 1620-1623, thereby allowing any processor or GPU to access that memory using a virtual address mapped to any physical memory.
[0283] In one embodiment, the bias / coherence management circuitry 1694A-1694E within one or more MMUs 1639A-1639E ensures cache coherence between the caches of one or more host processors (e.g., 1605) and the GPUs 1610-1613, and implements biasing techniques that indicate the physical memory in which certain types of data should be stored. While in Figure 16F Several instances of the bias / coherence management circuitry 1694A-1694E are shown, but the bias / coherence circuitry can be implemented within the MMU of one or more host processors 1605 and / or within the accelerator integrated circuit 1636.
[0284] One embodiment allows GPU-attached memory 1620-1623 to be mapped as part of system memory and accessed using shared virtual memory (SVM) technology without suffering the performance drawbacks associated with full system cache coherence. In at least one embodiment, the ability to access GPU-attached memory 1620-1623 as system memory without the heavy overhead of cache coherence provides a favorable operational context for GPU offloading. This arrangement allows the host processor 1605 to software-set operands and access computation results without the overhead of conventional I / O DMA data copying. Such conventional copying includes driver calls, interrupts, and memory-mapped I / O (MMIO) accesses, all of which are less efficient than simple memory accesses. In at least one embodiment, the ability to access GPU-attached memory 1620-1623 without cache coherence overhead can be critical to the execution time of offloaded computations. For example, in cases with high volumes of streaming write memory traffic, cache coherence overhead can significantly reduce the effective write bandwidth seen by GPUs 1610-1613. In at least one embodiment, the efficiency of operand setup, the efficiency of result access, and the efficiency of GPU computation may play a role in determining the effectiveness of GPU offloading.
[0285] In at least one embodiment, the selection of GPU bias and host processor bias is driven by a bias tracker data structure. For example, a bias table can be used, which may be a page-granular structure (i.e., controlled at the memory page level) comprising 1 or 2 bits of memory pages attached to each GPU. In at least one embodiment, with or without a bias cache (e.g., for caching frequently / recently used entries in the bias table) in GPUs 1610-1613, the bias table can be implemented across one or more stolen memory ranges of GPU-attached memory 1620-1623. Alternatively, the entire bias table can be maintained within the GPU.
[0286] In at least one embodiment, prior to actual access to GPU memory, the bias table entry associated with each access to GPU-attached memory 1620-1623 is accessed, causing the following operations: First, a local request from GPUs 1610-1613 to find its page in the GPU bias is directly forwarded to the corresponding GPU memory 1620-1623. A local request from the GPU to find its page in the host bias is forwarded to processor 1605 (e.g., via the high-speed link described above). In one embodiment, a request from processor 1605 to find the requested page in the host processor bias completes a request similar to a normal memory read. Alternatively, requests to GPU bias pages can be forwarded to GPUs 1610-1613. In at least one embodiment, if the GPU is not currently using the page, the GPU may subsequently migrate the page to the host processor bias. In at least one embodiment, the page bias state can be changed through software-based mechanisms, hardware-assisted software mechanisms, or, in limited cases, purely hardware-based mechanisms.
[0287] One mechanism for changing the bias state employs an API call (e.g., OpenCL), which subsequently invokes the GPU's device driver. The device driver then sends a message (or enqueues a command descriptor) to the GPU, instructing the GPU to change the bias state and, in some migrations, performs a cache refresh operation on the host. In at least one embodiment, the cache refresh operation is used for migrations from the host processor 1605 bias to the GPU bias, but not for the reverse migration.
[0288] In one embodiment, cache coherence is maintained by temporarily rendering GPU bias pages that the host processor 1605 cannot cache. To access these pages, the processor 1605 may request access from the GPU 1610, which may or may not grant access immediately. Therefore, to reduce communication between the processor 1605 and the GPU 1610, it is beneficial to ensure that the GPU bias pages are those required by the GPU, not those required by the host processor 1605, and vice versa.
[0289] One or more hardware structures 815 are used to execute one or more embodiments. This document may combine... Figure 8A and / or Figure 8B Provide details about one or more hardware architectures 815.
[0290] Figure 17Exemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which may be manufactured using one or more IP cores. In addition to the illustrations, at least one embodiment may include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.
[0291] Figure 17 This is a block diagram illustrating an exemplary system on a chip integrated circuit 1700 that can be fabricated using one or more IP cores according to at least one embodiment. In at least one embodiment, the integrated circuit 1700 includes one or more application processors 1705 (e.g., CPUs), at least one graphics processor 1710, and may additionally include an image processor 1715 and / or a video processor 1720, any of which may be a modular IP core. In at least one embodiment, the integrated circuit 1700 includes peripheral or bus logic, which includes a USB controller 1725, a UART controller 1730, an SPI / SDIO controller 1735, and an I... 2 S / I 2 C controller 1740. In at least one embodiment, integrated circuit 1700 may include display device 1745 coupled to one or more of high-definition multimedia interface (HDMI) controller 1750 and mobile industrial processor interface (MIPI) display interface 1755. In at least one embodiment, storage may be provided by flash memory subsystem 1760, including flash memory and flash memory controller. In at least one embodiment, a memory interface may be provided via memory controller 1765 for accessing SDRAM or SRAM memory devices. In at least one embodiment, some integrated circuits also include embedded security engine 1770.
[0292] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 may be used in integrated circuit 1700 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0293] In at least one embodiment, inference and / or training logic 412, 414 may be used in integrated circuit 1700 for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0294] Figures 18A-18BExemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which may be manufactured using one or more IP cores. In addition to the illustrations, at least one embodiment may include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.
[0295] Figures 18A-18B This is a block diagram illustrating an exemplary graphics processor used within a SoC according to embodiments described herein. Figure 18A An exemplary graphics processor 1810 of a system-on-a-chip according to at least one embodiment is shown, which can be manufactured using one or more IP cores. Figure 18B Further exemplary graphics processor 1840 of a system-on-a-chip according to at least one embodiment is shown, which can be manufactured using one or more IP cores. In at least one embodiment, Figure 18A The graphics processor 1810 is a low-power graphics processor core. In at least one embodiment, Figure 18B The graphics processor 1840 is a higher-performance graphics processor core. In at least one embodiment, each graphics processor 1810, 1840 may be... Figure 17 A variant of the 1710 graphics processor.
[0296] In at least one embodiment, the graphics processor 1810 includes a vertex processor 1805 and one or more fragment processors 1815A-1815N (e.g., 1815A, 1815B, 1815C, 1815D to 1815N-1 and 1815N). In at least one embodiment, the graphics processor 1810 may execute different shader programs via separate logic, such that the vertex processor 1805 is optimized to perform operations for vertex shader programs, while one or more fragment processors 1815A-1815N perform fragment (e.g., pixel) shading operations for fragments or pixels or shader programs. In at least one embodiment, the vertex processor 1805 performs the vertex processing stage of the 3D graphics pipeline and generates primitive and vertex data. In at least one embodiment, one or more fragment processors 1815A-1815N use the primitive and vertex data generated by the vertex processor 1805 to generate framebuffers for display on a display device. In at least one embodiment, one or more fragment processors 1815A-1815N are optimized to execute fragment shader programs as provided in the OpenGL API, which can be used to perform operations similar to those of pixel shader programs provided in the Direct 3D API.
[0297] In at least one embodiment, the graphics processor 1810 additionally includes one or more memory management units (MMUs) 1820A-1820B, one or more caches 1825A-1825B, and one or more circuit interconnects 1830A-1830B. In at least one embodiment, one or more MMUs 1820A-1820B provide a virtual-to-physical address mapping for the graphics processor 1810, including for the vertex processor 1805 and / or fragment processors 1815A-1815N, which can reference vertex or image / texture data stored in memory, in addition to vertex or image / texture data stored in one or more caches 1825A-1825B. In at least one embodiment, one or more MMUs 1820A-1820B can be synchronized with other MMUs within the system, including with... Figure 17 One or more application processors 1705, graphics processors 1715, and / or video processors 1720 are associated with one or more MMUs, enabling each processor 1705-1720 to participate in a shared or unified virtual memory system. In at least one embodiment, one or more circuit interconnects 1830A-1830B enable the graphics processor 1810 to connect to other IP cores within the SoC via the SoC's internal bus or via a direct connection.
[0298] In at least one embodiment, the graphics processor 1840 includes Figure 18A The graphics processor 1810 includes one or more MMUs 1820A-1820B, caches 1825A-1825B, and circuit interconnects 1830A-1830A. In at least one embodiment, the graphics processor 1840 includes one or more shader cores 1855A-1855N (e.g., 1855A, 1855B, 1855C, 1855D, 1855E, 1855F to 1855N-1 and 1855N) that provide a unified shader core architecture, wherein a single core or type of core can execute all types of programmable shader code, including shader program code for implementing vertex shaders, fragment shaders, and / or compute shaders. In at least one embodiment, the number of shader cores may vary. In at least one embodiment, the graphics processor 1840 includes an inter-core task manager 1845 that acts as a thread dispatcher to assign execution threads to one or more shader cores 1855A-1855N and a tile unit 1858 to accelerate tile-based rendering operations, wherein rendering operations of a scene are subdivided in image space, for example, to take advantage of local spatial consistency within the scene or to optimize the use of internal caches.
[0299] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 may be integrated into an integrated circuit. Figure 18A and / or Figure 18B The above is used for inference or prediction operations based at least in part on weight parameters calculated using neural network training operations, neural network functions or architectures, or neural network use cases described herein.
[0300] In at least one embodiment, inference and / or training logic 412, 414 may be used in integrated circuits 18A and / or 18B for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0301] Figures 19A-19B Additional exemplary graphics processor logic according to embodiments described herein is illustrated. In at least one embodiment, Figure 19A It shows that it can be included in Figure 17 The graphics core 1900 within the graphics processor 1710, and in at least one embodiment, may be as follows: Figure 18B The Unified Shader Core 1855A-1855N is shown. Figure 19B A highly parallel general-purpose graphics processing unit 1930 suitable for deployment on a multi-chip module is shown in at least one embodiment.
[0302] In at least one embodiment, the graphics core 1900 includes a shared instruction cache 1902, texture units 1918, and cache / shared memory 1920, which are common to the execution resources within the graphics core 1900. In at least one embodiment, the graphics core 1900 may include multiple slices 1901A-1901N or partitions of each core, and the graphics processor may include multiple instances of the graphics core 1900. Slices 1901A-1901N may include supporting logic, including local instruction caches 1904A-1904N, thread schedulers 1906A-1906N, thread dispatchers 1908A-1908N, and a set of registers 1910A-1910N. In at least one embodiment, slices 1901A-1901N may include a set of additional functional units (AFU1912A-1912N), floating-point units (FPU1914A-1914N), integer arithmetic logic units (ALU 1916A-1916N), address calculation units (ACU 1913A-1913N), double-precision floating-point units (DPFPU 1915A-1915N), and matrix processing units (MPU1917A-1917N).
[0303] In at least one embodiment, the FPU 1914A-1914N can perform single-precision (32-bit) and half-precision (16-bit) floating-point operations, while the DPFPU 1915A-1915N performs double-precision (64-bit) floating-point operations. In at least one embodiment, the ALU 1916A-1916N can perform variable-precision integer operations with 8-bit, 16-bit, and 32-bit precision, and can be configured for mixed-precision operations. In at least one embodiment, the MPU 1917A-1917N can also be configured for mixed-precision matrix operations, including half-precision floating-point operations and 8-bit integer operations. In at least one embodiment, the MPU 1917-1917N can perform various matrix operations to accelerate machine learning application frameworks, including enabling support for accelerated generalized matrix-to-matrix multiplication (GEMM). In at least one embodiment, the AFU 1912A-1912N can perform additional logical operations not supported by floating-point or integer units, including trigonometric operations (e.g., sine, cosine, etc.).
[0304] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This is combined with... Figure 8A and / or Figure 8BDetails are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 may be used in the graphics core 1900 to infer or predict operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0305] In at least one embodiment, inference and / or training logic 412, 414 may be used in the graphics core 1900 for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0306] Figure 19B A general-purpose processing unit (GPGPU) 1930 is illustrated in at least one embodiment, which can be configured to enable highly parallel computational operations to be performed by a set of graphics processing units. In at least one embodiment, the GPGPU 1930 can be directly linked to other instances of the GPGPU 1930 to create a multi-GPU cluster to improve the training speed for deep neural networks. In at least one embodiment, the GPGPU 1930 includes a host interface 1932 for connection to a host processor. In at least one embodiment, the host interface 1932 is a PCI Express interface. In at least one embodiment, the host interface 1932 can be a vendor-specific communication interface or communication structure. In at least one embodiment, the GPGPU 1930 receives commands from the host processor and uses a global scheduler 1934 to allocate execution threads associated with those commands to a set of compute clusters 1936A-1936H. In at least one embodiment, compute clusters 1936A-1936H share a cache memory 1938. In at least one embodiment, cache memory 1938 can be used as a higher-level cache than cache memory within computing clusters 1936A-1936H.
[0307] In at least one embodiment, the GPGPU 1930 includes memories 1944A-1944B, which are coupled to computing clusters 1936A-1936H via a set of memory controllers 1942A-1942B. In at least one embodiment, memories 1944A-1944B may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), which includes graphics double data rate (GDDR) memory.
[0308] In at least one embodiment, each of the computing clusters 1936A-1936H includes a set of graphics cores, for example... Figure 19A The graphics core 1900 may include various types of integer and floating-point logic units that can perform computational operations across a range of precisions, including precisions suitable for machine learning computations. For example, in at least one embodiment, at least a subset of the floating-point units in each computing cluster 1936A-1936H may be configured to perform 16-bit or 32-bit floating-point operations, while different subsets of the floating-point units may be configured to perform 64-bit floating-point operations.
[0309] In at least one embodiment, multiple instances of the GPGPU 1930 can be configured as a computing cluster. In at least one embodiment, the communication used for synchronization and data exchange by the computing clusters 1936A-1936H varies between embodiments. In at least one embodiment, the multiple instances of the GPGPU 1930 communicate via a host interface 1932. In at least one embodiment, the GPGPU 1930 includes an I / O hub 1939 that couples the GPGPU 1930 to a GPU link 1940, enabling direct connection to other instances of the GPGPU 1930. In at least one embodiment, the GPU link 1940 is coupled to a dedicated GPU-to-GPU bridge, which enables communication and synchronization between the multiple instances of the GPGPU 1930. In at least one embodiment, the GPU link 1940 is coupled to a high-speed interconnect for sending and receiving data to and from other GPGPUs or parallel processors. In at least one embodiment, the multiple instances of the GPGPU 1930 reside in a separate data processing system and communicate via network devices accessible through the host interface 1932. In at least one embodiment, GPU link 1940 may be configured to connect to a host processor other than or as a replacement for host interface 1932.
[0310] In at least one embodiment, the GPGPU 1930 can be configured to train a neural network. In at least one embodiment, the GPGPU 1930 can be used within an inference platform. In at least one embodiment, when using the GPGPU 1930 for inference, the GPGPU may include fewer compute clusters 1936A-1936H compared to when using the GPGPU to train a neural network. In at least one embodiment, the memory technology associated with the memories 1944A-1944B can differ between inference and training configurations, with higher bandwidth memory technology dedicated to the training configuration. In at least one embodiment, the inference configuration of the GPGPU 1930 can support inference-specific instructions. For example, in at least one embodiment, the inference configuration can provide support for one or more 8-bit integer dot product instructions, which can be used during the inference operation of the deployed neural network.
[0311] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 may be used in the GPGPU 1930 for inferring or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architecture, or neural network use cases described herein.
[0312] In at least one embodiment, inference and / or training logic 412, 414 may be used in the GPGPU1930 for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0313] Figure 20 A block diagram of a computer system 2000 according to at least one embodiment is shown. In at least one embodiment, the computer system 2000 includes a processing subsystem 2001 having one or more processors 2002 and a system memory 2004 communicating via an interconnect path that may include a memory hub 2005. In at least one embodiment, the memory hub 2005 may be a separate component within a chipset component or may be integrated within one or more processors 2002. In at least one embodiment, the memory hub 2005 is coupled to an I / O subsystem 2011 via a communication link 2006. In one embodiment, the I / O subsystem 2011 includes an I / O hub 2007 that enables the computer system 2000 to receive input from one or more input devices 2008. In at least one embodiment, the I / O hub 2007 enables a display controller to provide output to one or more display devices 2010A, the display controller being included in one or more processors 2002. In at least one embodiment, one or more display devices 2010A coupled to the I / O hub 2007 may include local, internal or embedded display devices.
[0314] In at least one embodiment, the processing subsystem 2001 includes one or more parallel processors 2012 coupled to the memory hub 2005 via a bus or other communication link 2013. In at least one embodiment, the communication link 2013 can be any of many standards-based communication link technologies or protocols, such as, but not limited to, PCI Express, or can be a vendor-specific communication interface or communication architecture. In at least one embodiment, one or more parallel processors 2012 form a computationally concentrated parallel or vector processing system, which may include a large number of processing cores and / or processing clusters, such as a multi-core integrated (MIC) processor. In at least one embodiment, one or more parallel processors 2012 form a graphics processing subsystem that can output pixels to one or more display devices 2010A coupled via an I / O hub 2007. In at least one embodiment, the parallel processors 2012 may also include a display controller and a display interface (not shown) to enable direct connection to one or more display devices 2010B.
[0315] In at least one embodiment, system storage unit 2014 may be connected to I / O hub 2007 to provide a storage mechanism for computer system 2000. In at least one embodiment, I / O switch 2016 may be used to provide an interface mechanism to enable connectivity between I / O hub 2007 and other components, such as network adapter 2018 and / or wireless network adapter 2019 which may be integrated into the platform, and various other devices that can be added via one or more additional devices 2020. In at least one embodiment, network adapter 2018 may be an Ethernet adapter or another wired network adapter. In at least one embodiment, wireless network adapter 2019 may include one or more of Wi-Fi, Bluetooth, Near Field Communication (NFC), or other network devices including one or more wireless devices.
[0316] In at least one embodiment, the computer system 2000 may include other components not explicitly shown, such as USB or other port connections, optical storage drives, video capture devices, etc., which may also be connected to the I / O hub 2007. In at least one embodiment, the interconnection can be implemented using any suitable protocol (e.g., PCI-based protocols such as PCI-Express or other bus or point-to-point communication interfaces and / or protocols). Figure 20 The communication paths of the various components, such as NV-Link high-speed interconnect or interconnect protocols.
[0317] In at least one embodiment, one or more parallel processors 2012 include circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constituting a graphics processing unit (GPU). In at least one embodiment, one or more parallel processors 2012 include circuitry optimized for general-purpose processing. In at least one embodiment, components of the computer system 2000 may be integrated with one or more other system elements on a single integrated circuit. For example, in at least one embodiment, one or more parallel processors 2012, memory hub 2005, processor 2002, and I / O hub 2007 may be integrated into a system-on-a-chip (SoC) integrated circuit. In at least one embodiment, components of the computer system 2000 may be integrated into a single package to form a system-in-package (SIP) configuration. In at least one embodiment, at least a portion of the components of the computer system 2000 may be integrated into a multi-chip module (MCM) that can interconnect with other MCMs to a modular computer system.
[0318] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details regarding inference and / or training logic 815 are provided. In at least one embodiment, inference and / or training logic 815 may be used in system diagram 2000 for inferring or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0319] In at least one embodiment, inference and / or training logic 412, 414 may be used in system diagram 2000 for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0320] processor
[0321] Figure 21A A parallel processor 2100 according to at least one embodiment is illustrated. In at least one embodiment, various components of the parallel processor 2100 may be implemented using one or more integrated circuit devices, such as programmable processors, application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). In at least one embodiment, the illustrated parallel processor 2100 is according to an exemplary embodiment. Figure 20 The variant of the 2012 with one or more parallel processors is shown.
[0322] In at least one embodiment, the parallel processor 2100 includes a parallel processing unit 2102. In at least one embodiment, the parallel processing unit 2102 includes an I / O unit 2104 that enables communication with other devices, including other instances of the parallel processing unit 2102. In at least one embodiment, the I / O unit 2104 can be directly connected to other devices. In at least one embodiment, the I / O unit 2104 is connected to other devices using a hub or switch interface (e.g., a memory hub 2005). In at least one embodiment, the connection between the memory hub 2005 and the I / O unit 2104 forms a communication link 2013. In at least one embodiment, the I / O unit 2104 is connected to a host interface 2106 and a memory crossbar switch 2116, wherein the host interface 2106 receives commands for performing processing operations, and the memory crossbar switch 2116 receives commands for performing memory operations.
[0323] In at least one embodiment, when host interface 2106 receives a command buffer via I / O unit 2104, host interface 2106 can direct work operations to execute those commands to front end 2108. In at least one embodiment, front end 2108 is coupled to scheduler 2110, which is configured to assign commands or other work items to processing cluster array 2112. In at least one embodiment, scheduler 2110 ensures that processing cluster array 2112 is correctly configured and in an active state before assigning tasks to processing cluster array 2112. In at least one embodiment, scheduler 2110 is implemented via firmware logic executed on a microcontroller. In at least one embodiment, the microcontroller-implemented scheduler 2110 can be configured to perform complex scheduling and work assignment operations at both coarse and fine granular levels, thereby enabling fast preemption and context switching of threads executing on processing array 2112. In at least one embodiment, host software can demonstrate workloads scheduled on processing array 2112 via one of a plurality of graphics processing doorbells. In at least one embodiment, the workload can then be automatically distributed on the processing array 2112 by the scheduler 2110 logic within the microcontroller, which includes the scheduler 2110.
[0324] In at least one embodiment, the processing cluster array 2112 may include up to "N" processing clusters (e.g., clusters 2114A, 2114B to 2114N). In at least one embodiment, each cluster 2114A-2114N of the processing cluster array 2112 may execute a large number of concurrent threads. In at least one embodiment, the scheduler 2110 may use various scheduling and / or work allocation algorithms to allocate work to the clusters 2114A-2114N of the processing cluster array 2112, which may vary depending on the workload generated by each type of program or computation. In at least one embodiment, scheduling may be handled dynamically by the scheduler 2110, or may be partially assisted by compiler logic during the compilation of program logic configured to be executed by the processing cluster array 2112. In at least one embodiment, different clusters 2114A-2114N of the processing cluster array 2112 may be assigned to process different types of programs or to perform different types of computations.
[0325] In at least one embodiment, the processing cluster array 2112 can be configured to perform various types of parallel processing operations. In at least one embodiment, the processing cluster array 2112 is configured to perform general-purpose parallel computing operations. For example, in at least one embodiment, the processing cluster array 2112 may include logic for performing processing tasks, including filtering video and / or audio data, performing modeling operations, including physical operations, and performing data transformations.
[0326] In at least one embodiment, the processing cluster array 2112 is configured to perform parallel graphics processing operations. In at least one embodiment, the processing cluster array 2112 may include additional logic to support the execution of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, as well as tessellation logic and other vertex processing logic. In at least one embodiment, the processing cluster array 2112 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. In at least one embodiment, the parallel processing unit 2102 may transfer data from system memory via I / O unit 2104 for processing. In at least one embodiment, during processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 2122) and then written back to system memory.
[0327] In at least one embodiment, when the parallel processing unit 2102 is used to perform graphics processing, the scheduler 2110 may be configured to divide the processing workload into tasks of approximately equal size to better distribute graphics processing operations among the multiple clusters 2114A-2114N of the processing cluster array 2112. In at least one embodiment, portions of the processing cluster array 2112 may be configured to perform different types of processing. For example, in at least one embodiment, a first portion may be configured to perform vertex shading and topology generation, a second portion may be configured to perform tessellation and geometry shading, and a third portion may be configured to perform pixel shading or other screen-space operations to generate a rendered image for display. In at least one embodiment, intermediate data generated by one or more of the clusters 2114A-2114N may be stored in a buffer to allow intermediate data to be transferred between the clusters 2114A-2114N for further processing.
[0328] In at least one embodiment, the processing cluster array 2112 may receive processing tasks to be executed via a scheduler 2110, which receives commands defining the processing tasks from the front end 2108.
[0329] In at least one embodiment, the processing task may include an index of data to be processed, such as surface (patch) data, raw data, vertex data, and / or pixel data, as well as state parameters and commands defining how the data is processed (e.g., what program to execute). In at least one embodiment, the scheduler 2110 may be configured to acquire an index corresponding to a task, or may receive an index from the front end 2108. In at least one embodiment, the front end 2108 may be configured to ensure that the processing cluster array 2112 is configured to be active before starting the workload specified by the incoming command buffer (e.g., batch buffer, push buffer, etc.).
[0330] In at least one embodiment, each of one or more instances of the parallel processing unit 2102 may be coupled to the parallel processor memory 2122. In at least one embodiment, the parallel processor memory 2122 may be accessed via a memory crossbar switch 2116, which may receive memory requests from the processing cluster array 2112 and the I / O unit 2104. In at least one embodiment, the memory crossbar switch 2116 may access the parallel processor memory 2122 via a memory interface 2118. In at least one embodiment, the memory interface 2118 may include a plurality of partition units (e.g., partition units 2120A, 2120B to 2120N), each of which may be coupled to a portion (e.g., a memory cell) of the parallel processor memory 2122. In at least one embodiment, the plurality of partition units 2120A-2120N are configured to be equal to the number of memory units, such that the first partition unit 2120A has a corresponding first memory unit 2124A, the second partition unit 2120B has a corresponding memory unit 2124B, and the Nth partition unit 2120N has a corresponding Nth memory unit 2124N. In at least one embodiment, the number of partition units 2120A-2120N may not be equal to the number of memory devices.
[0331] In at least one embodiment, memory cells 2124A-2124N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. In at least one embodiment, memory cells 2124A-2124N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). In at least one embodiment, rendering targets such as frame buffers or texture maps may be stored across memory cells 2124A-2124N, allowing partitioning cells 2120A-2120N to write portions of each rendering target in parallel to efficiently utilize the available bandwidth of the parallel processor memory 2122. In at least one embodiment, local instances of the parallel processor memory 2122 may be excluded to facilitate a unified memory design that combines system memory with local cache memory.
[0332] In at least one embodiment, any of the clusters 2114A-2114N of the processing cluster array 2112 can process data to be written to any memory cell 2124A-2124N within the parallel processor memory 2122. In at least one embodiment, the memory crossbar switch 2116 can be configured to transfer the output of each cluster 2114A-2114N to any partition cell 2120A-2120N or another cluster 2114A-2114N, and the clusters 2114A-2114N can perform further processing operations on the output. In at least one embodiment, each cluster 2114A-2114N can communicate with the memory interface 2118 via the memory crossbar switch 2116 to read from or write to various external storage devices. In at least one embodiment, the memory crossbar switch 2116 has a connection to the memory interface 2118 for communication with the I / O unit 2104, and a connection to a local instance of the parallel processor memory 2122, thereby enabling processing units within different processing clusters 2114A-2114N to communicate with system memory or other memory not local to the parallel processing unit 2102. In at least one embodiment, the memory crossbar switch 2116 may use virtual channels to separate traffic flows between clusters 2114A-2114N and partition units 2120A-2120N.
[0333] In at least one embodiment, multiple instances of the parallel processing unit 2102 may be provided on a single insert card, or multiple insert cards may be interconnected. In at least one embodiment, different instances of the parallel processing unit 2102 may be configured to interoperate, even if the different instances have different numbers of processing cores, different numbers of local parallel processor memories, and / or other configuration differences. For example, in at least one embodiment, some instances of the parallel processing unit 2102 may include higher-precision floating-point units relative to other instances. In at least one embodiment, a system combining one or more instances of the parallel processing unit 2102 or the parallel processor 2100 may be implemented in various configurations and form factors, including but not limited to desktop, laptop, or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.
[0334] Figure 21B This is a block diagram of a partitioning unit 2120 according to at least one embodiment. In at least one embodiment, the partitioning unit 2120 is... Figure 21AThis is an example of one of the partitioning units 2120A-2120N. In at least one embodiment, the partitioning unit 2120 includes an L2 cache 2121, a frame buffer interface 2125, and a ROP 2126 (raster operation unit). The L2 cache 2121 is a read / write cache configured to perform load and store operations received from the memory crossbar switch 2116 and the ROP 2126. In at least one embodiment, the L2 cache 2121 outputs read misses and urgent write-back requests to the frame buffer interface 2125 for processing. In at least one embodiment, updates can also be sent to the frame buffer for processing via the frame buffer interface 2125. In at least one embodiment, the frame buffer interface 2125 communicates with memory cells in the parallel processor memory (such as...). Figure 21A The memory cells 2124A-2124N (e.g., within the parallel processor memory 2122) interact with one of them.
[0335] In at least one embodiment, ROP 2126 is a processing unit that performs raster operations such as stenciling, z-testing, blending, etc. In at least one embodiment, ROP 2126 then outputs processed graphics data stored in graphics memory. In at least one embodiment, ROP 2126 includes compression logic to compress depth or color data written to memory and decompress depth or color data read from memory. In at least one embodiment, the compression logic may be lossless compression logic utilizing one or more of a variety of compression algorithms. The type of compression performed by ROP 2126 may vary based on the statistical characteristics of the data to be compressed. For example, in at least one embodiment, incremental color compression is performed based on depth and color data on a per-tile basis.
[0336] In at least one embodiment, ROP 2126 is included within each processing cluster (e.g., clusters 2114A-2114N of FIG. 21), rather than within partition unit 2120. In at least one embodiment, read and write requests for pixel data are made via memory crossbar switch 2116 instead of pixel fragment data transfer. In at least one embodiment, the processed graphics data can be displayed on a display device (such as...). Figure 22 Displayed by one or more display devices 2210, routed by processor 2202 for further processing, or by... Figure 21A One of the processing entities within the parallel processor 2100 is routed for further processing.
[0337] Figure 21CThis is a block diagram of a processing cluster 2114 within a parallel processing unit according to at least one embodiment. In at least one embodiment, the processing cluster is an example of one of the processing clusters 2114A-2114N of FIG. 21. In at least one embodiment, the processing cluster 2114 may be configured to execute a number of threads in parallel, wherein the term "thread" refers to an instance of a specific program executing on a particular set of input data. In at least one embodiment, a Single Instruction Multiple Data (SIMD) instruction issuing technique is used to support the parallel execution of a large number of threads without providing multiple independent instruction units. In at least one embodiment, a Single Instruction Multiple Threading (SIMT) technique is used to support the parallel execution of a large number of generally synchronous threads, which uses a common instruction unit configured to issue instructions to a set of processing engines within each processing cluster.
[0338] In at least one embodiment, the operation of the processing cluster 2114 can be controlled by a pipeline manager 2132 that assigns processing tasks to SIMT parallel processors. In at least one embodiment, the pipeline manager 2132 receives instructions from the scheduler 2110 of FIG. 21 and manages the execution of these instructions via the graphics multiprocessor 2134 and / or texture unit 2136. In at least one embodiment, the graphics multiprocessor 2134 is an exemplary instance of a SIMT parallel processor. However, in at least one embodiment, the processing cluster 2114 may include various types of SIMT parallel processors with different architectures. In at least one embodiment, the processing cluster 2114 may include one or more instances of the graphics multiprocessor 2134. In at least one embodiment, the graphics multiprocessor 2134 can process data, and the data crossover switch 2140 can be used to distribute the processed data to one of several possible destinations (including other shader units). In at least one embodiment, the pipeline manager 2132 can facilitate the distribution of processed data by specifying the destination of the processed data to be distributed via the data crossover switch 2140.
[0339] In at least one embodiment, each graphics multiprocessor 2134 within the processing cluster 2114 may include the same set of functional execution logic (e.g., arithmetic logic units, load-memory units, etc.). In at least one embodiment, the functional execution logic may be configured in a pipelined manner, wherein new instructions may be issued before previous instructions complete. In at least one embodiment, the functional execution logic supports a variety of operations, including integer and floating-point arithmetic, comparison operations, Boolean operations, shift operations, and computation of various algebraic functions. In at least one embodiment, the same functional unit hardware may be used to perform different operations, and any combination of functional units may exist.
[0340] In at least one embodiment, instructions sent to the processing cluster 2114 constitute threads. In at least one embodiment, a group of threads executed across a set of parallel processing engines is a thread group. In at least one embodiment, the thread group executes programs on different input data. In at least one embodiment, each thread within the thread group may be assigned to a different processing engine within the graphics multiprocessor 2134. In at least one embodiment, the thread group may include fewer threads than the number of processing engines within the graphics multiprocessor 2134. In at least one embodiment, when the number of threads included in the thread group is less than the number of processing engines, one or more processing engines may be idle during a loop that is processing the thread group. In at least one embodiment, the thread group may also include more threads than the number of processing engines within the graphics multiprocessor 2134. In at least one embodiment, when the thread group includes more threads than the number of processing engines within the graphics multiprocessor 2134, processing can be performed in consecutive clock cycles. In at least one embodiment, multiple thread groups can be executed simultaneously on the graphics multiprocessor 2134.
[0341] In at least one embodiment, the graphics multiprocessor 2134 includes an internal cache memory for performing load and store operations. In at least one embodiment, the graphics multiprocessor 2134 may forgo the internal cache and use a cache memory (e.g., L1 cache 2148) within the processing cluster 2114. In at least one embodiment, each graphics multiprocessor 2134 may also access an L2 cache within partition units (e.g., partition units 2120A-2120N of FIG. 21), which are shared among all processing clusters 2114 and can be used to transfer data between threads. In at least one embodiment, the graphics multiprocessor 2134 may also access off-chip global memory, which may include one or more of local parallel processor memory and / or system memory. In at least one embodiment, any memory outside of the parallel processing unit 2102 may be used as global memory. In at least one embodiment, the processing cluster 2114 includes multiple instances of the graphics multiprocessor 2134, which may share common instructions and data that may be stored in the L1 cache 2148.
[0342] In at least one embodiment, each processing cluster 2114 may include a memory management unit (“MMU”) 2145 configured to map virtual addresses to physical addresses. In at least one embodiment, one or more instances of MMU 2145 may reside within the memory interface 2118 of FIG. 21. In at least one embodiment, MMU 2145 includes a set of page table entries (PTEs) for mapping virtual addresses to physical addresses of tiles (talk more about tiling) and optionally to cache line indices. In at least one embodiment, MMU 2145 may include an address translation back buffer (TLB) or a cache that may reside within the graphics multiprocessor 2134 or the L1 cache or processing cluster 2114. In at least one embodiment, physical addresses are processed to allocate surface data access locality for efficient request interleaving between partition units. In at least one embodiment, cache line indices may be used to determine whether a request for a cache line is a hit or a miss.
[0343] In at least one embodiment, the processing cluster 2114 may be configured such that each graphics multiprocessor 2134 is coupled to a texture unit 2136 to perform texture mapping operations that determine texture sample locations, read texture data, and filter texture data. In at least one embodiment, texture data is read as needed from an internal texture L1 cache (not shown) or from an L1 cache within the graphics multiprocessor 2134, and texture data is also retrieved from an L2 cache, local parallel processor memory, or system memory. In at least one embodiment, each graphics multiprocessor 2134 outputs a processed task to a data crossover switch 2140 to provide the processed task to another processing cluster 2114 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via a memory crossover switch 2116. In at least one embodiment, the preROP 2142 (pre-raster operation unit) is configured to receive data from the graphics multiprocessor 2134 and direct the data to a ROP unit that may be located together with partitioning units described herein (e.g., partitioning units 2120A-2120N of FIG. 21). In at least one embodiment, the PreROP 2142 unit may perform optimizations for color blending, organize pixel color data, and perform address translation.
[0344] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8BDetails are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 may be used in the graphics processing cluster 2114 to perform inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0345] In at least one embodiment, inference and / or training logic 412, 414 may be used in the graphics processing cluster 2114 for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0346] Figure 21D A graphics multiprocessor 2134 according to at least one embodiment is illustrated. In at least one embodiment, the graphics multiprocessor 2134 is coupled to a pipeline manager 2132 of a processing cluster 2114. In at least one embodiment, the graphics multiprocessor 2134 has an execution pipeline including, but not limited to, an instruction cache 2152, an instruction unit 2154, an address mapping unit 2156, a register file 2158, one or more general-purpose graphics processing unit (GPGPU) cores 2162, and one or more load / store units 2166. The GPGPU cores 2162 and the load / store units 2166 are coupled to a cache memory 2172 and a shared memory 2170 via a memory and cache interconnect 2168.
[0347] In at least one embodiment, instruction cache 2152 receives a stream of instructions to be executed from pipeline manager 2132. In at least one embodiment, instructions are cached in instruction cache 2152 and dispatched to instruction unit 2154 for execution. In one embodiment, instruction unit 2154 may dispatch instructions as thread groups (e.g., thread bundles), assigning each thread of the thread group to a different execution unit within GPGPU core 2162. In at least one embodiment, instructions can access any local, shared, or global address space by specifying an address within a unified address space. In at least one embodiment, address mapping unit 2156 may be used to translate addresses in the unified address space into different memory addresses that can be accessed by load / store unit 2166.
[0348] In at least one embodiment, register file 2158 provides a set of registers for functional units of graphics multiprocessor 2134. In at least one embodiment, register file 2158 provides temporary storage for operands of data paths connected to functional units of graphics multiprocessor 2134 (e.g., GPGPU core 2162, load / store unit 2166). In at least one embodiment, register file 2158 is partitioned among each functional unit, such that a dedicated portion of register file 2158 is allocated to each functional unit. In at least one embodiment, register file 2158 is partitioned among different thread bundles being executed by graphics multiprocessor 2134.
[0349] In at least one embodiment, each of the GPGPU cores 2162 may include a floating-point unit (FPU) and / or an integer arithmetic logic unit (ALU) for executing instructions of the graphics multiprocessor 2134. The GPGPU cores 2162 may be architecturally similar or may differ in architecture. In at least one embodiment, a first portion of the GPGPU core 2162 includes a single-precision FPU and an integer ALU, while a second portion of the GPGPU core includes a double-precision FPU. In at least one embodiment, the FPU may implement the IEEE 754-2008 standard for floating-point algorithms or enable variable-precision floating-point algorithms. In at least one embodiment, the graphics multiprocessor 2134 may additionally include one or more fixed-function or special-function units to perform specific functions, such as copying rectangles or pixel blending operations. In at least one embodiment, one or more of the GPGPU cores may also include fixed-function or special-function logic.
[0350] In at least one embodiment, the GPGPU core 2162 includes SIMD logic capable of executing a single instruction on multiple sets of data. In one embodiment, the GPGPU core 2162 can physically execute SIMD4, SIMD8, and SIMD16 instructions and logically execute SIMD1, SIMD2, and SIMD32 instructions. In at least one embodiment, the SIMD instructions for the GPGPU core can be generated by a shader compiler at compile time or automatically generated when executing a program written and compiled for a Single Program Multiple Data (SPMD) or SIMT architecture. In at least one embodiment, multiple threads of a program configured for a SIMT execution model can be executed using a single SIMD instruction. For example, in at least one embodiment, eight SIMD threads performing the same or similar operations can be executed in parallel using a single SIMD8 logic unit.
[0351] In at least one embodiment, the memory and cache interconnect 2168 is an interconnect network connecting each functional unit of the graphics multiprocessor 2134 to the register file 2158 and the shared memory 2170. In at least one embodiment, the memory and cache interconnect 2168 is a cross-switch interconnect that allows the load / store unit 2166 to perform load and store operations between the shared memory 2170 and the register file 2158. In at least one embodiment, the register file 2158 can operate at the same frequency as the GPGPU core 2162, resulting in very low latency for data transfer between the GPGPU core 2162 and the register file 2158. In at least one embodiment, the shared memory 2170 can be used to enable communication between threads executing on functional units within the graphics multiprocessor 2134. In at least one embodiment, the cache memory 2172 can be used, for example, as a data cache to cache texture data communicated between functional units and texture units 2136. In at least one embodiment, the shared memory 2170 can also be used as a program-managed cache. In at least one embodiment, in addition to the automatically cached data stored in cache memory 2172, the thread executing on GPGPU core 2162 can also programmatically store data in shared memory.
[0352] In at least one embodiment, a parallel processor or GPGPU, as described herein, is communicatively coupled to a host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. In at least one embodiment, the GPU may be communicatively coupled to the host processor / core via a bus or other interconnect (e.g., high-speed interconnects such as PCIe or NVLink). In at least one embodiment, the GPU may be integrated with the core on the same package or chip and communicatively coupled to the core via an internal processor bus / interconnect (i.e., within the package or chip). In at least one embodiment, regardless of how the GPU is connected, the processor core may assign work to the GPU in the form of a sequence of commands / instructions contained in a job descriptor. In at least one embodiment, the GPU then uses dedicated circuitry / logic to efficiently process these commands / instructions.
[0353] The inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 8A and / or Figure 8BDetails are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 may be used in the graphics multiprocessor 2134 to perform inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0354] In at least one embodiment, inference and / or training logic 412, 414 may be used in graphics multiprocessor 2134 for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0355] Figure 22 A multi-GPU computing system 2200 according to at least one embodiment is illustrated. In at least one embodiment, the multi-GPU computing system 2200 may include a processor 2202 coupled to a plurality of general-purpose graphics processing units (GPGPUs) 2206A-D via a host interface switch 2204. In at least one embodiment, the host interface switch 2204 is a PCI Express switch device that couples the processor 2202 to a PCI Express bus, through which the processor 2202 communicates with the GPGPUs 2206A-D. The GPGPUs 2206A-D may be interconnected via a set of high-speed P2P GPU-to-GPU links 2216. In at least one embodiment, the GPU-to-GPU links 2216 are connected to each of the GPGPUs 2206A-D via dedicated GPU links. In at least one embodiment, the P2P GPU links 2216 enable direct communication between each of the GPGPUs 2206A-D without communication via the host interface bus 2204 to which the processor 2202 is connected. In at least one embodiment, when GPU-to-GPU traffic is directed to the P2P GPU link 2216, the host interface bus 2204 remains available for system memory access or, for example, communication with other instances of the multi-GPU computing system 2200 via one or more network devices. While in at least one embodiment, the GPGPUs 2206A-D are connected to the processor 2202 via the host interface switch 2204, in at least one embodiment, the processor 2202 includes direct support for the P2P GPU link 2216 and can be directly connected to the GPGPUs 2206A-D.
[0356] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8BDetails are provided regarding inference and / or training logic 815. In at least one embodiment, inference and / or training logic 815 may be used in a multi-GPU computing system 2200 for performing inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0357] In at least one embodiment, inference and / or training logic 412, 414 may be used in a multi-GPU computing system 2200 for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0358] Figure 23 This is a block diagram of a graphics processor 2300 according to at least one embodiment. In at least one embodiment, the graphics processor 2300 includes a ring interconnect 2302, a pipeline front end 2304, a media engine 2337, and graphics cores 2380A-2380N. In at least one embodiment, the ring interconnect 2302 couples the graphics processor 2300 to other processing units, said processing units including other graphics processors or one or more general-purpose processor cores. In at least one embodiment, the graphics processor 2300 is one of many processors integrated within a multi-core processing system.
[0359] In at least one embodiment, the graphics processor 2300 receives multiple batches of commands via a ring interconnect 2302. In at least one embodiment, the input commands are interpreted by a command streamer 2303 in a pipeline front-end 2304. In at least one embodiment, the graphics processor 2300 includes scalable execution logic for performing 3D geometry processing and media processing via graphics cores 2380A-2380N. In at least one embodiment, for 3D geometry processing commands, the command streamer 2303 provides the commands to the geometry pipeline 2336. In at least one embodiment, for at least some media processing commands, the command streamer 2303 provides the commands to a video front-end 2334, which is coupled to a media engine 2337. In at least one embodiment, the media engine 2337 includes a video quality engine (VQE) 2330 for video and image post-processing, and a multi-format encoding / decoding (MFX) engine 2333 for providing hardware-accelerated media data encoding and decoding. In at least one embodiment, the geometry pipeline 2336 and the media engine 2337 each generate an execution thread for the thread execution resources provided by at least one graphics core 2380A.
[0360] In at least one embodiment, the graphics processor 2300 includes modular cores 2380A-2380N (sometimes referred to as core slices) with scalable thread execution resource features, each graphics core having multiple sub-cores 2350A-2350N, 2360A-2360N (sometimes referred to as core sub-slices). In at least one embodiment, the graphics processor 2300 may have any number of graphics cores 2380A to 2380N. In at least one embodiment, the graphics processor 2300 includes a graphics core 2380A having at least a first sub-core 2350A and a second sub-core 2360A. In at least one embodiment, the graphics processor 2300 is a low-power processor with a single sub-core (e.g., 2350A). In at least one embodiment, the graphics processor 2300 includes multiple graphics cores 2380A-2380N, each graphics core including a set of first sub-cores 2350A-2350N and a set of second sub-cores 2360A-2360N. In at least one embodiment, each of the first sub-cores 2350A-2350N includes at least a first set of execution units 2352A-2352N and media / texture samplers 2354A-2354N. In at least one embodiment, each of the second sub-cores 2360A-2360N includes at least a second set of execution units 2362A-2362N and samplers 2364A-2364N. In at least one embodiment, each of the sub-cores 2350A-2350N and 2360A-2360N shares a set of shared resources 2370A-2370N. In at least one embodiment, the shared resources include a shared cache memory and pixel operation logic.
[0361] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details are provided regarding the inference and / or training logic 815. In at least one embodiment, the inference and / or training logic 815 may be used in the graphics processor 2300 to perform inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.
[0362] In at least one embodiment, inference and / or training logic 412, 414 may be used in the multi-graphics processor 2300 for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0363] Figure 24This is a block diagram illustrating a microarchitecture for a processor 2400 according to at least one embodiment, the processor 2400 including logic circuitry for executing instructions. In at least one embodiment, the processor 2400 can execute instructions, including x86 instructions, ARM instructions, special-purpose instructions for application-specific integrated circuits (ASICs), etc. In at least one embodiment, the processor 2410 may include registers for storing packaged data, such as the 64-bit wide MMX™ registers in an Intel microprocessor enabled by MMX technology in Santa Clara, California. In at least one embodiment, the MMX registers available in integer and floating-point forms can operate with packaged data elements accompanied by Single Instruction Multiple Data (“SIMD”) and Streaming SIMD Extensions (“SSE”) instructions. In at least one embodiment, a 128-bit wide XMM register associated with SSE2, SSE3, SSE4, AVX, or later (generally referred to as “SSEx”) technologies can hold such packaged data operands. In at least one embodiment, the processor 2410 can execute instructions to accelerate machine learning or deep learning algorithms, training, or inference.
[0364] In at least one embodiment, processor 2400 includes an ordered front end (“front end”) 2401 to fetch instructions to be executed and prepare instructions for later use in the processor pipeline. In at least one embodiment, front end 2401 may include several units. In at least one embodiment, instruction prefetcher 2426 fetches instructions from memory and provides the instructions to instruction decoder 2428, which in turn decodes or interprets the instructions. For example, in at least one embodiment, instruction decoder 2428 decodes the received instructions into one or more machine-executable so-called “micro-instructions” or “micro-operations” (also referred to as “micro-operations” or “micro-instructions”). In at least one embodiment, instruction decoder 2428 parses the instructions into opcodes and corresponding data and control fields, which can be used by the microarchitecture to perform operations according to at least one embodiment. In at least one embodiment, trace cache 2430 may assemble the decoded micro-instructions into a program-ordered sequence or trace in micro-instruction queue 2434 for execution. In at least one embodiment, when the trace cache 2430 encounters complex instructions, the microcode ROM 2432 provides the microinstructions required to complete the operation.
[0365] In at least one embodiment, some instructions may be converted into a single micro-operation, while others require several micro-operations to complete the entire operation. In at least one embodiment, if more than four micro-instructions are required to complete an instruction, the instruction decoder 2428 may access the microcode ROM 2432 to execute the instruction. In at least one embodiment, an instruction may be decoded into a small number of micro-instructions for processing at the instruction decoder 2428. In at least one embodiment, if multiple micro-instructions are required to complete the operation, the instructions may be stored in the microcode ROM 2432. In at least one embodiment, the trace cache 2430 references an entry point programmable logic array (“PLA”) to determine the correct micro-instruction pointer for reading a microcode sequence from the microcode ROM 2432 to complete one or more instructions, according to at least one embodiment. In at least one embodiment, after the microcode ROM 2432 has completed the micro-operation ordering of the instructions, the machine front end 2401 may resume fetching micro-operations from the trace cache 2430.
[0366] In at least one embodiment, the out-of-order execution engine (“out-of-order engine”) 2403 can prepare instructions for execution. In at least one embodiment, the out-of-order execution logic has multiple buffers to smooth and reorder the instruction flow to optimize performance as instructions descend the pipeline and are scheduled for execution. The out-of-order execution engine 2403 includes, but is not limited to, an allocator / register renamer 2440, a memory microinstruction queue 2442, an integer / floating-point microinstruction queue 2444, a memory scheduler 2446, a fast scheduler 2402, a slow / general-purpose floating-point scheduler (“slow / general-purpose FP scheduler”) 2404, and a simple floating-point scheduler (“simple FP scheduler”) 2406. In at least one embodiment, the fast scheduler 2402, the slow / general-purpose floating-point scheduler 2404, and the simple floating-point scheduler 2406 are also collectively referred to as “microinstruction schedulers 2402, 2404, 2406”. The allocator / register renamer 2440 allocates the machine buffers and resources required for each microinstruction to be executed sequentially. In at least one embodiment, the allocator / register renamer 2440 renames logical registers to entries in a register file. In at least one embodiment, the allocator / register renamer 2440 also assigns an entry for each microinstruction in one of two microinstruction queues, memory microinstruction queue 2442 for memory operations and integer / floating-point microinstruction queue 2444 for non-memory operations, preceding the memory scheduler 2446 and microinstruction schedulers 2402, 2404, and 2406. In at least one embodiment, the microinstruction schedulers 2402, 2404, and 2406 determine when they are ready to execute a microinstruction based on the readiness of their dependent input register operand sources and the availability of the execution resource microinstructions that need to be completed. In at least one embodiment, the fast scheduler 2402 can schedule on each half of the master clock cycle, while the slow / general-purpose floating-point scheduler 2404 and the simple floating-point scheduler 2406 can schedule once per master processor clock cycle. In at least one embodiment, microinstruction schedulers 2402, 2404, and 2406 arbitrate the scheduling port to schedule microinstructions for execution.
[0367] In at least one embodiment, execution block 2411 includes, but is not limited to, integer register file / branch network 2408, floating-point register file / branch network (“FP register file / branch network”) 2410, address generation units (“AGU”) 2412 and 2414, fast arithmetic logic units (“fast ALU”) 2416 and 2418, slow arithmetic logic unit (“slow ALU”) 2420, floating-point ALU (“FP”) 2422, and floating-point move unit (“FP move”) 2424. In at least one embodiment, integer register file / branch network 2408 and floating-point register file / bypass network 2410 are also referred to herein as “register files 2408, 2410”. In at least one embodiment, AGUs 2412 and 2414, fast ALUs 2416 and 2418, slow ALU 2420, floating-point ALU 2422, and floating-point movement unit 2424 are also referred to herein as "execution units 2412, 2414, 2416, 2418, 2420, 2422, and 2424". In at least one embodiment, execution block 2411 may include, but is not limited to, any number (including zero) and type of register files, branch networks, address generation units, and execution units (in any combination).
[0368] In at least one embodiment, register files 2408, 2410 may be arranged between microinstruction schedulers 2402, 2404, 2406 and execution units 2412, 2414, 2416, 2418, 2420, 2422, and 2424. In at least one embodiment, integer register file / tribute network 2408 performs integer operations. In at least one embodiment, floating-point register file / tribute network 2410 performs floating-point operations. In at least one embodiment, each of register files 2408, 2410 may include, but is not limited to, a tribute network that can bypass or forward recently completed results not yet written to the register file to a new dependent object. In at least one embodiment, register files 2408, 2410 may communicate data with each other. In at least one embodiment, integer register file / tribute network 2408 may include, but is not limited to, two separate register files, one register file for low-order 32-bit data and a second register file for high-order 32-bit data. In at least one embodiment, the floating-point register file / branch network 2410 may include, but is not limited to, entries with a width of 128 bits, since floating-point instructions typically have operands with a width of 64 to 128 bits.
[0369] In at least one embodiment, execution units 2412, 2414, 2416, 2418, 2420, 2422, and 2424 can execute instructions. In at least one embodiment, register files 2408 and 2410 store integer and floating-point data operation values that the microinstructions need to execute. In at least one embodiment, processor 2400 may include, but is not limited to, any number of execution units 2412, 2414, 2416, 2418, 2420, 2422, and 2424, and combinations thereof. In at least one embodiment, floating-point ALU 2422 and floating-point move unit 2424 can perform floating-point, MMX, SIMD, AVX, and SSE or other operations, including specialized machine learning instructions. In at least one embodiment, floating-point ALU 2422 may include, but is not limited to, a 64-bit multiplication-64-bit floating-point divider to perform division, square root, and remainder micro-operations. In at least one embodiment, floating-point hardware can be used to process instructions involving floating-point values. In at least one embodiment, ALU operations can be passed to fast ALUs 2416 and 2418. In at least one embodiment, fast ALUs 2416 and 2418 can perform fast operations with an effective delay of half a clock cycle. In at least one embodiment, most complex integer operations are routed to slow ALU 2420, because slow ALU 2420 can include, but is not limited to, integer execution hardware for long-latency type operations, such as multipliers, shifters, flag logic, and branching. In at least one embodiment, memory load / store operations can be performed by AGUs 2412 and 2414. In at least one embodiment, fast ALU 2416, fast ALU 2418, and slow ALU 2420 can perform integer operations on 64-bit data operands. In at least one embodiment, fast ALU 2416, fast ALU 2418, and slow ALU 2420 can be implemented to support various data bit sizes, including sixteen, thirty-two, 128, 256, etc. In at least one embodiment, the floating-point ALU 2422 and the floating-point movement unit 2424 can be implemented to support a range of operands with various bit widths. In at least one embodiment, the floating-point ALU 2422 and the floating-point movement unit 2424 can operate on 128-bit wide packaged data operands in conjunction with SIMD and multimedia instructions.
[0370] In at least one embodiment, microinstruction schedulers 2402, 2404, and 2406 schedule dependent operations before the parent load completes execution. In at least one embodiment, since microinstructions can be speculatively scheduled and executed within processor 2400, processor 2400 may also include logic for handling memory misses. In at least one embodiment, if a data load miss occurs in the data cache, there may be a dependent operation running in the pipeline that temporarily deprives the scheduler of the correct data. In at least one embodiment, a replay mechanism tracks and re-executes instructions that use incorrect data. In at least one embodiment, it may be necessary to replay dependent operations and may allow independent operations to be completed. In at least one embodiment, the scheduler and replay mechanism of at least one embodiment of the processor may also be designed to capture instruction sequences used for text string comparison operations.
[0371] In at least one embodiment, the term "register" may refer to an onboard processor storage location that can be used as part of an instruction that identifies operands. In at least one embodiment, a register may be one that can be used externally to the processor (from a programmer's perspective). In at least one embodiment, a register may not be limited to a particular type of circuit. Rather, in at least one embodiment, a register may store data, provide data, and perform the functions described herein. In at least one embodiment, the registers described herein may be implemented using a variety of different techniques via circuitry within the processor, such as dedicated physical registers, dynamically allocated physical registers renamed using register renaming, a combination of dedicated and dynamically allocated physical registers, etc. In at least one embodiment, an integer register stores 32-bit integer data. The register file of at least one embodiment also includes eight multimedia SIMD registers for encapsulating data.
[0372] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details regarding the inference and / or training logic 815 are provided. In at least one embodiment, some or all of the inference and / or training logic 815 may be incorporated into the execution block (EXE block) 2411 and other memories or registers shown or not shown. For example, in at least one embodiment, the training and / or inference techniques described herein may use one or more ALUs shown in the execution block 2411. Furthermore, weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown) that configure the ALUs of the execution block 2411 to execute one or more machine learning algorithms, neural network architectures, use cases, or training techniques described herein.
[0373] This article combines Figure 8A8B provides detailed information about inference and / or training logic 412, 414. In at least one embodiment, some or all of the inference and / or training logic 412, 414 may be incorporated into EXE block 2411, as well as other memory or registers shown or not shown.
[0374] Figure 25 A deep learning application processor 2500 according to at least one embodiment is illustrated. In at least one embodiment, the deep learning application processor 2500 uses instructions, which, if executed by the deep learning application processor 2500, cause the deep learning application processor 2500 to perform some or all of the processes and techniques described herein. In at least one embodiment, the deep learning application processor 2500 is an application-specific integrated circuit (ASIC). In at least one embodiment, the application processor 2500 performs matrix multiplication operations or is "hardwired" into hardware as a result of executing one or more instructions or both. In at least one embodiment, the deep learning application processor 2500 includes, but is not limited to, a processing cluster 2510(1)-2510(12), an inter-chip link (“ICL”) 2520(1)-2520(12), an inter-chip controller (“ICC”) 2530(1)-2530(2), a second-generation high-bandwidth memory (“HBM2”) 2540(1)-2540(4), a memory controller (“MemCtrlr”) 2542(1)-2542(4), a high-bandwidth memory physical layer (“HBM PHY”) 2544(1)-2544(4), a management controller central processing unit (“management controller CPU”) 2550, a serial peripheral interface, internal integrated circuits and general purpose input / output blocks (“SPI, I2C, GPIO”) 2560, a peripheral component interconnect fast controller and direct memory access block (“PCIe controller and DMA”) 2570, and a sixteen-channel peripheral component interconnect fast port (“PCI Express”). x16”)2580.
[0375] In at least one embodiment, processing cluster 2510 can perform deep learning operations, including inference or prediction operations based on weight parameters computed using one or more training techniques, including those described herein. In at least one embodiment, each processing cluster 2510 can include, but is not limited to, any number and type of processors. In at least one embodiment, deep learning application processor 2500 can include any number and type of processing cluster 2500. In at least one embodiment, the inter-chip link 2520 is bidirectional. In at least one embodiment, the inter-chip link 2520 and the inter-chip controller 2530 enable multiple deep learning application processors 2500 to exchange information, including activation information generated from executing one or more machine learning algorithms embodied in one or more neural networks. In at least one embodiment, deep learning application processor 2500 can include any number (including zero) and type of ICL 2520 and ICC 2530.
[0376] In at least one embodiment, the HBM2 2540 provides a total of 32GB of memory. The HBM2 2540(i) is associated with both the memory controller 2542(i) and the HBM PHY 2544(i). In at least one embodiment, any number of HBM2 2540s can provide any type and total amount of high-bandwidth memory and can be associated with any number (including zero) and type of memory controller 2542 and HBM PHY 2544. In at least one embodiment, any number and type of blocks can replace SPI, I2C, GPIO 3360, PCIe controller 2560, and DMA 2570 and / or PCIe 2580 to implement any number and type of communication standards in any technically feasible manner.
[0377] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details regarding the inference and / or training logic 815 are provided. In at least one embodiment, the deep learning application processor is used to train a machine learning model (e.g., a neural network) to predict or infer information provided to the deep learning application processor 2500. In at least one embodiment, the deep learning application processor 2500 is used to infer or predict information based on a trained machine learning model (e.g., a neural network) that has been trained by another processor or system or by the deep learning application processor 2500. In at least one embodiment, the processor 2500 may be used to perform one or more neural network use cases described herein.
[0378] This article combines Figure 8AAnd / or 8B provides detailed information about inference and / or training logic 412, 414. In at least one embodiment, the deep learning application processor is used to train a machine learning model (e.g., a neural network) to predict or infer information provided to the deep learning application processor 2500.
[0379] Figure 26 This is a block diagram of a neuromorphic processor 2600 according to at least one embodiment. In at least one embodiment, the neuromorphic processor 2600 may receive one or more inputs from a source external to the neuromorphic processor 2600. In at least one embodiment, these inputs may be transmitted to one or more neurons 2602 within the neuromorphic processor 2600. In at least one embodiment, the neurons 2602 and their components may be implemented using circuitry or logic including one or more arithmetic logic units (ALUs). In at least one embodiment, the neuromorphic processor 2600 may include, but is not limited to, thousands upon thousands of instances of neurons 2602, but any suitable number of neurons 2602 may be used. In at least one embodiment, each instance of a neuron 2602 may include a neuron input 2604 and a neuron output 2606. In at least one embodiment, a neuron 2602 may generate an output that can be transmitted to the inputs of other instances of the neuron 2602. In at least one embodiment, the neuron input 2604 and the neuron output 2606 may be interconnected via synapses 2608.
[0380] In at least one embodiment, neuron 2602 and synapse 2608 may be interconnected, causing neuromorphic processor 2600 to operate to process or analyze information received by neuromorphic processor 2600. In at least one embodiment, neuron 2602 may send an output pulse (or “trigger” or “peak”) when the input received through neuron input 2604 exceeds a threshold. In at least one embodiment, neuron 2602 may sum or integrate the signal received at neuron input 2604. For example, in at least one embodiment, neuron 2602 may be implemented as a leaky integral-triggered neuron, wherein if the summation (referred to as “membrane potential”) exceeds a threshold, neuron 2602 may use a transfer function such as a sigmoid or threshold function to generate an output (or “trigger”). In at least one embodiment, the leaky integral-triggered neuron may sum the signal received at neuron input 2604 to a membrane potential and may apply an attenuation factor (or leak) to reduce the membrane potential. In at least one embodiment, a leaking integral-triggered neuron may trigger if multiple input signals are received at neuron input 2604 quickly enough to exceed a threshold (i.e., before the membrane potential decays too low to trigger). In at least one embodiment, neuron 2602 may be implemented using circuitry or logic that receives input, integrates the input to the membrane potential, and decays the membrane potential. In at least one embodiment, the input may be averaged, or any other suitable transfer function may be used. Furthermore, in at least one embodiment, neuron 2602 may include, but is not limited to, comparator circuitry or logic that generates an output spike at neuron output 2606 when the result of applying the transfer function to neuron input 2604 exceeds a threshold. In at least one embodiment, once neuron 2602 is triggered, it can ignore previously received input information by, for example, resetting the membrane potential to 0 or another suitable default value. In at least one embodiment, once the membrane potential is reset to 0, neuron 2602 may resume normal operation after a suitable period of time (or recovery period).
[0381] In at least one embodiment, neurons 2602 can be interconnected via synapses 2608. In at least one embodiment, synapses 2608 can be operated to transmit signals from the output of a first neuron 2602 to the input of a second neuron 2602. In at least one embodiment, neurons 2602 can transmit information on more than one instance of synapse 2608. In at least one embodiment, one or more instances of neuron output 2606 can be connected via instances of synapses 2608 to instances of neuron input 2604 in the same neuron 2602. In at least one embodiment, an instance of neuron 2602 that produces an output to be transmitted on the instance of synapse 2608 can be referred to as a "presynaptic neuron". In at least one embodiment, an instance of neuron 2602 that receives input transmitted via an instance of synapse 2608 can be referred to as a "postsynaptic neuron". In at least one embodiment, regarding various instances of synapse 2608, since an instance of neuron 2602 can receive input from one or more instances of synapse 2608 and can also transmit output through one or more instances of synapse 2608, a single instance of neuron 2602 can be both a "presynaptic neuron" and a "postsynaptic neuron".
[0382] In at least one embodiment, neurons 2602 may be organized into one or more layers. Each instance of neuron 2602 may have a neuron output 2606, which may fan out to one or more neuron inputs 2604 via one or more synapses 2608. In at least one embodiment, the neuron output 2606 of neuron 2602 in the first layer 2610 may be connected to the neuron input 2604 of neuron 2602 in the second layer 2612. In at least one embodiment, layer 2610 may be referred to as a “feedforward layer.” In at least one embodiment, each instance of neuron 2602 in an instance of the first layer 2610 may fan out to each instance of neuron 2602 in the second layer 2612. In at least one embodiment, the first layer 2610 may be referred to as a “fully connected feedforward layer.” In at least one embodiment, each instance of neuron 2602 in an instance of the second layer 2612 fan out to fewer than all instances of neuron 2602 in the third layer 2614. In at least one embodiment, the second layer 2612 may be referred to as a “sparsely connected feedforward layer.” In at least one embodiment, neurons 2602 in the second layer 2612 may fan out to neurons 2602 in a plurality of other layers, including neurons 2602 fan out to (the same) second layer 2612.
[0383] In at least one embodiment, the second layer 2612 may be referred to as a “recurrent layer”. The neuromorphic processor 2600 may include, but is not limited to, any suitable combination of recurrent layers and feedforward layers, including, but not limited to, sparsely connected feedforward layers and fully connected feedforward layers.
[0384] In at least one embodiment, the neuromorphic processor 2600 may include, but is not limited to, a reconfigurable interconnect architecture or dedicated hardwired interconnect to connect the synapse 2608 to the neuron 2602.
[0385] In at least one embodiment, the neuromorphic processor 2600 may include, but is not limited to, circuitry or logic that allows synapses to be assigned to different neurons 2602 as needed, based on the neural network topology and neuron fan-in / fan-out. For example, in at least one embodiment, synapses 2608 may be connected to neurons 2602 using interconnect structures (such as on-chip networks) or via dedicated connections. In at least one embodiment, synaptic interconnects and their components may be implemented using circuitry or logic.
[0386] Figure 27 A processing system according to at least one embodiment is illustrated. In at least one embodiment, system 2700 includes one or more processors 2702 and one or more graphics processors 2708, and may be a single-processor desktop system, a multi-processor workstation system, or a server system having a large number of processors 2702 or processor cores 2707. In at least one embodiment, system 2700 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices.
[0387] In at least one embodiment, system 2700 may include or be integrated into a server-based gaming platform, including a game console, mobile game console, handheld game console, or online game console, which are game and media consoles. In at least one embodiment, system 2700 is a mobile phone, smartphone, tablet computing device, or mobile internet device. In at least one embodiment, processing system 2700 may also include components coupled to or integrated into a wearable device, such as a smartwatch wearable device, smart glasses device, augmented reality device, or virtual reality device. In at least one embodiment, processing system 2700 is a television or set-top box device having one or more processors 2702 and a graphical interface generated by one or more graphics processors 2708.
[0388] In at least one embodiment, each of the one or more processors 2702 includes one or more processor cores 2707 for processing instructions that, when executed, perform operations against the system and user software. In at least one embodiment, each of the one or more processor cores 2707 is configured to process a particular instruction set 2709. In at least one embodiment, the instruction set 2709 may facilitate Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computation via Very Long Instruction Word (VLIW). In at least one embodiment, each processor core 2707 may process a different instruction set 2709, and the instruction sequence may include instructions that facilitate the emulation of other instruction sets. In at least one embodiment, the processor core 2707 may also include other processing devices, such as a digital signal processor (DSP).
[0389] In at least one embodiment, processor 2702 includes cache memory 2704. In at least one embodiment, processor 2702 may have a single internal cache or multiple levels of internal caches. In at least one embodiment, the cache memory is shared among various components of processor 2702. In at least one embodiment, processor 2702 also uses an external cache (e.g., a Level 3 (L3) cache or a last-level cache (LLC)) (not shown), which can be shared among processor cores 2707 using known cache coherence techniques. In at least one embodiment, processor 2702 further includes a register file 2706, which may include different types of registers for storing different types of data (e.g., integer registers, floating-point registers, status registers, and instruction pointer registers). In at least one embodiment, register file 2706 may include general-purpose registers or other registers.
[0390] In at least one embodiment, one or more processors 2702 are coupled to one or more interface buses 2710 to transmit communication signals, such as address, data, or control signals, between the processors 2702 and other components in the system 2700. In at least one embodiment, the interface bus 2710 may be a processor bus, such as a version of the Direct Media Interface (DMI) bus. In at least one embodiment, the interface bus 2710 is not limited to the DMI bus and may include one or more peripheral component interconnect buses (e.g., PCI, PCI Express), memory buses, or other types of interface buses. In at least one embodiment, the processor 2702 includes an integrated memory controller 2716 and a platform controller hub 2730. In at least one embodiment, the memory controller 2716 facilitates communication between memory devices and other components of the processing system 2700, while the platform controller hub (PCH) 2730 provides connectivity to input / output (I / O) devices via a local I / O bus.
[0391] In at least one embodiment, memory device 2720 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase-change memory device, or a device with suitable performance for use as processor memory. In at least one embodiment, memory device 2720 may be used as system memory of processing system 2700 to store data 2722 and instructions 2721 for use when one or more processors 2702 execute an application or process. In at least one embodiment, memory controller 2716 is also coupled to an optional external graphics processor 2712, which may communicate with one or more graphics processors 2708 of processor 2702 to perform graphics and media operations. In at least one embodiment, display device 2711 may be connected to processor 2702. In at least one embodiment, display device 2711 may include one or more internal display devices, such as in mobile electronic devices or laptop devices, or external display devices connected via a display interface (e.g., DisplayPort). In at least one embodiment, the display device 2711 may include a head-mounted display (HMD), such as a stereoscopic display device for virtual reality (VR) or augmented reality (AR) applications.
[0392] In at least one embodiment, the platform controller hub 2730 enables peripheral devices to connect to the storage device 2720 and the processor 2702 via a high-speed I / O bus. In at least one embodiment, the I / O peripheral devices include, but are not limited to, an audio controller 2746, a network controller 2734, a firmware interface 2728, a wireless transceiver 2726, a touch sensor 2725, and a data storage device 2724 (e.g., a hard disk drive, flash memory, etc.). In at least one embodiment, the data storage device 2724 may be connected via a storage interface (e.g., SATA) or via a peripheral bus, such as a peripheral component interconnect bus (e.g., PCI, PCIe). In at least one embodiment, the touch sensor 2725 may include a touchscreen sensor, a pressure sensor, or a fingerprint sensor. In at least one embodiment, the wireless transceiver 2726 may be a Wi-Fi transceiver, a Bluetooth transceiver, or a mobile network transceiver, such as a 3G, 4G, or LTE transceiver. In at least one embodiment, the firmware interface 2728 enables communication with the system firmware and may be, for example, a Unified Extensible Firmware Interface (UEFI). In at least one embodiment, network controller 2734 may enable network connectivity to a wired network. In at least one embodiment, a high-performance network controller (not shown) is coupled to interface bus 2710. In at least one embodiment, audio controller 2746 is a multi-channel high-definition audio controller. In at least one embodiment, processing system 2700 includes an optional legacy I / O controller 2740 for coupling legacy (e.g., Personal System 2 (PS / 2)) devices to the system. In at least one embodiment, platform controller hub 2730 may also be connected to one or more Universal Serial Bus (USB) controllers 2742 that connect input devices, such as a keyboard and mouse combination 2743, a camera 2744, or other USB input devices.
[0393] In at least one embodiment, instances of the memory controller 2716 and platform controller hub 2730 may be integrated into a discrete external graphics processor, such as external graphics processor 2712. In at least one embodiment, the platform controller hub 2730 and / or the memory controller 2716 may be external to one or more processors 2702. For example, in at least one embodiment, system 2700 may include an external memory controller 2716 and platform controller hub 2730, which may be configured as a memory controller hub and a peripheral controller hub in a system chipset communicating with processor 2702.
[0394] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8BDetails regarding the inference and / or training logic 815 are provided. In at least one embodiment, some or all of the inference and / or training logic 815 may be incorporated into the graphics processor 2700. For example, in at least one embodiment, the training and / or inference techniques described herein may use one or more ALUs embodied in the 3D pipeline 2712. Furthermore, in at least one embodiment, the inference and / or training operations described herein may use, in addition to Figure 8A or Figure 8B The logic is performed using logic other than that shown. In at least one embodiment, the weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown), which configure the ALU of the graphics processor 2700 to execute one or more machine learning algorithms, neural network architectures, use cases, or training techniques described herein.
[0395] This article combines Figure 8A 8B provides detailed information about inference and / or training logic 412, 414. In at least one embodiment, some or all of the inference and / or training logic 412, 414 may be integrated into the graphics processor 2700.
[0396] Figure 28 This is a block diagram of a processor 2800 having one or more processor cores 2802A-2802N, an integrated memory controller 2814, and an integrated graphics processor 2808 according to at least one embodiment. In at least one embodiment, the processor 2800 may include additional cores, up to and including additional cores 2802N, indicated by dashed boxes. In at least one embodiment, each processor core 2802A-2802N includes one or more internal cache units 2804A-2804N. In at least one embodiment, each processor core may also access one or more shared cache units 2806.
[0397] In at least one embodiment, internal cache units 2804A-2804N and shared cache unit 2806 represent a cache memory hierarchy within processor 2800. In at least one embodiment, cache memory units 2804A-2804N may include at least one level of instruction and data cache within each processor core and one or more levels of cache in a shared intermediate cache, such as Level 2 (L2), Level 3 (L3), Level 4 (L4), or other levels of cache, wherein the highest level of cache preceding external memory is classified as LLC. In at least one embodiment, cache coherence logic maintains coherence between the various cache units 2806 and 2804A-2804N.
[0398] In at least one embodiment, the processor 2800 may further include a set of one or more bus controller units 2816 and a system agent core 2810. In at least one embodiment, the one or more bus controller units 2816 manage a set of peripheral buses, such as one or more PCI or PCIe buses. In at least one embodiment, the system agent core 2810 provides management functions for various processor components. In at least one embodiment, the system agent core 2810 includes one or more integrated memory controllers 2814 to manage access to various external memory devices (not shown).
[0399] In at least one embodiment, one or more processor cores 2802A-2802N include support for multi-threaded concurrent processing. In at least one embodiment, system agent core 2810 includes components for coordinating and operating cores 2802A-2802N during multi-threaded processing. In at least one embodiment, system agent core 2810 may additionally include a power control unit (PCU) including logic and components for regulating one or more power states of processor cores 2802A-2802N and graphics processor 2808.
[0400] In at least one embodiment, processor 2800 further includes a graphics processor 2808 for performing graph processing operations. In at least one embodiment, graphics processor 2808 is coupled to a shared cache unit 2806 and a system proxy core 2810 including one or more integrated memory controllers 2814. In at least one embodiment, system proxy core 2810 further includes a display controller 2811 for driving graphics processor outputs to one or more coupled displays. In at least one embodiment, display controller 2811 may also be a separate module coupled to graphics processor 2808 via at least one interconnect, or it may be integrated within graphics processor 2808.
[0401] In at least one embodiment, ring-based interconnect unit 2812 is used to couple internal components of processor 2800. In at least one embodiment, alternative interconnect units, such as point-to-point interconnects, switched interconnects, or other technologies, may be used. In at least one embodiment, graphics processor 2808 is coupled to ring interconnect 2812 via I / O link 2813.
[0402] In at least one embodiment, I / O link 2813 represents at least one of a variety of I / O interconnects, including packaged I / O interconnects that facilitate communication between various processor components and high-performance embedded memory module 2818 (e.g., eDRAM module). In at least one embodiment, each of processor cores 2802A-2802N and graphics processor 2808 uses embedded memory module 2818 as a shared last-level cache.
[0403] In at least one embodiment, processor cores 2802A-2802N are homogeneous cores executing a common instruction set architecture. In at least one embodiment, processor cores 2802A-2802N are heterogeneous in terms of instruction set architecture (ISA), with one or more processor cores 2802A-2802N executing a common instruction set, while one or more other processor cores 2802A-2802N execute a subset of the common instruction set or a different instruction set. In at least one embodiment, processor cores 2802A-2802N are heterogeneous in terms of microarchitecture, with one or more cores having relatively high power consumption coupled to one or more power cores having lower power consumption. In at least one embodiment, processor 2800 may be implemented on one or more chips or implemented as a SoC integrated circuit.
[0404] Inference and / or training logic 815 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 8A and / or Figure 8B Details regarding the inference and / or training logic 815 are provided. In at least one embodiment, some or all of the inference and / or training logic 815 may be incorporated into the graphics processor 2810. For example, in at least one embodiment, the training and / or inference techniques described herein may use one or more ALUs embodied in... Figure 28 The 3D pipeline 2712, graphics core 2815A, shared function logic 2816, one or more graphics processing cores 2815B, shared function logic 2820, or other logic are included. Furthermore, in at least one embodiment, the inference and / or training operations described herein can use, except... Figure 8A or Figure 8B The logic is performed using logic other than that shown. In at least one embodiment, the weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown), which configure the ALU of the graphics processor 2810 to execute one or more of the machine learning algorithms, neural network architectures, use cases, or training techniques described herein.
[0405] In at least one embodiment, some or all of the inference and / or training logic 412, 414 may be integrated into the graphics processor 2810. For example, in at least one embodiment, the training and / or inference techniques described herein may use... Figure 28 The ALU in the 3D pipeline 2712, one or more graphics cores 2815A, shared functional logic 2816, one or more graphics cores 2815B, shared functional logic 2820, or one or more other logics.
[0406] Figure 29 This is a block diagram of a graphics processor 2900, which may be a discrete graphics processing unit or a graphics processor integrated with multiple processing cores. In at least one embodiment, the graphics processor 2900 communicates with registers on the graphics processor 2900 and commands placed in memory via a memory-mapped I / O interface. In at least one embodiment, the graphics processor 2900 includes a memory interface 2914 for accessing memory. In at least one embodiment, the memory interface 2914 is an interface to local memory, one or more internal caches, one or more shared external caches, and / or to system memory.
[0407] In at least one embodiment, the graphics processor 2900 further includes a display controller 2902 for driving display output data to the display device 2920. In at least one embodiment, the display controller 2902 includes a combination of hardware for one or more overlay planes of the display device 2920 and multi-layer video or user interface elements. In at least one embodiment, the display device 2920 may be an internal or external display device. In at least one embodiment, the display device 2920 is a head-mounted display device, such as a virtual reality (VR) display device or an augmented reality (AR) display device. In at least one embodiment, the graphics processor 2900 includes a video codec engine 2906 for encoding, decoding, or transcoding media into, from, or between one or more media encoding formats, including but not limited to Moving Picture Experts Group (MPEG) formats (e.g., MPEG-2), Advanced Video Coding (AVC) formats (e.g., H.264 / MPEG-4 AVC, and SMPTE 421M / VC-1), Joint Picture Experts Group (JPEG) formats (e.g., JPEG) and MotionJPEG (MJPEG) formats.
[0408] In at least one embodiment, the graphics processor 2900 includes a block image transfer (BLIT) engine 2904 to perform two-dimensional (2D) rasterizer operations, including, for example, bit boundary block transfer. However, in at least one embodiment, one or more components of a graphics processing engine (GPE) 2910 are used to perform 2D graphics operations. In at least one embodiment, the GPE 2910 is a computational engine for performing graphics operations, including three-dimensional (3D) graphics operations and media operations.
[0409] In at least one embodiment, GPE 2910 includes a 3D pipeline 2912 for performing 3D operations, such as rendering 3D images and scenes using processing functions that manipulate 3D primitive shapes (e.g., rectangles, triangles, etc.). The 3D pipeline 2912 includes programmable and fixed function elements that perform various tasks and / or generate execution threads to the 3D / media subsystem 2915. While the 3D pipeline 2912 can be used to perform media operations, in at least one embodiment, GPE 2910 also includes a media pipeline 2916 for performing media operations such as video post-processing and image enhancement.
[0410] In at least one embodiment, the media pipeline 2916 includes fixed-function or programmable logic units for performing one or more specialized media operations, such as video decoding acceleration, video deinterlacing, and video encoding acceleration, replacing or representing the video codec engine 2906. In at least one embodiment, the media pipeline 2916 also includes a thread generation unit for generating threads to execute on the 3D / media subsystem 2915. In at least one embodiment, the generated threads perform computations of media operations on one or more graphics execution units included in the 3D / media subsystem 2915.
[0411] In at least one embodiment, the 3D / media subsystem 2915 includes logic for executing threads generated by the 3D pipeline 2912 and the media pipeline 2916. In at least one embodiment, the 3D pipeline 2912 and the media pipeline 2916 send thread execution requests to the 3D / media subsystem 2915, which includes thread dispatch logic for arbitrating various requests and dispatching them to available thread execution resources. In at least one embodiment, the execution resources include an array of graphics execution units for processing 3D and media threads. In at least one embodiment, the 3D...
Claims
1. A processor, comprising: One or more circuits are used to estimate the distance from the object to the camera using one or more neural networks, at least in part based on a second image of the object modified to substantially match the object from a first image. The distance mentioned above is estimated in the following way: The first image is modified based on points shared by the first image and the second image to substantially match the second image; as well as Based on the changes made to the first image, it is determined whether the object is in front of or beyond a depth plane among a plurality of available depth planes.
2. The processor of claim 1, wherein the distance is further estimated by: If an additional depth plane is available, a new depth plane is selected based on a binary search of the multiple available depth planes; and The first image is modified to substantially match the second image based on new points shared by the first and second images, said points being located on the new depth plane.
3. The processor of claim 2, wherein the plurality of available depth planes includes a depth plane at a variable distance from the object, the variable distance from the object depending on the distance to the object.
4. The processor of claim 1, wherein the first image and the second image are captured by a plurality of image capture devices.
5. The processor of claim 1, wherein the one or more circuits modify the first image by transforming points in the first image into the second image through homography.
6. The processor of claim 1, wherein the first image is captured from an initial camera position and the second image is captured from a secondary camera position.
7. The processor of claim 1, wherein the one or more neural networks further estimate the distance by determining the direction in which the object moves in the first image after the first image is changed.
8. A system comprising: One or more processors are configured to estimate the distance of an object to one or more image capture devices using one or more neural networks, at least in part, based on alterations to a first image of the object to substantially match a second image of the object; and a memory containing instructions that, when executed, further cause the system to: Select a depth plane from multiple available depth planes; The one or more neural networks are used to determine whether the object is in front of or beyond the depth plane; as well as Based on whether the object is in front of or beyond the depth plane, the depth plane is instructed to define the boundary of the object.
9. The system of claim 8, wherein the one or more processors: If additional depth planes are available, a new depth plane is selected from the plurality of available depth planes; and The first image is modified to substantially match the second image based on new points shared by the first and second images, said points being located on the new depth plane.
10. The system of claim 9, wherein the new depth plane is selected based on a binary search of the plurality of available depth planes.
11. The system of claim 8, wherein the plurality of available depth planes includes one or more depth planes defining one or more available boundaries of the object.
12. The system of claim 8, wherein the first image and the second image are captured by a plurality of image capture devices.
13. The system of claim 8, wherein one or more processors modify the first image by transforming all points in the first image into the second image via homography, the homography relating to two images of a planar surface captured at least partially by the first image and the second image.
14. A processor, comprising: One or more circuits are used to help train one or more neural networks to estimate the distance from the object to the camera, at least in part based on a second image of the object modified to substantially match the object from a first image. The estimated distance is determined in the following way: The first image is modified to substantially match the second image based on points shared by the first image and the second image, the points being on a depth plane among a plurality of available depth planes displayed in each of the first image and the second image; Based on the changes made to the first image, determine whether the object is in front of or beyond the depth plane; as well as Based on whether the object is in front of or beyond the depth plane, the depth plane is instructed to define the boundary of the object.
15. The processor of claim 14, wherein the distance is further determined by: If an additional depth plane becomes available among the plurality of available depth planes, a new depth plane is selected based on a search of the plurality of available depth planes; and The first image is modified to substantially match the second image based on new points shared by the first and second images, said points being located on the new depth plane.
16. The processor of claim 14, wherein the plurality of available depth planes are predetermined.
17. The processor of claim 14, wherein the first image and the second image are captured by a plurality of cameras.
18. The processor of claim 14, wherein the one or more circuits modify the first image by transforming the first image to match the second image.
19. The processor of claim 14, wherein the first image is captured from an initial camera position and the second image is captured from a secondary camera position.
20. The processor of claim 14, wherein the one or more neural networks further estimate the distance by determining the direction in which the object moves in the first image after the first image is changed.
21. A method comprising: Train one or more neural networks to estimate the distance from the object to the camera, at least in part, based on a second image of the object that has been modified from a first image to substantially match the object. The estimated distance is determined in the following way: Capture two or more images from one or more image capture devices; Select a depth plane from multiple available depth planes; Determine whether the object is in front of the depth plane; as well as If the object is in front of the depth plane, then the depth plane is generated as an indication of the outer boundary of the object.
22. The method of claim 21, further comprising: If the object extends beyond the depth plane, the depth plane is generated as an indication of the object's inner boundary.
23. The method of claim 21, further comprising: Select a new depth plane from the plurality of available depth planes; as well as The position of the object is determined based on the new depth plane.
24. The method of claim 21, wherein the depth plane is selected from the plurality of available depth planes according to a binary search.
25. The method of claim 21, wherein the first image and the second image are captured by a plurality of cameras.
26. The method of claim 21, wherein the first image is modified by transforming all points in the first image into the second image through an affine transformation.
27. The method of claim 21, wherein the first image is captured from an initial camera position and the second image is captured from a secondary camera position.
28. The method of claim 21, wherein the camera comprises a single camera.
29. The method of claim 28, wherein the single camera captures the first image and is moved to capture the second image.
30. The method of claim 21, wherein the one or more neural networks further estimate the distance by determining the direction in which the object moves in the first image after the first image is changed.