Image label generation using neural networks and annotated images

By generating pseudo-labels using weak supervision techniques and combining them with a lightweight convolutional network, the problem of insufficient supervised data in medical image classification is solved, achieving efficient and automated image label generation and improving the accuracy of medical image classification.

CN115004197BActive Publication Date: 2026-06-23NVIDIA CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NVIDIA CORP
Filing Date
2021-07-26
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies in medical image classification struggle to effectively utilize neural networks for image label generation due to the labor-intensive data annotation requirements, especially in the medical field where efficient supervised data annotation methods are lacking.

Method used

Weakly supervised techniques are used to generate pseudo-labels. By combining partial or pseudo-labels with input image information through a training framework, labels are generated, including methods such as region growing and random walk. Meta-label fusion is performed using a lightweight convolutional network to generate high-quality medical image labels.

Benefits of technology

It enables efficient generation of medical image labels in the absence of supervised data, reduces reliance on professionals, and improves the automation and accuracy of image classification.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115004197B_ABST
    Figure CN115004197B_ABST
Patent Text Reader

Abstract

Devices, systems, and techniques for training one or more neural networks to generate labels for unsupervised or partially supervised data. In at least one embodiment, one or more pseudo-labels are generated by a training framework based on available weak annotations of input medical images and combined with feature information generated by one or more neural networks about the input medical images to generate labels about the input medical images.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references to related applications

[0002] This application claims priority to U.S. Patent Application No. 16 / 940,241, filed July 27, 2020, entitled “Label Generation Using Neural Networks,” the entire contents of which are incorporated herein by reference in their entirety and for all purposes. Technical Field

[0003] At least one embodiment relates to processing resources for generating labels using neural networks on unsupervised or partially supervised data. For example, at least one embodiment relates to a processor or computing system for generating one or more partial labels or pseudolabels and combining the partial or pseudolabels with information about features in an input image using one or more neural networks to generate labels, according to various new techniques described herein. Background Technology

[0004] The increased availability of deep learning techniques for image classification has led to a need to apply deep learning to advance various fields, such as medicine, particularly those related to data annotation. Data annotation is a resource-intensive but crucial step in developing supervised machine learning algorithms that use annotated data to train neural networks to perform object recognition in images. This is especially challenging in certain domains, such as medicine, where a high level of medical expertise is typically required to annotate the image data used to train the neural network. Unfortunately, due to its labor-intensive nature and the requirement for input from medical professionals, annotated training data containing complete descriptions of objects may not always be feasible. Attached Figure Description

[0005] Figure 1 This is a block diagram illustrating an architecture for generating labels for input medical images using one or more neural networks for training and inference, according to at least one embodiment.

[0006] Figure 2 This is a block diagram illustrating an architecture according to at least one embodiment, in which one or more weakly supervised techniques generate pseudo-labels to be fused into the labels based on information about the input image;

[0007] Figure 3 This is a block diagram illustrating a weak supervision technique for generating one or more pseudo-labels in a pseudo-label group, according to at least one embodiment;

[0008] Figure 4It is a block diagram illustrating, according to at least one embodiment, updating a group of pseudo-labels to one or more updated feature maps based on information about an input image;

[0009] Figure 5 This is a block diagram illustrating the fusion of one or more updated feature maps into a fused label according to at least one embodiment;

[0010] Figure 6 The process for training one or more neural networks to generate labels based on an input image, according to at least one embodiment, is illustrated.

[0011] Figure 7A The inference and / or training logic according to at least one embodiment is illustrated;

[0012] Figure 7B The inference and / or training logic according to at least one embodiment is illustrated;

[0013] Figure 8 The training and deployment of a neural network according to at least one embodiment are illustrated;

[0014] Figure 9 An example data center system according to at least one embodiment is shown;

[0015] Figure 10A An example of an autonomous vehicle according to at least one embodiment is shown;

[0016] Figure 10B The illustration shows an embodiment according to at least one of the embodiments. Figure 10A Examples of camera positions and fields of view for autonomous vehicles;

[0017] Figure 10C It is shown that, according to at least one embodiment, Figure 10A A block diagram of an example system architecture for an autonomous vehicle;

[0018] Figure 10D This illustrates a method for using one or more cloud-based servers according to at least one embodiment. Figure 10A A diagram illustrating a system for communication between autonomous vehicles;

[0019] Figure 11 This is a block diagram illustrating a computer system according to at least one embodiment;

[0020] Figure 12 This is a block diagram illustrating a computer system according to at least one embodiment;

[0021] Figure 13 A computer system according to at least one embodiment is shown;

[0022] Figure 14A computer system according to at least one embodiment is shown;

[0023] Figure 15A A computer system according to at least one embodiment is shown;

[0024] Figure 15B A computer system according to at least one embodiment is shown;

[0025] Figure 15C A computer system according to at least one embodiment is shown;

[0026] Figure 15D A computer system according to at least one embodiment is shown;

[0027] Figure 15E and Figure 15F A shared programming model according to at least one embodiment is shown;

[0028] Figure 16 An exemplary integrated circuit and associated graphics processor according to at least one embodiment are shown.

[0029] Figures 17A-17B An exemplary integrated circuit and an associated graphics processor according to at least one embodiment are shown.

[0030] Figure 18A and Figure 18B Additional exemplary graphics processor logic according to at least one embodiment is shown;

[0031] Figure 19 A computer system according to at least one embodiment is shown;

[0032] Figure 20A A parallel processor according to at least one embodiment is shown;

[0033] Figure 20B A partitioning unit according to at least one embodiment is shown;

[0034] Figure 20C A processing cluster according to at least one embodiment is shown;

[0035] Figure 20D A graphics multiprocessor according to at least one embodiment is shown;

[0036] Figure 21 A multi-graphics processing unit (GPU) system according to at least one embodiment is illustrated;

[0037] Figure 22 A graphics processor according to at least one embodiment is shown;

[0038] Figure 23It is a block diagram illustrating a processor microarchitecture for a processor according to at least one embodiment;

[0039] Figure 24 A deep learning application processor according to at least one embodiment is shown;

[0040] Figure 25 This is a block diagram illustrating an example neuromorphic processor according to at least one embodiment;

[0041] Figure 26 At least a plurality of portions of a graphics processor according to one or more embodiments are shown;

[0042] Figure 27 At least a plurality of portions of a graphics processor according to one or more embodiments are shown;

[0043] Figure 28 At least a plurality of portions of a graphics processor according to one or more embodiments are shown;

[0044] Figure 29 This is a block diagram illustrating a graphics processing engine of a graphics processor according to at least one embodiment;

[0045] Figure 30 It is a block diagram illustrating at least a plurality of portions of a graphics processor core according to at least one embodiment;

[0046] Figure 31A and Figure 31B The diagram illustrates thread execution logic for an array of processing elements, including a graphics processor core, according to at least one embodiment.

[0047] Figure 32 A parallel processing unit (“PPU”) according to at least one embodiment is shown;

[0048] Figure 33 A general-purpose processing cluster (“GPC”) according to at least one embodiment is illustrated;

[0049] Figure 34 A memory partition unit of a parallel processing unit (“PPU”) according to at least one embodiment is shown;

[0050] Figure 35 A streaming multiprocessor according to at least one embodiment is shown.

[0051] Figure 36 This is an example data flow diagram of an advanced computing pipeline according to at least one embodiment;

[0052] Figure 37This is a system diagram of an example system for training, adapting, instantiating, and deploying machine learning models in an advanced computing pipeline, according to at least one embodiment.

[0053] Figure 38 Example illustrations of an advanced computing pipeline 3710A for processing imaging data according to at least one embodiment;

[0054] Figure 39A Including an example data flow diagram of a virtual instrument supporting ultrasound equipment according to at least one embodiment;

[0055] Figure 39B Including an example data flow diagram of a virtual instrument supporting a CT scanner according to at least one embodiment;

[0056] Figure 40A A data flow diagram illustrating a process for training a machine learning model according to at least one embodiment; and

[0057] Figure 40B This is an example illustration of a client-server architecture that utilizes a pre-trained annotation model to enhance an annotation tool, according to at least one embodiment. Detailed Implementation

[0058] Figure 1 This is a block diagram illustrating an architecture for generating labels 120 for input medical images 112 using one or more neural networks 108, 114 for training 102 and inference 110, according to at least one embodiment. In at least one embodiment, training data 104 is used as input to a training framework 106 to train 102 one or more untrained neural networks 108. In at least one embodiment, training data 104 is a set of images or image data. In at least one embodiment, training data includes one or more complete or partial labels, or classifications containing information about the training data. In at least one embodiment, training data 104 does not include additional information, such as complete or partial labels or classifications. In at least one embodiment, training data 104 includes partial labels such as those described below in conjunction with weak supervision. In at least one embodiment, optional labels or classifications in training data 104 provide a set of examples on which one or more untrained neural networks 108 learn to perform functions, such as generating labels 120 in conjunction with medical images 118.

[0059] In at least one embodiment, training data 104 is a set of data, such as image data, for which one or more untrained neural networks 108 will be trained to operate on the data via training framework 106. In at least one embodiment, training data 104 includes a collection of images. In at least one embodiment, training data 104 includes a set of medical images. In at least one embodiment, medical images are image data related to medical imaging. In at least one embodiment, training data 104 includes a set of images with partial labels or classifications, such as those used in conjunction with weak supervision as described below. In at least one embodiment, training data 104 is one or more other types of data for which training framework 106 trains one or more untrained neural networks 108 to perform operations such as pseudo-labeling and label generation, as described below. Figures 2 to 6 As described.

[0060] In at least one embodiment, the training framework 106 is software instructions that, when executed on one or more computing devices, use training data 104 to manage the training 102 of one or more untrained neural networks 108. In at least one embodiment, the one or more untrained neural networks 108 are trained by the training framework 106, which facilitates learning by using the one or more untrained neural networks 108 based on the training data 104. In at least one embodiment, the training framework 106 uses GaN or any other type of neural network training method to train the one or more untrained neural networks.

[0061] In at least one embodiment, training framework 106 trains one or more untrained neural networks 108 without supervision. In at least one embodiment, training framework 106 trains one or more untrained neural networks 108 without supervision and using only training data 104. In at least one embodiment, training framework 106 trains one or more untrained neural networks 108 using any available supervision combined with training data 104.

[0062] In at least one embodiment, the training framework 106 uses supervised training data 104, wherein supervision is in the form of classification, labels, bounding boxes, pixel-level annotations, image-level annotations, points containing locations corresponding to objects, or lines containing locations corresponding to objects. In at least one embodiment, the training framework 106 uses the training data 104 to train one or more untrained neural networks 108 with any other form of supervision to facilitate the training 102 of the one or more untrained neural networks 108. In at least one embodiment, the training framework 106 does not use supervision for some or all of the training data 104.

[0063] In at least one embodiment, training framework 106 uses supervision to train one or more untrained neural networks 108. In at least one embodiment, supervision includes various types of assistance, as described above, for facilitating training 102 of one or more untrained neural networks 108 by training framework 106. In at least one embodiment, supervision includes input information describing one or more aspects of training data 104, such as objects, features, or styles, or classifications of training data 104, to help training one or more untrained neural networks 108 by training framework 106. In at least one embodiment, supervision is strong, where the input information provides direct identification of objects, features, styles, or other aspects of entries (e.g., images) in training data 104. In at least one embodiment, supervision is weak, where the input information provides partial identification of objects, features, styles, or other aspects of input training data 104 entries. In at least one embodiment, strong supervision is, for example, input information of bounding boxes, where one or more objects or features are outlined in the input training data 104 entries. In at least one embodiment, weak supervision includes, for example, input information of points, where various locations in the input training data 104 entries are identified as being within one or more objects. In at least one embodiment, weak supervision includes input information such as lines, wherein each point in a line within the 104 entries of input training data is identified by the weak supervision as being within one or more objects. In at least one embodiment, weak supervision includes input information such as tags or labels, wherein the tags or labels identify that the 104 entries of input training data contain one or more specific objects, or have a specific classification.

[0064] In at least one embodiment, one or more untrained neural networks 108 are trained by training framework 106 to perform operations such as generating a supervised image 116 from an input medical image 112. In at least one embodiment, one or more untrained neural networks 108 are trained by training framework 106 to generate a supervised image 116 from an input medical image 112 and optional weak supervision 122, as described below. Figures 2 to 6 As described herein. In one embodiment, optional weak supervision 122 is any type of weak supervision further described herein, such as points, lines, labels, or any other type of partial identifier of objects, features, styles, or other aspects of the input medical image 112.

[0065] In at least one embodiment, one or more neural networks 108, 114 are individual neural networks of any type. In at least one embodiment, each of the one or more neural networks 108, 114 includes a set of nodes, wherein each node computes a value based on one or more inputs using an activation function. In at least one embodiment, one or more neural networks 108, 114 are embodied in software having instructions to perform operations when executed, and having memory that stores computation results based on input data items. In at least one embodiment, each of the one or more neural networks 108, 114 is of any type of neural network further described herein.

[0066] In at least one embodiment, one or more trained neural networks 114 perform inference 110 using an input medical image 112. In at least one embodiment, one or more trained neural networks 114 perform inference 110 using additional optional weak supervision 122. In at least one embodiment, one or more trained neural networks 114 perform inference 110, whereby the trained neural networks 114 generate an output supervision image 116 from the input medical image 112, such that the supervision image 116 includes labels 120. In at least one embodiment, labels 120 are data specifying one or more features of the medical image 118. In at least one embodiment, labels 120 include any type of strong supervision further described herein, such as bounding boxes or any other type of strong supervision. In at least one embodiment, input data such as the input medical image 112 is any type of image such as a 2D image or a 3D image, and the optional weak supervision 122 includes one or more data values ​​to indicate features, styles, or objects in the output medical image 118.

[0067] In at least one embodiment, one or more trained neural networks 114 are trained 102 by training framework 106 based on training data 104 to perform operations, and one or more untrained neural networks 108 are trained 102 based on training data 104. In at least one embodiment, one or more trained neural networks 114 are trained 102 by training framework 106 based on training data 104 and unsupervised training. In at least one embodiment, one or more trained neural networks 114 are trained 102 by training framework 106 based on supervised training data 104. In at least one embodiment, one or more trained neural networks 114 are any type of neural network further described herein.

[0068] In at least one embodiment, one or more trained neural networks 114 generate output data such as a supervised image 116 based on input data such as an input medical image 112. In at least one embodiment, one or more trained neural networks 114 generate output data such as a supervised image 116 based on input data such as an input medical image 112 and optional weak supervision 122. In at least one embodiment, one or more trained neural networks 114 (which have been trained 102 by training framework 106 for this operation) perform operations on input data such as an input medical image 112 to generate output data such as an output supervised image 116. In at least one embodiment, the output data such as the output supervised image 116 includes a medical image 118 and one or more labels 120, which are further described herein.

[0069] Figure 2 This is a block diagram illustrating an architecture according to at least one embodiment, in which one or more weak supervision techniques 204 generate pseudo-labels 210 to be fused 232 into labels 234 from input weak annotations 238 based on information about the input image 202, in order to train a model 218. In at least one embodiment, the input image 202 is data including image information. In at least one embodiment, the input image 202 is a 3D image. In at least one embodiment, the input image 202 is a 2D image. In one embodiment, the input image 202 is multiple images. In at least one embodiment, the input image 202 includes medical image data.

[0070] In at least one embodiment, pseudo-label 210 is data that includes a class of unlabeled data, such as input image 202, as if said class were a real label. In at least one embodiment, pseudo-label 210 includes information about input image 202 that is less specific than a real label because the information is predicted and not considered real. For example, in one embodiment, pseudo-label 210 might indicate that a region of input image 202 includes a kidney, while the label directly indicates which pixels in input image 202 correspond to a kidney. In another example, pseudo-label 210 might indicate that a region of input image 202 includes something medically appealing, while in one embodiment the label accurately indicates which pixels correspond to that medically appealing thing, and what that medically appealing thing is.

[0071] In at least one embodiment, weak supervision techniques 204, 206, and 208 use the input image 202 in conjunction with weak annotations 238A. 1…m To generate pseudo-labels 210. In at least one embodiment, weak annotations are descriptions of one or more objects in the input image 202 (e.g., combined with the above). Figure 1 The data describes information. In at least one embodiment, weak annotation 238 includes centroids, 2D bounding boxes, 2D Recognition of Clinical Efficacy in Solid Tumors (RECIST), 3D bounding boxes, and 3D extrema, which are commonly used weak annotation methods in medical image analysis.

[0072] In at least one embodiment, the weak supervision techniques 204, 206, and 208 are software instructions that, when executed, generate one or more pseudo-labels 210 based on the weak annotations 238 and the input image 202. In at least one embodiment, the weak supervision techniques 204, 206, and 208 include region growing and random walks, wherein region growing provides initial seed generation, and random walks further refine the pseudo-labels 210 generated in each pseudo-label group 212, 214, and 216. For each weak annotation type A... i In conjunction with the spherical target object in the input image 202, in one embodiment, the initial foreground and background points specified by the training framework are as follows: 1) For centroid weak annotations 238, the foreground is specified at a given point based on the data distribution or clinical significance in the input image 202, and the background is specified as the largest sphere; 2) For 2D bounding boxes, the foreground is specified at the center of a given box, and the background is specified as the sphere within the bounding box, which is expanded by a scale of r or voxels (whichever is smaller); 3) For 2D RECISTs, the foreground is specified as a point on the RECIST annotation axis, and the sphere within the expanded bounding box is specified as the background; 4) For 3D bounding boxes, the center point is specified as the foreground, and the sphere within the expanded bounding box is specified as the background; and 5) For 3D extrema, the center and contraction extrema are specified as the foreground, and the sphere within the expanded bounding box is specified as the background. In at least one embodiment, the foreground points are expanded by the training framework using weakly supervised techniques 204, 206, 208 (e.g., region growing methods). In at least one embodiment, each weak supervision technique 204, 206, 208 provides a weak annotation 238A to the input. i Perform a random walk to refine the initial foreground and background points or the initial coarse seed position.

[0073] In at least one embodiment, for each weak annotation type A 238 i Each weak supervision technique 204, 206, and 208 generates n pseudo-labels 210 in pseudo-label groups 212, 214, and 216. In at least one embodiment, pseudo-tag groups 212, 214, and 216 are used for a single weak annotation 238A. i A set of n pseudo-labels 210 In at least one embodiment, pseudo-label 210 is data that includes a class of unlabeled data, such as the input image 202, as if the class were a real label. That is, in one embodiment, pseudo-label 210 includes a predicted label for the input image 202. In at least one embodiment, the predicted label is data 202 indicating an estimate of one or more features (e.g., foreground objects) in the input image as described above.

[0074] In at least one embodiment, the training framework updates each pseudo-label 210 in each pseudo-label group 212, 214, 216 according to the prediction map 220 to generate one or more feature maps 226, 228, 230. In at least one embodiment, update 222 is a software instruction that, when executed, adjusts the information contained in pseudo-labels 210 based on information about objects contained in prediction map 220. In at least one embodiment, prediction map 220 is a set of data values ​​including information about objects in input image 202 predicted by model 218. In at least one embodiment, model 218 is data values ​​and software instructions that, when executed, predict segmentation boundaries between background and foreground objects in input image 202. In at least one embodiment, model 218 is a 3D U-Net. In at least one embodiment, model 218 is any other type of neural network further described herein. In one embodiment, model 218M is trained by a training framework (via backpropagation 240) using an initial context loss function at least based on pseudo-labels 210 (which are generated by the training framework based on weak annotations 238 and will be updated 222 by prediction map 220X). In at least one embodiment, the context loss is a measurement that includes the distance between the location (context) of an object or object in one or more pseudo-labels 210 and the predicted segmentation boundary from prediction map 220X. In at least one embodiment, the training framework will be used to train various weak annotation types 238 A. i Gaussian filters of varying sizes are applied to the prediction graph 220X. In at least one embodiment, the Gaussian filter is controlled by a series of B convolution operations on the prediction graph 220X, where B is a controllable parameter for the region coverage, and is:

[0075]

[0076] In at least one embodiment, the convolution kernel in each conv operation is set to a constant average by the training framework and is based on the target object and weak annotation type A in the input image 202. i Adjust B by using the associated size.

[0077]

[0078] Where G({a}) is a Gaussian distribution with respect to the annotation position {a}. In one embodiment, during training via a training framework, the context loss is combined with a common segmentation loss to train a 240 model 218M with weights as follows:

[0079]

[0080] in It is the Dice loss and γ = 1. In one embodiment, for each weak annotation of type A238... i 210 per pseudo-label The training framework computes the local loss function as described above and performs gradient descent to backpropagate the local loss 240, thereby locally updating the current model 218M to a separately locally updated model Mj.

[0081] In at least one embodiment, the training framework uses the model M updated for each local update. j An updated feature map 224 is determined based on the prediction map 220X and the pseudo-labels 210 in each pseudo-label group 212, 214, 216. In at least one embodiment, the updated feature map 224 is a set of local feature maps 226, 228, 230. In at least one embodiment, feature figures 226, 228, and 230 It is a set of data values ​​that represent the corresponding pseudo-labels for model 218M. The features. In at least one embodiment, the training framework is based on a single weak annotation of type A238. j Alternatively, a mixed prediction graph 220X from different pseudo-label groups 212, 214, 216 and different annotation types 238 can be used to update 222 each pseudo-label 210. To generate each feature map 226, 228, 230

[0082] In at least one embodiment, the training framework performs meta-label fusion 232 to adjust the feature maps 226, 228, and 230. Generate fusion tag 234 In at least one embodiment, meta-label fusion 232 is a software instruction that, when executed, fuses feature maps 226, 228, and 230. The feature maps are concatenated or otherwise combined into concatenated or combined feature maps, and the concatenated feature maps are fed into a lightweight convolutional neural network to generate the fused label 234. In at least one embodiment, the combined feature map includes each feature map 226, 228, 230 that has been linked or otherwise combined. The data. In at least one embodiment, the fusion tag 234 This includes data containing information about one or more objects in the input image 202I. In at least one embodiment, meta-label fusion 232 is defined as:

[0083]

[0084] In at least one embodiment, during meta-label fusion 232, the convolutional weights of the convolutional neural network N are determined by the convolutional network N based on each feature map 226, 228, 230 in the connected feature maps. The training framework learns from the consistency and differences in features between the connected feature maps, and assigns pseudo-labels 210 based on these features. Merge into fusion tag 234 In at least one embodiment, the training framework is based on the updated or fused labels 234 To update the global model 236 218M. In at least one embodiment, the training framework targets a single weak annotation 238 type A. i Meta-label fusion 232 is performed. In at least one embodiment, the training framework performs meta-label fusion 232 on a mixture of different weak annotations 238 during the training of the global model 218M.

[0085] In at least one embodiment, one or more training iterations are performed by the training framework. In at least one embodiment, during each training iteration, training samples or training data (as described above) are... Figure 1 The described () is considered by the training framework. In at least one embodiment, the training samples include input data or input image I, with 238A for each weak annotation. i n pseudo-labels 210 in the generated pseudo-label groups 212, 214, and 216 In at least one embodiment, the training framework calculates a predicted segmentation mask or a mask with features F based on the input data or input image 202I. m The predicted figure is 220X.

[0086] Figure 3 This is a block diagram illustrating a weak supervision technique 304 for generating one or more pseudo-labels 308, 310, 312, 314 in a pseudo-label group 306 according to at least one embodiment. In at least one embodiment, the input image 302 and the weak annotation 316 are combined as described above. Figure 1 The training framework described is input to the weakly supervised technique 304. In one embodiment, the input image 302 is data including information about the image, such as those described above. Figure 1 and Figure 2The image data described. In at least one embodiment, the input image 302 is a medical image. In at least one embodiment, the medical image is an image including medical data generated by a medical imaging device. In at least one embodiment, the medical data generated by the medical imaging device includes anatomical information, such as organ scans, X-rays, or other medical imaging techniques.

[0087] In at least one embodiment, the training framework incorporates weak annotations 316A into the input image 302I. 1…m The input is fed into the weak supervision technique 304 to generate one or more pseudo-labels 308, 310, 312, and 314 in the pseudo-label group 306. In at least one embodiment, weak annotation 316A 1…m This includes data containing information about one or more objects in the input image 302I. In at least one embodiment, the weak supervision technique 304 receives the input image 302I along with weak annotations 316A. 1…m As input, to generate pseudo-tags 308, 310, 312, and 314.

[0088] In at least one embodiment, pseudo-labels 308, 310, 312, and 314 This refers to data that includes unlabeled data such as input image 302, as if the class were a real label. That is, in one embodiment, pseudo-labels 308, 310, 312, and 314... This includes the predicted location of foreground objects in the input image 302. In at least one embodiment, pseudo-labels 308, 310, 312, and 314... Include any other information about the input image 302 to provide an estimated classification of the input image 302.

[0089] In at least one embodiment, weak annotation 316A 1…m Describe information about one or more objects in the input image 302, and construct them in different formats, such as boxes, lines, points, and combinations as described above. Figure 1 Other methods described. In at least one embodiment, weak annotation 316A is input into the weak supervision technique 304. 1…m Includes specific types of information. In at least one embodiment, weak annotation 316A 1…m This includes weak annotation methods commonly used in medical image analysis, such as centroids, 2D bounding boxes, 2D Remedial Evaluation Criteria for Solid Tumors (RECIST) information, 3D bounding boxes, and 3D extrema.

[0090] In at least one embodiment, the weak supervision technique 304 is a software instruction that, when executed, is based on weak annotation 316A.1…m Generate one or more pseudo-labels 308, 310, 312, and 314 from the input image 302I. In at least one embodiment, the weak supervision technique 304 is implemented without weak annotation 316A. 1…m In this case, one or more pseudo-labels 308, 310, 312, and 314 are generated based on the input image 302I.

[0091] In at least one embodiment, the weak supervision technique 304 includes software instructions that implement region growth and random walks at execution time. In at least one embodiment, region growth is targeted at weak annotation type A 316. i Generate initial pseudo-tags 308, 310, 312, and 314. Values, such as foreground and background points. In at least one embodiment, for a given weak annotation type 316 A i The random walk refines the generated pseudo-labels 308, 310, 312, and 314 in pseudo-label group 306.

[0092] In at least one embodiment, pseudo-tag group 306 is for a single weak annotation type A 316. i A set of n pseudo-labels: 308, 310, 312, 314 In at least one embodiment, for each weak annotation type 316 A i Combined with the spherical target object (e.g., an object found in a medical image) in the input image 302, the initial pseudo-labels 308, 310, 312, and 314 are generated. Values ​​(e.g., foreground and background points) are determined during region growing by the training framework (e.g., combining...). Figure 1 The training framework described is generated or specified.

[0093] The training framework performs region growing to specify foreground and background points in weakly supervised technique 304. In one embodiment, as shown below, for weak annotation methods commonly used in medical image analysis: 1) For centroidal weak annotation 316, based on data distribution or clinical significance in the input image 302, the foreground is specified by the training framework at a given point, and the background is specified as a maximal sphere; 2) For 2D bounding boxes, the foreground is specified by the training framework at the center of the 2D bounding box, and the background is specified as a sphere within the 2D bounding box, which expands by a predetermined or parameterized ratio r or voxel v (whichever is smaller); 3) For 2DRECIST, the foreground is specified by the training framework as a 2D... Points on the RECIST annotation axis, spheres within the expanded bounding box are designated as background by the training framework; 4) For 3D bounding boxes, the center point of the 3D bounding box is designated as foreground by the training framework, and spheres within the expanded 3D bounding box are designated as background; and 5) For 3D extrema, the center point and contracted extrema are designated as foreground by the training framework, and spheres within the expanded bounding box are designated as background by the training framework. In at least one embodiment, during weak supervision technique 304, the foreground points are expanded by the training framework using additional region growing. In at least one embodiment, the training framework performing weak supervision technique 304 applies weak annotation 316A to the input. i Perform a random walk to refine the initial foreground and background points or the initial coarse seed position. In at least one embodiment, for weak annotation type 316 A... i The training framework implementing weak supervision technique 304 generates n pseudo-labels 308, 310, 312, and 314 in pseudo-label group 306. In at least one embodiment, pseudo-tag group 306 is used for a single weak annotation 316A. i A set of n pseudo-labels: 308, 310, 312, 314

[0094] Figure 4 This is a block diagram illustrating how, according to at least one embodiment, a pseudo-label group 402 is updated 414 to one or more updated feature maps 416 based on information about the input image in the prediction map 412. In at least one embodiment, the training framework updates 414 each pseudo-label 404, 406, 408, 410 in the pseudo-label group 402 according to the prediction map 412X. To generate one or more feature maps 418,420,422,424 In at least one embodiment, pseudo-labels 404, 406, 408, and 410 This includes data on categories (classifications) used for data without labels, such as those mentioned above. Figure 1-3 The input medical image is described. In at least one embodiment, pseudo-labels 404, 406, 408, and 410 are used. This includes the predicted locations of foreground objects in the input image. In at least one embodiment, the pseudo-label group 402 is for a single weak annotation type A. i A set of n pseudo-labels: 404, 406, 408, 410 As mentioned above Figure 2 and Figure 3 As described.

[0095] In at least one embodiment, the training framework performs update 414, which is a software instruction that, when executed, adjusts pseudo-labels 404, 406, 408, and 410 based on information about objects contained in the prediction graph 412X. The information contained therein, as combined with the above text Figure 2 As discussed above. In at least one embodiment, the predicted image 412X is a set of data values ​​that include information about objects predicted by the model in the input image, as described above. Figure 2 As discussed. In at least one embodiment, the prediction graph 412X includes information indicating the segmentation boundary between background and foreground objects in the input image.

[0096] In at least one embodiment, the training framework determines the context loss during update 414 to perform a local update on model M, as described above. Figure 2 The described model M for generating local updates j In at least one embodiment, the context loss is a measurement that includes one or more pseudo-labels 404, 406, 408, 410. The distance between the object or object location (context) in the prediction graph 412X and the predicted segmentation boundary. In at least one embodiment, the training framework provides a contextual loss to the model M 426 to locally update the model via the backpropagation, and for a given weak annotation type A i Each pseudo-tag is 404, 406, 408, or 410. Generate a locally updated model M j As mentioned above Figure 2 As described.

[0097] In at least one embodiment, the training framework uses the model M updated for each local update during update 414. j Based on the prediction graph 412X and the given weak annotation type A i The pseudo-tags 404, 406, 408, and 410 in pseudo-tag group 402 Feature maps 418, 420, 422, and 424 were identified for updating. In at least one embodiment, the updated feature map 416 is a set of locally updated feature maps 418, 420, 422, and 424. In at least one embodiment, the feature maps 418, 420, 422, and 424 are locally updated. These represent the corresponding pseudo-labels 404, 406, 408, and 410 for model M. A set of data values ​​representing the characteristics, as described above. Figure 2 As described. In at least one embodiment, the training framework is based on a single weak annotation type A. j The prediction graph 412X or the mixed update 414 of different weak annotation types from multiple pseudo-label groups 402 for each pseudo-label 404, 406, 408, 410 To generate feature maps 418, 420, 422, and 424 for each local update. As mentioned above Figure 2 As described.

[0098] Figure 5 This is a block diagram illustrating, according to at least one embodiment, the fusion 526 of meta-labels 526 of one or more updated feature maps 504, 506, 508, 512, 514, 516, 520, 522, 524 from one or more groups 502, 510, 518 into a fused label 528. In at least one embodiment, the training framework performs meta-label fusion 526 according to the weak annotation type A. i Feature maps 504, 506, 508, 512, 514, 516, 520, 522, and 524 in groups 502, 510, and 518, respectively. Generate annotations for a given weak annotation type A. i Fusion tag 528 As mentioned above Figure 2 and 4 As described above. In one embodiment, the training framework performs meta-label fusion 526 to update feature maps 504, 506, 508, 512, 514, 516, 520, 522, and 524 locally based on multiple groups 502, 510, and 518. Generate annotations for multiple weak annotation types A i The fusion tag 528Y, where each group 502, 510, 518 corresponds to the weak annotation type A. i .

[0099] In at least one embodiment, the updated feature maps are 504, 506, 508, 512, 514, 516, 520, 522, and 524. It is a set of data values ​​that represent the corresponding pseudo-labels of model M. The characteristics, as described above, are combined with Figure 2 and Figure 4 As described. In one embodiment, one or more groups 502, 510, 518 include one or more feature maps 504, 506, 508, 512, 514, 516, 520, 522, 524. In at least one embodiment, groups 502, 510, and 518 are one or more updated feature maps 504, 506, 508, 512, 514, 516, 520, and 522. The collection, which is based on the given weak annotation A i To organize the given weak annotation A i The feature maps 504, 506, 508, 512, 514, 516, 520, 522, and 524 used by the training framework to generate each updated feature map in the group. The pseudo-tags on which the generation is based As mentioned above Figure 2-4 As described. In at least one embodiment, the training framework uses updated feature maps 504, 506, 508, 512, 514, 516, 520, 522, and 524 from individual groups 502, 510, and 518. For meta-label fusion 526. In one embodiment, the training framework uses updated feature maps 504, 506, 508, 512, 514, 516, 520, 522, and 524 from all groups 502, 510, and 518. For meta-label fusion 526. In at least one embodiment, the training framework uses updated feature maps 504, 506, 508, 512, 514, 516, 520, 522, and 524, a mixture of features from groups 502, 510, and 518. Used for meta tag fusion 526.

[0100] In at least one embodiment, metatag fusion 526 is a software instruction that, when executed, is chained or otherwise combined with weak annotation type A. i The corresponding feature maps for groups 502, 210, and 518 are 504, 506, 508, 512, 514, 516, 520, 522, and 524. As mentioned above Figure 2 As described. In at least one embodiment, during meta-label fusion 526, the training framework concatenates or otherwise integrates feature maps 504, 506, 508, 512, 514, 516, 520, 522, 524. The combined feature maps are used as intermediate combinations. In at least one embodiment, the training framework feeds the combined feature maps into a lightweight convolutional network N during meta-label fusion 526.

[0101] In at least one embodiment, the training framework uses a lightweight convolutional network N to generate annotations for weak annotation type A during meta-label fusion 526. i Fusion tag 528 As mentioned above Figure 2 As described. In at least one embodiment, the fusion tag 528 It includes information about the input image I (trained by the framework based on type A). i The data containing information about one or more objects (generated by weak annotations). In at least one embodiment, meta-tag fusion 526 is defined as:

[0102]

[0103] In at least one embodiment, during meta-label fusion 232, the convolutional weights of the convolutional neural network N are determined by the convolutional network N based on each feature map 504, 506, 508, 512, 514, 516, 520, 522, 524. Learning is based on the consistency and differences in features between them.

[0104] In at least one embodiment, the training framework is designed for weak annotation type A. i Generate fusion tag 528 by fusing meta tag 526. Subsequently, the training framework is based on the weak annotation type A i The fusion tag 528 Update global model M to 530, as described above. Figure 2 and Figure 4 As described. In at least one embodiment, the training framework is trained on updated feature maps 504, 506, 508, 512, 514, 516, 520, 522, 524. (It is determined by the training framework based on a single weak annotation type A) i (Generate) Perform meta-label fusion 526. In at least one embodiment, the training framework is for updated feature maps 504, 506, 508, 512, 514, 516, 520, 522, 524. Meta-label fusion is performed (which is generated by the training framework based on a mixture of different weak annotation types A).

[0105] Figure 6 The illustration depicts a process 600 for training one or more neural networks to generate fused labels based on an input image, according to at least one embodiment. In at least one embodiment, the process begins at 602, where a training framework receives input data 604, such as an input medical image or a general input image, along with one or more weak annotations of different types, as described above. Figure 2 and 3 As described above. In one embodiment, using an input image such as a medical image, the training framework uses model M to generate a prediction map 606X, which includes information about features or objects in the input image, as described above. Figure 2 As described.

[0106] In at least one embodiment, the training framework uses region growing and random walks based on each weak annotation type A input to the training framework. i Generate one or more pseudo-tags 608 As mentioned above Figure 2 and 3 As described. In at least one embodiment, the training framework uses the generated pseudo-labels 608 Together with the prediction graph 606X, a locally updated model 610M is generated. j As mentioned above Figure 2 and 4 The model 610M, which is described, is a locally updated model. j The training framework utilizes this to backpropagate the contextual loss to model M. In at least one embodiment, the training framework uses a locally updated model 610M. j Combined with pseudo-tag 608 And predict graph 606X to generate for one or more weak annotation types A i The updated feature map 612, as described above, is combined with... Figure 2 and 4 As described.

[0107] In at least one embodiment, the training framework uses training for one or more weak annotation types A. i Meta-label fusion to fuse feature map 614 To generate fusion tags As mentioned above Figure 2 and 5 As described. In one embodiment, a fusion tag is used. The training framework is based on the fusion labels contained therein. The information in the global update of model M is as described above, which is based on the 616 model. Figure 2 As described. In at least one embodiment, the training process 600 ends 618 once the model M has been globally updated by the training framework 616.

[0108] Reasoning and training logic

[0109] Figure 7AInference and / or training logic 715 is shown for performing inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 7A and / or Figure 7B Provide details about reasoning and / or training logic 715.

[0110] In at least one embodiment, inference and / or training logic 715 may include, but is not limited to, code and / or data storage 701 for storing forward and / or output weights and / or input / output data, and / or other parameters configuring neurons or layers of a neural network trained for and / or used for inference in one or more embodiments. In at least one embodiment, training logic 715 may include or be coupled to code and / or data storage 701 for storing graph code or other software to control timing and / or sequence, wherein weight and / or other parameter information is loaded to configure logic, including integer and / or floating-point units (collectively, arithmetic logic units (ALUs)). In at least one embodiment, code (such as graph code) loads weight or other parameter information into the processor ALU based on the architecture of the neural network to which the code corresponds. In at least one embodiment, code and / or data storage 701 stores weight parameters and / or input / output data of each layer of a neural network trained or used in one or more embodiments during forward propagation of input / output data and / or weight parameters during training and / or inference using one or more embodiments. In at least one embodiment, any portion of the code and / or data storage 701 may be included within other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory.

[0111] In at least one embodiment, any portion of the code and / or data storage 701 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, the code and / or data storage 701 may be a cache memory, dynamic random-addressable memory (“DRAM”), static random-addressable memory (“SRAM”), non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice of whether the code and / or data storage 701 is internal or external to the processor, for example, or composed of DRAM, SRAM, flash memory, or some other storage type, may depend on the available on-chip or off-chip storage space, the latency requirements of the training and / or inference functions being performed, the batch size of the data used in the inference and / or training of the neural network, or some combination of these factors.

[0112] In at least one embodiment, the inference and / or training logic 715 may include, but is not limited to, code and / or data storage 705 to store backpropagation and / or output weights and / or input / output data neural networks corresponding to neurons or layers of a neural network trained and / or used for inference in one or more embodiments. In at least one embodiment, during training and / or inference using one or more embodiments, the code and / or data storage 705 stores weight parameters and / or input / output data for each layer of a neural network trained or used in one or more embodiments during backpropagation of input / output data and / or weight parameters. In at least one embodiment, the training logic 715 may include or be coupled to code and / or data storage 705 for storing graph code or other software to control timing and / or sequence, wherein weight and / or other parameter information is loaded to configure logic including integer and / or floating-point units (collectively, arithmetic logic units (ALUs)).

[0113] In at least one embodiment, code (such as graph code) causes the architecture of the neural network corresponding to that code to load weights or other parameter information into the processor ALU. In at least one embodiment, any portion of the code and / or data storage 705 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory. In at least one embodiment, any portion of the code and / or data storage 705 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, the code and / or data storage 705 may be cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice between the code and / or data storage 705 being internal or external to the processor, for example, whether it consists of DRAM, SRAM, flash memory, or some other type of storage, depends on whether the available storage is on-chip or off-chip, the latency requirements of the training and / or inference functions being performed, the data batch size used in the inference and / or training of the neural network, or some combination of these factors.

[0114] In at least one embodiment, code and / or data storage 701 and code and / or data storage 705 may be separate storage structures. In at least one embodiment, code and / or data storage 701 and code and / or data storage 705 may be the same storage structure. In at least one embodiment, code and / or data storage 701 and code and / or data storage 705 may be partially combined and partially separated. In at least one embodiment, any portion of code and / or data storage 701 and code and / or data storage 705 may be included with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory.

[0115] In at least one embodiment, the inference and / or training logic 715 may include, but is not limited to, one or more arithmetic logic units (“ALUs”) 710 (including integer and / or floating-point units) for performing logical and / or mathematical operations at least in part based on or instructed by training and / or inference code (e.g., graph code), the results of which may produce activations (e.g., output values ​​from layers or neurons within a neural network) stored in activation storage 720, which are functions of input / output and / or weight parameter data stored in code and / or data storage 701 and / or code and / or data storage 705. In at least one embodiment, activation is activated in response to execution instructions or other code, and linear algebraic and / or matrix-based mathematical generation performed by ALU 710 is stored in activation storage 720, wherein weight values ​​stored in code and / or data storage 705 and / or code and / or data storage 701 are used as operands with other values, such as bias values, gradient information, momentum values, or other parameters or hyperparameters, and any or all of these can be stored in code and / or data storage 705 or code and / or data storage 701 or other on-chip or off-chip storage.

[0116] In at least one embodiment, one or more processors or other hardware logic devices or circuits include one or more ALUs 710, while in another embodiment, one or more ALUs 710 may be located outside the processor or other hardware logic device or the circuitry that uses them (e.g., a coprocessor). In at least one embodiment, one or more ALUs 710 may be included within an execution unit of a processor, or otherwise included in a group of ALUs accessible by the execution unit of the processor, which may be within the same processor or distributed among different processors of different types (e.g., a central processing unit, a graphics processing unit, a fixed-function unit, etc.). In at least one embodiment, code and / or data storage 701, code and / or data storage 705, and activation storage 720 may share a processor or other hardware logic device or circuitry, while in another embodiment, they may be located in different processors or other hardware logic devices or circuitry, or in some combination of the same and different processors or other hardware logic devices or circuitry. In at least one embodiment, any portion of activation storage 720 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory. Furthermore, inference and / or training code may be stored together with other code accessible to the processor or other hardware logic or circuitry, and may be retrieved and / or processed using the processor’s fetch, decode, schedule, execute, exit, and / or other logic circuitry.

[0117] In at least one embodiment, the active memory 720 may be a cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other memory. In at least one embodiment, the active memory 720 may be wholly or partially located inside or outside one or more processors or other logic circuits. In at least one embodiment, the choice of whether the active memory 720 is internal to or external to the processor may depend on the available on-chip or off-chip storage, the latency requirements for training and / or inference functions, the batch size of data used in inference and / or training the neural network, or some combination of these factors. For example, it may include DRAM, SRAM, flash memory, or other memory types.

[0118] In at least one embodiment, Figure 7A The inference and / or training logic 715 shown can be used in conjunction with an application-specific integrated circuit (“ASIC”), such as those from Google. Processing unit, from Graphcore TM Inference processing units (IPUs) or from Intel Corp. (e.g., "Lake Crest") processor. In at least one embodiment, Figure 7AThe inference and / or training logic 715 shown can be used in conjunction with central processing unit (“CPU”) hardware, graphics processing unit (“GPU”) hardware, or other hardware such as field programmable gate array (“FPGA”)

[0119] Figure 7B Inference and / or training logic 715 according to at least one embodiment is illustrated. In at least one embodiment, the inference and / or training logic 715 may include, but is not limited to, hardware logic, wherein computational resources are dedicated or otherwise uniquely used in conjunction with weight values ​​or other information corresponding to one or more layers of neurons within a neural network. In at least one embodiment, Figure 7B The inference and / or training logic 715 shown can be used in conjunction with an application-specific integrated circuit (ASIC), such as those from Google. Processing unit, from Graphcore TM Inference processing units (IPUs) or from Intel Corp. (e.g., "Lake Crest") processor. In at least one embodiment, Figure 7B The inference and / or training logic 715 shown can be used in conjunction with central processing unit (CPU) hardware, graphics processing unit (GPU) hardware, or other hardware (e.g., field-programmable gate array (FPGA)). In at least one embodiment, the inference and / or training logic 715 includes, but is not limited to, code and / or data storage 701 and code and / or data storage 705, which can be used to store code (e.g., graph code), weight values, and / or other information, including bias values, gradient information, momentum values, and / or other parameter or hyperparameter information. Figure 7B In at least one embodiment shown, each of code and / or data storage 701 and code and / or data storage 705 is associated with dedicated computing resources (e.g., computing hardware 702 and computing hardware 706), respectively. In at least one embodiment, each of computing hardware 702 and computing hardware 706 includes one or more ALUs that perform mathematical functions (e.g., linear algebraic functions) only on the information stored in code and / or data storage 701 and code and / or data storage 705, respectively, and the results of the function execution are stored in activation storage 720.

[0120] In at least one embodiment, each of the code and / or data storage 701 and 705 and the corresponding computing hardware 702 and 706 corresponds to a different layer of the neural network, such that activation obtained from one “store / computation pair 701 / 702” of the code and / or data storage 701 and computing hardware 702 provides input as input to the next “store / computation pair 705 / 706” of the code and / or data storage 705 and computing hardware 706, in order to reflect the conceptual organization of the neural network. In at least one embodiment, each store / computation pair 701 / 702 and 705 / 706 may correspond to more than one neural network layer. In at least one embodiment, additional store / computation pairs (not shown) may be included in the inference and / or training logic 715 after or in parallel with the store / computation pairs 701 / 702 and 705 / 706.

[0121] Neural network training and deployment

[0122] Figure 8 Training and deployment of a deep neural network according to at least one embodiment are illustrated. In at least one embodiment, an untrained neural network 806 is trained using a training dataset 802. In at least one embodiment, the training framework 804 is the PyTorch framework, while in other embodiments, the training framework 804 is TensorFlow, Boost, Caffe, Microsoft Cognitive Toolkit / CNTK, MXNet, Chainer, Keras, Deeplearning4j, or other training frameworks. In at least one embodiment, the training framework 804 trains the untrained neural network 806 and enables it to be trained using the processing resources described herein to generate a trained neural network 808. In at least one embodiment, the weights may be randomly selected or pre-trained using a deep belief network. In at least one embodiment, training may be performed in a supervised, partially supervised, or unsupervised manner.

[0123] In at least one embodiment, supervised learning is used to train an untrained neural network 806, wherein training dataset 802 includes inputs paired with desired outputs for input, or wherein training dataset 802 includes inputs with known outputs and neural network 806 is manually graded output. In at least one embodiment, the untrained neural network 806 is trained in a supervised manner, and inputs from training dataset 802 are processed, and the resulting output is compared with a set of expected or desired outputs. In at least one embodiment, errors are then propagated back through the untrained neural network 806. In at least one embodiment, training framework 804 adjusts the weights controlling the untrained neural network 806. In at least one embodiment, training framework 804 includes tools for monitoring the degree to which the untrained neural network 806 converges to a model (e.g., a trained neural network 808) adapted to generate the correct answer (e.g., result 814) based on input data (e.g., a new dataset 812). In at least one embodiment, training framework 804 repeatedly trains the untrained neural network 806 while adjusting the weights to improve the output of the untrained neural network 806 using a loss function and tuning algorithms (e.g., stochastic gradient descent). In at least one embodiment, the training framework 804 trains an untrained neural network 806 until the untrained neural network 806 reaches the desired accuracy. In at least one embodiment, the trained neural network 808 can then be deployed to perform any number of machine learning operations.

[0124] In at least one embodiment, unsupervised learning is used to train an untrained neural network 806, wherein the untrained neural network 806 attempts to train itself using unlabeled data. In at least one embodiment, the unsupervised learning training dataset 802 will include input data without any associated output data or "ground truth" data. In at least one embodiment, the untrained neural network 806 can learn groupings within the training dataset 802 and can determine how each input relates to the untrained dataset 802. In at least one embodiment, unsupervised training can be used to generate a self-organizing graph in a trained neural network 808, which is capable of performing operations useful for reducing the dimensionality of the new dataset 812. In at least one embodiment, unsupervised training can also be used to perform anomaly detection, which allows the identification of data points in the new dataset 812 that deviate from the normal patterns of the new dataset 812.

[0125] In at least one embodiment, semi-supervised learning can be used, a technique in which a mixture of labeled and unlabeled data is included in the training dataset 802. In at least one embodiment, the training framework 804 can be used to perform incremental learning, for example, via a pass-through learning technique. In at least one embodiment, incremental learning enables the trained neural network 808 to adapt to a new dataset 812 without forgetting the knowledge injected into the trained neural network 808 during initial training.

[0126] Data Center

[0127] Figure 9 An example data center 900 that can be used with at least one embodiment is shown. In at least one embodiment, the data center 900 includes a data center infrastructure layer 910, a framework layer 920, a software layer 930, and an application layer 940.

[0128] In at least one embodiment, such as Figure 9 As shown, the data center infrastructure layer 910 may include a resource coordinator 912, packet computing resources 914, and node computing resources (“nodes CR”) 916(1)-916(N), where “N” represents a positive integer (which may be an integer “N” different from the integers used in other diagrams). In at least one embodiment, nodes CR 916(1)-916(N) may include, but are not limited to, any number of central processing units (“CPUs”) or other processors (including accelerators, field-programmable gate arrays (FPGAs), graphics processors, etc.), memory storage devices 918(1)-918(N) (e.g., dynamic read-only memory, solid-state drives, or disk drives), network input / output (“NW I / O”) devices, network switches, virtual machines (“VMs”), power modules, and cooling modules, etc. In at least one embodiment, one or more nodes CR 916(1)-916(N) may be servers having one or more of the aforementioned computing resources.

[0129] In at least one embodiment, the grouped computing resource 914 may include individual groups (not shown) of node CRs housed within one or more racks, or a plurality of racks (also not shown) housed within data centers in various geographical locations. In at least one embodiment, the individual groups of node CRs within the grouped computing resource 914 may include computing, networking, memory, or storage resources that can be configured or allocated to support groups of one or more workloads. In at least one embodiment, several node CRs, including CPUs or processors, may be grouped within one or more racks to provide computing resources to support one or more workloads. In at least one embodiment, the one or more racks may also include any number of power modules, cooling modules, and network switches, in any combination.

[0130] In at least one embodiment, resource coordinator 912 may configure or otherwise control one or more nodes CR916(1)-916(N) and / or grouped computing resources 914. In at least one embodiment, resource coordinator 912 may include a Software Design Infrastructure (“SDI”) management entity for data center 900. In at least one embodiment, resource coordinator 912 may include hardware, software, or some combination thereof.

[0131] In at least one embodiment, such as Figure 9 As shown, framework layer 920 includes a job scheduler 922, a configuration manager 924, a resource manager 926, and a distributed file system 928. In at least one embodiment, framework layer 920 may include a framework of software 932 supporting software layer 930 and / or one or more applications 942 supporting application layer 940. In at least one embodiment, software 932 or application 942 may respectively include web-based service software or applications, such as services or applications provided by Amazon Web Services, Google Cloud, and Microsoft Azure. In at least one embodiment, framework layer 920 may be, but is not limited to, a free and open-source software web application framework, such as Apache Spark, which can utilize distributed file system 928 for large-scale data processing (e.g., "big data"). TM(Hereinafter referred to as "Spark"). In at least one embodiment, the job scheduler 922 may include a Spark driver to facilitate the scheduling of workloads supported by various layers of the data center 900. In at least one embodiment, the configuration manager 924 may be able to configure different layers, such as the software layer 930 and the framework layer 920, which includes Spark and a distributed file system 928 for supporting large-scale data processing. In at least one embodiment, the resource manager 926 is able to manage cluster or group computing resources mapped to or allocated to support the distributed file system 928 and the job scheduler 922. In at least one embodiment, the cluster or group computing resources may include group computing resources 914 on the data center infrastructure layer 910. In at least one embodiment, the resource manager 926 may coordinate with the resource coordinator 912 to manage these mapped or allocated computing resources.

[0132] In at least one embodiment, the software 932 included in the software layer 930 may include software used by at least a portion of the nodes CR916(1)-916(N), the grouped computing resources 914, and / or the distributed file system 928 of the framework layer 920. In at least one embodiment, one or more types of software may include, but are not limited to, Internet web page search software, email virus scanning software, database software, and streaming video content software.

[0133] In at least one embodiment, one or more applications 942 included in application layer 940 may include one or more types of applications used by at least a portion of nodes CR916(1)-916(N), grouped computing resources 914, and / or the distributed file system 928 of framework layer 920. In at least one embodiment, one or more types of applications may include, but are not limited to, any number of genomics applications, cognitive computing, applications, and machine learning applications, including training or inference software, machine learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.), or other machine learning applications used in conjunction with one or more embodiments.

[0134] In at least one embodiment, any of the configuration manager 924, resource manager 926, and resource coordinator 912 can implement any number and type of self-modification actions based on any amount and type of data acquired in any technically feasible manner. In at least one embodiment, self-modification actions can mitigate potentially poor configuration decisions by data center operators of data center 900 and can prevent underutilization and / or poor performance of the data center.

[0135] In at least one embodiment, data center 900 may include tools, services, software, or other resources to train one or more machine learning models or to use one or more machine learning models to predict or infer information according to one or more embodiments described herein. For example, in at least one embodiment, a machine learning model can be trained by calculating weight parameters based on a neural network architecture using the software and computing resources described above with respect to data center 900. In at least one embodiment, information can be inferred or predicted using trained machine learning models corresponding to one or more neural networks using the resources described above with respect to data center 900 by using weight parameters calculated through one or more training techniques described herein.

[0136] In at least one embodiment, the data center may use a CPU, application-specific integrated circuit (ASIC), GPU, FPGA, or other hardware to utilize the aforementioned resources to perform training and / or inference. Furthermore, one or more of the aforementioned software and / or hardware resources may be configured as a service to allow a user to train or perform information inference, such as image recognition, speech recognition, or other artificial intelligence services.

[0137] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 can be implemented in the system. Figure 9 Used in this context for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.

[0138] In at least one embodiment, the inference and / or training logic 2 can be used in Figure 9 In a system, it is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0139] Autonomous vehicles

[0140] Figure 10AAn example of an autonomous vehicle 1000 according to at least one embodiment is shown. In at least one embodiment, the autonomous vehicle 1000 (which may alternatively be referred to herein as "vehicle 1000") may be, but is not limited to, a passenger vehicle, such as a car, truck, bus, and / or another type of vehicle capable of accommodating one or more passengers. In at least one embodiment, vehicle 1000 may be a semi-tractor-trailer for hauling goods. In at least one embodiment, vehicle 1000 may be an aircraft, a robotic vehicle, or other type of vehicle.

[0141] Autonomous vehicles can be described according to the levels of automation defined by the National Highway Traffic Safety Administration (“NHTSA”) and the Society of Automotive Engineers (“SAE”) of the U.S. Department of Transportation in their standard “Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles” (e.g., standard number J3016-201806, published June 15, 2018; standard number J3016-201609, published September 30, 2016; and previous and future versions of this standard). In one or more embodiments, vehicle 1000 may be able to function according to one or more of the levels of autonomous driving from Level 1 to Level 5. For example, in at least one embodiment, vehicle 1000 may be able to perform conditional automation (Level 3), high automation (Level 4), and / or full automation (Level 5).

[0142] In at least one embodiment, vehicle 1000 may include, but is not limited to, components such as chassis, body, wheels (e.g., 2, 4, 6, 8, 18, etc.), tires, axles, and other vehicle components. In at least one embodiment, vehicle 1000 may include, but is not limited to, propulsion system 1050, such as an internal combustion engine, a hybrid powertrain, a fully electric motor, and / or another type of propulsion system. In at least one embodiment, propulsion system 1050 may be connected to the drivetrain of vehicle 1000, which may include, but is not limited to, a transmission, to enable propulsion of vehicle 1000. In at least one embodiment, propulsion system 1050 may be controlled in response to receiving a signal from throttle / accelerator 1052.

[0143] In at least one embodiment, when the propulsion system 1050 is operating (e.g., when the vehicle 1000 is traveling), the steering system 1054 (which may include, but is not limited to, a steering wheel) is used to steer the vehicle 1000 (e.g., along a desired path or route). In at least one embodiment, the steering system 1054 may receive signals from the steering actuator 1056. In at least one embodiment, the steering wheel may be optional for fully automated (Level 5) functionality. In at least one embodiment, the brake sensor system 1046 may be used to operate the vehicle brakes in response to signals received from the brake actuator 1048 and / or brake sensors.

[0144] In at least one embodiment, the controller 1036 may include, but is not limited to, one or more system-on-chips (“SoCs”). Figure 10A A controller 1036 (not shown) and / or a graphics processing unit (“GPU”) provides signals (e.g., representing commands) to one or more components and / or systems of vehicle 1000. For example, in at least one embodiment, controller 1036 may send signals to operate vehicle braking via brake actuator 1048, to operate steering system 1054 via one or more steering actuators 1056, and to operate propulsion system 1050 via one or more throttles / accelerators 1052. In at least one embodiment, one or more controllers 1036 may include one or more onboard (e.g., integrated) computing devices that process sensor signals and output operating commands (e.g., signals representing commands) to enable autonomous driving and / or assist a driver in driving vehicle 1000. In at least one embodiment, one or more controllers 1036 may include a first controller for autonomous driving functions, a second controller for functional safety functions, a third controller for artificial intelligence functions (e.g., computer vision), a fourth controller for infotainment functions, a fifth controller for redundancy in emergency situations, and / or other controllers. In at least one embodiment, a single controller may handle two or more of the functions described above, and two or more controllers may handle a single function and / or any combination thereof.

[0145] In at least one embodiment, one or more controllers 1036 provide signals for controlling one or more components and / or systems of vehicle 1000 in response to sensor data received from one or more sensors (e.g., sensor inputs). In at least one embodiment, sensor data can be received from sensors, including but not limited to one or more Global Navigation Satellite System (“GNSS”) sensors 1058 (e.g., one or more Global Positioning System sensors), one or more RADAR sensors 1060, one or more ultrasonic sensors 1062, one or more LIDAR sensors 1064, one or more inertial measurement unit (IMU) sensors 1066 (e.g., one or more accelerometers, one or more gyroscopes, one or more magnetic compasses, one or more magnetometers, etc.), one or more microphones 1096, one or more stereo cameras 1068, one or more wide-angle cameras 1070 (e.g., fisheye cameras), one or more infrared cameras 1072, one or more surround cameras 1074 (e.g., 360-degree cameras), and remote cameras (…). Figure 10A (not shown in the image), medium-range camera ( Figure 10A (Not shown in the image) One or more speed sensors 1044 (e.g., for measuring the speed of vehicle 1000), one or more vibration sensors 1042, one or more steering sensors 1040, one or more brake sensors (e.g., as part of brake sensor system 1046) and / or other sensor types are received.

[0146] In at least one embodiment, one or more controllers 1036 may receive input (e.g., represented by input data) from the dashboard 1032 of the vehicle 1000 and provide output (e.g., represented by output data, display data, etc.) via a human-machine interface (“HMI”) display 1034, a voice signaler, a speaker, and / or other components of the vehicle 1000. In at least one embodiment, the output may include information such as vehicle speed, velocity, time, map data (e.g., high-definition map). Figure 10A The HMI display 1034 may display information such as (not shown in the image), location data (e.g., the location of vehicle 1000, for example on a map), direction, the location of other vehicles (e.g., occupancy raster), information about objects, and the state of objects sensed by one or more controllers 1036. For example, in at least one embodiment, the HMI display 1034 may display information about the presence of one or more objects (e.g., road signs, warning signs, traffic light changes, etc.) and / or information about driving operations that the vehicle has already made, is making, or will make (e.g., changing lanes now, exiting exit 34B within two miles, etc.).

[0147] In at least one embodiment, the vehicle 1000 also includes a network interface 1024, which can communicate over one or more networks using one or more wireless antennas 1026 and / or one or more modems. For example, in at least one embodiment, the network interface 1024 may be able to communicate over Long Term Evolution (“LTE”), Wideband Code Division Multiple Access (“WCDMA”), Universal Mobile Telecommunications System (“UMTS”), Global System for Mobile Communications (“GSM”), IMT-CDMA Multicarrier (“CDMA2000”) networks, etc. In at least one embodiment, one or more wireless antennas 1026 may also enable communication between objects in the environment (e.g., vehicles, mobile devices) using one or more local area networks (e.g., Bluetooth, Bluetooth Low Energy (LE), Z-Wave, ZigBee, etc.) and / or one or more low-power wide area networks (hereinafter referred to as “LPWAN”) (e.g., LoRaWAN, SigFox, etc. protocols).

[0148] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 can be implemented in the system. Figure 10A The operation is used to infer or predict the operation based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0149] In at least one embodiment, the inference and / or training logic 2 can be used in Figure 10A In a system, it is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0150] Figure 10B The illustration shows an embodiment according to at least one of the embodiments. Figure 10A Examples of camera positions and fields of view for an autonomous vehicle 1000. In at least one embodiment, the camera and its respective field of view are exemplary embodiments and are not intended to be limiting. For example, in at least one embodiment, additional and / or alternative cameras may be included and / or the cameras may be located at different positions on the vehicle 1000.

[0151] In at least one embodiment, the camera type used for the camera may include, but is not limited to, a digital camera suitable for use with components and / or systems of vehicle 1000. In at least one embodiment, one or more cameras may operate at Automotive Safety Integrity Level (“ASIL”) B and / or other ASILs. In at least one embodiment, the camera type may have any image capture rate, such as 60 frames per second (fps), 1220 fps, 240 fps, etc. In at least one embodiment, the camera may be able to use a rolling shutter, a global shutter, another type of shutter, or a combination thereof. In at least one embodiment, the color filter array may include a red-to-clear (“RCCC”) color filter array, a red-to-clear-blue (“RCCB”) color filter array, a red-blue-green (“RBGC”) color filter array, a Foveon X3 color filter array, a Bayer sensor (“RGGB”) color filter array, a monochrome sensor color filter array, and / or other types of color filter arrays. In at least one embodiment, a transparent pixel camera, such as a camera with an array of RCCC, RCCB and / or RBGC color filters, may be used to improve photosensitivity.

[0152] In at least one embodiment, one or more cameras may be used to perform advanced driver assistance system (“ADAS”) functions (e.g., as part of a redundancy or fail-safe design). For example, in at least one embodiment, a multi-function mono camera may be installed to provide functions including lane departure warning, traffic sign assist, and intelligent headlight control. In at least one embodiment, one or more cameras (e.g., all cameras) may simultaneously record and provide image data (e.g., video).

[0153] In at least one embodiment, one or more cameras may be mounted in a mounting assembly, such as a custom-designed (3D-printed) assembly, to cut out stray light and reflections within the vehicle 1000 (e.g., reflections from the dashboard in the windshield mirror), which may interfere with the camera's image data capture capabilities. Regarding the rearview mirror mounting assembly, in at least one embodiment, the rearview mirror assembly may be 3D-printed custom-made such that the camera mounting plate matches the shape of the rearview mirror. In at least one embodiment, one or more cameras may be integrated into the rearview mirror. In at least one embodiment, for side-view cameras, one or more cameras may also be integrated within four pillars at each corner of the cabin.

[0154] In at least one embodiment, a camera (e.g., a forward-facing camera) having a field of view including a portion of the environment in front of the vehicle 1000 can be used for surround view and, with the assistance of one or more controllers 1036 and / or control SoCs, to help identify forward paths and obstacles, thereby providing information crucial for generating an occupancy grid and / or determining a preferred vehicle path. In at least one embodiment, the forward-facing camera can be used to perform many ADAS functions similar to LIDAR, including but not limited to emergency braking, pedestrian detection, and collision avoidance. In at least one embodiment, the forward-facing camera can also be used for ADAS functions and systems, including but not limited to lane departure warning (“LDW”), adaptive cruise control (“ACC”), and / or other functions (e.g., traffic sign recognition).

[0155] In at least one embodiment, various cameras can be used in a forward-facing configuration, including, for example, a monocular camera platform including a CMOS (“complementary metal-oxide-semiconductor”) color imager. In at least one embodiment, a wide-angle camera 1070 can be used to sense objects entering from the periphery (e.g., pedestrians, people crossing the street, or bicycles). Although in Figure 10B Only one wide-angle camera 1070 is shown; however, in other embodiments, the vehicle 1000 may have any number (including zero) of wide-angle cameras. In at least one embodiment, any number of remote cameras 1098 (e.g., a pair of remote stereo cameras) can be used for depth-based object detection, especially for objects for which a neural network has not yet been trained. In at least one embodiment, the remote camera 1098 can also be used for object detection and classification, as well as basic object tracking.

[0156] In at least one embodiment, any number of stereo cameras 1068 may also be included in a forward configuration. In at least one embodiment, one or more stereo cameras 1068 may include an integrated control unit comprising a scalable processing unit that may provide programmable logic (“FPGA”) and a multi-core microprocessor with a controller area network (“CAN”) or Ethernet interface integrated on a single chip. In at least one embodiment, such a unit may be used to generate a 3D map of the environment of the vehicle 1000, including distance estimates for all points in the image. In at least one embodiment, one or more stereo cameras 1068 may include, but are not limited to, a compact stereo vision sensor, which may include, but is not limited to, two camera lenses (one on the left and one on the right) and an image processing chip that can measure the distance from the vehicle 1000 to a target object and use the generated information (e.g., metadata) to activate autonomous emergency braking and lane departure warning functions. In at least one embodiment, other types of stereo cameras 1068 may also be used in addition to those described herein.

[0157] In at least one embodiment, a camera (e.g., a side-view camera) having a field of view including a portion of the environment on the sides of the vehicle 1000 can be used for surround viewing, thereby providing information for creating and updating the occupied grid, and generating a side collision warning. For example, in at least one embodiment, a surround camera 1074 (e.g., such as...) Figure 10B The four surround cameras shown can be positioned on vehicle 1000. In at least one embodiment, one or more surround cameras 1074 can include, but are not limited to, any number and combination of wide-angle cameras, one or more fisheye lenses, one or more 360-degree cameras, and / or similar cameras. For example, in at least one embodiment, four fisheye lens cameras can be located at the front, rear, and sides of vehicle 1000. In at least one embodiment, vehicle 1000 can use three surround cameras 1074 (e.g., left, right, and rear) and can utilize one or more other cameras (e.g., forward-facing cameras) as a fourth surround-view camera.

[0158] In at least one embodiment, a camera (e.g., a rear-view camera) having a field of view including a portion of the environment behind the vehicle 1000 can be used for parking assistance, surround view, rear collision warning, and creating and updating occupancy raster. In at least one embodiment, a wide variety of cameras can be used, including but not limited to cameras that are also suitable as one or more forward-facing cameras (e.g., long-range camera 1098 and / or one or more mid-range cameras 1076, one or more stereo cameras 1068, one or more infrared cameras 1072, etc.), as described herein.

[0159] The inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. Figure 7A and / or Figure 7B This document provides details regarding inference and / or training logic 715. In at least one embodiment, inference and / or training logic 715 may be... Figure 10B Used in systems for reasoning or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.

[0160] In at least one embodiment, the inference and / or training logic 2 can be used in Figure 10B In a system, it is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0161] Figure 10C The illustration shows an embodiment according to at least one of the embodiments. Figure 10AA block diagram of an example system architecture for an autonomous vehicle 1000. In at least one embodiment, Figure 10C Each of one or more components, one or more features, and one or more systems of vehicle 1000 is shown as connected via bus 1002. In at least one embodiment, bus 1002 may include, but is not limited to, a CAN data interface (which may alternatively be referred to herein as “CAN bus”). In at least one embodiment, CAN may be a network within vehicle 1000 used to help control various features and functions of vehicle 1000, such as brake actuation, acceleration, braking, steering, windshield wipers, etc. In one embodiment, bus 1002 may be configured to have dozens or even hundreds of nodes, each node having its own unique identifier (e.g., CAN ID). In at least one embodiment, bus 1002 can be read to find steering wheel angle, ground speed, engine rotation speed (“RPM”), button position, and / or other vehicle status indicators. In at least one embodiment, bus 1002 may be an ASIL B compliant CAN bus.

[0162] In at least one embodiment, FlexRay and / or Ethernet protocols may be used in addition to or from CAN. In at least one embodiment, there may be any number of molded buses 1002, which may include, but are not limited to, zero or more CAN buses, zero or more FlexRay buses, zero or more Ethernet buses, and / or zero or more other types of buses using other protocols. In at least one embodiment, two or more buses may be used to perform different functions and / or may be used for redundancy. For example, a first bus may be used for a collision avoidance function, and a second bus may be used for actuation control. In at least one embodiment, each of any number of System-on-Chip (“SoC”) 1004 (e.g., SoC 1004(A) and SoC 1004(B)), each of one or more controllers 1036, and / or each computer within the vehicle may access the same input data (e.g., input from sensors of the vehicle 1000) and may be connected to a common bus, such as a CAN bus.

[0163] In at least one embodiment, vehicle 1000 may include one or more controllers 1036, such as those described herein. Figure 10AAs described above. In at least one embodiment, controller 1036 can be used for a variety of functions. In at least one embodiment, controller 1036 can be coupled to any of various other components and systems of vehicle 1000 and can be used to control vehicle 1000, artificial intelligence of vehicle 1000, infotainment and / or other functions of vehicle 1000.

[0164] In at least one embodiment, vehicle 1000 may include any number of SoCs 1004. In at least one embodiment, each of the SoCs 1004 may include, but is not limited to, a central processing unit (“one or more CPUs”) 1006, a graphics processing unit (“one or more GPUs”) 1008, one or more processors 1010, one or more caches 1012, one or more accelerators 1014, one or more data storage 1016, and / or other components and features not shown. In at least one embodiment, one or more SoCs 1004 may be used to control vehicle 1000 on various platforms and systems. For example, in at least one embodiment, one or more SoCs 1004 may be combined with a high-definition (“HD”) map 1022 in a system (e.g., the system of vehicle 1000), the high-definition map 1022 being accessible from one or more servers via a network interface 1024. Figure 10C (Not shown in the image) Get map refresh and / or update.

[0165] In at least one embodiment, one or more CPUs 1006 may include CPU clusters or CPU complexes (which may alternatively be referred to herein as “CCPLEX”). In at least one embodiment, one or more CPUs 1006 may include multiple cores and / or a secondary (“L2”) cache. For example, in at least one embodiment, one or more CPUs 1006 may include eight cores in an intercoupled multiprocessor configuration. In at least one embodiment, one or more CPUs 1006 may include four dual-core clusters, each cluster having a dedicated L2 cache (e.g., 2MB L2 cache). In at least one embodiment, one or more CPUs 1006 (e.g., CCPLEX) may be configured to support simultaneous cluster operation, such that any combination of clusters of one or more CPUs 1006 can be active at any given time.

[0166] In at least one embodiment, one or more CPUs 1006 may implement power management functions, including but not limited to one or more of the following features: automatic clock gating of individual hardware modules to conserve dynamic power when idle; clock gating of each core when the core is not actively executing instructions due to executing Wait for Interrupt (“WFI”) / Event Wait (“WFE”) instructions; independent power supply for each core; clock gating of each core cluster when all cores are clock-gated or power-gated; and / or power gating of each core cluster when all cores are power-gated. In at least one embodiment, one or more CPUs 1006 may further implement an enhanced algorithm for managing power states, wherein allowed power states and expected wake-up times are specified, and the hardware / microcode determines the optimal power state for cores, clusters, and CCPLEX inputs. In at least one embodiment, the processing core may support a simplified power state input sequence in software, wherein the work is offloaded to the microcode.

[0167] In at least one embodiment, one or more GPUs 1008 may include integrated GPUs (or "iGPUs" herein). In at least one embodiment, one or more GPUs 1008 may be programmable and efficient for parallel workloads. In one embodiment, one or more GPUs 1008 may use an enhanced tensor instruction set. In at least one embodiment, one or more GPUs 1008 may include one or more streaming microprocessors, wherein each streaming microprocessor may include a Level 1 ("L1") cache (e.g., an L1 cache with at least 96KB of storage capacity), and two or more streaming microprocessors may share an L2 cache (e.g., an L2 cache with 512KB of storage capacity). In at least one embodiment, one or more GPUs 1008 may include at least eight streaming microprocessors. In at least one embodiment, one or more GPUs 1008 may use a computation application programming interface (API). In at least one embodiment, one or more GPUs 1008 may use one or more parallel computing platforms and / or programming models (e.g., NVIDIA's CUDA model).

[0168] In at least one embodiment, one or more GPUs 1008 may be power-optimized for optimal performance in automotive and embedded use cases. For example, in one embodiment, one or more GPUs 1008 may be fabricated on a FinFET (“FinFET”) circuit. In at least one embodiment, each streaming microprocessor may include multiple mixed-precision processing cores divided into multiple blocks. For example, but not limited to, 64 PF32 cores and 32 PF64 cores may be divided into four processing blocks. In at least one embodiment, each processing block may be allocated 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, two mixed-precision NVIDIA Tensor cores for deep learning matrix arithmetic, a level-zero (“L0”) instruction cache, a thread bundle scheduler, a dispatch unit, and / or a 64KB register file. In at least one embodiment, the streaming microprocessor may include independent parallel integer and floating-point data paths to provide efficient execution of workloads that mix computation and addressing operations. In at least one embodiment, the streaming microprocessor may include independent thread scheduling capabilities to enable finer-grained synchronization and collaboration between parallel threads. In at least one embodiment, the streaming microprocessor may include a combined L1 data cache and shared memory unit to improve performance while simplifying programming.

[0169] In at least one embodiment, one or more GPUs 1008 may include high-bandwidth memory (“HBM”) and / or a 16GB HBM2 memory subsystem to provide a peak storage bandwidth of approximately 900GB / s in some examples. In at least one embodiment, in addition to or instead of HBM memory, synchronous graphics random access memory (“SGRAM”) may be used, such as graphics double data rate type five synchronous random access memory (“GDDR5”).

[0170] In at least one embodiment, one or more GPUs 1008 may include unified memory technology. In at least one embodiment, address translation service (“ATS”) support may be used to allow one or more GPUs 1008 to directly access the page tables of one or more CPUs 1006. In at least one embodiment, when a memory management unit (“MMU”) of one or more GPUs 1008 experiences a miss, an address translation request may be sent to one or more CPUs 1006. In response, in at least one embodiment, two CPUs of one or more CPUs 1006 may look up the virtual-physical mapping of the address in their page tables and transfer the translation back to one or more GPUs 1008. In at least one embodiment, unified memory technology may allow a single unified virtual address space to be used for the memory of both one or more CPUs 1006 and one or more GPUs 1008, thereby simplifying the programming of one or more GPUs 1008 and the porting of applications to one or more GPUs 1008.

[0171] In at least one embodiment, one or more GPUs 1008 may include any number of access counters that can track the frequency of memory accesses by one or more GPUs 1008 to other processors. In at least one embodiment, one or more access counters can help ensure that memory pages are moved to the physical memory of the processor that accesses the pages most frequently, thereby improving the efficiency of shared memory ranges between processors.

[0172] In at least one embodiment, one or more SoCs 1004 may include any number of caches 1012, including those described herein. For example, in at least one embodiment, one or more caches 1012 may include a Level 3 (“L3”) cache available for one or more CPUs 1006 and one or more GPUs 1008 (e.g., connected to CPUs 1006 and GPUs 1008). In at least one embodiment, one or more caches 1012 may include a write-back cache that can, for example, track the state of a line using a cache coherence protocol (e.g., MEI, MESI, MSI, etc.). In at least one embodiment, although a smaller cache size may be used, according to an embodiment, the L3 cache may include 4 MB of memory or more.

[0173] In at least one embodiment, one or more SoCs 1004 may include one or more accelerators 1014 (e.g., hardware accelerators, software accelerators, or combinations thereof). In at least one embodiment, one or more SoCs 1004 may include a hardware acceleration cluster, which may include optimized hardware accelerators and / or large on-chip memory. In at least one embodiment, large on-chip memory (e.g., 4MB of SRAM) enables the hardware acceleration cluster to accelerate neural networks and other computations. In at least one embodiment, the hardware acceleration cluster may be used to supplement one or more GPUs 1008 and offload some tasks from one or more GPUs 1008 (e.g., freeing up more cycles from one or more GPUs 1008 to perform other tasks). In at least one embodiment, one or more accelerators 1014 may be used for target workloads that are sufficiently stable to withstand acceleration testing (e.g., perceptual, convolutional neural networks (“CNN”), recurrent neural networks (“RNN”), etc.). In at least one embodiment, the CNN may include region-based or region convolutional neural networks (“RCNN”) and fast RCNN (e.g., for object detection) or other types of CNNs.

[0174] In at least one embodiment, one or more accelerators 1014 (e.g., a hardware acceleration cluster) may include one or more deep learning accelerators (“DLAs”). In at least one embodiment, one or more DLAs may include, but are not limited to, one or more Tensor Processing Units (“TPUs”), which may be configured to provide an additional 10 trillion operations per second for deep learning applications and inference. In at least one embodiment, a TPU may be an accelerator configured and optimized for performing image processing functions (e.g., for CNNs, RCNNs, etc.). In at least one embodiment, one or more DLAs may be further optimized for specific sets of neural network types and floating-point operations and inference. In at least one embodiment, one or more DLAs are designed to provide higher performance per millimeter than typical general-purpose GPUs and typically significantly outperform CPUs. In at least one embodiment, one or more TPUs may perform several functions, including single-instance convolution functions supporting, for example, INT8, INT16, and FP16 data types for features and weights, as well as post-processor functions. In at least one embodiment, one or more DLAs can execute neural networks, particularly CNNs, quickly and efficiently on processed or unprocessed data for any of the various functions, including, but not limited to: CNNs for object recognition and detection using data from camera sensors; CNNs for distance estimation using data from camera sensors; CNNs for emergency vehicle detection, recognition, and identification using data from microphones; CNNs for face recognition and vehicle owner recognition using data from camera sensors; and / or CNNs for safety and / or safety-related events.

[0175] In at least one embodiment, the DLA can perform any function of one or more GPUs 1008, and by using inference accelerators, for example, the designer can target one or more DLAs or one or more GPUs 1008 for any function. For example, in at least one embodiment, the designer can concentrate the CNN processing and floating-point operations on one or more DLAs, leaving other functions to one or more GPUs 1008 and / or one or more accelerators 1014.

[0176] In at least one embodiment, one or more accelerators 1014 may include programmable vision accelerators (“PVAs”), which may alternatively be referred to herein as computer vision accelerators. In at least one embodiment, one or more PVAs may be designed and configured to accelerate computer vision algorithms for advanced driver assistance systems (“ADAS”) 1038, autonomous driving, augmented reality (“AR”) applications, and / or virtual reality (“VR”) applications. In at least one embodiment, one or more PVAs may strike a balance between performance and flexibility. For example, in at least one embodiment, each of one or more PVAs may include, for example, but not limited to, any number of reduced instruction set computer (“RISC”) cores, direct memory access (“DMA”), and / or any number of vector processors.

[0177] In at least one embodiment, the RISC core can interact with an image sensor (e.g., the image sensor of any camera described herein), an image signal processor, etc. In at least one embodiment, each RISC core may include any number of memories. In at least one embodiment, the RISC core may use any of a variety of protocols, depending on the embodiment. In at least one embodiment, the RISC core may execute a real-time operating system (“RTOS”). In at least one embodiment, the RISC core may be implemented using one or more integrated circuit devices, application-specific integrated circuits (“ASICs”), and / or storage devices. For example, in at least one embodiment, the RISC core may include an instruction cache and / or tightly coupled RAM.

[0178] In at least one embodiment, DMA enables components of the PVA to access system memory independently of one or more CPUs 1006. In at least one embodiment, DMA can support any number of features for providing optimization to the PVA, including but not limited to, support for multidimensional addressing and / or circular addressing. In at least one embodiment, DMA can support up to six or more addressing dimensions, which may include, but are not limited to, block width, block height, block depth, horizontal block step, vertical block step, and / or depth step.

[0179] In at least one embodiment, the vector processor may be a programmable processor designed to efficiently and flexibly execute programming for computer vision algorithms and provide signal processing capabilities. In at least one embodiment, the PVA may include a PVA core and two vector processing subsystem partitions. In at least one embodiment, the PVA core may include a processor subsystem, a DMA engine (e.g., two DMA engines), and / or other peripherals. In at least one embodiment, the vector processing subsystem may serve as the main processing engine of the PVA and may include a vector processing unit (“VPU”), an instruction cache, and / or a vector memory (e.g., “VMEM”). In at least one embodiment, the VPU core may include a digital signal processor, such as a Single Instruction Multiple Data (“SIMD”) or Very Long Instruction Word (“VLIW”) digital signal processor. In at least one embodiment, the combination of SIMD and VLIW can improve throughput and speed.

[0180] In at least one embodiment, each vector processor may include an instruction cache and may be coupled to dedicated memory. As a result, in at least one embodiment, each vector processor may be configured to execute independently of other vector processors. In at least one embodiment, the vector processors included in a particular PVA may be configured to employ data parallelism. For example, in at least one embodiment, multiple vector processors included in a single PVA may execute general-purpose computer vision algorithms, except on different regions of an image. In at least one embodiment, vector processors included in a particular PVA may execute different computer vision algorithms simultaneously on a single image, or even execute different algorithms on a sequence of images or portions of images. In at least one embodiment, among others, any number of PVAs may be included in the hardware-accelerated cluster, and any number of vector processors may be included in each PVA. In at least one embodiment, the PVA may include additional error-correcting code (“ECC”) memory to enhance overall system security.

[0181] In at least one embodiment, one or more accelerators 1014 may include an on-chip computer vision network and static random access memory (“SRAM”) for providing high-bandwidth, low-latency SRAM to one or more accelerators 1014. In at least one embodiment, the on-chip memory may include at least 4 MB of SRAM, comprising, for example, but not limited to, eight field-configurable memory blocks accessible to both the PVA and DLA. In at least one embodiment, each pair of memory blocks may include an advanced peripheral bus (“APB”) interface, configuration circuitry, a controller, and a multiplexer. In at least one embodiment, any type of memory may be used. In at least one embodiment, the PVA and DLA may access the memory via a backbone providing high-speed access to the memory for both the PVA and DLA. In at least one embodiment, the backbone may include an on-chip computer vision network that interconnects the PVA and DLA to the memory (e.g., using an APB).

[0182] In at least one embodiment, the on-chip computer vision network may include an interface that determines that both the PVA and DLA provide ready and valid signals before transmitting any control signals / addresses / data. In at least one embodiment, the interface may provide separate phases and separate channels for transmitting control signals / addresses / data, as well as bursty communication for continuous data transmission. In at least one embodiment, although other standards and protocols may be used, the interface may conform to the International Organization for Standardization (“ISO”) 26262 or the International Electrotechnical Commission (“IEC”) 61508 standard.

[0183] In at least one embodiment, one or more SoCs 1004 may include a real-time eye-tracking hardware accelerator. In at least one embodiment, the real-time eye-tracking hardware accelerator may be used to quickly and efficiently determine the location and extent of an object (e.g., within a world model) to generate real-time visualization simulations for RADAR signal interpretation, for sound propagation synthesis and / or analysis, for simulation of SONAR systems, for general wave propagation simulation, for comparison with LIDAR data for localization and / or other functions, and / or for other purposes.

[0184] In at least one embodiment, one or more accelerators 1014 have broad applications for autonomous driving. In at least one embodiment, PVA can be used in critical processing stages in ADAS and autonomous vehicles. In at least one embodiment, the capabilities of PVA with low power consumption and low latency are well-matched to algorithmic domains requiring predictable processing. In other words, PVA performs well in semi-intensive or intensive conventional computations, even on small datasets that may require predictable runtimes with low latency and low power consumption. In at least one embodiment, such as in vehicle 1000, PVA may be designed to run classic computer vision algorithms, as they are efficient in object detection and integer mathematical operations.

[0185] For example, according to at least one embodiment of the technology, PVA is used to perform computer stereo vision. In at least one embodiment, a semi-global matching-based algorithm may be used in some examples, although this is not intended to be limiting. In at least one embodiment, applications for Level 3-5 autonomous driving use dynamic estimation / stereo matching during operation (e.g., structure recovery from motion, pedestrian recognition, lane detection, etc.). In at least one embodiment, PVA can perform computer stereo vision functions on input from two monocular cameras.

[0186] In at least one embodiment, the PVA can be used to perform intensive optical flow. For example, in at least one embodiment, the PVA can process raw RADAR data (e.g., using 4D Fast Fourier Transform) to provide processed RADAR data. In at least one embodiment, the PVA is used for time-of-flight depth processing, for example, by processing raw time-of-flight data to provide processed time-of-flight data.

[0187] In at least one embodiment, the DLA can be used to run any type of network to enhance control and driving safety, including, but not limited to, neural networks whose output is used for a confidence score for each object detection. In at least one embodiment, the confidence score can be represented or interpreted as a probability, or as providing a relative “weight” for each detection relative to other detections. In at least one embodiment, the confidence score measurement enables the system to make further decisions about which detections should be considered true positives rather than false positives. In at least one embodiment, the system can set a threshold for the confidence score and only consider detections exceeding the threshold as true positives. In embodiments using an Automatic Emergency Braking (“AEB”) system, false positives would cause the vehicle to automatically perform emergency braking, which is obviously undesirable. In at least one embodiment, a highly confident detection can be considered a trigger for AEB. In at least one embodiment, the DLA can run a neural network for regressing the confidence score value. In at least one embodiment, the neural network may take at least a subset of parameters as its input, such as bounding box size, obtained ground plane estimate (e.g., from another subsystem), and outputs of one or more IMU sensors 1066 related to the vehicle 1000 orientation, distance, and 3D position estimate of the object obtained from the neural network and / or other sensors (e.g., one or more LiDAR sensors 1064 or one or more RADAR sensors 1060).

[0188] In at least one embodiment, one or more SoCs 1004 may include one or more data storage devices 1016 (e.g., memory). In at least one embodiment, one or more data storage devices 1016 may be on-chip memory of one or more SoCs 1004, which may store neural networks to be executed on one or more GPUs 1008 and / or DLAs. In at least one embodiment, one or more data storage devices 1016 may have a sufficiently large capacity to store multiple instances of the neural network for redundancy and security. In at least one embodiment, one or more data storage devices 1016 may include L2 or L3 caches.

[0189] In at least one embodiment, one or more SoCs 1004 may include any number of processors 1010 (e.g., embedded processors). In at least one embodiment, one or more processors 1010 may include a startup and power management processor, which may be a dedicated processor and subsystem for handling startup power and management functions, as well as associated security implementations. In at least one embodiment, the startup and power management processor may be part of a startup sequence of one or more SoCs 1004s and may provide runtime power management services. In at least one embodiment, the startup power and management processor may provide clock and voltage programming, assist system low-power state transitions, thermal and temperature sensor management of one or more SoCs 1004s, and / or power state management of one or more SoCs 1004s. In at least one embodiment, each temperature sensor may be implemented with its output frequency proportional to temperature, and one or more SoCs 1004s may use the ring oscillator to detect the temperature of one or more CPUs 1006s, one or more GPUs 1008s, and / or one or more accelerators 1014s. In at least one embodiment, if it is determined that the temperature exceeds a threshold, the startup and power management processor may enter a temperature fault routine and place one or more SoCs 1004s into a lower power state and / or place the vehicle 1000 into a driver's safe stopping pattern (e.g., bring the vehicle 1000 to a safe stop).

[0190] In at least one embodiment, one or more processors 1010 may further include a set of embedded processors that can serve as an audio processing engine. The audio processing engine may be an audio subsystem capable of providing full hardware support for multi-channel audio through multiple interfaces and a wide and flexible range of audio I / O interfaces. In at least one embodiment, the audio processing engine is a dedicated processor core with a digital signal processor having dedicated RAM.

[0191] In at least one embodiment, one or more processors 1010 may also include an always-on processor engine that can provide the necessary hardware features to support low-power sensor management and wake-up use cases. In at least one embodiment, the processor on the always-on processor engine may include, but is not limited to, a processor core, tightly coupled RAM, peripheral support (e.g., timers and interrupt controllers), various I / O controller peripherals, and routing logic.

[0192] In at least one embodiment, one or more processors 1010 may further include a secure clustering engine, which includes, but is not limited to, a dedicated processor subsystem for handling security management of automotive applications. In at least one embodiment, the secure clustering engine may include, but is not limited to, two or more processor cores, tightly coupled RAM, supporting peripherals (e.g., timers, interrupt controllers, etc.) and / or routing logic. In secure mode, in at least one embodiment, the two or more cores may operate in lockstep mode and may be used as a single core with comparison logic for detecting any differences between their operations. In at least one embodiment, one or more processors 1010 may further include a real-time camera engine, which may include, but is not limited to, a dedicated processor subsystem for handling real-time camera management. In at least one embodiment, one or more processors 1010 may further include a high dynamic range signal processor, which may include, but is not limited to, an image signal processor, which is a hardware engine as part of the camera processing pipeline.

[0193] In at least one embodiment, one or more processors 1010 may include a video image synthesizer, which may be a processing block (e.g., implemented on a microprocessor) that implements video post-processing functions required by the video playback application to generate the final video for the player window. In at least one embodiment, the video image synthesizer may perform lens distortion correction on one or more wide-angle cameras 1070, one or more surround cameras 1074, and / or one or more cabin monitoring camera sensors. In at least one embodiment, preferably, the cabin monitoring camera sensors are monitored by a neural network running on another instance of SoC 1004, the neural network being configured to recognize cabin events and respond accordingly. In at least one embodiment, the cabin system may perform, but is not limited to, lip reading to activate cellular service and make phone calls, instruct emails, change the vehicle's destination, activate or change the vehicle's infotainment system and settings, or provide voice-activated web browsing. In at least one embodiment, certain functions are available to the driver when the vehicle is operating in autonomous mode, and are otherwise disabled.

[0194] In at least one embodiment, the video image synthesizer may include enhanced temporal denoising for simultaneous spatial and temporal denoising. For example, in at least one embodiment, when motion occurs in the video, denoising appropriately weights spatial information, thereby reducing the weight of information provided by adjacent frames. In at least one embodiment, when the image or a portion of the image does not contain motion, temporal denoising performed by the video image synthesizer may use information from previous images to reduce noise in the current image.

[0195] In at least one embodiment, the video image compositor can also be configured to perform stereoscopic correction on the input stereo lens frames. In at least one embodiment, when using an operating system desktop, the video image compositor can also be used for user interface compositing and does not require one or more GPUs 1008 to continuously render new surfaces. In at least one embodiment, when one or more GPUs 1008 are powered and actively performing 3D rendering, the video image compositor can be used to offload one or more GPUs 1008 to improve performance and responsiveness.

[0196] In at least one embodiment, one or more SoCs of SoC 1004 may also include a Mobile Industrial Processor Interface (“MIPI”) camera serial interface, a high-speed interface, and / or a video input block that can be used for receiving video and input from a camera and associated pixel input functions. In at least one embodiment, one or more SoCs of SoC 1004 may also include an input / output controller that can be software controlled and can be used to receive I / O signals not assigned to a specific role.

[0197] In at least one embodiment, one or more SoCs of SoC 1004 may also include extensive peripheral interfaces to enable communication with peripheral devices, audio encoders / decoders (“codecs”), power management and / or other devices. In at least one embodiment, one or more SoCs of SoC 1004 may be used to process data from (e.g., connected via gigabit multimedia serial links and Ethernet channels) cameras, sensors (e.g., one or more LiDAR sensors 1064, one or more RADAR sensors 1060, etc., which may be connected via Ethernet channels), data from bus 1002 (e.g., vehicle 1000 speed, steering wheel position, etc.), data from one or more GNSS sensors 1058 (e.g., connected via Ethernet bus or CAN bus), etc. In at least one embodiment, one or more SoCs of SoC 1004 may also include a dedicated high-performance mass storage controller, which may include its own DMA engine and may be used to free one or more CPUs of SoC 1006 from routine data management tasks.

[0198] In at least one embodiment, one or more SoCs 1004 can be an end-to-end platform with a flexible architecture spanning automation levels 3-5, providing a comprehensive functional safety architecture that leverages and effectively utilizes computer vision and ADAS technologies to achieve diversity and redundancy. This provides a platform offering a flexible and reliable driving software stack as well as deep learning tools. In at least one embodiment, one or more SoCs 1004 can be faster, more reliable, and even more energy and space efficient than conventional systems. For example, in at least one embodiment, one or more accelerators 1014, when combined with one or more CPUs 1006, one or more GPUs 1008, and one or more data storage devices 1016, can provide a fast and efficient platform for Level 3-5 autonomous vehicles.

[0199] In at least one embodiment, the computer vision algorithm can be executed on a CPU, which can be configured using a high-level programming language (e.g., C) to execute multiple processing algorithms on a variety of visual data. However, in at least one embodiment, the CPU typically cannot meet the performance requirements of many computer vision applications, such as performance requirements related to execution time and power consumption. In at least one embodiment, many CPUs cannot execute complex object detection algorithms in real time, which are used in automotive ADAS applications and practical Level 3-5 autonomous vehicles.

[0200] The embodiments described herein allow multiple neural networks to be executed simultaneously and / or sequentially, and allow the results to be combined to achieve Level 3-5 autonomous driving capabilities. For example, in at least one embodiment, a CNN executed on a DLA or discrete GPU (e.g., one or more GPU 1020s) may include text and word recognition, thereby allowing a supercomputer to read and understand traffic signs, including signs for which the neural network has not yet been specifically trained. In at least one embodiment, the DLA may also include a neural network capable of recognizing, interpreting, and providing semantic understanding of symbols, and passing this semantic understanding to a path planning module running on a CPU Complex.

[0201] In at least one embodiment, for drives of levels 3, 4, or 5, multiple neural networks can run simultaneously. For example, in at least one embodiment, a warning sign consisting of a light bulb accompanied by the warning sign “Caution: flashing lights indicate icy conditions” can be interpreted independently or jointly by multiple neural networks. In at least one embodiment, the warning sign itself can be recognized as a traffic sign by a first deployed neural network (e.g., a trained neural network), and the text “flashing lights indicate icy conditions” can be interpreted by a second deployed neural network, which informs the vehicle’s path planning software (preferably executed on the CPU Complex) that icing conditions exist when flashing lights are detected. In at least one embodiment, flashing lights can be identified by operating a third deployed neural network across multiple frames, informing the vehicle’s path planning software of the presence (or absence) of flashing lights. In at least one embodiment, all three neural networks can run simultaneously, for example within the DLA and / or on one or more GPUs 1008.

[0202] In at least one embodiment, the CNN for facial recognition and vehicle owner identification can use data from camera sensors to identify the presence of an authorized driver and / or the owner of vehicle 1000. In at least one embodiment, a normally open sensor processor engine can be used to unlock the vehicle when the owner approaches the driver's door and turns on the lights, and, in security mode, can be used to disable the vehicle when the owner leaves it. In this way, one or more SoCs 1004 provide protection against theft and / or carjacking.

[0203] In at least one embodiment, the CNN for emergency vehicle detection and identification can use data from microphone 1096 to detect and identify emergency vehicle sirens. In at least one embodiment, one or more SoCs 1004 use the CNN to classify environmental and urban sounds, as well as visual data. In at least one embodiment, the CNN running on DLA is trained to identify the relative approach speed of emergency vehicles (e.g., by using the Doppler effect). In at least one embodiment, the CNN can also be trained to identify emergency vehicles in the area where the vehicle is operating, as identified by one or more GNSS sensors 1058. In at least one embodiment, when operating in Europe, the CNN will seek to detect European sirens, while in North America, the CNN will seek to identify only North American sirens. In at least one embodiment, once an emergency vehicle is detected, a control program can be used, with the assistance of one or more ultrasonic sensors 1062, to execute emergency vehicle safety routines, slow the vehicle, pull the vehicle to the side of the road, stop, and / or leave the vehicle idle until the emergency vehicle passes.

[0204] In at least one embodiment, vehicle 1000 may include one or more CPUs 1018 (e.g., one or more discrete CPUs or one or more dCPUs) that may be coupled to one or more SoCs 1004 via high-speed interconnects (e.g., PCIe). In at least one embodiment, one or more CPUs 1018 may include x86 processors. For example, one or more CPUs 1018 may be used to perform any of the various functions, such as arbitrating the results of potential inconsistencies between ADAS sensors and one or more SoCs 1004, and / or monitoring the status and health of one or more monitoring controllers 1036 and / or on-chip information systems (“information SoCs”) 1030.

[0205] In at least one embodiment, vehicle 1000 may include one or more GPUs 1020 (e.g., one or more discrete GPUs or one or more dGPUs) coupled to one or more SoCs 1004 via high-speed interconnects (e.g., NVIDIA's NVLINK channels). In at least one embodiment, one or more GPUs 1020 may provide additional artificial intelligence capabilities, such as by executing redundant and / or different neural networks, and may be used to train and / or update the neural networks based at least in part on inputs from sensors of vehicle 1000 (e.g., sensor data).

[0206] In at least one embodiment, vehicle 1000 may also include a network interface 1024, which may include, but is not limited to, one or more wireless antennas 1026 (e.g., one or more wireless antennas for different communication protocols, such as cellular antennas, Bluetooth antennas, etc.). In at least one embodiment, network interface 1024 may be used to enable wireless connectivity with other vehicles and / or computing devices (e.g., passenger client devices) via Internet cloud services (e.g., using servers and / or other network devices). In at least one embodiment, for communication with other vehicles, a direct link and / or an indirect link (e.g., via a network and the Internet) may be established between vehicle 1000 and another vehicle. In at least one embodiment, a vehicle-to-vehicle communication link may be used to provide a direct link. In at least one embodiment, the vehicle-to-vehicle communication link may provide vehicle 1000 with information about vehicles near vehicle 1000 (e.g., vehicles in front, to the side, and / or behind vehicle 1000). In at least one embodiment, the foregoing functionality may be part of a cooperative adaptive cruise control function of vehicle 1000.

[0207] In at least one embodiment, network interface 1024 may include a System-on-Chip (SoC) that provides modulation and demodulation functions and enables one or more controllers 1036 to communicate over a wireless network. In at least one embodiment, network interface 1024 may include a radio frequency (RF) front-end for up-conversion from baseband to RF and down-conversion from RF to baseband. In at least one embodiment, frequency conversion may be performed in any technically feasible manner. For example, frequency conversion may be performed using known processes and / or using a superheterodyne process. In at least one embodiment, the RF front-end functionality may be provided by a separate chip. In at least one embodiment, the network interface may include wireless functions for communication over LTE, WCDMA, UMTS, GSM, CDMA2000, Bluetooth, Bluetooth LE, Wi-Fi, Z-Wave, ZigBee, LoRaWAN, and / or other wireless protocols.

[0208] In at least one embodiment, the vehicle 1000 may also include one or more data storage units 1028, which may include, but are not limited to, off-chip (e.g., one or more SoC 1004) storage. In at least one embodiment, the one or more data storage units 1028 may include, but are not limited to, one or more storage elements, including RAM, SRAM, dynamic random access memory (“DRAM”), video random access memory (“VRAM”), flash memory, hard disk and / or other components and / or devices capable of storing at least one bit of data.

[0209] In at least one embodiment, the vehicle 1000 may also include one or more GNSS sensors 1058 (e.g., GPS and / or auxiliary GPS sensors) to assist in map creation, perception, occupancy raster generation, and / or path planning functions. In at least one embodiment, any number of GNSS sensors 1058 may be used, including, for example, but not limited to, GPS sensors connected to a serial interface (e.g., RS-232) bridge using a USB connector with Ethernet.

[0210] In at least one embodiment, vehicle 1000 may also include one or more RADAR sensors 1060. In at least one embodiment, one or more RADAR sensors 1060 may be used by vehicle 1000 for remote vehicle detection, even in dark and / or inclement weather conditions. In at least one embodiment, the RADAR functional safety level may be ASIL B. In at least one embodiment, one or more RADAR sensors 1060 may use a CAN bus and / or bus 1002 (e.g., to transmit data generated by one or more RADAR sensors 1060) for control and access to object tracking data, and in some examples may access an Ethernet channel to access raw data. In at least one embodiment, a wide variety of RADAR sensor types may be used. For example, but not limited to, one or more of the RADAR sensors 1060 may be suitable for front, rear, and side RADAR use. In at least one embodiment, one or more RADAR sensors 1060 are pulse Doppler RADAR sensors.

[0211] In at least one embodiment, one or more RADAR sensors 1060 may include different configurations, such as long-range with a narrow field of view, short-range with a wide field of view, short-range side coverage, etc. In at least one embodiment, the long-range RADAR can be used for adaptive cruise control functions. In at least one embodiment, the long-range RADAR system can provide a wide field of view achieved through two or more independent scans (e.g., within a 250m range). In at least one embodiment, one or more RADAR sensors 1060 can help distinguish between stationary and moving objects and can be used by the ADAS system 1038 for emergency braking assistance and forward collision warning. In at least one embodiment, one or more sensors 1060 included in the long-range RADAR system may include, but are not limited to, a monostatic multimode RADAR with multiple (e.g., six or more) fixed RADAR antennas and high-speed CAN and FlexRay interfaces. In at least one embodiment, having six antennas, with the four central antennas, can create a focused beammap designed to record the vehicle 1000's surroundings at a high speed while minimizing traffic interference from adjacent lanes. In at least one embodiment, the other two antennas can expand the field of view, thereby enabling rapid detection of vehicles 1000 entering or leaving the lane.

[0212] In at least one embodiment, as an example, a mid-range RADAR system may include, for example, a range of up to 160m (front) or 80m (rear), and a field of view of up to 42 degrees (front) or 150 degrees (rear). In at least one embodiment, a short-range RADAR system may include, but is not limited to, any number of RADAR sensors 1060 designed to be mounted at both ends of the rear bumper. When mounted at both ends of the rear bumper, in at least one embodiment, the RADAR sensor system may generate two beams that continuously monitor the rearward direction of the vehicle and nearby blind spots. In at least one embodiment, the short-range RADAR system may be used in ADAS system 1038 for blind spot detection and / or lane change assistance.

[0213] In at least one embodiment, the vehicle 1000 may also include one or more ultrasonic sensors 1062. In at least one embodiment, one or more ultrasonic sensors 1062, which may be positioned at the front, rear, and / or sides of the vehicle 1000, may be used for parking assistance and / or creating and updating occupancy detectors. In at least one embodiment, a wide variety of ultrasonic sensors 1062 may be used, and different ultrasonic sensors 1062 may be used for different detection ranges (e.g., 2.5m, 4m). In at least one embodiment, the ultrasonic sensors 1062 may operate at the ASIL B functional safety level.

[0214] In at least one embodiment, vehicle 1000 may include one or more LiDAR sensors 1064. In at least one embodiment, one or more LiDAR sensors 1064 may be used for object and pedestrian detection, emergency braking, collision avoidance, and / or other functions. In at least one embodiment, one or more LiDAR sensors 1064 may operate at functional safety level ASIL B. In at least one embodiment, vehicle 1000 may include multiple (e.g., two, four, six, etc.) LiDAR sensors 1064 that can use Ethernet channels (e.g., providing data to a Gigabit Ethernet switch).

[0215] In at least one embodiment, one or more LiDAR sensors 1064 may be able to provide a list of objects and their distances for a 360-degree field of view. In at least one embodiment, one or more commercially available LiDAR sensors 1064 may, for example, have an advertising range of approximately 100m, an accuracy of 2cm-3cm, and support a 100Mbps Ethernet connection. In at least one embodiment, one or more non-protruding LiDAR sensors may be used. In such embodiments, one or more LiDAR sensors 1064 may include small devices that can be embedded in the front, rear, side, and / or corner locations of a vehicle 1000. In at least one embodiment, one or more LiDAR sensors 1064, in such embodiments, can provide a horizontal field of view of up to 120 degrees and a vertical field of view of 35 degrees, even for objects with low reflectivity, and have a range of 200m. In at least one embodiment, one or more forward-facing LiDAR sensors 1064 may be configured for a horizontal field of view between 45 degrees and 105 degrees.

[0216] In at least one embodiment, LIDAR technology (such as 3D flash LIDAR) may also be used. In at least one embodiment, 3D flash LIDAR uses a laser flash as a transmission source to illuminate approximately 200m around vehicle 1000. In at least one embodiment, the flash LIDAR unit includes, but is not limited to, a receiver that records the laser pulse propagation time and reflected light on each pixel, which in turn corresponds to the range from vehicle 1000 to the object. In at least one embodiment, flash LIDAR can allow the generation of highly accurate and distortion-free images of the surrounding environment using each laser flash. In at least one embodiment, four flash LIDAR sensors may be deployed, one on each side of vehicle 1000. In at least one embodiment, the 3D flash LIDAR system includes, but is not limited to, a solid-state 3D line-of-sight array LIDAR camera with no moving parts other than a fan (e.g., a non-scanning LIDAR device). In at least one embodiment, the flash LIDAR device can use a 5-nanosecond Class I (eye-safe) laser pulse per frame and can capture reflected laser light as a 3D ranging point cloud and co-registered intensity data.

[0217] In at least one embodiment, vehicle 1000 may further include one or more IMU sensors 1066. In at least one embodiment, one or more IMU sensors 1066 may be located at the center of the rear axle of vehicle 1000. In at least one embodiment, one or more IMU sensors 1066 may include, for example, but not limited to, one or more accelerometers, one or more magnetometers, one or more gyroscopes, a magnetic compass, multiple magnetic compasses, and / or other sensor types. In at least one embodiment, for example in a six-axis application, one or more IMU sensors 1066 may include, but are not limited to, accelerometers and gyroscopes. In at least one embodiment, for example in a nine-axis application, one or more IMU sensors 1066 may include, but are not limited to, accelerometers, gyroscopes, and magnetometers.

[0218] In at least one embodiment, one or more IMU sensors 1066 may be implemented as a miniature, high-performance GPS-assisted inertial navigation system (“GPS / INS”) combining a microelectromechanical system (“MEMS”) inertial sensor, a high-sensitivity GPS receiver, and an advanced Kalman filtering algorithm to provide position, velocity, and attitude estimations; in at least one embodiment, one or more IMU sensors 1066 may enable vehicle 1000 to estimate heading without input from a magnetic sensor obtained by directly observing and correlating velocity changes from GPS to one or more IMU sensors 1066. In at least one embodiment, one or more IMU sensors 1066 and one or more GNSS sensors 1058 may be combined in a single integrated unit.

[0219] In at least one embodiment, vehicle 1000 may include one or more microphones 1096 placed inside and / or around vehicle 1000. In at least one embodiment, in addition, one or more microphones 1096 may be used for emergency vehicle detection and identification.

[0220] In at least one embodiment, vehicle 1000 may also include any number of camera types, including one or more stereo cameras 1068, one or more wide-angle cameras 1070, one or more infrared cameras 1072, one or more surround cameras 1074, one or more long-range cameras 1098, one or more mid-range cameras 1076, and / or other camera types. In at least one embodiment, the cameras can be used to capture image data around the entire perimeter of vehicle 1000. In at least one embodiment, the type of camera used depends on vehicle 1000. In at least one embodiment, any combination of camera types can be used to provide the necessary coverage around vehicle 1000. In at least one embodiment, the number of cameras deployed may vary depending on the embodiment. For example, in at least one embodiment, vehicle 1000 may include six cameras, seven cameras, ten cameras, twelve cameras, or other numbers of cameras. In at least one embodiment, the cameras may be, by way of example but not limited to, supporting gigabit multimedia serial link (“GMSL”) and / or gigabit Ethernet communication. In at least one embodiment, previously referenced herein Figure 10A and Figure 10B Each camera can be described in more detail.

[0221] In at least one embodiment, the vehicle 1000 may also include one or more vibration sensors 1042. In at least one embodiment, the one or more vibration sensors 1042 can measure vibrations of components of the vehicle 1000 (e.g., axles). For example, in at least one embodiment, changes in vibration can indicate changes in road surface conditions. In at least one embodiment, when two or more vibration sensors 1042 are used, differences between vibrations can be used to determine road surface friction or slippage (e.g., when there is a vibration difference between a power drive axle and a free-rotating axle).

[0222] In at least one embodiment, vehicle 1000 may include ADAS system 1038. In at least one embodiment, ADAS system 1038 may include, but is not limited to, SoC. In at least one embodiment, ADAS system 1038 may include, but is not limited to, any number of autonomous / adaptive / automatic cruise control (“ACC”) systems, cooperative adaptive cruise control (“CACC”) systems, forward collision warning (“FCW”) systems, automatic emergency braking (“AEB”) systems, lane departure warning (“LDW”) systems, lane keeping assist (“LKA”) systems, blind spot warning (“BSW”) systems, rear cross traffic warning (“RCTW”) systems, collision warning (“CW”) systems, lane centering (“LC”) systems, and / or other systems, features, and / or functions, and combinations thereof.

[0223] In at least one embodiment, the ACC system may use one or more RADAR sensors 1060, one or more LIDAR sensors 1064, and / or any number of cameras. In at least one embodiment, the ACC system may include a longitudinal ACC system and / or a lateral ACC system. In at least one embodiment, the longitudinal ACC system monitors and controls the distance to another vehicle adjacent to vehicle 1000 and automatically adjusts the speed of vehicle 1000 to maintain a safe distance from the vehicle ahead. In at least one embodiment, the lateral ACC system performs distance holding and suggests that vehicle 1000 change lanes when necessary. In at least one embodiment, lateral ACC is associated with other ADAS applications, such as LC and CW.

[0224] In at least one embodiment, the CACC system uses information from other vehicles, which may be received from other vehicles via network interface 1024 and / or one or more wireless antennas 1026 via a wireless link or indirectly via a network connection (e.g., via the Internet). In at least one embodiment, the direct link may be provided by a vehicle-to-vehicle (“V2V”) communication link, while the indirect link may be provided by an infrastructure-to-vehicle (“I2V”) communication link. Typically, V2V communication provides information about the vehicle immediately preceding it (e.g., a vehicle immediately in front of vehicle 1000 and in the same lane as it), while I2V communication provides information about traffic further ahead. In at least one embodiment, the CACC system may include one or both of the I2V and V2V information sources. In at least one embodiment, given information about vehicles preceding vehicle 1000, the CACC system can be more reliable and has the potential to improve traffic flow smoothness and reduce road congestion.

[0225] In at least one embodiment, the FCW system is designed to warn the driver of a hazard so that the driver can take corrective action. In at least one embodiment, the FCW system uses a forward-facing camera and / or one or more RADAR sensors 1060, coupled to a dedicated processor, DSP, FPGA, and / or ASIC, electrically coupled to components providing driver feedback, such as a display, speaker, and / or vibration. In at least one embodiment, the FCW system can provide warnings, for example, in the form of audible, visual, haptic, and / or rapid braking pulses.

[0226] In at least one embodiment, the AEB system detects an impending forward collision with another vehicle or other object and can automatically apply brakes if the driver does not take corrective action within a specified time or distance parameter. In at least one embodiment, the AEB system may use one or more forward-facing cameras and / or one or more RADAR sensors 1060 coupled to a dedicated processor, DSP, FPGA, and / or ASIC. In at least one embodiment, when the AEB system detects a hazard, it typically first warns the driver to take corrective action to avoid a collision, and if the driver does not take corrective action, the AEB system may automatically apply brakes to attempt to prevent or at least mitigate the effects of the predicted collision. In at least one embodiment, the AEB system may include techniques such as dynamic braking to support and / or brakes for impending collisions.

[0227] In at least one embodiment, when vehicle 1000 crosses lane markings, the LDW system provides visual, auditory, and / or tactile warnings, such as steering wheel or seat vibrations, to alert the driver. In at least one embodiment, the LDW system is inactive when the driver indicates intentional lane departure, such as by activating turn signals. In at least one embodiment, the LDW system may use a front-facing camera coupled to a dedicated processor, DSP, FPGA, and / or ASIC, which is electrically coupled to provide driver feedback such as a display, speaker, and / or vibration components. In at least one embodiment, the LKA system is a variant of the LDW system. In at least one embodiment, if vehicle 1000 begins to leave the lane, the LKA system provides steering input or braking to correct vehicle 1000.

[0228] In at least one embodiment, the BSW system detects and warns the driver of a vehicle in the blind spot. In at least one embodiment, the BSW system can provide visual, auditory, and / or tactile alerts to indicate that merging or changing lanes is unsafe. In at least one embodiment, the BSW system can provide additional warnings when the driver uses the turn signal. In at least one embodiment, the BSW system can use one or more rear-facing cameras and / or one or more RADAR sensors 1060 coupled to a dedicated processor, DSP, FPGA, and / or ASIC, electrically coupled to driver feedback, such as a display, speaker, and / or vibration assembly.

[0229] In at least one embodiment, the RCTW system can provide visual, auditory, and / or tactile notifications when an object is detected outside the range of the rear camera while the vehicle 1000 is reversing. In at least one embodiment, the RCTW system includes an AEB system to ensure the vehicle brakes to avoid a collision. In at least one embodiment, the RCTW system may use one or more rear-facing RADAR sensors 1060 coupled to a dedicated processor, DSP, FPGA, and / or ASIC, which are electrically coupled to provide driver feedback such as displays, speakers, and / or vibration components.

[0230] In at least one embodiment, conventional ADAS systems may be prone to generating false alarms, which can be annoying and distracting to the driver, but are generally not catastrophic because conventional ADAS systems warn the driver and allow the driver to determine whether a safe situation truly exists and take appropriate action. In at least one embodiment, in the event of conflicting results, vehicle 1000 itself decides whether to follow the result of the main computer or the auxiliary computer (e.g., the first or second controller of controller 1036). For example, in at least one embodiment, ADAS system 1038 may be a backup and / or auxiliary computer for providing perception information to a backup computer rationality module. In at least one embodiment, the backup computer rationality monitor may run redundant software on hardware components to detect faults in perception and dynamic driving tasks. In at least one embodiment, the output from ADAS system 1038 may be provided to a monitoring MCU. In at least one embodiment, if the output from the main computer and the output from the auxiliary computer conflict, the monitoring MCU decides how to reconcile the conflict to ensure safe operation.

[0231] In at least one embodiment, the master computer may be configured to provide a confidence score to the supervisory MCU to indicate the master computer's confidence in the selected result. In at least one embodiment, if the confidence score exceeds a threshold, the supervisory MCU may follow the master computer's instructions regardless of whether the auxiliary computer provides conflicting or inconsistent results. In at least one embodiment, if the confidence score does not meet the threshold, and if the master computer and the auxiliary computer indicate different results (e.g., conflicting), the supervisory MCU may arbitrate between the computers to determine the appropriate result.

[0232] In at least one embodiment, the supervisory MCU may be configured to run a neural network trained and configured to determine, at least in part, the conditions under which the auxiliary computer provides a false alarm based on outputs from a host computer and an auxiliary computer. In at least one embodiment, the neural network in the supervisory MCU may learn when the outputs of the auxiliary computer can be trusted and when they cannot. For example, in at least one embodiment, when the auxiliary computer is a RADAR-based FCW system, the neural network in the supervisory MCU may learn when the FCW system recognizes a metallic object that is not actually dangerous, such as a drain grating or manhole cover that would trigger an alarm. In at least one embodiment, when the auxiliary computer is a camera-based LDW system, the neural network in the supervisory MCU may learn to override the LDW when a cyclist or pedestrian is present and lane departure is actually the safest operation. In at least one embodiment, the supervisory MCU may include at least one of a DLA or GPU suitable for running a neural network with associated memory. In at least one embodiment, the supervisory MCU may include and / or be included as a component of one or more SoC 1004s.

[0233] In at least one embodiment, the ADAS system 1038 may include an auxiliary computer that performs ADAS functions using conventional computer vision rules. In at least one embodiment, the auxiliary computer may use classic computer vision rules (if-then), and the presence of a neural network in the supervisory MCU can improve reliability, security, and performance. For example, in at least one embodiment, diverse implementations and intentional non-identity make the entire system more fault-tolerant, especially for failures caused by software (or software-hardware interface) functionality. For example, in at least one embodiment, if a software vulnerability or bug exists in the software running on the host computer, and different software code running on the auxiliary computer provides consistent overall results, the supervisory MCU can more confidently assume that the overall result is correct and that the vulnerability in the software or hardware on the host computer will not lead to a significant error.

[0234] In at least one embodiment, the output of the ADAS system 1038 can be input to the perception module and / or the dynamic driving task module of the host computer. For example, in at least one embodiment, if the ADAS system 1038 indicates a forward collision warning due to an object directly ahead, the perception block can use this information when identifying the object. In at least one embodiment, as described herein, the assistance computer can have its own neural network trained to reduce the risk of false alarms.

[0235] In at least one embodiment, vehicle 1000 may also include an infotainment SoC 1030 (e.g., an in-vehicle infotainment system (IVI)). Although shown and described as an SoC, in at least one embodiment, the infotainment system SoC 1030 may not be an SoC and may include, but is not limited to, two or more discrete components. In at least one embodiment, the infotainment SoC 1030 may include, but is not limited to, a combination of hardware and software that can be used to provide audio (e.g., music, personal digital assistant, navigation instructions, news, radio, etc.), video (e.g., television, movies, streaming media, etc.), telephone (e.g., hands-free calling), network connectivity (e.g., LTE, WiFi, etc.) and / or information services (e.g., navigation system, rear parking assist, radio data system, vehicle-related information such as fuel level, total coverage distance, brake fuel level, fuel level, door opening / closing, air filter information, etc.) to vehicle 1000. For example, the infotainment SoC 1030 may include a radio, disk player, navigation system, video player, USB and Bluetooth connectivity, vehicle, in-vehicle entertainment system, WiFi, steering wheel audio controls, hands-free voice control, head-up display (“HUD”), HMI display 1034, telematics device, control panel (e.g., for controlling and / or interacting with various components, features and / or systems) and / or other components. In at least one embodiment, the infotainment SoC 1030 may further be used to provide information (e.g., visual and / or auditory) to a user of vehicle 1000, such as information from ADAS system 1038, autonomous driving information (such as planned vehicle maneuvers), trajectory, surrounding environment information (e.g., intersection information, vehicle information, road information, etc.) and / or other information.

[0236] In at least one embodiment, the infotainment SoC 1030 may include any number and type of GPU functionality. In at least one embodiment, the infotainment SoC 1030 may communicate with other devices, systems, and / or components of the vehicle 1000 via bus 1002. In at least one embodiment, the infotainment SoC 1030 may be coupled to a monitoring MCU, enabling the GPU of the infotainment system to perform some autonomous driving functions in the event of a failure of the main controller 1036 (e.g., the main computer and / or backup computer of the vehicle 1000). In at least one embodiment, the infotainment SoC 1030 may cause the vehicle 1000 to enter a driver-to-safe-stop mode, as described herein.

[0237] In at least one embodiment, vehicle 1000 may also include instrument panel 1032 (e.g., digital instrument panel, electronic instrument panel, digital instrument control panel, etc.). In at least one embodiment, instrument panel 1032 may include, but is not limited to, controllers and / or supercomputers (e.g., discrete controllers or supercomputers). In at least one embodiment, instrument panel 1032 may include, but is not limited to, any number and combination of a set of instruments, such as speedometer, fuel level, oil pressure, tachometer, odometer, turn indicator, shift position indicator, one or more seatbelt warning lights, one or more parking brake warning lights, one or more engine malfunction lights, auxiliary restraint system (e.g., airbag) information, lighting controls, safety system controls, navigation information, etc. In some examples, information may be displayed and / or shared between infotainment SoC 1030 and instrument panel 1032. In at least one embodiment, instrument panel 1032 may be included as part of infotainment SoC 1030, or vice versa.

[0238] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 can be implemented in the system. Figure 10C The operation is used to infer or predict the operation based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0239] In at least one embodiment, the inference and / or training logic 2 can be used in Figure 10C In a system, it is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0240] Figure 10DIt is based on at least one embodiment in a cloud-based server and Figure 10A A diagram of a system 1076 for communication between autonomous vehicles 1000. In at least one embodiment, system 1076 may include, but is not limited to, one or more servers 1078, one or more networks 1090, and any number and type of vehicles, including vehicle 1000. In at least one embodiment, one or more servers 1078 may include, but is not limited to, multiple GPUs 1084(A)-1084(H) (collectively referred to herein as GPU 1084), PCIe switches 1082(A)-1082(D) (collectively referred to herein as PCIe switch 1082), and / or CPUs 1080(A)-1080(B) (collectively referred to herein as CPU 1080). GPU 1084, CPU 1080, and PCIe switch 1082 may be interconnected with high-speed connection cables, such as, but not limited to, NVLink interface 1088 developed by NVIDIA and / or PCIe connection 1086. In at least one embodiment, the GPU 1084 is connected via NVLink and / or NVSwitchSoC, and the GPU 1084 and PCIe switch 1082 are connected via PCIe interconnect. Although eight GPUs 1084, two CPUs 1080, and four PCIe switches 1082 are shown, this is not intended to be limiting. In at least one embodiment, each of one or more servers 1078 may include, but is not limited to, any combination of any number of GPUs 1084, CPUs 1080, and / or PCIe switches 1082. For example, in at least one embodiment, one or more servers 1078 may each include eight, sixteen, thirty-two, and / or more GPUs 1084.

[0241] In at least one embodiment, one or more servers 1078 may receive image data representing images from vehicles via one or more networks 1090, the images showing unexpected or changed road conditions, such as recently started roadworks. In at least one embodiment, one or more servers 1078 may transmit updated neural network 1092 and / or map information 1094, including but not limited to information about traffic and road conditions, to vehicles via one or more networks 1090. In at least one embodiment, updating the map information 1094 may include, but is not limited to, updating the HD map 1022, such as information about construction sites, potholes, sidewalks, floods, and / or other obstacles. In at least one embodiment, the neural network 1092 and / or map information 1094 may be generated from new training and / or experience represented by data received from any number of vehicles in the environment, and / or at least based on training performed in a data center (e.g., using one or more servers 1078 and / or other servers).

[0242] In at least one embodiment, one or more servers 1078 may be used to train a machine learning model (e.g., a neural network) at least in part based on training data. In at least one embodiment, the training data may be generated by the vehicle, and / or may be generated in a simulation (e.g., using a game engine). In at least one embodiment, any amount of training data is labeled (e.g., where the associated neural network benefits from supervised learning) and / or undergoes other preprocessing. In at least one embodiment, no amount of training data is labeled and / or preprocessed (e.g., where the associated neural network does not require supervised learning). In at least one embodiment, once the machine learning model is trained, the machine learning model may be used by the vehicle (e.g., transmitted to the vehicle via one or more networks 1090), and / or the machine learning model may be used by one or more servers 1078 to remotely monitor the vehicle.

[0243] In at least one embodiment, one or more servers 1078 may receive data from the vehicle and apply the data to state-of-the-art real-time neural networks for real-time intelligent inference. In at least one embodiment, one or more servers 1078 may include a deep learning supercomputer and / or a dedicated AI computer powered by one or more GPUs 1084, such as the DGX and DGX Station machines developed by NVIDIA. However, in at least one embodiment, one or more servers 1078 may include a deep learning infrastructure in a data center using CPU power.

[0244] In at least one embodiment, the deep learning infrastructure of one or more servers 1078 may be capable of fast, real-time inference and can use this capability to assess and verify the health of the processor, software, and / or associated hardware in vehicle 1000. For example, in at least one embodiment, the deep learning infrastructure may receive periodic updates from vehicle 1000, such as image sequences and / or objects located by vehicle 1000 in the image sequence (e.g., via computer vision and / or other machine learning object classification techniques). In at least one embodiment, the deep learning infrastructure may run its own neural network to identify objects and compare them with objects identified by vehicle 1000, and if the results do not match and the deep learning infrastructure determines that the AI ​​in vehicle 1000 is malfunctioning, one or more servers 1078 may signal to vehicle 1000 to instruct the fail-safe computer of vehicle 1000 to take control, notify passengers, and complete a safe stopping operation.

[0245] In at least one embodiment, one or more servers 1078 may include one or more GPUs 1084 and one or more programmable inference accelerators (e.g., NVIDIA's TensorRT 3 devices). In at least one embodiment, the combination of GPU-driven servers and inference acceleration enables real-time response. In at least one embodiment, for example, where performance is less critical, servers driven by CPUs, FPGAs, and other processors may be used for inference. In at least one embodiment, hardware architecture 715 is used to execute one or more embodiments. This document incorporates... Figure 7A and / or Figure 7B Provide details about the hardware architecture of 715.

[0246] Computer System

[0247] Figure 11 This is a block diagram illustrating an exemplary computer system according to at least one embodiment. The exemplary computer system may be a system of interconnected devices and components, a system-on-a-chip (SoC), or some combination thereof formed with a processor, which may include an execution unit to execute instructions. In at least one embodiment, according to this disclosure, such as the embodiments described herein, computer system 1100 may include, but is not limited to, components such as processor 1102, whose execution unit includes logic to execute algorithms for process data. In at least one embodiment, computer system 1100 may include a processor, such as those available from Intel Corporation of Santa Clara, California. Processor family, Xeon TM , XScale TM and / or StrongARM TM , Core TM or Nervana TM A microprocessor may be used, although other systems (including PCs, engineering workstations, set-top boxes, etc.) with other microprocessors may also be used. In at least one embodiment, computer system 1100 may execute a version of the Windows operating system available from Microsoft Corporation of Redmond, Washington, although other operating systems (such as UNIX and Linux), embedded software, and / or graphical user interfaces may also be used.

[0248] The embodiments can be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular phones, Internet Protocol (IP) devices, digital cameras, personal digital assistants (“PDAs”), and handheld PCs. In at least one embodiment, the embedded application may include a microcontroller, a digital signal processor (“DSP”), a system-on-a-chip (SoC), a network computer (“NetPC”), a set-top box, a network hub, a wide area network (“WAN”) switch, or any other system that can execute one or more instructions according to at least one embodiment.

[0249] In at least one embodiment, computer system 1100 may include, but is not limited to, processor 1102, which may include, but is not limited to, one or more execution units 1108, to perform machine learning model training and / or inference according to the techniques described herein. In at least one embodiment, computer system 1100 is a single-processor desktop or server system, but in another embodiment, computer system 1100 may be a multiprocessor system. In at least one embodiment, processor 1102 may include, but is not limited to, a Complex Instruction Set Computer (“CISC”) microprocessor, a Reduced Instruction Set Computing (“RISC”) microprocessor, a Very Long Instruction Word (“VLIW”) microprocessor, a processor implementing instruction set combination, or any other processor device, such as a digital signal processor. In at least one embodiment, processor 1102 may be coupled to processor bus 1110, which can transmit data signals between processor 1102 and other components in computer system 1100.

[0250] In at least one embodiment, processor 1102 may include, but is not limited to, a Level 1 (“L1”) internal cache memory (“cache”) 1104. In at least one embodiment, processor 1102 may have a single internal cache or multiple levels of internal cache. In at least one embodiment, the cache memory may reside external to processor 1102. Depending on specific implementation and requirements, other embodiments may also include a combination of internal and external caches. In at least one embodiment, register file 1106 may store different types of data in various registers, including but not limited to integer registers, floating-point registers, status registers, and instruction pointer registers.

[0251] In at least one embodiment, an execution unit 1108, including but not limited to logic for performing integer and floating-point operations, is also located within the processor 1102. In at least one embodiment, the processor 1102 may also include a microcode (“ucode”) read-only memory (“ROM”) for storing microcode of certain macro instructions. In at least one embodiment, the execution unit 1108 may include logic for processing a packaged instruction set 1109. In at least one embodiment, by including the packaged instruction set 1109 in the instruction set of a general-purpose processor, along with the associated circuitry for executing the instructions, the packaged data in the processor 1102 can be used to perform operations used by numerous multimedia applications. In one or more embodiments, many multimedia applications can be executed more quickly and efficiently by using the full width of the processor’s data bus to perform operations on the packaged data, which may eliminate the need to transfer smaller data units on the processor’s data bus to perform one or more operations on one data element at a time.

[0252] In at least one embodiment, execution unit 1108 may also be used in a microcontroller, embedded processor, graphics device, DSP, and other types of logic circuitry. In at least one embodiment, computer system 1100 may include, but is not limited to, memory 1120. In at least one embodiment, memory 1120 may be a dynamic random access memory (“DRAM”) device, a static random access memory (“SRAM”) device, a flash memory device, or another storage device. In at least one embodiment, memory 1120 may store instructions 1119 and / or data 1121 represented by data signals that can be executed by processor 1102.

[0253] In at least one embodiment, the system logic chip may be coupled to processor bus 1110 and memory 1120. In at least one embodiment, the system logic chip may include, but is not limited to, a memory controller hub (“MCH”) 1116, and processor 1102 may communicate with MCH 1116 via processor bus 1110. In at least one embodiment, MCH 1116 may provide a high-bandwidth memory path 1118 to memory 1120 for instruction and data storage, as well as for storage of graphics commands, data, and textures. In at least one embodiment, MCH 1116 may initiate data signals between processor 1102, memory 1120, and other components in computer system 1100, and bridge data signals between processor bus 1110, memory 1120, and system I / O interface 1122. In at least one embodiment, the system logic chip may provide a graphics port for coupling to a graphics controller. In at least one embodiment, MCH 1116 can be coupled to memory 1120 via high-bandwidth memory path 1118, and graphics / video card 1112 can be coupled to MCH 1116 via Accelerated Graphics Port (“AGP”) interconnect 1114.

[0254] In at least one embodiment, the computer system 1100 may use the system I / O interface 1122 as a proprietary hub interface bus to couple the MCH 1116 to the I / O controller hub (“ICH”) 1130. In at least one embodiment, the ICH 1130 may provide direct connectivity to certain I / O devices via a local I / O bus. In at least one embodiment, the local I / O bus may include, but is not limited to, a high-speed I / O bus for connecting peripheral devices to the memory 1120, chipset, and processor 1102. Examples may include, but are not limited to, an audio controller 1129, a firmware hub (“Flash BIOS”) 1128, a wireless transceiver 1126, a data storage 1124, a conventional I / O controller 1123 including a user input and keyboard interface, a serial expansion port 1127 (e.g., a Universal Serial Bus (USB) port), and a network controller 1134. In at least one embodiment, the data storage 1124 may include a hard disk drive, a floppy disk drive, a CD-ROM device, a flash memory device, or other mass storage device.

[0255] In at least one embodiment, Figure 11 A system including interconnected hardware devices or "chips" is shown, while in other embodiments, Figure 11 The SoC can be shown. In at least one embodiment, Figure 11The devices shown can be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, one or more components of the computer system 1100 are interconnected using a Compute Fast Link (CXL) interconnect.

[0256] The inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 can be... Figure 11 Used in systems for reasoning or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0257] In at least one embodiment, the inference and / or training logic 2 can be used in Figure 11 In a system, it is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0258] Figure 12 This is a block diagram illustrating an electronic device 1200 for utilizing a processor 1210 according to at least one embodiment. In at least one embodiment, the electronic device 1200 may be, for example, but not limited to, a laptop computer, tower server, rack server, blade server, desktop computer, tablet computer, mobile device, telephone, embedded computer, or any other suitable electronic device.

[0259] In at least one embodiment, the electronic device 1200 may include, but is not limited to, a processor 1210 communicatively coupled to any suitable number or type of components, peripherals, modules, or devices. In at least one embodiment, the processor 1210 is coupled using a bus or interface, such as I... 2 C-bus, System Management Bus (“SMBus”), Low Pin Count (LPC) bus, Serial Peripheral Interface (“SPI”), High Definition Audio (“HDA”) bus, Serial Advanced Technology Accessory (“SATA”) bus, Universal Serial Bus (“USB”) (versions 1, 2, 3, etc.), or Universal Asynchronous Receiver / Transmitter (“UART”) bus. In at least one embodiment, Figure 12 The system shown includes interconnected hardware devices or "chips," while in other embodiments, Figure 12 An exemplary SoC can be shown. In at least one embodiment, Figure 12The device shown can be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, Figure 12 One or more components are interconnected using Computational Fast Link (CXL) interconnects.

[0260] In at least one embodiment, Figure 12 It may include a display 1224, a touch screen 1225, a touchpad 1230, a near field communication unit (“NFC”) 1245, a sensor hub 1240, a thermal sensor 1246, a fast chipset (“EC”) 1235, a trusted platform module (“TPM”) 1238, a BIOS / firmware / flash (“BIOS, FW Flash”) 1222, a DSP 1260, a drive 1220 (e.g., a solid-state drive (“SSD”) or a hard disk drive (“HDD”)), a wireless local area network unit (“WLAN”) 1250, a Bluetooth unit 1252, a wireless wide area network unit (“WWAN”) 1256, a global positioning system (GPS) unit 1255, a camera (“USB 3.0 camera”) 1254 (e.g., a USB 3.0 camera), and / or a low-power double data rate (“LPDDR”) memory unit (“LPDDR3”) 1215 implemented in, for example, the LPDDR3 standard. These components can each be implemented in any suitable way.

[0261] In at least one embodiment, other components may be communicatively coupled to processor 1210 via the components described herein. In at least one embodiment, accelerometer 1241, ambient light sensor (“ALS”) 1242, compass 1243, and gyroscope 1244 may be communicatively coupled to sensor hub 1240. In at least one embodiment, thermal sensor 1239, fan 1237, keyboard 1236, and touchpad 1230 may be communicatively coupled to EC 1235. In at least one embodiment, speaker 1263, earphone 1264, and microphone (“mic”) 1265 may be communicatively coupled to audio unit (“audio codec and Class D amplifier”) 1262, which in turn may be communicatively coupled to DSP 1260. In at least one embodiment, audio unit 1262 may include, for example, but not limited to, audio encoder / decoder (“codec”) and Class D amplifier. In at least one embodiment, SIM card (“SIM”) 1257 may be communicatively coupled to WWAN unit 1256. In at least one embodiment, components such as WLAN unit 1250, Bluetooth unit 1252, and WWAN unit 1256 can be implemented as next-generation form factor (NGFF).

[0262] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details regarding the inference and / or training logic 715 are provided. In at least one embodiment, the inference and / or training logic 715 may be used in System Figure 15 for inferring or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.

[0263] In at least one embodiment, the inference and / or training logic 2 can be used in Figure 8 In a system, it is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0264] Figure 13 A computer system 1300 according to at least one embodiment is shown. In at least one embodiment, the computer system 1300 is configured to implement various processes and methods described throughout this disclosure.

[0265] In at least one embodiment, the computer system 1300 includes, but is not limited to, at least one central processing unit (“CPU”) 1302 connected to a communication bus 1310 implemented using any suitable protocol, such as PCI (“Peripheral Interconnect”), Peripheral Component Interconnect Express (“PCI-Express”), AGP (“Accelerated Graphics Port”), HyperTransport, or any other bus or point-to-point communication protocol. In at least one embodiment, the computer system 1300 includes, but is not limited to, main memory 1304 and control logic (e.g., implemented in hardware, software, or a combination thereof), and data may be stored in main memory 1304 in the form of random access memory (“RAM”). In at least one embodiment, a network interface subsystem (“Network Interface”) 1322 provides an interface to other computing devices and networks for receiving data using the computer system 1300 and transferring data to other systems.

[0266] In at least one embodiment, the computer system 1300 includes, but is not limited to, an input device 1308, a parallel processing system 1312, and a display device 1306, which may be implemented using conventional cathode ray tube (“CRT”), liquid crystal display (“LCD”), light-emitting diode (“LED”) display, plasma display, or other suitable display technologies. In at least one embodiment, user input is received from the input device 1308 (such as a keyboard, mouse, touchpad, microphone, etc.). In at least one embodiment, each of the modules described herein may reside on a single semiconductor platform to form the processing system.

[0267] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 can be implemented in the system. Figure 13 It is used to perform inference or prediction operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture or neural network use cases described herein.

[0268] In at least one embodiment, the inference and / or training logic 2 can be used in Figure 13 In a system, it is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0269] Figure 14 A computer system 1400 according to at least one embodiment is illustrated. In at least one embodiment, the computer system 1400 includes, but is not limited to, a computer 1410 and a USB flash drive 1420. In at least one embodiment, the computer 1410 may include, but is not limited to, any number and type of processors (not shown) and memory (not shown). In at least one embodiment, the computer 1410 includes, but is not limited to, a server, a cloud instance, a laptop computer, and a desktop computer.

[0270] In at least one embodiment, the USB flash drive 1420 includes, but is not limited to, a processing unit 1430, a USB interface 1440, and USB interface logic 1450. In at least one embodiment, the processing unit 1430 can be any instruction execution system, apparatus, or device capable of executing instructions. In at least one embodiment, the processing unit 1430 can include, but is not limited to, any number and type of processing cores (not shown). In at least one embodiment, the processing unit 1430 includes an application-specific integrated circuit (“ASIC”) optimized to perform any number and type of operations associated with machine learning. For example, in at least one embodiment, the processing unit 1430 is a tensor processing unit (“TPC”) optimized to perform machine learning inference operations. In at least one embodiment, the processing unit 1430 is a vision processing unit (“VPU”) optimized to perform machine vision and machine learning inference operations.

[0271] In at least one embodiment, the USB interface 1440 can be any type of USB connector or USB receptacle. For example, in at least one embodiment, the USB interface 1440 is a USB 3.0 Type-C receptacle for data and power. In at least one embodiment, the USB interface 1440 is a USB 3.0 Type-A connector. In at least one embodiment, the USB interface logic 1450 may include any amount and type of logic enabling the processing unit 1430 to connect to a device (e.g., computer 1410) via the USB connector 1440.

[0272] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 can be implemented in the system. Figure 14 In use, at least in part, the operation is based on weight parameters, neural network functions and / or architectures computed using neural network training operations, or neural network use cases described herein to infer or predict operations.

[0273] In at least one embodiment, the inference and / or training logic 2 can be used in Figure 14 In a system, it is used to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0274] Figure 15AAn exemplary architecture is shown in which multiple GPUs 1510(1)-1510(N) are communicatively coupled to multiple multi-core processors 1505(1)-1505(M) via high-speed links 1540(1)-1540(N) (e.g., bus / point-to-point interconnect, etc.). In at least one embodiment, the high-speed links 1540(1)-1540(N) support communication throughput of 4GB / s, 30GB / s, 80GB / s, or higher. In at least one embodiment, various interconnect protocols may be used, including but not limited to PCIe 4.0 or 5.0 and NVLink 2.0. In the various figures, “N” and “M” represent positive integers, the values ​​of which may vary from figure to figure.

[0275] Furthermore, in one embodiment, two or more GPUs 1510 are interconnected via high-speed links 1529(1)-1529(2), which can be implemented using a protocol / link similar to or different from that used for high-speed links 1540(1)-1540(N). Similarly, two or more multi-core processors 1505 can be connected via high-speed link 1528, which can be a symmetric multiprocessor (SMP) bus operating at speeds of 20GB / s, 30GB / s, 120GB / s, or higher. Alternatively, similar protocols / links (e.g., via a common interconnect structure) can be used. Figure 15A This shows all communication between the various system components.

[0276] In one embodiment, each multi-core processor 1505 is communicatively coupled to processor memory 1501(1)-1501(M) via memory interconnects 1526(1)-1526(M), and each GPU 1510(1)-1510(N) is communicatively coupled to GPU memory 1520(1)-1520(N) via GPU memory interconnects 1550(1)-1550(N). In at least one embodiment, memory interconnects 1526 and 1550 may utilize similar or different memory access technologies. By way of example and not limitation, processor memory 1501(1)-1501(M) and GPU memory 1520 may be volatile memory, such as dynamic random access memory (DRAM) (including stacked DRAM), graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or high bandwidth memory (HBM), and / or may be non-volatile memory, such as 3D XPoint or Nano-RAM. In at least one embodiment, some portions of the processor memory 1501 may be volatile memory, while other portions may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy).

[0277] As described herein, although the various multi-core processors 1505 and GPUs 1510 can be physically coupled to specific memories 1501 and 1520 respectively, and / or can implement a unified memory architecture, in which the virtual system address space (also known as the “effective address” space) is distributed among the various physical memories. For example, processor memories 1501(1)-1501(M) can each contain 64GB of system memory address space, and GPU memories 1520(1)-1520(N) can each contain 32GB of system memory address space, resulting in a total addressable memory size of 256GB when M=2 and N=4. N and M may also be other values.

[0278] Figure 15B Additional details are shown regarding the interconnection between a multi-core processor 1507 and a graphics acceleration module 1546 according to an exemplary embodiment. In at least one embodiment, the graphics acceleration module 1546 may include one or more GPU chips integrated on a line card coupled to the processor 1507 via a high-speed link 1540 (e.g., PCIe bus, NVLink, etc.). In at least one embodiment, the graphics acceleration module 1546 may optionally be integrated on a package or chip having the processor 1507.

[0279] In at least one embodiment, the processor 1507 includes a plurality of cores 1560A-1560D, each core having a translation back cover (“TLB”) 1561A-1561D and one or more caches 1562A-1562D. In at least one embodiment, the cores 1560A-1560D may include various other components (not shown) for executing instructions and processing data. In at least one embodiment, the caches 1562A-1562D may include level 1 (L1) and level 2 (L2) caches. Furthermore, one or more shared caches 1556 may be included in the caches 1562A-1562D and shared by the respective groups of cores 1560A-1560D. For example, one embodiment of the processor 1507 includes 24 cores, each core having its own L1 cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, two adjacent cores share one or more L2 and L3 caches. In at least one embodiment, the processor 1507 and the graphics acceleration module 1546 are connected to a system memory 1514, which may include... Figure 15A The processor memory 1501(1)-1501(M) is included.

[0280] In at least one embodiment, consistency of data and instructions stored in the various caches 1562A-1562D, 1556 and system memory 1514 is maintained via inter-core communication through the consistency bus 1564. In at least one embodiment, for example, each cache may have associated cache consistency logic / circuit to communicate via the consistency bus 1564 in response to the detection of a read or write to a particular cache line. In at least one embodiment, a cache snooping protocol is implemented via the consistency bus 1564 to snoop on cache accesses.

[0281] In at least one embodiment, proxy circuitry 1525 communicatively couples graphics acceleration module 1546 to coherence bus 1564, thereby allowing graphics acceleration module 1546 to participate in cache coherence protocols as a peer of cores 1560A-1560D. Specifically, in at least one embodiment, interface 1535 provides connectivity to proxy circuitry 1525 via high-speed link 1540, and interface 1537 connects graphics acceleration module 1546 to high-speed link 1540.

[0282] In at least one embodiment, the accelerator integrated circuit 1536 provides cache management, memory access, context management, and interrupt management services for a plurality of graphics processing engines 1531(1)-1531(N) of the graphics acceleration module. In at least one embodiment, the graphics processing engines 1531(1)-1531(N) may each include a separate graphics processing unit (GPU). In at least one embodiment, the graphics processing engines 1531(1)-1531(N) may optionally include different types of graphics processing engines within the GPU, such as graphics execution units, media processing engines (e.g., video encoders / decoders), samplers, and blit engines. In at least one embodiment, the graphics acceleration module 1546 may be a GPU having a plurality of graphics processing engines 1531(1)-1531(N), or the graphics processing engines 1531(1)-1531(N) may be individual GPUs integrated on a general-purpose package, line card, or chip.

[0283] In at least one embodiment, the accelerator integrated circuit 1536 includes a memory management unit (MMU) 1539 for performing various memory management functions, such as virtual-to-physical memory translation (also known as effective-to-real memory translation), and a memory access protocol for accessing system memory 1514. In at least one embodiment, the MMU 1539 may also include a translation back buffer (“TLB”) (not shown) for caching virtual / effective-to-physical / real address translations. In at least one embodiment, a cache 1538 may store commands and data for efficient access by the graphics processing engines 1531(1)-1531(N). In at least one embodiment, a fetch unit 1544 may be used to keep data stored in the cache 1538 and graphics memories 1533(1)-1533(M) consistent with core caches 1562A-1562D, 1556 and system memory 1514. As previously mentioned, this task can be accomplished via proxy circuitry 1525 representing cache 1538 and graphics memory 1533(1)-1533(M) (e.g., sending updates related to the modification / access of cache lines on processor caches 1562A-1562D, 1556 to cache 1538 and receiving updates from cache 1538).

[0284] In at least one embodiment, a set of registers 1545 stores context data of threads executed by graphics processing engines 1531(1)-1531(N), and context management circuitry 1548 manages the thread context. For example, context management circuitry 1548 can perform save and restore operations to save and restore the context of individual threads during context switching (e.g., saving the first thread and storing the second thread so that the second thread can be executed by the graphics processing engine). For example, during context switching, context management circuitry 1548 can store the current register value to a designated area in memory (e.g., identified by a context pointer). The register value can then be restored when returning to the context. In at least one embodiment, interrupt management circuitry 1547 receives and processes interrupts received from system devices.

[0285] In one implementation, MMU 1539 translates virtual / effective addresses from graphics processing engine 1531 into real / physical addresses in system memory 1514. In at least one embodiment, accelerator integrated circuit 1536 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 1546 and / or other accelerator devices. In at least one embodiment, graphics accelerator module 1546 may be dedicated to a single application executing on processor 1507, or may be shared among multiple applications. In at least one embodiment, a virtualized graphics execution environment is presented, wherein resources of graphics processing engines 1531(1)-1531(N) are shared with multiple applications or virtual machines (VMs). In at least one embodiment, resources may be subdivided into “slices” based on processing requirements and priorities associated with VMs and / or applications, which are allocated to different VMs and / or applications.

[0286] In at least one embodiment, the accelerator integrated circuit 1536 acts as a bridge to the system of the graphics acceleration module 1546 and provides address translation and system memory caching services. Additionally, in at least one embodiment, the accelerator integrated circuit 1536 can provide virtualization facilities for the host processor to manage the virtualization, interrupt, and memory management of the graphics processing engines 1531(1)-1531(N).

[0287] In at least one embodiment, since the hardware resources of the graphics processing engines 1531(1)-1531(N) are explicitly mapped to the real address space seen by the host processor 1507, any host processor can directly address these resources using valid address values. In at least one embodiment, a function of the accelerator integrated circuit 1536 is to physically separate the graphics processing engines 1531(1)-1531(N) so that they appear as independent units to the system.

[0288] In at least one embodiment, one or more graphics memories 1533(1)-1533(M) are coupled to each graphics processing engine 1531(1)-1531(N), and N = M. In at least one embodiment, the graphics memories 1533(1)-1533(M) store instructions and data processed by each graphics processing engine 1531(1)-1531(N). In at least one embodiment, the graphics memories 1533(1)-1533(M) may be volatile memory, such as DRAM (including stacked DRAM), GDDR memory (e.g., GDDR5, GDDR6), or HBM, and / or may be non-volatile memory, such as 3DXPoint or Nano-RAM.

[0289] In one embodiment, to reduce data traffic on the high-speed link 1540, a biasing technique is used to ensure that the data stored in graphics memory 1533(1)-1533(M) is the data most frequently used by graphics processing engines 1531(1)-1531(N), and preferably data that cores 1560A-1560D do not use (or at least do not use frequently). Similarly, in at least one embodiment, the biasing mechanism attempts to keep the data needed by the cores (and preferably not graphics processing engines 1531(-1)-1531(N)) in caches 1562A-1562D, 1556 and system memory 1514.

[0290] Figure 15C Another exemplary embodiment is shown, wherein the accelerator integrated circuit 1536 is integrated within the processor 1507. In this embodiment, the graphics processing engines 1531(1)-1531(N) communicate directly with the accelerator integrated circuit 1536 via a high-speed link 1540 through interfaces 1537 and 1535 (which may also be any form of bus or interface protocol). In at least one embodiment, the accelerator integrated circuit 1536 can perform operations related to... Figure 15B The described operation is similar, but due to its close proximity to the coherence bus 1564 and caches 1562A-1562D, 1556, it may have higher throughput. One embodiment supports different programming models, including a dedicated process programming model (without graphics acceleration module virtualization) and a shared programming model (with virtualization), which may include a programming model controlled by accelerator integrated circuit 1536 and a programming model controlled by graphics acceleration module 1546.

[0291] In at least one embodiment, graphics processing engines 1531(1)-1531(N) are dedicated to a single application or process under a single operating system. In at least one embodiment, a single application can funnel requests from other applications to graphics processing engines 1531(1)-1531(N), thereby providing virtualization within a VM / partition.

[0292] In at least one embodiment, graphics processing engines 1531(1)-1531(N) can be shared by multiple VM / application partitions. In at least one embodiment, the shared model can use a hypervisor to virtualize graphics processing engines 1531(1)-1531(N) to allow each operating system to access them. In at least one embodiment, for a single-partition system without a hypervisor, the operating system owns graphics processing engines 1531(1)-1531(N). In at least one embodiment, the operating system can virtualize graphics processing engines 1531(1)-1531(N) to provide access to each process or application.

[0293] In at least one embodiment, the graphics acceleration module 1546 or the individual graphics processing engine 1531(1)-1531(N) uses a process handle to select a process element. In at least one embodiment, the process element is stored in system memory 1514 and can be addressed using the effective address to real address translation techniques described herein. In at least one embodiment, the process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engine 1531(1)-1531(N) (i.e., invoking system software to add the process element to the process element linked list). In at least one embodiment, the lower 16 bits of the process handle may be the offset of the process element in the process element linked list.

[0294] Figure 15D An exemplary accelerator integration slice 1590 is illustrated. In at least one embodiment, a "slice" includes a designated portion of the processing resources of an accelerator integrated circuit 1536. In at least one embodiment, the application is an effective address space 1582 in system memory 1514, which stores process element 1583. In at least one embodiment, process element 1583 is stored in response to a GPU call 1581 from an application 1580 executing on processor 1507. In at least one embodiment, process element 1583 contains the process state of the corresponding application 1580. In one embodiment, a job descriptor (WD) 1584 contained in process element 1583 may be a single job requested by the application, or it may contain a pointer to a job queue. In at least one embodiment, WD 1584 is a pointer to a job request queue in the effective address space 1582 of the application.

[0295] In at least one embodiment, the graphics acceleration module 1546 and / or the various graphics processing engines 1531(1)-1531(N) may be shared by all processes or subsets of processes in the system. In at least one embodiment, infrastructure may be included for setting process states and sending WD 1584 to the graphics acceleration module 1546 to begin operations in a virtualized environment.

[0296] In at least one embodiment, the dedicated process programming model is implementation-specific. In at least one embodiment, in this model, a single process owns either the graphics acceleration module 1546 or an individual graphics processing engine 1531. In at least one embodiment, when the graphics acceleration module 1546 is owned by a single process, the hypervisor initializes the accelerator integrated circuit for the owned partition, and when the graphics acceleration module 1546 is assigned, the operating system initializes the accelerator integrated circuit 1536 for the owned process.

[0297] In at least one embodiment, during operation, the WD acquisition unit 1591 in the accelerator integration slice 1590 acquires the next WD 1584, which includes instructions for work to be performed by one or more graphics processing engines of the graphics acceleration module 1546. In at least one embodiment, data from the WD 1584 may be stored in register 1545 and used by MMU 1539, interrupt management circuitry 1547, and / or context management circuitry 1548, as shown. For example, one embodiment of MMU 1539 includes segment / page roaming circuitry for accessing segment / page tables 1586 within the OS virtual address space 1585. In at least one embodiment, interrupt management circuitry 1547 may process interrupt events 1592 received from the graphics acceleration module 1546. In at least one embodiment, when performing graphics operations, a valid address 1593 generated by graphics processing engines 1531(1)-1531(N) is translated into a real address by MMU 1539.

[0298] In one embodiment, register 1545 is copied for each graphics processing engine 1531(1)-1531(N) and / or graphics acceleration module 1546, and register 1545 may be initialized by a hypervisor or operating system. In at least one embodiment, each of these copied registers may be included in accelerator integration slice 1590. Exemplary registers that may be initialized by a hypervisor are shown in Table 1.

[0299] Table 1 – Registers for Supervisor Initialization

[0300]

[0301] Table 2 shows exemplary registers that can be initialized by the operating system.

[0302] Table 2 – Operating System Initialization Registers

[0303]

[0304] In at least one embodiment, each WD 1584 is specific to a particular graphics acceleration module 1546 and / or graphics processing engine 1531(1)-1531(N). In at least one embodiment, it contains all the information required for the graphics processing engine 1531(1)-1531(N) to complete its work, or it may be a pointer to a memory location where the application has set up a command queue for the work to be completed.

[0305] Figure 15EAdditional details of an exemplary embodiment of the shared model are shown. This embodiment includes a hypervisor real address space 1598, in which a list of process elements 1599 is stored. In at least one embodiment, the hypervisor real address space 1598 can be accessed via a hypervisor 1596, which virtualizes the graphics acceleration module engine for an operating system 1595.

[0306] In at least one embodiment, the shared programming model allows all processes or subsets of processes from all partitions or subsets of partitions in the system to use the graphics acceleration module 1546. In at least one embodiment, there are two programming models in which the graphics acceleration module 1546 is shared by multiple processes and partitions, namely, time-slice sharing and graphics-oriented sharing.

[0307] In at least one embodiment, in this model, the hypervisor 1596 owns the graphics acceleration module 1546 and makes its functionality available to all operating systems 1595. In at least one embodiment, for the graphics acceleration module 1546 to support virtualization through the hypervisor 1596, the graphics acceleration module 1546 may comply with certain requirements, such as (1) the job requests of the application must be autonomous (i.e., no state needs to be maintained between jobs), or the graphics acceleration module 1546 must provide a context saving and recovery mechanism, (2) the graphics acceleration module 1546 guarantees that the job requests of the application are completed within a specified amount of time, including any conversion errors, or the graphics acceleration module 1546 provides the ability to preempt job processing, and (3) when operating in a directed shared programming model, fairness between the processes of the graphics acceleration module 1546 must be ensured.

[0308] In at least one embodiment, application 1580 needs to make system calls to operating system 1595 using the graphics acceleration module type, working descriptor (WD), permission mask register (AMR) value, and context save / restore region pointer (CSRP). In at least one embodiment, the graphics acceleration module type describes the target acceleration function for the system call. In at least one embodiment, the graphics acceleration module type can be a system-specific value. In at least one embodiment, the WD is specifically formatted for graphics acceleration module 1546 and can take the form of graphics acceleration module 1546 commands, valid address pointers to user-defined structures, valid address pointers to command queues, or any other data structure describing the work to be performed by graphics acceleration module 1546.

[0309] In at least one embodiment, the AMR value is the AMR state for the current process. In at least one embodiment, the value passed to the operating system is similar to that of the application that sets the AMR. In at least one embodiment, if the implementation of the accelerator integrated circuit 1536 (not shown) and the graphics acceleration module 1546 does not support the User Rights Mask Overwrite Register (UAMOR), the operating system may apply the current UAMOR value to the AMR value before passing the AMR in the hypervisor call. In at least one embodiment, the hypervisor 1596 may selectively apply the current Rights Mask Overwrite Register (AMOR) value before placing the AMR into the process element 1583. In at least one embodiment, CSRP is one of the registers 1545 that contains the effective address of a region in the effective address space 1582 of the application for the graphics acceleration module 1546 to save and restore the context state. In at least one embodiment, this pointer is optional if it is not necessary to save state between jobs or when a job is preempted. In at least one embodiment, the context save / restore region may be fixed system memory.

[0310] Upon receiving a system call, the operating system 1595 can verify that the application 1580 has been registered and granted permission to use the graphics acceleration module 1546. Then, in at least one embodiment, the operating system 1595 uses the information shown in Table 3 to invoke the hypervisor 1596.

[0311] Table 3 – Operating System to Hypervisor Call Parameters

[0312]

[0313] In at least one embodiment, upon receiving a hypervisor call, hypervisor 1596 verifies that operating system 1595 has been registered and granted permission to use graphics acceleration module 1546. Then, in at least one embodiment, hypervisor 1596 adds process element 1583 to a linked list of process elements of the corresponding graphics acceleration module 1546 type. In at least one embodiment, the process element may include the information shown in Table 4.

[0314] Table 4 – Process Element Information

[0315]

[0316]

[0317] In at least one embodiment, the hypervisor initializes multiple accelerator integration slice 1590 registers 1545.

[0318] like Figure 15FAs shown, in at least one embodiment, a unified memory is used, which is addressable via a common virtual memory address space for accessing physical processor memories 1501(1)-1501(N) and GPU memories 1520(1)-1520(N). In this implementation, operations performed on GPUs 1510(1)-1510(N) utilize the same virtual / effective memory address space to access processor memories 1501(1)-1501(M) and vice versa, thereby simplifying programmability. In at least one embodiment, a first portion of the virtual / effective address space is allocated to processor memory 1501(1), a second portion to second processor memory 1501(N), a third portion to GPU memory 1520(1), and so on. In at least one embodiment, the entire virtual / effective memory space (sometimes referred to as the effective address space) is thus distributed across each of processor memory 1501 and GPU memory 1520, thereby allowing any processor or GPU to access that memory using a virtual address mapped to any physical memory.

[0319] In one embodiment, the bias / coherence management circuitry 1594A-1594E within one or more MMUs 1539A-1539E ensures cache coherence between one or more host processors (e.g., 1505) and the cache of the GPU 1510, and implements biasing techniques to indicate the physical memory in which certain types of data should be stored. In at least one embodiment, although in Figure 15F Several instances of bias / coherence management circuitry 1594A-1594E are shown, but bias / coherence circuitry can be implemented within the MMU of one or more host processors 1505 and / or within the accelerator integrated circuit 1536.

[0320] One embodiment allows GPU memory 1520 to be mapped as part of system memory and accessed using shared virtual memory (SVM) technology without suffering the performance drawbacks associated with full system cache coherence. In at least one embodiment, the ability to access GPU memory 1520 as system memory without the heavy overhead of cache coherence provides a favorable operating environment for GPU offloading. In at least one embodiment, this arrangement allows the host processor 1505 to software-set operands and access computation results without the overhead of conventional I / O DMA data copying. In at least one embodiment, such conventional copying includes driver calls, interrupts, and memory-mapped I / O (MMIO) accesses, all of which are less efficient than simple memory accesses. In at least one embodiment, the ability to access GPU memory 1520 without cache coherence overhead may be critical to the execution time of offloaded computations. In at least one embodiment, for example, in cases with high streaming write memory traffic, cache coherence overhead can significantly reduce the effective write bandwidth seen by GPU 1510. In at least one embodiment, the efficiency of operand setting, the efficiency of result access, and the efficiency of GPU computation may play a role in determining the effectiveness of GPU offloading.

[0321] In at least one embodiment, the selection of GPU bias and host processor bias is driven by a bias tracker data structure. In at least one embodiment, for example, a bias table can be used, which may be a page-granular structure (e.g., controlled at the memory page level) comprising one or two bits of memory pages attached to each GPU. In at least one embodiment, with or without a bias cache (e.g., for caching frequently / recently used entries in the bias table) in GPU 1510, the bias table can be implemented across one or more stolen memory ranges of GPU memory 1520. Alternatively, in at least one embodiment, the entire bias table can be maintained within the GPU.

[0322] In at least one embodiment, prior to actual access to GPU memory, an access to the bias table entry associated with each access to GPU-attached memory 1520 is performed, resulting in the following operations: In at least one embodiment, a local request from GPU 1510 to find its page in the GPU bias is forwarded directly to the corresponding GPU memory 1520. In at least one embodiment, a local request from the GPU to find its page in the host bias is forwarded to processor 1505 (e.g., via the high-speed link described herein). In at least one embodiment, a request from processor 1505 to find the requested page in the host processor bias completes a request similar to a normal memory read. Alternatively, a request for a page pointing to the GPU bias can be forwarded to GPU 1510. In at least one embodiment, if the GPU is not currently using the page, the GPU may subsequently migrate the page to the host processor bias. In at least one embodiment, the page bias state can be changed through a software-based mechanism, a hardware-assisted software mechanism, or, in limited cases, a purely hardware-based mechanism.

[0323] In at least one embodiment, a mechanism for changing the bias state employs an API call (e.g., OpenCL), which subsequently invokes the GPU's device driver. The device driver then sends a message (or enqueues a command descriptor) to the GPU, instructing the GPU to change the bias state and, in some migration, performs a cache refresh operation on the host. In at least one embodiment, the cache refresh operation is used for migration from the host processor 1505 bias to the GPU bias, but not for the reverse migration.

[0324] In one embodiment, cache coherence is maintained by temporarily rendering GPU bias pages that the host processor 1505 cannot cache. In at least one embodiment, to access these pages, the processor 1505 may request access from the GPU 1510, which may or may not immediately grant access. Therefore, in at least one embodiment, to reduce communication between the processor 1505 and the GPU 1510, it is beneficial to ensure that the GPU bias pages are pages needed by the GPU, not those needed by the host processor 1505, and vice versa.

[0325] One or more hardware structures 715 are used to execute one or more embodiments. This document may combine... Figure 7A and / or Figure 7B Provide details about one or more hardware structures 715.

[0326] Figure 16Exemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which may be manufactured using one or more IP cores. In addition to the illustrations, at least one embodiment may include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.

[0327] Figure 16 This is a block diagram illustrating an exemplary system on a chip integrated circuit 1600 that can be fabricated using one or more IP cores according to at least one embodiment. In at least one embodiment, the integrated circuit 1600 includes one or more application processors 1605 (e.g., CPU), at least one graphics processor 1610, and may additionally include an image processor 1615 and / or a video processor 1620, any of which may be a modular IP core. In at least one embodiment, the integrated circuit 1600 includes peripheral or bus logic, which includes a USB controller 1625, a UART controller 1630, an SPI / SDIO controller 1635, and an I... 2 2S / I 2 2C controller 1640. In at least one embodiment, integrated circuit 1600 may include display device 1645 coupled to one or more of High Definition Multimedia Interface (HDMI) controller 1650 and Mobile Industrial Processor Interface (MIPI) display interface 1655. In at least one embodiment, storage may be provided by flash memory subsystem 1660, including flash memory and flash memory controller. In at least one embodiment, a memory interface may be provided via memory controller 1665 for accessing SDRAM or SRAM memory devices. In at least one embodiment, some integrated circuits also include embedded security engine 1670.

[0328] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 may be used in integrated circuit 1600 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.

[0329] In at least one embodiment, inference and / or training logic 2 may be used in integrated circuit 1600 for inferring or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture, or neural network use cases described herein.

[0330] Figure 17A and Figure 17B Exemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which may be manufactured using one or more IP cores. In addition to the illustrations, at least one embodiment may include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.

[0331] Figure 17A and Figure 17B This is a block diagram illustrating an exemplary graphics processor used within a SoC according to embodiments described herein. Figure 17A An exemplary graphics processor 1710 of a system-on-a-chip integrated circuit according to at least one embodiment is shown. The system-on-a-chip integrated circuit can be manufactured using one or more IP cores. Figure 17B Further exemplary graphics processor 1740 of a system-on-a-chip integrated circuit according to at least one embodiment is shown. The system-on-a-chip integrated circuit can be manufactured using one or more IP cores. In at least one embodiment, Figure 17A The graphics processor 1710 is a low-power graphics processor core. In at least one embodiment, Figure 17B The graphics processor 1740 is a higher-performance graphics processor core. In at least one embodiment, each graphics processor 1710, 1740 may be... Figure 16 A variant of the 1610 graphics processor.

[0332] In at least one embodiment, the graphics processor 1710 includes a vertex processor 1705 and one or more fragment processors 1715A-1715N (e.g., 1715A, 1715B, 1715C, 1715D to 1715N-1 and 1715N). In at least one embodiment, the graphics processor 1710 can execute different shader programs via separate logic, such that the vertex processor 1705 is optimized to perform operations for vertex shader programs, while one or more fragment processors 1715A-1715N perform fragment (e.g., pixel) shading operations for fragments or pixels or shader programs. In at least one embodiment, the vertex processor 1705 performs the vertex processing stage of the 3D graphics pipeline and generates primitive and vertex data. In at least one embodiment, one or more fragment processors 1715A-1715N use the primitive and vertex data generated by the vertex processor 1705 to generate framebuffers for display on a display device. In at least one embodiment, one or more fragment processors 1715A-1715N are optimized to execute fragment shader programs as provided in the OpenGL API, which can be used to perform operations similar to those of pixel shader programs provided in the Direct 3D API.

[0333] In at least one embodiment, the graphics processor 1710 additionally includes one or more memory management units (MMUs) 1720A-1720B, one or more caches 1725A-1725B, and one or more circuit interconnects 1730A-1730B. In at least one embodiment, one or more MMUs 1720A-1720B provide virtual-to-physical address mappings for the graphics processor 1710, including for vertex processors 1705 and / or fragment processors 1715A-1715N, which can reference vertex or image / texture data stored in memory, in addition to vertex or image / texture data stored in one or more caches 1725A-1725B. In at least one embodiment, one or more MMUs 1720A-1720B can be synchronized with other MMUs within the system, including with... Figure 16 One or more application processors 1605, graphics processors 1615, and / or video processors 1620 are associated with one or more MMUs, enabling each processor 1605-1620 to participate in a shared or unified virtual memory system. In at least one embodiment, one or more circuit interconnects 1730A-1730B enable the graphics processor 1710 to connect to other IP cores within the SoC via the SoC's internal bus or via a direct connection.

[0334] In at least one embodiment, the graphics processor 1740 includes one or more shader cores 1755A-1755N (e.g., 1755A, 1755B, 1755C, 1755D, 1755E, 1755F to 1755N-1 and 1755N), such as Figure 17B As shown, it provides a unified shader core architecture, where a single core or type or core can execute all types of programmable shader code, including shader program code for implementing vertex shaders, fragment shaders, and / or compute shaders. In at least one embodiment, the number of shader cores can vary. In at least one embodiment, the graphics processor 1740 includes an inter-core task manager 1745, which acts as a thread dispatcher to assign execution threads to one or more shader cores 1755A-1755N and a tile unit 1758 to accelerate tile-based rendering operations, where scene rendering operations are subdivided in image space, for example, to take advantage of local spatial consistency within the scene or optimize the use of internal caches.

[0335] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7BDetails are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 may be integrated into an integrated circuit. Figure 17A and / or Figure 17B The above is used for inference or prediction operations based at least in part on weight parameters calculated using neural network training operations, neural network functions or architectures, or neural network use cases described herein.

[0336] In at least one embodiment, inference and / or training logic 2 may be used in integrated circuits 17A and / or 17B for inferring or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.

[0337] Figures 18A-18B Additional exemplary graphics processor logic according to embodiments described herein is illustrated. In at least one embodiment, Figure 18A It shows that it can be included in Figure 16 The graphics core 1800 within the graphics processor 1610, and in at least one embodiment, may be as follows: Figure 17B The unified shader cores shown are 1755A-1755N. Figure 18B A highly parallel general-purpose graphics processing unit (“GPGPU”) 1830 suitable for deployment on a multi-chip module is shown in at least one embodiment.

[0338] In at least one embodiment, the graphics core 1800 includes a shared instruction cache 1802, texture units 1818, and cache / shared memory 1820, which are common to the execution resources within the graphics core 1800. In at least one embodiment, the graphics core 1800 may include multiple slices 1801A-1801N or partitions of each core, and the graphics processor may include multiple instances of the graphics core 1800. In at least one embodiment, slices 1801A-1801N may include supporting logic, including local instruction caches 1804A-1804N, thread schedulers 1806A-1806N, thread dispatchers 1808A-1808N, and a set of registers 1810A-1810N. In at least one embodiment, slices 1801A-1801N may include a set of additional functional units (AFU 1812A-1812N), floating-point units (FPU 1814A-1814N), integer arithmetic logic units (ALU 1816A-1816N), address calculation units (ACU 1813A-1813N), double-precision floating-point units (DPFPU 1815A-1815N), and matrix processing units (MPU 1817A-1817N).

[0339] In at least one embodiment, the FPU 1814A-1814N can perform single-precision (32-bit) and half-precision (16-bit) floating-point operations, while the DPFPU 1815A-1815N performs double-precision (64-bit) floating-point operations. In at least one embodiment, the ALU 1816A-1816N can perform variable-precision integer operations with 8-bit, 16-bit, and 32-bit precision, and can be configured for mixed-precision operations. In at least one embodiment, the MPU 1817A-1817N can also be configured for mixed-precision matrix operations, including half-precision floating-point operations and 8-bit integer operations. In at least one embodiment, the MPU 1817-1817N can perform various matrix operations to accelerate machine learning application frameworks, including enabling support for accelerated generalized matrix-to-matrix multiplication (GEMM). In at least one embodiment, the AFU 1812A-1812N can perform additional logical operations not supported by floating-point or integer units, including trigonometric operations (e.g., sine, cosine, etc.).

[0340] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This is combined with... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 may be used in the graphics core 1800 to infer or predict operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0341] In at least one embodiment, inference and / or training logic 2 may be used in the graphics core 1800 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0342] Figure 18BA general-purpose processing unit (GPGPU) 1830 is illustrated in at least one embodiment, which can be configured to enable highly parallel computational operations to be performed by a set of graphics processing units. In at least one embodiment, the GPGPU 1830 can be directly linked to other instances of the GPGPU 1830 to create a multi-GPU cluster to improve the training speed for deep neural networks. In at least one embodiment, the GPGPU 1830 includes a host interface 1832 for connection to a host processor. In at least one embodiment, the host interface 1832 is a PCI Express interface. In at least one embodiment, the host interface 1832 may be a vendor-specific communication interface or communication structure. In at least one embodiment, the GPGPU 1830 receives commands from the host processor and uses a global scheduler 1834 to allocate execution threads associated with those commands to a set of compute clusters 1836A-1836H. In at least one embodiment, compute clusters 1836A-1836H share a cache memory 1838. In at least one embodiment, cache memory 1838 can be used as a higher-level cache within the cache memory of computing clusters 1836A-1836H.

[0343] In at least one embodiment, the GPGPU 1830 includes memories 1844A-1844B, which are coupled to the computing cluster 1836A-1836H via a set of memory controllers 1842A-1842B. In at least one embodiment, memories 1844A-1844B may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), which includes graphics double data rate (GDDR) memory.

[0344] In at least one embodiment, each of the computing clusters 1836A-1836H includes a set of graphics cores, for example... Figure 18A The graphics core 1800 may include various types of integer and floating-point logic units that can perform computational operations across a range of precisions, including precisions suitable for machine learning computations. For example, in at least one embodiment, at least a subset of the floating-point units in each computing cluster 1836A-1836H may be configured to perform 16-bit or 32-bit floating-point operations, while different subsets of the floating-point units may be configured to perform 64-bit floating-point operations.

[0345] In at least one embodiment, multiple instances of the GPGPU 1830 can be configured as a computing cluster. In at least one embodiment, the communication used for synchronization and data exchange by the computing clusters 1836A-1836H varies between embodiments. In at least one embodiment, the multiple instances of the GPGPU 1830 communicate via a host interface 1832. In at least one embodiment, the GPGPU 1830 includes an I / O hub 1839 that couples the GPGPU 1830 to a GPU link 1840, enabling direct connection to other instances of the GPGPU 1830. In at least one embodiment, the GPU link 1840 is coupled to a dedicated GPU-to-GPU bridge, which enables communication and synchronization between the multiple instances of the GPGPU 1830. In at least one embodiment, the GPU link 1840 is coupled to a high-speed interconnect for sending and receiving data to and from other GPGPUs or parallel processors. In at least one embodiment, the multiple instances of the GPGPU 1830 reside in a separate data processing system and communicate via network devices accessible through the host interface 1832. In at least one embodiment, GPU link 1840 may be configured to enable connection to a host processor other than or as a replacement for host interface 1832.

[0346] In at least one embodiment, the GPGPU 1830 can be configured to train a neural network. In at least one embodiment, the GPGPU 1830 can be used within an inference platform. In at least one embodiment, when using the GPGPU 1830 for inference, the GPGPU 1830 may include fewer compute clusters 1836A-1836H compared to when using the GPGPU 1830 to train a neural network. In at least one embodiment, the memory technology associated with the memories 1844A-1844B can differ between inference and training configurations, with higher bandwidth memory technology dedicated to the training configuration. In at least one embodiment, the inference configuration of the GPGPU 1830 can support inference-specific instructions. For example, in at least one embodiment, the inference configuration can provide support for one or more 8-bit integer dot product instructions, which can be used during the inference operation of the deployed neural network.

[0347] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7BDetails are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 may be used in the GPGPU 1830 for inferring or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architecture, or neural network use cases described herein.

[0348] In at least one embodiment, inference and / or training logic 2 may be used in the GPGPU 1830 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0349] Figure 19 A block diagram of a computer system 1900 according to at least one embodiment is shown. In at least one embodiment, the computer system 1900 includes a processing subsystem 1901 having one or more processors 1902 and a system memory 1904 communicating via an interconnect path that may include a memory hub 1905. In at least one embodiment, the memory hub 1905 may be a separate component within a chipset assembly or may be integrated within one or more processors 1902. In at least one embodiment, the memory hub 1905 is coupled to an I / O subsystem 1911 via a communication link 1906. In one embodiment, the I / O subsystem 1911 includes an I / O hub 1907 that enables the computer system 1900 to receive input from one or more input devices 1908. In at least one embodiment, the I / O hub 1907 enables a display controller to provide output to one or more display devices 1910A, the display controller being included in one or more processors 1902. In at least one embodiment, one or more display devices 1910A coupled to the I / O hub 1907 may include local, internal, or embedded display devices.

[0350] In at least one embodiment, the processing subsystem 1901 includes one or more parallel processors 1912 coupled to the memory hub 1905 via a bus or other communication link 1913. In at least one embodiment, the communication link 1913 may use any of many standards-based communication link technologies or protocols, such as, but not limited to, PCI Express, or may be a vendor-specific communication interface or communication architecture. In at least one embodiment, one or more parallel processors 1912 form a computationally concentrated parallel or vector processing system, which may include a large number of processing cores and / or processing clusters, such as a multi-core integrated (MIC) processor. In at least one embodiment, one or more parallel processors 1912 form a graphics processing subsystem that can output pixels to one or more display devices 1910A coupled via an I / O hub 1907. In at least one embodiment, the parallel processors 1912 may also include a display controller and a display interface (not shown) to enable direct connection to one or more display devices 1910B.

[0351] In at least one embodiment, system storage unit 1914 may be connected to I / O hub 1907 to provide a storage mechanism for computer system 1900. In at least one embodiment, I / O switch 1916 may be used to provide an interface mechanism to enable connectivity between I / O hub 1907 and other components, such as network adapter 1918 and / or wireless network adapter 1919 which may be integrated into the platform, and various other devices that can be added via one or more attachment devices 1920. In at least one embodiment, network adapter 1918 may be an Ethernet adapter or another wired network adapter. In at least one embodiment, wireless network adapter 1919 may include one or more of Wi-Fi, Bluetooth, Near Field Communication (NFC), or other network devices including one or more wireless devices.

[0352] In at least one embodiment, the computer system 1900 may include other components not explicitly shown, such as USB or other port connections, optical storage drives, video capture devices, etc., which may also be connected to the I / O hub 1907. In at least one embodiment, the interconnection can be implemented using any suitable protocol (e.g., PCI-based protocols such as PCI-Express or other bus or point-to-point communication interfaces and / or protocols). Figure 19 The communication paths of the various components, such as NV-Link high-speed interconnect or interconnect protocols.

[0353] In at least one embodiment, one or more parallel processors 1912 include circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constituting a graphics processing unit (GPU). In at least one embodiment, the parallel processor 1912 includes circuitry optimized for general-purpose processing. In at least one embodiment, components of the computer system 1900 may be integrated with one or more other system elements on a single integrated circuit. For example, in at least one embodiment, the parallel processor 1912, memory hub 1905, processor 1902, and I / O hub 1907 may be integrated into a system-on-a-chip (SoC) integrated circuit. In at least one embodiment, components of the computer system 1900 may be integrated into a single package to form a system-in-package (SIP) configuration. In at least one embodiment, at least a portion of the components of the computer system 1900 may be integrated into a multi-chip module (MCM) that can interconnect with other MCMs to a modular computer system.

[0354] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 can be... Figure 19 The system 1900 is used for reasoning or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0355] In at least one embodiment, inference and / or training logic 2 may be used in system diagram 1900 for inferring or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0356] processor

[0357] Figure 20A A parallel processor 2000 according to at least one embodiment is illustrated. In at least one embodiment, various components of the parallel processor 2000 may be implemented using one or more integrated circuit devices, such as programmable processors, application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). In at least one embodiment, the illustrated parallel processor 2000 is according to an exemplary embodiment. Figure 19 The variant of the 1912, which includes one or more parallel processors, is shown.

[0358] In at least one embodiment, the parallel processor 2000 includes a parallel processing unit 2002. In at least one embodiment, the parallel processing unit 2002 includes an I / O unit 2004 that enables communication with other devices, including other instances of the parallel processing unit 2002. In at least one embodiment, the I / O unit 2004 can be directly connected to other devices. In at least one embodiment, the I / O unit 2004 is connected to other devices using a hub or switch interface (e.g., a memory hub 2005). In at least one embodiment, the connection between the memory hub 2005 and the I / O unit 2004 forms a communication link 2013. In at least one embodiment, the I / O unit 2004 is connected to a host interface 2006 and a memory crossbar 2016, wherein the host interface 2006 receives commands for performing processing operations, and the memory crossbar 2016 receives commands for performing memory operations.

[0359] In at least one embodiment, when host interface 2006 receives a command buffer via I / O unit 2004, host interface 2006 can direct work operations to execute those commands to front end 2008. In at least one embodiment, front end 2008 is coupled to scheduler 2010, which is configured to assign commands or other work items to processing cluster array 2012. In at least one embodiment, scheduler 2010 ensures that processing cluster array 2012 is correctly configured and in an active state before assigning tasks to processing cluster array 2012. In at least one embodiment, scheduler 2010 is implemented via firmware logic executed on a microcontroller. In at least one embodiment, the microcontroller-implemented scheduler 2010 can be configured to perform complex scheduling and work assignment operations at both coarse and fine granular levels, thereby enabling fast preemption and context switching of threads executing on processing array 2012. In at least one embodiment, host software can demonstrate workloads for scheduling on processing array 2012 via one of multiple graphics processing paths. In at least one embodiment, the workload can then be automatically distributed on the processing array 2012 by the scheduler 2010 logic within the microcontroller, which includes the scheduler 2010.

[0360] In at least one embodiment, the processing cluster array 2012 may include up to "N" processing clusters (e.g., clusters 2014A, 2014B to 2014N), where "N" represents a positive integer (which may be an integer different from the integer "N" used in other diagrams). In at least one embodiment, each cluster 2014A-2014N of the processing cluster array 2012 can execute a large number of concurrent threads. In at least one embodiment, the scheduler 2010 may use various scheduling and / or work allocation algorithms to allocate work to the clusters 2014A-2014N of the processing cluster array 2012, which may vary depending on the workload generated by each type of program or computation. In at least one embodiment, scheduling may be handled dynamically by the scheduler 2010, or may be partially assisted by compiler logic during the compilation of program logic configured to be executed by the processing cluster array 2012. In at least one embodiment, the different clusters 2014A-2014N of the processing cluster array 2012 may be assigned to process different types of programs or to perform different types of computations.

[0361] In at least one embodiment, the processing cluster array 2012 can be configured to perform various types of parallel processing operations. In at least one embodiment, the processing cluster array 2012 is configured to perform general-purpose parallel computing operations. For example, in at least one embodiment, the processing cluster array 2012 may include logic for performing processing tasks, including filtering video and / or audio data, performing modeling operations, including physical operations, and performing data transformations.

[0362] In at least one embodiment, the processing cluster array 2012 is configured to perform parallel graphics processing operations. In at least one embodiment, the processing cluster array 2012 may include additional logic to support the execution of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, as well as tessellation logic and other vertex processing logic. In at least one embodiment, the processing cluster array 2012 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. In at least one embodiment, the parallel processing unit 2002 may transfer data from system memory via I / O unit 2004 for processing. In at least one embodiment, during processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 2022) and then written back to system memory.

[0363] In at least one embodiment, when the parallel processing unit 2002 is used to perform graphics processing, the scheduler 2010 may be configured to divide the processing workload into tasks of approximately equal size to better distribute graphics processing operations among the multiple clusters 2014A-2014N of the processing cluster array 2012. In at least one embodiment, portions of the processing cluster array 2012 may be configured to perform different types of processing. For example, in at least one embodiment, a first portion may be configured to perform vertex shading and topology generation, a second portion may be configured to perform tessellation and geometry shading, and a third portion may be configured to perform pixel shading or other screen-space operations to generate a rendered image for display. In at least one embodiment, intermediate data generated by one or more of the clusters 2014A-2014N may be stored in a buffer to allow intermediate data to be transferred between the clusters 2014A-2014N for further processing.

[0364] In at least one embodiment, the processing cluster array 2012 may receive processing tasks to be executed via a scheduler 2010, which receives commands defining the processing tasks from a front end 2008. In at least one embodiment, the processing task may include an index of data to be processed, such as surface (patch) data, raw data, vertex data, and / or pixel data, as well as state parameters and commands defining how the data is processed (e.g., what program to execute). In at least one embodiment, the scheduler 2010 may be configured to acquire an index corresponding to a task, or may receive an index from the front end 2008. In at least one embodiment, the front end 2008 may be configured to ensure that the processing cluster array 2012 is configured to be active before initiating the workload specified by an incoming command buffer (e.g., a batch buffer, push buffer, etc.).

[0365] In at least one embodiment, each of one or more instances of the parallel processing unit 2002 may be coupled to the parallel processor memory 2022. In at least one embodiment, the parallel processor memory 2022 may be accessed via a memory crossbar switch 2016, which may receive memory requests from the processing cluster array 2012 and the I / O unit 2004. In at least one embodiment, the memory crossbar switch 2016 may access the parallel processor memory 2022 via a memory interface 2018. In at least one embodiment, the memory interface 2018 may include a plurality of partition units (e.g., partition units 2020A, 2020B to 2020N), each of which may be coupled to a portion (e.g., a memory cell) of the parallel processor memory 2022. In at least one embodiment, the plurality of partition units 2020A-2020N are configured to be equal to the number of memory units, such that the first partition unit 2020A has a corresponding first memory unit 2024A, the second partition unit 2020B has a corresponding memory unit 2024B, and the Nth partition unit 2020N has a corresponding Nth memory unit 2024N. In at least one embodiment, the number of partition units 2020A-2020N may not be equal to the number of memory units.

[0366] In at least one embodiment, memory cells 2024A-2024N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. In at least one embodiment, memory cells 2024A-2024N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). In at least one embodiment, rendering targets such as frame buffers or texture maps may be stored across memory cells 2024A-2024N, allowing partitioning cells 2020A-2020N to write portions of each rendering target in parallel, to efficiently utilize the available bandwidth of the parallel processor memory 2022. In at least one embodiment, local instances of the parallel processor memory 2022 may be excluded to facilitate a unified memory design that combines system memory with local cache memory.

[0367] In at least one embodiment, any of clusters 2014A-2014N of the processing cluster array 2012 can process data to be written to any memory cell 2024A-2024N within the parallel processor memory 2022. In at least one embodiment, the memory crossbar switch 2016 can be configured to transfer the output of each cluster 2014A-2014N to any partition cell 2020A-2020N or another cluster 2014A-2014N, and clusters 2014A-2014N can perform further processing operations on the output. In at least one embodiment, each cluster 2014A-2014N can communicate with the memory interface 2018 via the memory crossbar switch 2016 to read from or write to various external storage devices. In at least one embodiment, the memory crossbar switch 2016 has a connection to a memory interface 2018 for communication with I / O unit 2004, and a connection to a local instance of parallel processor memory 2022, thereby enabling processing units within different processing clusters 2014A-2014N to communicate with system memory or other memory not local to parallel processing unit 2002. In at least one embodiment, the memory crossbar switch 2016 may use virtual channels to separate traffic flows between clusters 2014A-2014N and partition units 2020A-2020N.

[0368] In at least one embodiment, multiple instances of the parallel processing unit 2002 may be provided on a single insert card, or multiple insert cards may be interconnected. In at least one embodiment, different instances of the parallel processing unit 2002 may be configured to interoperate, even if the different instances have different numbers of processing cores, different numbers of local parallel processor memories, and / or other configuration differences. For example, in at least one embodiment, some instances of the parallel processing unit 2002 may include higher-precision floating-point units relative to other instances. In at least one embodiment, a system combining one or more instances of the parallel processing unit 2002 or the parallel processor 2000 may be implemented in various configurations and form factors, including but not limited to desktop, laptop, or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.

[0369] Figure 20B This is a block diagram of a partitioning unit 2020 according to at least one embodiment. In at least one embodiment, the partitioning unit 2020 is... Figure 20AThis is an example of one of the partitioning units 2020A-2020N. In at least one embodiment, the partitioning unit 2020 includes an L2 cache 2021, a frame buffer interface 2025, and a ROP 2026 (raster operation unit). In at least one embodiment, the L2 cache 2021 is a read / write cache configured to perform load and store operations received from the memory crossbar switch 2016 and the ROP 2026. In at least one embodiment, the L2 cache 2021 outputs read misses and urgent write-back requests to the frame buffer interface 2025 for processing. In at least one embodiment, updates can also be sent to the frame buffer for processing via the frame buffer interface 2025. In at least one embodiment, the frame buffer interface 2025 communicates with memory cells in the parallel processor memory (such as...). Figure 20A The memory cells 2024A-2024N (e.g., within the parallel processor memory 2022) interact with one of them.

[0370] In at least one embodiment, ROP 2026 is a processing unit that performs raster operations such as stenciling, z-testing, blending, etc. In at least one embodiment, ROP 2026 then outputs processed graphics data stored in graphics memory. In at least one embodiment, ROP 2026 includes compression logic to compress depth or color data written to memory and decompress depth or color data read from memory. In at least one embodiment, the compression logic may be lossless compression logic utilizing one or more of a variety of compression algorithms. In at least one embodiment, the type of compression performed by ROP 2026 may vary based on the statistical characteristics of the data to be compressed. For example, in at least one embodiment, incremental color compression is performed based on depth and color data on a per-tile basis.

[0371] In at least one embodiment, ROP 2026 is included within each processing cluster (e.g., Figure 20A Clusters 2014A-2014N are used instead of partition units 2020. In at least one embodiment, read and write requests for pixel data are made via memory crossbar switch 2016 instead of pixel fragment data transfer. In at least one embodiment, the processed graphics data can be displayed on a display device (such as...). Figure 19 One or more display devices 1910) display, routed by processor 1902 for further processing, or by Figure 20A One of the processing entities within the parallel processor 2000 is routed for further processing.

[0372] Figure 20C This is a block diagram of a processing cluster 2014 within a parallel processing unit according to at least one embodiment. In at least one embodiment, the processing cluster is... Figure 20A An example of one of the processing clusters 2014A-2014N. In at least one embodiment, the processing cluster 2014 can be configured to execute a number of threads in parallel, where a "thread" refers to an instance of a specific program executing on a particular set of input data. In at least one embodiment, Single Instruction Multiple Data (SIMD) instruction issuing technology is used to support the parallel execution of a large number of threads without providing multiple independent instruction units. In at least one embodiment, Single Instruction Multiple Threading (SIMT) technology is used to support the parallel execution of a large number of generally synchronous threads, which uses a common instruction unit configured to issue instructions to a set of processing engines within each processing cluster.

[0373] In at least one embodiment, the operation of the processing cluster 2014 can be controlled by a pipeline manager 2032 that assigns processing tasks to the SIMT parallel processors. In at least one embodiment, the pipeline manager 2032... Figure 20A The scheduler 2010 receives instructions and manages the execution of these instructions via the graphics multiprocessor 2034 and / or texture unit 2036. In at least one embodiment, the graphics multiprocessor 2034 is an exemplary instance of a SIMT parallel processor. However, in at least one embodiment, the processing cluster 2014 may include various types of SIMT parallel processors with different architectures. In at least one embodiment, the processing cluster 2014 may include one or more instances of the graphics multiprocessor 2034. In at least one embodiment, the graphics multiprocessor 2034 can process data, and the data crossover switch 2040 can be used to distribute the processed data to one of a number of possible destinations (including other shader units). In at least one embodiment, the pipeline manager 2032 can facilitate the distribution of processed data by specifying the destination of the processed data to be distributed via the data crossover switch 2040.

[0374] In at least one embodiment, each graphics multiprocessor 2034 within the processing cluster 2014 may include the same set of functional execution logic (e.g., arithmetic logic units, load-memory units, etc.). In at least one embodiment, the functional execution logic may be configured in a pipelined manner, wherein new instructions may be issued before previous instructions complete. In at least one embodiment, the functional execution logic supports a variety of operations, including integer and floating-point arithmetic, comparison operations, Boolean operations, shift operations, and computation of various algebraic functions. In at least one embodiment, the same functional unit hardware may be used to perform different operations, and any combination of functional units may exist.

[0375] In at least one embodiment, instructions transmitted to the processing cluster 2014 constitute threads. In at least one embodiment, a group of threads executed across a set of parallel processing engines is a thread group. In at least one embodiment, the thread group executes a general program on different input data. In at least one embodiment, each thread within the thread group may be assigned to a different processing engine within the graphics multiprocessor 2034. In at least one embodiment, the thread group may include fewer threads than the number of processing engines within the graphics multiprocessor 2034. In at least one embodiment, when the number of threads included in the thread group is less than the number of processing engines, one or more processing engines may be idle during a loop that is processing the thread group. In at least one embodiment, the thread group may also include more threads than the number of processing engines within the graphics multiprocessor 2034. In at least one embodiment, when the thread group includes more threads than the number of processing engines within the graphics multiprocessor 2034, processing can be performed in consecutive clock cycles. In at least one embodiment, multiple thread groups can be executed simultaneously on the graphics multiprocessor 2034.

[0376] In at least one embodiment, the graphics multiprocessor 2034 includes an internal cache memory for performing load and store operations. In at least one embodiment, the graphics multiprocessor 2034 may forgo the internal cache and use a cache memory within the processing cluster 2014 (e.g., L1 cache 2048). In at least one embodiment, each graphics multiprocessor 2034 may also access partition units (e.g., Figure 20A The L2 cache is located within partition units 2020A-2020N, ​​which are shared among all processing clusters 2014 and can be used to transfer data between threads. In at least one embodiment, the graphics multiprocessor 2034 can also access off-chip global memory, which may include one or more of local parallel processor memory and / or system memory. In at least one embodiment, any memory outside of the parallel processing unit 2002 can be used as global memory. In at least one embodiment, the processing cluster 2014 includes multiple instances of the graphics multiprocessor 2034, which can share common instructions and data that can be stored in the L1 cache 2048.

[0377] In at least one embodiment, each processing cluster 2014 may include a memory management unit (“MMU”) 2045 configured to map virtual addresses to physical addresses. In at least one embodiment, one or more instances of the MMU 2045 may reside in Figure 20AThe memory interface 2018 is located within the MMU 2045. In at least one embodiment, the MMU 2045 includes a set of page table entries (PTEs) for mapping virtual addresses to physical addresses of tiles and optionally to cache line indices. In at least one embodiment, the MMU 2045 may include an address translation lookup buffer (TLB) or a cache that may reside within the graphics multiprocessor 2034, the L1 cache 2048, or the processing cluster 2014. In at least one embodiment, physical addresses are processed to allocate surface data access locality for efficient request interleaving between partition units. In at least one embodiment, cache line indices may be used to determine whether a request for a cache line is a hit or a miss.

[0378] In at least one embodiment, the processing cluster 2014 can be configured such that each graphics multiprocessor 2034 is coupled to a texture unit 2036 to perform texture mapping operations that determine texture sample locations, read texture data, and filter texture data. In at least one embodiment, texture data is read as needed from an internal texture L1 cache (not shown) or from an L1 cache within the graphics multiprocessor 2034, and texture data is also retrieved from an L2 cache, local parallel processor memory, or system memory. In at least one embodiment, each graphics multiprocessor 2034 outputs a processed task to a data cross switch 2040 to provide the processed task to another processing cluster 2014 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via a memory cross switch 2016. In at least one embodiment, a preROP 2042 (pre-raster operation unit) is configured to receive data from the graphics multiprocessor 2034 and direct the data to a ROP unit, which can be associated with a partitioning unit (e.g., [missing information]). Figure 20A The PreROP 2042 unit is located together with the partition units 2020A-2020N. In at least one embodiment, the PreROP 2042 unit can perform optimizations for color blending, organize pixel color data, and perform address translation.

[0379] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 may be used in a graphics processing cluster 2014 to perform inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0380] In at least one embodiment, inference and / or training logic 2 may be used in a graphics processing cluster 2014 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0381] Figure 20D A graphics multiprocessor 2034 according to at least one embodiment is illustrated. In at least one embodiment, the graphics multiprocessor 2034 is coupled to a pipeline manager 2032 of a processing cluster 2014. In at least one embodiment, the graphics multiprocessor 2034 has an execution pipeline including, but not limited to, an instruction cache 2052, an instruction unit 2054, an address mapping unit 2056, a register file 2058, one or more general-purpose graphics processing unit (GPGPU) cores 2062, and one or more load / store units 2066. In at least one embodiment, the GPGPU cores 2062 and the load / store units 2066 are coupled to a cache memory 2072 and a shared memory 2070 via a memory and cache interconnect 2068.

[0382] In at least one embodiment, instruction cache 2052 receives a stream of instructions to be executed from pipeline manager 2032. In at least one embodiment, instructions are cached in instruction cache 2052 and dispatched to instruction unit 2054 for execution. In one embodiment, instruction unit 2054 may dispatch instructions as thread groups (e.g., thread bundles), assigning each thread of the thread group to a different execution unit within GPGPU core 2062. In at least one embodiment, instructions can access any local, shared, or global address space by specifying an address within a unified address space. In at least one embodiment, address mapping unit 2056 may be used to translate addresses in the unified address space into different memory addresses that can be accessed by load / store unit 2066.

[0383] In at least one embodiment, register file 2058 provides a set of registers for functional units of graphics multiprocessor 2034. In at least one embodiment, register file 2058 provides temporary storage for operands of data paths connected to functional units of graphics multiprocessor 2034 (e.g., GPGPU core 2062, load / store unit 2066). In at least one embodiment, register file 2058 is partitioned among each functional unit, such that a dedicated portion of register file 2058 is allocated to each functional unit. In at least one embodiment, register file 2058 is partitioned among different thread bundles being executed by graphics multiprocessor 2034.

[0384] In at least one embodiment, each of the GPGPU cores 2062 may include a floating-point unit (FPU) and / or an integer arithmetic logic unit (ALU) for executing instructions of the graphics multiprocessor 2034. In at least one embodiment, the GPGPU cores 2062 may be architecturally similar or may differ in architecture. In at least one embodiment, a first portion of the GPGPU core 2062 includes a single-precision FPU and an integer ALU, while a second portion of the GPGPU core includes a double-precision FPU. In at least one embodiment, the FPU may implement the IEEE 754-2008 standard for floating-point algorithms or enable variable-precision floating-point algorithms. In at least one embodiment, the graphics multiprocessor 2034 may additionally include one or more fixed-function or special-function units to perform specific functions, such as copying rectangles or pixel blending operations. In at least one embodiment, one or more of the GPGPU cores 2062 may also include fixed-function or special-function logic.

[0385] In at least one embodiment, the GPGPU core 2062 includes SIMD logic capable of executing a single instruction on multiple sets of data. In one embodiment, the GPGPU core 2062 can physically execute SIMD4, SIMD8, and SIMD16 instructions, and logically execute SIMD1, SIMD2, and SIMD32 instructions. In at least one embodiment, the SIMD instructions for the GPGPU core can be generated by a shader compiler at compile time, or automatically generated when executing a program written and compiled for a Single Program Multiple Data (SPMD) or SIMT architecture. In at least one embodiment, multiple threads of a program configured for a SIMT execution model can be executed using a single SIMD instruction. For example, in at least one embodiment, eight SIMD threads performing the same or similar operations can be executed in parallel using a single SIMD8 logic unit.

[0386] In at least one embodiment, the memory and cache interconnect 2068 is an interconnect network connecting each functional unit of the graphics multiprocessor 2034 to the register file 2058 and the shared memory 2070. In at least one embodiment, the memory and cache interconnect 2068 is a cross-switch interconnect that allows the load / store unit 2066 to perform load and store operations between the shared memory 2070 and the register file 2058. In at least one embodiment, the register file 2058 can operate at the same frequency as the GPGPU core 2062, resulting in very low latency for data transfer between the GPGPU core 2062 and the register file 2058. In at least one embodiment, the shared memory 2070 can be used to enable communication between threads executing on functional units within the graphics multiprocessor 2034. In at least one embodiment, the cache memory 2072 can be used, for example, as a data cache to cache texture data communicated between functional units and texture units 2036. In at least one embodiment, the shared memory 2070 can also be used as a program-managed cache. In at least one embodiment, in addition to the automatically cached data stored in cache memory 2072, the thread executing on GPGPU core 2062 can also programmatically store data in shared memory.

[0387] In at least one embodiment, a parallel processor or GPGPU, as described herein, is communicatively coupled to a host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. In at least one embodiment, the GPU may be communicatively coupled to the host processor / core via a bus or other interconnect (e.g., high-speed interconnects such as PCIe or NVLink). In at least one embodiment, the GPU may be integrated with the core on a package or chip and communicatively coupled to the core via an internal processor bus / interconnect (i.e., within the package or chip). In at least one embodiment, regardless of how the GPU is connected, the processor core may assign work to the GPU in the form of a sequence of commands / instructions contained in a job descriptor. In at least one embodiment, the GPU then uses dedicated circuitry / logic to efficiently process these commands / instructions.

[0388] The inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 7A and / or Figure 7BDetails are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 may be used in a graphics multiprocessor 2034 to perform inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0389] In at least one embodiment, inference and / or training logic 2 may be used in graphics multiprocessor 2034 for inferring or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture, or neural network use cases described herein. Figure 21 A multi-GPU computing system 2100 according to at least one embodiment is illustrated. In at least one embodiment, the multi-GPU computing system 2100 may include a processor 2102 coupled to a plurality of general-purpose graphics processing units (GPGPUs) 2106A-D via a host interface switch 2104. In at least one embodiment, the host interface switch 2104 is a PCI Express switch device that couples the processor 2102 to a PCI Express bus, through which the processor 2102 can communicate with the GPGPUs 2106A-D. In at least one embodiment, the GPGPUs 2106A-D may be interconnected via a set of high-speed P2P GPU-to-GPU links 2116. In at least one embodiment, the GPU-to-GPU links 2116 are connected to each of the GPGPUs 2106A-D via dedicated GPU links. In at least one embodiment, the P2P GPU links 2116 enable direct communication between each GPGPU 2106A-D without communication via the host interface bus 2104 to which the processor 2102 is connected. In at least one embodiment, when GPU-to-GPU traffic is directed to the P2P GPU link 2116, the host interface bus 2104 remains available for system memory access or, for example, communication with other instances of the multi-GPU computing system 2100 via one or more network devices. While in at least one embodiment, the GPGPUs 2106A-D are connected to the processor 2102 via the host interface switch 2104, in at least one embodiment, the processor 2102 includes direct support for the P2P GPU link 2116 and can be directly connected to the GPGPUs 2106A-D.

[0390] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7BDetails are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 may be used in a multi-GPU computing system 2100 for performing inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0391] In at least one embodiment, inference and / or training logic 2 can be used in a multi-GPU computing system 2100 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0392] Figure 22 This is a block diagram of a graphics processor 2200 according to at least one embodiment. In at least one embodiment, the graphics processor 2200 includes a ring interconnect 2202, a pipeline front end 2204, a media engine 2237, and graphics cores 2280A-2280N. In at least one embodiment, the ring interconnect 2202 couples the graphics processor 2200 to other processing units, said processing units including other graphics processors or one or more general-purpose processor cores. In at least one embodiment, the graphics processor 2200 is one of many processors integrated within a multi-core processing system.

[0393] In at least one embodiment, graphics processor 2200 receives multiple batches of commands via ring interconnect 2202. In at least one embodiment, the input commands are interpreted by command streamer 2203 in pipeline front-end 2204. In at least one embodiment, graphics processor 2200 includes scalable execution logic for performing 3D geometry processing and media processing via graphics cores 2280A-2280N. In at least one embodiment, for 3D geometry processing commands, command streamer 2203 provides commands to geometry pipeline 2236. In at least one embodiment, for at least some media processing commands, command streamer 2203 provides commands to video front-end 2234, which is coupled to media engine 2237. In at least one embodiment, media engine 2237 includes a video quality engine (VQE) 2230 for video and image post-processing, and a multi-format encoding / decoding (MFX) engine 2233 for providing hardware-accelerated media data encoding and decoding. In at least one embodiment, the geometry pipeline 2236 and the media engine 2237 each generate an execution thread for thread execution resources provided by at least one graphics core 2280.

[0394] In at least one embodiment, the graphics processor 2200 includes scalable thread execution resources featuring graphics cores 2280A-2280N (which may be modular and sometimes referred to as core slices), each graphics core having multiple sub-cores 2250A-2250N, 2260A-2260N (sometimes referred to as core sub-slices). In at least one embodiment, the graphics processor 2200 may have any number of graphics cores 2280A. In at least one embodiment, the graphics processor 2200 includes graphics cores 2280A having at least a first sub-core 2250A and a second sub-core 2260A. In at least one embodiment, the graphics processor 2200 is a low-power processor having a single sub-core (e.g., 2250A). In at least one embodiment, the graphics processor 2200 includes multiple graphics cores 2280A-2280N, each graphics core including a set of first sub-cores 2250A-2250N and a set of second sub-cores 2260A-2260N. In at least one embodiment, each of the first sub-cores 2250A-2250N includes at least a first set of execution units 2252A-2252N and media / texture samplers 2254A-2254N. In at least one embodiment, each of the second sub-cores 2260A-2260N includes at least a second set of execution units 2262A-2262N and samplers 2264A-2264N. In at least one embodiment, each sub-core 2250A-2250N, 2260A-2260N shares a set of shared resources 2270A-2270N. In at least one embodiment, the shared resources include shared cache memory and pixel operation logic.

[0395] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details are provided regarding the inference and / or training logic 715. In at least one embodiment, the inference and / or training logic 715 may be used in the graphics processor 2200 to perform inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0396] In at least one embodiment, inference and / or training logic 2 may be used in graphics processor 2200 for inferring or predicting operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures or neural network use cases described herein.

[0397] Figure 23This is a block diagram illustrating a microarchitecture for a processor 2300 according to at least one embodiment, the processor 2300 including logic circuitry for executing instructions. In at least one embodiment, the processor 2300 can execute instructions, including x86 instructions, ARM instructions, and special-purpose instructions for application-specific integrated circuits (ASICs). In at least one embodiment, the processor 2300 may include registers for storing packaged data, such as the 64-bit wide MMX registers used in Intel Corporation's Santa Clara, California-enabled MMX technology microprocessors. TM Registers. In at least one embodiment, MMX registers available in integer and floating-point forms can operate alongside packaged data elements accompanied by Single Instruction Multiple Data (“SIMD”) and Streaming SIMD Extensions (“SSE”) instructions. In at least one embodiment, a 128-bit wide XMM register associated with SSE2, SSE3, SSE4, AVX, or later (generally referred to as “SSEx”) technologies can hold such packaged data operands. In at least one embodiment, processor 2300 can execute instructions to accelerate machine learning or deep learning algorithms, training, or inference.

[0398] In at least one embodiment, processor 2300 includes an ordered front end (“front end”) 2301 to fetch instructions to be executed and prepare instructions for later use in the processor pipeline. In at least one embodiment, front end 2301 may include several units. In at least one embodiment, instruction prefetcher 2326 fetches instructions from memory and provides the instructions to instruction decoder 2328, which in turn decodes or interprets the instructions. For example, in at least one embodiment, instruction decoder 2328 decodes the received instructions into one or more machine-executable so-called “micro-instructions” or “micro-operations” (also referred to as “micro-operations” or “micro-instructions”). In at least one embodiment, instruction decoder 2328 parses the instructions into opcodes and corresponding data and control fields, which can be used by the microarchitecture to perform operations according to at least one embodiment. In at least one embodiment, trace cache 2330 may assemble the decoded micro-instructions into a program-ordered sequence or trace in micro-instruction queue 2334 for execution. In at least one embodiment, when the trace cache 2330 encounters complex instructions, the microcode ROM 2332 provides the microinstructions required to complete the operation.

[0399] In at least one embodiment, some instructions may be converted into a single micro-operation, while others require several micro-operations to complete the entire operation. In at least one embodiment, if more than four micro-instructions are required to complete an instruction, the instruction decoder 2328 may access the microcode ROM 2332 to execute the instruction. In at least one embodiment, an instruction may be decoded into a small number of micro-instructions for processing at the instruction decoder 2328. In at least one embodiment, if multiple micro-instructions are required to complete the operation, the instructions may be stored in the microcode ROM 2332. In at least one embodiment, the trace cache 2330 references an entry point programmable logic array (“PLA”) to determine the correct micro-instruction pointer for reading a microcode sequence from the microcode ROM 2332 to complete one or more instructions, according to at least one embodiment. In at least one embodiment, after the microcode ROM 2332 has completed the micro-operation ordering of the instructions, the machine front end 2301 may resume fetching micro-operations from the trace cache 2330.

[0400] In at least one embodiment, the out-of-order execution engine (“out-of-order engine”) 2303 can prepare instructions for execution. In at least one embodiment, the out-of-order execution logic has multiple buffers to smooth and reorder the instruction flow to optimize performance as instructions descend the pipeline and are scheduled for execution. In at least one embodiment, the out-of-order execution engine 2303 includes, but is not limited to, an allocator / register renamer 2340, a memory microinstruction queue 2342, an integer / floating-point microinstruction queue 2344, a memory scheduler 2346, a fast scheduler 2302, a slow / general-purpose floating-point scheduler (“slow / general-purpose FP scheduler”) 2304, and a simple floating-point scheduler (“simple FP scheduler”) 2306. In at least one embodiment, the fast scheduler 2302, the slow / general-purpose floating-point scheduler 2304, and the simple floating-point scheduler 2306 are also collectively referred to as “microinstruction schedulers 2302, 2304, 2306”. In at least one embodiment, the allocator / register renamer 2340 allocates the machine buffers and resources required for the sequential execution of each microinstruction. In at least one embodiment, the allocator / register renamer 2340 renames logical registers to entries in a register file. In at least one embodiment, the allocator / register renamer 2340 also allocates entries for each microinstruction in one of two microinstruction queues, a memory microinstruction queue 2342 for memory operations and an integer / floating-point microinstruction queue 2344 for non-memory operations, preceding the memory scheduler 2346 and microinstruction schedulers 2302, 2304, and 2306. In at least one embodiment, the microinstruction schedulers 2302, 2304, and 2306 determine when they are ready to execute a microinstruction based on the readiness of their dependent input register operand sources and the availability of the execution resource microinstructions that need to be completed. The fast scheduler 2302 of at least one embodiment can schedule on each half of the master clock cycle, while the slow / general-purpose floating-point scheduler 2304 and the simple floating-point scheduler 2306 can schedule once per master processor clock cycle. In at least one embodiment, microinstruction schedulers 2302, 2304, and 2306 arbitrate the scheduling port to schedule microinstructions for execution.

[0401] In at least one embodiment, execution block 2311 includes, but is not limited to, integer register file / tribute network 2308, floating-point register file / tribute network (“FP register file / tribute network”) 2310, address generation units (“AGU”) 2312 and 2314, fast arithmetic logic units (“fast ALU”) 2316 and 2318, slow arithmetic logic unit (“slow ALU”) 2320, floating-point ALU (“FP”) 2322, and floating-point move unit (“FP move”) 2324. In at least one embodiment, integer register file / tribute network 2308 and floating-point register file / bypass network 2310 are also referred to herein as “register files 2308, 2310”. In at least one embodiment, AGUs 2312 and 2314, fast ALUs 2316 and 2318, slow ALU 2320, floating-point ALU 2322, and floating-point movement unit 2324 are also referred to herein as "execution units 2312, 2314, 2316, 2318, 2320, 2322, and 2324". In at least one embodiment, execution block 2311 may include, but is not limited to, any number (including zero) and type of register files, branch networks, address generation units, and execution units (in any combination).

[0402] In at least one embodiment, register networks 2308, 2310 may be arranged between microinstruction schedulers 2302, 2304, 2306 and execution units 2312, 2314, 2316, 2318, 2320, 2322, and 2324. In at least one embodiment, integer register file / tribute network 2308 performs integer operations. In at least one embodiment, floating-point register file / tribute network 2310 performs floating-point operations. In at least one embodiment, each of register networks 2308, 2310 may include, but is not limited to, a tribute network that can bypass or forward recently completed results not yet written to a register file to a new dependent object. In at least one embodiment, register networks 2308, 2310 may communicate data with each other. In at least one embodiment, integer register file / tribute network 2308 may include, but is not limited to, two separate register files, one register file for low-order 32-bit data and a second register file for high-order 32-bit data. In at least one embodiment, the floating-point register file / branch network 2310 may include, but is not limited to, entries with a width of 128 bits, since floating-point instructions typically have operands with a width of 64 to 128 bits.

[0403] In at least one embodiment, execution units 2312, 2314, 2316, 2318, 2320, 2322, and 2324 can execute instructions. In at least one embodiment, register networks 2308 and 2310 store integer and floating-point data operation values ​​that the microinstructions need to execute. In at least one embodiment, processor 2300 can be, but is not limited to, any number of execution units 2312, 2314, 2316, 2318, 2320, 2322, and 2324, and combinations thereof. In at least one embodiment, floating-point ALU 2322 and floating-point move unit 2324 can perform floating-point, MMX, SIMD, AVX, and SSE or other operations, including specialized machine learning instructions. In at least one embodiment, floating-point ALU 2322 can be, but is not limited to, a 64-bit multiplication-64-bit floating-point divider to perform division, square root, and remainder micro-operations. In at least one embodiment, floating-point hardware can be used to process instructions involving floating-point values. In at least one embodiment, ALU operations can be passed to fast ALUs 2316 and 2318. In at least one embodiment, fast ALUs 2316 and 2318 can perform fast operations with an effective delay of half a clock cycle. In at least one embodiment, most complex integer operations are routed to slow ALU 2320, because slow ALU 2320 can include, but is not limited to, integer execution hardware for long-latency type operations, such as multipliers, shifters, flag logic, and branching. In at least one embodiment, memory load / store operations can be performed by AGUs 2312 and 2314. In at least one embodiment, fast ALU 2316, fast ALU 2318, and slow ALU 2320 can perform integer operations on 64-bit data operands. In at least one embodiment, fast ALU 2316, fast ALU 2318, and slow ALU 2320 can be implemented to support various data bit sizes, including sixteen, thirty-two, 128, 256, etc. In at least one embodiment, the floating-point ALU 2322 and the floating-point moving unit 2324 can be implemented to support a range of operands with various bit widths, for example, they can be combined with SIMD and multimedia instructions to operate on 128-bit wide packaged data operands.

[0404] In at least one embodiment, microinstruction schedulers 2302, 2304, and 2306 schedule dependent operations before the parent load completes execution. In at least one embodiment, since microinstructions can be speculatively scheduled and executed within processor 2300, processor 2300 may also include logic for handling memory misses. In at least one embodiment, if a data load miss occurs in the data cache, there may be a dependent operation running in the pipeline that temporarily deprives the scheduler of the correct data. In at least one embodiment, a replay mechanism tracks and re-executes instructions that use incorrect data. In at least one embodiment, it may be necessary to replay dependent operations and may allow independent operations to be completed. In at least one embodiment, the scheduler and replay mechanism of at least one embodiment of the processor may also be designed to capture instruction sequences used for text string comparison operations.

[0405] In at least one embodiment, "register" can refer to an onboard processor storage location that can be used as part of an instruction that identifies an operand. In at least one embodiment, a register can be one that can be used externally to the processor (from a programmer's perspective). In at least one embodiment, a register may not be limited to a particular type of circuit. Rather, in at least one embodiment, a register can store data, provide data, and perform the functions described herein. In at least one embodiment, the registers described herein can be implemented using a variety of different techniques via circuitry within the processor, such as dedicated physical registers, dynamically allocated physical registers renamed using register renaming, a combination of dedicated and dynamically allocated physical registers, etc. In at least one embodiment, an integer register stores 32-bit integer data. The register file of at least one embodiment also includes eight multimedia SIMD registers for encapsulating data.

[0406] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details regarding the inference and / or training logic 715 are provided. In at least one embodiment, some or all of the inference and / or training logic 715 may be incorporated into execution block 2311 and other memories or registers shown or not shown. For example, in at least one embodiment, the training and / or inference techniques described herein may use one or more ALUs shown in execution block 2311. Furthermore, weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown) that configure the ALUs of execution block 2311 to execute one or more machine learning algorithms, neural network architectures, use cases, or training techniques described herein.

[0407] Figure 24A deep learning application processor 2400 according to at least one embodiment is illustrated. In at least one embodiment, the deep learning application processor 2400 uses instructions, which, if executed by the deep learning application processor 2400, cause the deep learning application processor 2400 to perform some or all of the processes and techniques described herein. In at least one embodiment, the deep learning application processor 2400 is an application-specific integrated circuit (ASIC). In at least one embodiment, the application processor 2400 performs matrix multiplication operations or is "hardwired" into hardware as a result of executing one or more instructions or both. In at least one embodiment, the deep learning application processor 2400 includes, but is not limited to, a processing cluster 2410(1)-2410(12), an inter-chip link (“ICL”) 2420(1)-2420(12), an inter-chip controller (“ICC”) 2430(1)-2430(2), a second-generation high-bandwidth memory (“HBM2”) 2440(1)-2440(4), a memory controller (“Mem Ctrlr”) 2442(1)-2442(4), a high-bandwidth memory physical layer (“HBM PHY”) 2444(1)-2444(4), a management controller central processing unit (“management controller CPU”) 2450, a serial peripheral interface, internal integrated circuits and general-purpose input / output blocks (“SPI, I2C, GPIO”) 2460, a peripheral component interconnect fast controller and direct memory access block (“PCIe controller and DMA”) 2470, and a sixteen-channel peripheral component interconnect fast port (“PCI Express”). x 16”)2480.

[0408] In at least one embodiment, processing cluster 2410 can perform deep learning operations, including inference or prediction operations based on weight parameters computed using one or more training techniques, including those described herein. In at least one embodiment, each processing cluster 2410 can include, but is not limited to, any number and type of processors. In at least one embodiment, deep learning application processor 2400 can include any number and type of processing cluster 2400. In at least one embodiment, the inter-chip link 2420 is bidirectional. In at least one embodiment, the inter-chip link 2420 and the inter-chip controller 2430 enable multiple deep learning application processors 2400 to exchange information, including activation information generated from executing one or more machine learning algorithms embodied in one or more neural networks. In at least one embodiment, deep learning application processor 2400 can include any number (including zero) and type of ICL 2420 and ICC 2430.

[0409] In at least one embodiment, the HBM22440 provides a total of 32GB of memory. In at least one embodiment, the HBM22440(i) is associated with both the memory controller 2442(i) and the HBM PHY 2444(i), where “i” is any integer. In at least one embodiment, any number of HBM22440s can provide any type and total amount of high-bandwidth memory and can be associated with any number (including zero) and type of memory controller 2442 and HBM PHY 2444. In at least one embodiment, any number and type of blocks can replace SPI, I2C, GPIO 2460, PCIe controller, and DMA 2470 and / or PCIe 2480 to implement any number and type of communication standards in any technically feasible manner.

[0410] Inference and / or training logic 715 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 7A and / or Figure 7B Details regarding the inference and / or training logic 715 are provided. In at least one embodiment, the deep learning application processor is used to train a machine learning model (e.g., a neural network) to predict or infer information provided to the deep learning application processor 2400. In at least one embodiment, the deep learning application processor 2400 is used to infer or predict information based on a trained machine learning model (e.g., a neural network) that has been trained by another processor or system or by the deep learning application processor 2400. In at least one embodiment, the processor 2400 may be used to perform one or more neural network use cases described herein.

[0411] Figure 25This is a block diagram of a neuromorphic processor 2500 according to at least one embodiment. In at least one embodiment, the neuromorphic processor 2500 may receive one or more inputs from a source external to the neuromorphic processor 2500. In at least one embodiment, these inputs may be transmitted to one or more neurons 2502 within the neuromorphic processor 2500. In at least one embodiment, the neurons 2502 and their components may be implemented using circuitry or logic including one or more arithmetic logic units (ALUs). In at least one embodiment, the neuromorphic processor 2500 may include, but is not limited to, thousands upon thousands of instances of neurons 2502, but any suitable number of neurons 2502 may be used. In at least one embodiment, each instance of a neuron 2502 may include a neuron input 2504 and a neuron output 2506. In at least one embodiment, a neuron 2502 may generate an output that can be transmitted to the inputs of other instances of the neuron 2502. In at least one embodiment, the neuron input 2504 and the neuron output 2506 may be interconnected via synapses 2508.

[0412] In at least one embodiment, neuron 2502 and synapse 2508 may be interconnected, causing neuromorphic processor 2500 to operate to process or analyze information received by neuromorphic processor 2500. In at least one embodiment, neuron 2502 may send an output pulse (or “trigger” or “peak”) when the input received through neuron input 2504 exceeds a threshold. In at least one embodiment, neuron 2502 may sum or integrate the signal received at neuron input 2504. For example, in at least one embodiment, neuron 2502 may be implemented as a leaky integral-triggered neuron, wherein if the summation (referred to as “membrane potential”) exceeds a threshold, neuron 2502 may use a transfer function such as a sigmoid or threshold function to generate an output (or “trigger”). In at least one embodiment, the leaky integral-triggered neuron may sum the signal received at neuron input 2504 to a membrane potential and may apply an attenuation factor (or leak) to reduce the membrane potential. In at least one embodiment, a leaking integral-triggered neuron may trigger if multiple input signals are received at neuron input 2504 quickly enough to exceed a threshold (i.e., before the membrane potential decays too low to trigger). In at least one embodiment, neuron 2502 may be implemented using circuitry or logic that receives input, integrates the input to the membrane potential, and decays the membrane potential. In at least one embodiment, the input may be averaged, or any other suitable transfer function may be used. Furthermore, in at least one embodiment, neuron 2502 may include, but is not limited to, comparator circuitry or logic that generates an output spike at neuron output 2506 when the result of applying the transfer function to neuron input 2504 exceeds a threshold. In at least one embodiment, once neuron 2502 is triggered, it can ignore previously received input information by, for example, resetting the membrane potential to 0 or another suitable default value. In at least one embodiment, once the membrane potential is reset to 0, neuron 2502 may resume normal operation after a suitable period of time (or recovery period).

[0413] In at least one embodiment, neurons 2502 can be interconnected via synapses 2508. In at least one embodiment, synapses 2508 are operable to transmit signals from the output of a first neuron 2502 to the input of a second neuron 2502. In at least one embodiment, neurons 2502 can transmit information on more than one instance of synapses 2508. In at least one embodiment, one or more instances of neuron outputs 2506 can be connected via instances of synapses 2508 to instances of neuron inputs 2504 within the same neuron 2502. In at least one embodiment, an instance of neuron 2502 that produces an output to be transmitted on the instance of synapse 2508 may be referred to as a "presynaptic neuron". In at least one embodiment, an instance of neuron 2502 that receives input transmitted via an instance of synapse 2508 may be referred to as a "postsynaptic neuron". In at least one embodiment, regarding various instances of synapse 2508, since an instance of neuron 2502 can receive input from one or more instances of synapse 2508 and can also transmit output through one or more instances of synapse 2508, a single instance of neuron 2502 can be both a "presynaptic neuron" and a "postsynaptic neuron".

[0414] In at least one embodiment, neurons 2502 may be organized into one or more layers. In at least one embodiment, each instance of neuron 2502 may have a neuron output 2506 that fans out to one or more neuron inputs 2504 via one or more synapses 2508. In at least one embodiment, the neuron output 2506 of neuron 2502 in the first layer 2510 may be connected to the neuron input 2504 of neuron 2502 in the second layer 2512. In at least one embodiment, layer 2510 may be referred to as a “feedforward layer.” In at least one embodiment, each instance of neuron 2502 in an instance of the first layer 2510 may fan out to each instance of neuron 2502 in the second layer 2512. In at least one embodiment, the first layer 2510 may be referred to as a “fully connected feedforward layer.” In at least one embodiment, each instance of neuron 2502 in an instance of the second layer 2512 fans out to fewer than all instances of neuron 2502 in the third layer 2514. In at least one embodiment, the second layer 2512 may be referred to as a “sparsely connected feedforward layer.” In at least one embodiment, neurons 2502 in the second layer 2512 may fan out to neurons 2502 in multiple other layers, including neurons 2502 fan out to the second layer 2512. In at least one embodiment, the second layer 2512 may be referred to as a “recurrent layer.” In at least one embodiment, the neuromorphic processor 2500 may be any suitable combination of recurrent layers and feedforward layers, including but not limited to sparsely connected feedforward layers and fully connected feedforward layers.

[0415] In at least one embodiment, the neuromorphic processor 2500 may include, but is not limited to, a reconfigurable interconnect architecture or dedicated hardwired interconnects to connect synapses 2508 to neurons 2502. In at least one embodiment, the neuromorphic processor 2500 may include, but is not limited to, circuitry or logic that allows synapses to be assigned to different neurons 2502 as needed, depending on the neural network topology and neuron fan-in / fan-out. For example, in at least one embodiment, synapses 2508 may be connected to neurons 2502 using interconnect structures (such as on-chip networks) or via dedicated connections. In at least one embodiment, synaptic interconnects and their components may be implemented using circuitry or logic.

[0416] Figure 26A processing system according to at least one embodiment is illustrated. In at least one embodiment, system 2600 includes one or more processors 2602 and one or more graphics processors 2608, and may be a single-processor desktop system, a multi-processor workstation system, or a server system having a large number of processors 2602 or processor cores 2607. In at least one embodiment, system 2600 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices.

[0417] In at least one embodiment, system 2600 may include or be integrated into a server-based gaming platform, including a game console, mobile game console, handheld game console, or online game console, which are game and media consoles. In at least one embodiment, system 2600 is a mobile phone, smartphone, tablet computing device, or mobile internet device. In at least one embodiment, processing system 2600 may also include components coupled to or integrated into a wearable device, such as a smartwatch, smart glasses, augmented reality, or virtual reality device. In at least one embodiment, processing system 2600 is a television or set-top box device having one or more processors 2602 and a graphical interface generated by one or more graphics processors 2608.

[0418] In at least one embodiment, each of the one or more processors 2602 includes one or more processor cores 2607 for processing instructions that, when executed, perform operations against the system and user software. In at least one embodiment, each of the one or more processor cores 2607 is configured to process a specific instruction sequence 2609. In at least one embodiment, the instruction sequence 2609 may facilitate Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computation via Very Long Instruction Word (VLIW). In at least one embodiment, each processor core 2607 may process a different instruction sequence 2609, which may include instructions that facilitate the emulation of other instruction sequences. In at least one embodiment, the processor core 2607 may also include other processing devices, such as a digital signal processor (DSP).

[0419] In at least one embodiment, processor 2602 includes cache memory 2604. In at least one embodiment, processor 2602 may have a single internal cache or multiple levels of internal caches. In at least one embodiment, the cache memory is shared among various components of processor 2602. In at least one embodiment, processor 2602 also uses an external cache (e.g., a Level 3 (L3) cache or a last-level cache (LLC)) (not shown), which can be shared among processor cores 2607 using known cache coherence techniques. In at least one embodiment, processor 2602 further includes a register file 2606, which may include different types of registers (e.g., integer registers, floating-point registers, status r...

Claims

1. A processor, comprising: A circuit for labeling one or more objects within one or more images using one or more neural networks, at least in part based on one or more updates of two or more pseudo-labels, the two or more pseudo-labels being associated with two or more distinct annotations associated with the one or more objects, wherein the one or more updates of the two or more pseudo-labels are obtained by adjusting the information about the one or more objects indicated by the two or more pseudo-labels according to information about the one or more objects identified using the one or more neural networks.

2. The processor of claim 1, wherein the circuitry for labeling one or more objects within one or more images using one or more neural networks at least in part based on one or more updates of two or more pseudo-labels comprises the circuitry for: The two or more pseudo-tags are generated based at least in part on the two or more annotations; The one or more neural networks are used to determine one or more prediction maps about the one or more objects; One or more feature maps are generated, at least in part, based on the two or more pseudo-labels and the one or more prediction maps, wherein the information contained in the two or more pseudo-labels is adjusted for objects contained in the one or more prediction maps; as well as Labels for marking one or more objects within one or more images are generated, based at least in part on a combination of the one or more feature maps.

3. The processor of claim 2, wherein the two or more pseudo-labels are generated by performing weak supervision on the two or more different annotations.

4. The processor of claim 2, wherein the labels for marking the one or more objects within the one or more images are generated by concatenating the one or more feature maps into a combined feature map and using a fusion neural network to determine the labels based on the combined feature map.

5. The processor of claim 2, wherein the one or more neural networks are trained to determine the one or more objects in the one or more images based at least in part on the one or more prediction maps and the two or more pseudo-labels.

6. The processor of claim 1, wherein the one or more neural networks used to determine the one or more objects in the training image are convolutional neural networks.

7. A system comprising: One or more processors, including one or more circuits, are configured to label one or more objects within one or more images using one or more neural networks, at least in part, based on updates to two or more pseudo-labels associated with two or more distinct annotations related to the one or more objects, wherein the updates to the two or more pseudo-labels are obtained by adjusting information about the one or more objects indicated by the two or more pseudo-labels based on information about the one or more objects identified using the one or more neural networks.

8. The system of claim 7, wherein the one or more circuits for labeling one or more objects within one or more images using one or more neural networks at least in part based on one or more updates of two or more pseudo-labels comprises the one or more circuits for: Using one or more weak supervision techniques, generate the two or more pseudo-labels based on the two or more different annotations; One or more prediction maps are generated using the one or more neural networks to indicate information about the one or more objects; Using the one or more prediction maps and the two or more pseudo-labels, generate one or more feature maps, wherein the information contained in the two or more pseudo-labels is adjusted for the objects contained in the one or more prediction maps; as well as The one or more feature maps are combined into labels for tagging the one or more objects within the one or more images.

9. The system of claim 8, wherein the one or more weak supervision techniques include random walk and region growing operations for determining the two or more pseudo-labels, the two or more pseudo-labels indicating at least the foreground and background of the one or more images.

10. The system of claim 8, wherein the context loss is computed at least in part based on the one or more prediction graphs, and the one or more neural networks are trained at least in part based on the context loss.

11. The system of claim 8, wherein the one or more feature maps are generated by using the one or more prediction maps to determine information indicating the one or more objects in the one or more images from the two or more pseudo-labels.

12. The system of claim 8, wherein the one or more feature maps are combined by concatenating the one or more feature maps into a concatenated feature map and using a convolutional neural network to determine the two or more pseudo-labels.

13. The system of claim 12, wherein one or more loss values ​​of the one or more neural networks are calculated based at least in part on the two or more pseudo-labels, and the one or more loss values ​​are used to train the one or more neural networks.

14. The system of claim 7, wherein the two or more distinct annotations comprise approximate indications of the one or more objects in the one or more images.

15. A non-transitory machine-readable medium having a set of instructions stored thereon, which, when executed by one or more processors, causes said one or more processors to at least: One or more objects within one or more images are labeled using one or more neural networks, at least in part, based on one or more updates of two or more pseudo-labels, the two or more pseudo-labels being associated with two or more distinct annotations associated with the one or more objects, wherein the one or more updates of the two or more pseudo-labels are obtained by adjusting the information about the one or more objects indicated by the two or more pseudo-labels according to information about the one or more objects identified using the one or more neural networks.

16. The non-transitory machine-readable medium of claim 15, wherein causing the one or more processors to label one or more objects within one or more images using one or more neural networks at least in part based on one or more updates of two or more pseudo-labels comprises causing the one or more processors to: Using one or more weak supervision techniques, at least in part based on the two or more different annotations and the one or more images, the two or more pseudo-labels indicate estimates of foreground and background in the one or more images; One or more prediction maps are generated using the one or more neural networks, at least in part, based on the one or more images, wherein the information contained in the two or more pseudo-labels is adjusted for objects contained in the one or more prediction maps; The two or more pseudo-labels are updated to one or more feature maps using the one or more prediction maps; as well as The one or more feature maps are combined into labels for tagging the one or more objects within the one or more images.

17. The non-transitory machine-readable medium of claim 16, wherein the one or more neural networks comprise convolutional neural networks, and the one or more prediction maps comprise information indicating estimates of the one or more objects in the one or more images.

18. The non-transitory machine-readable medium of claim 16, wherein the one or more weak supervision techniques include region growing operations and random walk operations, and the two or more pseudo-labels include information indicating estimates of foreground and background in the one or more images.

19. The non-transitory machine-readable medium of claim 16, wherein the one or more feature maps are combined into a combined feature map by linking the one or more feature maps, and the label for marking the one or more objects within the one or more images is determined based at least in part on the combined feature map.

20. The non-transitory machine-readable medium of claim 19, wherein the two or more pseudo-labels are determined using a convolutional neural network trained at least in part based on shared information between the one or more feature maps.

21. The non-transitory machine-readable medium of claim 15, wherein each of the labeled objects within the one or more images comprises a label determined at least in part based on the one or more images and the two or more distinct annotations, and the one or more neural networks are trained at least in part based on information contained in the two or more pseudo-labels.

22. A method comprising: One or more objects within one or more images are labeled using one or more neural networks, at least in part, based on one or more updates of two or more pseudo-labels, the two or more pseudo-labels being associated with two or more distinct annotations associated with the one or more objects, wherein the one or more updates of the two or more pseudo-labels are obtained by adjusting the information about the one or more objects indicated by the two or more pseudo-labels according to information about the one or more objects identified using the one or more neural networks.

23. The method of claim 22, wherein labeling one or more objects within one or more images using one or more neural networks, at least in part based on one or more updates of two or more pseudo-labels, comprises: The one or more neural networks are used to generate one or more feature maps about the one or more images, the one or more feature maps being generated at least in part based on the one or more images and the two or more pseudo-labels determined according to the two or more different annotations; as well as The one or more feature maps are combined into labels for tagging the one or more objects within the one or more images.

24. The method of claim 23, wherein the two or more pseudo-labels are determined using one or more weak supervision techniques based on the two or more different annotations, and the two or more pseudo-labels include information for at least indicating the estimated foreground and estimated background in the one or more images.

25. The method of claim 23, wherein the one or more feature maps are further generated at least in part based on updating the two or more pseudo-labels, and the two or more pseudo-labels are updated based on one or more prediction maps determined by the one or more neural networks, the one or more prediction maps indicating estimates of the one or more objects in the one or more images.

26. The method of claim 25, wherein one or more context loss values ​​are computed at least in part based on the one or more prediction maps, and the one or more context loss values ​​are used to train the one or more neural networks.

27. The method of claim 23, wherein the one or more feature maps are combined into the label by linking the one or more feature maps into a linked feature map and using a fusion neural network to determine the label based on the linked feature map.

28. The method of claim 27, wherein the fusion neural network is a convolutional neural network.

29. The method of claim 27, wherein one or more loss values ​​are computed at least in part based on the one or more feature maps, and the one or more loss values ​​are used to train the fusion neural network.

30. The method of claim 22, wherein at least one of the one or more neural networks is a 3D U-Net neural network.

31. The processor of claim 1, wherein the one or more neural networks are used to combine two or more different types of annotations associated with objects within one or more images, and to label the objects at least in part based on the combined two or more different types of annotations, wherein the combined two or more different types of annotations include one or more updates of two or more pseudo-labels.