Evaluating training of machine learning models and altering the training based on the evaluation

The facility uses spectral training observability data to efficiently evaluate and adjust machine learning model training, addressing resource and time inefficiencies in conventional methods by enabling real-time assessment and hyperparameter tuning.

WO2026122985A2PCT designated stage Publication Date: 2026-06-11MTS IP HLDG LTD +5

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
MTS IP HLDG LTD
Filing Date
2025-12-05
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Conventional methods for assessing machine learning model architectures during training are resource-intensive, time-consuming, and inefficient, requiring extensive computing resources and long training periods to evaluate multiple model architectures before selecting the optimal one.

Method used

A facility that generates spectral training observability data through dynamic mode decomposition, allowing for real-time evaluation of model training and adjustment of hyperparameters, reducing the need for resource-intensive prediction of future training data.

🎯Benefits of technology

This approach reduces computing resource usage, accelerates the evaluation process, and enables more accurate assessment of model training, enabling efficient selection of optimal architectures without lengthy training periods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025058453_11062026_PF_FP_ABST
    Figure US2025058453_11062026_PF_FP_ABST
Patent Text Reader

Abstract

A facility for evaluating the training of machine learning models is described. The facility receives an indication of a machine learning model that is being trained and first and second training observability data that indicate the state of the machine learning model at a first and second time, respectively. The facility generates spectral training observability data by applying a dynamic mode decomposition to the first and second training observability data. The facility determines whether training of the machine learning model is to change based on the spectral observability data and causes the training to change based on the determination. A facility for optimizing mixed precision training for artificial intelligence models is described. The facility identifies a computer number format used to represent one or more parameters of an artificial intelligence model. The facility obtains first and second training observability data that indicate states of the AI model during a first and second training period, respectively. The facility generates spectral training observability data by applying dynamic mode decomposition to the first and second training observability data. The facility determines whether to change the computer number format used to represent the one or more parameters based on the spectral training observability data and causes the AI model to be trained based on the determination. A facility for evaluating the training of ensemble machine learning models is described. The facility receives an indication of an ensemble machine learning model that is being trained. For each constituent machine learning model of the ensemble machine learning model, the facility receives first and second training observability data indicating at least one state of the constituent machine learning model. The facility generates spectral training observability data for each constituent machine learning model based on the first and second training observability data. The facility determines, for at least a subset of the constituent machine learning models, an adjustment to an aspect of the constituent machine learning model. Based on the determination, the facility continues the training of the ensemble machine learning model in a way that captures the determined adjustments.
Need to check novelty before this filing date? Find Prior Art

Description

EVALUATING TRAINING OF MACHINE LEARNING MODELS AND ALTERING THE TRAINING BASED ON THE EVALUATIONCROSS-REFERENCE TO RELATED APPLICATION(S)

[0001] This Application claims the benefit of provisional U.S. Application No. 63 / 729,142, filed on December 6, 2024, and entitled “EVALUATING TRAINING OF MACHINE LEARNING MODELS AND ALTERING THE TRAINING BASED ON THE EVALUATION”; U.S. Application No. 63 / 730,895, filed on December 11, 2024, and entitled “OPTIMIZING MIXED PRECISION TRAINING of ARTIFICIAL INTELLIGENCE MODELS”; and U.S. Application No. 63 / 748,395, filed on January 22, 2025, and entitled “EVALUATING THE TRAINING OF ENSEMBLE MACHINE LEARNING MODELS AND ADJUSTING AN ASPECT OF CONSTITUENT MACHINE LEARNING MODELS BASED ON THE EVALUATION” which are hereby incorporated by reference in their entireties.

[0002] In cases where the present application conflicts with a document incorporated by reference, the present application controls.BACKGROUND

[0003] Section 1 : Machine learning models, such as Large Language Models (“LLMs”), are typically trained by using large datasets over months-long periods of time.

[0004] Section 2: Training artificial intelligence models such as large language models ( / .<?., “LLMs”) is often expensive, requiring large datasets and significant computing resources.

[0005] Section 3 : Machine learning models, such as ensemble machine learning models,Large Language Models (“LLMs”), and ensemble LLMs, are typically trained by using large datasets over months-long periods of time.BRIEF DESCRIPTION OF THE DRAWINGS

[0006] Figure 1.1 is a block diagram showing some of the components typically incorporated in at least some of the computer systems and other devices on which the facility operates.

[0007] Figure 1.2 is a flow diagram showing a process for changing the training of a machine learning model, used by the facility in some embodiments.1720229.45101WO / #11326387.1

[0008] Figure 1.3 is a sample observability training data matrix describing a state of a machine learning model during a period of time that occurs while the machine learning model is being trained, used by the facility in some embodiments.

[0009] Figure 1.4 is a flow diagram showing a process for performing a dynamic mode decomposition on training observability data, used by the facility in some embodiments.

[0010] Figure 1.5 is a flow diagram showing a process for selecting a model architecture for a machine learning model, used by the facility in some embodiments.

[0011] Figure 1.6 is a block diagram showing a system for obtaining training observability data and evaluating the training of a machine learning model, used by the facilities in some embodiments.

[0012] Figure 2.1 is a block diagram showing some of the components typically incorporated in at least some of the computer systems and other devices on which the facility operates.

[0013] Figure 2.2 is a flow diagram showing a process used by the facility in some embodiments to determine a computer number format with which to represent one or more parameters of an artificial intelligence model.

[0014] Figure 2.3 is a sample training data observability matrix used by the facility in some embodiments to describe a training state of an artificial intelligence model.

[0015] Figure 2.4 is a flow diagram showing a process used by the facility in some embodiments to perform dynamic mode decomposition on training observability data.

[0016] Figure 2.5 is a flow diagram showing a process used by the facility in some embodiments to select a computer number format with which to represent one or more parameters of an artificial intelligence model.

[0017] Figure 2.6 is a block diagram showing a system used by the facility in some embodiments to obtain training observability data and evaluate training of an artificial intelligence model.

[0018] Figure 3.1 is a block diagram showing some of the components typically incorporated in at least some of the computer systems and other devices on which the facility operates.

[0019] Figure 3.2 is a flow diagram showing a process for generating spectral training observability data, used by the facility in some embodiments.2720229.45101 WO / #11326387.1

[0020] Figure 3.3 is a sample observability training data matrix describing a state of a machine learning model during a period of time that occurs while the machine learning model is being trained, used by the facility in some embodiments.

[0021] Figure 3.4 is a flow diagram showing a process for performing a dynamic mode decomposition on training observability data, used by the facility in some embodiments.

[0022] Figure 3.5 is a block diagram showing a sample training life-cycle for training an ensemble machine learning model, used by the facility in some embodiments.

[0023] Figure 3.6 is a flow diagram showing a process for changing an aspect of a constituent machine learning model or ensemble machine learning model during training, used by the facility in some embodiments.

[0024] Figure 3.7 is a flow diagram showing a process for adjusting a hyperparameter or model architecture for a constituent machine learning model of an ensemble machine learning model, used by the facility in some embodiments.

[0025] Figure 3.8 is a flow diagram showing a process for removing a constituent machine learning model from an ensemble machine learning model during training, used by the facility in some embodiments.

[0026] Figure 3.9 is a block diagram showing a system for obtaining training observability data and evaluating the training of a machine learning model, used by the facility in some embodiments.DETAILED DESCRIPTION

[0027] Section 1 : The inventors have recognized that it would be of great benefit to those who train machine learning models, including model training managers, to determine whether to reject a model architecture for a machine learning model as soon as possible during the training phase of the machine learning model. However, because it is difficult to predict how a model architecture and the training data will affect the training of the model, conventional approaches are unable to predict the likelihood of success that any model architecture of the many choices of model architectures will have. Additionally, the inventors have also recognized that some model architectures may perform worse than others early in training, but may still be successful if one or more “hyperparameters” or other attributes of the model’s architecture are changed before training continues.

[0028] The inventors have further recognized that while conventional methods of assessing the performance of machine learning models during the training phase exist, these3720229.45101 WO / #11326387.1methods require long periods of time to train the model, which increase exponentially as the size of the model increases. Furthermore, training a machine learning model, such as a large language model ( / .<?., an “LLM”), requires a large amount of computing resources, such as processing power, graphic processing unit usage, memory usage, electricity, and other computer resources, and the need for such resources increases with each model architecture that is assessed for the purpose of choosing the resulting model architecture of the machine learning model. Thus, conventional methods of assessing performance of machine learning models during the training phase require the use of computing resources to train many different versions of a machine learning model that each have different model architectures, and discarding all but the version that is configured with the resulting model architecture. In some cases, training a single machine learning model requires the use of computing resources for many months, and each model architecture considered must be trained for at least a matter of months before conventional systems are able to select a model architecture to use for the machine learning model.

[0029] Furthermore, the inventors have recognized that the amount of data generated as a result of training each version of the machine learning model is on a scale that also requires a large amount of computing resources to process and analyze, and that these resources must be expended multiple times throughout the training phase of the machine learning model in order to select the resulting model architecture of the machine learning model. For example, some conventional systems receiving training observability data matrices that describe the state of the machine learning model while it is being trained, and predict future training observability data matrices to assess whether training the model should continue. However, each training observability data matrix may include hundreds, thousands, etc., of data points, and predicting what each of these data points will be at a future time requires a significant amount of additional computing resources in addition to the resources being used to train the model.

[0030] As a result of these disadvantages, conventional systems are currently unable to reliably and accurately predict whether a model architecture is able to be used for a machine learning model without first training multiple versions of the model that each have different model architectures. Furthermore, conventional systems must also use a large amount of computing resources in order to assess each model architecture considered for a machine learning model at multiple times throughout the training phase in order to select the resulting model architecture.

[0031] In response to recognizing these disadvantages, the inventors have conceived and reduced to practice a software and / or hardware facility for managing the training of machine4720229.45101 WO / #11326387.1learning models based on training observability data representing the state of the machine learning model while it is being trained (“the facility”). By generating spectral training observability data, the facility is able to assess the training of a machine learning model without generating predicted training observability matrices. Furthermore, the facility is able to use the spectral training observability data to tune one or more hyperparameters for training a machine learning model.

[0032] Additionally, by generating spectral training observability data instead of predicting future training observability data, the facility is able to tune hyperparameters during the training of a machine learning model, determine whether to cease training of the machine learning model, or perform other functions, with fewer computing resources than conventional systems that predict future training observability data. Generating spectral training observability data also allows for samples of training observability data to be nonuniform, in contrast with the uniform samples required by conventional systems. Thus, by using spectral training observability data, the facility is able to alter the interval for obtaining additional training observability data to be more frequent, less frequent, etc., and to alter the amount of additional training observability data obtained. Therefore, the facility allows sample training observability data to be gathered at any time, and performs the analysis and processing of the training observability data faster and with fewer computing resources than conventional systems. Furthermore, the resources saved by using the processes performed by the facility may then be used directly for training machine learning model instead of analyzing the data produced as a result of training the machine learning model.

[0033] The facility generates spectral training observability data by applying a dynamic mode decomposition to training observability data received during at least two periods of time that occur during the training of the machine learning model. The training observability data is data that indicates the state of a machine learning model during a period of time while the machine learning model is being trained. In some embodiments, the training observability data indicates training observability metrics that are measured at one or more times during the time period for which the training observability data is received. For example, the training observability data may be data that represents one or more metrics calculated based on observable data of the state of the machine learning model while it is being trained.

[0034] In some embodiments, the time periods for different sets of training observability data intersect. For example, first training observability data may be received between 100 seconds and 200 seconds after training a model begins and second training observability data5720229.45101 WO / #11326387.1may be received between 150 seconds and 250 seconds after training the model begins. In some embodiments, the facility generates a training observability data matrix for training observability data received within a time period. Continuing the example above, the facility may generate a first matrix of training observability data for the data received between 100 and 200 seconds, and a second matrix of training observability data for the data received between 150 and 250 seconds.

[0035] In some embodiments, the facility changes the frequency at which training observability data is obtained. In such embodiments, the facility may change the frequency at which training observability data is obtained based on spectral training observability data, input indicating that the frequency of receiving training observability data is to change, input indicating that the facility is to obtain training observability data at during a selected time period, changing the frequency of receiving training observability data based on the amount of time that the model has been trained, changing the frequency of receiving training observability data based on the amount of training observability data received, other methods of determining whether the frequency of receiving the training observability data is to change, or some combination thereof. In some embodiments, the facility may increase or reduce the amount of training observability data obtained based on indications similar to those of changing the frequency of obtaining training observability data.

[0036] For example, the facility may change the frequency of receiving training observability data as the machine learning model is being trained, such that earlier stages of training correspond to a greater frequency of obtaining training observability data and later stages of training correspond to a lesser frequency of obtaining training observability data. In another example, the facility may receive input, such as user input, input from a system that manages training of the machine learning model, or some combination thereof, that indicates that training observability data for a selected time period is to be received. In yet another example, the facility may change the frequency of obtaining training observability data based on a determination that training the machine learning model is proceeding well, proceeding poorly, etc. For example, if the spectral training observability data indicates that training the machine learning model is proceeding well, the facility may reduce the frequency at which training observability data is received. In another example, if the spectral training observability data indicates that training the machine learning model is proceeding poorly, the facility may increase the frequency at which training observability data is obtained.6720229.45101 WO / #11326387.1

[0037] The spectral training observability data may include one or more eigenvalues, one or more eigenvectors, or some combination thereof. The facility uses the spectral training observability data to determine whether the model architecture of the machine learning model for which the spectral observability data is generated is to cease. In some embodiments, the facility uses the spectral observability data to identify one or more hyperparameters during the training of the machine learning model that are to be changed. In such embodiments, the facility may determine the magnitude of the change based on the spectral observability data. In some embodiments, the facility may cause at least one of the one or more hyperparameters to be changed by automatically changing the at least one hyperparameter, receive input indicating that at least one hyperparameter of the one or more hyperparameters is to be changed, cause at least one hyperparameter of the one or more hyperparameters to change via one or more other methods of changing a hyperparameter, or some combination thereof.

[0038] By performing in some or all of the ways described above, the facility is able to reduce the computing resources needed to evaluate the training of a machine learning model and provide more accurate evaluations of the training of the machine learning model. Also, the facility improves the functioning of computer or other hardware, such as by reducing the dynamic display area, processing, storage, and / or data transmission resources needed to perform a certain task, thereby enabling the task to be permitted by less capable, capacious, and / or expensive hardware devices, and / or be performed with lesser latency, and / or preserving more of the conserved resources for use in performing other tasks. For example, generating spectral training observability data requires less memory, less processing power, and is able to be performed more quickly, when compared to generating a prediction of future training observability data, because generating spectral training observability data uses fewer resources than predicting future training observability data. Also, by generating spectral training observability data, the facility does not need to obtain training observability data at predetermined and unchangeable time intervals. Thus, the facility is able to vary the frequency of obtaining training observability data and amount of training observability data received as the model is being trained. This feature also enables the facility to evaluate the training of the machine learning model contemporaneously with receiving an indication that additional training observability data is to be received during a selected time period.

[0039] Further, for at least some of the domains and scenarios discussed herein, the processes described herein as being performed automatically by a computing system cannot practically be performed in the human mind, for reasons that include that the starting data,7720229.45101 WO / #11326387.1intermediate state(s), and ending data are too voluminous and / or poorly organized for human access and processing, and / or are a form not perceivable and / or expressible by the human mind; the involved data manipulation operations and / or subprocesses are too complex, and / or too different from typical human mental operations; required response times are too short to be satisfied by human performance; etc.

[0040] Figure 1.1 is a block diagram showing some of the components typically incorporated in at least some of the computer systems and other devices on which the facility operates. In various embodiments, these computer systems and other devices 1.100 can include server computer systems, cloud computing platforms or virtual machines in other configurations, desktop computer systems, laptop computer systems, netbooks, mobile phones, personal digital assistants, televisions, cameras, automobile computers, electronic media players, etc. In various embodiments, the computer systems and devices include zero or more of each of the following: a processor 1.101 for executing computer programs and / or training or applying machine learning models, such as a CPU, GPU, TPU, NNP, FPGA, or ASIC; a computer memory 1.102 — such as RAM, SDRAM, ROM, PROM, etc. — for storing programs and data while they are being used, including the facility and associated data, an operating system including a kernel, and device drivers; a persistent storage device 1.103, such as a hard drive or flash drive for persistently storing programs and data; a computer-readable media drive 1.104, such as a floppy, CD-ROM, or DVD drive, for reading programs and data stored on a computer-readable medium; and a network connection 1.105 for connecting the computer system to other computer systems to send and / or receive data, such as via the Internet or another network and its networking hardware, such as switches, routers, repeaters, electrical cables and optical fibers, light emitters and receivers, radio transmitters and receivers, and the like. None of the components shown in Figure 1.1 and discussed above constitutes a data signal per se. While computer systems configured as described above are typically used to support the operation of the facility, those skilled in the art will appreciate that the facility may be implemented using devices of various types and configurations, and having various components.

[0041] Those skilled in the art will appreciate that the acts shown in the flow diagrams of Figures 1.2, 1.4, and 1.5 discussed below may be altered in a variety of ways. For example, the order of the acts may be rearranged; some acts may be performed in parallel; shown acts may be omitted, or other acts may be included; a shown act may be divided into subacts, or multiple shown acts may be combined into a single act, etc.8720229.45101 WO / #11326387.1

[0042] While the table diagram shown in Figure 1.3 discussed below shows a table that represents a matrix whose contents and organization are designed to make them more comprehensible by a human reader, those skilled in the art will appreciate that actual data structures used by the facility to store this information may differ from the table shown, in that they, for example, may be organized in a different manner; may contain more or less information than shown; may be compressed, encrypted, and / or indexed; may contain a much larger number of rows than shown, etc. Additionally, in some embodiments, rather than storing the data shown in the table diagrams in tables, the facility stores it in semi-structured or unstructured data stores, such as JSON objects.

[0043] Figure 1.2 is a flow diagram showing a process 1.200 for changing the training of a machine learning model, used by the facility in some embodiments. First, at act 1.201, the facility receives an indication of a machine learning model that is being trained. In some embodiments, at act 1.201, the facility receives an indication of a model architecture of the machine learning model.

[0044] At act 1.202, the facility receives first training observability data obtained during a first time period during which the machine learning model is being trained. In some embodiments, the facility receives training observability data from one or more systems that receive data indicating metrics that describe how a machine learning model is being trained from a computer system that trains the machine learning model, such as the watch dog 1.602, described below in connection with Figure 1.6.

[0045] In some embodiments, the facility receives training observability data at one or more selected time intervals, at one or more selected training iteration intervals, or some combination thereof. In some embodiments, the facility may alter the interval at which training observability data is received based on spectral training observability data, such as the spectral training observability data generated during act 1.204, described below.

[0046] In some embodiments, the time interval, training iteration interval, or some combination thereof, of the second training observability data is the same as for the first training observability data, but is shifted by a selected amount of time, a number of training iterations, or some combination thereof. For example, the first training observability data may be collected between time 1 and time 10, and the second training observability data may be collected between time 5 and time 15. In another example, the first training observability data may be collected between time 1 and time 10, and the second training observability data may be collected between time 20 and time 30. In some embodiments, the size of the interval for receiving the training9720229.45101 WO / #11326387.1observability data may be different between the first training observability data and subsequent instance of training observability data. For example, the size of the interval for the first training observability data may be 10, and for subsequent training observability data the interval may be 5, 20, 3, etc. In such an example, the size of the interval for subsequent training observability data may be different for a portion of the subsequent training observability data than for other portions of the subsequent training observability data (e.g. the size of the interval for the second training observability data may be 5, and the size of the interval of a third instance of training observability data may be 7).

[0047] In some embodiments, the facility determines a time interval for training observability data based on spectral observability data, such as the spectral observability data generated as part of performing act 1.204 described below. In such embodiments, the facility may determine the number of iteration steps for which the spectral observability data is accurate by determining the number of iteration steps for which the spectral observability data does not include an error. In some embodiments, the facility determines a new time interval, training iteration interval, or some combination thereof, for training observability data at one or more selected times, training iterations, or some combination thereof. In some embodiments, the facility determines a new time interval, training iteration interval, or some combination thereof, when a checkpoint for training of the machine learning model is identified. In some such embodiments, the facility identifies a checkpoint for training of the machine learning model based on user input, one or more selected training iteration intervals, one or more selected time intervals, or some combination thereof.

[0048] At act 1.203, the facility receives second training observability data obtained during a second time period during which the machine learning model is being trained. In some embodiments, the facility performs act 1.203 in a similar manner to act 1.202.

[0049] In some embodiments, the training observability data is included in a matrix of training observability data, such as the training observability data matrix 1.300, described below in connection with Figure 1.3.

[0050] Figure 1.3 is a sample observability training data matrix 1.300 describing a state of a machine learning model during a period of time that occurs while the machine learning model is being trained, used by the facility in some embodiments. The columns of the observability training data matrix 1.300, such as columns 1.321 and 1.322, indicate a time at which the observability training data indicated in the column is received. In some embodiments, the time indicated by the column is a time represented by an “iteration step,” i.e. an iteration of10720229.45101 WO / #11326387.1training the machine learning model. The observability data metrics rows 1.301 each indicate an observability metric associated with training the machine learning model. The facility may determine the observability metrics based on observable data of the state of the machine learning model while it is being trained. In some embodiments, a system other than the facility generates one or more of the observability metrics, the observability training data matrix, or some combination thereof.

[0051] The observability metrics in the observability training data matrix 1.300 may include, but are not limited to, a cross entropy, a gradient norm, a perplexity, a learning rate, a validation loss, a BiLingual Evaluation Understudy (BLEU) score, a Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, a Fl score, precision, recall, an Area Under the Curve Receiver Operating Characteristics (AUC-ROC) score, and an Area Under the Curve Precision Recall Curve (AUC-PRC) score.

[0052] Returning to Figure 1.2, at act 1.204, the facility generates spectral training observability data based on the first and second training observability data. In some embodiments, the facility uses the process 1.400, described below in connection with Figure 1.4 to generate the spectral training observability data.

[0053] Figure 1.4 is a flow diagram showing a process for performing a dynamic mode decomposition on training observability data, used by the facility in some embodiments. First, at act 1.401, the facility receives a first matrix indicating training observability data for a machine learning model that is being trained.

[0054] At act 1.402, the facility receives a second matrix indicating training observability data for the machine learning model that is being trained. In some embodiments, the facility performs acts 1.401 and 1.402 in a similar manner to acts 1.201 and 1.202, respectively, described above in connection with Figure 1.2. The first matrix and second matrix may be observability training data matrices, such as the observability training data matrix 1.300 described above in connection with Figure 1.3.

[0055] At act 1.403, the facility determines a singular value decomposition of the first matrix. In some embodiments, the facility determines the singular value decomposition based on Equation 1.1 below:X = UEV* (1.1)

[0056] In Equation 1.1 : “X” represents the first matrix, “U” represents a complex unitary matrix, “X” represents a rectangular diagonal matrix, and “E*” represents the conjugate transpose of complex unitary matrix V. In some embodiments, the facility generates “U,” “X,”11720229.45101 WO / #11326387.1and “E*” based on the first matrix, “X” by factoring “X’ into these three components, such as by using the theories of singular value decomposition, dynamic mode decomposition, or a combination thereof.

[0057] In some embodiments, the facility determines a compact singular value decomposition, such as according to Equation 1.1 A below. In compact SVD, “Er” includes the r non-zero singular values of “S,” “Ur” includes the corresponding “r” columns of “U,” and the “r” rows of “V*” corresponding to non-zero singular values “Sr” in “S” are calculated:X = UrSrV*r (1.1 A)

[0058] In some embodiments, the facility determines a truncated singular value decomposition that approximates “X” using a configurable number “ ’ of the largest singular values of “S,” such as according to Equation 1.1B below:X = UtStVt* (1 IB)

[0059] In truncated singular value decomposition, “St” includes the “ ’ largest singular values of “S”. Accordingly, “t” columns of “U” and “t” rows of “V*” corresponding to the “t” largest singular values of “St” are calculated. In various embodiments, “t” is any value less than or equal to the number of singular values in “S”.

[0060] At act 1.404, the facility defines a third matrix based on the second matrix and the singular value decomposition of the first matrix. Thus, at act 1.404, the facility generates a definition of the third matrix without generating the third matrix itself. In some embodiments, the facility determines the third matrix based on Equation 1.2 below:A = U*Y V S'1(1.2)

[0061] In Equation 1.2: “U*” is the conjugate transpose of U from Eq. 1.1, “V” is the complex unitary matrix described with respect to Eq. 1.1, and “S'1is the inverse of S from Eq.1.1. “ Y” represents the second matrix, and “A” represents the third matrix. In embodiments where truncated SVD is used, “Ut*” is used in place of “U*”, “St” is used in place of “S,” and “Vt” is used in place of “V” according to Equation 1.1 A. In embodiments where compact SVD is used, “Ur” is used in place of “U,” “Sr” is used in place of “S,” and “Vr” is used in place of “V” according to Equation 1. IB.

[0062] At act 1.405, the facility generates one or more eigenvalues, one or more eigenvectors, or some combination thereof, based on the third matrix. In some embodiments, the facility performs eigendecomposition on the third matrix to determine the one or more eigenvalues, the one or more eigenvalues, or some combination thereof. In some embodiments,12720229.45101 WO / #11326387.1the facility identifies the one or more eigenvalues, one or more eigenvectors, or some combination thereof, based on Equation 1.3, below:A: Aw = Iw (1.3)

[0063] In Equation 1.3: “A” represents the third matrix, “w” represents a matrix having columns that correspond to eigenvectors, and “1” represents a matrix having columns that correspond to eigenvalues.

[0064] In some embodiments, if the third matrix “A” is not a square matrix, as part of performing act 1.405, the facility transposes the third matrix to calculate “AT,” such as by defining the rows of the matrix as columns and the columns of the matrix as rows. In such embodiments, the facility generates the eigenvalues and eigenvectors based on the transposed third matrix.

[0065] In some embodiments in which “A” is not a square, the facility calculates a square matrix “A*AT” and generates the eigenvalues and eigenvectors based on “A*AT.”

[0066] After act 1.405, the process 1.400 ends.

[0067] Returning to Figure 1.2, at act 1.205, the facility determines whether training of the machine learning model should be changed. In some embodiments, the facility determines whether training of the machine learning model is to be changed based on the spectral training observability data. For example, if the spectral training observability data includes an eigenvalue whose absolute value is less than or equal to 1, the facility may determine that the training should not be changed. In another example, if the spectral training observability data includes an eigenvalue whose absolute value is greater than or equal to 1, the facility may determine that the training should cease, that a hyperparameter associated with training the machine learning model should be changed, that another change should be made to the training of the machine learning model, or some combination thereof. If the facility determines that training of the machine learning model is to be changed, the process 1.200 proceeds to act 1.206, otherwise the process 1.200 ends.

[0068] In some embodiments, at act 1.205, the facility transmits an indication to a user regarding the determination of whether training of the machine learning model should be changed. In such embodiments, the facility may determine whether to change the machine learning model based on input received in response to transmitting such an indication to the user.

[0069] At act 1.206, the facility causes the training of the machine learning model to change. In some embodiments, the facility automatically causes the training of the machine learning model to change, such as by causing instructions, commands, etc., to be transmitted to a13720229.45101 WO / #11326387.1system that manages the training of the machine learning model. In some embodiments, the facility causes the training of the machine learning model to change by transmitting a request to a user for permission to change the training of the machine learning model, and changing - or not changing - the training of the machine learning model based on a response from the user.

[0070] In some embodiments, the facility changes the training of the machine learning model by causing the training of the machine learning model to cease. In some embodiments, the facility changes the training of the machine learning model by changing one or more hyperparameters associated with training the machine learning model, such as: a learning rate for training the machine learning model, a batch size for training the machine learning model, a momentum for training the machine learning model, an adaptive learning rate for training the machine learning model, a number of attenuation heads of the machine learning model, a number of feedforward layers of the machine learning model, other hyperparameters associated with training a machine learning model, or some combination thereof. In some embodiments, the facility may determine that multiple hyperparameters are to be changed. In such embodiments, the facility may change all of the hyperparameters or a portion of the hyperparameters. In some such embodiments, the facility may receive user input indicating which of the hyperparameters are to be changed.

[0071] In some embodiments, the facility determines which hyperparameters are to be changed by computing a new value for at least one hyperparameter based on the spectral training observability data and comparing the new value for the at least one hyperparameter with a current value of the hyperparameter, such as by using one or more of equations 1.4-1.7, described below. In some such embodiments, the facility determines to change a hyperparameter based on a determination that the new hyperparameter is outside of a threshold range of values for the hyperparameter. In some embodiments, the threshold range of values is determined based on user input. For example, if the facility determines that a new learning rate computed based on the spectral observability data is outside of a threshold range of learning rate values that include the current learning rate, the facility may determine that the learning rate for training the machine learning model is to change.

[0072] In some embodiments, the facility determines a new learning rate for training the machine learning model based on Equation 1.4, below:14720229.45101 WO / #11326387.1In Equation 1.4, “T|*” represents the new learning rate and “p(A)” represents the maximum eigenvalue magnitude of the spectral training observability data.

[0073] In some embodiments, the facility determines a new batch size for training the machine learning model based on Equation 1.5, below:In Equation 1.5, “B*” represents the new batch size, “S” is the covariance matrix of the gradients, “L” is the Lipschitz constant of the loss function, “p” is the mean gradient, and “q” is the learning rate.

[0074] In some embodiments, the facility determines a new momentum for training the machine learning model based on Equation 1.6, below:In Equation 1.6, “P*” represents the new momentum and “K” represents the maximum eigenvalue included in the spectral observability data divided by the minimum eigenvalue included in the spectral observability data.

[0075] In some embodiments, the facility determines a new adaptive learning rate for training the machine learning model based on Equation 1.7, below:

[0076] In Equation 1.7, “qt*” represents the new adaptive learning rate, “a” is the base learning rate, “i?t” is a bias-corrected second moment estimate, and “c” is a small constant.

[0077] In some embodiments, the facility determines the magnitude of changing the hyperparameters based on a new hyperparameter calculated for the training of the machine learning model, the precision of the training observability data (such as whether the data is a 32- bit floating point number, a 16-bit floating point number, etc.), or some combination thereof.

[0078] In some embodiments, the facility determines whether a ratio of attenuation heads and feedforward layers of the machine learning model should be changed based on how much greater the absolute value of the eigenvalues included in the spectral observability data are than 1. For example, the facility may determine that if the absolute value of the eigenvalues is720229.45101 WO / #11326387.1between 1 and 2, the ratio of attenuation heads and feedforward layers of the machine learning model should be changed.

[0079] After act 1.206, the process 1.200 ends.

[0080] In some embodiments, the facility performs aspects of the process 1.200 multiple times throughout training of the machine learning model. In such embodiments, the facility may skip one or more of acts 1.201 and 1.202. In some embodiments, the facility may determine a time interval or training iteration interval for receiving additional training observability data based on the spectral observability data in a similar manner to selecting a time interval or training iteration interval described above in connection with act 1.202. In some such embodiments, the facility may use training observability data previously obtained by the facility, such as by performing aspects of the process 1.200, to generate the spectral training observability data. Thus, in such embodiments, the facility may generate the spectral training observability data with more than two sets of training observability data.

[0081] Figure 1.5 is a flow diagram showing a process 1.500 for selecting a model architecture for a machine learning model, used by the facility in some embodiments. First, at act 1.501, the facility receives an indication of a plurality of model architectures.

[0082] At act 1.502a, the facility, for each model architecture of the plurality of model architectures, performs acts 1.503-1.506.

[0083] At act 1.503, the facility causes training of a machine learning model configured based on the model architecture to be initiated.

[0084] At act 1.504, the facility receives first and second training observability data while the machine learning model is being trained. In some embodiments, the facility performs act 1.504 in a similar manner to acts 1.201 and 1.202 and acts 1.401 and 1.402, described above in connection with Figures 1.2 and 1.4, respectively.

[0085] At act 1.505, the facility generates spectral training observability data based on the first and second training observability data. In some embodiments, the facility performs act1.505 in a similar manner to act 1.204, described above in connection with Figure 1.2.

[0086] At act 1.506, the facility determines whether the training of the model is to change based on the spectral training observability data. In some embodiments, the facility performs act1.506 in similar manner to 1.205 and 1.206, described above in connection with Figure 1.2.

[0087] At act 1.502b, if the facility has performed acts 1.503-1.506 for each model architecture of the plurality of model architectures, the process 1.500 proceeds to act 1.507.16720229.45101 WO / #11326387.1Otherwise, the facility continues to perform acts 1.503-1.506 until they have been performed for each model architecture of the plurality of model architectures.

[0088] At act 1.507, the facility selects a model architecture of the plurality of model architectures based on spectral training observability data generated for each of the model architectures. In some embodiments, the facility selects the model architecture based on eigenvalues or eigenvectors included in the spectral observability data.

[0089] In some embodiments, the facility selects a model architecture by determining whether the absolute value of an eigenvalue, or an aggregation of one or more eigenvalues, included in the spectral observability data for the machine learning model configured based on the model architecture is greater than or equal to 1, less than or equal to 1, equal to 1, etc. For example, the facility may select a model architecture based on a determination that the absolute value of an eigenvalue of the spectral observability data generated for the machine learning model configured with the model architecture is greater than 1. In such an example, the facility may determine whether to cease training of the model, change the training of the model, etc., in a manner similar to act 1.206 described above in connection with Figure 1.2. In some embodiments, if the absolute value of an eigenvalue, or an aggregation of one or more eigenvalues, included in the spectral observability data is less than 1, the facility determines that the training of a machine learning model with the selected architecture is not to change. Thus, in some embodiments, the facility may alter or cease the training of some machine learning models that are being trained without altering or ceasing the training of other machine learning models that are being trained.

[0090] In some embodiments, the facility continues to perform aspects of the process 1.500 until one machine learning model configured with a model architecture remains, or until training of the machine learning models has completed. In such embodiments, the facility may cease the training of one or more machine learning models having a model architecture selected in act 1.507. In some such embodiments, the facility may alter the training of one or more machine learning models having a model architecture selected in act 1.507.

[0091] In embodiments where the facility performs aspects of the process 1.500 multiple times, the facility may determine one or more time intervals or training intervals for receiving additional training observability data, such as in a similar manner to the embodiments described above in connection with Figure 1.2.

[0092] In some embodiments, at act 1.507, none of the model architectures are selected. In some such embodiments, the facility continues training each of the machine learning models.17720229.45101 WO / #11326387.1In other such embodiments, the facility presents information regarding the progress of training each of the machine learning models generated based on the spectral observability data to a user and receives input indicating the machine learning models that are to continue training.

[0093] In some embodiments, at act 1.507, all of the model architectures are selected. In such embodiments, the facility may determine whether all of the machine learning models are to cease training, or whether the training of at least a portion of the machine learning models are to be changed. In some such embodiments, the facility presents information regarding the progress of training each of the machine learning models generated based on the spectral observability data to a user, and receives input indicating whether training of all of the machine learning models is to cease or the training of at least a portion of the machine learning models are to be changed. In embodiments where the facility ceases training for all of the machine learning models, the facility may receive an indication of one or more additional model architectures and begin the process 1.500 for each of the additional model architectures.

[0094] After act 1.507, the process 1.500 ends.

[0095] Figure 1.6 is a block diagram showing a sample system 1.600 for obtaining training observability data and evaluating the training of a machine learning model, used by the facilities in some embodiments. The system 1.600 includes a model training workload block 1.601, a watch dog block 1.602, a spectral observability data generation and evaluation block (“spectral data block”) 1.603, and a data store block 1.604.

[0096] The model training workload block 1.601 represents a system that trains one or more machine learning models, each machine learning model having been configured based on a model architecture of a plurality of model architectures, based on training data included in the data store block 1.604.

[0097] The watch dog block 1.602 represents a system that generates training observability data based on observable data representing the state of a machine learning model that is being trained.

[0098] The spectral data block 1.603 represents a system that receives training observability data from the watch dog block 1.602, such as in a similar manner to acts 1.201 and 1.202, 1.401 and 1.402, and 1.504, described above in connection with Figures 1.2, 1.4, and 1.5 respectively. The spectral data block 1.603 may perform any of the processes 1.200, 1.400, and 1.500, described above in connection with Figures 1.2, 1.4, and 1,5 respectively. The spectral data block 1.603 may transmit instructions to one or more of the watch dog bock 1.602, model training block 1.601, other systems, or some combination thereof, as part of performing the18720229.45101 WO / #11326387.1processes 1.200, 1.400, 1.0, other processes, methods, or functions performed by the facility, or some combination thereof.

[0099] The data store block 1.604 stores training data for training one or more machine learning models, training observability data generated by the watch dog 1.602, and spectral training observability data generated by the spectral data block 1.603.

[0100] Section 2: Computer number formats such as double-precision floating-point ( / .<?., “FP64”), single-precision floating point (i.e. “FP32”), etc. enable computers to represent numbers with various precisions using various numbers of bits. Numbers requiring relatively high precision can be represented using a computer format such as FP64 (requiring 64 bits in memory), while numbers requiring relatively low precision can be represented using a computer number format such as FP8 (requiring 8 bits in memory).

[0101] Artificial intelligence (Al) models may include billions of parameters stored using a computer number format. Due to the large number of parameters typically included in Al models, it is often impractical to use high-precision computer number formats such as FP64 due to memory or processing constraints. But lower-precision formats such as FP8 may lead to less accurate training and performance of the Al model. Furthermore, different parameters of the artificial intelligence model can be stored using different computer number formats. For example, a model may use FP8 during backpropagation and FP32 during inference, store a first group of weights according to a first computer number format and a second group of weights according to a second computer number format, etc.

[0102] It is often difficult to determine which computer number format or combination thereof provides a suitable balance of precision and performance for an Al model. The inventors have recognized that it would be useful to identify optimal computer number formats to use to represent parameters of Al models.

[0103] In response to recognizing these disadvantages, the inventors have conceived and reduced to practice a software and / or hardware facility for optimizing mixed precision training for Al models (“the facility”).

[0104] In some embodiments, the facility identifies a first computer number format used to represent one or more parameters of an artificial intelligence model. The facility obtains first and second training observability data that indicate states of the Al model during a first and second training period, respectively. The facility generates spectral training observability data by applying dynamic mode decomposition to the first and second training observability data. The facility determines, based on the spectral training observability data, whether to change the first19720229.45101 WO / #11326387.1computer number format used to represent the one or more parameters to a second computer number format and causes the Al model to be trained based on the determination.

[0105] In some embodiments, the facility determines whether to change the computer number format based on determining whether the computer number format is associated with vanishing or exploding gradients.

[0106] In some embodiments, the facility determines whether to change the computer number format based on determining whether the computer number format is associated with convergence of the Al model.

[0107] In some embodiments, the facility determines whether to change the computer number format based an early stopping criterion.

[0108] In some embodiments, the facility selects a computer number format with which to represent one or more parameters of the Al model from a plurality of computer number formats.

[0109] In some embodiments, the facility changes the computer number format while the Al model is trained.

[0110] The facility generates spectral training observability data by applying dynamic mode decomposition to training observability data received during at least two periods of time that occur during the training of the Al model. The training observability data is data that indicates the state of an Al model during a period of time while the Al model is being trained. In some embodiments, the training observability data indicates training observability metrics that are measured at one or more times during the time period for which the training observability data is received. For example, the training observability data may be data that represents one or more metrics calculated based on observable data of the state of the Al model while it is being trained.[OHl] In some embodiments, the time periods for different sets of training observability data intersect. For example, first training observability data may be received between 100 seconds and 200 seconds after training a model begins and second training observability data may be received between 150 seconds and 250 seconds after training the model begins. In some embodiments, the facility generates a training observability data matrix for training observability data received within a time period. Continuing the example above, the facility may generate a first matrix of training observability data for the data received between 100 and 200 seconds, and a second matrix of training observability data for the data received between 150 and 250 seconds.20720229.45101 WO / #11326387.1

[0112] In some embodiments, the facility changes the frequency at which training observability data is obtained. In some such embodiments, the facility may change the frequency at which training observability data is obtained based on spectral training observability data, input indicating that the frequency of receiving training observability data is to change, input indicating that the facility is to obtain training observability data at during a selected time period, changing the frequency of receiving training observability data based on the amount of time that the model has been trained, changing the frequency of receiving training observability data based on the amount of training observability data received, other methods of determining whether the frequency of receiving the training observability data is to change, or some combination thereof. In some embodiments, the facility may increase or reduce the amount of training observability data obtained based on indications similar to those of changing the frequency of obtaining training observability data.

[0113] For example, the facility may change the frequency of receiving training observability data as the Al model is being trained, such that earlier stages of training correspond to a greater frequency of obtaining training observability data and later stages of training correspond to a lesser frequency of obtaining training observability data. In another example, the facility may receive input, such as user input, input from a system that manages training of the Al model, or some combination thereof, that indicates that training observability data for a selected time period is to be received. In yet another example, the facility may change the frequency of obtaining training observability data based on a determination that training the Al model is proceeding well, proceeding poorly, etc. For example, if the spectral training observability data indicates that training the Al model is proceeding well (e.g., the eigenvalues do not indicate exploding or vanishing gradients), the facility may reduce the frequency at which training observability data is received. In another example, if the spectral training observability data indicates that training the Al model is proceeding poorly (e.g., the eigenvalues indicate exploding or vanishing gradients), the facility may increase the frequency at which training observability data is obtained.

[0114] The spectral training observability data may include one or more eigenvalues, one or more eigenvectors, or some combination thereof. The facility uses the spectral training observability data to determine whether the computer number format of the Al model for which the spectral observability data is generated is to change. In some embodiments, the facility uses the spectral observability data to identify one or more computer number formats during the training of the Al model that are to be changed. In some such embodiments, the facility may21720229.45101 WO / #11326387.1determine the magnitude of the change based on the spectral observability data. In some embodiments, the facility may cause at least one of the one or more computer number formats to be changed by automatically changing the at least one computer number format, receive input indicating that at least one computer number format of the one or more computer number formats is to be changed, cause at least one computer number format of the one or more computer number formats to change via one or more other methods of changing a computer number format, or some combination thereof.

[0115] By performing in some or all of the ways described above, the facility optimizes mixed precision training of an artificial intelligence model. Also, the facility improves the functioning of computer or other hardware, such as by reducing the dynamic display area, processing, storage, and / or data transmission resources needed to perform a certain task, thereby enabling the task to be permitted by less capable, capacious, and / or expensive hardware devices, and / or be performed with lesser latency, and / or preserving more of the conserved resources for use in performing other tasks. For example, by changing the computer number format used to represent one or more parameters of an Al model, the facility may avoid using a computer number format of an unsuitably high precision that increases compute requirements in training the Al model or avoid using a computer number format of an unsuitably low-precision that may increase a likelihood that the Al model is re-trained due to inadequate performance. Additionally, the facility may determine which computer number format of a plurality of computer number formats to use to represent the one or more parameters of the Al model.

[0116] Generating spectral training observability data requires less memory, less processing power, and is able to be performed more quickly, when compared to generating a prediction of future training observability data, because generating spectral training observability data uses fewer resources than predicting future training observability data. Also, by generating spectral training observability data, the facility does not need to obtain training observability data at predetermined and unchangeable time intervals. Thus, the facility is able to vary the frequency of obtaining training observability data and amount of training observability data received as the model is being trained. This feature also enables the facility to evaluate the training of the Al model contemporaneously with receiving an indication that additional training observability data is to be received during a selected time period.

[0117] Additionally, by generating spectral training observability data instead of predicting future training observability data, the facility is able to perform multi -re solution training of an Al model, determine whether to cease training of the Al model, determine whether22720229.45101 WO / #11326387.1to perform other functions, with fewer computing resources than conventional systems that predict future training observability data. Generating spectral training observability data also allows for samples of training observability data to be nonuniform, in contrast with the uniform samples required by conventional systems. Thus, by using spectral training observability data, the facility is able to alter the interval for obtaining additional training observability data to be more frequent, less frequent, etc., and to alter the amount of additional training observability data obtained. Therefore, the facility allows sample training observability data to be gathered at any time, and performs the analysis and processing of the training observability data faster and with fewer computing resources than conventional systems. Furthermore, the resources saved by using the processes performed by the facility may then be used directly for training the Al model instead of analyzing the data produced as a result of training the Al model.

[0118] Further, for at least some of the domains and scenarios discussed herein, the processes described herein as being performed automatically by a computing system cannot practically be performed in the human mind, for reasons that include that the starting data, intermediate state(s), and ending data are too voluminous and / or poorly organized for human access and processing, and / or are a form not perceivable and / or expressible by the human mind; the involved data manipulation operations and / or subprocesses are too complex, and / or too different from typical human mental operations; required response times are too short to be satisfied by human performance; etc. For example, it is impractical for a human mind to create spectral training observability data by performing dynamic mode decomposition and determine whether to change a computer number format based on the spectral training observability data.

[0119] Figure 2.1 is a block diagram showing some of the components typically incorporated in at least some of the computer systems and other devices on which the facility operates. In various embodiments, these computer systems and other devices 2.100 can include server computer systems, cloud computing platforms or virtual machines in other configurations, desktop computer systems, laptop computer systems, netbooks, mobile phones, personal digital assistants, televisions, cameras, automobile computers, electronic media players, etc. In various embodiments, the computer systems and devices include zero or more of each of the following: a processor 2.101 for executing computer programs and / or training or applying Al models, such as a CPU, GPU, TPU, NNP, FPGA, or ASIC; a computer memory 2.102 — such as RAM, SDRAM, ROM, PROM, etc. — for storing programs and data while they are being used, including the facility and associated data, an operating system including a kernel, and device drivers; a persistent storage device 2.103, such as a hard drive or flash drive for persistently23720229.45101 WO / #11326387.1storing programs and data; a computer-readable media drive 2.104, such as a floppy, CD-ROM, or DVD drive, for reading programs and data stored on a computer-readable medium; and a network connection 2.105 for connecting the computer system to other computer systems to send and / or receive data, such as via the Internet or another network and its networking hardware, such as switches, routers, repeaters, electrical cables and optical fibers, light emitters and receivers, radio transmitters and receivers, and the like. None of the components shown in Figure 2.1 and discussed above constitutes a data signal per se. While computer systems configured as described above are typically used to support the operation of the facility, those skilled in the art will appreciate that the facility may be implemented using devices of various types and configurations, and having various components.

[0120] Figure 2.2 is a flow diagram showing a process 2.200 used by the facility in some embodiments to determine a computer number format with which to represent one or more parameters of an artificial intelligence model. In some embodiments, process 2.200 is implemented using computer system 2.100 of Fig. 2.1.

[0121] Those skilled in the art will appreciate that the acts shown in Figure 2.2 and in each of the flow diagrams discussed below may be altered in a variety of ways. For example, the order of the acts may be rearranged; some acts may be performed in parallel; shown acts may be omitted, or other acts may be included; a shown act may be divided into subacts, or multiple shown acts may be combined into a single act, etc.

[0122] Process 2.200 begins, after a start block, at block 2.202 where the facility identifies a computer number format used to represent one or more parameters of an artificial intelligence model.

[0123] In various embodiments, the one or more parameters include one or more weights, biases, etc. of the artificial intelligence model.

[0124] In various embodiments, the computer number format includes any computer number format such as floating point 32 (z.e., “FP32”), floating point 16 (z.e. “FP16”), floating point 8 (i.e., “FP8”), brain float 16 (i.e., “BF16”), TensorFloat32 (i.e., “TF32”), Microsoft® binary format, IBM® floating-point architecture, any fixed-point computer number format, any logarithmic computer number format, etc. After block 2.202, process 2.200 proceeds to block 2.204.

[0125] At block 2.204, the facility obtains first training observability data indicating a first state of an Al model during a first training period.24720229.45101 WO / #11326387.1

[0126] In some embodiments, the training observability data includes a matrix of training observability data such as training observability data matrix 2.300, described below with respect to Fig. 3.

[0127] Figure 2.3 is a sample training data observability matrix 2.300 used by the facility in some embodiments to describe a training state of an artificial intelligence model. The columns of the training observability data matrix 2.300, such as columns 2.321 and 2.322, indicate a time at which the observability training data indicated in the column is received. In some embodiments, the time indicated by the column represents an “iteration step,” i.e. an iteration of training the Al model. The observability data metrics rows 2.301 each indicate an observability metric associated with training the Al model. The facility may determine the observability metrics based on observable data of the state of the Al model while it is being trained. In some embodiments, a system other than the facility generates one or more of the observability metrics, the observability training data matrix, or some combination thereof.

[0128] The observability metrics in the training observability data matrix 2.300 may include, but are not limited to, a cross entropy, a gradient norm, a perplexity, a learning rate, a validation loss, a BiLingual Evaluation Understudy (BLEU) score, a Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, a Fl score, precision, recall, an Area Under the Curve Receiver Operating Characteristics (AUC-ROC) score, and an Area Under the Curve Precision Recall Curve (AUC-PRC) score.

[0129] While Figure 2.3 shows a table diagram whose contents and organization are designed to make them more comprehensible by a human reader, those skilled in the art will appreciate that actual data structures used by the facility to store this information may differ from the table shown, in that they, for example, may be organized in a different manner; may contain more or less information than shown; may be compressed, encrypted, and / or indexed; may contain a much larger number of rows than shown, etc. Additionally, in some embodiments, rather than storing the data shown in the table diagrams in tables, the facility stores it in semistructured or unstructured data stores, such as JSON objects.

[0130] Returning to Figure 2.2, in some embodiments, the facility obtains training observability data from one or more systems that receive data indicating metrics that describe how an Al model is being trained via a computer system that trains the model, such as Al model training engine 2.602, described below in connection with Figure 2.6.

[0131] In some embodiments, the time interval, training iteration interval, or some combination thereof, of the second training observability data is the same as that of the first25720229.45101 WO / #11326387.1training observability data, but is shifted by a selected amount of time, a number of training iterations, or some combination thereof. For example, the first training observability data may be collected between time 1 and time 9, and the second training observability data may be collected between time 2 and time 10. In another example, the facility collects training observability data between time M and time N, where M and N are integers and M i s greater than N. The facility splits the training observability data into first training observability data that includes observability data from time AT to time N-l, and second observability data that includes observability data from time M+l to time N.

[0132] In some embodiments, the size of the interval for receiving the training observability data may be different between the first training observability data and subsequent instance of training observability data. For example, the size of the interval for the first training observability data may be 10, and for subsequent training observability data the interval may be 5, 20, 3, etc. In such an example, the size of the interval for subsequent training observability data may be different for a portion of the subsequent training observability data than for other portions of the subsequent training observability data (e.g. the size of the interval for the second training observability data may be 5, and the size of the interval of a third instance of training observability data may be 7).

[0133] In some embodiments, the facility determines a time interval for training observability data based on spectral observability data, such as the spectral observability data generated as part of performing block 2.204 described below. In some such embodiments, the facility may determine the number of iteration steps for which the spectral observability data is accurate by determining the number of iteration steps for which the spectral observability data does not include an error. In some embodiments, the facility determines a new time interval, training iteration interval, or some combination thereof, for training observability data at one or more selected times, training iterations, or some combination thereof. In some embodiments, the facility determines a new time interval, training iteration interval, or some combination thereof, when a checkpoint for training of the Al model is identified. In some such embodiments, the facility identifies a checkpoint for training of the Al model based on user input, one or more selected training iteration intervals, one or more selected time intervals, or some combination thereof. After block 2.204, process 2.200 proceeds to block 2.206.

[0134] At block 2.206, the facility receives second training observability data indicating a second state of the Al model during the second training period. In some embodiments, the26720229.45101 WO / #11326387.1facility employs embodiments of block 2.204 to receive the second training observability data. After block 2.206, process 2.200 continues to block 2.208.

[0135] At block 2.208, the facility generates spectral training observability data based on the first training observability data and the second training observability data. In some embodiments, the facility uses process 2.400, described below with respect to Figure 2.4 to generate the spectral training observability data.

[0136] Figure 2.4 is a flow diagram showing a process 2.400 used by the facility in some embodiments to perform dynamic mode decomposition on training observability data to generate spectral training observability data. In some embodiments, the facility uses embodiments of process 2.400 to perform block 2.208 of process 2.200 shown in Fig. 2.2.

[0137] Process 2.400 begins, after a start block, at block 2.401, where the facility receives a first matrix indicating training observability data for an Al model that is being trained.

[0138] At block 2.402, the facility receives a second matrix indicating training observability data for the Al model that is being trained. In some embodiments, the facility performs blocks 2.401 and 2.402 in a similar manner to blocks 2.204 and 2.206, respectively, described above in connection with Figure 2.2. The first matrix and second matrix may be observability training data matrices, such as the observability training data matrix 2.300 described above in connection with Figure 2.3.

[0139] At block 2.403, the facility determines a singular value decomposition of the first matrix. In some embodiments, the facility determines the singular value decomposition based on Equation 2.1 below:X = USV* (2.1)

[0140] In Equation 2.1 : “A” represents the first matrix, “U” represents a complex unitary matrix, “X” represents a rectangular diagonal matrix, and “E*” represents the conjugate transpose of complex unitary matrix V. In some embodiments, the facility generates “U,” “X,” and “E*” based on the first matrix, “A,” by factoring “A” into these three components, such as by using the theories of singular value decomposition, dynamic mode decomposition, or a combination thereof.

[0141] In some embodiments, the facility determines a compact singular value decomposition, such as according to Equation 2.1 A below. In compact SVD, “Er” includes the r non-zero singular values of “S,” “Ur” includes the corresponding “r” columns of “U,” and the “r” rows of “V*” corresponding to non-zero singular values “Sr” in “S” are calculated:X = UrSrVr* (2.1 A)27720229.45101 WO / #11326387.1

[0142] In some embodiments, the facility determines a truncated singular value decomposition that approximates “X” using a configurable number “t” of the largest singular values of “S,” such as according to Equation 2. IB below:X = UtStVt* (2. IB)

[0143] In truncated singular value decomposition, “St” includes the “f ’ largest singular values of “S”. Accordingly, “t” columns of “U” and “t” rows of “V*” corresponding to the “t” largest singular values of “St” are calculated. In various embodiments, “t” is any value less than or equal to the number of singular values in “S”.

[0144] At block 2.404, the facility defines a third matrix based on the second matrix and the singular value decomposition of the first matrix. Thus, at block 2.404, the facility generates a definition of the third matrix without generating the third matrix itself. In some embodiments, the facility determines the third matrix based on Equation 2.2 below:A = U*Y V S’1(2.2)

[0145] In Equation 2.2: “U*” is the conjugate transpose of U from Eq. 2.1, “V” is the complex unitary matrix described with respect to Eq. 2.1, and “S'1is the inverse of S from Eq. 2.1. “Y” represents the second matrix, and “A” represents the third matrix. In embodiments where truncated SVD is used, “Ut*” is used in place of “U*”, “St” is used in place of “S,” and “Vt” is used in place of “V” according to Equation 2.1 A. In embodiments where compact SVD is used, “Ur” is used in place of “U,” “Sr” is used in place of “S,” and “Vr” is used in place of “V” according to Equation 2. IB.

[0146] At block 2.405, the facility generates one or more eigenvalues, one or more eigenvectors, or some combination thereof, based on the third matrix. In some embodiments, the facility performs eigendecomposition on the third matrix to determine the one or more eigenvalues, the one or more eigenvalues, or some combination thereof. In some embodiments, the facility identifies the one or more eigenvalues, one or more eigenvectors, or some combination thereof, based on Equation 2.3, below:A: Aw = Aw (2.3)

[0147] In Equation 2.3: “A” represents the third matrix, “w” represents a matrix having columns that correspond to eigenvectors, and “1” represents a matrix having columns that correspond to eigenvalues.

[0148] In some embodiments, if the third matrix “A” is not a square matrix, as part of performing block 2.405, the facility transposes the third matrix to calculate “AT,” such as by defining the rows of the matrix as columns and the columns of the matrix as rows. In some such 28720229.45101 WO / #11326387.1embodiments, the facility generates the eigenvalues and eigenvectors based on the transposed third matrix.

[0149] In some embodiments in which “A” is not a square, the facility calculates a square matrix “A*AT” and generates the eigenvalues and eigenvectors based on “A*AT.”

[0150] After block 2.405, the process 2.400 ends at an end block.

[0151] Returning to Figure 2.2, at block 2.210, the facility determines whether to change the computer number format used to represent the one or more parameters based on the spectral training observability data. The spectral training observability data provides insights into the loss landscape of the Al model during training, and can be used to identify several conditions associated with using suboptimal computer number formats to represent parameters of the Al model. Using a computer number format with too low of a precision can introduce quantization of parameters of the Al model, which can lead to various suboptimal training conditions such as:(1) increased roughness of the loss landscape;(2) false local minima in the loss landscape, whereby quantization introduced by the computer number format leads to artificial “flat” regions of the loss landscape where gradients become zero due to limited precision;(3) altered optimization paths; or(4) gradient explosion or vanishing, whereby gradients become too large or too small to be represented using the computer number format.

[0152] As discussed herein, the spectral training observability data may include spectra of one or more eigenvalues, eigenvectors, or both. In some embodiments, the higher-magnitude eigenvalues of the spectral training observability data are associated with more important directions in the parameter space. These eigenvalues are often least affected by precision reduction. In some embodiments, the lower-magnitude eigenvalues of the spectral training observability data are associated with fine-tuning directions in the parameter space.

[0153] As precision of a computer number format is reduced, the eigenvalue distribution of the spectral training observability data may exhibit one or more characteristics such as (1) the number of distinct eigenvalues may decrease due to quantization; (2) relatively low-magnitude eigenvalues may become indistinguishable from zero in the computer number format; (3) a difference between a lowest-magnitude eigenvalue and a highest-magnitude eigenvalue (i.e., the “spectral gap”) may increase, potentially leading to faster but less precise convergence.

[0154] For example, changing the computer number format of weights of an Al model from a relatively high-precision computer number format such as FP64 to a relatively low-29720229.45101 WO / #11326387.1precision computer number format such as FP8 may cause many relatively low-magnitude eigenvalues to rapidly decay to zero due to quantization effects of FP8 being on a similar order of magnitude as the relatively low-magnitude eigenvalues. The relatively high-magnitude eigenvalues may be less impacted by these quantization effects.

[0155] In some embodiments, the facility determines to change the computer number format based on identifying one or more of the above-referenced training conditions or characteristics using the spectral training observability data.

[0156] In various embodiments, the facility determines whether to change the computer number format based on determining that the computer number format is associated with vanishing gradients.

[0157] In some embodiments, the facility determines that the computer number format is associated with vanishing gradients by calculating a ratio of eigenvalues of the spectral training observability data that are below a configurable threshold, as shown in Equation 2.4:Number of eigenvalues < 6'Total number of eigenvalues(2.4)

[0158] In Equation 2.4, “c” is the configurable threshold, and “V” is the ratio of eigenvalues that are below the configurable threshold. As “V” increases, an increasing proportion of eigenvalues are small, meaning the gradient will be dominated by a few directions, which can lead to slow or stalled training. In various embodiments, the facility determines that the computer number format is associated with vanishing gradients based on V exceeding a configurable threshold such as 0.1, 0.2, 0.3, etc. In some embodiments, the configurable threshold is based on user input.

[0159] In some embodiments, the facility determines that the computer number format is associated with vanishing gradients by calculating a fraction F of gradient norm contributed by eigenvalues below a configurable threshold e, such as according to Equation 2.5:

[0160] In Equation 2.5, S = {i : F < e}, F is the eigenvalue corresponding to i, and vt is the eigenvector corresponding to i. In some embodiments, the facility determines that the computer number format is associated with vanishing gradients when F satisfies a gradient norm fraction threshold such as 0.1, 0.2, 0.3, etc. In some embodiments, the gradient norm fraction30720229.45101 WO / #11326387.1threshold is based on user input. In some embodiments, the facility selects a new computer number format based on the gradient norm fraction threshold. In some embodiments, the facility selects the computer number format associated with the lowest gradient norm fraction.

[0161] One non-limiting example of gradient norm fractions computed for various computer number formats is given in Table 2.6 below:

[0162] As can be seen in Table 2.6, lower-precision computer number formats such as FP8 may be associated with higher gradient norm fractions.

[0163] In various embodiments, the facility determines whether to change the computer number format based on determining that the computer number format is associated with convergence of the Al model.

[0164] In some embodiments, a likelihood of convergence is determined based on one or more eigenvalues of the spectral training observability data, such as using Inequality (2.7) below:

[0165] In Inequality 2.7, i is the eigenvalue having the greatest magnitude, and is the learning rate of the Al model. In some embodiments, when Inequality 2.7 is true, the Al model is likely to converge.

[0166] In some embodiments, a convergence metric is calculated for a plurality of computer number formats. In such embodiments, spectral training observability data is obtained using each computer number format, such as by training the Al model for a selectable number of iterations using each of the plurality of computer number formats. In some embodiments, the convergence metric associated with each computer number format is given by Equation 2.8:(2-8)

[0167] In Equation 2.8, Rformat is the convergence metric associated with a computer number format, is the learning rate of the Al model, and kA"'""" is the highest-magnitude eigenvalue of the spectral training observability data associated with the computer number format.31720229.45101 WO / #11326387.1

[0168] One non-limiting example of convergence metrics computed for various computer number formats is given in Table 2.9 below:

[0169] In some embodiments, the convergence metric associated with each computer number format is given by Equation 2.10:

[0170] In Equation 2.10, Qformat is the quantization function for each computer number format. In some embodiments, the quantization functions for various computer number formats are as follows:

[0171] For FP32: QFP32(X) ~ x;

[0172] For FP16: QFPI6(X) = round(x ■ 210) ■ 210;

[0173] For FP8: QFPS(X) = round(x ■ 23) ■ 23; and

[0174] For BF16: QBFI6(X) = round(x ■ 27) ■ 27.

[0175] In some embodiments, the convergence metrics include one or more loss functions for various computer number formats, which may be approximated as follows, wherein Ci are coefficients determined by initial conditions of the Al model:

[0176] Equation 2.11 for FP32:

[0177] Equation 2.12 for FP16:Where c is the smallest positive number that can be represented using FP16.

[0178] Equation 2.13 for FP8:Lm(t) £(0) + £(2- 13)Where e' is the smallest positive number that can be represented in FP8.720229.45101 WO / #11326387.1

[0179] Equation 2.14 for BF16:(2 14)Where e" is the smallest positive number that can be represented in BF16.

[0180] In some embodiments, the convergence metrics associated with the plurality of computer number formats are compared to determine the computer number format most associated with convergence of the Al model. In some embodiments, the computer number format having the largest convergence metric is the most associated with convergence of the Al model. In Table 2.9, FP32 has the largest convergence metric ( / .<?., “Ratio”) of 1.82. Thus, the facility may determine to change the computer number format to FP32 to increase the likelihood of convergence.

[0181] In various embodiments, the facility determines whether to change the computer format based on determining whether the computer number format is associated with early stopping. In some embodiments, early stopping is determined according to Equation 2.15:(2- 15)

[0182] In Equation 2.15, i is the eigenvalue having the greatest magnitude, k is the smallest non-zero eigenvalue, and C is the eigenvalue magnitude ratio. In some embodiments, early stopping is determined when C changes more than a configurable stopping threshold value within a time period.

[0183] In some embodiments, early stopping is determined based on a time-based criterion such as in Inequality 2.16:

[0184] In Inequality 2.16, rj is the learning rate of the Al model, i is the eigenvalue having the greatest magnitude, k is the smallest non-zero eigenvalue, r is a configurable threshold value, C is the eigenvalue magnitude ratio defined according to Equation 2.15, and t is a threshold number of training epochs. In various embodiments, the facility determines early stopping when inequality 2.16 is true or false.33720229.45101 WO / #11326387.1

[0185] In some embodiments, the facility ceases training of the Al model based on an early stopping criterion. In some embodiments, the facility determines to change the computer number format based on the early stopping criterion.

[0186] After block 2.210, process 2.200 continues to block 2.212, where the Al model is trained based on the determination whether to change the computer number format.

[0187] In some embodiments, the facility uses process 2.200 to change the computer number format used to represent the one or more parameters during training the Al model. In one non-limiting example, in response to determining to change the computer number format, the facility copies the relevant parameters into variables of the new computer number format. In some embodiments, the facility pauses training to change the computer number format that represents the one or more parameters. In some embodiments, the facility changes the computer number format during training.

[0188] While process 2.200 is discussed in terms of determining whether to change one computer number format used to represent one or more parameters of an Al model for ease of discussion, the disclosure is not so limited. In some embodiments, the facility performs process 2.200 any number of times to determine any number of computer number formats for an Al model. In some embodiments, the facility uses process 2.200 to determine a plurality of computer number formats with which to represent a corresponding plurality of one or more parameters. In one non-limiting example, the facility uses process 2.200 to determine a first computer number format with which to represent a first set of weights of the Al model, and a second computer number format with which to represent a second set of weights of the Al model.

[0189] Figure 2.5 is a flow diagram showing a process 2.500 used by the facility in some embodiments to select a computer number format with which to represent one or more parameters of an artificial intelligence model.

[0190] Process 2.500 begins, after a start block, at block 2.501, where the facility receives a plurality of computer number formats usable to represent one or more parameters of an Al model.

[0191] At block 2.502a, the facility, for each computer number format of the plurality of computer number formats, performs blocks 2.503-2.506.

[0192] At block 2.503, the facility causes training of an artificial intelligence model using the computer number format to be initialized.34720229.45101 WO / #11326387.1

[0193] At block 2.504, the facility receives first and second training observability data while the Al model is being trained. In some embodiments, the facility performs block 2.504 in a similar manner to blocks 2.204 and 2.206 and blocks 2.401 and 2.402, described above in connection with Figures 2.2 and 2.4, respectively.

[0194] At block 2.505, the facility generates spectral training observability data based on the first and second training observability data. In some embodiments, the facility performs block 2.505 in a similar manner to block 2.208, described above in connection with Figure 2.2.

[0195] At block 2.502b, if the facility has performed blocks 2.503-2.505 for each computer number format of the plurality of computer number formats, process 2.500 proceeds to block 2.506. Otherwise, the facility continues to perform blocks 2.503-2.505 until they have been performed for each computer number format of the plurality of computer number formats.

[0196] At block 2.506, the facility selects a computer number format of the plurality of computer number formats based on spectral training observability data generated for each of the computer number formats. In some embodiments, the facility selects the computer number format based on eigenvalues or eigenvectors included in the spectral observability data.

[0197] In some embodiments, the facility selects the computer number format by determining whether the absolute value of an eigenvalue, or an aggregation of one or more eigenvalues, included in the spectral observability data for the Al model using the computer number format is greater than or equal to 1, less than or equal to 1, equal to 1, etc. For example, the facility may select a computer number format based on a determination that the absolute value of an eigenvalue of the spectral observability data generated for the Al model configured with the computer number format is greater than 1. In such an example, the facility may determine whether to cease training of the model, change the training of the model, etc., in a manner similar to block 2.210 described above in connection with Figure 2.2. In some embodiments, if the absolute value of an eigenvalue, or an aggregation of one or more eigenvalues, included in the spectral observability data is less than 1, the facility determines that the training of an Al model with the computer number format is not to change. Thus, in some embodiments, the facility may alter or cease the training of some Al model instances that are being trained without altering or ceasing the training of other Al model instances that are being trained.

[0198] In some embodiments, process 2.500 is performed iteratively, such that the plurality of computer number formats is progressively narrowed. For example, a subset of the plurality of computer number formats may be selected at block 2.506, and process 2.500 may35720229.45101 WO / #11326387.1continue to block 2.502a to continue evaluating the selected subset of the plurality of computer number formats (not shown).

[0199] In some embodiments, the facility continues to perform aspects of the process 2.500 until one Al model configured with a computer number format remains, or until training of the Al models is complete. In such embodiments, the facility may cease the training of one or more Al models having a computer number format selected in block 2.507. In some such embodiments, the facility may alter the training of one or more Al models having a computer number format selected in block 2.507.

[0200] In embodiments where the facility performs aspects of the process 2.500 multiple times, the facility may determine one or more time intervals or training intervals for receiving additional training observability data, such as in a similar manner to the embodiments described above in connection with Figure 2.2.

[0201] In some embodiments, at block 2.506, none of the computer number formats are selected, such as when none of the computer number formats satisfy a threshold performance metric. In some such embodiments, the facility continues training each of the Al models. In other such embodiments, the facility presents information regarding the progress of training each of the Al models generated based on the spectral observability data to a user and receives input indicating the Al models that are to continue training.

[0202] In some embodiments, at block 2.507, all of the computer number formats are selected. In such embodiments, the facility may determine whether all of the Al models are to cease training, or whether the training of at least a portion of the Al models are to be changed. In some such embodiments, the facility presents information regarding the progress of training each of the Al models generated based on the spectral observability data to a user, and receives input indicating whether training of all of the Al models is to cease or the training of at least a portion of the Al models are to be changed. In embodiments where the facility ceases training for all of the Al models, the facility may receive an indication of one or more additional computer number formats and begin the process 2.500 for each of the additional computer number formats.

[0203] After block 2.507, process 2.500 ends at an end block.

[0204] Figure 2.6 is a block diagram showing a system 2.600 used by the facility in some embodiments to obtain training observability data and evaluate training of an artificial intelligence model.36720229.45101 WO / #11326387.1

[0205] System 2.600 includes Al model training engine 2.602, training data observation engine 2.604, dynamic mode decomposition engine 2.606, dynamic hyperparameter tuning engine 2.608, and mixed precision training optimization engine 2.610.

[0206] Al model training engine 2.602 is configured to control data flow in system 2.600, such as data flow involving one or more of training data observation engine 2.604, dynamic mode decomposition engine 2.606, dynamic hyperparameter tuning engine 2.608, or mixed precision training optimization engine 2.610. In some embodiments, Al model training engine provides one or more user interfaces configured to receive user input, such as user input specifying one or more computer number formats with which to analyze training performance of an Al model.

[0207] Training data observation engine 2.604 is configured to obtain training observability data. In some embodiments, the facility uses training data observation engine 2.604 to perform blocks 2.204 and 2.206 of process 2.200 shown in Figure 2.2 or block 2.504 of process 2.500, shown in Figure 2.5.

[0208] Dynamic mode decomposition engine 2.606 is configured to produce spectral training observability data by performing dynamic mode decomposition on training observability data obtained via training data observation engine 2.604. In some embodiments, the facility uses dynamic mode decomposition engine 2.606 to perform block 2.208 of process 2.200 shown in Figure 2.2 or block 2.505 of process 2.500 shown in Figure 2.5.

[0209] Dynamic hyperparameter tuning engine 2.608 is configured to change hyperparameters of an Al model. Dynamic hyperparameter tuning engine 2.608 is described in detail in U.S. Provisional Application No. 63 / 729,142, filed December 6, 2024, and entitled “EVALUATING TRAINING OF MACHINE LEARNING MODELS AND ALTERING THE TRAINING BASED ON THE EVALUATION,” which is hereby incorporated by reference in its entirety.

[0210] Mixed precision training optimization engine 2.610 is configured to change a computer number format used to represent one or more parameters of an Al model. In some embodiments, the facility uses mixed precision training optimization engine 2.610 to perform blocks 2.210 and 2.212 of process 2.200 shown in Figure 2.2 or blocks 2.506 or 2.507 of process 2.500 shown in Figure 2.5.

[0211] Section 3 : The inventors have recognized that it would be of great benefit to those who train machine learning models, including model training managers, to dynamically adjust the training and model architecture of ensemble machine learning models and their37720229.45101 WO / #11326387.1constituent machine learning models while training the machine learning models. However, because it is difficult to predict how a model architecture and the training data will affect the training of the ensemble machine learning model and its constituent machine learning models, conventional approaches are unable to predict the likelihood of success that any model architecture of the many choices of model architectures will have. Additionally, the inventors have also recognized that some model architectures may perform worse than others early in training, but may still be successful if one or more aspects of the machine learning model, such as “hyperparameters,” the model architecture, or some combination thereof, are changed before training continues.

[0212] The inventors have further recognized that while conventional methods of assessing the performance of ensemble machine learning models during the training phase exist, these methods require long periods of time to train the model, which increases exponentially as the size of the ensemble machine learning model, its constituent machine learning models, or some combination thereof, increase. Furthermore, training an ensemble machine learning model requires a large amount of computing resources, such as processing power, graphic processing unit usage, memory usage, electricity, and other computer resources, and the need for such resources increases with each constituent machine learning model included in the ensemble machine learning model. In some cases, training a single ensemble machine learning model requires the use of computing resources for many months, as well as the training of multiple constituent machine learning models that may not have any impact, or may only have a negligible impact, on the performance of the ensemble machine learning model.

[0213] Furthermore, the inventors have recognized that the amount of data generated as a result of training a single constituent machine learning model is on a scale that also requires a large amount of computing resources to process and analyze, and that these resources must be expended multiple times throughout the training phase of the ensemble machine learning model. For example, some conventional systems receive training observability data matrices that describe the state of the machine learning model while it is being trained, and predict future training observability data matrices to assess whether training the model should continue or whether an aspect of the model should be changed. However, each training observability data matrix may include hundreds, thousands, etc., of data points, and predicting what each of these data points will be at a future time requires a significant amount of additional computing resources in addition to the resources being used to train the model.38720229.45101 WO / #11326387.1

[0214] As a result of these disadvantages, conventional systems are currently unable to optimize the training dynamics of an ensemble machine learning model and its constituent machine learning models. Furthermore, conventional systems are unable to change the model architecture of the ensemble machine learning model, its constituent machine learning models, or some combination thereof, while the models are being trained.

[0215] In response to recognizing these disadvantages, the inventors have conceived and reduced to practice a software and / or hardware facility for managing and adjusting the training of ensemble machine learning models based on training observability data representing the state of the ensemble machine learning model and its constituent machine learning models during training (“the facility”). By generating spectral training observability data, the facility is able to assess the training of an ensemble machine learning model without generating predicted training observability matrices. Furthermore, the facility is able to use the spectral training observability data to change one or more aspects of an ensemble machine learning model, its constituent machine learning models, or some combination thereof, without halting training.

[0216] Additionally, the facility is able to determine whether a constituent machine learning model should be removed from the ensemble machine learning model based on spectral training observability data generated from validation data, training observability data, or some combination thereof. Generating spectral training observability data also allows for samples of training observability data to be nonuniform, in contrast with the uniform samples required by conventional systems. Thus, by using spectral training observability data, the facility is able to alter the interval for obtaining additional training observability data to be more frequent, less frequent, etc., and to alter the amount of additional training observability data obtained. Therefore, the facility allows sample training observability data, validation data, or some combination thereof, to be gathered at any time, and performs the analysis and processing of the training observability data, validation data, or some combination thereof, faster and with fewer computing resources than conventional systems. Furthermore, the resources saved by using the processes performed by the facility may then be used directly for training the ensemble machine learning model instead of analyzing the data produced as a result of training the machine learning model.

[0217] The facility generates spectral training observability data by applying dynamic mode decomposition to training observability data received during at least two periods of time that occur during the training of the machine learning model. The training observability data is data that indicates the state of a machine learning model during a period of time while the39720229.45101 WO / #11326387.1machine learning model is being trained. In some embodiments, the training observability data indicates training observability metrics that are measured at one or more times during the time period for which the training observability data is received. For example, the training observability data may be data that represents one or more metrics calculated based on observable data of the state of the machine learning model while it is being trained. In some embodiments, the facility generates spectral training observability data based on validation data received as a result of validating the training of the ensemble machine learning model, its constituent machine learning models, or some combination thereof.

[0218] In some embodiments, the time periods for different sets of training observability data intersect. For example, first training observability data may be received between 100 seconds and 200 seconds after training a model begins and second training observability data may be received between 150 seconds and 250 seconds after training the model begins. In some embodiments, the facility generates a training observability data matrix for training observability data received within a time period. Continuing the example above, the facility may generate a first matrix of training observability data for the data received between 100 and 200 seconds, and a second matrix of training observability data for the data received between 150 and 250 seconds. In some embodiments, the facility changes the frequency at which training observability data is obtained.

[0219] The spectral training observability data may include one or more eigenvalues, one or more eigenvectors, or some combination thereof. The facility uses the spectral training observability data to determine whether the model architecture of the machine learning model for which the spectral observability data is generated is to cease. In some embodiments, the facility uses the spectral observability data to identify one or more hyperparameters during the training of the machine learning model that are to be changed. In such embodiments, the facility may determine the magnitude of the change based on the spectral observability data. In some embodiments, the facility may cause at least one of the one or more hyperparameters to be changed by automatically changing the at least one hyperparameter, receive input indicating that at least one hyperparameter of the one or more hyperparameters is to be changed, cause at least one hyperparameter of the one or more hyperparameters to change via one or more other methods of changing a hyperparameter, or some combination thereof.

[0220] In some embodiments, the facility determines whether an aspect of a machine learning model is to be changed based on the spectral training observability data. The aspect of the machine learning model to be changed may be a hyperparmeter for training the machine40720229.45101 WO / #11326387.1learning model, a model architecture of the machine learning model, other aspects of a machine learning model, or some combination thereof. In some embodiments, the facility determines whether a proposed change to an aspect of a machine learning model is feasible based on model architecture criteria. In some embodiments, model architecture criteria includes a desired architecture of a machine learning model, one or more hardware constraints of a machine learning model (such as, for example, a number of CPUs, GPUs, memory, etc., for operating or training the machine learning model), or some combination thereof.

[0221] In some embodiments, the facility trains the ensemble machine learning model within a “feedback loop.” In some embodiments, the feedback loop includes generating spectral training observability data, adjusting an aspect of the ensemble machine learning model, at least one constituent machine learning model, or some combination thereof. In some embodiments, the facility uses the feedback loop during training, pre-training, fine tuning, other aspects of the training lifecycle of an ensemble machine learning model, or some combination thereof.

[0222] In some embodiments, the facility adjusts an aspect of a machine learning model by changing a hyperparameter for training the machine learning model. In some embodiments, the facility adjusts an aspect of a machine learning model by changing a model architecture of the machine learning model. In some embodiments, the facility adjusts an aspect of an ensemble machine learning model by removing a constituent machine learning model of the ensemble machine learning model, such as based on a determination that the constituent machine learning model has a negligible effect on the output of the ensemble machine learning model.

[0223] By performing in some or all of the ways described above, the facility is able to train an ensemble machine learning model in a manner that improves “convergence properties” for the ensemble machine learning model’s training more than conventional methods. Furthermore, by performing in some or all of the ways described above, the facility is able to reduce the computing resources needed to evaluate the training of an ensemble machine learning model and provide more accurate evaluations of the training of the ensemble machine learning model. Also, the facility improves the functioning of computer or other hardware, such as by reducing the dynamic display area, processing, storage, and / or data transmission resources needed to perform a certain task, thereby enabling the task to be permitted by less capable, capacious, and / or expensive hardware devices, and / or be performed with lesser latency, and / or preserving more of the conserved resources for use in performing other tasks. For example, generating spectral training observability data requires less memory, less processing power, and is able to be performed more quickly, when compared to generating a prediction of future41720229.45101 WO / #11326387.1training observability data, because generating spectral training observability data uses fewer resources than predicting future training observability data. Also, by generating spectral training observability data, the facility does not need to obtain training observability data at predetermined and unchangeable time intervals. Thus, the facility is able to vary the frequency of obtaining training observability data and amount of training observability data received as the model is being trained. This feature also enables the facility to evaluate the training of the machine learning model contemporaneously with receiving an indication that additional training observability data is to be received during a selected time period. As another example, the facility is able to determine whether one or more constituent machine learning models can be removed from the ensemble machine learning model, resulting in an ensemble machine learning model that requires less computing resources for storing and operating the ensemble machine learning model.

[0224] Further, for at least some of the domains and scenarios discussed herein, the processes described herein as being performed automatically by a computing system cannot practically be performed in the human mind, for reasons that include that the starting data, intermediate state(s), and ending data are too voluminous and / or poorly organized for human access and processing, and / or are a form not perceivable and / or expressible by the human mind; the involved data manipulation operations and / or subprocesses are too complex, and / or too different from typical human mental operations; required response times are too short to be satisfied by human performance; etc. For example, the volume of training observability data needed to generate spectral training observability data for evaluating a typical ensemble machine learning model, let alone a large language ensemble machine learning model, is too voluminous for a human to be able to practically generate spectral training observability data, even with the aid of pen and paper.

[0225] Figure 3.1 is a block diagram showing some of the components typically incorporated in at least some of the computer systems and other devices on which the facility operates. In various embodiments, these computer systems and other devices 3.100 can include server computer systems, cloud computing platforms or virtual machines in other configurations, desktop computer systems, laptop computer systems, netbooks, mobile phones, personal digital assistants, televisions, cameras, automobile computers, electronic media players, etc. In various embodiments, the computer systems and devices include zero or more of each of the following: a processor 3.101 for executing computer programs and / or training or applying machine learning models, such as a CPU, GPU, TPU, NNP, FPGA, or ASIC; a computer memory 3.102 — such as42720229.45101 WO / #11326387.1RAM, SDRAM, ROM, PROM, etc. — for storing programs and data while they are being used, including the facility and associated data, an operating system including a kernel, and device drivers; a persistent storage device 3.103, such as a hard drive or flash drive for persistently storing programs and data; a computer-readable media drive 3.104, such as a floppy, CD-ROM, or DVD drive, for reading programs and data stored on a computer-readable medium; and a network connection 3.105 for connecting the computer system to other computer systems to send and / or receive data, such as via the Internet or another network and its networking hardware, such as switches, routers, repeaters, electrical cables and optical fibers, light emitters and receivers, radio transmitters and receivers, and the like. None of the components shown in Figure 3.1 and discussed above constitutes a data signal per se. While computer systems configured as described above are typically used to support the operation of the facility, those skilled in the art will appreciate that the facility may be implemented using devices of various types and configurations, and having various components.

[0226] Figure 3.2 is a flow diagram showing a process 3.200 for generating spectral training observability data, used by the facility in some embodiments. Those skilled in the art will appreciate that the acts shown in the flow diagrams of Figures 3.2, 3.4, 3.6, 3.7, and 3.8 discussed below may be altered in a variety of ways. For example, the order of the acts may be rearranged; some acts may be performed in parallel; shown acts may be omitted, or other acts may be included; a shown act may be divided into subacts, or multiple shown acts may be combined into a single act, etc.

[0227] First, at act 3.201, the facility receives an indication of a machine learning model that is being trained. In some embodiments, at act 3.201, the facility receives an indication of a model architecture of the machine learning model.

[0228] At act 3.202, the facility receives first training observability data obtained during a first time period during which the machine learning model is being trained. In some embodiments, the facility receives training observability data from one or more systems that receive data indicating metrics that describe how a machine learning model is being trained from a computer system that trains the machine learning model, such as the watch dog 3.902, described below in connection with Figure 3.9.

[0229] In some embodiments, the facility receives training observability data at one or more selected time intervals, at one or more selected training iteration intervals, or some combination thereof. In some embodiments, the facility may alter the interval at which training43720229.45101 WO / #11326387.1observability data is received based on spectral training observability data, such as the spectral training observability data generated during act 3.204, described below.

[0230] In some embodiments, the time interval, training iteration interval, or some combination thereof, of the second training observability data is the same as for the first training observability data, but is shifted by a selected amount of time, a number of training iterations, or some combination thereof. For example, the first training observability data may be collected between time 1 and time 10, and the second training observability data may be collected between time 5 and time 15. In another example, the first training observability data may be collected between time 1 and time 10, and the second training observability data may be collected between time 20 and time 30. In some embodiments, the size of the interval for receiving the training observability data may be different between the first training observability data and subsequent instance of training observability data. For example, the size of the interval for the first training observability data may be 10, and for subsequent training observability data the interval may be 5, 20, 3, etc. In such an example, the size of the interval for subsequent training observability data may be different for a portion of the subsequent training observability data than for other portions of the subsequent training observability data (e.g. the size of the interval for the second training observability data may be 5, and the size of the interval of a third instance of training observability data may be 7).

[0231] In some embodiments, the facility determines a time interval for training observability data based on spectral observability data, such as the spectral observability data generated as part of performing act 3.204 described below. In such embodiments, the facility may determine the number of iteration steps for which the spectral observability data is accurate by determining the number of iteration steps for which the spectral observability data does not include an error. In some embodiments, the facility determines a new time interval, training iteration interval, or some combination thereof, for training observability data at one or more selected times, training iterations, or some combination thereof. In some embodiments, the facility determines a new time interval, training iteration interval, or some combination thereof, when a checkpoint for training of the machine learning model is identified. In some such embodiments, the facility identifies a checkpoint for training of the machine learning model based on user input, one or more selected training iteration intervals, one or more selected time intervals, or some combination thereof.44720229.45101 WO / #11326387.1

[0232] At act 3.203, the facility receives second training observability data obtained during a second time period during which the machine learning model is being trained. In some embodiments, the facility performs act 3.203 in a similar manner to act 3.202.

[0233] In some embodiments, the training observability data is included in a matrix of training observability data, such as the training observability data matrix 3.300, described below in connection with Figure 3.3.

[0234] Figure 3.3 is a sample observability training data matrix 3.300 describing a state of a machine learning model during a period of time that occurs while the machine learning model is being trained, used by the facility in some embodiments. The columns of the observability training data matrix 3.300, such as columns 3.321 and 3.322, indicate a time at which the observability training data indicated in the column is received. In some embodiments, the time indicated by the column is a time represented by an “iteration step,” i.e. an iteration of training the machine learning model. The observability data metrics rows 3.301 each indicate an observability metric associated with training the machine learning model. The facility may determine the observability metrics based on observable data of the state of the machine learning model while it is being trained. In some embodiments, a system other than the facility generates one or more of the observability metrics, the observability training data matrix, or some combination thereof. In some embodiments, the observability training data matrix 3.300 includes one or more observability data metrics associated with validation loss experienced by a machine learning model during training of the machine learning model.

[0235] The observability metrics in the observability training data matrix 3.300 may include, but are not limited to, a cross entropy, a gradient norm, a perplexity, a learning rate, a validation loss, a BiLingual Evaluation Understudy (BLEU) score, a Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, a Fl score, precision, recall, an Area Under the Curve Receiver Operating Characteristics (AUC-ROC) score, and an Area Under the Curve Precision Recall Curve (AUC-PRC) score.

[0236] In some embodiments, the facility generates an observability training data matrix 3.300 for each constituent machine learning model of an ensemble machine learning model. In some embodiments, the facility generates an observability training data matrix 3.300 for an ensemble machine learning model based on one or more observability training data matrices for one or more constituent machine learning model of the ensemble machine learning model. In some such embodiments, the facility generates the observability training data matrix 3.300 for the ensemble machine learning model by aggregating one or more observability training data45720229.45101 WO / #11326387.1matrices for one or more constituent machine learning models of the ensemble machine learning model.

[0237] While the table diagram shown in Figure 3.3 shows a table that represents a matrix whose contents and organization are designed to make them more comprehensible by a human reader, those skilled in the art will appreciate that actual data structures used by the facility to store this information may differ from the table shown, in that they, for example, may be organized in a different manner; may contain more or less information than shown; may be compressed, encrypted, and / or indexed; may contain a much larger number of rows than shown, etc. Additionally, in some embodiments, rather than storing the data shown in the table diagrams in tables, the facility stores it in semi -structured or unstructured data stores, such as JSON objects.

[0238] Returning to Figure 3.2, at act 3.204, the facility generates spectral training observability data based on the first and second training observability data. In some embodiments, the facility uses the process 3.400, described below in connection with Figure 3.4 to generate the spectral training observability data. In some embodiments, the facility generates the spectral training observability data based on training observability data received as a result of training the machine learning model based on a training dataset. In some embodiments, the facility generates the spectral training observability data based on validation data received as a result of validating the machine learning model based on a validation dataset. In some embodiments, the spectral training observability data is generated based on a combination of the training observability data and the validation data.

[0239] Figure 3.4 is a flow diagram showing a process for performing a dynamic mode decomposition on training observability data, used by the facility in some embodiments. First, at act 3.401, the facility receives a first matrix indicating training observability data for a machine learning model that is being trained.

[0240] At act 3.402, the facility receives a second matrix indicating training observability data for the machine learning model that is being trained. In some embodiments, the facility performs acts 3.401 and 3.402 in a similar manner to acts 3.201 and 3.202, respectively, described above in connection with Figure 3.2. The first matrix and second matrix may be observability training data matrices, such as the observability training data matrix 3.300 described above in connection with Figure 3.3. In some embodiments, when the machine learning model being trained is an ensemble machine learning model, the first and second46720229.45101 WO / #11326387.1matrices include training observability data for one or more constituent machine learning models of the ensemble machine learning model.

[0241] At act 3.403, the facility determines a singular value decomposition of the first matrix. In some embodiments, the facility determines the singular value decomposition based on Equation 3.1 below:X = USV* (3.1)

[0242] In Equation 3.1 : “X” represents the first matrix, “U” represents a complex unitary matrix, “X” represents a rectangular diagonal matrix, and “V*” represents the conjugate transpose of complex unitary matrix V. In some embodiments, the facility generates “U,” “X,” and “V*” based on the first matrix, “X,” by factoring “X” into these three components, such as by using the theories of singular value decomposition, dynamic mode decomposition, or a combination thereof.

[0243] In some embodiments, the facility determines a compact singular value decomposition, such as according to Equation 3.1 A below. In compact SVD, “Er” includes the r non-zero singular values of “S,” “Ur” includes the corresponding “r” columns of “U,” and the “r” rows of “V*” corresponding to non-zero singular values “Sr” in “S” are calculated:X = UrSrV*r (3.1 A)

[0244] In some embodiments, the facility determines a truncated singular value decomposition that approximates “X’ using a configurable number “f ’ of the largest singular values of “S,” such as according to Equation 3. IB below:X = UtStVt* (3. IB)

[0245] In truncated singular value decomposition, “St” includes the “f ’ largest singular values of “S”. Accordingly, “t” columns of “U” and “t” rows of “V*” corresponding to the “t” largest singular values of “St” are calculated. In various embodiments, “t” is any value less than or equal to the number of singular values in “S”.

[0246] At act 3.404, the facility defines a third matrix based on the second matrix and the singular value decomposition of the first matrix. Thus, at act 3.404, the facility generates a definition of the third matrix without generating the third matrix itself. In some embodiments, the facility determines the third matrix based on Equation 3.2 below:A = U*Y V S-1 (3.2)

[0247] In Equation 3.2: “U*” is the conjugate transpose of U from Eq. 3.1, “V” is the complex unitary matrix described with respect to Eq. 3.1, and “S-l is the inverse of S from Eq.47720229.45101 WO / #11326387.13.1. “Y” represents the second matrix, and “A” represents the third matrix. In embodiments where truncated SVD is used, “Ut*” is used in place of “U*”, “St” is used in place of “S,” and “Vt” is used in place of “V” according to Equation 3.1 A. In embodiments where compact SVD is used, “Ur” is used in place of “U,” “Sr” is used in place of “S,” and “Vr” is used in place of “V” according to Equation 3. IB.

[0248] At act 3.405, the facility generates one or more eigenvalues, one or more eigenvectors, or some combination thereof, based on the third matrix. In some embodiments, the facility performs eigendecomposition on the third matrix to determine the one or more eigenvalues, the one or more eigenvalues, or some combination thereof. In some embodiments, the facility identifies the one or more eigenvalues, one or more eigenvectors, or some combination thereof, based on Equation 3.3, below:A: Aw = Xw (3.3)

[0249] In Equation 3.3: “A” represents the third matrix, “w” represents a matrix having columns that correspond to eigenvectors, and “1” represents a matrix having columns that correspond to eigenvalues.

[0250] In some embodiments, if the third matrix “A” is not a square matrix, as part of performing act 3.405, the facility transposes the third matrix to calculate “AT,” such as by defining the rows of the matrix as columns and the columns of the matrix as rows. In such embodiments, the facility generates the eigenvalues and eigenvectors based on the transposed third matrix.

[0251] In some embodiments in which “A” is not a square, the facility calculates a square matrix “A* AT” and generates the eigenvalues and eigenvectors based on “A*AT.”

[0252] After act 3.405, the process 3.400 ends.

[0253] Returning to Figure 3.2, after act 3.204, the process 3.200 ends.

[0254] In some embodiments, the facility performs aspects of the process 3.200 multiple times throughout training of the machine learning model. In such embodiments, the facility may skip one or more of acts 3.201 and 3.202. In some embodiments, the facility may determine a time interval or training iteration interval for receiving additional training observability data based on the spectral observability data in a similar manner to selecting a time interval or training iteration interval described above in connection with act 3.202. In some such embodiments, the facility may use training observability data previously obtained by the facility, such as by performing aspects of the process 3.200, to generate the spectral training observability48720229.45101 WO / #11326387.1data. Thus, in such embodiments, the facility may generate the spectral training observability data with more than two sets of training observability data.

[0255] Figure 3.5 is a block diagram showing a sample training life-cycle 3.500 for training an ensemble machine learning model, used by the facility in some embodiments. The training life-cycle 3.500 includes a training data block 3.501, a data preprocessing block 3.502, a begin training block 3.503, a generate spectral training observability data block 3.504, a model adjustments block 3.505, and a continue training block 3.506. The model training adjustments block 3.505 includes a hyperparameter tuning block 3.505a, an architecture adjustment block 3.505b, and a validation block 3.505c.

[0256] The training data block 3.501 represents the facility’s receipt of training data for an ensemble machine learning model. The data preprocessing block 3.502 represents the facility’s preprocessing of the training data to prepare the data to be used to train the ensemble machine learning model. In some embodiments, preprocessing the training data includes selecting a portion of the training data to be designated as validation data.

[0257] The begin training block 3.503 represents the facility’s initiation of training of the ensemble machine learning model based on the preprocessed training data. The generate spectral training observability data block 3.504 represents the facility’s generation of spectral training observability data for the ensemble machine learning model, constituent machine learning models, or some combination thereof, such as by using the processes 3.200 and 3.400 described above in connection with Figures 3.2 and 3.4.

[0258] The model adjustments block 3.505 represents the facility’s determination of whether an adjustment should be made to an aspect of the ensemble machine learning model, at least one constituent machine learning model, or some combination thereof. The model adjustments block 3.505 includes a hyperparameter tuning block 3.505a, architecture adjustment block 3.505b, and a validation block 3.505c.

[0259] The hyperparameter tuning block 3.505a represents one or more processes used by the facility to tune hyperparameters for training the ensemble machine learning model, at least one constituent machine learning model, or some combination thereof. In some embodiments, the facility tunes one or more hyperparameters associated with training the machine learning model, such as: a learning rate for training the machine learning model, a batch size for training the machine learning model, a momentum for training the machine learning model, an adaptive learning rate for training the machine learning model, other hyperparameters associated with training a machine learning model, or some combination thereof. In some embodiments, the49720229.45101 WO / #11326387.1facility may determine that multiple hyperparameters are to be changed. In such embodiments, the facility may change all of the hyperparameters or a portion of the hyperparameters. In some such embodiments, the facility may receive user input indicating which of the hyperparameters are to be changed.

[0260] In some embodiments, the facility determines which hyperparameters are to be changed by computing a new value for at least one hyperparameter based on the spectral training observability data and comparing the new value for the at least one hyperparameter with a current value of the hyperparameter, such as by using one or more of equations 3.4-3.7, described below. In some such embodiments, the facility determines to change a hyperparameter based on a determination that the new hyperparameter is outside of a threshold range of values for the hyperparameter. In some embodiments, the threshold range of values is determined based on user input. For example, if the facility determines that a new learning rate computed based on the spectral observability data is outside of a threshold range of learning rate values that include the current learning rate, the facility may determine that the learning rate for training the machine learning model is to change.

[0261] In some embodiments, the facility determines a new learning rate for training the machine learning model based on Equation 3.4, below:

[0262] In Equation 3.4, “T|*” represents the new learning rate and “p(A)” represents the maximum eigenvalue magnitude of the spectral training observability data.

[0263] In some embodiments, the facility determines a new batch size for training the machine learning model based on Equation 3.5, below:

[0264] In Equation 3.5, “B*” represents the new batch size, “S” is the covariance matrix of the gradients, “L” is the Lipschitz constant of the loss function, “p” is the mean gradient, and “q” is the learning rate.

[0265] In some embodiments, the facility determines a new momentum for training the machine learning model based on Equation 3.6, below:50720229.45101 WO / #11326387.1(3-6)

[0266] In Equation 3.6, “P*” represents the new momentum and “K” represents the maximum eigenvalue included in the spectral observability data divided by the minimum eigenvalue included in the spectral observability data.

[0267] In some embodiments, the facility determines a new adaptive learning rate for training the machine learning model based on Equation 3.7, below:(3-7)

[0268] In Equation 3.7, “qt*” represents the new adaptive learning rate, “a” is the base learning rate, “vl” is a bias-corrected second moment estimate, and “c” is a small constant.

[0269] In some embodiments, the facility determines the magnitude of changing the hyperparameters based on a new hyperparameter calculated for the training of the machine learning model, the precision of the training observability data (such as whether the data is a 32- bit floating point number, a 16-bit floating point number, etc.), or some combination thereof.

[0270] The architecture adjustment block 3.505b represents one or more processes to change a model architecture of the ensemble machine learning model, at least one constituent machine learning model, or some combination thereof. In some embodiments, the facility determines whether to change a model architecture of a constituent machine learning model, the ensemble machine learning model, or some combination thereof, based on the spectral training observability data, such as in a similar manner to changing a hyperparameter for training a machine learning model.

[0271] In some embodiments, at the architecture adjustment block 3.505b, the facility determines whether a determined change to a model architecture of a machine learning model would result in a change in the base architecture of the machine learning model. For example, if the change includes adding or removing a block or layer of the machine learning model, the change would result in retraining the constituent machine learning model or ensemble machine learning model from scratch, or some combination thereof, the facility may determine that the determined change to the architecture of the machine learning model is a change in the base architecture of the machine learning model. In some embodiments, when a determined change to a model architecture of a machine learning model would result in a change in the base architecture of the machine learning model, the facility requests input, such as user input,51720229.45101 WO / #11326387.1regarding whether the change is to be made. In some such embodiments, the facility creates a “checkpoint” for the training of the machine learning model, such as by saving a current state of the machine learning model, before the change is made.

[0272] In some embodiments, when training of a machine learning model reaches a point at which another state of the machine learning model is checkpointed, such as after a number of training iterations or amount of time spent training the checkpointed machine learning model is reached, the facility compares the checkpointed machine learning model with the machine learning model for which the architecture was changed. In some embodiments, based on the comparison of the machine learning models, the facility may determine that the training of one of the machine learning models is to resume. In some embodiments, based on the comparison of the machine learning models, the facility may determine that the training of one of the machine learning models is to cease. In some embodiments, the facility receives user input regarding whether training of a checkpointed machine learning model, a machine learning model whose model architecture was changed, or some combination thereof, should resume or cease based on user input. In some embodiments, the facility trains a checkpointed machine learning model, a machine learning model whose model architecture was changed, a machine learning model having one or more constituent machine learning models whose architecture was changed, or some combination thereof, in parallel.

[0273] In some embodiments, the facility determines whether a change in a model architecture for a machine learning model is feasible based on model architecture criteria. In some embodiments, the facility determines that a change to a model architecture is not possible based on a comparison of the model architecture criteria and a proposed change in a model architecture for a machine learning model. For example, a proposed change in model architecture determined by the facility may require more processing power or memory usage than is available to use or train the machine learning model. In such an example, the facility may determine that the proposed change is not possible.

[0274] In some embodiments, when a proposed change is not possible, the facility may determine whether a similar change is possible based on the model architecture criteria. For example, the facility may determine that adding two layers to a constituent machine learning model would require too much processing power, but adding one layer would not require too much processing power but would still improve the constituent machine learning model, its training, or some combination thereof. In such an example, the facility may determine that52720229.45101 WO / #11326387.1changing the constituent machine learning model to add one layer is possible, and the facility may add the layer to the constituent machine learning model.

[0275] In some embodiments, the proposed change may be a change in the depth, width, or some combination thereof, of the machine learning model. In some embodiments, the facility determines the change in depth, width, or both, based on a ratio of aspects of the machine learning model and at least one eigenvalue generated from the spectral training observability data. In an example embodiment, the facility may identify a current ratio of feedforward layers and attenuation heads that represent the depth and width of the machine learning model. In such an example embodiment, the facility may compare the ratio of feedforward layers and attenuation heads to an integral multiple of the at least one eigenvalue to determine how the ratio of feedforward layers and attenuation heads is to change. Continuing the example, if the ratio is 4.7 and is closer to five times the at least one eigenvalue than four times the at least one eigenvalue, the facility determines that the ratio of feedforward layers and attenuation heads should be changed so that it is closer to five. In some embodiments, the distance between the ratio and the nearest integral multiple of the at least one eigenvalue is used to determine a magnitude of the proposed change.

[0276] In some embodiments, the facility determines whether a ratio of attenuation heads and feedforward layers of the machine learning model should be changed based on how much greater the absolute value of the eigenvalues included in the spectral observability data are than 1. For example, the facility may determine that if the absolute value of the eigenvalues is between 1 and 2, the ratio of attenuation heads and feedforward layers of the machine learning model should be changed.

[0277] The continue training block 3.506 represents one or more processes of the facility that determine whether the ensemble machine learning model training has been completed, and continuing the training if not. If the training is not completed, the facility continues in a “feedback loop” to generate spectral training observability data and adjusting the ensemble machine learning model, constituent machine learning models, or some combination thereof, until the training has completed.

[0278] Furthermore, in some embodiments, the facility uses the feedback loop described above in connection with Figure 3.5 to dynamically adjust parameters of the ensemble machine learning model, constituent machine learning models, or some combination thereof. By using the feedback loop to dynamically adjust machine learning models, the facility is able to determine a set of constituent machine learning models included in an ensemble machine53720229.45101 WO / #11326387.1learning model is expected to result in better “convergence properties” for the ensemble machine learning model than conventional methods, because the facility is assessing and adjusting the constituent machine learning models to improve their ability to perform certain tasks while they are being trained. Convergence properties may refer to properties of the machine learning model that indicate that training of the machine learning model has reached a stable state and that the parameters of the network have reached values that result in accurate predictions based on the training data.

[0279] Figure 3.6 is a flow diagram showing a process for changing an aspect of a constituent machine learning model or ensemble machine learning model during training, used by the facility in some embodiments. First, at act 3.601, the facility receives an indication of an ensemble machine learning model. In some embodiments, at act 3.601, the facility receives an indication of a model architecture of the ensemble machine learning model, one or more constituent machine learning models of the ensemble machine learning model, or some combination thereof.

[0280] At act 3.602a, the facility begins a loop of acts 3.603-3.606 that continues while the ensemble machine learning model is being trained.

[0281] At act 3.603, the facility receives training observability data reflecting the training of constituent machine learning models of the ensemble machine learning model, such as in a similar manner to acts 3.202 and 3.203, described above in connection with Figure 3.2.

[0282] At act 3.604, the facility generates spectral training observability data based on the training observability data, such as by using the process 3.200, the process 3.400, other methods of generating spectral training observability data, or some combination thereof. In some embodiments, the facility generates spectral training observability data for each constituent machine learning model. In some embodiments, the facility aggregates the spectral training observability data for each constituent machine learning model to generate spectral training observability data for the ensemble machine learning model. In some embodiments, the facility weights the spectral training observability data for each constituent machine learning model as part of generating the spectral training observability data for the ensemble machine learning model. In some such embodiments, the facility equally weighs the spectral training observability data for constituent machine learning models. In some embodiments, the facility weighs the spectral training observability data for constituent machine learning models based on learned weights for the constituent machine learning models.54720229.45101 WO / #11326387.1

[0283] At act 3.605, the facility determines whether an aspect of the ensemble machine learning model or at least one constituent machine learning model of the ensemble machine learning model should be changed. At act 3.605, if the facility determines that an aspect of the ensemble machine learning model or at least on constituent machine learning model should be changed, the process 3.600 proceeds to act 3.606, otherwise the process 3.600 proceeds to act 3.603. In some embodiments, an aspect of the ensemble model or at least one constituent machine learning model may be a hyperparameter, an aspect of the model architecture, or some combination thereof.

[0284] For example, if the spectral training observability data includes an eigenvalue whose absolute value is less than or equal to 1, the facility may determine that a constituent machine learning model, the ensemble machine learning model, or some combination thereof should not be changed. In another example, if the spectral training observability data includes an eigenvalue whose absolute value is greater than or equal to 1, the facility may determine that an aspect of a constituent machine learning model, the ensemble machine learning model, or some combination thereof, should cease; that a hyperparameter associated with training a constituent machine learning model, the ensemble machine learning model, or some combination thereof, should be changed; that the model architecture of a constituent machine learning model, the ensemble machine learning model, or some combination thereof, should change; or some combination thereof.

[0285] In some embodiments, at act 3.605, the facility transmits an indication to a user regarding the determination of whether an aspect of a constituent machine learning model, the ensemble machine learning model, or some combination thereof, should be changed. In such embodiments, the facility may determine whether to change the machine learning model based on input received in response to transmitting such an indication to the user.

[0286] In some embodiments, as part of performing act 3.605, the facility receives an indication of additional data, such as training loss, validation loss, Fl score, BLEU score, root score, perplexity, image fidelity, signal to noise ratio, temporal coherence, or some combination thereof, for at least one constituent machine learning model, the ensemble machine learning model, or some combination thereof. For example, the facility may determine based on the additional data that a particular constituent machine learning model is not needed and may determine that the particular constituent machine learning model is to be removed from the ensemble machine learning model.55720229.45101 WO / #11326387.1

[0287] In some embodiments, the facility changes the training of the machine learning model by causing the training of the machine learning model to cease. For example, the facility may cease the training of a constituent machine learning model. In such an example, the facility may remove the constituent machine learning model from the ensemble machine learning model.

[0288] At act 3.606, the facility changes an aspect of the ensemble machine learning model or at least one of the constituent machine learning models. After act 3.606, if the training is not complete, the process 3.600 proceeds to act 3.603, otherwise the process 3.600 proceeds to act 3.602b. In some embodiments, the facility automatically causes the aspect of the machine learning model to change, such as by causing instructions, commands, etc., to be transmitted to a system that manages the training of the machine learning model. In some embodiments, the facility causes the aspect of the machine learning model to change by transmitting a request to a user for permission to change the training of the machine learning model, and changing - or not changing - the training of the machine learning model based on a response from the user.

[0289] In some embodiments, as part of performing acts 3.605 and 3.606, the facility performs one or more aspects of the processes 3.700, 3.800, or some combination thereof.

[0290] Figure 3.7 is a flow diagram showing a process for adjusting a hyperparameter or model architecture for a constituent machine learning model of an ensemble machine learning model, used by the facility in some embodiments. First, at act 3.701, the facility initiates the training of an ensemble machine learning model.

[0291] At act 3.702, the facility receives spectral training observability data reflecting training of an ensemble machine learning model. In some embodiments, the facility performs act 3.702 in a similar manner to act 3.604, described above in connection with Figure 3.6.

[0292] At act 3.703, the facility determines whether the spectral training observability data indicates that a hyperparameter for training at least one constituent machine learning model or a model architecture for at least one constituent machine learning model is to change. In some embodiments, the facility performs act 3.703 in a similar manner to act 3.605, described above in connection with Figure 3.6. If the spectral training observability data indicates that a hyperparameter or model architecture for the constituent machine learning model is to change, the process 3.700 proceeds to act 3.704. Otherwise, the process 3.700 ends.

[0293] At act 3.704, the facility generates at least one hyperparameter adjustment, at least one model architecture adjustment, or some combination thereof, based on the spectral training observability data.56720229.45101 WO / #11326387.1

[0294] At act 3.705, the facility resumes training of the at least one constituent machine learning model based on the generated hyperparameter, generated model architecture configuration, or some combination thereof.

[0295] After act 3.705, the process 3.700 ends.

[0296] Figure 3.8 is a flow diagram showing a process for removing a constituent machine learning model from an ensemble machine learning model during training, used by the facility in some embodiments. First, at act 3.801, the facility initiates training of an ensemble machine learning model.

[0297] At act 3.802, the facility receives spectral training observability data for an ensemble machine learning model. In some embodiments, the facility performs act 3.802 in a similar manner to act 3.604, described above in connection with Figure 3.6.

[0298] At act 3.803, the facility receives validation data for training the ensemble machine learning model. In some embodiments, the validation data is a subset of the data included in training data for the ensemble machine learning model.

[0299] At act 3.804, the facility determines whether at least one constituent machine learning model should be removed from the ensemble machine learning model. In some embodiments, the facility determines whether at least one constituent machine learning model should be removed based on spectral training observability data generated from training observability data, validation data, or some combination thereof. If the at least one constituent machine learning model should be removed from the ensemble machine learning model, the process 3.800 proceeds to act 3.805. Otherwise, the process proceeds to act 3.806.

[0300] At act 3.805, the facility removes the at least one constituent machine learning model from the ensemble machine learning model.

[0301] At act 3.806, the facility resumes training of the ensemble machine learning model.

[0302] After act 3.806, the process 3.800 ends.

[0303] At act 3.602b, the loop ends when training of the machine learning model is complete. In some embodiments, the facility performs one or more aspects of the process 3.600 during pre-training, while training the ensemble machine learning model, while training a constituent machine learning model, while validating the training of the ensemble machine learning model, while validating the training of a constituent machine learning model, or some combination thereof.

[0304] After Act 3.602b, the process 3.600 ends.57720229.45101 WO / #11326387.1

[0305] Figure 3.9 is a block diagram showing a sample system 3.900 for obtaining training observability data and evaluating the training of a machine learning model, used by the facilities in some embodiments. The system 3.900 includes a model training workload block 3.901, a watch dog block 3.902, a spectral observability data generation and evaluation block (“spectral data block”) 3.903, and a data store block 3.604.

[0306] The model training workload block 3.901 represents a system that trains one or more machine learning models, each machine learning model having been configured based on a model architecture of a plurality of model architectures, based on training data included in the data store block 3.904.

[0307] The watch dog block 3.902 represents a system that generates training observability data based on observable data representing the state of a machine learning model that is being trained.

[0308] The spectral data block 3.903 represents a system that receives training observability data from the watch dog block 3.902, such as in a similar manner to acts 3.202 and 3.203, 3.401 and 3.402, 3.603, and 3.701 described above in connection with Figures 3.2, 3.4, 3.6, and 7 respectively. The spectral data block 3.903 may perform any of the processes 3.200, 3.400, 3.600, 3.700, and 3.800, described above in connection with Figures 3.2, 3.4, 3.6, 3.7, and 3.8 respectively. The spectral data block 3.903 may transmit instructions to one or more of the watch dog bock 3.902, model training block 3.901, other systems, or some combination thereof, as part of performing the processes 3.200, 3.400, 3.600, 3.700, 3.800, other processes, methods, or functions performed by the facility, or some combination thereof.

[0309] The data store block 3.904 stores training data for training one or more machine learning models, training observability data generated by the watch dog 3.902, and spectral training observability data generated by the spectral data block 3.903.

[0310] The various embodiments described above can be combined to provide further embodiments. All of the U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and / or listed in the Application Data Sheet are incorporated herein by reference, in their entirety. Aspects of the embodiments can be modified, if necessary to employ concepts of the various patents, applications and publications to provide yet further embodiments.

[0311] These and other changes can be made to the embodiments in light of the abovedetailed description. In general, in the following claims, the terms used should not be construed58720229.45101 WO / #11326387.1to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.59720229.45101 WO / #11326387.1

Claims

CLAIMS1. A method in a computing system, comprising: receiving an indication of a machine learning model that is being trained; receiving first training observability data indicating at least one state of a machine learning model at a first time, the first time occurring while the machine learning model is being trained; receiving second training observability data indicating at least one state of a machine learning model at a second time, the second time occurring while the machine learning model is being trained; generating spectral training observability data by applying a dynamic mode decomposition to the first training observability data and second training observability data, wherein the spectral training observability data indicates a prediction of how training observability data will change over time; determining whether the training of the machine learning model is to change based on the spectral training observability data; and based on the determining, causing training of the machine learning model to be changed.

2. The method of claim 1, wherein: the first training observability data includes a matrix of observability data metrics representing at least one state of the model at the first time; the second training observability data includes a matrix of observability data metrics representing at least one state of the model at the second time; and the spectral training observability data includes one or more eigenvalues and one or more eigenvectors representing a matrix that may be computed from the first training observability data and the second training observability data.

3. The method of claim 1, wherein generating the spectral training observability data by applying the dynamic mode decomposition to the first training observability data and the second training observability data comprises: determining a singular value decomposition of the data indicated by the first training observability data;60720229.45101 WO / #11326387.1generating an indication of third training observability data based on the second training data and the singular value decomposition of the first training data; and generating at least one eigenvalue and at least one eigenvector based on the indication of the third training observability data.

4. The method of claim 1, wherein causing the training of the machine learning model to be changed comprises: selecting at least one hyperparameter of a plurality of hyperparameters for training the machine learning model; and causing the at least one hyperparameter to change based on the spectral training observability data.

5. The method of claim 4, wherein causing the at least one hyperparameter to change comprises: transmitting commands to a computing system able to change a hyperparameter of the machine learning model that instruct the computing system to change the hyperparameter.

6. The method of claim 4, wherein causing the at least one hyperparameter to change comprises: transmitting a message to a user indicating that the at least one hyperparameter should be changed; and receiving input indicating that the hyperparameter is to be changed.

7. The method of claim 4, wherein the plurality of hyperparameters include: a learning rate for training the machine learning model; a batch size for training the machine learning model; a momentum for training the machine learning model; an adaptive learning rate for training the machine learning model; a number of attenuation heads of the machine learning model; and a number of feedforward layers of the machine learning model.61720229.45101 WO / #11326387.

18. The method of claim 4, wherein causing the at least one hyperparameter to change comprises: determining a magnitude for changing the hyperparameter based on the spectral training observability data.

9. The method of claim 1, wherein causing the training of the machine learning model to be changed comprises: determining whether the training of the machine learning model should cease based on the spectral training observability data; and based on the determining, ceasing the training of the machine learning model.

10. One or more instances of computer-readable media not constituting a transitory propagating data signal, the one or more instances of computer-readable media collectively having contents configured to cause a computing device to perform a method comprising: receiving an indication of a plurality of model architectures; for each respective model architecture of the plurality of model architectures: train a machine learning model configured based on the respective model architecture; during the training of the machine learning model: receiving first training observability data during a first time period; receiving second training observability data during a second time period; generating spectral training observability data based on the first and second training observability data; and determining whether the training of the model is to change based on the spectral training observability data; and select a model architecture of the plurality of model architectures based on the spectral training observability data for each model architecture of the plurality of model architectures.

11. The one or more instances of computer-readable media of claim 10, wherein the method further comprises: causing training of the machine learning model having the selected model architecture to cease.62720229.45101 WO / #11326387.

112. The one or more instances of computer-readable media of claim 10, wherein the method further comprises: determining whether one or more hyperparameters for training the machine learning model associated with the selected model architecture are to be changed based on the spectral training observability data; and based on the determining, causing the one or more hyperparameters to be changed.

13. The one or more instances of computer-readable media of claim 12, wherein causing the one or more hyperparameters to be changed comprises: automatically transmitting instructions to change the one or more hyperparameters for training the machine learning model associated with the selected model architecture.

14. The one or more instances of computer-readable media of claim 12, wherein causing the one or more hyperparameters to be changed comprises: presenting an indication of the one or more hyperparameters to a user; receiving input indicating that at least one hyperparameter of the one or more hyperparameters are to be changed; and changing the at least one hyperparameter based on the received input.

15. A method compri sing : generating spectral training observability data; and assessing the training of a machine learning model based on the spectral training observability data.

16. A method comprising: tuning hyperparameters for training a machine learning model based on training observability data.

17. A method comprising: identifying a first computer number format used to represent one or more parameters of an artificial intelligence (Al) model; obtaining first training observability data indicating a first state of the Al model during a first training period;63720229.45101 WO / #11326387.1obtaining second training observability data indicating a second state of the Al model during a second training period; generating spectral training observability data by applying dynamic mode decomposition (“DMD”) to the first training observability data and the second training observability data; based on the spectral training observability data, determining whether to change the first computer number format used to represent the one or more parameters to a second computer number format; and causing the Al model to be trained during a third training period based on the determination.

18. The method of claim 17, wherein the spectral observability data includes one or more eigenvalues.

19. The method of claim 17, wherein determining whether to change the first computer number format includes: determining, based on the spectral training observability data, whether vanishing gradients are observed while training the Al model.

20. The method of claim 19, wherein determining whether vanishing gradients are observed while training the Al model includes: obtaining an eigenvalue threshold; counting eigenvalues of the spectral training observability data having a value less than the eigenvalue threshold; and based on the count of eigenvalues, determining whether vanishing gradients are observed while training the Al model.

21. The method of claim 19, wherein determining whether vanishing gradients are observed while training the Al model includes: obtaining an eigenvalue threshold; determining a proportion of eigenvalues of the spectral training observability data having a value less than the eigenvalue threshold; comparing the proportion of eigenvalues to a vanishing gradient threshold; and64720229.45101 WO / #11326387.1based on the comparing, determining whether vanishing gradients are observed while training the Al model.

22. The method of claim 19, wherein determining whether vanishing gradients are observed while training the Al model includes: obtaining an eigenvalue threshold; determining a first gradient norm contributed by eigenvalues less than the eigenvalue threshold; determining a second norm contributed by eigenvalues greater than or equal to the eigenvalue threshold; and based on the first gradient norm and the second gradient norm, determining whether vanishing gradients are observed while training the Al model.

23. The method of claim 22, further comprising: determining a fraction of gradient norm contributed by the eigenvalues less than the eigenvalue threshold based on the first norm and the second norm; comparing the fraction of gradient norm to a gradient norm threshold; and determining whether vanishing gradients are observed based on the comparing.

24. The method of claim 17, wherein determining whether to change the first computer number format includes: determining to change the first computer number format based on the spectral training observability data indicating that exploding gradients are observed while training the Al model.

25. The method of claim 17, wherein determining whether to change the first computer number format includes: determining not to change the first computer number format based on determining that the spectral training observability data indicates convergence of the Al model.

26. The method of claim 17, wherein determining whether to change the first computer number format includes: selecting a first eigenvalue of the spectral training observability data having a highest magnitude;65720229.45101 WO / #11326387.1selecting a second eigenvalue of the spectral training observability data having a lowest magnitude among the eigenvalues of the spectral training observability data; determining a difference in magnitude between the first eigenvalue and the second eigenvalue; and based on the difference satisfying a spectral gap threshold, determining whether to change the first computer number format to the second computer number format.

27. The method of claim 17, further comprising: based on determining to change the first computer number, selecting the second computer number format to have a greater precision than the first computer number format; and representing the one or more parameters using the second computer number format.

28. The method of claim 17, wherein the one or more parameters include a weight of the Al model.

29. The method of claim 17, wherein the one or more parameters include a bias of the Al model.

30. The method of claim 17, further comprising: in response to determining to change the first computer number format, changing the first computer number format of the one or more parameters to the second computer number format while training the Al model.

31. The method of claim 17, wherein at least one of the first computer number format or the second computer number format includes one of floating point 32, floating point 16, floating point 8, brain floating point 16, or TensorFloat-32.

32. A system comprising: one or more processors; and one or more memories collectively storing contents executable by the one or more processors to perform actions, the actions comprising: generating spectral training observability data for an artificial intelligence (Al) model; and66720229.45101 WO / #11326387.1changing, based on the spectral training observability data, a first computer number format of a parameter of the Al model to a second computer number format while training the Al model.

33. A method compri sing : generating spectral training observability data for an artificial intelligence (Al) model; and changing, based on the spectral training observability data, a first computer number format of a parameter of the Al model to a second computer number format.

34. One or more computer-readable media, not constituting a signal per se, collectively storing contents executable by one or more processors to perform actions, the actions comprising: obtaining a plurality of computer number formats usable to represent one or more parameters of an artificial intelligence (Al) model; for each computer number format of the plurality of computer number formats: training the Al model using the computer number format to represent one or more parameters of the Al model; during training of the Al model: obtaining first training observability data indicating a first state of the Al model during a first time period; obtaining second training observability data indicating a second state of the Al model during the second time period; and generating spectral training observability data based on the first training observability data and the second training observability data; based on the spectral training observability data, selecting a computer number format to represent the one or more parameters of the Al model; and training the Al model using the computer number format to represent the one or more parameters of the Al model.

35. A method in a computing system, comprising: receiving an indication of an ensemble machine learning model that is being trained, the ensemble machine learning model comprising two or more constituent machine learning models;67720229.45101 WO / #11326387.1for each constituent machine learning model of the ensemble machine learning model: receiving first training observability data indicating at least one state of the constituent machine learning model at a first time, the first time occurring while the ensemble machine learning model is being trained; receiving second training observability data indicating at least one state of the constituent machine learning model at a second time, the second time occurring while the ensemble machine learning model is being trained; and generating spectral training observability data for the constituent machine learning model based on the first training observability data and second training observability data; for each of one or more constituent machine learning models of the ensemble machine learning model, determining, based on spectral training observability data, an adjustment to an aspect of the constituent machine learning model; and based on the determining, continuing the training of the ensemble machine learning model in a way that reflects the determined adjustments.

36. The method of claim 35, wherein continuing the training of the ensemble machine learning model in a way that reflects the determined adjustments comprises continuing the training of only a proper subset of the constituent machine learning models of the ensemble machine learning model.

37. The method of claim 35, wherein continuing the training of the ensemble machine learning model in a way that captures the determined adjustments comprises changing at least one of: a learning rate for training the at least one constituent machine learning model; a batch size for training the at least one constituent machine learning model; a momentum for training the at least one constituent machine learning model; and an adaptive learning rate for training at least one constituent the machine learning model.

38. The method of claim 35, wherein continuing the training of the ensemble machine learning model in a way that captures the determined adjustments comprises changing at least one of: a number of attenuation heads of the at least one constituent machine learning model; and68720229.45101 WO / #11326387.1a number of feedforward layers of the at least one constituent machine learning model.

39. The method of claim 35, wherein generating spectral training observability data for the constituent machine learning model comprises: applying dynamic mode decomposition to the first training observability data and the second training observability data.

40. The method of claim 35, wherein determining an adjustment to an aspect of the constituent machine learning model comprises: generating a matrix of spectral training observability data based on the spectral training observability data generated for each constituent machine learning model; and applying a dynamic mode decomposition to the matrix of spectral training observability data.

41. The method of claim 35, wherein determining an adjustment to an aspect of the constituent machine learning model comprises: generating at least one validation data matrix based on a validation dataset for training the ensemble machine learning model; generating a matrix of training observability data for the ensemble machine learning model based on the validation data matrix, the first training observability data matrix for each constituent machine learning model, and the second training observability data matrix for each constituent machine learning model; and applying a dynamic mode decomposition to the matrix of training observability data.

42. One or more instances of computer-readable media, the one or more instances of computer-readable media collectively having contents configured to cause a computing device to perform a method comprising: receiving an indication of an ensemble machine learning model to be trained, the ensemble machine learning model comprising two or more constituent machine learning models; and while the ensemble machine learning model is being trained: receiving first training observability data for each constituent machine learning model;69720229.45101 WO / #11326387.1receiving second training observability data for each constituent machine learning model; generating spectral training observability data for the ensemble machine learning model based on the first training observability data and the second training observability data; determining, based on the spectral training observability data, that a change to an aspect of the ensemble machine learning model is to be made; and based on the determining, causing the aspect of the ensemble machine learning model to be changed.

43. The one or more instances of computer-readable media of claim 42, wherein causing the aspect of the ensemble machine learning model to change comprises: causing the training of at least one constituent machine learning model of the ensemble machine learning model to cease; and removing the at least one constituent machine learning model from the ensemble machine learning model.

44. The one or more instances of computer-readable media of claim 42, wherein causing the aspect of the ensemble machine learning model to change comprises: causing a hyperparameter of at least one constituent machine learning model of the ensemble machine learning model to be changed.

45. The one or more instances of computer-readable media of claim 42, wherein causing the aspect of the ensemble machine learning model to change comprises: causing a model architecture of at least one constituent machine learning model of the ensemble machine learning model to be changed.

46. The one or more instances of computer-readable media of claim 42, wherein generating the spectral training observability data for the ensemble machine learning model comprises: applying a dynamic mode decomposition to the first training observability data and the second training observability data.70720229.45101 WO / #11326387.

147. The one or more instances of computer-readable media of claim 42, wherein generating the spectral training observability data for the ensemble machine learning model comprises: for each constituent machine learning model: applying a dynamic mode decomposition to first training observability data of the constituent machine learning model and second training observability data of the constituent machine learning model.

48. The one or more instances of computer-readable media of claim 42, wherein determining whether the aspect of the ensemble machine learning model is to change comprises: receiving an indication of a validation dataset for training the ensemble machine learning model; and determining whether the training of the ensemble machine learning model is to change based on the spectral training observability data and the validation dataset.

49. A method comprising: receiving an indication of a constituent machine learning model of an ensemble machine learning model that is being trained; and changing hyperparameters or a model architecture for the constituent machine learning model.

50. A method comprising: receiving an indication of an ensemble machine learning model; and assessing the training of the ensemble machine learning model based on training observability data.71720229.45101 WO / #11326387.1