Capturing technical variability in discovery proteomics experiments

By modeling technical variability in mass spectrometry proteomics experiments, the method improves sensitivity and specificity, addressing issues with small sample sizes and heteroskedasticity, enabling reliable detection of biological changes with reduced replicates.

WO2026143106A1PCT designated stage Publication Date: 2026-07-02GOLGI LLC +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GOLGI LLC
Filing Date
2025-12-23
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Mass spectrometry proteomics experiments face challenges with high technical variability due to small sample sizes, leading to increased false positives and negatives, especially when dealing with heteroskedasticity and signal-dependent variance, which complicates the separation of biological signals from technical noise.

Method used

A method and system that models technical variability by relating composite quality metrics to estimated variability, applying functions to transform these metrics, and performing variance moderation to improve sensitivity and specificity, even with reduced replicates.

Benefits of technology

The method enhances statistical power and reduces experimental overhead by accurately determining variance metrics, maintaining sensitivity and controlling false discovery rates, even with single replicates, thus optimizing throughput in high-throughput proteomics.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025061112_02072026_PF_FP_ABST
    Figure US2025061112_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A method processes an experiment design definition and resulting omics data containing measurements for multiple analytes. For each analyte, the method fits a model to estimate a first variability and computes a first quality metric per measurement. The method aggregates per-analyte measurements into a composite quality metric, fits a function relating the composite quality metric to the first variability, and uses the function to transform both variables. The method fits a model to the transformed data to obtain a second estimated variability for each analyte and computes residuals from the first and second estimates, followed by variance moderation of the residuals. The method then reads a second dataset with samples for multiple analytes, computes a second quality metric for each analyte, transforms the second quality metric using the function, and applies the fitted model to output a variance metric for each analyte in the second dataset.
Need to check novelty before this filing date? Find Prior Art

Description

GLI-00125 CAPTURING TECHNICAL VARIABILITY IN DISCOVERY PROTEOMICS EXPERIMENTSRELATED APPLICATION

[0001] This application claims the benefit of U. S. Provisional Application. No. 63 / 738,964 filed December 26, 2024, which is hereby incorporated by reference in its entirety.BACKGROUND

[0002] Proteomics is the study of interactions, function, composition, and structures of proteins and their cellular activities. In addition, mass spectrometry is a tool for measuring mass-to-chart ratio (m / z) of a molecule in a sample.BRIEF SUMMARY

[0003] According to embodiments of the present disclosure, methods of and computer program products for determining a variance metric are provided. In some embodiments, a resulting from executing the experiment design method can comprise reading an experiment design definition. The method can further comprise reading Omics data. The Omics data can result from executing the experiment design. The Omics data can comprise a plurality of measurements for a plurality of analytes. The method can further comprise fitting at least one model for each analyte. Each model can provide a first estimated variability of the Omics data for its analyte. The method can comprise determining a first quality metric for each of the plurality of measurements. In some embodiments, and as described below, the first quality metric can be converted into a weight. The method can comprise calculating, based on the plurality of first quality metrics, a compositeGLI-00160 Page l of 50SWDocIDFH12598156.3GLI-00125 quality metric for each of the plurality of analytes. The method can comprise fitting a function. The function can relate the composite quality metric to the first estimated variability for each analyte. The method can comprise applying the function to transform the composite quality metric and the first estimated variability for each analyte. In some embodiments, the method can comprise fitting a model to the transformed data. In some embodiments, the method can comprise obtaining, from the model, a plurality of model parameter estimates. The method can comprise determining a second estimated variability for each analyte by applying the fitted model to the composite quality metric. The method can comprise determining a vector of residuals based on a comparison of the first estimated variability and second estimated variability. The method can comprise performing variance moderation on the vector of residuals. The method can comprise reading a second dataset comprising a plurality of samples for a plurality of analytes. The method can comprise determining a second quality metric for each of the plurality of analytes. In some embodiments, the plurality of second quality metrics comprises one or more quality metrics. The method can comprise applying the function to transform each second quality metric. The method can comprise applying the fitted model to the plurality of second quality metrics (e.g., the transformed second set of quality metrics) to determine a variance metric for each of the plurality of analytes in the second dataset.

[0004] In some embodiments, the method further includes determining a vector of covariate between group error (ESS) the first dataset. Fitting the model can further comprise including the covariate ESS. In some embodiments, the method further comprises determining an estimate of the ESS for the second dataset based on each analyte of the first dataset.GLI-00160 Page 2 of 50SWDocIDFH12598156.3GLI-00125

[0005] In some embodiments, the method can comprise determining a subset of the plurality of analytes which were not obtained by the Omics data. The method can further comprise generating measurements for each of the subset of the plurality of analytes.

[0006] In some embodiments, the method can comprise aggregating the plurality of measurements and estimated variabilities from the protein analyte level to the complete protein level, the complete protein comprising all its set of analytes.

[0007] In some embodiments, the Omics data comprises one or more of RNA-Seq, Metabolomics, or Proteomics. In some embodiments, Proteomics comprises data relating to peptides and measurements thereof.

[0008] In some embodiments, the Omics data is mass spectrometry data.

[0009] In some embodiments, the plurality of measurements are one or more of a representation of an ion count, or a measurement of an intensity, intensity as a representation of a flux, or a read count, and other representations of analyte experimental attributes such as mass charge, mapping score, etc. In some embodiments, the ion count can be represented as a signal-to-noise ratio (SNR, SN Ratio). In some embodiments, intensity can be represented as a measurement of luminescence, a measurement of frequency, or a measurement of flux. The measurement of flux can be calculated as a Fourier transform of the measurement of frequency. In some embodiments, the read count can be a representation of RNA Sequencing data (RNASeq).

[0010] In some embodiments, the at least one variance is one of more of an average within group sum of squares (RSS) and an average between group sum of squares (EES).

[0011] In some embodiments, fitting the model is further based on complete data for that analyte.

[0012] In some embodiments, each of the plurality of variances is a sum of squares.GLI-00160 Page 3 of 50SWDocIDFH12598156.3GLI-00125

[0013] In some embodiments, at least one function to transform the composite quality metric and the first estimated variability for each analyte is a linearization function.

[0014] In some embodiments, a method of modifying experiment parameters can include obtaining the omics data of the above described methods. The omics data can comprise a plurality of omics samples. The method can comprise, for each of a plurality of combinations of the plurality of omics samples, calculating, by any of the above described methods, the variance metric. The method can comprise calculating a number of false positives of the plurality of combinations of the plurality of omics samples based on the plurality of variance metrics. The method can comprise calculating a relative sensitivity and a relative false discovery rate for each plurality of combinations of the plurality of omics samples, each relative sensitivity and relative false discovery rate being relative to the plurality of omics samples as a whole.

[0015] In some embodiments, the method can comprise outputting, for each of the combinations, the relative sensitivity and relative false discovery rate.

[0016] In some embodiments, a computing node comprises a computer readable storage medium having program instructions embodied therewith. The program instructions can be executable by a processor of the computing node to cause the processor to perform a method comprising reading an experiment design definition. The method can further comprise reading Omics data. The Omics data can result from executing the experiment design. The Omics data can comprise a plurality of measurements for a plurality of analytes. The method can further comprise fitting at least one model for each analyte. Each model can provide a first estimated variability of the Omics data for its analyte. The method can comprise determining a first quality metric for each of the plurality of measurements. In some embodiments, and as described below, the plurality of first quality metrics can be used to compute a weight. The method can comprise calculating,GLI-00160 Page 4 of 50SWDocIDFH12598156.3GLI-00125 based on the plurality of the first quality metrics, a composite quality metric for each of the plurality of analytes. The method can comprise fitting a function. The function can relate the composite quality metric to the first estimated variability for each analyte. The method can comprise applying the function to transform the composite quality metric and the first estimated variability for each analyte. In some embodiments, the method can comprise fitting a model to the transformed data. In some embodiments, the method can comprise obtaining, from the model, a plurality of model parameter estimates. The method can comprise determining a second estimated variability for each analyte by applying the fitted model to the composite quality metric. The method can comprise determining a vector of residuals based on a comparison of the first estimated variability and second estimated variability. The method can comprise performing variance moderation on the vector of residuals. The method can comprise reading a second dataset comprising a plurality of samples for a plurality of analytes. The method can comprise determining a second quality metric for each of the plurality of analytes. The method can comprise applying the function to transform the each of the second quality metrics. The method can comprise applying the fitted model to the plurality of second quality metrics (e.g., the transformed second set of quality metrics) to determine a variance metric for each of the plurality of analytes in the second dataset.

[0017] In some embodiments, the method performed by the processor further comprises determining a vector of covariate between group error (ESS) the first dataset. Fitting the model can further comprise including the covariate ESS. In some embodiments, the method performed by the processor further comprises determining an estimate of the ESS for the second dataset based on each analyte of the first dataset.GLI-00160 Page 5 of 50SWDocIDFH12598156.3GLI-00125

[0018] In some embodiments, the Omics data comprises one or more of RNA-Seq, Metabolomics, or Proteomics.

[0019] In some embodiments, the Omics data is mass spectrometry data.

[0020] In some embodiments, the plurality of measurements are one or more of a representation of an ion count, or a measurement of an intensity, intensity as representative as a flux, or a read count, and other representations of analyte experimental attributes such as mass charge, mapping score, etc.

[0021] In some embodiments, the at least one variance is one of more of an average within group sum of squares (RSS) and an average between group sum of squares (EES).

[0022] In some embodiments, fitting the model is further based on complete data for that analyte.

[0023] In some embodiments, each of the plurality of variances is a sum of squares.

[0024] In some embodiments, at least one function to transform the composite quality metric and the first estimated variability for each analyte is a linearization function.

[0025] In some embodiments, a system for modifying experiment parameters, comprises a computing node comprising a computer readable storage medium having program instructions embodied therewith. The program instructions can be executable by a processor of the computing node to cause the processor to perform a method comprising obtaining the above described omics data, the omics data comprising a plurality of omics samples. The method further includes, for each of a plurality of combinations of the plurality of omics samples, calculating, by the above method, the variance metric. The method can further comprise calculating a number of false positives of the plurality of combinations of the plurality of omics samples based on the plurality of variance metrics. The method can further comprise calculatingGLI-00160 Page 6 of 50SWDocIDFH12598156.3GLI-00125 a relative sensitivity and a relative false discovery rate for each plurality of combinations of the plurality of omics samples, each relative sensitivity and relative false discovery rate being relative to the plurality of omics samples as a whole.

[0026] In some embodiments, the method performed by the processor further comprises outputting, for each of the combinations, the relative sensitivity and relative false discovery rate.

[0027] In some embodiments, calculating a number of false positives comprises generating a first analysis of each omics sample of the plurality of omics samples, removing one of the plurality of omics samples in the plurality of combinations, thereby generating a reduced plurality of combinations, generating a second analysis of the reduced plurality of combinations, and marking an omics sample as a false positive when the second analysis indicates it is significant and the first analysis indicates it is not significant.

[0028] In some embodiments, calculating the relative sensitivity comprises calculating a ratio of (a) significant detected changes in the plurality of omics samples and significant detected changes the reduced plurality of combinations to (b) significant detected changes in the plurality of omics samples.

[0029] In some embodiments, a computer program product for determining variance moderation can comprise a computer readable storage medium having program instructions embodied therewith. The program instructions can be executable by a processor to cause the processor to perform a method comprising reading Omics data. The Omics data can result from executing the experiment design. The Omics data can comprise a plurality of measurements for a plurality of analytes. The method can further comprise fitting at least one model for each analyte. Each model can provide a first estimated variability of the Omics data for its analyte. The method can comprise determining a first quality metric for each of the plurality of measurements. In someGLI-00160 Page 7 of 50SWDocIDFH12598156.3GLI-00125 embodiments, the plurality of first quality metrics comprises one or more quality metrics. In some embodiments, and as described below, the plurality of first quality metrics can be used to compute a weight. The method can comprise calculating, based on the plurality of the first quality metrics, a composite quality metric for each of the plurality of analytes. The method can comprise fitting a function. The function can relate the composite quality metric to the first estimated variability for each analyte. The method can comprise applying the function to transform the composite quality metric and the first estimated variability for each analyte. In some embodiments, the method can comprise fitting a model to the transformed data. In some embodiments, the method can comprise obtaining, from the model, a plurality of model parameter estimates. The method can comprise determining a second estimated variability for each analyte by applying the fitted model to the composite quality metric. The method can comprise determining a vector of residuals based on a comparison of the first estimated variability and second estimated variability. The method can comprise performing variance moderation on the vector of residuals. The method can comprise reading a second dataset comprising a plurality of samples for a plurality of analytes. The method can comprise determining a second quality metric for each of the plurality of analytes. The method can comprise applying the function to transform the plurality of second quality metrics. The method can comprise applying the fitted model to the plurality of second quality metrics (e.g., the transformed second set of quality metrics) to determine a variance metric for each of the plurality of analytes in the second dataset.

[0030] In some embodiments, the method performed by the processor further comprises determining a vector of covariate between group error (ESS) the first dataset. Fitting the model can further comprise including the covariate ESS. In some embodiments, the method performedGLI-00160 Page 8 of 50SWDocIDFH12598156.3GLI-00125 by the processor further comprises determining an estimate of the ESS for the second dataset based on each analyte of the first dataset.

[0031] In some embodiments, the Omics data comprises one or more of RNA-Seq, Metabolomics, or Proteomics.

[0032] In some embodiments, the Omics data is mass spectrometry data.

[0033] In some embodiments, the plurality of measurements are one or more of a representation of an ion count, or a measurement of an intensity, intensity as representative as a flux, or a read count.

[0034] In some embodiments, the at least one variance is one of more of an average within group sum of squares (RSS) and an average between group sum of squares (EES).

[0035] In some embodiments, fitting the model is further based on complete data for that analyte.

[0036] In some embodiments, each of the plurality of variances is a sum of squares.

[0037] In some embodiments, at least one function to transform the composite quality metric and the first estimated variability for each analyte is a linearization function.

[0038] In some embodiments, a computer program product for modifying experiment parameters, comprises a computing node comprising a computer readable storage medium having program instructions embodied therewith. The program instructions can be executable by a processor of the computing node to cause the processor to perform a method comprising obtaining the above described omics data, the omics data comprising a plurality of omics samples. The method further includes, for each of a plurality of combinations of the plurality of omics samples, calculating, by the above method, the variance metric. The method can further comprise calculating a number of false positives of the plurality of combinations of the pluralityGLI-00160 Page 9 of 50SWDocIDFH12598156.3GLI-00125 of omics samples based on the plurality of variance metrics. The method can further comprise calculating a relative sensitivity and a relative false discovery rate for each plurality of combinations of the plurality of omics samples, each relative sensitivity and relative false discovery rate being relative to the plurality of omics samples as a whole.

[0039] In some embodiments, the method performed by the processor further comprises outputting, for each of the combinations, the relative sensitivity and relative false discovery rate.

[0040] In some embodiments, calculating a number of false positives comprises generating a first analysis of each omics sample of the plurality of omics samples, removing one of the plurality of omics samples in the plurality of combinations, thereby generating a reduced plurality of combinations, generating a second analysis of the reduced plurality of combinations, and marking an omics sample as a false positive when the second analysis indicates it is significant and the first analysis indicates it is not significant.

[0041] In some embodiments, calculating the relative sensitivity comprises calculating a ratio of (a) significant detected changes in the plurality of omics samples and significant detected changes the reduced plurality of combinations to (b) significant detected changes in the plurality of omics samples.BRIEF DESCRIPTION OF THE DRAWINGS

[0042] Fig. 1A is a flowchart illustrating a method according to embodiments of the present disclosure.

[0043] Fig. 1B is a flowchart illustrating a method according to embodiments of the present disclosure.GLI-00160 Page 10 of 50SWDocIDFH12598156.3GLI-00125

[0044] Fig. 1C is a block diagram illustrating a system according to embodiments of the present disclosure.

[0045] Fig.2A is a graph illustrating a plot of the log of the observed variance versus the log of the average ion count at N=2.

[0046] Fig.2B is a graph illustrating a plot of the log of the weighted variance versus the log of the average ion count at N=2. The graph was generated using ion counts as weights.

[0047] Fig.2C is a graph illustrating analyte reweights according to embodiments of the present disclosure.

[0048] Fig.2D is a graph illustrating residuals of variances according to embodiments of the present disclosure.

[0049] Fig.2E is a graph illustrating a plot of DIA analyte level EB variances according to embodiments of the present disclosure.

[0050] Fig.2F is a graph illustrating residuals of the variances.

[0051] Fig.3A is a graph illustrating a plot of the log of the observed variance versus the log of the average weight.

[0052] Fig.3B is a graph illustrating a plot of the log of the weighted variance versus the log of the average weight.

[0053] Fig.4A is a graph illustrating a plot of sensitivity versus the number of samples, both for Signal Independent Variability (SIV) values and t-test values.

[0054] Fig.4B is a graph illustrating a plot of empirical False Discovery Rate (FDR) versus the number of samples, both for Signal Independent Variability (SIV) values and t-test values.

[0055] Fig.5A is a graph illustrating performance of the Empirical Throughput Control (ETC) system in a three-replicate analysis.GLI-00160 Page 11 of 50SWDocIDFH12598156.3GLI-00125

[0056] Fig.5B is a graph illustrating performance of the Empirical Throughput Control (ETC) system in a single-replicate analysis.

[0057] Fig.5C is a chart illustrating performance metrics of the ETC system.

[0058] Fig.5D is a chart illustrating a false discovery rate analysis.

[0059] Fig.6 is a flowchart illustrating experiment design using empirical throughput calibration according to embodiments of the present disclosure.

[0060] Fig.7 is a schematic illustrating an example of a computing node according to embodiments of the present disclosure.DETAILED DESCRIPTION

[0061] Mass spectrometry proteomics experiments are frequently performed in vitro with small numbers of technical replicates used to assess measurement variability. In these settings, measurement error can be determined by experimental attributes such as the number of ions collected in a mass spectrometer, precursor charge, mapping quality, etc, thereby creating a correlation between residual error and regression variability. By explicitly modeling these relationships, sensitivity and specificity can be improved while simultaneously reducing the number of technical replicates collected. Using embodiments of the present disclosure, improvements are reported even when reducing replicates to a single sample per group, a design that makes calculating a sample standard deviation impossible.

[0062] In vitro post-translational biology is frequently observed through mass spectrometry proteomics experiments. Many designs of these experiments tend to be relatively simple, with the vast majority focused on differential expression between a small number of experimental conditions (e.g., genotypes, drug treatments, etc.). However, one aspect of common proteomicsGLI-00160 Page 12 of 50SWDocIDFH12598156.3GLI-00125 experiments is unusual from an outside perspective — experiments often contain multiple replicates that are theoretically identical (e.g., multiple swabs from a yeast culture or different aliquots of the same cell culture). In such experiments, the trait of interest here cannot be observed / measured with perfect precision. If the trait of interest were able to be measured with perfect precision, the researcher would never bother collecting more than one of the same sample. This is not necessarily true of studies of tumors, or any other scientific inquiry that requires sampling from a large biological population. If perfect technical precision were achievable, the vast majority of proteomics experiments would not contain any replicates.

[0063] Mass spectrometry proteomics experiments do not have perfect precision. Even in benchmarking experiments that contain no biological variability across replicates, the observed residual standard deviations can typically span multiple orders of magnitude. Accordingly, experiments include replicates of the same sample to reliably separate observed patterns from what could occur due to nothing more than technical noise. Unfortunately, this creates a tension in objectives — many researchers desire to increase the throughput of their experiments as much as possible. For example, chemoproteomics screening experiments can plausibly test 1060small molecules as potential therapeutics. In this setting throughput is extremely valuable, and experiments are often performed with sample sizes less than or equal to three per group.Unfortunately, throughput comes at a cost to reliability as experimental error is difficult to capture at small sample sizes.

[0064] There are challenges associated with small sample sizes. As a first example, in a / -test comparing the mean of two groups with three replicates, separate estimates of the variance in each group are often created, resulting in only two degrees of freedom to estimate each within group variance. If three Gaussian random variables are simulated with a mean of zero and aGLI-00160 Page 13 of 50SWDocIDFH12598156.3GEI-00125 standard deviation (SD) of one, the estimates deviate from the true value by two-fold (SD <.5 or SD > 2) approximately 24% of the time. If the sample size is reduced to 2 per group the percentage of estimate deviation rises to 43%, potentially resulting in large numbers of false positives and false negatives.

[0065] One strategy — empirical Bayesian variance moderation — for dealing with variance in small samples was popularized in the context of analyzing microarray data. The basic idea is that performance can improve if one could 'borrow' information across molecules because estimating variance with a small number of samples is difficult 'Empirical Bayes' implies that prior distribution on the variances is estimated from each dataset, and the vector of variances (one variance for each gene) is moderated in the sense that extreme values are pulled towards the center of the prior distribution.

[0066] Subsequent work in genomics showed that variance moderation in the presence of heteroskedasticity could have negative consequences. In genomics, residual variance in each gene can be a function of the average signal. Similarly, heteroskedasticity is one of the properties of proteomics data. For example, the use of weighted models can be helpful for controlling variation across multiple batches of isobaric data. In the presence of heteroskedasticity, variance moderation strategies risk systematically reducing the variance of poor quality measurements (e.g., low signals have above average variance) while increasing the variance of those with the highest quality measurements. For this reason, solutions that model variance as a function of the average signal quality, thereby creating a signal-to-variance curve, and then perform moderation after adjusting for the observed dependence on signal magnitude. A modification on this strategy — called Voom — uses the signal-variance curve, not primarily for moderation, but to create weights for subsequent use in linear modeling.GLI-00160 Page 14 of 50SWDocIDFH12598156.3GLI-00125

[0067] Signal dependent variance moderation strategies are powerful tools for small sample size omics experiments, yet signal dependence is not the only predictable property that determines technical variability. Another underlying property that makes signal dependent variance moderation useful is heteroskedasticity. When looking at each molecule independently, weights can be used to account for this variance. Employing a strategy such as Voom can help to properly calibrate those weights. However, the question of heteroskedasticity across compounds such as proteins can be more complex. If all the measurements for an analyte (e.g., a protein analyte) are poor, then weights within a linear model do not account for the systematically poor measurements. There are two implications of this property that should be understood.

[0068] The first implication is that the systematic decrease in measurement quality across analytes can be captured with a sample standard deviation, but only in large sample sizes. In small sample sizes, using variance moderation even after using weighted models could prove beneficial.

[0069] The second implication is that residual and explained variance are no longer independent. Many tools in statistics are based on a decomposition of variance into the variability of the estimates and the variability of the residuals. For example, an F-test is the ratio of the sum of the squared deviations explained by a model (ESS) and the sum of the squared residual values (RSS). Geometrically, estimates in linear models are an orthogonal projection from the data onto a column space spanned by a design matrix (e.g., high dimensional data is being projected onto a mathematical space that can be understood). Since this projection is orthogonal, in a standard Gaussian model, the SSR and SSE are independent statistics. However, systematically altered variance components create an association between SSR and SSE. This dependency is readilyGLI-00160 Page 15 of 50SWDocIDFH12598156.3GLI-00125 verifiable with simulated data and the association can be seen in any omics dataset with signal dependent variation.

[0070] The above principles of heteroskedasticity have implications for high throughput screening experiments. In this setting, sample sizes per group are typically small (N < 3) and there is generally no biological variation in the samples, and therefore all of the error is technical. Statistics are only used to separate signals from technical noise and this task is made difficult by the small sample sizes. However, when the consequences of heteroskedasticity are understood, both ESS and signal magnitude can be used to supplement or even entirely replace sample standard deviations.

[0071] In some embodiments, a method and corresponding system corrects for additional sources of technical error and expands upon variance moderation. In some embodiments, this method and system is used to enable statistical modeling based purely on prior information about technical variance. In addition, the method can include an approach for artificially reducing sample size to perform a cost benefit analysis on the consequences of reducing technical replicates from any agreed upon standard.

[0072] Fig. 1A is a flowchart 100 illustrating a method according to embodiments of the present disclosure. In some embodiments, the method includes reading an experiment design function, and reading Omics data (102). The omics data can result from executing the experiment design (e.g., performing mass spectrometry on an analyte, sequencing RNA, performing a proteomics experiment, etc.). The Omics data can comprise a plurality of measurements for a plurality of analytes. The method includes fitting at least one model for each analyte (104). Each of the at least one model can provide a first estimated variability of the mass spectrometry data for its analyte (e.g., each of the at least one model is local relative to its analyte). The method furtherGLI-00160 Page 16 of 50SWDocIDFH12598156.3GLI-00125 can include determining a first quality metric for each of the plurality of measurements (106). The method can further include calculating a composite quality metric for each of the plurality of analytes (108). The method can include fitting a function, where the function relates the composite quality metric to the first estimated variability for each analyte (110). The method can include applying at least one function to transform the composite quality metric and the first estimated variability for each analyte (112). In some embodiments, the function is a linearization. The method can include fitting a model to the transformed data (114). The model fit to the transformed data can be a model of all or multiple analytes represented by the at least one models (e.g., a global model relative to the analytes). The method can include obtaining, from the model, a plurality of model parameter estimates (116). The method can include determining a second estimated variability for each analyte by applying the fitted model to the composite quality metric (118). The method can include determining a vector of residuals based on a comparison of the first estimated variability and second estimated variability (120). The vector of residuals can be the average within group sum of squares (RSS), or the average explained sum of squares (ESS) (e.g., between group sum of squares), or both. The method can include performing variance moderation on the vector of residuals (122).

[0073] In some embodiments, and with reference to Fig. IB, the method can additionally and optionally comprise determining a subset of the plurality of analytes which were not obtained by the Omics data (132). The method can further comprise generating measurements for each of the subset of the plurality of analytes (132). In some embodiments, the method of Fig. IB can additionally and optionally comprise aggregating the plurality of measurements and estimated variabilities from the protein analyte level to the complete protein level, the complete protein comprising all its set of analytes (134).GLI-00160 Page 17 of 50SWDocIDFH12598156.3GLI-00125

[0074] Returning to Fig. 1A, the method can include reading a second dataset comprising a plurality of samples for a plurality of analytes (124). The method can further include determining a second quality metric for each of the plurality of analytes (126). The method can further include applying the at least one function to transform the second quality metric (128). The method can further include applying the fitted model to determine a variance metric for each of the plurality of analytes in the second dataset (130). In some embodiments, the plurality of model parameter estimates are applied to the plurality of second quality metrics to determine the variance metric.

[0075] Fig. IB is a block diagram 150 illustrating a system according to embodiments of the present disclosure. An experimental design 152 and Omics data 154 are provided to a model generator 156. A model generator 156 can be a computer, such as a server, with a processor that operates a machine-learning training method to output one or more trained machine learning models. The Omics data 154 can include RNA sequencing data (e.g., RNA-Seq), Metabolomics data, Proteomics data, or mass spectrometry data. Likewise, the Omics data 154 can be received from a device (not shown) for measuring such data, such as an RNA sequencing machine, a mass spectrometer, etc. The Omics data 154 can further include measurements for multiple analytes (e.g., analyte1, analyte2,..., analyten) and for multiple samples of each analyte (e.g., sample1, sample2,..., samplem). As shown in Fig. IB, each sample is represented by a column and each analyte is represented by a row, although other orientations and organizations of the Omics data 154 can be used. The Omics data 154 can include, for each analyte, measurements of some or all of each sample. The Omics data 154 can include, for each sample, measurements of some or all of each analyte.GLI-00160 Page 18 of 50SWDocIDFH12598156.3GLI-00125

[0076] The model generator 156 generates multiple models for each analyte 157a-1-157n-2. Each respective model 157a-1-157n-2 generates a measure of variance for its respective analyte, the variability being based on the variance of the measurements in each sample measured for that analyte. The model generator can generate multiple models for each respective analyte - e.g., Models 157a-l-157a-2 for analytei, Models 157b-l-157b-2 for analyte2, and Models 157n-l-157n-2 for analyten. Likewise, the outputs of these models for each analyte can be used to generate a quality metric for each measurement (e.g., thereby resulting in a set of quality metrics), and a composite quality metric 160 therefrom. In addition, the models can generate a variability for each analyte 158a-n. A function fitter 162 can receive the composite quality metric 160 and variability for each analyte 158a-n, and generate a transform function 164 that can transform the composite quality metric to the estimated variability for each analyte. When applied, that function generates a model fit to the transform function 166. The model 166 can output variability for each analyte 168a-n. A residual calculator 170 can then calculate a vector of residuals 172 based on the variability for each analyte 158a-n and variability for each analyte 168a-n. A variance modulation calculator 174 can then output a variance metric based on the vector of residuals 172.

[0077] The system can further generate a variance metric for a second set of Omics data 180 (e.g., a smaller set of data). In Fig. IB, the data processing of the first set of Omics data 154 is illustrated by the solid blocks and lines, and the data processing of the second set of Omics data 180 is illustrated by the dashed blocks and lines in Fig. IB.

[0078] The second set of Omics data 180 is provided to a model generator 156. In some embodiments, the second set of Omics data comprises data having at least a partial overlap of analytes with the first set of Omics data. In some embodiments, the second set of Omics dataGLI-00160 Page 19 of 50SWDocIDFH12598156.3GLI-00125 comprises data having analytes that are mutually exclusive to the analytes of the first set of Omics data. The model generator can generate a second plurality of models for each analyte of the second set of Omics data (not shown). These models can be used to generate a set of quality metrics 182 for the second set of Omics data 180. The transform function 164 can transform the set of quality metrics, and the fit model 166 can be applied to determine a variance metric for each of the plurality of analytes in the second dataset.

[0079] As such, a relationship between quality and variance can be determined from a first dataset. Then, the system can apply that relationship to a second dataset where only quality, but not variance, can be determined directly.

[0080] In some embodiments, the model generator can generate a model of “dataset A” to determine a first relationship, such as between observation(s) of ion count and the variance, while dataset B includes observations of ion count without variance.. In this example, the relationship between observations of ion count and variance of dataset A are applied to the observations of ion count in dataset B, thereby deriving a new estimate of variability.

[0081] In some embodiments, the model can use the determined relationship between ESS to determine variability. If so, the model generator can generate a model of the relationship between ESS and variability in a first dataset (e.g., “dataset A”). The model generator can then fit a model to each analyte from a second dataset (e.g., “dataset B”) to derive ESS. Then, the modeled or discovered relationship from first dataset (e.g., “dataset A”) is combined with the model of each analyte from (e.g., “dataset B”) to estimate variability.

[0082] To explore the tradeoffs in performance and throughput in experiments with minimal biological variation, a benchmarking dataset was analyzed. The benchmarking dataset has no biological variation at all, but has known changes. The Ratio Expansion Data is a two-speciesGLI-00160 Page 20 of 50SWDocIDFH12598156.3GLI-00125 dilution model where human proteins are all at 1: 1 ratios and the yeast proteome was diluted at levels of lx, 2x, 4x, 8x, and 16x. Within the design, each of these dilution groups has three technical replicates. Additionally, advantages of Empirical Throughput Control (ETC) are explored using publicly available chemoproteomics data.

[0083] Both experiments were analyzed, as described in the methods, three times while artificially reducing the sample size from N=3 per group to N=l. At N=3 and N=2 the primary mechanism for dealing with analyte heteroskedasticity is variance moderation, where the moderation pulls values towards the predicted level of variability.

[0084] Fig.2A is a graph 200 illustrating a plot of the log of the observed variance versus the log of the average ion count at N=2 in a two-species dilution model. The graph 200 illustrates a clear relationship between overall abundance and observed variability.

[0085] Fig.2B is a graph 202 illustrating a plot of the log of the observed variance versus the log of the average ion count at N=2. The graph 202 was generated using ion counts as weights. The graph 202 illustrates that the naive weighting scheme does not eliminate the signal dependency in the data.

[0086] The variance versus signal relationship seen in Figs.2A-2B was used to calculate a set of weights to be applied to each observation (Figs. 3A-3B).

[0087] Fig.2C is a graph 210 illustrating analyte reweights according to embodiments of the present disclosure. Graph 210 illustrates weighted log variances as a function of a logarithm of average weights.

[0088] Fig.2D is a graph 220 illustrating residuals of variances according to embodiments of the present disclosure. Graph 220 illustrates re-weighted log residual variances as a function of a logarithm of average weights.GLI-00160 Page 21 of 50SWDocIDFH12598156.3GLI-00125

[0089] Fig.2E is a graph 230 illustrating a plot of DIA analyte level EB variances according to embodiments of the present disclosure. The graph 230 illustrates weighted residual variances. The graph 230 illustrates logarithms of residual variance as a function of logarithms of average weight.

[0090] Fig.2F is a graph 240 illustrating residuals of the variances. The graph 240 illustrates logarithms of residual value as a function of average fluence.

[0091] Fig.3A is a graph 300 illustrating a plot of the log of the observed variance versus the log of the average weight.

[0092] Fig.3B is a graph 302 illustrating a plot of the log of the weighted variance versus the log of the average weight. Graph 302 illustrates that the signal dependence has been removed from the overall structure of the variance. Variance moderation is performed on these signal independent variances and the moderated variances are used to replace estimated residual error in each linear model.

[0093] Fig.4A is a graph 400 illustrating a plot of sensitivity versus the number of samples, both for Signal Independent Variability (SIV) values 402 and t-test values 404. Empirical Throughput Calibration (ETC) was performed, however in this case it is known which answers are correct based on species, so the assignment of true and false positives was instead determined using this information. Performance was assessed separately for each magnitude of change since changes in sensitivity are a function of effect size. For a two-fold decrease in signal, 8x versus 16x in the SIV values 402, the reduction in sample size from N=3 to N=1 is correlated with only a ~5% drop in sensitivity. The same cannot be said for t-tests values 404 on the same data, which causes a collapse in sensitivity when going from N=3 to N=2, and t-tests cannot be done at all with only a single replicate per group.GLI-00160 Page 22 of 50SWDocIDFH12598156.3GLI-00125

[0094] In some embodiments, a change in sensitivity can be considered significant when it is a change of an order of magnitude. In some embodiments, a change in sensitivity can be a 2X, 4X, 8X, or 16X change in sensitivity.

[0095] Fig.4B is a graph 450 illustrating a plot of empirical False Discovery Rate (FDR) versus the number of samples, both for Signal Independent Variability (SIV) values 452 and t-test values 454. In the benchmarking data, human proteins can be used to evaluate false discovery rates, which is helpful when comparing method sensitivity since increases to sensitivity often result in higher false positives. However, in this experiment, at an FDR adjusted p-value of 0.01, the empirical false discovery rate (e.g., the ratio of the number of significant human proteins to total number of significant proteins) remained under control (< 1% at all sample sizes).

[0096] Chemoproteomics Data

[0097] To evaluate the ETC system in a real-world setting, an experiment was run to analyze chemoproteomics data from SHSY5Y cells treated with five distinct drugs (e.g., Ibrutinib, THZ-1, ARS1620, Sulfopin, and DMF). Mass spectrometry data were processed using Crux, with the experimental design incorporating three DMSO controls and three replicates per drug treatment. The ETC framework was employed to model technical variability as a quadratic function of both ion count and explained sum of squares (ESS). A reduced dataset comprising one replicate per treatment group was generated to assess the system’s performance with minimal replication. This design enabled direct comparison between conventional three-replicate analysis and our single-replicate approach. Importantly, while residual variance cannot be estimated in singlereplicate experiments, the ETC leverages total sum of squares (TSS) and ion count-based weights to establish variance priors based on relationships learned from the complete dataset.GLI-00160 Page 23 of 50SWDocIDFH12598156.3GLI-00125

[0098] Fig.5A is a graph 500 illustrating performance of the Empirical Throughput Control (ETC) system in a three-replicate analysis. Graph 500 is a volcano plot comparing drug treatments to DMSO controls using three replicates. The labels 502a-e indicate the five most significant peptides identified in the three-replicate analysis, demonstrating their consistent detection in single-replicate data as shown by labels 522a-e in Fig. 5B.

[0099] Fig.5B is a graph 520 illustrating performance of the Empirical Throughput Control (ETC) system in a single-replicate analysis. Graph 520 is a volcano plot comparing drug treatments to DMSO controls using three replicates. The labels 522a-e indicate the five most significant peptides identified in the single-replicate data. In graphs 500 and 502 of Figs.5A-B, for substantial decreases in abundance-those exceeding log2(-1.5) (e.g., corresponding to a greater than 65 Volcano plot visualization) — confirms this effect size-dependent performance. The five most significant peptide changes identified in the three-replicate analysis maintain their significance in single-replicate data, demonstrating reliable detection of substantial biological changes despite reduced replication. These results show that the ETC framework effectively preserves statistical power for meaningful effect sizes while substantially reducing experimental overhead. This capability could significantly impact experimental design in high-throughput proteomics, enabling broader biological investigations within fixed resource constraints.

[0100] Fig.5C is a chart 540 illustrating performance metrics of the ETC system. The chart 540 illustrates a sensitivity analysis stratified by effect size shows nearly perfect recovery of large changes (log2 fold change < -2) and robust detection of moderate changes using single replicates.

[0101] Fig.5D is a chart 560 illustrating a false discovery rate analysis. The chart 560 and the results therein confirm controlled error rates across effect size categories, remaining well below the nominal threshold of 0.01 for substantial changes.GLI-00160 Page 24 of 50SWDocIDFH12598156.3GLI-00125

[0102] The performance analysis illustrated by Figs.5A-D revealed a clear relationship between statistical power and effect magnitude on the log2scale.

[0103] Methods

[0104] The process of optimizing throughput starts by modeling variability in a setting with a sufficiently large sample size to see the underlying relationships. The sample size is then artificially reduced, using the estimated relationships as a prior, when needed.

[0105] Analysis of Heteroskedastic Structure

[0106] In some embodiments, preprocessed data that is ready for statistical analysis is obtained for initialization as a starting point (see, e.g., 102 of Fig. 1). For example, in proteomics data a mass spectrometer may generate. RAW files that can then be searched and quantified in software such as Proteome Discoverer or MaxQuant. Preprocessing software can further perform column normalizations or isotopic adjustments. The data can contain intensities for the analyte in rows (protein, peptide, metabolite, or RNA) and different samples in each column.

[0107] For each analyte (e.g., indexed by i in a set of analytes 1,..., I), a linear model is fit for the z'th analyte. In some embodiments, the linear model is fit using only complete data (e.g., data without missing values) (see, e.g., 104 of Fig. 1). The model can correspond to an experimental design. For example, the model can group samples by drug treatments or each sample can have associated metadata (e.g., age, sex, BMI, etc.). In some embodiments, software such as msTrawler can automatically convert the metadata into a design matrix for use in a linear model, or a user can do this manually.

[0108] In some embodiments, for each model, the statistics, the average within group sum of squares (RSS), and the average explained sum of squares (ESS) (e.g., between group sum of squares) are recorded, (see, e.g., 106 of Fig. 1) Additionally, an average weight of eachGLI-00160 Page 25 of 50SWDocIDFH12598156.3GLI-00125 observation can be recorded, where the weight may be an ion count, intensity, precursor charge, mapping quality, etc, depending on the data type (see, e.g., 108 of Fig. 1). The two average sum-of-squares calculations come from the standard variance decomposition used in any F-test. Forthe z'th analyte, the within group error (RSS) is referred to aswand the between group error (ESS) is referred to asThe average weight is referred to as Wi.

[0109] In some embodiments, a function such as a logarithm~~ i ~~ i can be taken to scale the weights linearly (see, e.g., 110 and 112 of Fig. 1). It can be recognized that other functions can be employed and the function does not have to be a logarithm, but that the above function is exemplary. A linear model is fit to the vector of log residual errors, where the explanatory variables are functions of the weights and the ESS (see, e.g., 114 of Fig. 1). As an example, this linear model can include covariates of the log of the average weight, the log of the average weight squared, and the estimate of the between sample variance — which is the ESS divided by the relevant degrees of freedom.4- SJWi 4- [hlW? 4-

[0110] The method can then create a weight function from the predicted curve (see, e.g., 116 of Fig. 1). In various non-limiting embodiments, an approach similar to Voom,(.z) 1 / exp, 3(1 4- 7 og x) T p2log{x)', is used to calculate calibrated weights for each observation weight in the dataset. It can be appreciated that other approaches to create the weight function can be employed.

[0111] In some embodiments, the method centers the new weights at the median value of the weights and each linear model is re-fit using weighted linear regression (see, e.g., 118 of Fig. 1).The variances and average weights can be recorded as before. The new residual standardGLI-00160 Page 26 of 50SWDocIDFH12598156.3GLI-00125deviation can provide the expected amount of error for an observation with the median calibrated weight (see, e.g., 120 of Fig. 1). This can be referred to as the Signal Independent Variability (SIVi).

[0112] The method fits the model (e.g., from step 3) the log of the SIV values,£(log(5ZV;)) = 4- filW* + A:(ZH7)2-4-..

[0113] where * indicates that the statistic was generated after calibration of the weights. Note *that 'ut and " 2 should be near zero because well calibrated weights remove the signal dependency. From these new residuals the prior degrees of freedom do are calculated, as shown previously.

[0114] The method can perform variance moderation against the predicted SIV atA 0!-? b,i,asjonepreviously (see, e.g., 122 of Fig. 1). The output of this method can be a new, moderated, variance for every analyte along with an updated degrees of freedom for the combination of prior information along with the current sample size.

[0115] Statistical analyses can be performed using the moderated variance in place of the observed residual variance (see, e.g., 126, 128, and 130 of Fig. 1).

[0116] Having estimated the key relationships on an experiment with a reasonable sample size, these relationships can be employed in experiments with less available information. The consequences of the data loss can be evaluated by artificially reducing the size of the data to evaluate the performance versus throughput trade off and examine the consequences.

[0117] Empirical Throughput Calibration

[0118] Fig.6 is a flowchart 600 illustrating experiment design using empirical throughput calibration according to embodiments of the present disclosure.GLI-00160 Page 27 of 50SWDocIDFH12598156.3GLI-00125

[0119] In some embodiments, the method obtains Omics data comprising a plurality of Omics samples (602).

[0120] In some embodiments, the method can then calculate the variance metric, as described above, for Omics data having at least two combinations of the Omics samples (604). For example, in some embodiments, the complete dataset (or datasets, if available) can be analyzed with N samples per condition, as above. The model parameters,A), 2 are recorded as well as do. In some embodiments, the method can set a cutoff on significance and record a vector of strings indicating the significant analytes for one or more hypothesis tests. In some embodiments, for j in 2,..., 2V-1, the method can perform an analysis of Heteroskedastic Structure, using; samples from the original N. In some embodiments, the method can calculate a variance moderation for each j in 2,..,, N-1. In some embodiments, the method can calculate a variance moderation for any number of samples of j in 2,..,, N-1. In some embodiments, the method can calculate a variance moderation for multiple combinations of samples for each value of j. For example, where N = 6 and j = 3, the method can use samples 1, 2, and 5, samples 4, 5, and 6, or any combination and permutation of the samples for j = 3.

[0121] In some embodiments, for j = 1, the method can perform the data analysis using / ?o, / ?i, p2along with the observed weights. If, a, and a2are present in the model, and variance cannot be estimated, then the method can utilize the TSS method, as described further below, to obtain new variance estimates.

[0122] In some embodiments, for all 2V-1 analyses, and whatever partitions of effect size are of interest, the intersection between the significant analytes in the smaller dataset and the significant list from the full data is taken to be the number of true positives in dataset j (TPj). The size of theGLI-00160 Page 28 of 50SWDocIDFH12598156.3GLI-00125 set difference in significant analytes (e.g., a number of significance in the / th analysis but not in the full analysis) is considered to be the number of false positives (FPj) (606).

[0123] In some embodiments, the method calculates relative (relative to performance in the full data) sensitivity and a false discovery rate (FDR) for each of the j subsets of data, where Sensitivity = TPj I TPN, and FDR = FPj I TPj + FPj).

[0124] In some embodiments, the method reports the relative changes to sensitivity and false discovery rates, along with other summary statistics to inform decisions about throughput (608).

[0125] Variance Estimation in Saturated Models

[0126] For saturated models (e.g., models having degrees of freedom equal to the number of parameters) traditional variance decomposition is not possible because there are no replicates to use for measuring residual variance. In these cases, the method only observes the total sum of squared errors (TSS) and the average weight. However, with the prior relationships established, this is still enough data to derive an estimate of RSS.

[0127] In some embodiments, the variance structure follows a log-linear relationship with weights:log(Vw) = / 30+ prlog(w) + / 32log(w)2(1) where Vwis the weight-predicted variance and w is the mean of observed weights.

[0128] The total variance (I / ) can be linked to residual variance through the sum of squared errors:TSS = RSS + ESSFor a dataset with p observations and prior weight d0, the total variance can be calculated as follows:(d0+ PWT = PVE + d0Vwexp(aly / V^)exp(a2VE)GLI-00160 Page 29 of 50SWDocIDFH12598156.3GLI-00125

[0129] The method then returns Vwas the estimate if the weight-predicted variance Vwexceeds the observed total variance VT. The method otherwise solves for VEby minimizing:| (d0+ P E + d0vexp(ai7^)exp(a2lZE) - (do + P TI

[0130] Finally, the method can calculate residual variance:_ (.do+P)VT~(.do+P)VEVR-O

[0131] This approach ensures proper variance estimation while accounting for the weightdependent structure of measurement precision in proteomics data.

[0132] Fig.7 is a schematic illustrating an example of a computing node according to embodiments of the present disclosure. Computing node 10 is only one example of a suitable computing node and is not intended to suggest any limitation as to the scope of use or functionality of embodiments described herein. Regardless, computing node 10 is capable of being implemented and / or performing any of the functionality set forth hereinabove.

[0133] In computing node 10 there is a computer system / server 12, which is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and / or configurations that may be suitable for use with computer system / server 12 include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like.

[0134] Computer system / server 12 may be described in the general context of computer systemexecutable instructions, such as program modules, being executed by a computer system.GLI-00160 Page 30 of 50SWDocIDFH12598156.3GLI-00125 Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system / server 12 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

[0135] As shown in Fig. 7, computer system / server 12 in computing node 10 is shown in the form of a general-purpose computing device. The components of computer system / server 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including system memory 28 to processor 16.

[0136] Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, Peripheral Component Interconnect (PCI) bus, Peripheral Component Interconnect Express (PCIe), and Advanced Microcontroller Bus Architecture (AMBA).

[0137] Computer system / server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system / server 12, and it includes both volatile and non-volatile media, removable and non-removable media.GLI-00160 Page 31 of 50SWDocIDFH12598156.3GLI-00125

[0138] System memory 28 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and / or cache memory 32. Computer system / server 12 may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, storage system 34 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a "hard drive"). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 18 by one or more data media interfaces. As will be further depicted and described below, memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure.

[0139] Program / utility 40, having a set (at least one) of program modules 42, may be stored in memory 28 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 42 generally carry out the functions and / or methodologies of embodiments as described herein.

[0140] Computer system / server 12 may also communicate with one or more external devices 14 such as a keyboard, a pointing device, a display 24, etc.; one or more devices that enable a user to interact with computer system / server 12; and / or any devices (e.g., network card, modem, etc.) that enable computer system / server 12 to communicate with one or more other computingGLI-00160 Page 32 of 50SWDocIDFH12598156.3GLI-00125 devices. Such communication can occur via Input / Output (I / O) interfaces 22. Still yet, computer system / server 12 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and / or a public network (e.g., the Internet) via network adapter 20. As depicted, network adapter 20 communicates with the other components of computer system / server 12 via bus 18. It should be understood that although not shown, other hardware and / or software components could be used in conjunction with computer system / server 12. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.

[0141] The present disclosure may be embodied as a system, a method, and / or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.

[0142] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raisedGLI-00160 Page 33 of 50SWDocIDFH12598156.3GLI-00125 structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

[0143] Computer readable program instructions described herein can be downloaded to respective computing / processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and / or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and / or edge servers. A network adapter card or network interface in each computing / processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing / processing device.

[0144] Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer orGLI-00160 Page 34 of 50SWDocIDFH12598156.3GLI-00125 entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

[0145] Aspects of the present disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer readable program instructions.

[0146] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions / acts specified in the flowchart and / or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and / or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions whichGLI-00160 Page 35 of 50SWDocIDFH12598156.3GLI-00125 implement aspects of the function / act specified in the flowchart and / or block diagram block or blocks.

[0147] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions / acts specified in the flowchart and / or block diagram block or blocks.

[0148] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and / or flowchart illustration, and combinations of blocks in the block diagrams and / or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0149] The descriptions of the various embodiments of the present disclosure have been presented for purposes of illustration, but are not intended to be exhaustive or limited to theGLI-00160 Page 36 of 50SWDocIDFH12598156.3GLI-00125 embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.GLI-00160 Page 37 of 50SWDocIDFH12598156.3

Claims

GLI-00125CLAIMSWhat is claimed is:

1. A method comprising:reading an experiment design definition;reading Omics data, the Omics data resulting from executing the experiment design, the Omics data comprising a plurality of measurements for a plurality of analytes;fitting at least one model for each analyte, each model providing a first estimated variability of the Omics data for its analyte;determining a first quality metric for each of the plurality of measurements; calculating, based on the plurality of first quality metrics, a composite quality metric for each of the plurality of analytes;fitting a function, the function relating the composite quality metric to the first estimated variability for each analyte;applying the function to transform the composite quality metric and the first estimated variability for each analyte;fitting a model to the transformed data;obtaining, from the model, a plurality of model parameter estimates; determining a second estimated variability for each analyte by applying the fitted model to the composite quality metric;determining a vector of residuals based on a comparison of the first estimated variability and second estimated variability;performing variance moderation on the vector of residuals;GLI-00160 Page 38 of 50SWDocIDFH12598156.3GLI-00125 reading a second dataset comprising a plurality of samples for a plurality of analytes;determining a second quality metric for each of the plurality of analytes; applying the function to transform each second quality metric;applying, for each of the plurality of analytes, the fitted model to each second quality metric to determine a variance metric.

2. The method of Claim 1, further comprising:determining a vector of covariate between group error (ESS) the first dataset; wherein fitting the model further comprises including the covariate ESS.

3. The method of Claim 2, further comprising:determining an estimate of the ESS for the second dataset based on each analyte of the first dataset.

4. The method of Claim 1, wherein the Omics data comprises one or more of RNA-Seq, Metabolomics, or Proteomics.

5. The method of Claim 1, wherein the Omics data is mass spectrometry data.

6. The method of Claim 1, wherein the plurality of measurements are one or more of a representation of an ion count, or a measurement of an intensity, intensity as representative as a flux, or a read count.GLI-00160 Page 39 of 50SWDocIDFH12598156.3GLI-00125 7. The method of Claim 1, wherein the at least one variance is one of more of an average within group sum of squares (RSS) and an average between group sum of squares (EES).

8. The method of Claim 1, wherein fitting the model is further based on complete data for that analyte.

9. The method of Claim 1, wherein each of the plurality of variances is a sum of squares.

10. The method of Claim 1, wherein the at least one function to transform the composite quality metric and the first estimated variability for each analyte is a linearization function.

11. A method of modifying experiment parameters, the method comprising:obtaining the omics data of Claim 1, the omics data comprising a plurality of omics samples;for each of a plurality of combinations of the plurality of omics samples, calculating, by the method of Claim 1, the variance metric;calculating a number of false positives of the plurality of combinations of the plurality of omics samples based on the plurality of variance metrics;calculating a relative sensitivity and a relative false discovery rate for each plurality of combinations of the plurality of omics samples, each relative sensitivity and relative false discovery rate being relative to the plurality of omics samples as a whole.

12. The method of Claim 11, further comprising:GLI-00160 Page 40 of 50SWDocIDFH12598156.3GLI-00125 outputting, for each of the combinations, the relative sensitivity and relative false discovery rate.

13. The method of Claim 11, wherein calculating a number of false positives comprises:generate a first analysis of each omics sample of the plurality of omics samples; remove one of the plurality of omics samples in the plurality of combinations, thereby generating a reduced plurality of combinations;generate a second analysis of the reduced plurality of combinations; marking an omics sample as a false positive when the second analysis indicates it is significant and the first analysis indicates it is not significant.

14. The method of Claim 13, wherein calculating the relative sensitivity comprises calculating a ratio of (a) significant detected changes in the plurality of omics samples and significant detected changes the reduced plurality of combinations to (b) significant detected changes in the plurality of omics samples.

15. A system comprising:a computing node comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of the computing node to cause the processor to perform a method comprising:reading an experiment design definition;reading Omics data, the Omics data resulting from executing the experiment design, the Omics data comprising a plurality of measurements for a plurality of analytes;GLI-00160 Page 41 of 50SWDocIDFH12598156.3GLI-00125 fitting at least one model for each analyte, each model providing a first estimated variability of the Omics data for its analyte;determining a quality metric for each of the plurality of measurements; calculating, based on the plurality of first quality metrics, a composite quality metric for each of the plurality of analytes;fitting a function, the function relating the composite quality metric to the first estimated variability for each analyte;applying the function to transform the composite quality metric and the first estimated variability for each analyte;fitting a model to the transformed data;obtaining, from the model, a plurality of model parameter estimates; determining a second estimated variability for each analyte by applying the fitted model to the composite quality metric;determining a vector of residuals based on a comparison of the first estimated variability and second estimated variability;performing variance moderation on the vector of residuals;reading a second dataset comprising a plurality of samples for a plurality of analytes;determining a second quality metric for each of the plurality of analytes; applying the function to transform each second quality metric;applying, for each of the plurality of analytes, the fitted model to each second quality metric to determine a variance metric.GLI-00160 Page 42 of 50SWDocIDFH12598156.3GLI-00125 16. The system of Claim 15, wherein the method performed by the processor further comprises:determining a vector of covariate between group error (ESS) the first dataset; wherein fitting the model further comprises including the covariate ESS.

17. The system of Claim 16, wherein the method performed by the processor further comprises:determining an estimate of the ESS for the second dataset based on each analyte of the first dataset.

18. The system of Claim 15, wherein the Omics data comprises one or more of RNA-Seq, Metabolomics, or Proteomics.

19. The system of Claim 15, wherein the Omics data is mass spectrometry data.

20. The system of Claim 15, wherein the plurality of measurements are one or more of a representation of an ion count, or a measurement of an intensity, intensity as representative as a flux, or a read count.

21. The system of Claim 15, wherein the at least one variance is one of more of an average within group sum of squares (RSS) and an average between group sum of squares (EES).

22. The system of Claim 15, wherein fitting the model is further based on complete data for that analyte.

23. The system of Claim 15, wherein each of the plurality of variances is a sum of squares.GLI-00160 Page 43 of 50SWDocIDFH12598156.3GLI-0012524. The system of Claim 15, wherein the at least one function to transform the composite quality metric and the first estimated variability for each analyte is a linearization function.

25. A system for modifying experiment parameters, comprising;a computing node comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of the computing node to cause the processor to perform a method comprising:obtaining the omics data of Claim 15, the omics data comprising a plurality of omics samples;for each of a plurality of combinations of the plurality of omics samples, calculating, by the method of Claim 1, the variance metric;calculating a number of false positives of the plurality of combinations of the plurality of omics samples based on the plurality of variance metrics;calculating a relative sensitivity and a relative false discovery rate for each plurality of combinations of the plurality of omics samples, each relative sensitivity and relative false discovery rate being relative to the plurality of omics samples as a whole.

26. The system of Claim 25, wherein the method performed by the processor further comprises:outputting, for each of the combinations, the relative sensitivity and relative false discovery rate.GLI-00160 Page 44 of 50SWDocIDFH12598156.3GLI-00125 27. The system of Claim 25, wherein calculating a number of false positives comprises:generate a first analysis of each omics sample of the plurality of omics samples; remove one of the plurality of omics samples in the plurality of combinations, thereby generating a reduced plurality of combinations;generate a second analysis of the reduced plurality of combinations; marking an omics sample as a false positive when the second analysis indicates it is significant and the first analysis indicates it is not significant.

28. The system of Claim 27, wherein calculating the relative sensitivity comprises calculating a ratio of (a) significant detected changes in the plurality of omics samples and significant detected changes the reduced plurality of combinations to (b) significant detected changes in the plurality of omics samples.

29. A computer program product for determining variance moderation, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform a method comprising:reading an experiment design definition;reading Omics data, the Omics data resulting from executing the experiment design, the Omics data comprising a plurality of measurements for a plurality of analytes;fitting at least one model for each analyte, each model providing a first estimated variability of the Omics data for its analyte;GLI-00160 Page 45 of 50SWDocIDFH12598156.3GLI-00125 determining a quality metric for each of the plurality of measurements; calculating, based on the plurality of first quality metrics, a composite quality metric for each of the plurality of analytes;fitting a function, the function relating the composite quality metric to the first estimated variability for each analyte;applying the function to transform the composite quality metric and the first estimated variability for each analyte;fitting a model to the transformed data;obtaining, from the model, a plurality of model parameter estimates; determining a second estimated variability for each analyte by applying the fitted model to the composite quality metric;determining a vector of residuals based on a comparison of the first estimated variability and second estimated variability;performing variance moderation on the vector of residuals;reading a second dataset comprising a plurality of samples for a plurality of analytes;determining a second quality metric for each of the plurality of analytes; applying the function to transform each second quality metric;applying, for each of the plurality of analytes, the fitted model to each second quality metric to determine a variance metric.

30. The computer program product of Claim 29, wherein the method performed by the processor further comprises:GLI-00160 Page 46 of 50SWDocIDFH12598156.3GLI-00125 determining a vector of covariate between group error (ESS) the first dataset; wherein fitting the model further comprises including the covariate ESS.

31. The computer program product of Claim 30, wherein the method performed by the processor further comprises:determining an estimate of the ESS for the second dataset based on each analyte of the first dataset.

32. The computer program product of Claim 29, wherein the Omics data comprises one or more of RNA-Seq, Metabolomics, or Proteomics.

33. The computer program product of Claim 29, wherein the Omics data is mass spectrometry data.

34. The computer program product of Claim 29, wherein the plurality of measurements are one or more of a representation of an ion count, or a measurement of an intensity, intensity as representative as a flux, or a read count.

35. The computer program product of Claim 29, wherein the at least one variance is one of more of an average within group sum of squares (RSS) and an average between group sum of squares (EES).

36. The computer program product of Claim 29, wherein fitting the model is further based on complete data for that analyte.GLI-00160 Page 47 of 50SWDocIDFH12598156.3GLI-0012537. The computer program product of Claim 29, wherein each of the plurality of variances is a sum of squares.

38. The computer program product of Claim 29, wherein the at least one function to transform the composite quality metric and the first estimated variability for each analyte is a linearization function.

39. A computer program product for determining variance moderation, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform a method comprising:obtaining the omics data of Claim 29, the omics data comprising a plurality of omics samples;for each of a plurality of combinations of the plurality of omics samples, calculating, by the method of Claim 1, the variance metric;calculating a number of false positives of the plurality of combinations of the plurality of omics samples based on the plurality of variance metrics;calculating a relative sensitivity and a relative false discovery rate for each plurality of combinations of the plurality of omics samples, each relative sensitivity and relative false discovery rate being relative to the plurality of omics samples as a whole.GLI-00160 Page 48 of 50SWDocIDFH12598156.3GLI-00125 40. The computer program product of Claim 39, wherein the method performed by the processor further comprises:outputting, for each of the combinations, the relative sensitivity and relative false discovery rate.

41. The computer program product of Claim 39, wherein calculating a number of false positives comprises:generate a first analysis of each omics sample of the plurality of omics samples; remove one of the plurality of omics samples in the plurality of combinations, thereby generating a reduced plurality of combinations;generate a second analysis of the reduced plurality of combinations; marking an omics sample as a false positive when the second analysis indicates it is significant and the first analysis indicates it is not significant.

42. The computer program product of Claim 41, wherein calculating the relative sensitivity comprises calculating a ratio of (a) significant detected changes in the plurality of omics samples and significant detected changes the reduced plurality of combinations to (b) significant detected changes in the plurality of omics samples.GLI-00160 Page 49 of 50SWDocIDFH12598156.3