A multi-sample joint deconvolution method applied to GC-MS technology
By employing a multi-sample joint deconvolution method and utilizing non-negative matrix factorization and alternating iterative optimization, a shared mass spectrometry and specific elution curve matrix is constructed, which solves the problems of overlapping peaks and retention time drift in GC-MS technology and achieves higher resolution accuracy and consistency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI DEV CENT OF COMP SOFTWARE TECH
- Filing Date
- 2026-04-17
- Publication Date
- 2026-06-16
AI Technical Summary
Traditional GC-MS technology suffers from insufficient accuracy in deconvolution of overlapping peaks and analytical errors caused by retention time drift between samples in the analysis of complex samples. It cannot effectively utilize the correlation of multi-sample data, resulting in insufficient analysis accuracy and consistency.
A multi-sample joint deconvolution method is adopted. Through non-negative matrix factorization and alternating iterative optimization, a shared mass spectrometry matrix and a specific elution curve matrix are constructed to achieve synergistic optimization of retention time and deconvolution, and to integrate multi-sample information to improve resolution stability.
It improves the analytical stability and consistency of cross-sample component matching under overlapping peak conditions, solves the analytical error and drift problems existing in traditional methods, and enhances the accuracy of complex sample analysis.
Smart Images

Figure CN122045583B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of chemical analysis, and in particular to a multi-sample joint deconvolution method for GC-MS technology. Background Technology
[0002] Gas chromatography-mass spectrometry (GC-MS) has become the gold standard for analyzing complex mixtures (such as metabolomics samples, environmental pollutants, and natural product extracts) due to its high selectivity of chromatographic separation and the specificity of mass spectrometry. It achieves qualitative and quantitative analysis by using the retention time of compounds (chromatographic dimension) and characteristic mass spectrometry (mass-charge ratio-intensity distribution).
[0003] However, in actual analysis, two major challenges severely restrict the accuracy and efficiency of analyzing complex samples.
[0004] 1. Traditional overlapping peak deconvolution has insufficient accuracy.
[0005] Complex samples often contain dozens to hundreds of compounds. Due to the limitations of chromatographic column separation capabilities, many compounds form co-elution peaks (overlapping peaks) due to similar elution times. Signals from low-abundance components are easily masked by signals from high-abundance components. Traditional methods for deconvolution of overlapping peaks have significant limitations: Single-sample dependent algorithms iteratively separate chromatographic peaks from mass spectrometric features by identifying local maxima in a single sample. However, their resolution drops sharply for peak groups with overlap exceeding 60%, and they are prone to misinterpreting noise as low-abundance component signals, leading to false positives. Single-sample matrix decomposition methods, while achieving deconvolution by multiplying the chromatographic peak profile matrix by the mass spectrometric feature matrix, only utilize information from a single sample. The decomposition results are highly susceptible to noise interference, have weak ability to distinguish mass spectrometric features of structurally similar compounds (such as homologs), and the mass spectrometric features of the same component are prone to drift when different samples are analyzed independently.
[0006] 2. Retention time drift between samples leads to cross-sample analysis errors.
[0007] When analyzing different samples, the retention time of the same compound may drift from seconds to minutes due to factors such as instrument stability (e.g., column temperature fluctuations), matrix effects (e.g., sample matrix differences), and flow rate variations. Traditional deconvolution processing methods have significant drawbacks: Independent correction strategies, such as fixed time window matching (setting a tolerance of ±0.5 minutes) or linear or piecewise linear correction, are difficult to adapt to complex nonlinear drift scenarios, resulting in limited correction accuracy. Post-deconvolution matching involves deconvolving each sample individually and then matching cross-sample components using mass spectrometry similarity (e.g., cosine similarity). However, errors from single-sample deconvolution accumulate and propagate, causing the same compound to be misclassified as different components in different samples, severely impacting the reliability of inter-sample difference analysis (e.g., comparing metabolites between case and control groups).
[0008] In summary, traditional deconvolution methods fail to fully utilize the correlation between multiple sample data: on the one hand, single-sample deconvolution processes each sample in isolation, failing to enhance the signal identification of low-abundance components through cross-sample information and reducing analytical stability under overlapping peak conditions; on the other hand, retention time correction and the deconvolution process are isolated from each other, failing to form a collaborative optimization mechanism between retention time and deconvolution, making it difficult to fundamentally solve the component matching problem caused by drift. Therefore, developing a technical solution that integrates multi-sample information and achieves collaborative optimization of retention time and deconvolution has become a key breakthrough for improving the analytical accuracy and cross-sample consistency of complex GC-MS samples. Summary of the Invention
[0009] The purpose of this application is to provide a multi-sample joint deconvolution method for GC-MS technology, which can integrate multi-sample information, improve the parsing stability under overlapping peak conditions, and achieve collaborative optimization of retention time and deconvolution.
[0010] To achieve the above objectives, this application provides the following solution: This application provides a multi-sample joint deconvolution method applied to GC-MS technology, including: obtaining the GC-MS data matrix of samples in each batch and presetting multiple candidate group scores; one batch corresponds to one sample, and the sample categories in all batches are the same.
[0011] The GC-MS data matrices of each sample were time-aligned to obtain the aligned matrices for each sample.
[0012] For any candidate group score, perform non-negative matrix decomposition on the aligned matrix of each sample according to the candidate group score to obtain the NMF elution curve matrix and NMF spectral coefficient matrix of each sample under the candidate group score.
[0013] Based on the NMF elution curve matrix and NMF spectral coefficient matrix of each sample under the candidate group score, the shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial cycle number under the candidate group score are obtained.
[0014] The shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial number of cycles under the candidate group score are alternately iteratively optimized to obtain the final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group score.
[0015] The final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group scores are sorted by component.
[0016] The index value corresponding to the candidate group score is obtained based on the final shared mass spectrometry matrix after sorting the candidate group scores and the final specific elution curve matrix after sorting each sample.
[0017] The optimal group score is determined based on the index value corresponding to each group score.
[0018] Output the optimal group score, the final shared mass spectrometry matrix after sorting the candidate group scores, the final specific elution curve matrix after sorting each sample, and the final peak shape function parameter matrix after sorting each sample.
[0019] According to the specific embodiments provided in this application, this application has the following technical effects: This application provides a multi-sample joint deconvolution method applied to GC-MS technology. This application constructs a shared mass spectrometry matrix by jointly using the NMF spectral coefficient matrices of all samples, solving the problem that current single-sample deconvolution processes each sample in isolation, failing to enhance the signal identification of low-abundance components through cross-sample information. It can integrate multi-sample information and improve the resolution stability under overlapping peak conditions. By temporally aligning the GC-MS data matrices of all samples, the time axis of all samples is unified, ensuring the consistency of cross-sample component matching, solving the component matching problem caused by drift, and realizing deconvolution based on this, achieving coordinated optimization of retention time and deconvolution. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a flowchart illustrating a multi-sample joint deconvolution method applied to GC-MS technology, as provided in an embodiment of this application.
[0022] Figure 2 This is a schematic diagram of a multi-sample joint deconvolution method applied to GC-MS technology, provided in an embodiment of this application.
[0023] Figure 3 This is a flowchart illustrating a multi-sample joint deconvolution method for GC-MS technology, provided as an embodiment of this application. Detailed Implementation
[0024] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0025] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0026] In one exemplary embodiment, such as Figure 1 As shown, a multi-sample joint deconvolution method for GC-MS technology is provided, including: Step 201: Obtain the GC-MS data matrix of samples in each batch and preset multiple candidate group scores; one batch corresponds to one sample, and the sample categories in all batches are the same. All samples are obtained according to the target substances in different batches. For example, blood samples: if a person's blood is tested once, then that blood is from one batch; blood from different people or multiple tests on a person's blood are from different batches.
[0027] Step 202: Perform time alignment on the GC-MS data matrix of each sample to obtain the aligned matrix of each sample.
[0028] Step 203: For any candidate group score, perform non-negative matrix decomposition on the aligned matrix of each sample according to the candidate group score to obtain the NMF elution curve matrix and NMF spectral coefficient matrix of each sample under the candidate group score.
[0029] Step 204: Based on the NMF elution curve matrix and NMF spectral coefficient matrix of each sample under the candidate group score, obtain the shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial cycle number under the candidate group score.
[0030] Step 205: Perform alternating iterative optimization on the shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial number of cycles under the candidate group score, to obtain the final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group score.
[0031] Step 206: Sort the components of the final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group scores.
[0032] Step 207: Obtain the index value corresponding to the candidate group score based on the final shared mass spectrometry matrix after sorting the candidate group scores and the final specific elution curve matrix after sorting each sample.
[0033] Step 208: Determine the optimal group score based on the index values corresponding to each group score.
[0034] Step 209: Output the optimal group score and the final shared mass spectrometry matrix after sorting the candidate group scores, the final specific elution curve matrix after sorting each sample, and the final peak shape function parameter matrix after sorting each sample.
[0035] In practical applications, the GC-MS data matrices of each sample are time-aligned to obtain the aligned matrix of each sample. Specifically, for any sample, with the reference time axis as the reference, the data in the GC-MS data matrix of the sample are linearly interpolated to obtain the aligned matrix of the sample.
[0036] In another exemplary embodiment of this application, based on the NMF elution curve matrix and NMF spectral coefficient matrix of each sample under the candidate group score, the shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial cycle number under the candidate group score are obtained. Specifically, this includes: fitting the peak shape function to the NMF elution curve matrix of each sample under the candidate group score using a least squares fitting method, thereby obtaining the peak shape function parameter matrix and amplitude scaling factor matrix of each sample under the initial cycle number under the candidate group score. The peak shape function parameter matrix of sample s. This includes the peak shape function parameters of the sample under each component. The peak shape function parameters under component i include... , and , representing the peak center, peak width, and tail for component i, respectively. The amplitude scaling factor matrix of the sample includes the amplitude scaling factor of the sample for each component.
[0037] Based on the peak shape function and the peak shape function parameter matrix and amplitude scaling factor matrix of each sample under the initial number of cycles under the candidate group score, construct the specific elution curve matrix of each sample under the initial number of cycles under the candidate group score; the specific elution curve matrix of the sample includes the specific elution curve of the sample under each component.
[0038] A shared mass spectrometry matrix is generated based on the NMF spectral coefficient matrix of each sample under the candidate group score and the initial cycle number. The shared mass spectrometry matrix includes the shared mass spectrometry data of the sample under each component.
[0039] In another exemplary embodiment of this application, the shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial cycle number under the candidate group score are alternately iteratively optimized to obtain the final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group score. Specifically, under the current cycle number, the shared mass spectrometry matrix under the current cycle number under the candidate group score is updated according to the aligned matrix of each sample and the specific elution curve matrix of each sample under the current cycle number under the candidate group score to obtain the shared mass spectrometry matrix under the next cycle number under the candidate group score.
[0040] Under the condition of the shared mass spectrometry matrix in the next cycle number under the candidate group score, the peak shape function parameter matrix and amplitude scaling factor matrix of each sample in the current cycle number under the candidate group score are updated according to the matrix after each sample is aligned and the specific elution curve matrix of each sample in the current cycle number under the candidate group score, so as to obtain the peak shape function parameter matrix and amplitude scaling factor matrix of each sample in the next cycle number under the candidate group score.
[0041] Based on the peak shape function parameter matrix, amplitude scaling factor matrix, and peak shape function of each sample in the next cycle under the candidate group score, the specific elution curve matrix of each sample in the next cycle under the candidate group score is obtained, the cycle number is updated, and the next cycle is entered until the cycle stopping condition is reached.
[0042] The shared mass spectrometry matrix at the last cycle number under the candidate group score is determined as the final shared mass spectrometry matrix under the candidate group score.
[0043] The peak shape function parameter matrix of each sample under the last cycle number under the candidate group score is determined as the final peak shape function parameter matrix of each sample under the candidate group score.
[0044] The specific elution curve matrix of each sample under the next cycle number under the candidate group score is determined as the final specific elution curve matrix under the candidate group score.
[0045] In another exemplary embodiment of this application, the shared mass spectrometry matrix under the current cycle number of the candidate group score is updated based on the matrix after each sample is aligned and the specific elution curve matrix of each sample under the current cycle number of the candidate group score, to obtain the shared mass spectrometry matrix under the next cycle number of the candidate group score. Specifically, this includes: calculating the cumulative autocorrelation matrix across samples based on the specific elution curve matrix of all samples under the current cycle number of the candidate group score.
[0046] Based on the aligned matrix of each sample and the specific elution curve matrix of each sample at the current cycle number under the candidate group score, calculate the cumulative cross-correlation matrix across samples.
[0047] The regularized normal equation is solved based on the cumulative autocorrelation matrix and the cumulative cross-correlation matrix across samples to obtain the unconstrained shared mass spectrometry estimation results.
[0048] The unconstrained shared mass spectrometry estimation results are truncated by non-negative projection, and the shared mass spectrometry matrix under the current cycle number under the candidate group score is updated to obtain the shared mass spectrometry matrix under the next cycle number under the candidate group score.
[0049] In another exemplary embodiment of this application, under the condition of the shared mass spectrometry matrix for the next cycle number under the candidate group score, the peak shape function parameter matrix and amplitude scaling factor matrix of each sample under the current cycle number under the candidate group score are updated according to the aligned matrix of each sample and the specific elution curve matrix of each sample under the current cycle number under the candidate group score, to obtain the peak shape function parameter matrix and amplitude scaling factor matrix of each sample under the next cycle number under the candidate group score. Specifically, this includes: for any component in any sample, calculating the residual matrix of the sample after removing the component according to the aligned matrix of the sample, the shared mass spectrometry data corresponding to the target component in the shared mass spectrometry matrix for the next cycle number under the candidate group score, and the specific elution curve corresponding to the target component in the specific elution curve matrix of the sample under the current cycle number under the candidate group score; the target component is other components besides the target component.
[0050] Based on the residual matrix after removing the components corresponding to the sample, a nonlinear least squares fitting method is used to fit the objective function. The solution is performed to obtain the optimal peak shape function parameters and the optimal amplitude scaling factor for the sample under the given composition; wherein, This represents the optimal peak shape function parameters for sample s under component i. This represents the optimal magnitude scaling factor for sample s under component i. Indicates to make The smallest sample s has the following peak shape function parameters for component i. and the magnitude scaling factor of sample s under component i , This represents the residual matrix of sample s after removing component i. This represents the normalized peak shape vector; the peak shape vector is the vector that represents the peak shape vector. The result is obtained by substituting into the peak shape function. This represents the transpose of the shared mass spectrometry data corresponding to component i in the shared mass spectrometry matrix for the next cycle number under the candidate group score. This represents the square of the Frobenius norm.
[0051] Based on the optimal peak shape function parameters of the sample under each component, construct the peak shape function parameter matrix of the sample under the next cycle number under the candidate group score.
[0052] Construct the amplitude scaling factor matrix of the sample for the next iteration under the candidate group score based on the optimal amplitude scaling factor of the sample under each component.
[0053] In another exemplary embodiment of this application, the specific elution curve matrix of each sample at the next cycle number under the candidate group score is obtained based on the peak shape function parameter matrix, amplitude scaling factor matrix, and peak shape function of each sample at the next cycle number under the candidate group score. Specifically, this includes: for any component in any sample, inputting the peak shape function parameter corresponding to the component in the peak shape function parameter matrix of the sample at the next cycle number under the candidate group score into the peak shape function to obtain the peak shape vector corresponding to the component in the sample.
[0054] The peak shape vectors corresponding to the components in the sample are normalized.
[0055] The peak shape vector corresponding to the component in the normalized sample is multiplied by the optimal amplitude scaling factor of the sample under the component to obtain the product corresponding to the component.
[0056] The specific elution curve matrix of the sample at the next cycle number is obtained by multiplying the products of the components in the sample.
[0057] In another exemplary embodiment of this application, determining whether the loop termination condition has been met specifically involves: under the current loop count, according to the formula... Calculate the relative reconstruction error for the current loop iteration.
[0058] If the formula is satisfied If the condition is met, the loop termination condition is determined to have been met; otherwise, the loop termination condition is determined not to have been met. This represents the relative reconstruction error at the current iteration number. Represents the total number of samples. This represents the matrix after aligning samples s. This represents the specific elution curve matrix of sample s for the next cycle number under the candidate group score. This represents the shared mass spectrometry matrix for the next iteration under the candidate group score. It has no practical significance; it is used to prevent the denominator from being 0. This represents the relative reconstruction error under the previous iteration number. This represents the preset convergence tolerance threshold, and max() indicates taking the maximum value. Let || denote the square of the Frobenius norm, and || denote the absolute value.
[0059] In another exemplary embodiment of this application, the final shared mass spectrometry matrix under the candidate group score, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample are sorted by component. Specifically, this includes: for any component, identifying the retention time of the component reaching the peak in the final specific elution curve matrix of each sample under the candidate group score, and obtaining the peak retention time of the component in all samples.
[0060] The cross-sample representative retention time of a component is obtained based on its peak retention time across all samples. The median of the peak retention times of a component across all samples is the cross-sample representative retention time of that component.
[0061] Based on the cross-sample representative retention time of all components from smallest to largest, all components are sorted to obtain the component index order.
[0062] Based on the component index order, the row vectors (i.e., the shared mass spectrometry data corresponding to the components) in the final shared mass spectrometry matrix under the candidate group score, the column vectors (the specific elution curves corresponding to the components) in the final specific elution curve matrix of each sample under the candidate group score, and the column vectors (i.e., the specific elution curves corresponding to the components) in the final peak shape function parameter matrix of each sample under the candidate group score are sorted to obtain the sorted final shared mass spectrometry matrix under the candidate group score, the sorted final specific elution curve matrix of each sample, and the sorted final peak shape function parameter matrix of each sample.
[0063] In another exemplary embodiment of this application, the index value corresponding to the candidate group score is obtained based on the final shared mass spectrometry matrix after sorting the candidate group scores and the final specific elution curve matrix after sorting each sample; the optimal group score is determined based on the index value corresponding to each group score, specifically by: according to the formula Calculate the comprehensive score corresponding to the candidate group scores; where, This represents the overall score corresponding to candidate group score k. Represents the total number of samples. This represents the matrix after aligning samples s. Let m represent the square of the Frobenius norm, m be the number of retention time scan points corresponding to the GC-MS data matrix of any sample, and n be the number of mass-to-charge ratio bins corresponding to the GC-MS data matrix of any sample. All samples have the same number of retention time scan points corresponding to their GC-MS data matrices, and the samples have the same number of mass-to-charge ratio bins corresponding to their GC-MS data matrices. This represents the final specific elution curve matrix after sorting samples s under the candidate group scores. This represents the final shared mass spectrometry matrix after sorting the candidate group scores.
[0064] According to the formula Calculate the average explained variance corresponding to the scores of the candidate groups; where, This represents the average explained variance corresponding to the candidate group score k. It has no practical significance and is used to prevent the denominator from being 0.
[0065] According to the formula Calculate the average signal-to-noise ratio corresponding to the scores of the candidate groups. This represents the average signal-to-noise ratio corresponding to the candidate group score k.
[0066] The optimal group score is determined based on the comprehensive score, average explained variance, and average signal-to-noise ratio corresponding to the scores of each candidate group.
[0067] In another exemplary embodiment of this application, after outputting the optimal group score and the final shared mass spectrometry matrix after sorting the scores of each candidate group, the final specific elution curve matrix of each sample, and the final peak shape function parameter matrix of each sample, the method further includes: plotting a comprehensive score curve with each candidate component as the abscissa and the comprehensive score corresponding to each candidate component as the ordinate.
[0068] Plot the average explained variance curve with each candidate component as the x-axis and the average explained variance corresponding to each candidate component as the y-axis.
[0069] Plot the average signal-to-noise ratio curve with each candidate component as the x-axis and the average signal-to-noise ratio corresponding to each candidate component as the y-axis.
[0070] Component mass spectra are generated based on the final shared mass spectrometry matrix after sorting under the optimal component set. The shared mass spectrometry vectors of each component under the optimal component set are displayed as bar charts.
[0071] The specific elution curves for each component are generated based on the final specific elution curve matrix after sorting the samples under the optimal component score. The specific elution curves for each component in each sample are then displayed as line graphs.
[0072] A superimposed comparison plot of the target sample aligned matrix under the optimal grouping score and the reconstructed data matrix is generated. The reconstructed data matrix is obtained from the final specific elution curve matrix and the final shared mass spectrometry matrix after sorting the target samples under the optimal grouping score. A superimposed comparison of the original total ion current chromatogram and the total ion current chromatogram after model reconstruction of the representative sample (target sample) is displayed to visually evaluate the decomposition effect.
[0073] The core idea of this application is to assume that all samples share the same mass spectrometry fingerprint, but each sample has a specific elution curve due to chromatographic conditions. By introducing an Exponentially Modified Gaussian (EMG) model to parameterize the elution curve, the mathematical decomposition problem is transformed into a physical parameter optimization problem, thereby achieving high-precision component extraction and alignment. The core mathematical model is: [The text abruptly ends here, so the translation stops as well.] The GC-MS data of each sample are denoted as a matrix. The model assumes that each sample can be decomposed into the product of a specific elution curve matrix and a shared mass spectrometry matrix. The general steps are as follows: Figure 2 As shown, the process comprises two stages. In the first stage, the data is binned, and an intensity matrix specific to each batch of samples is constructed, with retention times aligned. Then, EMG peak shapes are defined for each binning matrix, and decomposition hypotheses are established. In the second stage, an outer loop iterates, updating the shared mass spectrometry matrix, peak shape function parameter matrix, and specific elution curve matrix until convergence. The components are then sorted, and k-values are iterated to calculate a comprehensive score. The optimal k-value is selected, and finally, the sorted final shared mass spectrometry matrix, the final specific elution curve matrix and peak shape function parameter matrix for each sample, the comprehensive score curve, the average explained variance curve, the average signal-to-noise curve, the component mass spectra, the specific elution curves, and the overlay comparison plot are output.
[0074] This application also provides a more specific embodiment to detail the above-mentioned multi-sample joint deconvolution method applied to GC-MS technology, specifically for joint deconvolution of "multi-batch, multi-sample GC-MS chromatographic-mass spectrometry data": first, the chromatographic-mass spectrometry data of each sample are constructed into a matrix; then, multiple samples are stacked into a third-order structure of "sample dimension"; the data is interpreted using a shared mass spectrometry matrix + a specific elution curve matrix (parameterized by an exponentially modified Gaussian model peak shape function), and the optimal grouping is automatically selected. For example... Figure 3 As shown, the process begins by inputting the raw GC-MS data and binning the mass and charge. Then, data preprocessing, sample retention time alignment, model initialization, and alternating iterative optimization are performed sequentially. The convergence condition is checked, and if not, alternating iterative optimization is performed again. If so, the components are sorted based on the representative retention time of the samples, candidate k values are traversed, and a comprehensive score is calculated to determine the optimal component. Finally, the results are output and visualized. The specific steps are as follows: Step 1: Obtain the GC-MS data matrix of the samples in each batch.
[0075] Data Input: Read GC-MS mass spectrometry data files from different batches of similar samples (samples containing the same or similar target components, such as biological samples from the same source or repeatedly tested samples) (one sample corresponds to one GC-MS mass spectrometry data file), extract MS1 level scan data, and return two arrays: mass-to-charge ratio and intensity. Each peak consists of both mass-to-charge ratio and intensity. During the data input stage, invalid samples (such as samples with <6 scans) must be filtered after extracting the scan data to avoid affecting the joint deconvolution effect. The mass spectrometry data (mass-to-charge ratio and intensity) is binned according to a given mass-to-charge ratio range. The cumulative intensity of the mass spectrometry data in each bin is returned, and the center value of each bin is calculated as the mass-to-charge ratio value of that bin. Based on the cumulative ion intensity in each bin and the mass-to-charge ratio value of the bin, a multi-sample observation matrix of different batches of the same sample is obtained. This is to discretize the data within a predetermined interval to form a histogram.
[0076] Data preprocessing: Receive observation matrices of different batches of similar samples after Savitzky-Golay smoothing, SNIP baseline correction, and weak signal filtering. The GC-MS data matrix of samples from each batch was obtained. , Here is the GC-MS data matrix of sample s. express The dimension is (m is the number of retention time points, n is the number of mass-to-charge ratio bins,) (This represents the total number of samples).
[0077] Step 2: Time alignment is performed on the GC-MS data matrices of samples from each batch to help unify the time axis of all samples. This time alignment step primarily aims to reduce large-scale retention time shifts between different samples (e.g., drift caused by column aging) to improve the stability of the initialization and iteration processes. This method does not rely on strict physical time alignment; the final fine peak positions will be automatically determined by subsequent EMG peak shape function parameter optimization.
[0078] Specifically, refer to the timeline As a standard, sample s is placed on its own timeline. GC-MS data matrix on Mapped to via linear interpolation Above, obtain the aligned data. .
[0079] For each reference time point If in the sample ,but: .
[0080] Indicates reference time point The corresponding interpolated value, express Mid-time point The corresponding data. express Mid-time point Corresponding data. Introduce a reference timeline. The time axis of each sample Linear mapping to a unified scale resolves coarse-grained retention time drift, providing a foundation for subsequent refined modeling.
[0081] Step 3: Model initialization.
[0082] To avoid getting trapped in local optima with non-convex optimization, the model parameters are first initialized using the coarse results generated by Non-negative Matrix Factorization (NMF). The specific process is as follows: Preset the current candidate group score as k, first perform NMF decomposition on the aligned matrix of each sample to obtain the NMF elution curve matrix and the NMF spectral coefficient matrix, i.e. ,in This represents the NMF elution curve matrix of sample s at candidate group score k. This represents the NMF spectral coefficient matrix of sample s under candidate group score k.
[0083] Next, the least squares fitting method is used to fit the peak shape function to... The kurtosis parameter matrix and amplitude scaling factor matrix of each sample under the initial number of iterations for the candidate group score are obtained. Specifically, for component i of sample s, its kurtosis parameter and amplitude scaling factor are obtained through the formula... Optimized and obtained.
[0084] in, These are the peak shape function parameters of sample s under component i at the initial number of cycles. This represents the magnitude scaling factor of sample s under component i at the initial number of iterations. yes The The column, that is, the data corresponding to component i. It is determined by parameters Defined exponentially modified Gaussian function.
[0085] Subsequently, the optimized results were obtained The corresponding peak shape is normalized by area, i.e., according to the EMG peak shape function. as well as and Construct specific elution curves of sample s with component i at the initial number of cycles. : Normalization ensures that the peak area is 1, thus solving the problem of scale uncertainty.
[0086] Simultaneously, the NMF spectral coefficient matrices of all samples are averaged to serve as the initial value for the shared mass spectrometry matrix. ,Right now This converges faster than random initialization. Since the nonnegative matrix factorization results may exhibit inconsistent component arrangements across different samples, component correspondences should be established based on mass spectrometry similarity and / or elution time characteristics before aggregating the NMF spectral coefficient matrix.
[0087] Step 4: Iterative optimization of the outer loop.
[0088] Given the number of components k, the algorithm employs an alternating least squares (ALS) framework, iterating alternately between two steps: updating the shared mass spectrometry matrix while fixing the specific elution curve matrix, and updating the specific elution curve matrix while fixing the shared mass spectrometry matrix, to minimize the objective function. .
[0089] in, S represents the specific elution curve matrix of sample s constrained by the peak shape function parameter, and S represents the shared mass spectrometry matrix. The square of the Frobenius norm. For regularization terms, This is a regularization parameter used to control the complexity of S and improve the model's generalization ability. (Specific elution curve matrix of sample s), unlike traditional methods, each column here... It is not a free variable, but a physical curve strictly defined by the EMG peak shape function.
[0090] Sub-step 4.1: Update the shared mass spectrometry matrix S (global step). Fix the specific elution curve matrix for all samples. Solve for the case containing Regularized Ridge Regression Problem Update and force nonnegative projection ( This step utilizes the statistics of all samples to jointly constrain the mass spectrometry solution, significantly improving the signal-to-noise ratio.
[0091] Specifically, fix the specific elution curve matrix for all samples. S is updated by solving a ridge regression problem with non-negativity constraints. First, the cumulative matrix is calculated: .
[0092] in For dimension The cumulative sum of covariance of the elution matrix, For dimension The projective cumulative sum. Then solve the linear system: .
[0093] in for The identity matrix. Finally, a nonnegative projection is performed on the solution to satisfy the physical constraints: ,in It is a very small positive number.
[0094] Sub-step 4.2: Update the peak shape function parameter matrix and amplitude scaling factor matrix Fixed shared mass spectrometry matrix Nonlinear least squares optimization is performed on each component of each sample. The optimization is then tailored by calculating the residual after removing the current component. , , This allows the model to adaptively match the retention time drift and peak distortion specific to the sample. , , These represent the peak center, peak width, and tail of sample s under component i, respectively.
[0095] Specifically, with a fixed shared mass spectrometry matrix, the EMG parameters are updated sample by sample and component by component. To update component i in sample s, the residual after deducting the contribution of that component is first calculated: .
[0096] in, Let be the residual matrix of sample s after removing component i. This represents the j-th row of the shared mass spectrometry matrix S, i.e., the shared mass spectrometry data of the sample in component j. The specific elution curve of sample s for component j is shown. Subsequently, the peak shape function parameters and amplitude coefficients are jointly optimized using a nonlinear least squares method. .
[0097] .
[0098] in, This represents the area-normalized EMG peak shape vector to be optimized. Constraints ensure peak center... Within a reasonable timeframe, These represent the preset minimum and maximum time values, and the peak width. and trail The value is positive. It should be noted that although the EMG peak shape vector is normalized (to determine the shape), during parameter optimization, the model simultaneously solves for a scaling factor or retains intensity information in the unnormalized values of the elution curve matrix to reflect the concentration / abundance differences of the target component between different samples. Therefore, the final output specific elution curve matrix contains the peak area information required for quantification.
[0099] Sub-step 4.3: Reconstruct the elution matrix .
[0100] Update the optimal parameters obtained using sub-step 4.2. Generate the corresponding normalized EMG peak shape vector. and according to Update Sample The corresponding columns of the specific elution curve matrix: .
[0101] Sub-step 4.4: Convergence criterion.
[0102] Calculate the relative reconstruction error of the current model: .
[0103] Where rel_err is the current relative reconstruction error. To prevent small values from being divided by zero. If the following conditions are met. Then the iteration stops, where tol is the preset convergence tolerance (e.g., ...). Otherwise, return to sub-step 4.1 and continue iterating.
[0104] Step 5: Component sorting. The analysis results are uniformly sorted based on the median peak time of each component to ensure that "Component 1" has a consistent physical meaning in all samples.
[0105] To ensure that "component i" identifies the same chemical substance in all samples, the components need to be globally ordered based on retention time. First, the median peak time (Apex) of each component across all samples is calculated as its representative retention time: .
[0106] in, ( ) represents the median. It is a robust estimate of the retention time of component i. Convert the matrix index to physical time. Then, obtain the data by... The index for sorting from smallest to largest is order = argsort( `argsort()` represents the sorting index function. Finally, all related matrices and parameters are synchronously rearranged according to this index to obtain... , and . The first row corresponds to the earliest outflowing component. Among them, This represents the final shared mass spectrum matrix after sorting. This represents the final peak shape function parameter matrix after the samples s are sorted. This represents the final specific elution curve matrix after sample s is sorted.
[0107] Step 6: Model selection (determining the optimal group score k).
[0108] Since the number of groups k is unknown, the optimal value is determined by iterating through the candidate k values and evaluating the comprehensive score. For each k, calculate the overall score: .
[0109] The first term is the total reconstruction error (it is recommended to perform overall strength normalization or standardization on the input matrix X before calculation to match the magnitude of the penalty term), and the second term is a complexity-based penalty term. The total number of data elements. These are empirical weighting coefficients. The optimal grouping score is... Meanwhile, the average explained variance (EV) and average signal-to-noise ratio (SNR) are calculated as auxiliary evaluation metrics.
[0110] (need ).
[0111] (need ).
[0112] in Reconstruct the data for the model.
[0113] First, EV and SNR are used as validity constraints to screen out candidate group scores that satisfy EV≥0.8 and SNR≥10dB. Then, among these candidate group scores that meet the conditions, instead of pursuing the maximization of the index, the group score with the smallest comprehensive score is selected as the optimal solution. This is because a smaller comprehensive score means that the model complexity is the lowest while ensuring the accuracy of data fitting, thereby avoiding overfitting.
[0114] Step 7: Results Output and Visualization.
[0115] Determine the optimal model ( After that, the system outputs quantitative results and visualization charts. The quantitative results include the sorted shared mass spectrometry matrix. The final specific elution curve matrix after sorting each sample and the final peak shape function parameter matrix after sorting each sample. Save as CSV or other formats. Visualization charts include: model selection curves showing the relationship between k-values and Score, EV, and SNR; mass spectrum bar charts for each compound (…). Elution profiles of each component in each sample (line plots) ); raw data With reconstructing data The superimposed comparison chart is used to visually verify the degree of fit.
[0116] To address the shortcomings of traditional GC-MS methods in analyzing complex mixtures such as biological samples, environmental samples, and drug components—namely, insufficient accuracy in deconvolution of overlapping peaks and large analytical errors due to retention time drift across samples—this application provides an automated, efficient, and interpretable multi-sample joint deconvolution method for GC-MS. This method addresses these issues by jointly analyzing multiple sample data, correcting for retention time differences, and leveraging inter-sample correlations to improve component resolution performance. It can be widely applied in fields reliant on GC-MS analysis, such as metabolomics, environmental monitoring, food testing, and drug development, as well as in methods for identifying and quantifying components in complex GC-MS samples. It is particularly suitable for automated component resolution of complex mixture systems. This application first analyzes the raw chromatographic-mass linkage data of multiple GC-MS samples from different batches of the same type (samples containing the same or similar target components, such as biological samples from the same source or repeatedly tested samples), extracting retention time, mass-to-charge ratio, and ionic intensity. Then, it bins the samples according to a preset mass-to-charge ratio range and bin width, and accumulates the intensities within each bin to construct a... A sample intensity matrix specific to each batch of samples is constructed. Then, using samples with stable signals as a reference, linear interpolation is used to align the intensity matrices of the remaining samples to the reference time axis to unify the time axis. Next, an exponentially modified Gaussian function containing peak center, Gaussian width, and exponential decay rate is used to simulate the component elution curve, constructing a sample-specific elution curve matrix. Based on the assumption that "the intensity matrix after sample alignment ≈ elution curve matrix × shared mass spectrometry matrix", the non-negative shared mass spectrometry matrix with regularization constraints and the elution curve parameters of each sample are alternately optimized (iteratio until the relative change of the objective function is less than the threshold convergence). Then, the number of components within a preset range is traversed, and the comprehensive score of the reconstruction error and complexity penalty term of the fusion model is calculated. The component with the lowest score is selected as the optimal solution. Finally, the shared mass spectrum of each component, sample-specific elution curve, peak parameters and explained variance ratio, signal-to-noise ratio, and other evaluation indicators are output. This scheme can enhance the identification ability of low abundance and overlapping components, improve cross-sample analysis consistency, fit more closely to actual data, and achieve full-process automation. It has the following beneficial effects.
[0117] Improving the accuracy and reliability of component analysis: By constructing a shared mass spectrometry matrix by combining data from multiple samples, the ability to identify low-abundance or overlapping components is enhanced by utilizing the correlation between samples; at the same time, by decoupling the peak shape modeling from the amplitude parameter, this application can accurately reflect the relative quantitative differences of target components among different samples while ensuring the stability of peak shape fitting, and reducing the randomness error of single-sample analysis.
[0118] Eliminate the effects of time drift: Unify the time axis of all samples through optional time alignment to ensure consistency of cross-sample component matching and improve the reliability of inter-sample comparisons.
[0119] Optimized peak shape fitting: The EMG function is used to accurately simulate the asymmetric elution characteristics of the chromatographic peaks, which is closer to the actual experimental data than the traditional symmetric peak shape model, thus improving the fitting accuracy of the elution curve.
[0120] Automation and efficiency: It achieves fully automated processing from data parsing and preprocessing to deconvolution, and combines parallel computing to accelerate parameter optimization, making it suitable for large-scale sample analysis scenarios.
[0121] High interpretability: The output of shared mass spectra, sample-specific elution curves, and quantitative evaluation indicators provides a direct basis for subsequent qualitative (such as database matching) and quantitative analysis of components.
[0122] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.
[0123] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0124] This document uses specific examples to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. Furthermore, those skilled in the art will recognize that, based on the ideas of this application, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A multi-sample joint deconvolution method applied to GC-MS technology, characterized in that, The multi-sample joint deconvolution method applied to GC-MS technology includes: Obtain the GC-MS data matrix of samples in each batch and preset multiple candidate group scores; one batch corresponds to one sample, and the samples in all batches are of the same category; The GC-MS data matrices of each sample are time-aligned to obtain the aligned matrices for each sample. For any candidate group score, perform non-negative matrix decomposition on the aligned matrix of each sample according to the candidate group score to obtain the NMF elution curve matrix and NMF spectral coefficient matrix of each sample under the candidate group score. Based on the NMF elution curve matrix and NMF spectral coefficient matrix of each sample under the candidate group score, the shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial cycle number under the candidate group score are obtained. The shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial number of cycles under the candidate group score are alternately iteratively optimized to obtain the final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group score. The final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group scores are sorted by component. The index value corresponding to the candidate group score is obtained based on the final shared mass spectrometry matrix after sorting the candidate group scores and the final specific elution curve matrix after sorting each sample. Determine the optimal group score based on the index value corresponding to each group score; Output the optimal group score, the final shared mass spectrometry matrix after sorting the candidate group scores, the final specific elution curve matrix after sorting each sample, and the final peak shape function parameter matrix after sorting each sample.
2. The multi-sample joint deconvolution method applied to GC-MS technology according to claim 1, characterized in that, Based on the NMF elution curve matrix and NMF spectral coefficient matrix of each sample under the candidate group score, the shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial cycle number under the candidate group score are obtained, specifically including: The least squares fitting method is used to fit the peak shape function to the NMF elution curve matrix of each sample under the candidate group score, so as to obtain the peak shape function parameter matrix and amplitude scaling coefficient matrix of each sample under the initial number of cycles under the candidate group score; Based on the peak shape function and the peak shape function parameter matrix and amplitude scaling factor matrix of each sample under the initial number of cycles under the candidate group score, construct the specific elution curve matrix of each sample under the initial number of cycles under the candidate group score; A shared mass spectrometry matrix for the initial number of cycles under the candidate group score is generated based on the NMF spectral coefficient matrix of each sample under the candidate group score.
3. The multi-sample joint deconvolution method applied to GC-MS technology according to claim 1, characterized in that, The shared mass spectrometry matrix, the peak shape function parameter matrix of each sample, and the specific elution curve matrix of each sample under the initial cycle number at the candidate group score are alternately and iteratively optimized to obtain the final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group score, specifically including: At the current cycle number, based on the aligned matrix of each sample and the specific elution curve matrix of each sample at the current cycle number under the candidate group score, the shared mass spectrometry matrix at the current cycle number under the candidate group score is updated to obtain the shared mass spectrometry matrix at the next cycle number under the candidate group score. Under the condition of the shared mass spectrometry matrix in the next cycle number under the candidate group score, the peak shape function parameter matrix and amplitude scaling factor matrix of each sample in the current cycle number under the candidate group score are updated according to the matrix after each sample is aligned and the specific elution curve matrix of each sample in the current cycle number under the candidate group score, so as to obtain the peak shape function parameter matrix and amplitude scaling factor matrix of each sample in the next cycle number under the candidate group score. Based on the peak function parameter matrix, amplitude scaling factor matrix, and peak function of each sample in the next cycle under the candidate group score, obtain the specific elution curve matrix of each sample in the next cycle under the candidate group score, update the cycle number, and enter the next cycle until the cycle stop condition is reached. The shared mass spectrometry matrix for the last cycle number under the candidate group score is determined as the final shared mass spectrometry matrix under the candidate group score. The peak shape function parameter matrix of each sample under the last cycle number under the candidate group score is determined as the final peak shape function parameter matrix of each sample under the candidate group score; The specific elution curve matrix of each sample under the next cycle number under the candidate group score is determined as the final specific elution curve matrix under the candidate group score.
4. The multi-sample joint deconvolution method applied to GC-MS technology according to claim 3, characterized in that, Based on the aligned matrix of each sample and the specific elution curve matrix of each sample at the current cycle number under the candidate group score, the shared mass spectrometry matrix at the current cycle number under the candidate group score is updated to obtain the shared mass spectrometry matrix at the next cycle number under the candidate group score, specifically including: Based on the specific elution curve matrix of all samples under the current cycle number at the candidate group score, calculate the cumulative autocorrelation matrix across samples; Based on the aligned matrix of each sample and the specific elution curve matrix of each sample under the current cycle number at the candidate group score, calculate the cumulative cross-correlation matrix across samples; The regularized normal equation is solved based on the cumulative autocorrelation matrix and the cumulative cross-correlation matrix across samples to obtain the unconstrained shared mass spectrometry estimation results; The unconstrained shared mass spectrometry estimation results are truncated by non-negative projection, and the shared mass spectrometry matrix under the current cycle number under the candidate group score is updated to obtain the shared mass spectrometry matrix under the next cycle number under the candidate group score.
5. A multi-sample joint deconvolution method for GC-MS technology according to claim 3, characterized in that, Under the condition of the shared mass spectrometry matrix in the next cycle number under the candidate group score, based on the aligned matrix of each sample and the specific elution curve matrix of each sample in the current cycle number under the candidate group score, the peak shape function parameter matrix and amplitude scaling factor matrix of each sample in the current cycle number under the candidate group score are updated to obtain the peak shape function parameter matrix and amplitude scaling factor matrix of each sample in the next cycle number under the candidate group score, specifically including: For any component in any sample, the residual matrix after removing the target component is calculated based on the sample-aligned matrix, the shared mass spectrometry data corresponding to the target component in the shared mass spectrometry matrix at the next cycle number under the candidate group score, and the specific elution curve corresponding to the target component in the specific elution curve matrix of the sample at the current cycle number under the candidate group score; the target component is any component other than the target component. Based on the residual matrix after removing the components corresponding to the sample, a nonlinear least squares fitting method is used to fit the objective function. The solution is performed to obtain the optimal peak shape function parameters and the optimal amplitude scaling factor for the sample under the given composition; wherein, This represents the optimal peak shape function parameters for sample s under component i. This represents the optimal magnitude scaling factor for sample s under component i. Indicates to make The smallest sample s has the following peak shape function parameters for component i. and the magnitude scaling factor of sample s under component i , This represents the residual matrix of sample s after removing component i. This represents the normalized peak shape vector; the peak shape vector is the vector that represents the peak shape vector. The result is obtained by substituting into the peak shape function. This represents the transpose of the shared mass spectrometry data corresponding to component i in the shared mass spectrometry matrix for the next cycle number under the candidate group score. Denotes the square of the Frobenius norm; Construct the peak shape function parameter matrix of the sample for the next iteration number under the candidate group score based on the optimal peak shape function parameters of the sample under each component. Construct the amplitude scaling factor matrix of the sample for the next iteration under the candidate group score based on the optimal amplitude scaling factor of the sample under each component.
6. A multi-sample joint deconvolution method for GC-MS technology according to claim 3, characterized in that, Based on the peak shape function parameter matrix, amplitude scaling factor matrix, and peak shape function of each sample in the next cycle under the candidate group score, the specific elution curve matrix of each sample in the next cycle under the candidate group score is obtained, specifically including: For any component in any sample, input the peak shape function parameter corresponding to the component in the peak shape function parameter matrix of the sample under the next cycle number under the candidate group score into the peak shape function to obtain the peak shape vector corresponding to the component in the sample; The peak shape vectors corresponding to the components in the sample are normalized. The peak shape vector corresponding to the component in the normalized sample is multiplied by the optimal amplitude scaling factor of the sample under the component to obtain the product corresponding to the component; The specific elution curve matrix of the sample at the next cycle number is obtained by multiplying the products of the components in the sample.
7. A multi-sample joint deconvolution method for GC-MS technology according to claim 3, characterized in that, To determine whether the loop termination condition has been met, the following steps are taken: At the current iteration number, according to the formula Calculate the relative reconstruction error at the current iteration number; If the formula is satisfied If the condition is met, the loop termination condition is determined to have been met; otherwise, the loop termination condition is determined not to have been met. This represents the relative reconstruction error at the current iteration number. Represents the total number of samples. This represents the matrix after aligning samples s. This represents the specific elution curve matrix of sample s for the next cycle number under the candidate group score. This represents the shared mass spectrometry matrix for the next iteration under the candidate group score. It has no practical significance; it is used to prevent the denominator from being 0. This represents the relative reconstruction error under the previous iteration number. This represents the preset convergence tolerance threshold, and max() indicates taking the maximum value. Let || denote the square of the Frobenius norm, and || denote the absolute value.
8. A multi-sample joint deconvolution method for GC-MS technology according to claim 1, characterized in that, The final shared mass spectrometry matrix, the final peak shape function parameter matrix of each sample, and the final specific elution curve matrix of each sample under the candidate group scores are sorted by component, specifically including: For any given component, in the final specific elution curve matrix of each sample under the candidate group score, identify the retention time at which the component reaches the peak, and obtain the peak retention time of the component in all samples; The representative retention time of the component across samples is obtained based on the peak retention time of the component in all samples; Based on the representative retention time of all components across samples from smallest to largest, all components are sorted to obtain the component index order; Based on the component index order, the row vectors corresponding to each component in the final shared mass spectrometry matrix under the candidate group score, the column vectors corresponding to each component in the final specific elution curve matrix of each sample under the candidate group score, and the column vectors corresponding to each component in the final peak shape function parameter matrix of each sample under the candidate group score are sorted to obtain the sorted final shared mass spectrometry matrix under the candidate group score, the sorted final specific elution curve matrix of each sample, and the sorted final peak shape function parameter matrix of each sample.
9. A multi-sample joint deconvolution method for GC-MS technology according to claim 1, characterized in that, The index values corresponding to the candidate group scores are obtained based on the final shared mass spectrometry matrix after sorting the candidate group scores and the final specific elution curve matrix after sorting each sample; the optimal group score is determined based on the index values corresponding to each group score, specifically as follows: According to the formula Calculate the comprehensive score corresponding to the candidate group scores; where, This represents the overall score corresponding to candidate group score k. Represents the total number of samples. This represents the matrix after aligning samples s. Let m represent the square of the Frobenius norm, m be the number of retention time scan points corresponding to the GC-MS data matrix of any sample, and n be the number of mass-to-charge ratio bins corresponding to the GC-MS data matrix of any sample. All samples have the same number of retention time scan points corresponding to their GC-MS data matrices, and the samples have the same number of mass-to-charge ratio bins corresponding to their GC-MS data matrices. This represents the final specific elution curve matrix after sorting samples s under the candidate group scores. This represents the final shared mass spectrometry matrix after sorting the candidate group scores; According to the formula Calculate the average explained variance corresponding to the scores of the candidate groups; where, This represents the average explained variance corresponding to the candidate group score k. It has no practical significance; it is used to prevent the denominator from being 0. According to the formula Calculate the average signal-to-noise ratio corresponding to the scores of the candidate groups. This represents the average signal-to-noise ratio corresponding to the candidate group score k. The optimal group score is determined based on the comprehensive score, average explained variance, and average signal-to-noise ratio corresponding to the scores of each candidate group.
10. A multi-sample joint deconvolution method for GC-MS technology according to claim 9, characterized in that, After outputting the optimal group score and the final shared mass spectrometry matrix after sorting each candidate group score, the final specific elution curve matrix after sorting each sample, and the final peak shape function parameter matrix after sorting each sample, the following is also included: Plot a comprehensive score curve with each candidate component as the x-axis and the comprehensive score corresponding to each candidate component as the y-axis. Plot the average explained variance curve with each candidate component as the x-axis and the average explained variance corresponding to each candidate component as the y-axis. Plot the average signal-to-noise ratio curve with each candidate component as the x-axis and the average signal-to-noise ratio corresponding to each candidate component as the y-axis. Generate component mass spectra based on the final shared mass spectrometry matrix after sorting under the optimal group number; The specific elution curves for each component are generated based on the final specific elution curve matrix after sorting the samples under the optimal component number. A superimposed comparison diagram of the target sample aligned matrix and the reconstructed data matrix under the optimal grouping score is generated; the reconstructed data matrix is obtained based on the final specific elution curve matrix after sorting the target samples under the optimal grouping score and the final shared mass spectrometry matrix after sorting under the optimal grouping score.