Metabolite retention time domain determination method, annotation method, and apparatus

By using a method for determining the retention time domain of metabolites and a machine learning model, the retention time drift problem in metabolomics data processing was solved, improving the accuracy and efficiency of metabolite annotation.

CN122307016APending Publication Date: 2026-06-30SUZHOU BIONOVOGENE BIOMEDICAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SUZHOU BIONOVOGENE BIOMEDICAL TECH CO LTD
Filing Date
2024-12-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, metabolomics data processing suffers from problems such as high noise, multi-dimensionality, difficulty in identification, and high irregularity, and retention time drift leads to errors in metabolite annotation.

Method used

A method for determining the retention time domain of metabolites was adopted. By obtaining the retention times of metabolites under the same chromatographic conditions, calibrators were identified and base point compounds were determined. The retention time domain was divided, and the division and matching of the retention time domain were optimized by combining machine learning models and cluster analysis.

Benefits of technology

It effectively reduces the deviation caused by retention time drift, improves the accuracy of identifying unknown metabolites, simplifies the process of exploring chromatographic conditions, and improves the accuracy of metabolite annotation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122307016A_ABST
    Figure CN122307016A_ABST
Patent Text Reader

Abstract

This application relates to the field of metabolomics technology, specifically to a method, annotation method, and apparatus for determining the retention time domains of metabolites. The method for determining the retention time domains of metabolites involves analyzing the distribution of retention times of a group of metabolites to identify base compounds or base points. This divides the time intervals in which the retention times of the metabolites occur into multiple time domains. By combining the retention times of each individual metabolite, the retention time domain for each metabolite is obtained. In the annotation process of unknown metabolites, using retention time domain information can further improve the accuracy of the annotation and avoid the influence of retention time drift or adjustments to chromatographic conditions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of metabolomics technology, specifically to methods, annotation methods, and apparatus for determining the retention time domain of metabolites. Background Technology

[0002] Metabolomics data is extremely large, especially non-targeted metabolomics data. Generally, metabolomics data exhibits the following characteristics: ① High noise level: A large number of endogenous small molecules exist in organisms, maintaining physiological functions. However, only a small percentage of biomarkers and functional metabolites have specific research value. The vast majority of metabolites are unrelated to the research objective, making the few functional metabolites subject to severe noise interference from a large number of useless metabolites within the overall metabolite background. ② High dimensionality (relatively small sample size): Typically, the number of metabolites detected in non-targeted metabolomics far exceeds the sample size. Therefore, traditional statistical methods are not suitable for processing metabolomics data. ③ Significant identification difficulties: Multiple factors make the identification and characterization of metabolomics data quite challenging. For example, the presence of isomers, metabolites with similar physicochemical properties, the complexity of liquid-phase systems, and the difficulty in resolving the metabolite mass spectrometry structure. ④ High irregularity: The distribution of metabolomics data is highly irregular, with frequent occurrences of zero values. This necessitates more complex and reasonable statistical analysis strategies to uncover the hidden complex data relationships.

[0003] Existing methods generally rely on retention time (RT) and first-order mass spectrometry (MS). 1 ) and secondary mass spectrometry (MS) 2 Metabolites are identified using three dimensions. However, most databases on the market lack RT information. Even when RT information is available, the RT values ​​detected under different conditions vary. Furthermore, even under the same conditions, RT values ​​can drift, leading to errors in metabolite annotation. Summary of the Invention

[0004] To address the retention time drift problem, this application provides a method, annotation method, and apparatus for determining the retention time domain of metabolites. The retention time of a metabolite is replaced by a retention time domain, and the time range of the retention time domain is determined. In the identification of unknown metabolites, known compounds are searched for and annotated according to the retention time domain into which the unknown metabolite falls.

[0005] The first aspect of this application provides a method for determining the retention time domain of metabolites, including:

[0006] S1. Obtain the retention times of n metabolites under the same chromatographic conditions;

[0007] S2. Based on the distribution of retention times of the n metabolites, select i metabolites from the n metabolites as calibrators;

[0008] S3. Determine the base point compound and base point based on the calibrator and its retention time, wherein the base point divides the running time under the chromatographic conditions into i+1 time domains;

[0009] S4. For each metabolite, the retention time domain is determined based on the retention time under the chromatographic conditions and the i+1 time domains.

[0010] Where n and i are both positive integers, and n is greater than i.

[0011] A second aspect of this application provides a method for metabolite annotation, including:

[0012] The retention time domains of known metabolites are obtained using the aforementioned method for determining metabolite retention time domains.

[0013] The sample was analyzed under the chromatographic conditions to obtain the retention time of the unknown metabolite;

[0014] Among the known metabolites whose retention time falls into the retention time domain, find one to annotate the unknown metabolite.

[0015] A third aspect of this application provides a device for annotating metabolites, comprising:

[0016] The metabolite information acquisition module is used to acquire retention time and mass spectrometry data of unknown metabolites;

[0017] An annotation module is used to annotate the unknown metabolites based on their retention time, mass spectrometry data, and a pre-created metabolite database;

[0018] The metabolite database includes retention time domains for multiple known metabolites, and the retention time domain of at least one known metabolite is obtained by the aforementioned method for determining metabolite retention time domains.

[0019] A fourth aspect of this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory, the processor executing the computer program to implement the aforementioned method for determining the metabolite retention time domain or the metabolite annotation method.

[0020] A fifth aspect of this application provides a computer program product, including a computer program that, when executed by a processor, implements the aforementioned method for determining the metabolite retention time domain or the metabolite annotation method.

[0021] The various technical solutions provided in this application bring at least one of the following beneficial effects:

[0022] By analyzing the retention time distribution of a large number of metabolites under chromatographic conditions, the running time is rationally divided, and the retention time of each metabolite is combined to obtain the retention time domain for each metabolite. Using this retention time domain information for annotation of unknown metabolites can effectively reduce the bias caused by retention time drift, resulting in higher accuracy of annotation results when identifying unknown metabolites.

[0023] For multiple chromatographic conditions under the same chromatographic system, the elution order (retention time order) of compounds is generally the same. That is, the elution order of other metabolites and the base site compound remains unchanged. Therefore, the retention time domain obtained by dividing the base site using the retention time of the base site compound is also unchanged. As long as the chromatographic conditions are varied under the same chromatographic system, the retention time domain remains constant. Using the retention time domain instead of the retention time allows for direct migration of the retention time domain under the same chromatographic system. This eliminates the need to reconfirm the retention time every time the chromatographic conditions are changed, simplifying the process of exploring chromatographic conditions. Retention time data for the same chromatographic system in existing databases or literature can also be further utilized through retention time domains to improve the accuracy of metabolite annotation. Attached Figure Description

[0024] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments of this application will be briefly introduced below.

[0025] Figure 1 A flowchart illustrating the method for determining the retention time domain of metabolites provided in this application embodiment;

[0026] Figure 2 This application provides a schematic diagram of metabolite retention time, baseline, and retention time domain in its embodiments.

[0027] Figure 3 A flowchart illustrating the MetaOffset method provided in this application embodiment;

[0028] Figure 4 A schematic flowchart of the metabolite annotation method provided in this application embodiment;

[0029] Figure 5 A schematic diagram of the metabolite annotation device provided in the embodiments of this application;

[0030] Figure 6 This application provides schematic diagrams of electronic devices in its embodiments;

[0031] Figure 7 This application provides a schematic diagram illustrating the retention time distribution of multiple known metabolites in serum under the same chromatographic conditions.

[0032] Figure 8This application provides a schematic diagram of the predicted retention time distribution of multiple known metabolites in serum under the same chromatographic conditions.

[0033] Figure 9 The retention time is predicted using the RT-transformer model, and the accuracy verification results in the retention time domain are obtained using the MetaOffset method.

[0034] Figure 10 Example of the effect of using time-domain-retaining annotation for unknown metabolites. Detailed Implementation

[0035] The embodiments of this application are described below with reference to the accompanying drawings. It should be understood that the embodiments described below with reference to the accompanying drawings are exemplary descriptions for explaining the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions of the embodiments of this application.

[0036] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “each,” “this,” and “the” used herein may also include plural forms. The terms “first,” “second,” “third,” “i,” “j,” “n,” etc. (if present), used herein are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence, nor do they indicate any difference between them. It should be understood that such data used herein can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in a sequence other than that shown in the illustrations or textual descriptions. Furthermore, the term “and / or” used herein indicates at least one of the items defined by the term; for example, “A and / or B” can be implemented as “A,” or as “B,” or as “A and B.”

[0037] To make the purpose, technical solution and advantages of this application clearer and easier to understand, several terms involved in this application will be introduced and explained first.

[0038] Chromatographic system and chromatographic conditions

[0039] A chromatographic system refers to the instrument system used for chromatographic analysis. Common chromatographic systems include gas chromatography (GC), liquid chromatography (LC), ion exchange chromatography (IGC), and chromatographic chromatography (GC), etc. The chromatographic separation principles of the same chromatographic system are generally the same. For example, LC separates different substances based on their different partition coefficients in the stationary and mobile phases. LC is further divided into normal-phase and reversed-phase systems based on the polarity of the stationary and mobile phases. The elution order of multiple substances in different normal-phase or reversed-phase chromatographic systems generally remains unchanged. Gas chromatography separates different substances based on differences in boiling point, polarity, and adsorption properties. Substances with lower boiling points generally elute first, and substances with higher boiling points generally elute later. The elution order of multiple substances in a gas chromatography system also generally remains unchanged.

[0040] Chromatographic conditions refer to the specific parameters and requirements used in chromatographic analysis, including: chromatographic column, mobile phase, detector (such as ultraviolet detector, electro-negation detector, mass spectrometry detector), column temperature, flow rate, and other parameter settings. In this application, "the same chromatographic system" or "identical chromatographic system" means that two or more chromatographic conditions correspond to the same chromatographic system. Here, "identical" means that the principle of chromatographic separation is the same, and that changes in the chromatographic conditions within that system do not alter the order of peak elution (or retention time) of different metabolites. These changes in chromatographic conditions include: using different models or manufacturers of the same type of chromatographic column; routinely adjusting the composition of the mobile phase (such as adjusting the buffer salt and organic phase ratio, or replacing the aqueous or organic phase with equivalent values); changing the flow rate; changing the column temperature, etc.

[0041] Retention time, retention time domain, base point, base point compound

[0042] Retention time (RT or rt) has a general meaning in analytical chemistry; it refers to the time from the start of sample injection to the appearance of the maximum concentration of a component at the post-column level, or the time elapsed from the start of injection to the appearance of the peak of a chromatographic component. It indicates the ease with which the analyte (metabolite or other compound) is eluted under the given chromatographic conditions. The retention time in this application can also be replaced by other indicators with similar indicative functions, such as relative retention time or corrected retention time.

[0043] The retention time domain refers to the time range from the start of sample injection until the peak of a chromatographic peak appears. This domain encompasses the possible occurrence of the peak, i.e., the time period during which the analyte (metabolite or other compound) may appear under the given chromatographic conditions. The endpoint of this time period is the baseline. The baseline divides the runtime under chromatographic conditions into retention time domains for each analyte (metabolite or other compound), serving as a reference for metabolite retention times. The baseline can be a fixed time point or the retention time of a compound under different chromatographic conditions; here, the compound is the baseline compound. For relatively stable chromatographic conditions, the retention time remains essentially constant, and a fixed time point can be used as the baseline (i.e., the retention time of a baseline compound detected under that chromatographic condition). If a chromatographic condition is unstable, with significant retention time variations, or if different chromatographic conditions exist under the same chromatographic system, then the retention time of the baseline compound detected each time is preferably used as the baseline. Under the same chromatographic conditions or the same chromatographic system, the retention times of various compounds are generally in the same order. Thus, even if the retention times vary greatly, the retention times of the base site compounds will change synchronously, and the compounds between two base sites remain unchanged (i.e., the retention time domain remains unchanged).

[0044] The chromatographic conditions corresponding to the retention times of compounds given in existing databases are often different from the chromatographic conditions actually used in the laboratory. Therefore, when annotating unknown metabolites, the laboratory cannot match the retention time information in the database and cannot use the retention time information. However, as long as the chromatographic systems of the two are the same and the same base point compounds can be used, a consistent retention time domain can be obtained. Then, the upper retention time (domain) information can be used to improve the accuracy of unknown metabolite annotation.

[0045] Machine learning, machine learning models

[0046] Machine learning (ML) is a branch of artificial intelligence that aims to enable computer systems to automatically learn from data and make decisions, predictions, and discover patterns based on the learned knowledge. A machine learning model is a mathematical representation or function used in machine learning to predict, classify, cluster, or perform other tasks on data. It is the core component of machine learning algorithms, and it learns from training data to capture relationships and patterns between data, thereby making predictions or inferences on new data.

[0047] In some embodiments of this application, machine learning methods are used to predict the retention time of metabolites. First, a training set, a validation set, and a test set are constructed using known metabolites and related information. A pre-defined machine learning model is then trained and adjusted to obtain a retention time prediction model that meets expectations. Specifically, the molecular characterization of the metabolites and chromatographic conditions (such as column type, mobile phase information, gradient, column temperature, etc.) are used as model inputs, and the output is the retention time. Models that can be used for retention time prediction include RT-Transformer, 1D CNN, RGCN, GNN-RT, MPNN, CPORT, and DNN (deep neural network), as specifically described in Xue J, Wang B, Ji H, Li W. RT-Transformer: retention timeprediction for metabolite annotation to assist in metabolite identification. Bioinformatics. 2024 Mar 4; 40(3):btae084. doi:10.1093 / bioinformatics / btae084 and CN119068999 A. In addition to the machine learning models listed above, other models that can be used for metabolite retention time prediction should be part of this application.

[0048] Cluster analysis, clusters, cluster centers

[0049] Cluster analysis divides data objects (datasets) into several groups (subsets) based on similarity or distance. Data points within each group are similar (correlated), while data points in different groups are different (unrelated). The greater the similarity within a group and the greater the difference between groups, the better the clustering effect. The groups or subsets in cluster analysis are called clusters, which are the basic units of cluster analysis. Cluster centers represent a specific cluster; other samples determine their membership by calculating their distance from the cluster center, which can be the mean or median of all data points within that cluster.

[0050] In some embodiments of this application, cluster analysis is used to analyze a set of retention times, clustering them into multiple clusters. Calibrators or base points are selected or retention time domains are divided based on the clusters. The cluster analysis methods listed in this application include K-means clustering, affinity propagation, DBSCAN (density-based spatial clustering of applications with noise), BIRCH (balanced iterative reducing and clustering using hierarchies), agglomerative clustering, and mean shift. In addition to these cluster analysis methods, other methods capable of clustering a set of numerical values ​​should also be considered part of this application.

[0051] In existing techniques for substance annotation using retention time information, the retention time of the analyte is typically compared with the retention times of known metabolites in a database or obtained through detection. Metabolites with similar or close retention times are matched to the analyte. Considering retention time drift, a certain error range is usually set during the retention time matching process. However, retention time drift varies for different compounds and under different chromatographic conditions. Setting the same error range is not applicable to all metabolites, which can negatively impact the annotation of some metabolites, leading to annotation errors. For example, suppose the retention times of metabolites A1, B1, and C1 are known to be 2 min, 3 min, and 3.5 min, respectively. The retention time error range is set to ±0.2 min. The retention times of unknown metabolites A2 and B2 (which are present in the sample as A1 and B1) are detected to be 2.1 min and 3.3 min, respectively. In this case, the retention time of A2 is within the range of 2 min ± 0.2 min, so A2 is matched with A1. The retention time of B2 exceeds the range of 3 min ± 0.2 min, but is within the range of 3.5 min ± 0.2 min, so B2 is matched with C1. Therefore, the retention time information has a negative impact on the annotation of B2.

[0052] This application provides a method for determining the retention time domain of metabolites, a method for annotating metabolites, and an apparatus, which aim to solve the above-mentioned technical problems.

[0053] The technical solutions of this application and their effects are described below through several exemplary embodiments. It should be noted that the following embodiments can be referenced, borrowed from, or combined with each other. Identical terms, similar features, and similar implementation steps in different embodiments will not be repeated.

[0054] This application provides a method for determining the retention time domain of metabolites, such as... Figure 1 As shown, the method includes steps S1-S4:

[0055] S1. Obtain the retention times of n metabolites under the same chromatographic conditions.

[0056] In the embodiments of this application, one or more reference standards from n metabolites can be used to prepare samples individually or in combination, and the samples can be detected under the same chromatographic conditions to obtain the retention times of each of the n metabolites. Each sample can also be measured multiple times, and for each metabolite, the mean or median of the retention times from several measurements can be taken, or multiple retention times can be retained.

[0057] As an optional embodiment, the retention times of the n metabolites are obtained by a pre-built machine learning model (predicting retention times).

[0058] Metabolite reference standards are expensive, and some are even unavailable, making it costly and impractical to obtain metabolite retention times through reference standard detection. While machine learning models such as RT-Transformer, 1D CNN, RGCN, GNN-RT, MPNN, and CPORT can predict metabolite retention times, these predictions may differ from actual detected retention times, leading to errors. This method is also susceptible to retention time drift when used for annotating unknown metabolites. Predicting metabolite retention times using retention time prediction models readily yields a large dataset, providing a more comprehensive picture of retention time distribution under specific chromatographic conditions. Analyzing this extensive retention time data allows for the identification of retention time domains for each metabolite under specific chromatographic conditions, improving annotation accuracy.

[0059] The metabolites mentioned generally refer to any metabolites that may be present in a biological sample, and can be exogenous or endogenous. Biological samples refer to samples used for metabolomics analysis, which can be animal samples, such as blood, serum, plasma, cells, or tissues, and samples obtained through their processing; plant samples, such as flowers, fruits, roots, stems, and leaves, and samples obtained through their processing; or microbial samples, such as gut microbiota, fermentation broth, and samples obtained through their processing. Biological samples may contain a large number of structurally similar metabolites because these metabolites are formed through a series of biochemical reactions, with each step producing at least one metabolite. Metabolites along a reaction pathway are likely to have similar structures, such as citric acid and isocitrate, succinic acid and fumaric acid in the tricarboxylic acid cycle.

[0060] S2. Based on the distribution of retention times of the n metabolites, select i metabolites from the n metabolites as calibrators.

[0061] It should be noted that the distribution of retention times of n metabolites typically implies the following information:

[0062] (1) The elution order of various metabolites under the same chromatographic conditions: metabolites with earlier elution (shorter retention time) are easier to elute, while metabolites with later elution (longer retention time) are more difficult to elute. This is related to the chromatographic conditions and the properties of the metabolites themselves. For example, under reverse phase chromatography, highly polar compounds are easier to elute and have shorter retention times, while less polar compounds are more difficult to elute and have longer retention times; the opposite is true under normal phase chromatography.

[0063] (2) The aggregation or difference of the retention times of each metabolite. If the difference between the retention times is small, then these retention times are clustered together, and the corresponding metabolites have high similarity in chromatographic behavior under the chromatographic conditions. If the difference between the retention times is large, then these retention times are relatively dispersed, and the corresponding metabolites have low similarity in chromatographic behavior under the chromatographic conditions.

[0064] (3) Based on the distribution of retention times (over the running time under the corresponding chromatographic conditions) of each metabolite, regions with sparse or no retention time points can be identified, for example... Figure 2 In the diagram (the retention time points from left to right are denoted as rt1, rt2...rtn), the region between rt5 and rt7 can also contain areas with dense retention time points, such as the region between rt1 and rt6.

[0065] (4) Perform multiple detections on n metabolites under the same chromatographic conditions to obtain the retention time of each detection. Put these retention times together and observe the distribution of each metabolite and its retention time. Then you can see which metabolites have stable retention times and which are unstable.

[0066] Therefore, calibrators can be selected based on the distribution of retention times of n metabolites (retention time magnitude, aggregation, sparse regions, dense regions, and / or stability). For example, for a large time span of n retention times (generally corresponding to long run times under chromatographic conditions), more calibrators can be selected; conversely, fewer calibrators can be selected. Calibrators can be found in sparse regions, where there are fewer retention time points and fewer adjacent metabolites, so even if a point drifts, the impact is small. Calibrators can be found at the edges of dense regions, where they are relatively close to the dense region and have fewer retention time points at the edges, making them a better reference for that dense region. Alternatively, relatively stable metabolites can be selected as calibrators, and so on.

[0067] S3. Determine the base point compound and base point based on the calibrator and its retention time. The base point divides the time period in which the retention times of the n metabolites are located into i+1 time domains.

[0068] After finding the calibrator in step S2, the base point compound and base point are determined based on the structure and retention time of the calibrator. Each calibrator corresponds to at least one base point compound and base point.

[0069] In some embodiments, the calibrator is used as a base point compound, and the retention time of the base point compound is used as a base point.

[0070] In some embodiments, based on the calibrator and its retention time, a compound whose absolute difference in retention time with the calibrator is less than a first threshold is selected as a base point compound under the same chromatographic conditions, and the retention time of the base point compound is used as the base point.

[0071] Preferably, the first threshold is 30s. The first threshold can be adjusted appropriately according to the length of the running time under the corresponding chromatographic conditions and the retention time drift. For example, the first threshold can be larger if the running time is longer, and the first threshold can also be larger if the retention time points of the calibrator are sparser.

[0072] S4. Determine the retention time domain of each metabolite based on the retention time of each metabolite and the i+1 time domains.

[0073] Once the baseline points for each metabolite and its corresponding chromatographic conditions or system are determined, the time intervals in which the retention times of the n metabolites are located are divided into multiple time domains. For each metabolite, the time domain in which its retention time is located is taken as the retention time domain of that metabolite.

[0074] In some embodiments, for each metabolite, the measured retention time is obtained by detecting its reference under the chromatographic conditions, and the time domain in which the measured retention time falls is taken as the retention time domain of that metabolite.

[0075] In some implementations, for each metabolite, a predicted retention time under the chromatographic conditions is predicted based on a pre-built machine learning model, and the time domain in which the predicted retention time falls is taken as the retention time domain of that metabolite.

[0076] For metabolites that are unavailable as reference standards, a pre-built machine learning model can be used to predict retention time. This model takes the molecular characteristics of the metabolite and the chromatographic conditions as input and outputs the predicted retention time.

[0077] In some implementations, the machine learning model is selected from at least one of the following: RT-Transformer, 1DCNN, RGCN, GNN-RT, MPNN, and CPORT.

[0078] Preferably, when constructing the machine learning model, the training dataset includes data on n metabolites or multiple metabolites therein. The molecular characteristics and chromatographic conditions of each metabolite are used as model input, and retention time is used as a label for model training.

[0079] Based on the above, as an optional embodiment, such as Figure 3 As shown, based on the retention time distribution of the n metabolites, i metabolites are selected from the n metabolites as calibrators, including steps S201-S206 (this method is denoted as MetaOffset):

[0080] S201. Sort the retention times of n metabolites by size to obtain a dataset Array. For example, dataset Array = [rt1,rt2,rt3,…,rtn], where rt1,rt2,rt3,…,rtn are sorted in ascending order.

[0081] n is a positive integer greater than 1, generally greater than 30, preferably greater than 100, even better than 200, even better than 300, and even better than 400. The larger n is, the more abundant the data, and the higher the accuracy of the results obtained for statistical analysis, but the more computational resources are required. The higher the representativeness of each value in the dataset Array, the higher the accuracy of the results obtained for analysis.

[0082] S202. Select j values ​​from the dataset Array to obtain the dataset Node. For example, dataset Node = [t1, t2, t3, ..., tj], where j is a positive integer less than n.

[0083] S203. Calculate the dispersion index of the dataset Node, denoted as σ. Performing conventional transformations on σ, such as adding, subtracting, dividing or multiplying by a constant, or taking the log value, should all fall within the scope of this application.

[0084] In some implementations, the dispersion index is variance or standard deviation.

[0085] S204. Calculate the sum of the absolute differences between each value in the dataset Array and each value in the dataset Node, denoted as dis:

[0086]

[0087] Performing routine transformations on dis, such as adding, subtracting, dividing, or multiplying by a constant, specifically dividing by n or taking the log value, should all fall within the scope of this application.

[0088] S205. Calculate the ratio of the dispersion index to the sum of the absolute differences, denoted as the score:

[0089] score = σ / dis (Equation 2).

[0090] S206. Repeat steps S202-S205 until the maximum ratio is obtained. At this point, the j value is the i value, and the metabolites corresponding to each retention time value in the dataset Node are the calibrators.

[0091] The maximum score can be found by enumerating all possible cases. Furthermore, the experiment revealed that as the value of j increases, the maximum score for each j value initially increases and then decreases; therefore, finding this peak yields the maximum score.

[0092] For a given chromatographic condition, the fewer regions the run time is divided into, the larger the retention time domains for each metabolite will be. This increases the probability that the retention time will fall within its respective domain, resulting in higher accuracy. For example, if the entire run time is considered a single retention time domain, all metabolite retention times will fall within that domain under all circumstances, achieving 100% accuracy. However, this retention time domain is ineffective in metabolite annotation. Conversely, the more regions the run time is divided into, the narrower the retention time domains for each metabolite become. While the retention time domains play a more significant role in distinguishing metabolites during annotation, the lower the accuracy of retention times falling within these domains, thus having the opposite effect on metabolite annotation.

[0093] The score is calculated using Equation 2 above. The larger j is (corresponding to more retention time domains), the larger dis is; the larger j is, the more dispersed the values ​​in the dataset's Nodes (corresponding to more uniform division of retention time domains), and the larger the σ value. For each tj, the closer it is to the retention time of more metabolites (corresponding to the calibrator being close to the retention time of more metabolites), the smaller dis is. Therefore, by determining the i value and calibrator through steps S201-S206 above, considering the number and range of retention time domains, and the distance between the calibrator's retention time and the retention times of each metabolite, optimal results are achieved. This ensures the accuracy of each metabolite's retention time falling within its respective retention time domain while maximizing the value of the retention time domain for metabolite annotation.

[0094] Based on the above, as an optional embodiment, according to the distribution of retention times of the n metabolites, i metabolites are selected from the n metabolites as calibrators, including:

[0095] Cluster analysis is performed on the retention times of n metabolites, where i is the number of clusters. In each cluster, a metabolite with a retention time corresponding to a given time is selected as a calibrator.

[0096] Preferably, a metabolite with a retention time corresponding to the edge of each cluster is selected as a calibrator.

[0097] When analyzing metabolites in a large number of samples, it was unexpectedly discovered that the retention times of various metabolites in the samples showed a clustering trend in several groups, such as... Figure 2 As shown in the schematic diagram, the retention times of individual metabolites cluster together. This may be due to the presence of metabolites with similar structures in the sample; similar metabolites will have closer retention times under the same chromatographic conditions, thus forming retention time clusters. Alternatively, it could be due to changes in chromatographic conditions (e.g., gradient elution or temperature programmed elution) (including slow and drastic changes). These changes can lead to the elution of different metabolites, resulting in two retention time clusters before and after a change. This is especially true when changes are drastic, such as a sudden increase in the proportion of organic phase in the mobile phase under reverse chromatography conditions, which can cause more metabolites to be eluted, and the retention times of these additional metabolites can form a cluster. The actual process is likely the result of the combined effects of the characteristics of the metabolites themselves and the chromatographic conditions; the specific reasons require further investigation.

[0098] Cluster analysis is performed on the retention times of n metabolites. The number of clusters corresponds to the number of calibrators. For each cluster, one metabolite corresponding to a specific retention time can be selected as the calibrator. Preferably, the metabolite is relatively stable, meaning it can be stably detected under the chromatographic conditions with minimal retention time drift. Preferably, for each cluster or two adjacent clusters, calibrators (base points) are found at the edges of the clusters or the boundary regions between the two clusters. Regions with fewer retention time points have less influence from retention time drift, and these time points effectively define the retention time domains of metabolites within the two clusters, making it difficult for these metabolites to drift beyond this endpoint even if retention time drift occurs. For example... Figure 2 The base point and the time domain to be retained.

[0099] In some implementations, the clustering analysis method is selected from one or more of the following: K-means clustering, affinity propagation, DBSCAN, BIRCH, agglomerative clustering, and meanshift.

[0100] In some embodiments, the chromatographic conditions are those of liquid chromatography-tandem mass spectrometry (LC-MS / MS) or gas chromatography-tandem mass spectrometry (GC-MS / MS). LC-MS / MS or GC-MS / MS can simultaneously detect a large number of unknown metabolites, obtaining measured retention times and mass spectrometry data. These measured retention times and mass spectrometry data are then matched with data on known metabolites in a database to obtain identification results, which are then annotated. During the annotation process, consideration is given to whether the measured retention times fall within the retention time range of known metabolites, thus effectively improving the accuracy of the annotation results for these unknown metabolites.

[0101] This application provides a method for metabolite annotation, such as... Figure 4 As shown, it includes:

[0102] S501. Obtain the retention time domain of a known metabolite in the chromatographic system using the aforementioned method for determining the retention time domain of metabolites.

[0103] S502. Detect the sample under the same chromatographic system to obtain the retention time of the unknown metabolite;

[0104] S503. Find a known metabolite among the known metabolites whose retention time falls into the retention time domain and annotate the unknown metabolite.

[0105] This method for annotating unknown metabolites effectively avoids the impact of measured retention time drift by matching it with the retention time domains of known metabolites. This allows the retention time dimension to play a greater role in the annotation process of unknown metabolites, improving annotation accuracy. Furthermore, since the retention times of different metabolites are in the same order within the same chromatographic system, the retention times of unknown metabolites detected under different chromatographic conditions within the same system can also be matched with the retention time domains of known metabolites (using the region between the retention times of the base point compounds detected under each chromatographic condition as their respective retention time domains) to further improve the accuracy of the annotation results.

[0106] This application provides an annotation device for metabolites, such as... Figure 5 ,include:

[0107] The metabolite information acquisition module 601 is used to acquire the retention time and mass spectrometry data of unknown metabolites;

[0108] Annotation module 602 is used to annotate the unknown metabolites based on their retention time and mass spectrometry data, as well as a pre-created metabolite database.

[0109] The metabolite database includes retention time domains for multiple known metabolites, and the retention time domain of at least one known metabolite is obtained by the aforementioned method for determining metabolite retention time domains.

[0110] The metabolite annotation device of this application can execute the metabolite annotation method provided in this application, and its implementation principle is similar. The retention time and mass spectrometry data of the unknown metabolite are obtained through the metabolite information acquisition module 601, which can be a liquid chromatography-mass spectrometry (LC-MS) instrument or a gas chromatography-mass spectrometry (GC-MS) instrument. The annotation module 602 matches the retention time and mass spectrometry data of the unknown metabolite with known metabolites in a pre-created metabolite database to annotate the unknown metabolite. The annotation module 602 can be a computer device. Here, the pre-created metabolite database contains the retention time domain information of the metabolites obtained by the metabolite retention time domain determination method of this application.

[0111] This application provides a memory or electronic device. The electronic device includes a memory, a processor, and a computer program stored in the memory. The processor executes the computer program to implement the steps of a method for determining the retention time domain of metabolites or a method for annotating unknown metabolites, for storage or specific execution of the method of this application. Figure 6As shown, the electronic device includes a processor 4001 and a memory 4003. The processor 4001 and the memory 4003 are connected, for example, via a bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, which can be used for data interaction between the electronic device and other electronic devices, such as sending and / or receiving data. It should be noted that in practical applications, the transceiver 4004 is not limited to one type, and the structure of the electronic device 4000 does not constitute a limitation on the embodiments of this application.

[0112] This application also provides a computer program product, including a computer program that, when executed by a processor, performs the steps of the metabolite retention time domain determination method or the unknown metabolite annotation method of this application.

[0113] The following describes the method for determining the retention time domain of metabolites and the method for annotating metabolites provided in this application, along with their effects, with reference to specific embodiments.

[0114] I. Methods for Determining the Retention Time Domain of Metabolites

[0115] 1. Determine the retention time range of metabolites by analyzing measured retention times.

[0116] Existing literature was reviewed and serum samples were tested using liquid chromatography-tandem mass spectrometry to identify possible metabolites in the serum. 300 reference standards for metabolites that may be present in the serum were preliminarily identified and purchased. The mixed solution of the reference standards for the 300 metabolites was detected by liquid chromatography-tandem mass spectrometry in positive ion mode (chromatographic condition 1).

[0117] Chromatographic conditions 1 (17 min) include:

[0118] ① Instrument: Liquid Chromatography-Stancing Mass Spectrometer (LC-ESI-Orbitrap) Model: Vanquish Q-Exactive plus (using data-dependent acquisition mode);

[0119] ②Chromatographic column: Acclaim RSLC 120C18, 100×2.1mm, 2.2μm;

[0120] ③Mobile phase

[0121] Mobile phase A: 90% methanol aqueous solution of 0.01% formic acid and 5mM ammonium formate (volume ratio of methanol to water is 90:10);

[0122] Mobile phase B: Methanol.

[0123] Perform gradient elution using the following procedure:

[0124] Table 1

[0125]

[0126]

[0127] The retention times of 231 metabolites out of 300 metabolites were obtained from the sample injection detection. These 231 retention times were arranged in ascending order to obtain the dataset Array231 = [63.668.469.673.873.875.075.076.277.478.679.279.279.279.880.481.082.282.2 82.282.882.883.483.484.084.684.684.685.285.285.285.886.486.4 87.687.688.288.891.291.291.291.892.493.094.294.294.895.496.0 96.696.696.696.696.696.697.297.897.899.099.099.099.699.6102.6140.4 140.4141.0141.0141.6141.6142.2143.4148.2148.8149.4149.4152.4 159.6163.8169.2188.4190.2194.4198.6206.4207.0210.6220.2222.0 225.0256.8280.8285.0289.8291.0291.0295.2298.2316.2316.2321.0 326.4330.6335.4337.8339.0348.6354.0358.2366.6371.4389.4391.2 392.4396.6399.0401.4403.2412.8414.0425.4425.4427.8430.8435.0 438.6444.0454.8457.8460.2466.2492.6499.2506.4510.6512.4516.0 528.0531.0546.6547.2551.4565.8568.2583.8584.4584.4586.2591.6 597.6601.8603.6607.2619.2620.4621.6621.6630.6644.4648.6651.0 657.0657.6658.2680.4681.6682.2685.2685.8686.4704.4705.0705.0 708.6708.6709.8710.4715.8716.4718.8719.4719.4721.8722.4724.2 724.2727.8730.8736.8737.4739.8741.6747.0747.6747.6762.6765.0 766.2770.4771.6775.8777.6777.6777.6779.4792.0792.6795.0801.6802.8802.8802.8804.6805.2823.8828.0829.2832.2832.8835.2835.2837.0847.2855.6856.2858.6868.8870.6871.8876.0920.4936.0970.2].

[0128] like Figure 7 The figure shows the distribution of retention times, with the position index (1-231) of the values ​​in dataset Array231 as the x-axis and the value magnitude (time, s) as the y-axis. As can be seen from the figure, these 231 values ​​can be divided into 6 groups. For each group, a metabolite corresponding to a retention time can be selected as a calibrator or basepoint compound, with the retention time of that compound serving as the base point.

[0129] Preferably, such as Figure 7 The dotted lines in the graph serve as the dividing points. The data within each group are close to each other, exhibiting similar increasing trends; the numerical points are densely packed within each group, while the values ​​in adjacent areas between two groups are relatively sparse. Therefore, base points (calibrators or base point compounds) can be found in adjacent areas between two groups (i.e., near the dotted lines). For example, the metabolites corresponding to the values ​​102.6, 280.8, 466.2, 708.6, and 920.4 are dopamine, propionylcarnitine, hippuric acid, and glycohyodeoxycholic acid, respectively. Using erucic acid (22:1n9) as calibrators and also as base compounds, six time domains were obtained: the time period from 0 to the retention time of dopamine, the time period from the retention time of dopamine to adenosine, the time period from the retention time of adenosine to hippuric acid, the time period from the retention time of hippuric acid to glycine-deoxycholic acid, the time period from glycine-deoxycholic acid to erucic acid, and the time period from the retention time of erucic acid to the end of the chromatographic run. For the other 223 metabolites, their respective retention time domains were obtained based on the time domains into which each metabolite's retention time falls.

[0130] Finding a baseline by directly observing the distribution of retention times involves subjective judgment, which may introduce human error. Furthermore, as the data volume increases, it becomes increasingly difficult to make such judgments. Therefore, cluster analysis methods or methods described in this application can be used. Figure 3 The MetaOffset method processes the retention time data and uses theoretical calculations to find calibrators and base points.

[0131] For convenience, a compound near the dashed line was directly selected as the calibrator and base point compound. Alternatively, other compounds with stable retention times near the dashed line can be selected. However, it is better to select a compound that is not present in the sample and has a stable signal as the base point compound, such as an isotope internal standard or other artificially synthesized compounds that do not exist in nature. Adding such a base point compound to the sample can further ensure the stability of the base point compound detection results and will not interfere with the detection results.

[0132] 2. Determine the retention time domain of metabolites by analyzing and predicting retention times.

[0133] The RT-Transformer model was constructed using the method of Xue J et al. [Xue J, Wang B, Ji H, Li W. RT-Transformer: retention timeprediction for metabolite annotation to assist in metabolite identification. Bioinformatics. 2024 Mar 4; 40(3):btae084. doi:10.1093 / bioinformatics / btae084], in which the retention time of the above 300 metabolites with controls was transferred in the laboratory.

[0134] The retention times of the 231 reference standards under chromatographic condition 1 were predicted using the constructed RT-Transformer model. The predicted retention times were obtained and arranged in ascending order, denoted as dataset Array231'=[41.959.059.263.563.666.470.273.373.774.174.774.874.975.878.679.079.279.679.780.681.081.782.182.984.285.786.186.587.187.888.589.089.190.590.890.992.292.292.493.994.6 95.395.397.798.798.9102.3102.8104.8105.8106.6107.9109.1110.1 111.2115.8116.5121.9130.4134.3134.9140.3146.1148.5151.5154.4 155.9158.1164.8171.2172.8176.7188.8196.4197.7200.6201.6202.6 204.0205.6207.1212.1224.9227.6227.6231.4240.5244.3248.1253.8 272.6286.0288.7296.2298.6303.8306.5309.3321.6329.3330.0333.1 334.3356.0356.1365.5378.3383.7389.5392.1415.3416.0418.5428.0 432.0435.7436.5437.8438.4443.9450.4451.2452.0453.0453.9462.6 466.9468.1469.9487.3489.0501.9505.9508.3512.1519.5526.7535.8 536.2538.0559.2563.3566.6568.5590.7590.7597.2599.4601.1601.2 604.2605.4609.5609.5620.1620.6621.7622.8632.0642.7645.1646.8 647.9651.1654.6655.2655.4656.2663.4672.9677.7679.7683.3683.4 684.3685.2687.2691.6693.4697.8701.2704.5707.2710.5712.5717.6 719.2719.2722.5722.7724.0729.5734.9737.5740.2740.6741.9751.5 756.5756.8757.9765.4777.1778.5779.7782.4791.6794.9799.5800.3 803.1808.6812.9813.9815.7826.4828.4832.0835.0839.1858.8859.2866.8870.8876.0880.0889.1909.4911.6936.41006.7]. For example. Figure 8 As shown, this diagram illustrates the distribution of (predicted) retention times, with the position index (1-231) of the values ​​in dataset Array231' as the x-axis and the value magnitude (time, s) as the y-axis. (Comparison) Figure 7 and Figure 8 It can be seen that, Figure 8 The trend of change is and Figure 7 Similarly, but Figure 8 The boundary in the middle is not Figure 7 Clearly, this is likely due to model prediction errors. Therefore, relying on manual observation and judgment of the retention time distribution to find the baseline would increase the probability of human error or mistakes.

[0135] The retention times of the 300 metabolites with reference standards were predicted, resulting in a dataset Array300. Clustering analysis methods such as K-means clustering, affinity propagation, DBSCAN, and BIRCH were then applied. Figure 3 The MetaOffset method is used to process the Array300 dataset to find calibrators (base point compounds are the same as calibrators) and base points. When using each cluster analysis method, for each cluster (or subset), each retention time point is traversed as a base point to obtain a preset retention time domain. Then, the measured retention time is matched with the preset retention time domain, and the accuracy of the matching results is calculated, as shown in Table 2 below.

[0136] Table 2

[0137]

[0138]

[0139] The MetaOffset method score in Table 2 is calculated using the same method as in Equation 2 (dis divided by n). The scores for K-means clustering, affinity propagation, DBSCAN, and BIRCH are calculated in the same way as the MetaOffset method score, except that the dataset Array is divided into j clusters (subsets) through cluster analysis, and a point is selected from each cluster to form a dataset Node. The scores shown in Table 2 are the maximum values ​​obtained by using each retained time point of each subset under the corresponding j as the base point (the value in the dataset Node).

[0140] As shown in Table 2, ① the overall trend of decreasing accuracy with increasing baseline points is the same across all five methods; ② there are certain differences in the maximum score across different methods, with K-means, DBSCAN, Birch, and MetaOffset all having a maximum score when j=5, and MetaOffset exhibiting the highest accuracy at this point; ③ examining the results for the maximum score across the five methods, MetaOffset corresponds to the highest accuracy; ④ when j=5 for all five methods, MetaOffset, AgglomerativeClustering, and Birch have higher accuracy, all exceeding 0.8. Therefore, taking the result from j=5 in Table 2 (i=5), the runtime under chromatographic condition 1 is divided into six time domains. This is consistent with the results obtained earlier by observing the distribution of retention times. Furthermore, based on the time domains where the retention times of each metabolite fall, the retention time domains for each metabolite are obtained. This reasonable division of retention times ensures accuracy while providing greater value for metabolite annotation.

[0141] To better understand, the accuracy in Table 2 is further illustrated using the process obtained under the MetaOffset method as an example. Taking the results from the MetaOffset method, under chromatographic condition 1, there are six retention time domains: 0-198.6s, 198.6s-256.8s, 256.8s-460.2s, 460.2s-657.6s, 657.6s-802.8s, and 802.8s-1000s. The results are validated using the measured retention times of 300 metabolites under chromatographic condition 1, as shown in Table 3 below. Figure 9 The final accuracy rate was 0.84.

[0142] Table 3

[0143]

[0144] II. Metabolite Annotation Methods

[0145] 1. Single metabolite

[0146] Serum samples were analyzed under chromatographic conditions 1 (using data-dependent acquisition mode). When annotating the unknown metabolite M118T466_1, if similarity matching algorithms, such as cosine similarity (cosine_socre) and spectral entropy similarity (entropy_score), were used to find the known metabolite with the highest similarity in the mass spectrometry database based on its mass spectrometry data (including primary and secondary mass spectrometry data), then the annotation would be performed. Figure 10 M118T466_1 might be labeled as N6-methyladenosine or 1-methyladenosine. Using `fusion_socre` would label it as N6-methyladenosine, but by adding retention time domain information, M118T466_1 is clearly labeled as 1-methyladenosine. Confirmation with N6-methyladenosine and 1-methyladenosine standards confirms that M118T466_1 is 1-methyladenosine. This demonstrates that using retention time domain information for metabolite annotation can distinguish metabolites with similar mass spectrometry data, improving annotation accuracy.

[0147] 2. Multiple metabolites

[0148] The above-mentioned 300 metabolites were added to a blank solvent to prepare test samples (two in parallel). The samples were then analyzed under chromatographic conditions 1, and metabolite annotations were performed. The results are as follows:

[0149] Table 4

[0150]

[0151] Table 4 shows the annotation methods for the detection results using mass spectrometry data (MS1+MS2), mass spectrometry data plus retention time (with different error ranges), and mass spectrometry data plus retention time domain (the retention time domain obtained using the MetaOffset method described above). The comparison shows that the mass spectrometry data plus retention time domain method has the highest true positive rate and the lowest false positive rate. This indicates that combining retention time domain information with mass spectrometry data for metabolite annotation can improve the accuracy of annotation while reducing the false positive rate, resulting in more reliable annotation results.

[0152] In summary, this application provides a method, apparatus, electronic device, and computer program product for determining the retention time domain of metabolites. By analyzing the retention time distribution of multiple metabolites under the same chromatographic conditions, a suitable baseline is selected to rationally divide the retention time and obtain the retention time domain of each metabolite. Using the retention time domain instead of the retention time can reduce the impact of retention time drift.

[0153] This application analyzes the retention time distribution of multiple metabolites using cluster analysis or the MetaOffset method, enabling the calculation to obtain the baseline and retention time domain, making the division of the retention time domain more scientific and reasonable, and providing more effective information for metabolite annotation while ensuring accuracy.

[0154] The retention time in this application can also be the predicted retention time. In addition to the drift of the actual measured retention time during detection, the prediction of the retention time of metabolites by machine learning models also has errors. Using the retention time domain instead of the predicted retention time can also reduce the impact of prediction errors.

[0155] This application also provides metabolite annotation methods, devices, electronic devices, and computer program products, which utilize time-domain information to annotate unknown metabolites, effectively improving annotation accuracy and reducing false positives, thus providing better basic data for metabolomics research.

[0156] It should be understood that although arrows indicate various operation steps in the flowcharts of this application's embodiments, the order in which these steps are implemented is not limited to the order indicated by the arrows. Unless explicitly stated herein, in some implementation scenarios of this application's embodiments, the implementation steps in each flowchart can be executed in other orders as required. Furthermore, some or all steps in each flowchart, based on the actual implementation scenario, may include multiple sub-steps or multiple stages. Some or all of these sub-steps or stages can be executed at the same time, and each sub-step or stage can also be executed at different times. In scenarios where execution times differ, the execution order of these sub-steps or stages can be flexibly configured according to requirements, and this application's embodiments do not limit this.

[0157] The above description is only an optional implementation method for some implementation scenarios of this application. It should be noted that for those skilled in the art, other similar implementation methods based on the technical concept of this application without departing from the technical concept of this application also fall within the protection scope of the embodiments of this application.

Claims

1. A method for determining the retention time domain of metabolites, characterized in that, include: S1. Obtain the retention times of n metabolites under the same chromatographic conditions; S2. Based on the distribution of retention times of the n metabolites, select i metabolites from the n metabolites as calibrators; S3. Determine the base point compound and base point based on the calibrator and its retention time. The base point divides the time period in which the retention times of the n metabolites are located into i+1 time domains. S4. Determine the retention time domain of each metabolite based on the retention time of each metabolite and the i+1 time domains.

2. The method according to claim 1, characterized in that, The step of selecting i metabolites from the n metabolites as calibrators based on the distribution of their retention times includes: S201. Sort the retention times of n metabolites by size to obtain a dataset Array; S202. Select j values ​​from the dataset Array to obtain the dataset Node; S203. Calculate the dispersion index of the Nodes in the dataset; S204. Calculate the sum of the absolute differences between each value in the dataset Array and each value in the dataset Node; S205. Calculate the ratio of the dispersion index to the sum of the absolute differences; S206. Repeat steps S202-S205 until the maximum ratio is obtained. The j value at this time is the i value. The metabolites corresponding to each retention time value in the dataset Node are the calibrators.

3. The method according to claim 2, characterized in that, The dispersion index is either variance or standard deviation.

4. The method according to claim 1, characterized in that, The determination of the base point compound and base point based on the calibrator and its retention time includes: The calibrator is used as the base point compound, and the retention time of the base point compound is used as the base point; Alternatively, based on the calibrator and its retention time, under the same chromatographic conditions, find compounds whose absolute difference in retention time with the calibrator is less than a first threshold as base point compounds, and use the retention time of the base point compounds as base points; Preferably, the first threshold is 30 seconds.

5. The method according to claim 1, characterized in that, The retention times of the n metabolites are the retention times obtained by detection under the same chromatographic conditions, or the predicted retention times obtained based on a pre-built machine learning model.

6. The method according to claim 1, characterized in that, For each metabolite, the retention time domain of each metabolite is determined based on its retention time and i+1 time domains, including: For each metabolite, the measured retention time is obtained by detecting its reference standard under the chromatographic conditions, and the time range in which the measured retention time falls is taken as the retention time range of that metabolite. Alternatively, for each metabolite, its predicted retention time under the chromatographic conditions is predicted based on a pre-built machine learning model, and the time domain in which the predicted retention time falls is taken as the retention time domain of that metabolite.

7. The method according to claim 5 or 6, characterized in that, The machine learning model is selected from at least one of the following: RT-Transformer, 1D CNN, RGCN, GNN-RT, MPNN, and CPORT.

8. The method according to claim 1, characterized in that, The step of selecting i metabolites from the n metabolites as calibrators based on the distribution of their retention times includes: Cluster analysis is performed on the retention times of n metabolites, where i is the number of clusters. In each cluster, a metabolite with a retention time corresponding to a certain value is selected as a calibrator. Preferably, a metabolite with a retention time corresponding to the edge of each cluster is selected as a calibrator. Preferably, the clustering analysis method is selected from one or more of the following: k-means clustering, nearest neighbor propagation, density-based spatial clustering with noise, balanced iterative reduction and clustering using hierarchy, hierarchical clustering, and mean shift.

9. The method according to claim 1, characterized in that, The chromatographic conditions are those of liquid chromatography-tandem mass spectrometry or gas chromatography-tandem mass spectrometry.

10. A method for annotating metabolites, characterized in that, The method includes: The retention time domain of a known metabolite is obtained by the method for determining the retention time domain of a metabolite according to any one of claims 1-9; The sample was analyzed under the chromatographic conditions to obtain the retention time of the unknown metabolite; Among the known metabolites whose retention time falls into the retention time domain, find one to annotate the unknown metabolite.

11. A metabolite annotation device, characterized in that, include: The metabolite information acquisition module is used to acquire retention time and mass spectrometry data of unknown metabolites; An annotation module is used to annotate the unknown metabolites based on their retention time, mass spectrometry data, and a pre-created metabolite database; The metabolite database includes retention time domains of multiple known metabolites, and the retention time domain of at least one known metabolite is obtained by the metabolite retention time domain determination method according to any one of claims 1-9.

12. An electronic device comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the method according to any one of claims 1-10.

13. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method described in any one of claims 1-10.