LCMS read model construction method and application of LCMS read model
By constructing an LCMS reading model and using multiple features for iterative training, the accuracy and efficiency issues of existing LCMS reading methods are solved, achieving efficient and accurate reading results output, meeting the needs of multiple business scenarios, and improving the automated preparation process of chemical synthesis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN JINGTAI TECH CO LTD
- Filing Date
- 2024-12-31
- Publication Date
- 2026-06-30
AI Technical Summary
Existing LCMS spectrum reading methods have limited accuracy and cannot meet the needs of multiple business scenarios and high precision. Traditional methods are inefficient and subject to human error.
An LCMS reading model is constructed by acquiring LCMS training data, including material structure characteristics, mass spectrum characteristics, ultraviolet absorption spectrum (UV) characteristics, MS-UV correlation characteristics, and instrument parameter characteristics, and then iteratively training it to generate reading models suitable for different business types.
It achieves efficient and accurate output of spectral reading results, meets the needs of different business scenarios, and improves the reliability and efficiency of automated preparation processes in chemical synthesis.
Smart Images

Figure CN122309947A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data analysis technology, and in particular to a method for constructing an LCMS reading model and the application of the LCMS reading model. Background Technology
[0002] LCMS, or Liquid Chromatography-Mass Spectrometer, is an analytical instrument that combines the high-efficiency separation capabilities of liquid chromatography with the high sensitivity and specificity of mass spectrometry. Liquid chromatography (LC) separates different components in a sample, while mass spectrometry (MS) analyzes each separated component individually, obtaining information such as molecular weight, structure, and concentration. Therefore, accurate spectral reading is crucial for obtaining this vital information. Traditional LCMS spectral reading methods either rely on expert experience, which is inefficient and prone to human error, or they introduce automated reading algorithms. However, existing LCMS reading algorithms have limited accuracy and cannot meet the demands of various business scenarios and high precision requirements.
[0003] Therefore, designing a fast and accurate intelligent music reading method that can adapt to various business scenarios is a problem that needs to be solved. Summary of the Invention
[0004] To address or partially address the problems existing in related technologies, this application provides an LCMS reading model construction method and an application of the LCMS reading model, which can efficiently and accurately output reading results for different business types, providing a reliable guarantee for the automated preparation process of products.
[0005] The first aspect of this application provides a method for constructing an LCMS reading model, which includes:
[0006] Acquire LCMS training data, which includes training samples and corresponding training labels; wherein, the training samples include at least one of material structure features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features; the training labels include spectral readings corresponding to at least one service type.
[0007] Based on the LCMS training data, the initial model is iteratively trained to obtain a trained LCMS reading model, which is used to predict the target reading result corresponding to the business type.
[0008] In some embodiments, the training samples include material structural features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features; the method further includes:
[0009] Obtain the weight values of each feature in the training samples corresponding to the business type;
[0010] The step of iteratively training the initial model based on the LCMS training data to obtain a trained LCMS spectrum reading model includes:
[0011] Based on the LCMS training data and the weight values of each feature in the training samples, the initial model is iteratively trained to obtain a trained LCMS spectrum reading model.
[0012] In some implementations, the service type of the LCMS spectrum reading model includes at least one of the following:
[0013] In the intermediate process control stage of chemical synthesis, predict the presence of the target product and / or whether the content of the target product meets the standard.
[0014] In the separation and purification stage of chemical synthesis, it is necessary to predict the applicable separation system and separation method, whether the interval between the UV absorption peak of the target product and the adjacent peak meets the standard, and whether the UV absorption peak of the target product contains at least one of the following: MS signals unrelated to the target product.
[0015] In the quality control stage of chemical synthesis, predict whether the target product meets the quality standards and / or recommend sample injection methods and injection volumes;
[0016] In the scenario deployment phase of chemical synthesis, the applicable scenario category is predicted, which includes fully automated processing, semi-automated processing, or manual processing.
[0017] In some embodiments, the construction method further includes:
[0018] Based on the business type, the initial model is determined; the initial model includes at least one of a classification model, a regression model, and a generative model.
[0019] In some embodiments, the training samples include training samples of mixtures and training samples of pure substances; after acquiring LCMS training data, the method further includes:
[0020] The training samples are subjected to data augmentation processing to obtain augmented LCMS training data; the data augmentation processing method includes at least one of the following: overlay of spectra of multiple pure substances, overlay of spectra with simulated noise, and generation of spectra of preset groups;
[0021] The step of iteratively training the initial model based on the LCMS training data to obtain a trained LCMS spectrum reading model includes:
[0022] Based on the LCMS training data and the enhanced LCMS training data, the initial model is iteratively trained to obtain a trained LCMS spectrum reading model.
[0023] In some embodiments, the data enhancement processing method for superimposing the spectra of the multiple pure substances includes:
[0024] The original spectra of at least two pure substances are mixed in a preset ratio, and the time-series signals of each original spectrum are aligned in time and then superimposed to generate the corresponding superimposed spectrum.
[0025] Based on the superimposed spectrum, the corresponding enhanced training samples are obtained.
[0026] In some embodiments, the data enhancement processing method for generating the spectrum of the preset group includes:
[0027] Sample compounds containing preset groups are simulated to undergo ionization and collision according to preset parameters to obtain fragment ions;
[0028] Based on the spectra of the fragment ions, the corresponding enhanced training samples are obtained.
[0029] A second aspect of this application provides a method for reading spectrum using an LCMS, comprising:
[0030] Based on the business type, obtain the corresponding sampling features, which include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum features, MS-UV correlation features, and instrument parameter features;
[0031] Based on the sampling features, the target spectrum reading result is output by the LCMS spectrum reading model pre-constructed using the spectrum reading model construction method described in the first aspect above.
[0032] In some embodiments, the LCMS reading model includes at least one of a first model, a second model, a third model, and a fourth model; wherein:
[0033] The first model is used to predict and output the corresponding target spectral reading results during the intermediate process control stage of chemical synthesis. The target spectral reading results include whether the target product exists and / or whether the content of the target product meets the standard.
[0034] The second model is used to predict and output the corresponding target spectral reading results during the separation and purification stage of chemical synthesis. The target spectral reading results include the applicable separation system and separation method, whether the interval between the UV absorption peak of the target product and the adjacent peak meets the standard, and whether the UV absorption peak of the target product is mixed with at least one of the following: MS signals unrelated to the target product.
[0035] The third model is used to predict and output the corresponding target spectral reading results during the quality control stage of chemical synthesis. The target spectral reading results include whether the target product meets the quality standard and / or the recommended sample injection method and injection volume.
[0036] The fourth model is used to predict the corresponding target spectral reading results during the deployment phase of chemical synthesis scenarios. The target spectral reading results include applicable scenario categories, which include fully automated processing, semi-automated processing, or manual processing.
[0037] A third aspect of this application provides an LCMS reading model construction apparatus, comprising:
[0038] The training data acquisition module is used to acquire LCMS training data, which includes training samples and corresponding training labels. The training samples include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features. The training labels include spectral readings corresponding to at least one service type.
[0039] The model training module is used to iteratively train the initial model based on the LCMS training data to obtain a trained LCMS reading model, which is used to predict the target reading result corresponding to the business type.
[0040] The fourth aspect of this application provides a spectrum reading system for LCMS, comprising:
[0041] The data acquisition module is used to obtain corresponding sampling features according to the business type. The sampling features include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum features, MS-UV correlation features, and instrument parameter features.
[0042] The model reading module is used to output the corresponding target reading result based on the sampling features and the LCMS reading model pre-constructed by the LCMS reading model construction method described in the first aspect.
[0043] The fifth aspect of this application provides an electronic device, comprising:
[0044] Processor; and
[0045] The memory stores executable code, which, when executed by the processor, causes the processor to perform the LCMS reading model construction method as described in the first aspect and / or the LCMS reading method as described in the second aspect.
[0046] The sixth aspect of this application provides a computer-readable storage medium having executable code stored thereon, which, when executed by a processor of an electronic device, causes the processor to perform the LCMS spectrum reading model construction method as described in the first aspect above and / or the LCMS spectrum reading method as described in the second aspect.
[0047] The seventh aspect of this application provides a computer program product, including a computer program for executing the LCMS reading model construction method as described in the first aspect above and / or the LCMS reading method as described in the second aspect.
[0048] The technical solution provided in this application may include the following beneficial effects:
[0049] The LCMS reading model construction method in this application starts with building standardized, large-scale, high-quality LCMS training data, designs training data for specific downstream tasks, and uses the LCMS training data for feature engineering and model training to fully leverage the value of high-quality data. By using different training labels, an LCMS reading model that can be used to predict target reading results for different business types is obtained. The trained LCMS reading model can accurately predict the target reading results corresponding to different business types, meeting the business needs of different scenarios.
[0050] The LCMS reading method of this application can use rich sampling features and a trained LCMS reading model to read spectra according to different business stages of chemical synthesis, so as to obtain the target reading results efficiently and accurately for R&D personnel to refer to, meet the needs of different use scenarios, improve drug development efficiency, and reduce the chain reaction caused by errors in reading results.
[0051] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description
[0052] The above and other objects, features and advantages of this application will become more apparent from the following description of exemplary embodiments of this application in conjunction with the accompanying drawings, wherein the same reference numerals generally represent the same components in the exemplary embodiments of this application.
[0053] Figure 1 This is a flowchart illustrating the LCMS spectrum reading model construction method shown in this application;
[0054] Figure 2 This is another flowchart illustrating the LCMS reading model construction method shown in this application;
[0055] Figure 3 This is a schematic flowchart of the LCMS spectrum reading method shown in this application;
[0056] Figure 4 This is another schematic diagram of the LCMS spectrum reading method shown in this application;
[0057] Figure 5 This is a schematic diagram of the LCMS reading model construction device shown in this application;
[0058] Figure 6 This is a schematic diagram of the LCMS spectrum reading system shown in this application;
[0059] Figure 7 This is another schematic diagram of the LCMS reading system shown in this application;
[0060] Figure 8 This is a schematic diagram of the structure of the electronic device shown in this application. Detailed Implementation
[0061] Embodiments of this application will now be described in more detail with reference to the accompanying drawings. While embodiments of this application are shown in the drawings, it should be understood that this application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to make this application more thorough and complete, and to fully convey the scope of this application to those skilled in the art.
[0062] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.
[0063] It should be understood that although the terms "first," "second," "third," etc., may be used in this application to describe various information, this information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0064] In related technologies, the accuracy of traditional LCMS intelligent spectrum reading results is generally low. If the spectrum reading results are incorrect, for example, if the LCMS spectrum reading scheme incorrectly judges the absence of products as the presence of products and enters the subsequent automated processes such as post-processing and separation, it will cause resource waste or even process interruption and affect production efficiency.
[0065] To address the aforementioned issues, this application provides a method for constructing an LCMS reading model and the application of the LCMS reading model, which can efficiently and accurately output target reading results for different business types, providing a reliable guarantee for the automated preparation process of products.
[0066] This application pre-builds an LCMS reading model and then applies the trained LCMS reading model to each business stage for intelligent reading, so as to obtain the target reading results of each business stage efficiently and accurately.
[0067] See Figure 1 The following describes in detail the technical solution for constructing the LCMS reading model of this application, with reference to the accompanying drawings.
[0068] S110, Obtain LCMS training data. The LCMS training data includes training samples and corresponding training labels. The training labels include the spectrum reading results corresponding to at least one business type.
[0069] Taking the development of small molecule drugs as an example, according to the production process / reaction sequence of chemical synthesis, multiple raw materials undergo chemical reactions, separation and purification, quality inspection, and other steps to obtain the target product. Accordingly, the business types of this application include at least one type of spectrogram reading service in the intermediate process control stage, separation and purification stage, quality control stage, and scenario deployment stage of chemical synthesis. Each stage corresponds to one service, and different services yield different spectrogram reading results, i.e., different training labels are used.
[0070] In some implementations, the service types of the LCMS spectrum reading model include at least one of the following:
[0071] (1) In the intermediate process control stage of chemical synthesis, predict whether the target product exists and / or whether the content of the target product meets the standard.
[0072] (2) In the separation and purification stage of chemical synthesis, predict the applicable separation system and separation method, whether the interval between the UV absorption peak of the target product and the adjacent peak meets the standard, and whether the UV absorption peak of the target product is mixed with at least one of the following: MS signal unrelated to the target product.
[0073] (3) In the quality control stage of chemical synthesis, predict whether the target product meets the quality standard and / or recommend the sample injection method and injection volume.
[0074] (4) In the scenario deployment stage of chemical synthesis, predict the applicable scenario category, which includes fully automated processing, semi-automated processing or manual processing.
[0075] In other words, each business type corresponds to a different application stage, and the target spectral reading results predicted by the LCMS spectral reading model are different. Correspondingly, the training labels for each business type are different. The LCMS spectral reading model uses different training labels during training for different business types. Specifically, the training labels used by the LCMS spectral reading model during training are consistent with the target spectral reading results of the business type. For example, in the intermediate process control stage of chemical synthesis (hereinafter referred to as the IPC stage), the training labels include whether the target product corresponding to the training sample exists and / or whether the content of the target product meets the standard. The IPC stage requires LCMS spectral analysis of the mixture of the intermediate reaction process. Because the products of the intermediate reaction process are relatively complex, there may be many impurities. The analysis reports generated by the analysis instrument's built-in software often have limited information, and the features that can be extracted from the analysis reports for spectral reading model training are also relatively limited, seriously affecting the model's performance. Therefore, the original files generated by the analysis instrument cannot be used directly and need to be parsed to extract sufficiently rich features for model training. In the separation and purification stage of chemical synthesis, training labels include the applicable separation system and method, whether the spacing between the UV absorption peak of the target product and its neighboring peaks meets the standard, and whether the UV absorption peak of the target product contains at least one of the following: irrelevant MS signals. For these reasons, the raw data generated by the analytical instrument cannot be used directly; analysis and key feature extraction of the underlying data generated by the analytical instrument are required. In the quality control stage of chemical synthesis, training labels include whether the target product corresponding to the training sample meets the quality standard and / or the recommended sample injection method and injection volume. And so on, which will not be elaborated further here.
[0076] Furthermore, the training samples required for different types of operations may be all the same, partially the same, or completely different. In some implementations, the training samples may include, but are not limited to, at least one of the following: material structure characteristics, mass spectrum characteristics, ultraviolet absorption spectrum characteristics, MS-UV correlation characteristics, and instrument parameter characteristics.
[0077] In some specific implementations, training samples include material structure features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features, each with a preset weight value. That is, depending on the different business types, the importance of different types of features in the training samples varies; more critical training samples have higher weight values, and vice versa. When all types of features are used as training samples simultaneously, the weight values of various features applied to different business types can be adjusted according to the corresponding business type, helping to balance the generalization and accuracy of the training effect. In other words, different weights are used for training samples of different business types to achieve precise parameter tuning and obtain the corresponding LCMS reading model. The weight values of different features in the training samples can be different for different business types. The weight values of each feature can take any value between 0 and 1 (including 0 and 1). The distribution of feature weight values for models of different businesses can be automatically learned through the label distribution corresponding to that business type. The prediction in the IPC stage mainly focuses on whether the reaction is successful, so the characteristics that affect the determination of product content have a relatively high weight, such as the ratio of the UV peak area corresponding to the product in MS. The prediction in the separation and purification stage mainly focuses on whether the product is easy to separate, so the characteristics such as MS purity and UV peak resolution have a relatively high weight. The QC stage mainly focuses on whether the spectrum can meet the delivery standards, so the characteristics such as purity and peak shape have a relatively high weight.
[0078] In some specific implementations, the material structure features include, but are not limited to, at least one of the following: compound molecular weight, compound SMILES expression, molecular fingerprint or latent space vector representation extracted using graph neural networks (GNNs), and specific group structure information related to LCMS analysis, such as the structure information of groups such as BoC (tert-butyloxycarbonyl), benzylamine, and benzyl alcohol.
[0079] In some specific implementations, mass spectrum features, also known as mass spectrometry (MS) signal features, include, but are not limited to, at least one of the following: common mass spectrometry feature signals matched in the mass spectrum, such as M+1, M+23, and M-1 signals, as well as mass spectrometry feature signals related to the structure of specific substances, such as M+100-1 and M+56-1 signals related to BoC, M+1-17 signals related to benzylamine, and M+1-18 signals related to benzyl alcohol. It should be noted that M refers to the mass-to-charge ratio of the molecular ion peak of the target compound (i.e., the ion formed by the loss or gain of an electron by the molecule). For each mass spectrometry feature signal, features such as peak height, peak area, peak width, and peak start and end times need to be extracted. In addition to the features with clear physical meaning mentioned above, high-dimensional features represented by latent space vectors extracted from the underlying mass spectrometry MS signal using deep learning models (such as convolutional neural networks (CNN), Transformer, ViT, etc.) can also be included.
[0080] In some specific implementations, ultraviolet absorption spectral (UV) features, also known as UV signal features, include, but are not limited to, UV peak features extracted from UV signals at specific wavelengths (such as 220 nm, 254 nm, etc.), such as peak height, peak area, peak width, and peak start and end times. In addition to features with explicit physical meaning, high-dimensional features represented by latent space vectors extracted from the underlying UV signal using deep learning models (such as convolutional neural networks (CNN), Transformer, ViT, etc.) can also be included.
[0081] In some specific implementations, MS-UV correlation features include, but are not limited to, the mass spectrometry MS signal features corresponding to the time range covered by the UV peaks, such as the MS peak features and peak intensity ratios corresponding to the aforementioned common mass spectrometry MS signal features. In addition to the features with clear physical meanings mentioned above, high-dimensional features represented by latent space vectors extracted from the integrated signal after aligning the UV signal and the mass spectrometry MS signal along the time dimension using deep learning models (such as convolutional neural networks CNN, Transformer, ViT, etc.) at the bottom level alignment can also be included.
[0082] In some specific implementations, instrument parameter characteristics include, but are not limited to, at least one of the following: instrument manufacturer, instrument name, instrument column model, instrument-specific UV signal offset, analytical method gradient used by the instrument, and mobile phase information. Different instrument parameters have a significant impact on the generated spectral results, such as the time shift of the UV peak and detector sensitivity. Therefore, the introduction of instrument parameter characteristics can effectively improve the accuracy of subsequent target spectral readings.
[0083] To obtain high-quality training samples for training a more accurate LCMS reading model, some implementations include training samples of mixtures and training samples of pure substances in the LCMS training data. That is, whether it is a mixture or a pure substance, it contains at least one of the five features mentioned above.
[0084] Specifically, LCMS training data can come from various sources, such as academic literature, public databases, and LCMS datasets of pure substances (pure products). Academic literature can be papers that publish complete experimental data; public databases such as PubChem, ChemSpider, MassBank, and METLIN provide a large number of compounds and corresponding LCMS data. LCMS data includes the time-series information of the spectra generated by the LCMS instrument, instrument configuration information, and the material structure information of the products. Similarly, a series of representative pure substances can be selected according to the research objective. These selected pure substances include different chemical categories and physicochemical properties. Experimental protocols can then be designed, including sample preparation, chromatographic conditions, and mass spectrometry parameters, followed by LCMS experiments, and finally, the corresponding LCMS data can be collected. In addition, high-throughput experiments can be designed to collect LCMS data of common chemical reactions, such as Suzuki and Buchwald reactions. Samples, including reactants, catalysts, and solvents, are prepared according to predetermined reaction conditions. Automated sample processing and LCMS analysis systems are used to conduct experiments, and LCMS data for each sample are collected. Using the data collection methods described above, a large-scale dataset can be constructed for training the LCMS reading model. Based on the LCMS data collected from the different channels mentioned above, the five features described above, in a unified format, are extracted and used as training samples.
[0085] To further increase data diversity and enable the model to learn more specific tasks, more refined data augmentation strategies can be adopted for specific tasks, such as determining the presence of a certain compound in a spectrum. Taking the pure spectrum of compound A as an example, we can use it as a positive sample indicating the presence of compound A. However, this spectrum can also be used as a negative sample indicating the absence of other compounds such as compounds B, C, and D, as long as these compounds exhibit completely different characteristics in the spectrum compared to compound A. This method can significantly expand the number of negative samples because each pure spectrum can be used to represent multiple absent compounds.
[0086] In some embodiments, data augmentation processing is performed on the training samples to obtain enhanced LCMS training data. The data augmentation methods include at least one of the following: spectral superposition of multiple pure substances, spectral superposition of simulated noise, and spectral generation of preset functional groups. Unlike conventional data augmentation strategies, such as simple transformations based on the original training samples (e.g., rotation, flipping, or adding noise), the processing method used in this application includes at least one of the following:
[0087] (1) The superposition of spectra of multiple pure substances can also be regarded as the generation of mixed spectra. In some embodiments, the original spectra of at least two pure substances are mixed in a preset ratio, and the time-series signals of each original spectrum are aligned in time and then superimposed to generate the corresponding superimposed spectrum; based on the superimposed spectrum, the corresponding enhanced training sample is obtained. Specifically, a new spectrum is generated by mixing the LCMS spectra of different compounds in a certain ratio. The original LCMS spectrum of a single pure substance is a time-series signal, which contains the intensity of MS and UV signals that change over time. When mixing the LCMS spectra of multiple pure substances, it is only necessary to superimpose the time-series signals of the LCMS spectra of each pure substance in time, and then obtain the training sample corresponding to this enhancement method. The material structure features in the training sample contain the material structure features corresponding to each pure substance. Such training samples can be used to enhance the ability of the LCMS reading model to identify mixtures.
[0088] (2) Spectral overlay with simulated noise involves adding random noise or interference that may occur in the experiment, such as instrument noise or baseline drift, to the original LCMS spectrum of the pure substance to form a new spectrum. Such training data can be used to enhance the robustness of machine learning models.
[0089] (3) Spectrum generation of preset groups involves simulating the ionization and collision of sample compounds containing preset groups according to preset parameters to obtain fragment ions. Based on the spectra of these fragment ions, corresponding enhanced training samples are obtained. Specifically, during ionization and collision-induced dissociation (CID), some bonds in molecules are easily broken, forming specific fragment ions. These fragment ions form characteristic ion peaks in the mass spectrum, and their mass-to-charge ratio (m / z) values are key information for molecular structure analysis. By inputting molecular structure information containing preset groups into advanced computational simulation software and comprehensively considering various factors, such as the specific structural characteristics of the binding groups and the settings of the LCMS instrument (including the energy applied by the instrument), the computational simulation software can predict and determine the probability of each chemical bond breaking in the molecule. Through this method of combined computational simulation, more spectral data on the structure of a specific type of substance (such as BoC (tert-butyloxycarbonyl), benzylamine, benzyl alcohol, etc.) can be obtained, and then corresponding training samples can be extracted as enhanced data.
[0090] S120: Based on the LCMS training data, iteratively train the initial model to obtain a trained LCMS reading model, which can be used to predict the target reading results corresponding to the business type.
[0091] It is understood that, for different business types, this application can independently train a corresponding LCMS reading model, and then train it using the corresponding LCMS training data. Each trained LCMS reading model can be independently applied to different business stages to predict the target reading results corresponding to the business type. Alternatively, the LCMS training data required for different business types can be combined and trained uniformly to obtain an integrated LCMS reading model. This integrated LCMS reading model can include multiple parallel outputs, each capable of outputting the target reading results corresponding to different business types.
[0092] Optionally, in some implementations, when the weight values of each feature in the training samples are preset according to different business types, the initial model is iteratively trained based on the LCMS training data and the weight values of each feature in the training samples to obtain a trained LCMS spectrum reading model.
[0093] Optionally, in some implementations, when the training data also includes enhanced LCMS training data, the initial model is iteratively trained based on the LCMS training data and the enhanced LCMS training data to obtain a trained LCMS reading model.
[0094] In other words, depending on the different business requirements for prediction accuracy, different training data can be designed using the above different schemes for iterative training of the initial model, thereby training a more accurate LCMS reading model, or a more generalizable LCMS reading model, or an LCMS reading model suitable for certain specific products, or an LCMS reading model that combines accuracy and generalization.
[0095] It's understandable that when the training task involves training multiple independent LCMS reading models, the types of sampling features required in the training samples for different business types may be entirely or partially the same. In other words, training samples of the same type can be used, but labeled with different training tags to adapt to different business type training tasks.
[0096] Before formally training the model, in some implementations, an initial model is determined based on the business type; the initial model may include, but is not limited to, at least one of classification models, regression models, and generative models.
[0097] Specifically, when the attribute of the target spectral reading result is categorical, the initial model is a classification model, such as a binary or multi-class classification model. Examples include a binary model to determine the presence / absence of the target product from the spectrum, and a multi-class classification model to determine whether the target product is present, its quantity, and ease of separation. When the attribute of the target spectral reading result is a quantified numerical value, the initial model is a regression model. For example, a regression model to determine the content of the target product in the reaction result from the spectrum. Additionally, cutting-edge generative AI techniques (AIGC) can be used, such as using a diffusion model to analogize the process of multiple mixtures forming an LCMS spectrum to image noise reduction, thereby accurately distinguishing the target product from impurities in the spectrum, and then generating a spectrum of a single product to determine whether the target product is present.
[0098] When the initial model is a classification model or a regression model, the preset algorithm can be, for example, GBoost, LightGBM, random forest, or a deep learning model with a fully connected layer architecture.
[0099] The following section will use the initial XGBoost model to train a binary classification model for identifying the presence or absence of target products in LCMS spectra as an example to illustrate the training process of the initial model in this application.
[0100] XGBoost is an ensemble learning model that combines multiple weak learners (usually decision trees) to build a strong learner. Each decision tree is a simple model that makes predictions by partitioning the feature space into different regions. XGBoost adds a new tree in each iteration, and each tree attempts to correct the errors from the previous iteration.
[0101] When training an XGBoost binary classification model based on spectral data, the input training data includes feature vectors transformed from the aforementioned features and their corresponding training labels. Each feature vector's classification label uses a specific numerical value to indicate the presence of the target product in the LCMS spectral map; for example, 0 indicates the absence of the target product, and 1 indicates its presence. The training process includes:
[0102] (1) Data preprocessing: Before training the initial model, it is usually necessary to preprocess the spectral data, such as denoising, normalization, feature selection, etc.
[0103] (2) Model initialization: Select an initial model, usually a model that makes a single constant prediction.
[0104] (3) Iterative training: In each iteration, XGBoost adds a new decision tree to improve the model. This process includes calculating residuals, building the tree, and updating the model.
[0105] (4) Pruning: To avoid overfitting, XGBoost uses regularization techniques to prune the trees during the construction of each tree.
[0106] (5) Early termination: After multiple rounds of iterative training according to steps (3) and (4) above, if the performance on the validation set of the training data does not improve within a certain number of rounds, the training can be terminated early.
[0107] The convergence condition of the XGBoost model is typically based on its performance on the validation set. If, over several iterations, the model's evaluation metrics (such as accuracy, AUC, etc.) on the validation set do not show significant improvement, training can stop. This process is called "early stopping." After training, the XGBoost model outputs the probability that each training sample belongs to a certain class. In binary classification problems, this is usually a value between 0 and 1. Based on a threshold (usually 0.5), this probability can be converted into a class label (0 or 1), thus completing the classification and outputting the corresponding target spectrum reading result.
[0108] As can be seen from this example, the LCMS reading model construction method of this application starts with building standardized large-scale, high-quality LCMS training data, designs training data augmentation schemes for specific downstream tasks, and uses the augmented LCMS training data for feature engineering and model training to fully leverage the value of high-quality data. By using different training labels, an LCMS reading model that can be used to predict the target reading results of different business types is obtained. The trained LCMS reading model can accurately predict the target reading results corresponding to different business types, meeting the business needs of different scenarios.
[0109] See Figure 2 This application also discloses a method for constructing an LCMS reading model, which includes:
[0110] S210, acquire LCMS training data and enhanced LCMS training data; LCMS training data includes training samples and corresponding training labels; training labels include spectrum reading results corresponding to at least one service type.
[0111] In this embodiment, the training samples for both types of LCMS training data include material structure features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features. Based on business requirements, corresponding training labels are determined, and the number of training labels is determined according to the business type. That is, when there are multiple business types, there will be multiple training labels accordingly.
[0112] It is understood that the enhanced LCMS training data can be obtained by using existing LCMS training data and performing enhancement processing according to at least one of the data augmentation methods mentioned above, which will not be elaborated here.
[0113] S220, obtain the weight values of each feature in the training samples corresponding to the business type.
[0114] For each business type, the corresponding training samples include the five sampling features mentioned above. The difference between the training samples for each business type is that the weight values of the same sampling feature can be the same or different, thus adapting to the different business types' attention to different types of sampling features, so as to improve the accuracy of the trained LCMS reading model in the corresponding business type.
[0115] S230. Based on the LCMS training data and the enhanced LCMS training data, as well as the weight values of each feature in the training samples, the initial model is iteratively trained to obtain a trained LCMS spectrum reading model.
[0116] It is understandable that, depending on the different business types, a corresponding initial model is selected, such as a classification model, a regression model, or a generative model. Based on the selected model, training is performed using the corresponding training data to obtain the network parameters of the trained model, thereby obtaining the trained LCMS spectrum reading model.
[0117] For different business types, corresponding LCMS reading models can be trained independently to predict the corresponding target reading results. Alternatively, training data from business types that use similar models can be aggregated for training to obtain a single large LCMS reading model.
[0118] As can be seen from this example, the LCMS reading model construction method of this application can construct large-scale training data based on LCMS training data and enhanced LCMS training data. In addition, the sampling features in the training data can be weighted by the weight values corresponding to the business type, so that the initial model focuses on the influence of the sampling features with larger weight values on the target reading results during training, thereby training to obtain an LCMS reading model that meets the needs of specific business types.
[0119] See Figure 3 This application discloses a method for reading spectrum using an LCMS, which includes:
[0120] S310. Based on the business type, obtain the corresponding sampling features, which include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum features, MS-UV correlation features, and instrument parameter features.
[0121] The spectral reading method of this application can be applied to liquid chromatography-mass spectrometry (LC-MS) or gas chromatography-mass spectrometry (GC-MS). This spectral reading method can be deployed in either business type.
[0122] S320 outputs the corresponding target spectrum reading result based on the sampling features and the pre-built LCMS spectrum reading model.
[0123] In this step, all sampling features are input into the LCMS reading model. The LCMS reading model of this application is pre-trained according to the LCMS reading model construction method of the above embodiments. Depending on different service types, corresponding LCMS reading models can be adopted. In some embodiments, the LCMS reading model includes at least one of a first model, a second model, a third model, and a fourth model.
[0124] In some implementations, the first model is used to predict and output the corresponding target spectral reading results during the intermediate process control (IPC) stage of chemical synthesis. These target spectral reading results include, but are not limited to, the presence of the target product and / or whether the content of the target product meets the standards. It is understood that the intermediate process control stage (hereinafter referred to as the IPC stage) refers to the stage in chemical synthesis where intermediate products are controlled and detected. This stage is crucial for ensuring the quality and yield of the final product. It should be noted that in chemical synthesis, such as in the IPC stage of drug synthesis, there are usually multiple steps, each of which may produce one or more intermediates. The purpose of the IPC stage is to monitor and verify the quality of these intermediates to ensure they meet predetermined specifications and purity requirements, i.e., detection and judgment are performed after each step. Based on this, after the IPC stage, the collected product is the crude mixture after the reaction of this stage. By sampling this crude mixture and collecting corresponding characteristic data, the sampling characteristics corresponding to the target product are formed and input into the first model, outputting the target spectral reading results corresponding to the IPC stage. The target spectral reading results output in this stage indicate whether the target product exists in the crude mixture after the reaction; preferably, it can further output whether the content of the target product is sufficient. In some implementations, the content of the target product can be determined according to corresponding criteria based on business needs to predict whether the content of the target product is sufficient. In some specific implementations, when the peak area ratio of the UV peak in the ultraviolet absorption spectrum of the target product is greater than a preset value, the content of the target product is predicted to be sufficient. The preset value can be selected, for example, from 20% to 30%. In addition, the specific content of the target product can also be output, for example, by using the internal standard method to determine the content of the target product more accurately.
[0125] In some embodiments, the second model is used to predict the corresponding target spectral readings during the separation and purification stage of chemical synthesis. These target spectral readings include, but are not limited to, at least one of the following: applicable separation system and method; whether the spacing between the UV absorption peak of the target product and adjacent peaks meets the requirements; and whether the UV absorption peak of the target product contains any extraneous MS signals unrelated to the target product. It should be noted that the separation and purification stage after chemical synthesis is a crucial step, used to extract the target product from the crude mixture and remove byproducts, unreacted raw materials, catalysts, and other impurities. On one hand, a key objective of this separation and purification stage is to determine whether the UV absorption peaks in the UV absorption spectrum of the target product are easily separated. Accordingly, the criteria for this determination may include whether the spacing between the UV absorption peak of the target product and adjacent peaks is sufficiently large, and whether the UV absorption peak of the target product contains any extraneous MS signals unrelated to the target product. On the other hand, by comparing and analyzing multiple spectra of the crude mixture under different gradients, times, and mobile phases, a suitable separation system and method are determined.
[0126] In some implementations, the third model is used to predict the corresponding target spectral reading results during the quality control stage of chemical synthesis. These target spectral reading results include, but are not limited to, whether the target product meets quality standards and / or recommended sample injection methods and injection volumes. It is understood that entering the quality control stage indicates that the aforementioned rigorous IPC stage and separation purification have already been completed, resulting in relatively pure LCMS spectra. Therefore, the spectral reading algorithm in the quality control stage prioritizes using information from the instrument's report for various sampling feature extractions. In this application, the target spectral reading result can be determined based on the LCMS spectrum to indicate whether the target product meets or does not meet quality standards. When the LCMS spectrum is unqualified, i.e., the target product does not meet quality standards, the purity, peak height, and peak shape of the current target product can be analyzed based on the LCMS spectrum to recommend injection methods and injection volumes for the QC stage, thereby reducing the total number of QC injections and improving the efficiency of the QC stage. For example, if the peak height of the LCMS spectrum is insufficient, the QC injection volume can be increased; if an acidic injection method cannot obtain ideal QC results, an alkaline injection method can be used, etc.
[0127] In some implementations, the fourth model is used to predict the corresponding target spectral reading results during the scenario deployment phase of chemical synthesis. The target spectral reading results include applicable scenario categories, which include, but are not limited to, fully automated processing, semi-automated processing, or manual processing. Based on the output scenario category, the reliability of the target spectral reading results can be further determined. Specifically, the corresponding scenario category can be predicted based on the confidence level of the fourth model's judgment of the current scenario. When the scenario category is automated processing, it indicates that the reliability of the target spectral reading result is high, and it can be directly adopted to determine whether to proceed to the next reaction stage. When the scenario category is semi-automated processing or manual processing, it indicates that the reliability of the target spectral reading result is low or very low. To avoid triggering a series of subsequent errors, manual intervention is required to evaluate the correctness of the target spectral reading result, thereby determining whether to proceed to the next reaction stage and correcting errors in a timely manner. It is understandable that for fully automated equipment, since there is no human intervention in the process to promptly correct erroneous spectral reading results, it is difficult to troubleshoot errors at any stage, leading to interruptions and difficulties in repairing the automated process. This application can monitor the target spectrum reading results at different business stages, evaluate the scenario category of the target spectrum reading results, and then hand over the parts that can be accurately judged and interpreted to automated equipment, while handing over the parts that are not accurate or have poor interpretability to manual or semi-manual processing, thus avoiding a large-scale chain reaction caused by a single erroneous data.
[0128] In practical applications, the fourth model can be used in conjunction with the first, second, and third models to predict the applicable scenarios for these three models. For example, after the first model outputs the spectrum reading results, the fourth model can simultaneously output the corresponding scenario category to evaluate whether the output results of the first model are suitable for automation.
[0129] In some implementations, when the scenario category is semi-automatic or manual processing, the corresponding target spectrum reading results and contextual information are displayed on a preset user interface to facilitate user understanding and evaluation of the algorithm's decision-making process. For example, if the contextual information shows low confidence levels such as excessively high product polarity or peak tailing, manual processing is used to determine the correctness of the target spectrum reading results before proceeding to the next stage of synthesis, improving the controllability and timely error correction capability of the overall drug synthesis process. Furthermore, in some implementations, users can correct and / or annotate erroneous target spectrum reading results using preset annotation tools, generating corresponding correction data. When the target spectrum reading results are judged to be incorrect or unreasonable by manual error correction, the results can be corrected promptly or the reaction can be interrupted in time, reducing reaction cost losses.
[0130] Furthermore, in some implementations, the corrected data is aggregated and formatted to correspond to the sampling features for iterative training of the LCMS model. For example, a data collection system can be established to aggregate expert corrections and annotations, format this data into the form required for algorithm training, and remove noise and inconsistent data points through a review and cleaning process, thereby ensuring the quality and consistency of the training data.
[0131] As can be seen from this example, the LCMS reading method of this application can use rich sampling features and a trained LCMS reading model to read spectra according to different business stages of chemical synthesis, efficiently and accurately obtain target reading results for R&D personnel to refer to, meet the needs of different use cases, improve drug development efficiency, and reduce the chain reaction caused by errors in reading results.
[0132] See Figure 4 The LCMS spectrum reading method in one embodiment of this application includes:
[0133] S410: Based on the service type, obtain the corresponding sampling characteristics, including material structure characteristics, mass spectrum characteristics, ultraviolet absorption spectrum characteristics, MS-UV correlation characteristics, and instrument parameter characteristics.
[0134] In this application, the chemical synthesis process sequentially includes an intermediate process control stage, a separation and purification stage, and a quality control stage. At the end of each stage, samples corresponding to that stage can be collected and analyzed using an LCMS instrument to obtain the corresponding LCMS spectra, and then the corresponding sampling characteristics can be extracted.
[0135] S420, based on the sampling features, outputs the corresponding target spectrum reading result through the pre-trained LCMS spectrum reading model, and the target spectrum reading result matches the corresponding business type.
[0136] S430 monitors the accuracy and / or recall of the target spectrum reading results output by the LCMS spectrum reading model; when the accuracy and / or recall are lower than the preset performance indicators, the LCMS spectrum reading model is updated and iterated based on the corrected data.
[0137] Specifically, in this application, the target spectrum reading results output by the machine learning model can be collected according to a preset period or a preset threshold for the amount of data collected, and compared with the correct results to statistically analyze the accuracy and / or recall of the current version of the model prediction. In some embodiments, the preset period may be, for example, daily, several days, weekly, or monthly, etc., and is not limited thereto.
[0138] For example, by establishing a long-term automatic monitoring module, the accuracy and / or recall of model predictions within the current period or the current data collection volume can be calculated according to a preset cycle or when the total amount of collected data reaches a preset threshold. When the accuracy and / or recall is lower than the preset performance index, the LCMS reading model can be updated and iterated based on the correction data collected in the above embodiments to obtain an updated machine learning model. Accordingly, after the model is updated, its performance can continue to be monitored to ensure that the improvement measures are effective and that new problems can be discovered and resolved in a timely manner. This design, by establishing a long-term feedback mechanism, encourages users to continuously provide opinions and insights, making full use of experts' experience and knowledge to continuously optimize the algorithm's performance. This not only improves the quality of decision-making but also enhances the adaptability and reliability of the algorithm.
[0139] As can be seen from this example, the LCMS reading method of this application, by using the LCMS reading model trained with a high-quality dataset, can be applied to scenarios in different business stages and adapt to the intelligent reading needs of various businesses. In addition, it can promptly review and correct the target reading results according to different scenario categories to avoid a chain of errors caused by one error. At the same time, by monitoring the accuracy and / or recall of the LCMS reading model, the model can be updated and iterated in a timely manner based on the collected correction data, thereby obtaining a more accurate intelligent LCMS reading model.
[0140] Corresponding to the aforementioned application function implementation method embodiments, this application also provides an LCMS reading model construction device, an LCMS reading system, and corresponding embodiments.
[0141] Figure 5 This is a schematic diagram of the LCMS reading model construction device shown in this application.
[0142] See Figure 5 The LCMS spectrum reading model construction apparatus shown in this application includes a training data acquisition module 510 and a model training module 520. Wherein:
[0143] The training data acquisition module 510 is used to acquire LCMS training data, which includes training samples and corresponding training labels. The training samples include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features. The training labels include spectral readings corresponding to at least one service type.
[0144] The model training module 520 is used to iteratively train the initial model based on the LCMS training data to obtain a trained LCMS reading model, which can be used to predict the target reading results corresponding to the business type.
[0145] Furthermore, in some implementations, the training data acquisition module 510 is also used to acquire the weight values of each feature in the training samples corresponding to the business type. The model training module 520 is used to iteratively train the initial model based on the LCMS training data and the weight values of each feature in the training samples to obtain a trained LCMS reading model.
[0146] Furthermore, in some embodiments, the training data acquisition module 510 is also used to perform data augmentation processing on the training samples to obtain augmented LCMS training data. The model training module 520 is used to iteratively train the initial model based on the LCMS training data and the augmented LCMS training data to obtain a trained LCMS spectrum reading model.
[0147] Furthermore, in some embodiments, the training data acquisition module 510 is also used to acquire the weight values of each feature in the training samples corresponding to the business type and the enhanced LCMS training data. The model training module 520 is used to iteratively train the initial model based on the weight values of each feature, the LCMS training data, and the enhanced LCMS training data to obtain a trained LCMS reading model.
[0148] As can be seen from this example, the LCMS reading model building device of this application can train an initial model selected according to the business type based on rich training data, and build a more accurate LCMS reading model.
[0149] See Figure 6 The LCMS spectrum reading system shown in this application includes a data acquisition module 510 and a model spectrum reading module 520. Wherein:
[0150] The data acquisition module 510 is used to acquire corresponding sampling features according to the business type. The sampling features include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum features, MS-UV correlation features, and instrument parameter features.
[0151] The model reading module 520 is used to output the corresponding target reading result based on the sampling features and the pre-built LCMS reading model.
[0152] See Figure 7 In some specific implementations, the LCMS spectrum reading model includes at least one of a first model, a second model, a third model, and a fourth model; the model spectrum reading module 520 includes at least one of a first processing module 521, a second processing module 522, a third processing module 523, and a fourth processing module 524. Wherein:
[0153] The first processing module 521 is used to output target spectral reading results through the first model during the intermediate process control stage of chemical synthesis. The target spectral reading results include whether the target product exists and / or whether the content of the target product meets the standard.
[0154] The second processing module 522 is used to output target spectral reading results through the second model during the separation and purification stage of chemical synthesis. The target spectral reading results include the applicable separation system and separation method, whether the interval between the ultraviolet absorption peak of the target product and the adjacent peak meets the standard, and whether the ultraviolet absorption peak of the target product is mixed with at least one of the following: MS signals unrelated to the target product.
[0155] The third processing module 523 is used to output target spectral reading results through the third model during the quality control stage of chemical synthesis. The target spectral reading results include whether the target product meets the quality standard and / or the recommended sample injection method and injection volume.
[0156] The fourth processing module 524 is used to output target spectral reading results through the fourth model during the scenario deployment stage of chemical synthesis. The target spectral reading results include the applicable scenario category, which includes fully automated processing, semi-automated processing, or manual processing.
[0157] In some implementations, the spectrum reading system also includes a data collection module 530, which collects correction data uploaded by users and forms a format corresponding to the sampling features for iterative training of the LCMS spectrum reading model.
[0158] In some implementations, the system also includes an automatic monitoring module 540, which monitors the accuracy and / or recall of the target spectrum reading results output by the LCMS spectrum reading model; when the accuracy and / or recall is lower than a preset performance index, the machine learning model is updated and iterated based on the corrected data.
[0159] As can be seen from this example, the LCMS spectrum reading system of this application can be accurately adapted to the intelligent spectrum at different business stages and obtain the target spectrum reading results at different stages. At the same time, by collecting and correcting data and monitoring the accuracy of the model, timely updates and iterations are carried out, so that the model can operate in a virtuous cycle and continuously improve the accuracy of the prediction results.
[0160] Regarding the models in the above embodiments, the specific ways in which each module performs operations have been described in detail in the embodiments related to the method, and will not be elaborated further here.
[0161] Figure 8 This is a schematic diagram of the structure of the electronic device shown in this application.
[0162] See Figure 8 The electronic device 1000 includes a memory 1010 and a processor 1020.
[0163] The processor 1020 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor.
[0164] Memory 1010 may include various types of storage units, such as system memory, read-only memory (ROM), and permanent storage devices. ROM may store static data or instructions required by processor 1020 or other modules of the computer. Permanent storage devices may be read-write storage devices. Permanent storage devices may be non-volatile storage devices that retain stored instructions and data even when the computer is powered off. In some embodiments, permanent storage devices use mass storage devices (e.g., magnetic or optical disks, flash memory) as permanent storage devices. In other embodiments, permanent storage devices may be removable storage devices (e.g., floppy disks, optical drives). System memory may be a read-write storage device or a volatile read-write storage device, such as dynamic random access memory. System memory may store some or all of the instructions and data required by the processor during operation. Furthermore, memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and disks and / or optical disks may also be used. In some embodiments, the memory 1010 may include a removable storage device that is readable and / or writable, such as a laser disc (CD), a read-only digital multifunction optical disc (e.g., DVD-ROM, dual-layer DVD-ROM), a read-only Blu-ray disc, a high-density optical disc, a flash memory card (e.g., SD card, mini SD card, Micro-SD card, etc.), a magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves or transient electronic signals transmitted wirelessly or via wired connections.
[0165] The memory 1010 stores executable code, which, when processed by the processor 1020, can cause the processor 1020 to execute part or all of the methods described above.
[0166] Furthermore, the method according to this application can also be implemented as a computer program or computer program product, which includes computer program code instructions for performing some or all of the steps in the method described above.
[0167] Alternatively, this application may be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium) storing executable code (or computer program or computer instruction code) thereon, which, when executed by a processor of an electronic device (or an electronic device, etc.), causes the processor to perform part or all of the steps of the above-described method according to this application.
[0168] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A method for constructing an LCMS reading model, characterized in that, include: Acquire LCMS training data, which includes training samples and corresponding training labels; wherein, the training samples include at least one of material structure features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features; the training labels include spectral readings corresponding to at least one service type. Based on the LCMS training data, the initial model is iteratively trained to obtain a trained LCMS reading model, which is used to predict the target reading result corresponding to the business type.
2. The method according to claim 1, characterized in that, The training samples include material structural features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features. The method further includes: Obtain the weight values of each feature in the training samples corresponding to the business type; The step of iteratively training the initial model based on the LCMS training data to obtain a trained LCMS spectrum reading model includes: Based on the LCMS training data and the weight values of each feature in the training samples, the initial model is iteratively trained to obtain a trained LCMS spectrum reading model.
3. The method according to claim 1 or 2, characterized in that, The service types of the LCMS spectrum reading model include at least one of the following: In the intermediate process control stage of chemical synthesis, predict the presence of the target product and / or whether the content of the target product meets the standard. In the separation and purification stage of chemical synthesis, it is necessary to predict the applicable separation system and separation method, whether the interval between the UV absorption peak of the target product and the adjacent peak meets the standard, and whether the UV absorption peak of the target product contains at least one of the following: MS signals unrelated to the target product. In the quality control stage of chemical synthesis, predict whether the target product meets the quality standards and / or recommend sample injection methods and injection volumes; In the scenario deployment phase of chemical synthesis, the applicable scenario category is predicted, which includes fully automated processing, semi-automated processing, or manual processing.
4. The method according to any one of claims 1-3, characterized in that, The method further includes: Based on the business type, the initial model is determined; the initial model includes at least one of a classification model, a regression model, and a generative model.
5. The method according to any one of claims 1-4, characterized in that, The training samples include training samples of mixtures and training samples of pure substances; After obtaining the LCMS training data, the method further includes: The training samples are subjected to data augmentation processing to obtain augmented LCMS training data; the data augmentation processing method includes at least one of the following: overlay of spectra of multiple pure substances, overlay of spectra with simulated noise, and generation of spectra of preset groups; The step of iteratively training the initial model based on the LCMS training data to obtain a trained LCMS spectrum reading model includes: Based on the LCMS training data and the enhanced LCMS training data, the initial model is iteratively trained to obtain a trained LCMS spectrum reading model.
6. The method according to claim 5, characterized in that, The data enhancement processing method for superimposing the spectra of the multiple pure substances includes: The original spectra of at least two pure substances are mixed in a preset ratio, and the time-series signals of each original spectrum are aligned in time and then superimposed to generate the corresponding superimposed spectrum. Based on the superimposed spectrum, the corresponding enhanced training samples are obtained.
7. The method according to claim 5, characterized in that, The data enhancement processing method for generating the spectrum of the preset group includes: Sample compounds containing preset groups are simulated to undergo ionization and collision according to preset parameters to obtain fragment ions; Based on the spectra of the fragment ions, the corresponding enhanced training samples are obtained.
8. A method for reading spectrum in LCMS, characterized in that, include: Based on the business type, obtain the corresponding sampling features, which include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum features, MS-UV correlation features, and instrument parameter features; Based on the sampling features, the corresponding target spectrum reading result is output through the LCMS spectrum reading model pre-constructed by the LCMS spectrum reading model construction method as described in any one of claims 1 to 7.
9. The method according to claim 8, characterized in that, The LCMS reading model includes at least one of the following: a first model, a second model, a third model, and a fourth model; wherein: The first model is used to predict and output the corresponding target spectral reading results during the intermediate process control stage of chemical synthesis. The target spectral reading results include whether the target product exists and / or whether the content of the target product meets the standard. The second model is used to predict and output the corresponding target spectral reading results during the separation and purification stage of chemical synthesis. The target spectral reading results include the applicable separation system and separation method, whether the interval between the UV absorption peak of the target product and the adjacent peak meets the standard, and whether the UV absorption peak of the target product is mixed with at least one of the following: MS signals unrelated to the target product. The third model is used to predict and output the corresponding target spectral reading results during the quality control stage of chemical synthesis. The target spectral reading results include whether the target product meets the quality standard and / or the recommended sample injection method and injection volume. The fourth model is used to predict the corresponding target spectral reading results during the deployment phase of chemical synthesis scenarios. The target spectral reading results include applicable scenario categories, which include fully automated processing, semi-automated processing, or manual processing.
10. An LCMS reading model construction device, characterized in that, include: The training data acquisition module is used to acquire LCMS training data, which includes training samples and corresponding training labels. The training samples include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum (UV) features, MS-UV correlation features, and instrument parameter features. The training labels include spectral readings corresponding to at least one service type. The model training module is used to iteratively train the initial model based on the LCMS training data to obtain a trained LCMS reading model, which is used to predict the target reading result corresponding to the business type.
11. A spectrum reading system for LCMS, characterized in that, include: The data acquisition module is used to obtain corresponding sampling features according to the business type. The sampling features include at least one of the following: material structure features, mass spectrum features, ultraviolet absorption spectrum features, MS-UV correlation features, and instrument parameter features. The model reading module is used to output the corresponding target reading result based on the sampling features and the LCMS reading model pre-constructed by the LCMS reading model construction method as described in any one of claims 1 to 7.
12. An electronic device, characterized in that, include: processor; as well as A memory storing executable code, which, when executed by the processor, causes the processor to perform the LCMS reading model construction method according to any one of claims 1-7 and / or the LCMS reading method according to any one of claims 8-9.
13. A computer-readable storage medium having executable code stored thereon, which, when executed by a processor of an electronic device, causes the processor to perform the LCMS spectrum reading model construction method according to any one of claims 1-7 and / or the LCMS spectrum reading method according to any one of claims 8-9.
14. A computer program product, comprising a computer program, characterized in that, The computer program is used to execute the computer program code instructions corresponding to the LCMS reading model construction method according to any one of claims 1-7 and / or the LCMS reading method according to any one of claims 8-9.