Method and system for identifying organic pollutants in soil on basis of combination of ai and high-throughput screening

By combining artificial intelligence with high-throughput screening technology and using deep learning to construct a molecular fingerprint prediction model, the problem of low coverage and low efficiency of traditional targeted analysis methods in identifying unknown organic pollutants has been solved, enabling rapid and accurate identification and risk assessment of soil organic pollutants.

WO2026129392A1PCT designated stage Publication Date: 2026-06-25BCEG ENVIRONMENTAL REMEDIATION CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
BCEG ENVIRONMENTAL REMEDIATION CO LTD
Filing Date
2024-12-24
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Traditional targeted analysis methods have low coverage when identifying organic pollutants in the environment, cannot effectively identify unknown substances, and have low efficiency.

Method used

By combining artificial intelligence with high-throughput screening technology, high-resolution mass spectrometry analysis is used to construct an organic pollutant mass spectrum database. Deep learning is then used to build a molecular fingerprint prediction model to achieve the mapping from spectrum to structure and perform non-targeted intelligent analysis.

Benefits of technology

It enables rapid and accurate identification of soil organic pollutants, provides non-targeted intelligent analysis technology for new pollutants, and provides a data foundation for pollutant distribution statistics and risk assessment in target areas.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2024141968_25062026_PF_FP_ABST
    Figure CN2024141968_25062026_PF_FP_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of pollutant screening. Disclosed are a method and system for identifying organic pollutants in soil on the basis of a combination of AI and high-throughput screening. The method comprises: extracting substance peaks from high-resolution mass spectrometry data of a soil sample; constructing an organic pollutant mass spectrometry database, extracting spectrum features and structural features of compounds from the organic pollutant mass spectrometry database, constructing a molecular fingerprint prediction model, and establishing a spectrum-to-structure mapping relationship; on the basis of the extracted substance peaks, constructing spectrum vectors to predict molecular fingerprints, and acquiring candidate chemical structures by searching the organic pollutant mass spectrometry database; and scoring the candidate chemical structures by means of the predicted molecular fingerprints, selecting chemical structures that meet a preset standard, constructing an identification basis on the basis of the selected chemical structures, and acquiring an organic pollutant identification result for the soil sample. In the present application, non-targeted intelligent analysis is performed by means of the synergistic integration of artificial intelligence and mass spectrometry analysis, thereby realizing the rapid and accurate identification of organic pollutants in soil.
Need to check novelty before this filing date? Find Prior Art

Description

A Method and System for Identifying Soil Organic Pollutants Based on AI and High-Throughput Screening

[0001] Cross-references to related applications

[0002] This application claims priority to Chinese Patent Application No. 202411874876.9, filed on December 19, 2024, entitled "A Method and System for Identifying Soil Organic Pollutants Based on the Combination of AI and High-Throughput Screening", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of pollutant identification technology, and more specifically, to a method and system for identifying soil organic pollutants based on a combination of AI and high-throughput screening. Background Technology

[0004] With the acceleration of industrialization, soil pollution has become increasingly prominent. Organic pollutants, in particular, are not only highly toxic and slow-degrading, posing a serious threat to the safety of soil and the entire ecosystem. Organic pollutants are diverse and abundant, possessing different physicochemical properties, which presents significant challenges to environmental analysis methods. Currently, the main method for identifying compounds in the environment is the traditional targeted analysis approach. This method typically employs gas chromatography-mass spectrometry (GC-MS) or liquid chromatography-mass spectrometry (LC-MS). For the target compound, standards are purchased to establish a clear ion-pair monitoring method. Qualitative analysis is achieved through retention time and quantitative analysis using standard curves. However, due to limitations in chromatographic separation capabilities, targeted analysis can usually only analyze a maximum of a hundred substances, meaning that a large number of unknown pollutants remain unmonitored.

[0005] Traditional organic pollutant screening methods suffer from low pollutant coverage, weak chemical substance identification, and limited ability to identify new pollutants. Furthermore, traditional targeted analysis methods are labor-intensive and inefficient. High-throughput screening technology, on the other hand, offers comprehensive identification of organic compounds in samples, characterized by high throughput, sensitivity, and speed. Therefore, combining artificial intelligence with high-throughput screening technology to achieve rapid and accurate identification of soil organic pollutants, and providing a new non-targeted intelligent analysis technology for new pollutants, is a pressing issue that needs to be addressed. Summary of the Invention

[0006] To address the aforementioned technical challenges, this application proposes a method and system for identifying soil organic pollutants based on a combination of AI and high-throughput screening. The aim is to achieve rapid and accurate identification of soil organic pollutants by using the cross-integration of artificial intelligence and mass spectrometry analysis for non-targeted intelligent analysis, thereby overcoming the limitations of traditional targeted analysis methods that rely on databases and cannot identify unknown substances.

[0007] The first aspect of this application provides a method for identifying soil organic pollutants based on a combination of AI and high-throughput screening, comprising the following steps:

[0008] High-resolution mass spectrometry analysis was performed on the collected soil samples to obtain high-resolution mass spectrometry data. Data preprocessing was performed, and substance peaks were extracted from the high-resolution mass spectrometry data.

[0009] An organic pollutant mass spectrum database was constructed, and the spectral and structural features of compounds in the database were extracted to construct training and validation samples.

[0010] A molecular fingerprint prediction model is constructed based on deep learning. The training samples and validation samples are used to train and evaluate the model, and a mapping relationship from spectrum to structure is constructed.

[0011] Based on the extracted material peaks, spectral vectors are constructed, molecular fingerprints are predicted using a molecular fingerprint prediction model, and candidate chemical structures are obtained by searching an organic pollutant material spectral database.

[0012] Candidate chemical structures are scored by predicting molecular fingerprints, and chemical structures that meet preset standards are selected. Identification criteria are constructed based on the selected chemical structures to obtain the identification results of organic pollutants in soil samples.

[0013] In this scheme, high-resolution mass spectrometry data is acquired, substance peaks are extracted from the high-resolution mass spectrometry data, and data preprocessing is performed, specifically as follows:

[0014] High-throughput screening analysis instruments were used to perform high-resolution mass spectrometry analysis on the pretreated soil samples, extract the raw data peaks, remove background noise interference through filters, and perform baseline correction.

[0015] A peak list is constructed by setting precise mass number, retention time and response intensity. Peak correction and alignment are performed on the peak list. The peak list is sorted from largest to smallest according to response intensity. The ratio of response intensity in soil samples to response intensity in program blank samples is calculated. Data that does not meet the preset threshold is removed based on the ratio.

[0016] The remaining data are sorted according to the mass-to-charge ratio, and a preset number of data peaks are retained based on the sorting results, which are then summarized into high-response material peaks.

[0017] In this scheme, an organic pollutant mass spectrometry database is constructed, and the spectral and structural features of compounds in the database are extracted to construct training and validation samples. Specifically:

[0018] The compound mass spectrometry data were cleaned and preprocessed by using open-source data, historical known high-resolution standard spectrum data of organic pollutants, literature search data and mass spectrometry big data search data as data sources.

[0019] An organic pollutant mass spectrometry database is constructed by integrating compound, compound mass spectrometry data and compound molecular structure data. Any compound tag is selected from the organic pollutant mass spectrometry database, and the corresponding high-resolution standard spectrum and molecular structure data are retrieved according to the compound tag.

[0020] Multidimensional feature extraction is performed on the molecular structure data to obtain characteristic ion fragments, homologue structural features, isotope distribution, and structural features corresponding to the diagnostic ion generating compound tags;

[0021] The high-resolution standard spectrum is preprocessed and imported into a pre-trained dense convolutional neural network for feature extraction. The depth features of the material peaks are output through the last layer of the fully connected layer. The mutual information between the depth features is calculated. The correlation is calculated based on the mRMR algorithm using the acquired depth features and compound identification.

[0022] The relevance is used to sort the deep features, retain a first preset number of deep features, use the LASSO algorithm to reduce the feature dimensionality, and introduce cross-validation into the LASSO algorithm to obtain the importance of different deep features. A second preset number of deep features are selected and integrated to generate spectral features.

[0023] Based on the structural and spectral features matched to each compound tag, training and validation samples are constructed using the matched feature data at a pre-set ratio.

[0024] In this scheme, a molecular fingerprint prediction model is constructed based on deep learning. The training samples and validation samples are used for model training and evaluation to establish a mapping relationship between spectra and structures. Specifically:

[0025] A molecular fingerprint prediction model is constructed based on deep extreme learning machine. An improved gray wolf optimization algorithm is used to optimize the weight parameters of the deep extreme learning machine, initialize the gray wolf population, calculate the wolf pack fitness, and introduce elite differential mutation in the gray wolf individual position update to guide the position update and fitness update of gray wolf individuals.

[0026] After iterative updates, the optimal weight parameters are obtained based on the optimal gray wolf individual position. The optimal weight parameters are used to configure the deep extreme learning machine. The training samples are imported into the deep extreme learning machine for training. The output of the hidden layer of the previous extreme learning machine is used as the input matrix of the next extreme learning machine. After iterative training, the prediction performance of the deep extreme learning machine is verified using validation samples.

[0027] Once the prediction performance meets the standard, a molecular fingerprint prediction model is constructed based on the current deep limit learning machine to establish the mapping relationship between the spectrum and the structure.

[0028] In this scheme, the spectral features corresponding to the material peaks of the soil sample are extracted, a spectral vector is constructed based on the spectral features, the spectral vector is used as the model input to import into the molecular fingerprint prediction model, the molecular fingerprint prediction model is used to predict the corresponding molecular structure, and the molecular structure is used to generate a molecular fingerprint by bit string representation.

[0029] In this scheme, candidate chemical structures are obtained by searching an organic pollutant mass spectrometry database, specifically as follows:

[0030] The organic pollutant mass spectrum database is represented graphically, with compounds as nodes. The exact mass number, retention time, structural features, and spectral features of the compounds are used as additional features of the nodes. The abundance similarity, transformation reaction, addition relationship, and spectral similarity between compounds are used as different types of edge structures. Multiple association graphs corresponding to the organic pollutant mass spectrum database are generated based on different types of edge structures.

[0031] The spectral features of soil sample material peaks are obtained from the neighboring nodes and graph structure of multiple association graphs. The graph structure is learned and represented using a graph convolutional network to obtain the compound label embedding vector. The similarity between the spectral vector and the compound label embedding vector is calculated.

[0032] Select compound tags that meet the similarity threshold, and obtain candidate chemical structures based on the node-additional features of the corresponding compound nodes.

[0033] In this scheme, candidate chemical structures are scored by predicting molecular fingerprints, and chemical structures that meet preset criteria are selected. Identification criteria are then constructed based on the selected chemical structures to obtain the identification results of organic pollutants in soil samples. Specifically:

[0034] Obtain the predicted molecular fingerprint and candidate chemical structure of the material peaks in the soil sample, construct candidate molecular fingerprints through the candidate chemical structures, calculate the similarity between the predicted molecular fingerprint and different candidate molecular fingerprints, and score different candidate chemical structures using the similarity.

[0035] Chemical structures that meet preset standards are selected, and identification criteria are constructed based on the selected chemical structures. Molecular formulas and confidence levels are extracted from the organic pollutant mass spectrum database using the identification criteria to obtain the identification results of organic pollutants in soil samples.

[0036] The second aspect of this application provides a soil organic pollutant identification system based on the combination of AI and high-throughput screening. The system includes: a soil sample collection module, a high-throughput screening module, an organic pollutant mass spectrum database module, a material structure prediction module, and a non-targeted screening output module.

[0037] The soil sample collection module is responsible for collecting soil samples from the target area, setting up a program blank sample, and preprocessing the collected samples to meet the requirements of high-resolution mass spectrometry analysis.

[0038] The high-throughput screening module is responsible for performing high-resolution mass spectrometry analysis on the collected soil samples, obtaining high-resolution mass spectrometry data for data preprocessing, and extracting substance peaks from the high-resolution mass spectrometry data.

[0039] The organic pollutant mass spectrometry database module is responsible for constructing an organic pollutant mass spectrometry database using open source data, literature retrieval, known organic pollutant standard spectral data, and mass spectrometry big data retrieval data. It uses the spectral and structural features of compounds in the database to construct training and validation samples, and uses the spectral features of soil samples to retrieve candidate chemical structures from the organic pollutant mass spectrometry database.

[0040] The material structure prediction module is responsible for establishing a molecular fingerprint prediction model, using training samples and validation samples to train and evaluate the model, constructing a mapping relationship between spectra and structures, using the material peaks extracted by the high-throughput screening module to construct a spectral vector, importing it into the model to obtain possible chemical structures and predicting molecular fingerprints, using the predicted molecular fingerprints to score candidate chemical structures, and selecting material structures that meet preset standards.

[0041] The non-targeted screening output module is responsible for constructing identification criteria based on the predicted high-resolution material structure, and using the identification criteria to output the organic pollutant identification results of the soil sample.

[0042] Compared with the prior art, the beneficial effects of this application are as follows:

[0043] This application utilizes a cross-integration of deep learning methods and mass spectrometry analysis for non-targeted intelligent analysis, addressing the limitations of traditional targeted analysis methods that rely on databases and cannot identify unknown substances. This enables rapid and accurate identification of soil organic pollutants, providing a novel non-targeted intelligent analysis technology that offers a data foundation for pollutant distribution statistics and risk assessment in target areas. Attached Figure Description

[0044] To more clearly illustrate the technical solutions in the embodiments or examples of this application, the drawings used in the embodiments or examples will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained according to these drawings without creative effort.

[0045] Figure 1 shows a flowchart of the soil organic pollutant identification method based on the combination of AI and high-throughput screening in this application;

[0046] Figure 2 shows a flowchart of setting up training data for constructing the organic pollutant mass spectrum database in this application;

[0047] Figure 3 shows a flowchart of the molecular fingerprint prediction model constructed in this application to predict molecular fingerprints;

[0048] Figure 4 shows a block diagram of the soil organic pollutant identification system based on the combination of AI and high-throughput screening in this application. Detailed Implementation

[0049] To better understand the above-mentioned objectives, features, and advantages of this application, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be combined with each other.

[0050] Many specific details are set forth in the following description in order to provide a full understanding of this application. However, this application may also be implemented in other ways different from those described herein. Therefore, the scope of protection of this application is not limited to the specific embodiments disclosed below.

[0051] Figure 1 shows a flowchart of the soil organic pollutant identification method based on the combination of AI and high-throughput screening proposed in this application.

[0052] As shown in Figure 1, the first aspect of this application provides a method for identifying soil organic pollutants based on a combination of AI and high-throughput screening, including:

[0053] S102, perform high-resolution mass spectrometry analysis on the collected soil samples, obtain high-resolution mass spectrometry data, perform data preprocessing, and extract substance peaks from the high-resolution mass spectrometry data;

[0054] S104, Construct an organic pollutant mass spectrum database, extract the spectral and structural features of compounds in the organic pollutant mass spectrum database, and construct training samples and validation samples;

[0055] S106, Construct a molecular fingerprint prediction model based on deep learning, use the training samples and validation samples to train and evaluate the model, and construct the mapping relationship from spectrum to structure.

[0056] S108. Construct spectral vectors based on extracted material peaks, predict molecular fingerprints using a molecular fingerprint prediction model, and retrieve candidate chemical structures using an organic pollutant material spectral database.

[0057] S110 scores candidate chemical structures by predicting molecular fingerprints, selects chemical structures that meet preset standards, constructs identification criteria based on the selected chemical structures, and obtains the identification results of organic pollutants in soil samples.

[0058] It should be noted that high-resolution mass spectrometry, such as quadrupole time-of-flight mass spectrometry (QFS), with its high resolution (RP ≥ 10000), provides accurate mass numbers, isotopic distributions, and MS / MS spectra, playing a crucial role in identifying unknown or novel contaminants. Non-targeted analysis of mass spectrometry data is mainly divided into screening for suspected substances and non-targeted analysis for unknown substances. Screening for suspected substances involves comprehensively comparing given compound information to determine its presence in a sample. However, for a large number of substances without prior information, it is impossible to determine their presence or absence. Non-targeted screening, theoretically, can effectively screen all substances using both primary and secondary mass spectrometry information.

[0059] Soil samples were collected from the target area. Appropriate quality control measures were implemented to reduce interference from the solution and analysis process, and unstable interfering ions were eliminated through repeated analysis. High-throughput screening analyzers were used to perform high-resolution mass spectrometry analysis on the pre-processed soil samples to extract raw data peaks. Background noise interference was removed using filters, and baseline correction was performed. A peak list was constructed by setting precise mass numbers, retention times, and response intensities. Peak correction and alignment were performed on the peak list, and the peaks were sorted from largest to smallest response intensity. The ratio of the response intensity in the soil samples to the response intensity in the program blank samples was calculated, and data that did not meet a preset threshold were removed based on this ratio. The remaining data were sorted according to the mass-to-charge ratio, and a preset number of data peaks were retained based on the sorting results, summarizing them into high-response material peaks.

[0060] Figure 2 shows a flowchart of the process for setting up training data for constructing the organic pollutant mass spectrum database in this application.

[0061] According to the embodiments of this application, an organic pollutant mass spectrum database is constructed, and the spectral and structural features of compounds in the organic pollutant mass spectrum database are extracted to construct training samples and validation samples, specifically as follows:

[0062] S202 uses open-source data, historical known high-resolution standard spectrum data of organic pollutants, literature search data and mass spectrometry big data search data as data sources to obtain compound mass spectrometry data, and performs data cleaning and data structuring preprocessing.

[0063] S204, integrate compound, compound mass spectrometry data and compound molecular structure data to construct an organic pollutant mass spectrometry database, select any compound tag from the organic pollutant mass spectrometry database, and retrieve the corresponding high-resolution standard spectrum and molecular structure data according to the compound tag;

[0064] S206, Multidimensional feature extraction is performed on the molecular structure data to obtain the structural features corresponding to the characteristic ion fragments, homologue structural features, isotope distribution, and diagnostic ion generating compound tags.

[0065] S208, the high-resolution standard spectrum is preprocessed, and the pre-trained dense convolutional neural network is used for feature extraction. The depth features of the material peaks are output through the last layer of the fully connected layer, the mutual information between the depth features is calculated, and the correlation is calculated based on the mRMR algorithm using the acquired depth features and compound identification.

[0066] S210, use the correlation to sort the deep features, retain a first preset number of deep features, use the LASSO algorithm to perform feature dimensionality reduction, and introduce cross-validation in the LASSO algorithm to obtain the importance of different deep features, and select a second preset number of deep features to integrate and generate spectral features.

[0067] S212: Match the corresponding structural and spectral features of each compound tag, and construct training and validation samples based on the matched feature data and a pre-set ratio.

[0068] It's important to note that in a dense convolutional neural network (DCNN) structure, each layer receives the outputs of all preceding layers. Features from all previous layers can be passed to subsequent layers, achieving optimal information flow in the feature map, making the network more accurate and efficient. Dropout layers are introduced to reduce overfitting caused by excessive connections. Furthermore, DCNNs utilize feature reuse mechanisms. The dense connections of a DCNN allow feature vectors from previous layers to be reused multiple times, further enhancing the network's feature extraction capabilities. The DCNN is trained using stochastic gradient descent with mean squared error as the cost function. Features of the high-resolution standard spectrum are calculated through forward propagation, and deep features are obtained from the last fully connected layer. These deep features effectively capture key information from feature peaks, thereby improving the accuracy and reliability of subsequent predictions. Additionally, based on mutual information between deep features and compound identification, the minimum redundancy maximum correlation (mRMR) algorithm is used to rank the deep features, retaining the top 20 deep features with the highest mRMR correlation. The LASSO algorithm is then used to reduce the dimensionality of the features, obtaining the optimal features.

[0069] Figure 3 shows a flowchart of the molecular fingerprint prediction model constructed in this application to predict molecular fingerprints.

[0070] According to an embodiment of this application, a molecular fingerprint prediction model is constructed based on deep learning to predict the molecular fingerprint corresponding to the material peaks in a soil sample, specifically as follows:

[0071] S302, a molecular fingerprint prediction model is constructed based on deep extreme learning machine. An improved gray wolf optimization algorithm is used to optimize the weight parameters of the deep extreme learning machine, initialize the gray wolf population, calculate the wolf pack fitness, and introduce elite differential mutation in the gray wolf individual position update to guide the position update and fitness update of gray wolf individuals.

[0072] S304. After iterative update, obtain the optimal weight parameters based on the optimal gray wolf individual position, configure the deep extreme learning machine using the optimal weight parameters, import the training samples into the deep extreme learning machine for training, use the output of the hidden layer of the previous extreme learning machine as the input matrix of the next extreme learning machine, and use the validation samples to verify the prediction performance of the deep extreme learning machine after iterative training.

[0073] S306. Once the prediction performance meets the standard, a molecular fingerprint prediction model is constructed based on the current deep limit learning machine to build the mapping relationship between the spectrum and the structure.

[0074] S308, extract the spectral features corresponding to the material peaks of the soil sample, construct a spectral vector based on the spectral features, import the spectral vector as model input into the molecular fingerprint prediction model, use the molecular fingerprint prediction model to predict the corresponding molecular structure, and use the molecular structure to generate a molecular fingerprint by bit string representation.

[0075] It should be noted that a Deep Extreme Learning Machine (DEM) is constructed using the concept of stacked autoencoders. The DEM is formed by cascading DEM autoencoders, which consist of an input layer, hidden layers, and an output layer. DEM can comprehensively capture the mapping relationships between data, improving nonlinear fitting ability and prediction performance. Furthermore, DEM lacks a backpropagation process, significantly shortening network training time. During training, training samples are imported into the DEM to obtain the output weight matrix of the DEM autoencoder. After orthogonalization, the input weight matrix of the hidden layer is generated. Unsupervised training is used layer by layer to obtain the output weight matrix of the final layer. After model testing, if the prediction performance meets the standard, a molecular fingerprint prediction model is constructed based on the current DEM. Constructing a molecular fingerprint prediction model using artificial intelligence methods optimizes the process of determining compound molecular structures, greatly improving the prediction efficiency of compound molecular structures.

[0076] The stability of deep extreme learning machines is affected by the weight parameters of each layer. Optimizing these weight parameters involves using the mean squared error of the training set as the objective function to find the minimum fitness value and obtain the optimal weight combination. The Grey Wolf Algorithm is an optimization algorithm that simulates the hierarchy and predation strategies of a wolf pack, finding the optimal value through iterative optimization. In the Grey Wolf Algorithm, the hunting process is divided into three stages: searching, tracking, and encircling and attacking. The top three wolves with the highest fitness are defined as α, β, and δ, respectively, and the remaining wolves are defined as ω. α, β, and δ guide ω towards the target, updating its position around α, β, and δ. Elite differential mutation is introduced during the position update process, utilizing cooperation and competition within the group to make the newly generated grey wolf individuals tend towards the optimal individuals. The differential mutation concept strengthens information exchange among elite individuals, clarifying the search direction. Furthermore, to further improve the search range and convergence accuracy of the Grey Wolf Algorithm, the reverse solution of the grey wolf with the highest fitness is calculated, enhancing the algorithm's ability to escape local optima.

[0077] The gray wolf X with the highest fitness best The calculation of elite differential mutation is as follows:

[0078] Where X′ best This refers to a gray wolf individual that has undergone elite differential mutation. These represent the second and third most fit gray wolf individuals, with b1 and b2 representing random numbers.

[0079] It should be noted that the organic pollutant mass spectrum database is represented as a graph, with compounds as nodes. The precise mass number, retention time, structural features, and spectral features of the compounds are used as additional features for each node. Abundance similarity, transformation reactions, additive relationships, and spectral similarity between compounds are used as different types of edge structures. Multiple association graphs corresponding to the organic pollutant mass spectrum database are generated based on these different edge structures. Based on the spectral features, precise mass number, and retention time of soil sample peaks, they are located in these multiple association graphs. Based on the located nodes, related neighbor nodes and their graph structures are obtained. A graph convolutional network is used to extract features from the graph structures. The information transfer and neighbor aggregation mechanisms of the graph convolutional network are used to learn and represent the graph structures, obtaining the compound label embedding vectors for each compound node in different graph structures. Optionally, the number of directly connected edge structures for each compound node is obtained based on different graph structures. For any compound node, the number of its corresponding edge structures is summed, and after normalization, the weight information of the compound node is obtained. This weight information is used to weight the similarity, highlighting the labels of compounds with high importance. The spectral vectors corresponding to the material peaks of soil samples are mapped to the same common representation interval, and the similarity between the spectral vectors and the compound tag embedding vectors is calculated. Compound tags that meet the similarity threshold are selected, and the compound tags selected from different association graphs are integrated, removing duplicate compound tags. Candidate chemical structures are obtained based on the node-additional features of the corresponding compound nodes. The similarity between the spectral vectors and the compound tag embedding vectors is then calculated. The calculation is expressed as: Among similarity The prediction score representing the i-th spectral feature belonging to the j-th compound tag, v i Represents the spectral vector, u j This represents the compound tag embedding vector.

[0080] It should be noted that the process involves obtaining predicted molecular fingerprints and candidate chemical structures of soil sample material peaks, constructing candidate molecular fingerprints using the candidate chemical structures, calculating the similarity between the predicted molecular fingerprints and different candidate molecular fingerprints, and using the similarity to score different candidate chemical structures. Chemical structures that meet preset standards are selected, and identification criteria are constructed based on the selected chemical structures. Molecular formulas are searched in the organic pollutant mass spectrometry database using the identification criteria. The screened substances are classified into corresponding confidence levels: Level 3 and above is high confidence, Level 3 is medium confidence, and Level 3 and below is low confidence. Finally, the organic pollutant identification results of the soil sample are output.

[0081] Figure 4 shows a block diagram of the soil organic pollutant identification system based on the combination of AI and high-throughput screening in this application.

[0082] The second embodiment of this application provides a soil organic pollutant identification system 4 based on the combination of AI and high-throughput screening. The system includes: a soil sample collection module 401, a high-throughput screening module 402, an organic pollutant mass spectrum database module 403, a material structure prediction module 404, and a non-targeted screening output module 405.

[0083] The soil sample collection module is responsible for collecting soil samples from the target area, setting up a program blank sample, and preprocessing the collected samples to meet the requirements of high-resolution mass spectrometry analysis.

[0084] The high-throughput screening module is responsible for performing high-resolution mass spectrometry analysis on the collected soil samples, obtaining high-resolution mass spectrometry data for data preprocessing, and extracting substance peaks from the high-resolution mass spectrometry data.

[0085] The organic pollutant mass spectrometry database module is responsible for constructing an organic pollutant mass spectrometry database using open source data, literature retrieval, known organic pollutant standard spectral data, and mass spectrometry big data retrieval data. It uses the spectral and structural features of compounds in the database to construct training and validation samples, and uses the spectral features of soil samples to retrieve candidate chemical structures from the organic pollutant mass spectrometry database.

[0086] The material structure prediction module is responsible for establishing a molecular fingerprint prediction model, using training samples and validation samples to train and evaluate the model, constructing a mapping relationship between spectra and structures, using the material peaks extracted by the high-throughput screening module to construct a spectral vector, importing it into the model to obtain possible chemical structures and predicting molecular fingerprints, using the predicted molecular fingerprints to score candidate chemical structures, and selecting material structures that meet preset standards.

[0087] The non-targeted screening output module is responsible for constructing identification criteria based on the predicted high-resolution material structure, using the identification criteria to find the corresponding molecular formula and confidence level, and outputting the organic pollutant identification results of the soil sample.

[0088] In the embodiments provided in this application, it should be understood that the disclosed methods can be implemented in other ways. The embodiments described above are merely illustrative. For example, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple modules or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or modules, and can be electrical, mechanical, or other forms.

[0089] In addition, each functional module in the various embodiments of this application can be integrated into one processing module, or each module can be a separate module, or two or more modules can be integrated into one module; the integrated module can be implemented in hardware or in the form of hardware plus software functional modules.

[0090] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0091] Alternatively, if the integrated modules described above are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, or the parts that contribute to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.

[0092] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application.

Claims

1. An AI-based soil organic pollutant identification method combined with high-throughput screening, characterized in that, Includes the following steps: High-resolution mass spectrometry analysis was performed on the collected soil samples to obtain high-resolution mass spectrometry data. Data preprocessing was performed, and substance peaks were extracted from the high-resolution mass spectrometry data. An organic pollutant mass spectrum database was constructed, and the spectral and structural features of compounds in the database were extracted to construct training and validation samples. A molecular fingerprint prediction model is constructed based on deep learning. The training samples and validation samples are used to train and evaluate the model, and a mapping relationship from spectrum to structure is constructed. Based on the extracted material peaks, spectral vectors are constructed, molecular fingerprints are predicted using a molecular fingerprint prediction model, and candidate chemical structures are obtained by searching an organic pollutant material spectral database. Candidate chemical structures are scored by predicting molecular fingerprints, and chemical structures that meet preset standards are selected. Identification criteria are constructed based on the selected chemical structures to obtain the identification results of organic pollutants in soil samples. 2.The soil organic pollutant identification method based on AI combined with high-throughput screening according to claim 1, characterized in that, Acquire high-resolution mass spectrometry data, extract substance peaks from the high-resolution mass spectrometry data, and perform data preprocessing, specifically as follows: High-throughput screening analysis instruments were used to perform high-resolution mass spectrometry analysis on the pretreated soil samples, extract the raw data peaks, remove background noise interference through filters, and perform baseline correction. A peak list is constructed by setting precise mass number, retention time and response intensity. Peak correction and alignment are performed on the peak list. The peak list is sorted from largest to smallest according to response intensity. The ratio of response intensity in soil samples to response intensity in program blank samples is calculated. Data that does not meet the preset threshold is removed based on the ratio. The remaining data are sorted according to the mass-to-charge ratio, and a preset number of data peaks are retained based on the sorting results, which are then summarized into high-response material peaks. 3.The soil organic pollutant identification method based on AI combined with high-throughput screening according to claim 1, characterized in that, An organic pollutant mass spectrometry database was constructed, and the spectral and structural features of compounds in the database were extracted. Training and validation samples were then constructed, specifically as follows: The compound mass spectrometry data were cleaned and preprocessed by using open-source data, historical known high-resolution standard spectrum data of organic pollutants, literature search data and mass spectrometry big data search data as data sources. An organic pollutant mass spectrometry database is constructed by integrating compound, compound mass spectrometry data and compound molecular structure data. Any compound tag is selected from the organic pollutant mass spectrometry database, and the corresponding high-resolution standard spectrum and molecular structure data are retrieved according to the compound tag. Multidimensional feature extraction is performed on the molecular structure data to obtain characteristic ion fragments, homologue structural features, isotope distribution, and structural features corresponding to the diagnostic ion generating compound tags; The high-resolution standard spectrum is preprocessed and imported into a pre-trained dense convolutional neural network for feature extraction. The depth features of the material peaks are output through the last layer of the fully connected layer. The mutual information between the depth features is calculated. The correlation is calculated based on the mRMR algorithm using the acquired depth features and compound identification. The relevance is used to sort the deep features, retain a first preset number of deep features, use the LASSO algorithm to reduce the feature dimensionality, and introduce cross-validation into the LASSO algorithm to obtain the importance of different deep features. A second preset number of deep features are selected and integrated to generate spectral features. Based on the structural and spectral features matched to each compound tag, training and validation samples are constructed using the matched feature data at a pre-set ratio. 4.The method of claim 1, wherein the method is characterized by, A molecular fingerprint prediction model is constructed based on deep learning. The model is trained and evaluated using the training and validation samples. The mapping relationship from spectrum to structure is constructed, specifically as follows: A molecular fingerprint prediction model is constructed based on deep extreme learning machine. An improved gray wolf optimization algorithm is used to optimize the weight parameters of the deep extreme learning machine, initialize the gray wolf population, calculate the wolf pack fitness, and introduce elite differential mutation in the gray wolf individual position update to guide the position update and fitness update of gray wolf individuals. After iterative updates, the optimal weight parameters are obtained based on the optimal gray wolf individual position. The optimal weight parameters are used to configure the deep extreme learning machine. The training samples are imported into the deep extreme learning machine for training. The output of the hidden layer of the previous extreme learning machine is used as the input matrix of the next extreme learning machine. After iterative training, the prediction performance of the deep extreme learning machine is verified using validation samples. Once the prediction performance meets the standard, a molecular fingerprint prediction model is constructed based on the current deep limit learning machine to establish the mapping relationship between the spectrum and the structure.

5. The method for identifying soil organic pollutants based on a combination of AI and high-throughput screening according to claim 1, characterized in that, Extract the spectral features corresponding to the material peaks of the soil sample, construct a spectral vector based on the spectral features, import the spectral vector as the model input into the molecular fingerprint prediction model, use the molecular fingerprint prediction model to predict the corresponding molecular structure, and use the molecular structure to generate a molecular fingerprint by bit string representation.

6. The method for identifying soil organic pollutants based on the combination of AI and high-throughput screening according to claim 1, characterized in that, Candidate chemical structures were obtained by searching an organic pollutant mass spectrometry database, specifically as follows: The organic pollutant mass spectrum database is represented graphically, with compounds as nodes. The exact mass number, retention time, structural features, and spectral features of the compounds are used as additional features of the nodes. The abundance similarity, transformation reaction, addition relationship, and spectral similarity between compounds are used as different types of edge structures. Multiple association graphs corresponding to the organic pollutant mass spectrum database are generated based on different types of edge structures. The spectral features of soil sample material peaks are obtained from the neighboring nodes and graph structure of multiple association graphs. The graph structure is learned and represented using a graph convolutional network to obtain the compound label embedding vector. The similarity between the spectral vector and the compound label embedding vector is calculated. Select compound tags that meet the similarity threshold, and obtain candidate chemical structures based on the node-additional features of the corresponding compound nodes.

7. The method for identifying soil organic pollutants based on the combination of AI and high-throughput screening according to claim 1, characterized in that, Candidate chemical structures are scored by predicting molecular fingerprints, and chemical structures that meet preset criteria are selected. Identification criteria are then constructed based on the selected chemical structures to obtain the identification results of organic pollutants in soil samples. Specifically: Obtain the predicted molecular fingerprint and candidate chemical structure of the material peaks in the soil sample, construct candidate molecular fingerprints through the candidate chemical structures, calculate the similarity between the predicted molecular fingerprint and different candidate molecular fingerprints, and score different candidate chemical structures using the similarity. Chemical structures that meet preset standards are selected, and identification criteria are constructed based on the selected chemical structures. Molecular formulas and confidence levels are extracted from the organic pollutant mass spectrum database using the identification criteria to obtain the identification results of organic pollutants in soil samples.

8. A soil organic pollutant identification system based on the combination of AI and high-throughput screening, characterized in that, For implementing the soil organic pollutant identification method based on the combination of AI and high-throughput screening as described in any one of claims 1-7, the system includes: a soil sample collection module, a high-throughput screening module, an organic pollutant mass spectrum database module, a material structure prediction module, and a non-targeted screening output module; The soil sample collection module is responsible for collecting soil samples from the target area, setting up a program blank sample, and preprocessing the collected samples to meet the requirements of high-resolution mass spectrometry analysis. The high-throughput screening module is responsible for performing high-resolution mass spectrometry analysis on the collected soil samples, obtaining high-resolution mass spectrometry data for data preprocessing, and extracting substance peaks from the high-resolution mass spectrometry data. The organic pollutant mass spectrometry database module is responsible for constructing an organic pollutant mass spectrometry database using open source data, literature retrieval, known organic pollutant standard spectral data, and mass spectrometry big data retrieval data. It uses the spectral and structural features of compounds in the database to construct training and validation samples, and uses the spectral features of soil samples to retrieve candidate chemical structures from the organic pollutant mass spectrometry database. The material structure prediction module is responsible for establishing a molecular fingerprint prediction model, using training samples and validation samples to train and evaluate the model, constructing a mapping relationship between spectra and structures, using the material peaks extracted by the high-throughput screening module to construct a spectral vector, importing it into the model to obtain possible chemical structures and predicting molecular fingerprints, using the predicted molecular fingerprints to score candidate chemical structures, and selecting material structures that meet preset standards. The non-targeted screening output module is responsible for constructing identification criteria based on the predicted high-resolution material structure, and using the identification criteria to output the organic pollutant identification results of the soil sample.