Microbial induced calcium carbonate precipitation rate prediction method, device, equipment and medium

By constructing a biological interaction network topology and integrating attention mechanisms with multi-omics data, the problem of existing models ignoring microbial metabolic regulation is solved, achieving more accurate prediction of microbial-induced calcium carbonate precipitation rate and cost reduction.

CN122245446APending Publication Date: 2026-06-19LONGYAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
LONGYAN UNIV
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing microbial-induced calcium carbonate precipitation rate prediction models neglect the complex metabolic regulation within microorganisms, cannot explain the differences in precipitation efficiency under different nutrient substrates, and traditional methods lack adaptive prediction capabilities through multimodal feature fusion.

Method used

By acquiring multi-omics data, a biological interaction network topology map is constructed. Feature fusion is performed using graph convolutional networks and attention mechanisms, and precipitation rate prediction is performed by combining neural coding networks. By integrating information from multiple biological levels, the functional relationships between molecules are captured.

🎯Benefits of technology

It achieves more accurate precipitation rate prediction, provides guidance on biological targets, reduces detection costs, and improves the model's generalization ability and scientific rigor.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245446A_ABST
    Figure CN122245446A_ABST
Patent Text Reader

Abstract

This invention relates to the field of precipitation prediction technology, and discloses a method, apparatus, equipment, and medium for predicting the precipitation rate of microbially induced calcium carbonate. The method includes acquiring multiple initial omics data of a target microbial sample during the calcium carbonate precipitation process; preprocessing and reducing the dimensionality of each initial omics data to obtain corresponding low-dimensional target feature vectors; constructing a biological interaction network topology based on protein-protein interaction or metabolic pathway knowledge; mapping multiple low-dimensional target feature vectors to a unified low-dimensional embedding space based on a neural coding network to obtain a low-dimensional multimodal joint feature vector; weighting and integrating the low-dimensional multimodal joint feature vector based on an attention mechanism to obtain a fused feature vector; and inputting the fused feature vector into a pre-constructed regression prediction framework to obtain the precipitation prediction rate. This solution improves the ability to capture the microbially induced calcium carbonate precipitation process and enhances the accuracy of precipitation rate prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of precipitation prediction technology, specifically relating to a method, apparatus, equipment, and medium for predicting the precipitation rate of microbially induced calcium carbonate. Background Technology

[0002] Microbially induced calcium carbonate precipitation (MICP) is a process that utilizes microbial metabolic activity to drive calcium carbonate precipitation. Micromineralization processes involving precipitation (MICP) have broad application prospects in soil stabilization, building material preparation, and environmental remediation. Accurately predicting the precipitation rate of MICP is a core prerequisite for optimizing process parameters, reducing production costs, and expanding application scenarios. However, the MICP process involves complex couplings between multiple omics disciplines such as metabolomics, proteomics, and transcriptomics, and biological topological networks such as protein interactions and metabolic pathways.

[0003] Current research and prediction of MICP precipitation rates mainly involve the following models and methods: Physicochemical kinetics-based models primarily utilize chemical equilibrium equations, such as the urea hydrolysis kinetic rate equation, and mass transfer equations, such as Darcy's law, to simulate the precipitation process. The amount of calcium carbonate formed is calculated by setting environmental parameters such as initial concentration and flow rate.

[0004] Statistical models based on empirical parameters: using multiple linear regression or orthogonal experimental design, the effects of macroscopic factors such as pH, temperature, and type of calcium source on precipitation effect are analyzed.

[0005] While the aforementioned technologies have advanced MICP prediction techniques to some extent, several challenges remain in practical application: Most existing models treat microorganisms as constant factors or simple reaction rate constants, neglecting the complex metabolic regulation within microorganisms. They fail to explain why the same strain exhibits drastic differences in precipitation efficiency under different nutrient substrates, and their generalization ability is extremely poor when applied across strains and environments. Traditional machine learning methods treat protein expression levels or metabolites as independent feature points. In vivo, urease production is regulated by multiple transcription factors and is intertwined with the synthesis pathways of calcium transporters and extracellular polymeric substances (EPS). Existing technologies sever the logical connections between these biomolecules, causing models to lose crucial background biological information. Furthermore, there is a lack of adaptive prediction schemes for MICP under multimodal feature fusion. Summary of the Invention

[0006] To address the shortcomings of existing technologies, the present invention aims to provide a method, apparatus, equipment, and medium for predicting the rate of microbial-induced calcium carbonate precipitation.

[0007] According to one aspect of this application, a method for predicting the rate of microbial-induced calcium carbonate precipitation is disclosed, the method comprising: To obtain multiple initial omics data on the target microbial sample during the process of inducing calcium carbonate precipitation; Each type of initial omics data is preprocessed to obtain a high-dimensional feature vector corresponding to each preprocessed target omics data, wherein each dimension of the high-dimensional feature vector represents the expression level of the corresponding target omics data in the target microbial sample; For each type of target omics data, the same type of dimensionality reduction algorithm is used to process the high-dimensional feature vector to obtain the low-dimensional target feature vector corresponding to each high-dimensional feature vector. The protein interaction or metabolic pathway knowledge of the target microbial sample is obtained, and a biological interaction network topology is constructed based on the protein interaction or metabolic pathway knowledge. The nodes of the biological interaction network topology are functional proteins or key enzymes involved in the process of calcium carbonate precipitation induced by the target microorganism, and the edges are the physical binding between functional proteins or the upstream and downstream regulatory relationships in the metabolic pathway. Based on a pre-constructed neural coding network, multiple low-dimensional target feature vectors are mapped to a unified low-dimensional embedding space to obtain a low-dimensional multimodal joint feature vector. The multimodal joint feature vector contains feature information of each target omics data of the target microorganism. The low-dimensional multimodal joint feature vector is weighted and integrated based on an attention mechanism to obtain a fused feature vector; The fused feature vector is input into a pre-constructed regression prediction framework to obtain the precipitation prediction rate of the target microbial sample during the induced calcium carbonate precipitation process, as output by the regression prediction framework.

[0008] In some embodiments, the preprocessing of each type of initial omics data to obtain a high-dimensional feature vector corresponding to each type of initial omics data includes: Each type of initial omics data is preprocessed to obtain intermediate omics data corresponding to each type of initial omics data. By combining the feature scaling formula, feature scaling is performed on each intermediate omics data to obtain a high-dimensional feature vector corresponding to each initial omics data. The feature scaling formula is as follows: ; In the formula, For initial omics data The corresponding high-dimensional feature vector; This refers to a single initial omics data point from the target microbial sample. This represents the median of the initial omics data across all samples. The interquartile range of this initial omics data.

[0009] In some embodiments, the step of processing the high-dimensional feature vectors corresponding to each type of omics data using the same type of dimensionality reduction algorithm to obtain the low-dimensional target feature vectors corresponding to each high-dimensional feature vector includes: The high-dimensional feature vectors are input into the neural network encoder, and through nonlinear activation functions and layer normalization, the high-dimensional feature vectors are compressed into the low-dimensional embedding space of the target dimension to obtain the low-dimensional target feature vector.

[0010] In some embodiments, constructing a biological interaction network topology based on the knowledge of protein interactions or metabolic pathways includes: Retrieve interaction data of target microbial samples from protein interaction databases or metabolic pathway databases to construct an undirected graph G=(V,E), where node V represents functional proteins or key enzymes involved in the induction of calcium carbonate precipitation in the target microbial sample, and edge E represents the interaction between functional proteins or the upstream and downstream relationship in the metabolic pathway. Determine the initial feature vectors of the nodes in the undirected graph; The undirected graph is encoded using a graph convolutional network to generate a network embedding vector containing node and topological context information. The topology of the biological interaction network is determined based on the undirected graph, the initial feature vectors of the nodes, and the network embedding vectors.

[0011] In some embodiments, determining the initial feature vectors of the nodes in the undirected graph includes: For any node in the undirected graph: If the expression level of the functional protein corresponding to the node exists in the initial omics data, the feature value corresponding to the functional protein in the high-dimensional feature vector obtained after preprocessing will be determined as the initial feature vector of the node. If the expression level of the functional protein corresponding to the node is not recorded in the initial omics data, the initial feature vector of the node is obtained by initializing and filling it with the average expression level of the neighboring nodes or by using a pre-trained biological network embedding vector.

[0012] In some embodiments, the application of the graph convolutional network to encode the undirected graph to generate a network embedding vector containing nodes and their topological context information includes: By applying a graph convolutional network and combining it with a feature transfer formula, features are transferred between nodes through the adjacency matrix A and the degree matrix D to generate network embedding vectors containing node and topological context information. The feature transfer formula is as follows: ; In the formula, This represents the feature matrix of the nodes in the (l+1)th layer of the graph convolutional network. It is a non-linear activation function; Let be the degree matrix of the undirected graph; It is the negative 1 / 2 power of the degree matrix; Let be the adjacency matrix of an undirected graph; Let L be the node feature matrix of the l-th layer of the graph convolutional network; Let be the learnable weight parameter matrix of the l-th layer of the graph convolutional network.

[0013] In some embodiments, the weighted integration of the low-dimensional multimodal joint feature vector based on the attention mechanism to obtain the fused feature vector includes: The contribution score is calculated by linearly transforming each single-modal feature vector in the low-dimensional multimodal joint feature vector using a learnable parameter matrix. The contribution score is then normalized using the Softmax function to generate the single-modal weight coefficient corresponding to each single-modal feature vector. Each single-modal feature vector is multiplied by its corresponding single-modal weight coefficient and summed to generate a fused feature vector; The normalization formula for the Softmax function is as follows: ; In the formula, The single-mode weight coefficients corresponding to the i-th single-mode feature vector; is the contribution score calculated after linear transformation of the i-th single-modal feature vector; M is the total number of single-modal feature vectors participating in the fusion; j is the summation index variable, with a value ranging from 1 to M; is the contribution score of the j-th single-modal eigenvector; exp is an exponential function with the natural constant e as the base.

[0014] According to another aspect of this application, a microbial-induced calcium carbonate precipitation rate prediction device is also disclosed, the device comprising: The initial omics data acquisition module is used to acquire various initial omics data of the target microbial sample during the process of inducing calcium carbonate precipitation; A high-dimensional feature vector determination module is used to preprocess each type of initial omics data to obtain a high-dimensional feature vector corresponding to each type of initial omics data, wherein each dimension of the high-dimensional feature vector represents the expression level of the corresponding omics data in the target microbial sample; The low-dimensional target feature vector determination module is used to process the high-dimensional feature vectors corresponding to each type of omics data using the same type of dimensionality reduction algorithm to obtain the low-dimensional target feature vectors corresponding to each high-dimensional feature vector. A biological interaction network topology construction module is used to acquire knowledge of protein interactions or metabolic pathways of the target microbial sample and construct a biological interaction network topology based on the knowledge of protein interactions or metabolic pathways. The nodes of the biological interaction network topology are functional proteins or key enzymes involved in the process of calcium carbonate precipitation induced by the target microorganism, and the edges are the physical binding between functional proteins or the upstream and downstream regulatory relationships in metabolic pathways. The low-dimensional multimodal joint feature vector determination module is used to map multiple low-dimensional target feature vectors to a unified low-dimensional embedding space based on a pre-constructed neural coding network to obtain a low-dimensional multimodal joint feature vector, wherein the multimodal joint feature vector contains feature information of each omics data of the target microorganism; The fusion feature vector determination module is used to perform weighted integration of the low-dimensional multimodal joint feature vector based on an attention mechanism to obtain a fusion feature vector; The precipitation rate prediction module is used to input the fused feature vector into a pre-constructed regression prediction framework to obtain the precipitation prediction rate of the target microbial sample during the induced calcium carbonate precipitation process, as output by the regression prediction framework.

[0015] According to another aspect of this application, an electronic device is also disclosed, the electronic device including a memory and at least one processor, the memory storing instructions; the at least one processor invokes the instructions in the memory to cause the electronic device to perform the various steps of the microbial-induced calcium carbonate precipitation rate prediction method as described above.

[0016] According to another aspect of this application, a computer-readable storage medium is also disclosed, on which instructions are stored, which, when executed by a processor, implement the various steps of the microbial-induced calcium carbonate precipitation rate prediction method as described above.

[0017] The present invention includes, but is not limited to, the following beneficial effects: (1) This scheme integrates information from multiple biological levels, captures the functional relationships between molecules through graph networks, and uses an attention mechanism to allow the model to automatically focus on important biological signals under different environmental conditions, so that the prediction results are no longer just statistical fitting, but generalizable scientific inferences based on a deep understanding of biological processes. It realizes the transformation from a black box model to a transparent tool. The attention weight directly quantifies the contribution of different biomolecular levels to the final precipitation rate, providing clear biological targets for actual process optimization, that is, clearly indicating whether it is the activity of a certain key enzyme or the concentration of a certain metabolite that limits precipitation. (1) Pretreatment can improve the efficiency of sedimentation, thereby guiding the modification of target microorganisms or the adjustment of culture medium formulation, greatly accelerating the transformation process from laboratory discovery to industrial application; (2) Through pretreatment, redundant or unnecessary detection indicators can be eliminated, reducing the continuous and high-frequency detection costs in actual production; (3) This scheme transforms biological topological networks into computable and embeddable graph structure features, realizing the deep integration of biological mechanisms and topological perception of feature expression, realizing the deep fusion of molecular attributes and topological associations, providing graph structure features rich in biological semantics for subsequent multimodal fusion, which can more accurately characterize the synergistic mechanism in the MIP process and make up for the shortcomings of traditional omics. The feature only focuses on the shortcomings of individual molecules, so that the model can capture deeper biological regulatory logic, and further improve the scientificity and accuracy of precipitation rate prediction; (4) When the initial omics data includes proteomics data, the low-dimensional target feature vector corresponding to proteomics is directly used as the initial feature vector of the node of the undirected graph, realizing the direct correlation between node features and protein expression, preserving the biological specificity at the protein level, so that the node features of the biological interaction network topology graph have clear molecular functional semantics, which is convenient for capturing the interaction logic of protein-protein or protein-metabolism during subsequent graph convolutional encoding; On the other hand, when the initial omics data does not contain proteomics data At the same time, the model is initialized and filled by the average expression level of neighboring nodes or pre-trained biological network embedding vectors. This utilizes the topological neighborhood information of the network and the general biological knowledge of the pre-trained embeddings, which solves the problem of missing node features in non-proteomics data and ensures the integrity of the biological interaction network topology graph construction. This differentiated strategy makes the node features of the topology graph not only fit the specific omics data type, but also supplement biological information through topological neighborhood or pre-training knowledge. It provides a structurally complete graph feature foundation for subsequent multimodal fusion, improves the model's ability to capture the process of microbial-induced calcium carbonate precipitation, and thus improves the accuracy of precipitation rate prediction. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below.

[0019] Figure 1 This is a flowchart of the microbial-induced calcium carbonate precipitation rate prediction method according to an embodiment of this application; Figure 2 This is a structural block diagram of the microbial-induced calcium carbonate precipitation rate prediction device according to an embodiment of this application; Figure 3 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0020] The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0021] For ease of understanding, the specific process of the embodiments of the present invention will be described below. Figure 1 This is a flowchart of the microbial-induced calcium carbonate precipitation rate prediction method according to an embodiment of this application. (See attached document.) Figure 1 It includes the following steps: S100: Obtain various initial omics data on the target microbial sample during the process of inducing calcium carbonate precipitation.

[0022] Specifically, taking the target microbial sample as an example, different environmental factor gradients can be set, such as calcium ion concentration: 10-100 mM, pH value: 7.0-9.5, and temperature: 20-40°C, to induce the strain to exhibit differentiated induced calcium carbonate precipitation (MICP) efficiency. Destructive sampling is then performed at key calcification nodes after the strain enters the logarithmic growth phase to obtain multiple omics data, including but not limited to genomics, proteomics, metabolomics, and transcriptomics data. For example, the protocol focuses on transcriptomics and proteomics data, utilizing liquid chromatography-mass spectrometry (LC-MS / MS) to acquire metabolomics data, high-throughput sequencing to acquire transcriptomics (mRNA) data, and mass spectrometry to acquire proteomic expression profiles.

[0023] S101. Preprocess each type of initial omics data to obtain the high-dimensional feature vectors corresponding to the preprocessed target omics data.

[0024] In this context, each dimension of the high-dimensional feature vector represents the expression level of the corresponding target omics data in the target microbial sample.

[0025] Specifically, each initial omics data can be preprocessed to obtain the target omics data corresponding to each initial omics data. Then, the feature scaling formula can be used to scale the features of each target omics data to obtain the high-dimensional feature vector corresponding to each target omics data. The feature scaling formula is as follows: ; In the formula, For target omics data The corresponding high-dimensional feature vector; This refers to data from a specific target omics group of a target microbial sample. This represents the median of the target omics data across all samples. The interquartile range of the target omics data.

[0026] Each type of target omics data undergoes initial preprocessing to obtain the target omics data corresponding to each initial omics data, which may include: Feature filtering, such as removing features whose expression levels are constant (variance close to zero) across all samples, can reduce redundant information. Missing value imputation: For common mass spectrometry missing values ​​in proteomics or metabolomics, interpolation algorithms based on median imputation or K-nearest neighbor (KNN) are used to imput the missing values ​​to ensure the integrity of the data matrix.

[0027] Alignment and synchronization: Cross-sample alignment is performed for each omics modality to ensure that each sample (cell row) has corresponding records in the transcription, protein, and metabolism dimensions, and to remove sample data that cannot be matched.

[0028] Furthermore, after scaling the features of each target omics data using the feature scaling formula to obtain the high-dimensional feature vector corresponding to each target omics data, log2 transformation or L2 regularization can be applied to omics features with highly skewed distributions, such as the abundance of certain metabolites, to make the feature distribution closer to a normal distribution and improve the feature capture efficiency of the neural network.

[0029] In some specific embodiments, log2 transformation or L2 regularization is applied to omics features with highly skewed distributions, such as the abundance of certain metabolites, to make the feature distribution closer to a normal distribution. Specific implementation details include the following: 1. Automated mechanism for determining skewed distributions Before applying nonlinear transformation, the preprocessed intermediate omics data are first subjected to skewness detection. For each high-dimensional feature's feature vector sequence, its skewness coefficient is calculated. Specifically, if the absolute value of the skewness coefficient of the expression sequence of a certain metabolite or protein in all target microbial samples is greater than a set skewness threshold (e.g., skewness coefficient |S|>1.0 or 1.5), then the omics feature is determined to have a highly skewed distribution (usually exhibiting a long-tailed or exponential distribution), and nonlinear distribution correction needs to be triggered; otherwise, this correction step is skipped.

[0030] 2. Specific operations of log2 transformation When a certain omics characteristic (such as metabolites with abundance spanning multiple orders of magnitude) is determined to be highly right-skewed (long-tailed distribution), a base-2 logarithmic transformation formula with pseudo-count is used for processing. The specific formula is as follows: ; In the formula: The original abundance or expression level values ​​for characteristics that are determined to be skewed distributions; These are the converted feature values; A preset minimum constant (pseudo-count, for example, with a value of 1 or 10). -6 ), used to prevent when the original expression level When the logarithm is 0, the logarithmic calculation becomes meaningless (negative infinity), while maintaining a smooth transition in the low-abundance noise region.

[0031] 3. Specific operations of L2 norm normalization For omics data matrices with significant differences in magnitude between samples and a distribution exhibiting sharp peaks and heavy tails, L2-norm regularization / normalization can be applied. This operation is performed on either the sample dimension (row vectors) or the feature dimension (column vectors), mapping the feature vectors to a unit hypersphere. The specific formula is as follows: ; In the formula: This represents a specific omics feature being processed; n is the dimension of the vector (the total number of features or samples involved in normalization). This is the L2 norm of the vector (i.e., the square root of the sum of the squares of the elements in the vector). These are the eigenvalues ​​obtained after L2 regularization.

[0032] 4. Benefits of implementation The aforementioned specific distribution correction scheme effectively weakens the "absolute dominance" of extremely abundant metabolites (such as substrates or major metabolic byproducts remaining in the culture medium) on neuronal activation states during forward propagation in neural networks, mitigating the gradient explosion problem. Simultaneously, it amplifies the feature weights of low-to-medium abundance rate-limiting enzymes or key trace metabolites. This operation, which forces the feature distribution back to a pseudo-normal or unit norm space, provides a more uniform data input for nonlinear activation functions (such as ReLU) in subsequent multilayer perceptron (MLP) encoders, thereby significantly improving the neural network's efficiency in capturing deep features in the complex microbial-induced calcium carbonate precipitation (MICP) mechanism and its convergence stability.

[0033] S102. For each type of target omics data, the high-dimensional feature vector is processed using the same type of dimensionality reduction algorithm to obtain the low-dimensional target feature vector corresponding to each high-dimensional feature vector.

[0034] Specifically, high-dimensional feature vectors can be input into independent neural network encoders, and the dimensions can be compressed into a unified low-dimensional embedding space through non-linear activation functions (such as ReLU) and layer normalization to reduce the computational complexity of subsequent fusion.

[0035] S103. Obtain knowledge of protein interactions or metabolic pathways of the target microbial sample, and construct a biological interaction network topology based on the knowledge of protein interactions or metabolic pathways.

[0036] In this biological interaction network topology, the nodes are functional proteins or key enzymes involved in the process of calcium carbonate precipitation induced by the target microorganism, and the edges are the physical binding between functional proteins or the upstream and downstream regulatory relationships in metabolic pathways.

[0037] Specifically, for commonly used MIP strains such as *Sporosarcina pasteurii* or *Bacillus*, the STRING database (Search Tool for the Retrieval of Interacting Genes / Proteins, a database for collecting, evaluating, and integrating all known and predicted protein-protein interaction information) or the KEGG (Kyoto Encyclopedia of Genes and Genomes, a database for understanding higher functions and the operation of biological systems) metabolic pathway database can be used to retrieve data on protein physical interactions and functional associations across the entire genome of the strain. Furthermore, an undirected graph G=(V,E) can be constructed, where node V represents various functional proteins or key enzymes involved in the MIP process, such as urease subunits A and B, CUreA / B / C, carbonic anhydrase, calcium transporters, etc., and edge E represents the physical binding between proteins or the upstream and downstream regulatory relationships in metabolic pathways.

[0038] Furthermore, for any node in the undirected graph, if there is an expression level record of the functional protein corresponding to the node in the initial omics data, the feature value corresponding to the functional protein in the high-dimensional feature vector obtained after preprocessing is determined as the initial feature vector of the node; if there is no expression level record of the functional protein corresponding to the node in the initial omics data, the initial feature vector of the node is obtained by initializing and filling it with the average expression level of the neighboring nodes of the node or by using the pre-trained biological network embedding vector.

[0039] Furthermore, a Graph Convolutional Network (GCN) layer is applied, using the adjacency matrix A and the degree matrix D to achieve feature transfer between nodes. Its core calculation formula is as follows: ; In the formula, This represents the feature matrix of the nodes in the (l+1)th layer of the graph convolutional network. It is a non-linear activation function; Let be the degree matrix of the undirected graph; It is the negative 1 / 2 power of the degree matrix; Let be the adjacency matrix of an undirected graph; Let L be the node feature matrix of the l-th layer of the graph convolutional network; Let be the learnable weight parameter matrix of the l-th layer of the graph convolutional network.

[0040] Specifically, in the computation process of graph convolutional networks, the adjacency matrix is ​​first used as the basis for computation. Sum-degree matrix Calculate the normalized adjacency matrix Then it is compared with the node feature matrix of the l-th layer. Multiply by, and then by the learnable weight parameter matrix Multiplying, and finally obtaining the node feature matrix of the (l+1)th layer through the nonlinear activation function σ. This process is repeated multiple times (i.e., l increases gradually from 0). Each layer aggregates the feature information of neighboring nodes into the feature representation of the current node, thus resulting in the final node feature matrix (i.e., the last layer's matrix). (where l is the total number of layers) contains richer node and topological context information. This operation enables the features of each protein node to be fused with the context information of its neighboring nodes in the MICP metabolic network. After feature propagation through multiple layers of graph convolution, the node feature matrix of the last layer... This is the generated network embedding vector. Each row of this embedding vector corresponds to the embedding representation of a biological entity (node), which integrates the node's initial features as well as information from neighboring nodes and broader context passed down from the topology of the entire biological network (graph). These network embedding vectors can be used as input to subsequent prediction models (such as multilayer perceptrons, support vector machines, etc.) to predict the rate of microbial-induced calcium carbonate precipitation.

[0041] Furthermore, through multi-layer convolution, the network can capture the overall activity of functional modules such as the "urease metabolic cluster" or the "calcium ion pump regulatory chain".

[0042] S104. Based on a pre-constructed neural coding network, multiple low-dimensional target feature vectors are mapped to a unified low-dimensional embedding space to obtain a low-dimensional multimodal joint feature vector.

[0043] The multimodal joint feature vector contains feature information from each omics data of the target microorganism.

[0044] Specifically, this step involves constructing independent neural coding networks to map different modalities of multiple low-dimensional target feature vectors to a unified low-dimensional embedding space, thereby extracting deep nonlinear features at various biological levels. For each type of non-graph structured data, such as metabolomics, proteomics, and transcriptomics, a multilayer perceptron (MLP)-based encoder is constructed. Each encoder contains an input layer, multiple hidden layers, and an output embedding layer, where the number of hidden neurons decreases according to the feature dimension, such as 512-256-128. Nonlinear activation functions such as ReLU or LeakyReLU are used in each layer of the encoder to capture complex biological regulatory relationships in omics data, rather than simple linear superposition. Specifically, layer normalization is introduced after each MLP hidden layer to accelerate model convergence and ensure consistent feature scales across different samples, avoiding gradient vanishing. A dropout layer (with a dropout rate typically set to 0.3-0.5) is introduced to enhance the model's robustness and prevent overfitting to specific experimental sample noise. Furthermore, the topology graph of the biological interaction network is input into the Graph Convolutional Network (GCN) encoder. Through multiple rounds of graph convolution operations, the node expression level is fused with its topological position in the metabolic network. Global pooling is used to aggregate node-level features into a fixed-length vector representing the overall biological network state of the sample in the current MICP process. Further alignment processing is performed to ensure that the embedding vectors output by all encoders, including the multilayer perceptron (MLP) encoder and the GCN encoder, have the same dimension, such as 64 or 128, laying the mathematical foundation for subsequent attention mechanism fusion. Further synchronous cleaning is performed to align and verify the feature representations generated by each modality, eliminating incomplete or illogical sample embeddings at all levels.

[0045] S105. Based on the attention mechanism, the low-dimensional multimodal joint feature vector is weighted and integrated to obtain the fused feature vector.

[0046] First, feature vectors from different encoders, including metabolomics embeddings, are processed. Proteome embedding GCN topology embedding Alignment is performed and the vectors are transformed into a unified dimensional representation. The embedding vectors of all M modalities are concatenated into a joint representation matrix, which serves as the input to the attention layer.

[0047] Furthermore, a learnable parameter matrix is ​​utilized. A linear transformation is performed on each modality embedding, the contribution score of each modality to the final prediction task is calculated, and the scores are normalized using the Softmax function to generate modality weight coefficients. The normalization formula is as follows: ; In the formula, The single-mode weight coefficients corresponding to the i-th single-mode feature vector; is the contribution score calculated after linear transformation of the i-th single-modal feature vector; M is the total number of single-modal feature vectors participating in the fusion; j is the summation index variable, with a value ranging from 1 to M; is the contribution score of the j-th single-modal eigenvector; exp is an exponential function with the natural constant e as the base.

[0048] This mechanism allows the model to automatically increase the weight of proteomics modes when the MICP precipitation rate is dominated by urease protein concentration, and to automatically increase the weight of metabolomics modes when substrate is limited.

[0049] Furthermore, each modality embedding vector is associated with its corresponding weight coefficient. Multiply and sum to generate the final fused feature vector. The fusion vector retains the specific information of each omics while incorporating cross-modal synergistic associations, thus solving the problem that a single omics cannot fully explain the complex biological processes of MICP.

[0050] Furthermore, by extracting the attention weights after training, it is possible to quantitatively assess which biological levels, such as transcriptional regulation, play a key role in accelerating calcium carbonate formation under specific environments, such as high pH environments.

[0051] Based on the weight analysis results, the model input can be further simplified, and redundant modes such as the genome, which contribute little to the prediction of real-time sedimentation rate, can be eliminated, thereby reducing the detection cost in actual production.

[0052] S106. Input the fused feature vector into the pre-constructed regression prediction framework to obtain the precipitation prediction rate of the target microbial sample in the process of induced calcium carbonate precipitation output by the regression prediction framework.

[0053] Specifically, the fused feature vector is input into a regression prediction framework consisting of 2-3 fully connected layers to obtain the precipitation prediction rate of the target microbial sample during the induced calcium carbonate precipitation process.

[0054] The final layer of the regression prediction framework uses a single neuron output, without an activation function or using ReLU to ensure non-negative output, directly outputting the predicted calcium carbonate precipitation rate or yield. Dropout with a dropout rate of 0.2-0.4 and batch normalization are added between fully connected layers to improve the model's stability on small sample microbial datasets.

[0055] Furthermore, mean squared error (MSE) is used as the training loss function to minimize the deviation between the predicted rate and the actual experimental observations. A coefficient of determination is introduced. Mean absolute error (MAE) and root mean square error (RMSE) were used as performance evaluation criteria for the validation set. A performance close to 0.96 was prioritized for excellent generalization performance. Furthermore, the MICP experimental samples were divided into training, validation, and independent test sets in a 7:2:1 ratio. The loss value of the validation set was monitored during training. If the validation set loss no longer decreased within 20-50 consecutive training epochs, training was terminated early to prevent the model from overfitting to noisy data in a specific batch of experiments. The Adam optimizer was used in conjunction with a cosine annealing learning rate decay strategy to ensure that the model could accurately converge to the global optimum in the later stages of training.

[0056] Furthermore, optimization can be performed on key hyperparameters in the model, including adjusting the number of hidden layers in the MLP, optimizing the dimension of the weight matrix in the attention mechanism, and adjusting the receptive field of the graph convolutional layer for extracting topological features. Through cyclic ablation studies, the predictive efficacy of different omics combinations (such as "proteomics + metabolomics" versus "whole omics") can be verified, seeking the optimal configuration that balances accuracy and detection cost.

[0057] Furthermore, the attention weight coefficients optimized during training can be extracted. If the model assigns higher weights to "proteomics" or "metabolomics" in predictions, it indicates that phenotypic signals (such as enzyme abundance or substrate concentration) are more reflective of precipitation rates than static genomic information in this specific MIP environment. By comparing the performance of determination coefficients for different omics combinations, the molecular hierarchy most explanatory to the Dabrafenib-like response (i.e., the response to MIP rates) can be identified. For example, it can be verified whether the combination of "proteomics + transcriptomics" can generate the strongest synergistic predictive signal. By perturbing the input features using the model, specific features that have the greatest impact on the predicted precipitation rate (such as the expression levels of specific metabolites, urease subunits, or calcium transporters) can be identified. It can be evaluated whether the identified high-contribution features correspond to known MIP mechanisms, such as the role of the MAPK mitogen-activated protein kinase pathway in drug response, and key signal transduction or metabolic pathways in microorganisms (such as urea hydrolysis, carbonic anhydrase catalysis, etc.) can be analyzed. By analyzing the protein network embeddings after processing layers of the GCN (Graph Convolutional Network), we identified core protein modules (hub nodes) in the PPI (Protein-Protein Interaction) network. We investigated whether these core modules are related to calcium ion homeostasis, extracellular polymeric substance (EPS) synthesis, or cell surface charge regulation, thereby uncovering latent regulatory factors that may be overlooked by traditional single-omics analyses.

[0058] Understandably, this approach can adjust the nutrient composition of the MICP reaction system based on the rate-limiting metabolites or proteins identified by the model, such as by adding specific metabolic precursors or environmental parameters like pH control strategies. This provides target suggestions for the modification of genetically engineered strains, for example, by enhancing the expression of the relevant rate-limiting enzyme with the highest attention weight, achieving higher yields and faster calcium carbonate precipitation.

[0059] Furthermore, Figure 2 This is a structural block diagram of the microbial-induced calcium carbonate precipitation rate prediction device according to the embodiments of the application, such as... Figure 2 As shown, the device includes: The initial omics data acquisition module is used to acquire various initial omics data of the target microbial sample during the process of inducing calcium carbonate precipitation; The high-dimensional feature vector determination module is used to preprocess each type of initial omics data to obtain a high-dimensional feature vector corresponding to each type of initial omics data. Each dimension of the high-dimensional feature vector represents the expression level of the corresponding omics data in the target microbial sample. The low-dimensional target feature vector determination module is used to process the high-dimensional feature vectors corresponding to each type of omics data using the same type of dimensionality reduction algorithm to obtain the low-dimensional target feature vectors corresponding to each high-dimensional feature vector. The biological interaction network topology construction module is used to acquire knowledge of protein interactions or metabolic pathways of target microbial samples and construct a biological interaction network topology based on the knowledge of protein interactions or metabolic pathways. The nodes of the biological interaction network topology are functional proteins or key enzymes involved in the process of calcium carbonate precipitation induced by the target microorganism, and the edges are the physical binding between functional proteins or the upstream and downstream regulatory relationships in metabolic pathways. The low-dimensional multimodal joint feature vector determination module is used to map multiple low-dimensional target feature vectors to a unified low-dimensional embedding space based on a pre-constructed neural coding network, thereby obtaining a low-dimensional multimodal joint feature vector. The multimodal joint feature vector contains feature information of each omics data of the target microorganism. The fusion feature vector determination module is used to weight and integrate low-dimensional multimodal joint feature vectors based on an attention mechanism to obtain a fusion feature vector; The precipitation rate prediction module is used to input the fused feature vector into a pre-constructed regression prediction framework to obtain the precipitation prediction rate of the target microbial sample during the induced calcium carbonate precipitation process, as output by the regression prediction framework.

[0060] The application of the relevant modules of the device in this example can be referred to the relevant introduction of the method principle above, and will not be repeated here.

[0061] above Figure 2 The microbial-induced calcium carbonate precipitation rate prediction device in this embodiment of the invention is described in detail from the perspective of modular functional entities. The electronic device in this embodiment of the invention is described in detail from the perspective of hardware processing.

[0062] Figure 3This is a schematic diagram of the structure of an electronic device 300 provided in an embodiment of the present invention. The electronic device 300 can vary significantly due to different configurations or performance characteristics. It may include one or more central processing units (CPUs) 310 (e.g., one or more processors) and a memory 320, and one or more storage media 330 (e.g., one or more mass storage devices) for storing application programs 333 or data 332. The memory 320 and storage media 330 can be temporary or persistent storage. The program stored in the storage media 330 may include one or more modules (not shown in the diagram), each module including a series of instruction operations on the electronic device 300. Furthermore, the processor 310 may be configured to communicate with the storage media 330 and execute the series of instruction operations in the storage media 330 on the electronic device 300.

[0063] Electronic device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input / output interfaces 360, and / or one or more operating systems 331, such as Windows Server, MacOSX, Unix, Linux, FreeBSD, etc. Those skilled in the art will understand that... Figure 3 The illustrated electronic device structure does not constitute a limitation on electronic devices and may include more or fewer components than illustrated, or combine certain components, or have different component arrangements.

[0064] The present invention also provides a computer-readable storage medium, which can be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium, wherein the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to perform the steps of any of the above-described microbial mineralization induction and regulation methods.

[0065] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the system, device, or unit described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0066] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0067] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for predicting the precipitation rate of microbially induced calcium carbonate, characterized in that, The method includes: To obtain multiple initial omics data on the target microbial sample during the process of inducing calcium carbonate precipitation; Each type of initial omics data is preprocessed to obtain a high-dimensional feature vector corresponding to each preprocessed target omics data, wherein each dimension of the high-dimensional feature vector represents the expression level of the corresponding target omics data in the target microbial sample; For each type of target omics data, the same type of dimensionality reduction algorithm is used to process the high-dimensional feature vector to obtain the low-dimensional target feature vector corresponding to each high-dimensional feature vector. The protein interaction or metabolic pathway knowledge of the target microbial sample is obtained, and a biological interaction network topology is constructed based on the protein interaction or metabolic pathway knowledge. The nodes of the biological interaction network topology are functional proteins or key enzymes involved in the process of calcium carbonate precipitation induced by the target microorganism, and the edges are the physical binding between functional proteins or the upstream and downstream regulatory relationships in the metabolic pathway. Based on a pre-constructed neural coding network, multiple low-dimensional target feature vectors are mapped to a unified low-dimensional embedding space to obtain a low-dimensional multimodal joint feature vector. The multimodal joint feature vector contains feature information of each target omics data of the target microorganism. The low-dimensional multimodal joint feature vector is weighted and integrated based on an attention mechanism to obtain a fused feature vector; The fused feature vector is input into a pre-constructed regression prediction framework to obtain the precipitation prediction rate of the target microbial sample during the induced calcium carbonate precipitation process, as output by the regression prediction framework.

2. The method for predicting the rate of microbial-induced calcium carbonate precipitation according to claim 1, characterized in that, The step of preprocessing each type of initial omics data to obtain the high-dimensional feature vector corresponding to each preprocessed target omics data includes: Each type of initial omics data is preprocessed to obtain the target omics data corresponding to each type of initial omics data. By combining the feature scaling formula, feature scaling is performed on each target omics data to obtain a high-dimensional feature vector corresponding to each target omics data. The feature scaling formula is as follows: ; In the formula, For target omics data The corresponding high-dimensional feature vector; This refers to data from a specific target omics group of a target microbial sample. This represents the median of the target omics data across all samples. The interquartile range of the target omics data.

3. The method for predicting the rate of microbial-induced calcium carbonate precipitation according to claim 1, characterized in that, The high-dimensional feature vectors corresponding to each type of target omics data are processed using the same type of dimensionality reduction algorithm to obtain the low-dimensional target feature vectors corresponding to each high-dimensional feature vector, including: The high-dimensional feature vectors are input into the neural network encoder, and through nonlinear activation functions and layer normalization, the high-dimensional feature vectors are compressed into the low-dimensional embedding space of the target dimension to obtain the low-dimensional target feature vector.

4. The method for predicting the rate of microbial-induced calcium carbonate precipitation according to claim 1, characterized in that, The construction of the biological interaction network topology based on the knowledge of protein interactions or metabolic pathways includes: Retrieve interaction data of target microbial samples from protein interaction databases or metabolic pathway databases to construct an undirected graph G=(V,E), where node V represents functional proteins or key enzymes involved in the induction of calcium carbonate precipitation in the target microbial sample, and edge E represents the interaction between functional proteins or the upstream and downstream relationship in the metabolic pathway. Determine the initial feature vectors of the nodes in the undirected graph; The undirected graph is encoded using a graph convolutional network to generate a network embedding vector containing node and topological context information. The topology of the biological interaction network is determined based on the undirected graph, the initial feature vectors of the nodes, and the network embedding vectors.

5. The method for predicting the rate of microbial-induced calcium carbonate precipitation according to claim 4, characterized in that, The determination of the initial feature vectors of the nodes in the undirected graph includes: For any node in the undirected graph: If the expression level of the functional protein corresponding to the node exists in the initial omics data, the feature value corresponding to the functional protein in the high-dimensional feature vector obtained after preprocessing will be determined as the initial feature vector of the node. If the expression level of the functional protein corresponding to the node is not recorded in the initial omics data, the initial feature vector of the node is obtained by initializing and filling it with the average expression level of the neighboring nodes or by using a pre-trained biological network embedding vector.

6. The method for predicting the rate of microbial-induced calcium carbonate precipitation according to claim 4, characterized in that, The application of the graph convolutional network to encode the undirected graph to generate a network embedding vector containing node and topological context information includes: By applying a graph convolutional network and combining it with a feature transfer formula, features are transferred between nodes through the adjacency matrix A and the degree matrix D to generate network embedding vectors containing node and topological context information. The feature transfer formula is as follows: ; In the formula, This represents the feature matrix of the nodes in the (l+1)th layer of the graph convolutional network. It is a non-linear activation function; Let be the degree matrix of the undirected graph; It is the negative 1 / 2 power of the degree matrix; Let be the adjacency matrix of an undirected graph; Let L be the node feature matrix of the l-th layer of the graph convolutional network; Let be the learnable weight parameter matrix of the l-th layer of the graph convolutional network.

7. The method for predicting the rate of microbial-induced calcium carbonate precipitation according to claim 1, characterized in that, The weighted integration of the low-dimensional multimodal joint feature vector based on the attention mechanism to obtain the fused feature vector includes: The contribution score is calculated by linearly transforming each single-modal feature vector in the low-dimensional multimodal joint feature vector using a learnable parameter matrix. The contribution score is then normalized using the Softmax function to generate the single-modal weight coefficient corresponding to each single-modal feature vector. Each single-modal feature vector is multiplied by its corresponding single-modal weight coefficient and summed to generate a fused feature vector; The normalization formula for the Softmax function is as follows: ; In the formula, The single-mode weight coefficients corresponding to the i-th single-mode feature vector; The contribution score is calculated after linear transformation of the i-th single-modal feature vector; M is the total number of single-modal feature vectors participating in the fusion. The summation index variable takes values ​​from 1 to M; The contribution score for the j-th single-modal eigenvector; It is an exponential function with the natural constant e as its base.

8. A device for predicting the rate of microbial-induced calcium carbonate precipitation, characterized in that, The device includes: The initial omics data acquisition module is used to acquire various initial omics data of the target microbial sample during the process of inducing calcium carbonate precipitation; A high-dimensional feature vector determination module is used to preprocess each type of initial omics data to obtain high-dimensional feature vectors corresponding to each preprocessed target omics data, wherein each dimension of the high-dimensional feature vector represents the expression level of the corresponding target omics data in the target microbial sample; The low-dimensional target feature vector determination module is used to process the high-dimensional feature vectors corresponding to each type of target omics data using the same type of dimensionality reduction algorithm to obtain the low-dimensional target feature vectors corresponding to each high-dimensional feature vector. A biological interaction network topology construction module is used to acquire knowledge of protein interactions or metabolic pathways of the target microbial sample and construct a biological interaction network topology based on the knowledge of protein interactions or metabolic pathways. The nodes of the biological interaction network topology are functional proteins or key enzymes involved in the process of calcium carbonate precipitation induced by the target microorganism, and the edges are the physical binding between functional proteins or the upstream and downstream regulatory relationships in metabolic pathways. The low-dimensional multimodal joint feature vector determination module is used to map multiple low-dimensional target feature vectors to a unified low-dimensional embedding space based on a pre-constructed neural coding network to obtain a low-dimensional multimodal joint feature vector, wherein the multimodal joint feature vector contains feature information of each target omics data of the target microorganism; The fusion feature vector determination module is used to perform weighted integration of the low-dimensional multimodal joint feature vector based on an attention mechanism to obtain a fusion feature vector; The precipitation rate prediction module is used to input the fused feature vector into a pre-constructed regression prediction framework to obtain the precipitation prediction rate of the target microbial sample during the induced calcium carbonate precipitation process, as output by the regression prediction framework.

9. An electronic device, characterized in that, The electronic device includes a memory and at least one processor, the memory storing instructions; the at least one processor invokes the instructions in the memory to cause the electronic device to perform the steps of the microbial-induced calcium carbonate precipitation rate prediction method as described in any one of claims 1-7.

10. A computer-readable storage medium storing instructions thereon, characterized in that, When the instructions are executed by the processor, they implement the various steps of the microbial-induced calcium carbonate precipitation rate prediction method as described in any one of claims 1-7.