Method and apparatus for processing multi-omic components of breast milk

CN122245435APending Publication Date: 2026-06-19BEIJING SANYUAN FOOD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING SANYUAN FOOD
Filing Date
2026-03-05
Publication Date
2026-06-19

Smart Images

  • Figure CN122245435A_ABST
    Figure CN122245435A_ABST
Patent Text Reader

Abstract

This invention provides a method and apparatus for processing multi-omics components of breast milk, relating to the field of bioinformatics. The processing method includes: performing component detection on a breast milk sample to obtain component detection results; performing gene sequencing on the breast milk sample to obtain sequencing results; constructing an original dataset based on the basic information, detection data, and ancillary information of the breast milk sample, wherein the detection data includes the component detection results and sequencing results; constructing a breast milk multi-omics component database based on the original dataset; training multiple biomarker recognition models using the breast milk multi-omics component database based on various machine learning algorithms; and inputting candidate biomarkers into the multiple biomarker recognition models to determine key biomarker combinations. According to the technical solution of this invention, a technical system integrating breast milk component detection, microbiome data acquisition, data standardization and integration, and key biomarker recognition is constructed.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of bioinformatics, specifically to a method and apparatus for processing multi-omics components of breast milk. Background Technology

[0002] Breast milk is considered the "gold standard" for infant growth and development, providing essential nutrients such as carbohydrates, proteins, vitamins, minerals, and fatty acids. Both the World Health Organization (WHO) and the Chinese Dietary Guidelines recommend exclusive breastfeeding for the first six months of life, followed by the gradual introduction of complementary foods. However, achieving full-term exclusive breastfeeding remains challenging due to various factors such as maternal constitution, anxiety, and illness. Currently, most infant formula is based on cow's milk, formulated with added nutrients. However, there are significant differences between cow's milk and breast milk in key nutritional components. For example, the whey protein to casein ratio in breast milk is 3:2, making it easier to digest and absorb; while in cow's milk it is 1:4, potentially affecting protein utilization efficiency. Establishing a more comprehensive database of breast milk composition and identifying key biomarkers have become current research hotspots.

[0003] Existing breast milk research still faces several technical bottlenecks in terms of component detection, microbial acquisition, and data utilization: First, the nutritional composition of breast milk is complex and diverse, and the standards for detection methods are not uniform, making it difficult to directly compare data from different studies or different detection platforms; second, traditional methods often focus only on information about single nutrient components or microbial modalities, lacking a systematic analysis of the relationships between macronutrients, micronutrients, bioactive components, and the breast milk microbiome; third, the lack of a complete, structured, and scalable breast milk component database results in low efficiency in identifying key functional components or health-related biomarkers, making it difficult to support the needs of maternal and infant health management or formula milk optimization research and development.

[0004] In summary, there are gaps in the existing research chain of technology. Summary of the Invention

[0005] This invention provides a method and apparatus for processing multi-omics components of breast milk, which addresses the shortcomings of existing technologies where the entire research chain is fragmented, and realizes the construction of a technical system that integrates breast milk component detection, microbiome data acquisition, data standardization and integration, and key biomarker identification.

[0006] This invention provides a method for processing multi-omics components of breast milk, comprising:

[0007] Composition analysis was performed on breast milk samples to obtain the composition analysis results; Gene sequencing was performed on the breast milk sample to obtain sequencing results; Based on the basic information, test data, and supplementary information of the breast milk sample, an original dataset is constructed, wherein the test data includes the component detection results and sequencing results; Based on the original dataset, a multi-omics database of breast milk components was constructed. Using the aforementioned breast milk multi-omics composition database, multiple biomarker recognition models were trained based on various machine learning algorithms; The candidate markers are input into the multiple marker recognition models to determine the key marker combination.

[0008] According to a method for processing breast milk multi-omics components provided by the present invention, the components of a breast milk sample are detected to obtain component detection results, including: The breast milk sample was standardized to obtain the standardized processing result; An ultrasonic wave of a preset frequency is emitted into the standardized processing result, and the speed of sound and attenuation coefficient of the ultrasonic wave propagating in the breast milk sample are measured. Based on a pre-built breast milk component database and an inversion algorithm model, the content of the main components of the breast milk sample is obtained according to the propagation speed of sound and the attenuation coefficient, which serves as the component detection result.

[0009] According to a method for processing breast milk multi-omics components provided by the present invention, gene sequencing is performed on the breast milk sample to obtain sequencing results, including: Microbial genomic DNA was extracted from the breast milk sample; The microbial genomic DNA was fed into a pre-designed high-throughput sequencing platform to obtain sequence data; The sequence data is analyzed to obtain the sequencing results.

[0010] According to a method for processing breast milk multi-omics components provided by the present invention, the sequence data is analyzed to obtain the sequencing results, including: The sequence data is preprocessed to obtain the ASV representative sequence; The ASV representative sequences are compared with a preset reference database, and species annotations are performed based on the comparison results to construct a microbial abundance matrix. The microbial abundance matrix was standardized to obtain a standardized community characteristic table, which was used as the sequencing result.

[0011] According to a method for processing breast milk multi-omics components provided by the present invention, a breast milk multi-omics component database is constructed based on the original dataset, including: The original dataset is validated according to the pre-set original data validation rules, and valid data is extracted. Based on the effective data, the breast milk multi-omics component database was constructed.

[0012] According to the present invention, a method for processing breast milk multi-omics components is provided, wherein the breast milk multi-omics component database includes: The data management module is used to manage the valid data, numbering system, and status tracking of the breast milk samples; The analysis function modules include sub-modules for macronutrient analysis, proteomics analysis, mineral analysis, vitamin analysis, fatty acid analysis, MFGM protein analysis, and microbial community analysis, which are used to perform statistical processing, visualization, and / or indicator calculation on the corresponding data.

[0013] According to the present invention, a method for processing breast milk multi-omics components is provided, wherein the breast milk multi-omics component database supports query function, filtering function, chart display function and / or data export function.

[0014] According to a method for processing breast milk multi-omics components provided by the present invention, multiple biomarker recognition models are trained using the breast milk multi-omics component database and based on various machine learning algorithms, including: Based on the breast milk multi-omics composition database, the macronutrient characteristics, vitamin and mineral characteristics, MFGM protein characteristics, species abundance characteristics, α diversity characteristics and / or β diversity characteristics of the breast milk samples were extracted. Based on the macronutrient characteristics, vitamin and mineral characteristics, MFGM protein characteristics, species abundance characteristics, α diversity characteristics, and / or β diversity characteristics, a standardized feature matrix is ​​constructed; The standardized feature matrix is ​​used to construct a sample dataset; Based on the sample dataset, multiple marker recognition models were trained using various machine learning algorithms.

[0015] According to a method for processing breast milk multi-omics components provided by the present invention, candidate biomarkers are input into the plurality of biomarker recognition models to determine key biomarker combinations, including: The candidate markers are input into the multiple marker recognition models to obtain multiple scores. Calculate a comprehensive importance score based on the multiple scores; The candidate markers were verified to obtain the verification results; The key marker combination is determined based on the overall importance score and the verification results.

[0016] The present invention also provides an apparatus for processing breast milk multi-omics components, comprising: The component detection unit is used to detect the components of breast milk samples and obtain the component detection results; A gene sequencing unit is used to sequence the genes in the breast milk sample and obtain sequencing results. A dataset unit is used to construct an original dataset based on the basic information, detection data, and auxiliary information of the breast milk sample, wherein the detection data includes the component detection results and sequencing results; A database unit is used to construct a breast milk multi-omics component database based on the original dataset; The model training unit is used to train multiple biomarker recognition models using the breast milk multi-omics composition database and based on various machine learning algorithms. The model application unit is used to input candidate markers into the multiple marker recognition models to determine the key marker combination.

[0017] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the processing method for breast milk multi-omics components as described above.

[0018] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method for processing breast milk multi-omics components as described above.

[0019] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the processing method for breast milk multi-omics components as described above.

[0020] This invention provides a method and apparatus for processing breast milk multi-omics components. By performing component detection and gene sequencing on breast milk samples, component detection results and sequencing results are obtained. These results are used as detection data, combined with basic and auxiliary information of the breast milk samples, to construct a raw dataset, thereby building a structured breast milk multi-omics component database. This standardized storage effectively solves the problem of integrating multi-source heterogeneous data, ensuring data comparability and consistency. Based on the breast milk multi-omics component database, multiple biomarker recognition models are trained using various machine learning algorithms. These models are then used to identify combinations of key biomarkers, improving the accuracy and efficiency of biomarker screening. This achieves the construction of a comprehensive technical system integrating breast milk component detection, microbiome data acquisition, data standardization and integration, and key biomarker identification. Attached Figure Description

[0021] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0022] Figure 1 This is one of the flowcharts illustrating the method for processing multi-omics components of breast milk provided by the present invention; Figure 2 This is a second schematic flowchart of the method for processing breast milk multi-omics components provided by the present invention; Figure 3 This is a schematic diagram of the database architecture of the method for processing breast milk multi-omics components provided by the present invention; Figure 4 This is a schematic diagram of the structure of the method for processing breast milk multi-omics components provided by the present invention; Figure 5 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0024] The following is combined Figures 1-3 The present invention describes a method for processing breast milk multi-omics components. Figure 1 This is one of the flowcharts illustrating the method for processing multi-omics components of breast milk provided by the present invention. Figure 1 As shown, the method includes steps 110-160.

[0025] Step 110: Perform component analysis on the breast milk sample and obtain the component analysis results.

[0026] This invention does not restrict the source of breast milk samples, such as breast milk samples collected through standardized hospital procedures.

[0027] There are various methods for detecting breast milk components, and the content of major components such as breast milk fat, protein, true protein, lactose and water is obtained as the component detection results.

[0028] According to the example embodiment, ultrasonic detection is used to detect the components of breast milk samples.

[0029] Understandably, ultrasonic testing is a rapid and non-destructive measurement technique. It utilizes the acoustic characteristics of ultrasound waves propagating in breast milk (such as sound velocity, sound attenuation, acoustic impedance, and adiabatic compressibility) to establish predictive models for the main components, thereby enabling simple and rapid determination of the content of proteins, fats, lactose, minerals, and water.

[0030] Step 120: Perform gene sequencing on the breast milk sample to obtain sequencing results.

[0031] Understandably, breast milk contains a variety of components that can affect the gut microbiota of infants, including bacterial components, bioactive components such as human milk oligosaccharides (HMO), butyrate, disaccharides, milk fat globule membrane (MFGM), and free threonine (FAA). These components work together to promote the healthy development of the gut microbiota of infants.

[0032] In microbiome research, obtaining data on the abundance of breast milk microorganisms relies on a technical system that combines molecular biology and bioinformatics. The process begins with the aseptic collection of samples and the extraction of microbial genomic DNA. Subsequently, high-throughput sequencing technology is used to perform amplicon sequencing on specific variable regions of the bacterial 16S rRNA gene, or metagenomic sequencing on all microbial DNA in the sample, yielding massive amounts of sequence data as the sequencing results.

[0033] Furthermore, the sequencing results also include a species abundance table. Specifically, the massive amount of sequence data generated is processed using bioinformatics workflows, including quality control, noise reduction, clustering into operational taxonomic units or precise correction to amplicon sequence variants, and then compared and annotated with professional databases to finally generate a detailed species abundance table.

[0034] This table reveals the relative proportions of various microorganisms in the sample, laying a data foundation for subsequent construction of a microbiome database and exploration of the relationship between microbial community structure and maternal and infant health.

[0035] Step 130: Based on the basic information, detection data, and auxiliary information of the breast milk sample, construct an original dataset, wherein the detection data includes the component detection results and sequencing results.

[0036] The component detection results and sequencing results obtained in steps 110-120 are recorded as detection data.

[0037] Furthermore, the component detection results and sequencing results can be preprocessed.

[0038] During the data preprocessing stage, quality control and standardization are implemented for multi-source heterogeneous data from component analysis and microbial sequencing. For example, for component data, measurement units are standardized, and missing values ​​are handled using multiple imputation, while outliers are identified and corrected using statistical methods (such as the IQR rule). For microbial sequence data, the workflow includes quality control filtering, noise reduction, generation of ASV (amplifier sequence variant) tables, and species annotation based on the Standard Indicator Database (SILVA).

[0039] The original dataset is constructed by combining the basic and supplementary information from the test data and breast milk samples.

[0040] The basic information of the breast milk sample includes sample ID, donor basic information (such as age, gestational age, diet, health status, etc.), collection information (such as collection time, collection site, storage conditions, shelf life, etc.); the test data includes the content of major components (i.e., component test results, such as macronutrients, vitamins, minerals, lipids, proteins, etc.) and raw microbial sequencing data (metagenomic sequencing data); the supplementary information includes the batch of test reagents, sequencing platform parameters, etc.

[0041] Understandably, the constructed dataset is a multi-dimensional, standardized database of breast milk composition and microbiome. All samples underwent rigorous quality control processes to ensure data accuracy and comparability. The dataset covers the following core contents: Breast milk composition data: including quantitative detection results of macronutrients (protein, fat, lactose, etc.), micronutrients (vitamins, minerals), and functional components (MFGM protein, oligosaccharides, etc.). Microbiome data: including microbial species abundance tables, diversity indices, and community structure characteristics obtained from 16S rRNA sequencing. Sample metadata: detailed sample collection information, donor basic information, storage conditions, and other supporting data.

[0042] Step 140: Construct a breast milk multi-omics component database based on the original dataset.

[0043] Understandably, the construction of a multi-omics database of breast milk aims to provide a systematic data foundation for subsequent nutritional assessments, formula milk optimization, and maternal and infant health research. First, it is necessary to standardize and integrate the test results of breast milk samples, including macronutrients (protein, fat, lactose, etc.), micronutrients (vitamins, minerals), bioactive molecules (such as HMOs, MFGM components, fatty acid profiles, amino acid profiles), and microbial abundance information.

[0044] According to the example implementation, at the data management level, a hierarchical database architecture can be established: the basic layer records the original test data and sampling metadata (such as sampling time, maternal nutritional status, delivery method, etc.); the feature layer stores standardized nutritional indicators, metabolite content, and microbial abundance tables; and the association layer integrates clinical information, infant growth indicators, and potential breast milk component-microbe-health relationships.

[0045] Furthermore, the database can further support multimodal data mining based on statistical modeling, machine learning, and network analysis, providing a scientific basis for breast milk quality evaluation, personalized nutritional intervention, and the development of new infant formula foods by identifying key nutrients and microbial biomarkers.

[0046] Step 150: Using the breast milk multi-omics component database, multiple biomarker recognition models are trained based on various machine learning algorithms.

[0047] Choose an appropriate machine learning algorithm framework based on the research objectives (e.g., health status assessment).

[0048] A training dataset was constructed using a multi-omics database of breast milk components. For each selected machine learning algorithm, separate training was performed to obtain multiple biomarker recognition models. Each biomarker recognition model corresponds one-to-one with a machine learning algorithm.

[0049] According to the example implementation, a random forest method is used to construct a classifier to address the characteristics of high-dimensional, small-sample data; for complex nonlinear relationships, support vector machines and neural network models are introduced. The model construction adopts an ensemble learning framework to cope with the challenges of high-dimensional, small-sample data.

[0050] Step 160: Input the candidate markers into the multiple marker recognition models to determine the key marker combination.

[0051] Based on multiple trained marker recognition models, a multi-index fusion strategy is used to screen key markers.

[0052] Specifically, candidate markers are input into multiple marker recognition models, and the interpretability outputs of multiple models are combined to determine the key marker combination.

[0053] Furthermore, based on the model output, a screening strategy combining feature importance assessment, statistical verification, and biological function annotation can be adopted to overcome the limitations of a single method and improve the robustness of the discovery, quantifying the contribution of each feature to the prediction results from the perspective of model interpretability.

[0054] This invention performs component analysis and gene sequencing on breast milk samples to obtain component analysis and sequencing results. These results are then used as analytical data, combined with basic and supplementary information from the breast milk samples, to construct a raw dataset. This results in a structured breast milk multi-omics component database, enabling standardized storage and effectively solving the integration challenge of multi-source heterogeneous data, ensuring data comparability and consistency. Based on this database, multiple biomarker recognition models are trained using various machine learning algorithms. These models are then used to identify combinations of key biomarkers, improving the accuracy and efficiency of biomarker screening. This invention establishes a comprehensive technical system integrating breast milk component analysis, microbiome data acquisition, data standardization and integration, and key biomarker identification.

[0055] The following provides further explanation of the component detection of breast milk samples. In some embodiments, the breast milk sample is subjected to component detection to obtain component detection results, including steps 111-113.

[0056] Step 111: Standardize the breast milk sample to obtain the standardization result.

[0057] Standardization processes include constant temperature treatment (e.g., 40°C) and mechanical homogenization.

[0058] According to the example embodiment, the sample is subjected to isothermal (e.g., 40°C) and mechanical homogenization using a breast milk composition analyzer.

[0059] Step 112: Emit an ultrasonic wave of a preset frequency to the standardized processing result, and measure the speed of sound and attenuation coefficient of the ultrasonic wave propagating in the breast milk sample.

[0060] It emits ultrasound waves at a preset frequency and accurately measures the speed of sound (approximately 1400-1550 m / s) and attenuation coefficient as they propagate in breast milk.

[0061] According to the example embodiment, the preset frequency is 5MHz.

[0062] Step 113: Based on the pre-built breast milk component database and inversion algorithm model, the main component content of the breast milk sample is obtained according to the propagation speed of sound and the attenuation coefficient, which is used as the component detection result.

[0063] Based on the measured propagation speed and attenuation coefficient, the content of major components such as breast milk fat, protein, true protein, lactose and water is obtained as the component detection results by combining the breast milk component database and the inversion algorithm model.

[0064] This invention requires no chemical reagents, is fast, and the measurement results will be used as basic nutritional data in subsequent processing.

[0065] The following provides a further explanation of gene sequencing of breast milk samples. In some embodiments, gene sequencing is performed on the breast milk sample to obtain sequencing results, including steps 121-123.

[0066] Step 121: Extract microbial genomic DNA from the breast milk sample.

[0067] Microbial genomic DNA is extracted from breast milk samples. The extracted DNA must meet the required purity and concentration to ensure the accuracy of subsequent amplification and sequencing.

[0068] In the specific implementation process, a standard procedure of lysis, centrifugation and purification is adopted to break the bacterial cells and recover high-quality DNA.

[0069] According to the example embodiment, microbial genomic DNA was extracted using a DNA extraction kit.

[0070] It should be noted that the DNA extraction kit may include centrifuge tubes, lysis buffer, magnetic bead purification module, etc.

[0071] Step 122: The microbial genomic DNA is fed into a pre-set high-throughput sequencing platform to obtain sequence data.

[0072] The extracted microbial genomic DNA is fed into a high-throughput sequencing platform to sequence specific regions of the 16S rRNA or the entire genome. The sequencer reads the DNA fragment sequences, generating massive amounts of raw sequencing data, which are denoted as sequence data. This step can comprehensively cover all detectable microbial species in the sample, providing basic data for abundance analysis.

[0073] Step 123: Analyze the sequence data to obtain the sequencing results.

[0074] The sequence data is analyzed to transform the raw, disordered sequence information into reliable, interpretable biological knowledge, forming a standardized community characteristic table that can be used for diversity analysis and statistical modeling as the sequencing results. All processed microbial data will be standardized and stored for use in the database construction phase.

[0075] The following provides further explanation of steps 1, 2, and 3. For example... Figure 2 As shown, in some embodiments, the sequence data is analyzed to obtain the sequencing results, including steps 210-230.

[0076] Step 210: Preprocess the sequence data to obtain the ASV representative sequence.

[0077] Preprocessing includes quality assessment, low-quality rejection, noise reduction, and error correction.

[0078] According to the example embodiment, quality assessment is performed using FastQC (Fast Quality Control), and low-quality bases and adapter sequences are removed using Trimmomatic (Sequence Pruning Tool). Subsequently, the DADA2 denoising algorithm is used for denoising and error correction to generate a high-resolution amplicon sequence variant (ASV), denoised as the representative ASV sequence.

[0079] Step 220: Compare the ASV representative sequence with a preset reference database, and annotate the species based on the comparison results to construct a microbial abundance matrix.

[0080] The representative ASV sequences are compared with a preset reference database, and species annotation is completed by using a confidence threshold (e.g., ≥80%) to construct a microbial abundance matrix.

[0081] According to the example embodiment, the preset reference database is the SILVA (v138) reference database.

[0082] Step 230: Standardize the microbial abundance matrix to obtain a standardized community characteristic table as the sequencing result.

[0083] The obtained microbial abundance matrix was standardized to form a standardized community characteristic table that can be used for diversity analysis and statistical modeling, which served as the sequencing results. All processed microbial data were standardized and stored for use in the database construction phase.

[0084] According to the example embodiment, the microbial abundance matrix is ​​standardized using CSS normalization.

[0085] The following provides further explanation of the construction of a breast milk multi-omics composition database. In some embodiments, a breast milk multi-omics composition database is constructed based on the original dataset, including steps 141-142.

[0086] Step 141: Verify the original dataset according to the pre-set original data verification rules and extract valid data.

[0087] First, establish raw data verification rules. Using these pre-set rules, verify the data one by one for completeness (e.g., no missing key fields), accuracy (e.g., values ​​within a reasonable physiological range, such as breast milk protein content 1.0-1.5g / 100mL), and consistency (e.g., IDs corresponding to the same sample in different testing stages are consistent). Eliminate invalid data (e.g., test results of samples that have deteriorated due to improper storage), mark suspicious data as pending verification, contact testing personnel for confirmation, and then supplement or remove the data, retaining valid data.

[0088] Step 142: Based on the effective data, construct the breast milk multi-omics component database.

[0089] Based on valid data, a relational database (such as MySQL) is used to store the data in a structured manner to construct a multi-omics database of breast milk components.

[0090] This invention establishes a standardized data integration system, overcoming the challenge of integrating multi-source heterogeneous data. Traditional breast milk research faces technical bottlenecks such as inconsistent data standards across different testing platforms, making direct comparison difficult. This invention, by constructing a structured breast milk database and employing a relational database for standardized storage, effectively solves the integration problem of multi-source heterogeneous data, ensuring data comparability and consistency. This provides a reliable foundation for large-scale data analysis and cross-study comparisons, significantly enhancing the reusability of data and research efficiency.

[0091] The following provides a further description of the breast milk multi-omics composition database. In some embodiments, the breast milk multi-omics composition database includes: The data management module is used to manage the valid data, numbering system, and status tracking of the breast milk samples.

[0092] According to the example implementation, a breast milk sample management module is set up to manage the validity of samples (including test data, basic information and auxiliary information), sampling conditions, numbering system and status tracking.

[0093] Furthermore, according to the example embodiment, the data management module includes a breast milk sample management module, a plant sample management module, a bacterial strain management module, and a buffalo milk nutrient composition module, such as... Figure 3 As shown.

[0094] The core objects of the breast milk sample management module are breast milk samples themselves, focusing on their entire life cycle information (collection, storage, basic attributes), as well as related macronutrients, microorganisms, and other multi-dimensional data. The core objects of the plant sample management module are related plant samples, focusing on their species, nutritional value, and consumption status; its core function is "auxiliary variable recording and correlation," and it does not directly participate in the core breast milk analysis. The core objects of the strain management module are the microbial strains in breast milk, focusing on their identification and function; its core function is "precise management and annotation of research subjects," which is the foundation of microbial analysis. The core objects of the buffalo milk nutrient composition module are the nutrients in buffalo milk, serving as the carrier for buffalo milk nutritional research.

[0095] Furthermore, in Figure 3In this study, the strain information, metabolite information, and lipid information are constructed based on existing standard databases (such as the NCBI (National Center for Biotechnology Information) microbial database, HMDB (Human Metabolomics Database), and LIPID MAPS lipid database). Raw test data of breast milk samples are obtained using professional testing instruments. Through a series of operations such as data preprocessing, comparison and calibration, and annotation and analysis, relying on the authoritative classification system and annotation information of standard databases, the raw data obtained by the instruments are transformed into standardized and interpretable specialized data, thereby constructing a special dataset adapted to breast milk research.

[0096] In addition, the database uses a structured storage method to classify and archive information such as macronutrient data, proteome data, vitamin and mineral content, MFGM composition, and microbial abundance tables, providing a data foundation for subsequent functional module calls.

[0097] Furthermore, it allows for unified management of sample metadata, test results, and microbial abundance tables. The database schema is meticulously designed to ensure data consistency, integrity, and efficient query performance, providing a reliable data foundation for upper-layer applications.

[0098] The analysis function modules include sub-modules for macronutrient analysis, proteomics analysis, mineral analysis, vitamin analysis, fatty acid analysis, MFGM protein analysis, and microbial community analysis, which are used to perform statistical processing, visualization, and / or indicator calculation on the corresponding data.

[0099] Multiple analytical modules have been constructed, including sub-modules for macronutrient analysis, proteomics analysis, mineral analysis, vitamin analysis, fatty acid analysis, MFGM protein analysis, and microbial community analysis. Each module can perform statistical processing, visualization, and indicator calculation on the corresponding data.

[0100] It should be noted that the analysis function module has the following functions: 1) Improve data organization efficiency: Manage data in layers according to “sample-strain-nutrient analysis”, separate basic information, core objects and analysis results, avoid data mixing, facilitate quick retrieval and reduce maintenance costs.

[0101] 2) Supporting correlation analysis: Module linkage (such as breast milk sample correlation nutritional analysis, strain management) enables cross-dimensional cross-analysis (such as donor characteristics → breast milk nutrition → microbial composition), uncovering inherent patterns and adapting to specific research.

[0102] 3) Adapted to breast milk research scenarios: Specialized modules (breast milk, strain management) focus on core objects and avoid redundancy; detailed analysis modules (such as MFGM protein, microbial analysis) are aligned with research hotspots and accurately match needs.

[0103] Furthermore, in some embodiments, the breast milk multi-omics composition database supports query functions, filtering functions, chart display functions, and / or data export functions.

[0104] The database system is developed using the Python Flask framework and integrates front-end visualization libraries such as ECharts. It provides query, filtering, chart display, and data export functions through a web interface, enabling users to intuitively view the relationship between breast milk components and microorganisms, thus providing support for subsequent biomarker screening and model training.

[0105] Users can query data, filter data in multiple dimensions, and generate interactive charts (such as component stacked bar charts, species abundance heatmaps, PCA scatter plots, etc.) through the web interface, and export the analysis results, which greatly facilitates researchers' exploration and interpretation of data.

[0106] This invention achieves deep fusion of multimodal data, overcoming the technical limitations of single-dimensional analysis. Existing research methods often analyze nutrient components or microbial composition in isolation, making it difficult to reveal their intrinsic relationships. This invention systematically integrates data from different modalities and constructs a comprehensive feature matrix containing quantitative features of components, microbial composition features, and diversity indices through feature extraction. Based on this, a visualization analysis platform further provides multi-dimensional data correlation analysis capabilities, supporting researchers in intuitively exploring the interactions between components and microorganisms.

[0107] The following further explains the training of multiple biomarker recognition models. In some embodiments, multiple biomarker recognition models are trained using the breast milk multi-omics composition database and based on various machine learning algorithms, including steps 151- Step 151: Based on the breast milk multi-omics composition database, extract the macronutrient characteristics, vitamin and mineral characteristics, MFGM protein characteristics, species abundance characteristics, α diversity characteristics and / or β diversity characteristics of the breast milk sample.

[0108] In the feature extraction process, quantitative features are first extracted from macronutrients, vitamins and minerals, and MFGM protein components, including the absolute content and relative proportion of each component, to obtain macronutrient features, vitamin and mineral features, and MFGM protein features.

[0109] Simultaneously, species abundance features (phylum, class, order, family, genus, and species levels), α diversity indices (e.g., Shannon index, Simpson index, etc.), and β diversity distance matrices were extracted from the sequencing results. The α diversity indices were used as α diversity features, and the β diversity distance matrix was used as β diversity features.

[0110] Step 152: Construct a standardized feature matrix based on the macronutrient characteristics, vitamin and mineral characteristics, MFGM protein characteristics, species abundance characteristics, α diversity characteristics, and / or β diversity characteristics.

[0111] Based on macronutrient characteristics, vitamin and mineral characteristics, MFGM protein characteristics, species abundance characteristics, α diversity characteristics, and β diversity characteristics, a standardized feature matrix is ​​constructed with samples as rows and features as columns.

[0112] This invention extracts numerical features from cleaned data in a breast milk multi-omics composition database. These features primarily include: quantitative features of components (absolute concentration and relative percentage), microbial composition features (relative abundance and diversity indices of species at each taxonomic level), and derived features constructed using domain knowledge (such as key component ratios). All features are Z-score standardized to form a structured feature matrix with samples as rows and features as columns, providing high-quality input for subsequent machine learning modeling.

[0113] According to the example embodiment, the standardized feature matrix is ​​shown in Table 1.

[0114] Table 1 Standardized Feature Matrix

[0115] Step 153: Construct a sample dataset using the standardized feature matrix.

[0116] According to the example implementation, the standardized feature matrix is ​​divided into a training set, a validation set, and a test set in a ratio of 7:1.5:1.5, which together serve as the sample dataset.

[0117] Step 154: Based on the sample dataset, train the multiple marker recognition models using various machine learning algorithms.

[0118] Various machine learning algorithms can be selected based on the specific situation, including random forests, support vector machines, and neural networks. For common classification problems, ensemble learning random forests are used. For complex nonlinear relationships, support vector machines and neural networks are introduced.

[0119] Based on various machine learning algorithms, a variety of machine learning models were constructed to address different analysis tasks.

[0120] In model training, random forests use Bayesian optimization to search for the optimal tree depth and feature subset size; support vector machines determine the optimal kernel function parameters through grid search; and neural networks introduce Dropout and L2 regularization, along with an early stopping mechanism to prevent overfitting, ensuring the generalization ability and stability of the final model.

[0121] Furthermore, in some embodiments, five-fold cross-validation is used to evaluate the stability of all models, and finally a weighted ensemble is performed based on the validation set F1-score.

[0122] In one specific embodiment, multiple biomarker recognition models trained were tested using an independent test set. Blind testing on the independent test set showed that the ensemble model maintained high accuracy and a high AUC-ROC curve, with index fluctuations of less than 2% across three repeated experiments, confirming the model's strong generalization ability and robustness. SHAP value analysis further identified key microbial biomarkers and nutritional indicators influencing classification decisions.

[0123] This invention applies machine learning methods to the correlation analysis between breast milk components and microorganisms. It constructs a comprehensive feature matrix containing quantitative features of components and microbial composition features through feature engineering, and employs a hybrid modeling strategy combining ensemble learning and neural networks. By utilizing machine learning algorithms to capture complex nonlinear relationships in high-dimensional data, it achieves a technological leap from single-indicator analysis to multi-dimensional correlation mining, significantly improving the depth and breadth of biomarker discovery.

[0124] The application of multiple marker recognition models is further explained below. In some embodiments, candidate markers are input into the multiple marker recognition models to determine key marker combinations, including steps 161-164.

[0125] Step 161: Input the candidate markers into the multiple marker recognition models to obtain multiple scores.

[0126] The candidate markers are input into multiple marker recognition models to obtain interpretive outputs (scores) from different models.

[0127] Step 162: Calculate the overall importance score based on the multiple scores.

[0128] Key biomarkers are screened using a multi-indicator fusion strategy based on multiple scores. This integrates the interpretable outputs of different models to overcome the limitations of a single method and improve the robustness of the discovery. The contribution of each feature to the prediction results is quantified from the perspective of model interpretability.

[0129] According to an example embodiment, a comprehensive importance score is calculated using a first preset formula. The first preset formula includes: ; in, Indicates the first m The score of the marker recognition model. This indicates the number of marker recognition models.

[0130] Step 163: Verify the candidate markers to obtain the verification results.

[0131] The candidate biomarkers were subjected to significance tests and multiple hypothesis corrections to obtain validation results, thereby verifying their consistency in biological function and ultimately determining the core biomarker combination with statistical significance and mechanistic explanatory power.

[0132] According to the example implementation, statistical tests (such as the Mann-Whitney U test) and multiple hypothesis corrections were used, combined with KEGG pathway enrichment analysis or gene ontology enrichment analysis for verification.

[0133] Step 164: Determine the key marker combination based on the comprehensive importance score and the verification results.

[0134] Candidate markers are sorted according to their overall importance score, and those that fail the verification are removed. A preset number of candidate markers are selected as the key marker combination.

[0135] This invention employs a multi-step strategy to screen biologically significant biomarkers from a model. First, feature importance scores (ranking importance) are calculated based on a trained model (e.g., random forest) to initially identify key features. Then, statistical tests (e.g., the Mann-Whitney U test) and multiple hypothesis correction are used to assess the significant differences between groups for these features. Finally, biological function annotation and pathway enrichment analysis are performed on the selected candidate biomarkers (especially microorganisms and proteins), and their potential biological functional consistency is verified using databases such as KEGG and GO, thereby determining a core biomarker combination that possesses both statistical significance and biological explanatory power.

[0136] This invention constructs an intelligent biomarker discovery process, improving the accuracy and efficiency of biomarker screening. Addressing the inefficiencies and unreliability of traditional biomarker discovery methods, this invention innovatively introduces a machine learning-based biomarker identification process. This process constructs a high-quality feature matrix through feature engineering, builds a predictive model using ensemble learning methods such as random forests, and employs screening strategies based on feature importance assessment, statistical validation, and biological function annotation. This not only significantly improves the efficiency of biomarker discovery but also ensures the reliability of the results through multi-dimensional validation, providing precise targeted guidance for maternal and infant health assessment and formula food development.

[0137] The apparatus for processing breast milk multi-omics components provided by the present invention will be described below. The apparatus for processing breast milk multi-omics components described below can be referred to in correspondence with the method for processing breast milk multi-omics components described above. Figure 4 This is a schematic diagram of the structure of the breast milk multi-omics component processing device provided by the present invention, as shown below. Figure 4 As shown, the device includes: The component detection unit 410 is used to detect the components of breast milk samples and obtain the component detection results.

[0138] Gene sequencing unit 420 is used to perform gene sequencing on the breast milk sample to obtain sequencing results.

[0139] Data set unit 430 is used to construct an original dataset based on the basic information, detection data and auxiliary information of the breast milk sample, wherein the detection data includes the component detection results and sequencing results.

[0140] Database unit 440 is used to construct a breast milk multi-omics component database based on the original dataset.

[0141] The model training unit 450 is used to train multiple biomarker recognition models using the breast milk multi-omics component database and based on various machine learning algorithms.

[0142] The model application unit 460 is used to input candidate markers into the plurality of marker recognition models to determine the key marker combination.

[0143] The device performs functions similar to those described above; other functions are described in the preceding descriptions and will not be repeated here.

[0144] Figure 5 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 5As shown, the electronic device may include a processor 510, a communication interface 520, a memory 530, and a communication bus 540, wherein the processor 510, the communication interface 520, and the memory 530 communicate with each other via the communication bus 540. The processor 510 can call logical instructions in the memory 530 to execute a method for processing breast milk multi-omics components. This method includes: performing component detection on a breast milk sample to obtain component detection results; performing gene sequencing on the breast milk sample to obtain sequencing results; constructing an original dataset based on the basic information, detection data, and auxiliary information of the breast milk sample, wherein the detection data includes the component detection results and sequencing results; constructing a breast milk multi-omics component database based on the original dataset; training multiple biomarker recognition models using the breast milk multi-omics component database based on various machine learning algorithms; and inputting candidate biomarkers into the multiple biomarker recognition models to determine key biomarker combinations.

[0145] Furthermore, the logical instructions in the aforementioned memory 530 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0146] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer is able to execute the processing method for breast milk multi-omics components provided by the above methods. The method includes: performing component detection on a breast milk sample to obtain component detection results; performing gene sequencing on the breast milk sample to obtain sequencing results; constructing an original dataset based on the basic information, detection data, and auxiliary information of the breast milk sample, wherein the detection data includes the component detection results and sequencing results; constructing a breast milk multi-omics component database based on the original dataset; training multiple biomarker recognition models using the breast milk multi-omics component database based on multiple machine learning algorithms; and inputting candidate biomarkers into the multiple biomarker recognition models to determine key biomarker combinations. In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements a method for processing breast milk multi-omics components provided by the methods described above. This method includes: performing component detection on a breast milk sample to obtain component detection results; performing gene sequencing on the breast milk sample to obtain sequencing results; constructing an original dataset based on the basic information, detection data, and auxiliary information of the breast milk sample, wherein the detection data includes the component detection results and sequencing results; constructing a breast milk multi-omics component database based on the original dataset; training multiple biomarker recognition models using the breast milk multi-omics component database based on various machine learning algorithms; and inputting candidate biomarkers into the multiple biomarker recognition models to determine key biomarker combinations.

[0147] The system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0148] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0149] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for processing multi-omics components of breast milk, characterized in that, include: Composition analysis was performed on breast milk samples to obtain the composition analysis results; Gene sequencing was performed on the breast milk sample to obtain sequencing results; Based on the basic information, test data, and supplementary information of the breast milk sample, an original dataset is constructed, wherein the test data includes the component detection results and sequencing results; Based on the original dataset, a multi-omics database of breast milk components was constructed. Using the aforementioned breast milk multi-omics composition database, multiple biomarker recognition models were trained based on various machine learning algorithms; The candidate markers are input into the multiple marker recognition models to determine the key marker combination.

2. The method according to claim 1, characterized in that, Composition analysis was performed on breast milk samples, and the results included: The breast milk sample was standardized to obtain the standardized processing result; An ultrasonic wave of a preset frequency is emitted into the standardized processing result, and the speed of sound and attenuation coefficient of the ultrasonic wave propagating in the breast milk sample are measured. Based on a pre-built breast milk component database and an inversion algorithm model, the content of the main components of the breast milk sample is obtained according to the propagation speed of sound and the attenuation coefficient, which serves as the component detection result.

3. The method according to claim 1, characterized in that, Gene sequencing was performed on the breast milk sample, and the sequencing results included: Microbial genomic DNA was extracted from the breast milk sample; The microbial genomic DNA was fed into a pre-designed high-throughput sequencing platform to obtain sequence data; The sequence data is analyzed to obtain the sequencing results.

4. The method according to claim 3, characterized in that, The sequence data is analyzed to obtain the sequencing results, including: The sequence data is preprocessed to obtain the ASV representative sequence; The ASV representative sequences are compared with a preset reference database, and species annotations are performed based on the comparison results to construct a microbial abundance matrix. The microbial abundance matrix was standardized to obtain a standardized community characteristic table, which was used as the sequencing result.

5. The method according to claim 1, characterized in that, Based on the original dataset, a multi-omics composition database of breast milk was constructed, including: The original dataset is validated according to the pre-set original data validation rules, and valid data is extracted. Based on the effective data, the breast milk multi-omics component database was constructed.

6. The method according to claim 1, characterized in that, The breast milk multi-omics composition database includes: The data management module is used to manage the valid data, numbering system, and status tracking of the breast milk samples; The analysis function modules include sub-modules for macronutrient analysis, proteomics analysis, mineral analysis, vitamin analysis, fatty acid analysis, MFGM protein analysis, and microbial community analysis, which are used to perform statistical processing, visualization, and / or indicator calculation on the corresponding data.

7. The method according to claim 1, characterized in that, The breast milk multi-omics composition database supports query, filtering, chart display, and / or data export functions.

8. The method according to claim 1, characterized in that, Using the aforementioned breast milk multi-omics composition database, multiple biomarker recognition models were trained based on various machine learning algorithms, including: Based on the breast milk multi-omics composition database, the macronutrient characteristics, vitamin and mineral characteristics, MFGM protein characteristics, species abundance characteristics, α diversity characteristics and / or β diversity characteristics of the breast milk samples were extracted. Based on the macronutrient characteristics, vitamin and mineral characteristics, MFGM protein characteristics, species abundance characteristics, α diversity characteristics, and / or β diversity characteristics, a standardized feature matrix is ​​constructed; The standardized feature matrix is ​​used to construct a sample dataset; Based on the sample dataset, multiple marker recognition models were trained using various machine learning algorithms.

9. The method according to claim 1, characterized in that, The candidate markers are input into the multiple marker recognition models to determine key marker combinations, including: The candidate markers are input into the multiple marker recognition models to obtain multiple scores. Calculate a comprehensive importance score based on the multiple scores; The candidate markers were verified to obtain the verification results; The key marker combination is determined based on the overall importance score and the verification results.

10. A processing device for multi-omics components of breast milk, characterized in that, include: The component detection unit is used to detect the components of breast milk samples and obtain the component detection results; A gene sequencing unit is used to sequence the genes in the breast milk sample and obtain sequencing results. A dataset unit is used to construct an original dataset based on the basic information, detection data, and auxiliary information of the breast milk sample, wherein the detection data includes the component detection results and sequencing results; A database unit is used to construct a breast milk multi-omics component database based on the original dataset; The model training unit is used to train multiple biomarker recognition models using the breast milk multi-omics composition database and based on various machine learning algorithms. The model application unit is used to input candidate markers into the multiple marker recognition models to determine the key marker combination.