Method, system, and medium for human multi-site pathogen identification based on metagenome

By constructing and comparing a multi-level reference database, the accuracy problem of pathogen identification in different parts of the human body using metagenomic sequencing technology was solved, achieving efficient and accurate pathogen identification.

CN122245406APending Publication Date: 2026-06-19INST OF MICROBIOLOGY CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF MICROBIOLOGY CHINESE ACAD OF SCI
Filing Date
2026-05-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Current metagenomic sequencing technologies struggle to accurately distinguish between background microorganisms and pathogenic microorganisms in different tissues of the human body, impacting the development and implementation of personalized treatment strategies.

Method used

A reference database was constructed, consisting of genus-level unique gene sets, species-level unique genome sets, representative viral sequences, and representative fungal parasite sequences. Background pathogens and pathogenic pathogens were identified by comparing metagenomic sequencing data from multiple parts of the human body database by database.

Benefits of technology

It improves the accuracy of pathogen identification, can distinguish microorganisms with highly similar genomes, has a wide identification range, outputs identification results accurate to subtype, and distinguishes background species from pathogenic pathogens.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245406A_ABST
    Figure CN122245406A_ABST
Patent Text Reader

Abstract

This disclosure provides a method, system, and medium for identifying pathogens from multiple sites of the human body based on metagenomics. By analyzing existing pan-genome data, a first reference database including a genus-level unique gene set, a second reference database including a species-level unique genome set, a third reference database including representative viral sequences, and a fourth reference database including representative fungal and parasitic sequences are constructed. Then, metagenomic sequencing data from multiple sites of the human body are compared sequentially with the four reference databases, i.e., in the order of genus-level pathogens, species-level pathogens, viruses, and fungal and parasitic organisms. This method can effectively distinguish microorganisms of different genera or different species of the same genus with highly similar genomes. It can also identify viruses and fungal and parasitic organisms. The range of pathogen species and strains identified is wide, which can improve the identification accuracy. It can directly output identification results accurate to the subtype level and can distinguish background microorganisms from pathogenic pathogens, thus having high application value.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of gene detection technology, and in particular to a method, system, device and medium for identifying pathogens in multiple parts of the human body based on metagenomics. Background Technology

[0002] Some background microorganisms present in various human tissues and organs exhibit opportunistic pathogenicity, posing a challenge to the accurate detection of pathogenic microorganisms. Currently, metagenomic next-generation sequencing (mNGS) shows great potential in pathogen detection. However, the lack of reference databases for background pathogens in different human tissues and accurate species detection thresholds makes it difficult for mNGS analysis to accurately distinguish between background microorganisms and true pathogens in a sample. This insufficient ability to differentiate directly weakens the sensitivity and accuracy of mNGS detection, thus affecting the formulation and implementation of personalized treatment strategies. Therefore, this limitation poses a significant challenge to the development of precision medicine.

[0003] Therefore, it is necessary to provide a metagenomic-based method, system, and medium for identifying background pathogens and pathogenic pathogens in human samples from different parts of the body. Summary of the Invention

[0004] The purpose of this disclosure is to provide a method, system, and medium for identifying pathogens from multiple sites in the human body based on metagenomics, in order to solve at least one of the aforementioned technical problems.

[0005] To achieve the above objectives, in a first aspect, this disclosure provides a method for identifying pathogens from multiple sites in the human body based on metagenomics, comprising the following steps:

[0006] Analyze the genus-level pan-genome of pathogenic bacteria and construct a first reference database; the first reference database includes a set of genus-specific genes.

[0007] Analyze the species-level pan-genome of pathogenic bacteria within the same genus and construct a second reference database; the second reference database includes a set of species-specific genomes.

[0008] Integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences;

[0009] Genetic data of fungi and parasites from public databases are integrated to construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites.

[0010] Metagenomic sequencing data of multiple human organ samples were sequentially compared with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and gene parameters were calculated based on the comparison results.

[0011] Based on the gene parameters, background pathogens and pathogenic pathogens in multiple parts of the human body were identified, and an identification report was generated.

[0012] In a preferred embodiment, the genus-level pan-genome of the pathogen is analyzed, and a first reference database is constructed, including:

[0013] Quality control of genus-level genomic data;

[0014] After quality control, the longest sequence is selected from the reference genome of each genus-level species and several sequences of each genus-level species sequence subtype as the pangenome.

[0015] The core gene set of the genus was identified within the pangenome.

[0016] Comparative analysis of the core gene sets of all pathogens from different genera was conducted, and genera-specific gene sets were derived by comparing the core gene sets of different genera pairwise.

[0017] The genus-specific gene sets of each genus were constructed as the first reference database.

[0018] In a preferred embodiment, the species-level pan-genome of the pathogenic bacteria within the same genus is analyzed, and a second reference database is constructed, including:

[0019] Quality control of genomic data at the species level;

[0020] Gene identification is performed on the quality-controlled genomic data to generate a pan-genome for the species;

[0021] The core gene set of each species was identified within its pangenome;

[0022] Comparative analysis of the core gene sets of different species within the same genus was conducted, and species-specific gene sets were derived by comparing the core gene sets of different species pairwise.

[0023] A second reference database was constructed using gene sets specific to each species level.

[0024] In one preferred implementation, viral genetic data from public databases are integrated, and a third reference database is constructed, including:

[0025] Integrate viral genomes and representative fragment sequences from public databases, remove bacteriophages, and generate a pathogen gene sequence set that does not contain bacteriophages;

[0026] The pathogen gene sequence set was filtered to remove gene sequences with a fuzzy base content >5%, repetitive fragment gene sequences, and gene sequences with a length <300bp.

[0027] Clustering of viral sequences from the filtered pathogen gene sequence set;

[0028] If the clustering results are consistent with the fragment information of the species, the longest sequence is selected from the different clustering branches after quality control as the representative virus sequence of that clustering branch, and all the representative virus sequences are constructed into a third reference database.

[0029] If the clustering results are inconsistent with the fragment information of the species, the inconsistent genomes will be re-clustered by subspecies.

[0030] In a preferred embodiment, genetic data of fungi and parasites from public databases are integrated, and a fourth reference database is constructed, including:

[0031] Integrate reference sequences of fungi and parasites from public databases;

[0032] Quality control was performed on the reference sequences of pathogenic fungi and parasites. Sequences shorter than 1000 bp and sequences that successfully aligned with human and bacterial genomes were removed. Genome data with contamination level <5% and integrity >95% were retained to obtain representative sequences of fungi and parasites.

[0033] All representative sequences of fungal parasites were constructed into a fourth reference database.

[0034] In a preferred embodiment, metagenomic sequencing data of multiple human organ samples are sequentially compared with genus-specific gene sets in a first reference database, species-specific genome sets in a second reference database, representative viral sequences in a third reference database, and representative fungal parasite sequences in a fourth reference database, database by database. Gene parameters are calculated based on the comparison results, including:

[0035] Metagenomic sequencing data from samples of multiple human organs;

[0036] The metagenomic sequencing data is compared with the genus-specific gene set in the first reference database;

[0037] Gene sequences of a genus that are uniquely matched with the first reference database are extracted, and the genus-level gene sequences are compared with the species-specific genome sets within the same genus in the second reference database.

[0038] Gene sequences that are uniquely matched with the second reference database are classified into that species, and the number of species-specific genes and species genome coverage are counted.

[0039] If the metagenomic sequencing data does not match with either the first or second reference database, then the metagenomic sequencing data will be compared with the viral gene set in the third reference database.

[0040] Gene sequences that are uniquely matched with the third reference database are classified according to the classification information of viral gene reference sequences, and the abundance and genome coverage of viral sequences are statistically analyzed.

[0041] If the metagenomic sequencing data does not match the first reference database, the second reference database, and the third reference database, then the metagenomic sequencing data will be compared with the representative fungal parasite sequences in the fourth reference database.

[0042] Gene sequences that successfully matched the fourth reference database were classified according to the classification information of fungal parasite gene reference sequences, and the abundance and genome coverage of fungal parasite sequences were statistically analyzed.

[0043] In a preferred embodiment, background pathogens and pathogenic pathogens from multiple sites of the human body are identified based on the genetic parameters, and an identification report is output, including:

[0044] Species abundance in each organ sample is calculated based on the gene parameters, using human organs as the unit.

[0045] The threshold for background pathogens is calculated based on the species abundance, and the squared difference of twice the average species abundance in all organ samples is used as the threshold for background pathogens.

[0046] Species that do not exceed the threshold are identified as background pathogens, and species that exceed the threshold are identified as pathogenic pathogens.

[0047] Secondly, this disclosure provides a metagenomics-based system for identifying pathogens from multiple sites in the human body, implemented through the following modules:

[0048] The first construction module is used to analyze the genus-level pan-genome of pathogens and construct a first reference database; the first reference database includes a genus-level specific gene set.

[0049] The second construction module is used to analyze the species-level pan-genome of pathogenic bacteria within the same genus and to construct a second reference database; the second reference database includes a species-specific genome set.

[0050] The third building module is used to integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences.

[0051] The fourth building module is used to integrate the genetic data of fungi and parasites in public databases and construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites.

[0052] The alignment module is used to sequentially align metagenomic sequencing data of multiple human organ samples with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and calculate gene parameters based on the alignment results.

[0053] The identification module is used to identify background pathogens and pathogenic pathogens in multiple parts of the human body based on the gene parameters, and output an identification report.

[0054] Thirdly, this disclosure also provides a database construction platform for identifying pathogens from multiple sites in the human body, including:

[0055] The first construction module is used to analyze the genus-level pan-genome of pathogens and construct a first reference database; the first reference database includes a genus-level specific gene set.

[0056] The second construction module is used to analyze the species-level pan-genome of pathogenic bacteria within the same genus and to construct a second reference database; the second reference database includes a species-specific genome set.

[0057] The third building module is used to integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences.

[0058] The fourth building module is used to integrate the genetic data of fungi and parasites in public databases and construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites.

[0059] The genus-level unique gene set of the first reference database, the species-level unique genome set of the second reference database, the representative virus sequence of the third reference database, and the representative fungal parasite sequence of the fourth reference database are used to compare with metagenomic sequencing data of multiple human organ samples in turn to identify background pathogens and pathogenic pathogens in multiple parts of the human body.

[0060] Fourthly, this disclosure also provides a metagenomics-based platform for identifying pathogens from multiple sites in the human body, including:

[0061] A storage module is used to store a pre-constructed first reference database, a second reference database, a third reference database, and a fourth reference database; the first reference database includes a genus-level unique gene set, the second reference database includes a species-level unique genome set, the third reference database includes representative virus sequences, and the fourth reference database includes representative fungal parasite sequences.

[0062] The alignment module is used to sequentially align metagenomic sequencing data of multiple human organ samples with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and calculate gene parameters based on the alignment results.

[0063] The identification module is used to identify background pathogens and pathogenic pathogens in multiple parts of the human body based on the gene parameters, and output an identification report.

[0064] Fifthly, this disclosure also provides an electronic device, including: a memory and one or more processors; the memory is used to store one or more computer programs; when the one or more computer programs are executed by the one or more processors, they implement the metagenomic-based method for identifying pathogens from multiple sites in the human body as described in any embodiment of the first aspect of this disclosure.

[0065] In a sixth aspect, this disclosure also provides a computer storage medium storing a computer program; when the computer program is executed by a processor, it implements the metagenomic-based method for identifying pathogens from multiple parts of the human body as described in any embodiment of the first aspect of this disclosure.

[0066] Beneficial effects:

[0067] Compared to existing technologies, the metagenomic-based method, system, and medium for identifying pathogens from multiple human sites disclosed in this disclosure, through analysis of existing pan-genome data, constructs a first reference database including a genus-level unique gene set, a second reference database including a species-level unique genome set, a third reference database including representative viral sequences, and a fourth reference database including representative fungal and parasitic sequences. Then, metagenomic sequencing data from multiple human sites are sequentially compared with the first, second, third, and fourth reference databases, i.e., compared in the order of genus-level pathogens, species-level pathogens, viruses, and fungal parasites. This effectively distinguishes microorganisms from different genera or different species within the same genus with highly similar genomes. Furthermore, it enables the identification of viruses and fungal parasites, covering a wide range of pathogenic species and strains, improving identification accuracy, and directly outputting identification results accurate to the subtype level. Simultaneously, it can distinguish between background species and true pathogenic pathogens in human samples, demonstrating high application value. Attached Figure Description

[0068] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of this disclosure and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0069] Figure 1 This is a flowchart of a metagenomic-based method for identifying pathogens from multiple sites in the human body, as provided in this disclosure.

[0070] Figure 2 This is a schematic diagram of the principle of the metagenomic-based multi-site pathogen identification system for humans provided in this disclosure.

[0071] Figure 3 A schematic block diagram of the electronic device provided in this disclosure. Detailed Implementation

[0072] The technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. The components of the embodiments of this disclosure described and shown in the accompanying drawings can be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed disclosure, but merely represents selected embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without inventive effort are within the scope of protection of this disclosure.

[0073] Please see Figure 1 This is a flowchart of a method for identifying pathogens from multiple sites in the human body based on metagenomics, provided in Embodiment 1 of this disclosure. It should be noted that the method of this disclosure is not limited to the order of the following steps, and in other embodiments, the method of this disclosure may include only a portion of the following steps, or some steps may be deleted.

[0074] The metagenomic-based method for identifying pathogens from multiple sites of the human body provided in this embodiment can be applied to a metagenomic-based system for identifying pathogens from multiple sites of the human body. The method and system are suitable for identifying background pathogens and pathogenic pathogens in samples from different parts of the human body. It can identify different species that are closely related or have high nucleotide similarity, and can also detect important pathogen subtypes.

[0075] The metagenomic-based method for identifying pathogens from multiple sites in the human body provided in this embodiment includes the following steps:

[0076] Step S10: Analyze the genus-level pan-genome of the pathogen and construct a first reference database; the first reference database includes a genus-level specific gene set.

[0077] Specifically, this step involves constructing a first reference database based on genus-level pan-genome analysis, representing the first standard workflow in the current research field applicable to genus-level pan-genome analysis of pathogens. The metagenomic-based human multi-site pathogen identification system includes a first construction module, which analyzes the genus-level pan-genome of pathogens and constructs the first reference database.

[0078] In some implementations, step S10 includes the following steps:

[0079] Quality control of genus-level genomic data;

[0080] After quality control, the longest sequence is selected from the reference genome of each genus-level species and several sequences of each genus-level species sequence subtype as the pangenome.

[0081] The core gene set of the genus was identified within the pangenome.

[0082] Comparative analysis of the core gene sets of all pathogens from different genera was conducted, and genera-specific gene sets were derived by comparing the core gene sets of different genera pairwise.

[0083] The genus-specific gene sets of each genus were constructed as the first reference database.

[0084] Specifically, quality control was performed on genus-level FASTA format genomic data. Quality control parameters included: removing data shorter than 1000 bp; retaining genomic data with a contamination level <5% and integrity >95% after CheckM evaluation to obtain high-quality genomic data. FASTA format is a plain text format for storing biological sequences (such as DNA, RNA, and proteins), with a simple structure. CheckM is an existing bioinformatics tool capable of assessing the integrity and contamination level of microbial genome (especially metagenomic) data. Understandably, genus-level genomic data can come from multiple existing pathogen gene databases, such as the NCBI database, PATRIC database, and gcPathogen database, providing a wide and comprehensive range of data sources.

[0085] At the genus level, the best-performing and longest sequence from each species' reference genome and each ST (Sequence Type) subtype is selected as the pan-genome. Specifically, the reference genome is the genome sequence marked "reference" in the NCBIRefSeq database, and the ST subtype refers to the intraspecific clonal group defined by the allelic combination of seven housekeeping genes according to the MLST (Multiple Site Sequence Typing) scheme. The purpose is to ensure that the genomic data is high-quality, standardized, and representative of the main evolutionary lineages and clonal groups within the species, avoiding the omission of rare or underrepresented subtypes.

[0086] The quality control steps are as follows: First, at the genus level, all sequences with "integrity > 95% and contamination < 5%" are screened out. Second, based on this, sequences whose length meets the requirement of "not lower than the upper quartile (Q3, i.e., the 75th percentile) of the overall length distribution" are further screened out, resulting in a number of quality-controlled sequences. These sequences, after quality control using the above parameters, are the best-performing sequences. Finally, the longest of these quality-controlled sequences is selected as the genome to be used for pan-genome analysis.

[0087] The core genome and pangenome set sizes were assessed. Roary software was used to identify homologous genes in all genomes screened using the genus-level quality control steps described above. Genes with a similarity >70% were considered homologous, and sporadic genes were then removed. Roary software is an existing bioinformatics tool for pangenome analysis, capable of rapidly comparing the genome composition of hundreds to thousands of strains; for example, it can be used to reveal the distribution patterns of core and variable genes. The core objective of this step is to systematically screen and define the homologous gene sets in all quality-controlled genomes. Specifically, by setting a 70% gene similarity threshold, sporadic genes with insufficient sequence similarity and no homologous association can be precisely removed, while evolutionarily related homologous genes are retained. This provides standardized basic data support for subsequent pangenome analysis (including the screening of genus-level, species-level core gene sets, and unique gene sets).

[0088] Identifying the core gene set of a genus within a pan-genome is crucial. The core gene set is the set of genes common to all pan-genomes of a genus; in other words, it exists in all genomes. Typically, a pan-genome includes a core gene set, a subsidiary gene set, and a special gene set. The core gene set of a genus consists of genes present in all genomes; the subsidiary gene set consists of genes present in more than one but less than 30% of genomes; and the special gene set consists of genes present in only one genome. Understandably, the core gene set of a genus includes the core genes that represent the genus and can serve as the basis for subsequent comparative analyses, while the subsidiary and special gene sets are not representative and cannot be used as the basis for such analyses. Therefore, identifying the core gene set of a genus within a pan-genome ensures the accuracy of the basic data.

[0089] A comparative analysis of the core gene sets across different genera of pathogenic bacteria was conducted. Specifically, the core gene set of each genus was used as input, and BLASTn software was used with parameters set to nucleotide identity < 90% and alignment coverage > 80%. Pairwise comparisons of the core gene sets in the pan-genomes of different genera were performed at the genus level to obtain the set of genus-specific core genes, i.e., the genus-level unique gene set. This genus-level unique gene set includes the core unique genes of each genus, i.e., the key genes used to distinguish different genera. Finally, the genus-level unique gene sets of each genus were constructed into a first reference database, which can be used for the accurate identification of different bacterial genera.

[0090] Specifically, BLASTn software is an existing nucleic acid sequence alignment tool. When the parameters are set to identity < 90% and coverage > 80%, it can effectively screen out target nucleic acid sequences with moderate homology and incomplete matching while ensuring that the alignment region is sufficiently complete. It can quickly distinguish between highly similar sequences and closely related homologous sequences, reduce redundant matching, and improve the accuracy of specific alignment results.

[0091] Step S20: Analyze the species-level pan-genome of the pathogenic bacteria within the same genus and construct a second reference database; the second reference database includes a species-specific genome set.

[0092] Specifically, this step involves constructing a second reference database based on species-level pan-genome analysis. The human multi-site pathogen identification system based on metagenomics includes a second construction module, which is used to analyze the species-level pan-genome of pathogens within the same genus and construct the second reference database.

[0093] In some implementations, step S20 includes the following steps:

[0094] Quality control of genomic data at the species level;

[0095] Gene identification is performed on the quality-controlled genomic data to generate a pan-genome for the species;

[0096] The core gene set of each species was identified within its pangenome;

[0097] Comparative analysis of the core gene sets of different species within the same genus was conducted, and species-specific gene sets were derived by comparing the core gene sets of different species pairwise.

[0098] A second reference database was constructed using gene sets specific to each species level.

[0099] Specifically, quality control was performed on species-level FASTA format genomic data. Quality control parameters included: removing data shorter than 1000 bp; retaining genomic data with a contamination level <5% and integrity >95% after CheckM evaluation to obtain high-quality genomic data. Understandably, genus-level genomic data were sourced from multiple existing pathogen gene databases, such as the NCBI database, PATRIC database, and gcPathogen database.

[0100] Using Prokka software, genes in each genome are identified, and then Roary software is used to analyze and generate a pangenome for the species. SNV annotation, including SNP and Indel annotations, is performed using the pangenome as a reference sequence. Mummer software is used for filtering based on Identity > 95%, S-coverage > 80%, and Q-coverage > 80%. Specifically, Prokka software automatically identifies and annotates functional elements such as protein-coding genes, tRNA, and rRNA in each input genome sequence. Prokka includes a built-in Prodigal gene prediction tool, which efficiently converts raw sequences to annotated GFF3 / GenBank format files, providing standardized input data for subsequent pangenome analysis. Roary software clusters all genes in the Prokka-annotated genome files into homologous gene families (i.e., the "pangenome") through pairwise alignment. Specifically, Roary identifies which gene families are common to all strains (i.e., the core genome) and which exist only in some strains (i.e., variable / accessory genomes). The Mummer software compares the genome sequence of each strain with the pangenome reference sequence set constructed in the previous step, thereby detecting single nucleotide polymorphisms (SNPs) and insertions / deletions (Indels) between them. This ultimately yields a high-quality pangenome for the species.

[0101] Within the pangenome of each species, the core gene set is identified. The core gene set is the set of genes common to all pangenomes of a species; that is, the core gene set exists in all genomes. Typically, a pangenome includes a core gene set, a secondary gene set, and a specific gene set. The core gene set is the set of genes present in more than 95% of the genome; the secondary gene set is the set of genes present in more than one but less than 95% of the genomes; and the specific gene set is the set of genes present in only one genome. Understandably, the core gene set includes the core genes representative of the species and can serve as the basis for subsequent comparative analyses, while the secondary and specific gene sets are not representative and cannot be used as the basis for such analyses. Therefore, identifying the core gene set within the pangenome ensures the accuracy of the basic data.

[0102] After analyzing the pan-genomes of different species within the same genus to obtain the core gene sets of each species, and using the core gene sets of each species as input, pairwise comparisons of the core genome sets of different species were conducted using BLASTn software. The parameters were set to nucleotide identity <75%-95% and alignment coverage >80%, resulting in the species-specific core gene sets, i.e., the species-specific gene sets. These species-specific gene sets include the core-specific genes of each species, i.e., the key genes used to distinguish different species. Finally, the species-specific gene sets of each species were used to construct a second reference database, which can be used for the precise identification of different species within the same genus.

[0103] Specifically, when the parameters are set to nucleotide identity < 75%-95% and alignment coverage > 80%, highly conserved homologous sequences with no species differentiation between different species within the same genus are removed, and only gene sequences with significant sequence differences in cross-species alignment and whose sequence integrity meets the requirements for effective alignment are retained. This can accurately define the specific differences between different species within the same genus at the core genome level, providing a stable and unique molecular marker for accurate species-level differentiation.

[0104] The parameter settings differ slightly in the acquisition of genus-level and species-level unique gene sets because the taxonomic levels of genus and species are different. Their sequence differentiation degree, conservation intervals, and sequence difference thresholds required for species differentiation are significantly different. This allows for the accurate division of specific gene sets at different taxonomic levels, avoiding the decrease in distinguishing ability or false positive / false negative results caused by a uniform threshold, and ensuring high specificity and accuracy for both genus and species identification.

[0105] Step S30: Integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences.

[0106] Specifically, this step involves constructing a third reference database based on viral genetic data from public databases. This third reference database is essentially a viral genetic reference database. The metagenomic-based human multi-site pathogen identification system includes a third building module, which integrates viral genetic data from public databases and constructs the third reference database.

[0107] In some implementations, step S30 includes the following steps:

[0108] Integrate viral genomes and representative fragment sequences from public databases, remove bacteriophages, and generate a pathogen gene sequence set that does not contain bacteriophages;

[0109] The pathogen gene sequence set was filtered to remove gene sequences with a fuzzy base content >5%, repetitive fragment gene sequences, and gene sequences with a length <300bp.

[0110] Clustering of viral sequences from the filtered pathogen gene sequence set;

[0111] If the clustering results are consistent with the fragment information of the species, the longest sequence is selected from the different clustering branches after quality control as the representative virus sequence of that clustering branch, and all the representative virus sequences are constructed into a third reference database.

[0112] If the clustering results are inconsistent with the fragment information of the species, the inconsistent genomes will be re-clustered by subspecies.

[0113] Specifically, firstly, phage removal is performed by integrating all existing public databases such as Genbank, NCBI-RefSeq, IMG-VR, and existing literature, as well as all viral genomes and representative fragment sequences. Sequences containing "phage" in their names and with duplicate strain names are then screened and deleted to generate a pathogen sequence set that does not contain phages, thus preventing phage interference with the results.

[0114] Next, all sequences with a fuzzy base content >5% were removed. Fuzzy bases are defined as bases that are not ATCG (the four bases of DNA). Repeated fragments were also removed; that is, on a species-by-species basis, sequences were compared pairwise, and if a small fragment could be 100% covered by a long fragment, the small fragment was removed. All sequences <300bp in length were also removed. In short, this filtering process of the pathogen gene sequence set yielded higher-quality viral sequences, improving data accuracy.

[0115] Then, the MMseq2 software was used to cluster the viral sequences at the species level, with the standard being that each pair of sequences had a nucleotide identity >80% and an alignment coverage >80%. MMseq2 software is a highly efficient and sensitive nucleic acid sequence clustering and alignment tool that can quickly perform homology clustering analysis on a large scale of sequences. It combines computational efficiency with alignment accuracy and is suitable for batch clustering studies of viral and other microbial sequences. Its core function is to achieve precise grouping of sequences through sequence homology threshold screening. The parameters are designed to have Identity > 80% and Coverage > 80%, which means setting dual screening thresholds for sequence homology and alignment integrity. Sequence pairs with insufficient homology or incomplete alignment regions are eliminated, and only homologous sequences with high sequence consistency and sufficient alignment coverage are retained. This ensures the accuracy and reliability of clustering and grouping. It can cluster highly homologous sequences in species-level viral sequences into the same group, effectively distinguishing viral sequences of different homologous categories. At the same time, it removes redundant sequences and simplifies data complexity, providing a standardized and high-quality sequence grouping foundation for subsequent accurate identification, evolutionary analysis, and functional annotation of viral sequences, further improving the accuracy and efficiency of species-level virus identification.

[0116] If the clustering results are consistent with the original segment information of this species, the sequence with the best quality control results and the longest length in each cluster is selected as the representative sequence of that cluster and included in the third reference database. Specifically, the sequences with the best quality control results are selected from several sequences with a integrity >95%, a contamination level <5%, and a sequence length that meets the criterion of "not lower than the upper quartile (Q3, i.e., the 75th percentile) of the overall length distribution." Then, the longest sequence is selected from these sequences with the best quality control results as the representative sequence. If the clustering results are inconsistent with the original segment information of this species, these genomes are clustered again by subspecies to establish the third reference database.

[0117] Segment information refers to the existing taxonomic or phylogenetic classification of a species, such as serotype, sequence type (ST), or clonal lineage. If the clustering results are consistent with the existing segment information of the species, it indicates that the current clustering strategy can effectively reproduce the known population structure. In this case, the sequence with the best quality control index and longest length in each cluster is taken as the representative sequence of that cluster and included in the third reference database. If the clustering results are inconsistent with the existing segment information of the species, it suggests that the current clustering may be affected by subspecies recombination or other confounding factors. In this case, it is necessary to re-cluster the genome at the subspecies level, that is, to perform clustering within each subspecies, rather than at the species-wide level, thereby establishing the third reference database. This method can address intraspecific heterogeneity and ensure that the reference database accurately reflects the true phylogenetic framework at the subspecies level.

[0118] This step can establish a complete and accurate virus database, providing reliable data support for pathogen identification.

[0119] Step S40: Integrate the genetic data of fungi and parasites in public databases and construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites.

[0120] Specifically, this step involves constructing a fourth reference database based on fungal and parasitic genetic data from public databases. This fourth reference database is essentially a fungal and parasitic genetic reference database. The metagenomic-based human multi-site pathogen identification system includes a fourth construction module, which integrates fungal and parasitic genetic data from public databases and constructs the fourth reference database.

[0121] In some implementations, step S40 includes:

[0122] Integrate reference sequences of fungi and parasites from public databases;

[0123] Quality control was performed on the reference sequences of pathogenic fungi and parasites. Sequences shorter than 1000 bp and sequences that successfully aligned with human and bacterial genomes were removed. Genome data with contamination level <5% and integrity >95% were retained to obtain representative sequences of fungi and parasites.

[0124] All representative sequences of fungal parasites were constructed into a fourth reference database.

[0125] Specifically, reference sequences of pathogenic fungi and parasites from existing public databases such as Genbank, NCBI-RefSeq, and IMG-VR are integrated. These FASTA format genomic reference sequence data are quality controlled, with quality control parameters including: removing sequences shorter than 1000bp, removing sequences aligned with human and bacterial genomes, and retaining genomic data with a contamination level of <5% and integrity >95% after CheckM evaluation. This yields high-quality representative sequences of fungi and parasites, and all representative sequences of fungi and parasites are used to construct a fungal and parasite database, i.e., the fourth reference database.

[0126] This step can establish a complete and accurate database of fungal parasites, providing reliable data support for pathogen identification.

[0127] Step S50: The metagenomic sequencing data of multiple human organ samples are sequentially compared with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and the gene parameters are calculated based on the comparison results.

[0128] Specifically, this step enables pathogen detection based on metagenomic sequencing data of human organs. The metagenomic-based human multi-site pathogen identification system includes a comparison module, which is used to sequentially compare the metagenomic sequencing data of multiple human organ samples with the database established in steps S10-S40, and calculate gene parameters based on the comparison results.

[0129] In some implementations, step S50 includes:

[0130] Metagenomic sequencing data from samples of multiple human organs;

[0131] The metagenomic sequencing data is compared with the genus-specific gene set in the first reference database;

[0132] Gene sequences of a genus that are uniquely matched with the first reference database are extracted, and the genus-level gene sequences are compared with the species-specific genome sets within the same genus in the second reference database.

[0133] Gene sequences that are uniquely matched with the second reference database are classified into that species, and the number of species-specific genes and species genome coverage are counted.

[0134] If the metagenomic sequencing data does not match with either the first or second reference database, then the metagenomic sequencing data will be compared with the viral gene set in the third reference database.

[0135] Gene sequences that are uniquely matched with the third reference database are classified according to the classification information of viral gene reference sequences, and the abundance and genome coverage of viral sequences are statistically analyzed.

[0136] If the metagenomic sequencing data does not match the first reference database, the second reference database, and the third reference database, then the metagenomic sequencing data will be compared with the representative fungal parasite sequences in the fourth reference database.

[0137] Gene sequences that successfully matched the fourth reference database were classified according to the classification information of fungal parasite gene reference sequences, and the abundance and genome coverage of fungal parasite sequences were statistically analyzed.

[0138] Specifically, metagenomic sequencing data from multiple human organ samples, such as blood, cerebrospinal fluid, bronchoalveolar lavage fluid, and feces, involves performing host removal, low-quality, low-complexity, and adapter sequence quality control analysis on the mNGS sequencing data of different human organ samples to obtain high-quality sequence data. In essence, metagenomic data from different parts of the human body refers to the collection of all nucleic acid sequences obtained through metagenomic sequencing technology from samples collected from different organs, tissues, or body surface areas (such as the lungs, intestines, skin, blood, urogenital tract, and oral cavity), extracting the total genomic DNA / RNA of all microorganisms (including bacteria, viruses, fungi, parasites, etc.). It is essentially the sum of genomic information of the microbial community in a specific part of the human body, comprehensively reflecting the species composition, gene function, and community structure characteristics of the microorganisms in that location.

[0139] High-quality sequence data undergoes hierarchical alignment across databases. First, the sequenced sequences are aligned with the genus-level core-specific genome set (the genus-level specific gene set in the first reference database). Sequences that align uniquely to a specific genus are extracted and further aligned with the species-level core-specific genome set within that genus (i.e., the species-level specific gene set in the second reference database). Sequences uniquely aligned to a specific species are classified as belonging to that species, and the number of aligned species-level core-specific genes and species genome coverage are counted. Specifically, "aligned uniquely" means a single, successful alignment; that is, there is one and only one gene sequence in the reference database that successfully aligns with the sequenced data. A uniquely aligned sequenced data can be directly identified as belonging to that genus / species.

[0140] Furthermore, if the metagenomic sequencing data aligns with the first reference database but is not uniquely aligned (i.e., the reference database contains multiple (more than one) gene sequences that align successfully with the sequencing data), then this sequence is extracted and further assembled. The assembled sequence is then used to classify species and calculate abundance information using the ANI algorithm. Similarly, if the metagenomic sequencing data aligns with the second reference database but is not uniquely aligned, sequence assembly is also performed, and the assembled sequence is used to classify species and calculate abundance information using the ANI algorithm.

[0141] Understandably, the ANI algorithm, or Average Nucleotide Identity (ANI) algorithm, is a core algorithm in the field of microbial species classification and identification used to quantify genome-level sequence similarity and define species relationships. Its core principle is to calculate the average nucleotide identity value of homologous sequence fragments by comparing homologous sequence fragments between two genomes, thereby quantifying the degree of homology at the genome level. This can accurately determine the species classification status corresponding to the spliced ​​sequence and effectively solve the classification problem of non-unique matching in reference database comparisons.

[0142] For sequences that fail to align in the first and second reference databases, they are extracted and then aligned with the viral reference database (i.e., the third reference database). The sequences that are successfully aligned are classified according to the classification information of the reference sequences, and the sequence abundance and genome coverage are calculated.

[0143] Understandably, the failure of the metagenomic sequencing data to match the first and second reference databases means that there are no (i.e., zero) gene sequences in the first and second reference databases that match the sequencing data. In this case, the viral representative sequence from the third reference database is used for alignment.

[0144] Furthermore, if the metagenomic sequencing data can be matched with the third reference database but not uniquely matched, that is, if there are multiple (more than one) gene sequences in the reference database that have been successfully matched with the sequencing data, then the sequence is extracted and further sequence splicing is performed. The spliced ​​sequence is used to classify species and calculate abundance information using the ANI algorithm.

[0145] For sequences that fail to align in the first, second, and third reference databases, they are extracted and aligned with the fungal and parasitic reference database (i.e., the fourth reference database). The uniquely aligned sequence is then classified according to the classification information of the reference sequence, and the sequence abundance and genome coverage are calculated.

[0146] Understandably, the failure of the metagenomic sequencing data to match the first, second, and third reference databases means that there are no (i.e., zero) gene sequences in the first, second, and third reference databases that match the sequencing data. In this case, a comparison with representative fungal parasite sequences from the fourth reference database is performed.

[0147] Furthermore, if the metagenomic sequencing data can be matched with the fourth reference database but not uniquely matched, that is, if there are multiple (more than one) gene sequences in the reference database that have been successfully matched with the sequencing data, then the sequence is extracted and further sequence splicing is performed. The spliced ​​sequence is used to classify species and calculate abundance information using the ANI algorithm.

[0148] This step involves comparing metagenomic sequencing data sequentially with the first, second, third, and fourth reference databases, i.e., comparing them at the genus level for pathogens, the species level for pathogens, viruses, and fungal parasites. This effectively distinguishes microorganisms from different genera or different species within the same genus that have highly similar genomes. It also enables further identification of viruses and fungal parasites, covering a wide range of pathogen species and strains, thus improving the accuracy of identification.

[0149] Step S60: Identify background pathogens and pathogenic pathogens in multiple parts of the human body based on the gene parameters, and output an identification report.

[0150] Specifically, this step can identify background pathogens and pathogenic pathogens in multiple parts of the human body based on calculated gene parameters and output an identification report. The metagenomic-based human multi-site pathogen identification system includes an identification module, which is used to identify background pathogens and pathogenic pathogens in multiple parts of the human body based on the gene parameters and output an identification report.

[0151] In some implementations, step S60 includes:

[0152] Species abundance in each organ sample is calculated based on the gene parameters, using human organs as the unit.

[0153] The threshold for background pathogens is calculated based on the species abundance, and the squared difference of twice the average species abundance in all organ samples is used as the threshold for background pathogens.

[0154] Species exceeding the threshold will be identified as pathogenic pathogens.

[0155] Specifically, false positives are filtered out based on the ratio of theoretical to actual genome coverage and the proportion of species-specific core genes in the alignment. After alignment, the proportion of each detected species in all samples is statistically analyzed, using human organs as the unit, to determine the range of background species in the samples. Species abundance in each human body part is calculated, and the squared difference of twice the average species abundance across all samples is used as the threshold for background bacteria. Pathogens exceeding this threshold are considered pathogenic, while those below are considered opportunistic pathogens (background pathogens). An identification report is output, which may include organ type, pathogenic pathogen type, and background pathogen type. Furthermore, by combining previously integrated lists of pathogenic, opportunistic, and hospital-acquired infections, risk warnings can be issued for opportunistic pathogens that have already transformed into pathogenic pathogens.

[0156] Furthermore, step S50 allows for the calculation of gene parameters for genus-level pathogens, species-level pathogens, viruses, and fungal parasites. Specifically, it calculates the species abundance, coverage value, and the ratio of genic to non-genic reads for each type of pathogen. The specific calculation process for each parameter is as follows: Species abundance is calculated using the number of reads aligned to the target species per million sequencing reads (RPM) combined with the genome length of the target species. The formula is: Species abundance (RPM) = (Number of valid reads aligned to the target species ÷ Total number of sequencing reads) × 10 6The Coverage value is calculated by dividing the total number of bases aligned to the target species' reference sequence by the total length of the target species' reference sequence. The formula is: Actual Coverage = (Total number of bases aligned to the target sequence ÷ Total length of the target sequence) × 100%. The theoretical Coverage value is also calculated based on sequencing depth. The ratio of reads between genic and non-genic regions is determined by identifying two types of regions using the target species' reference genome annotation information. The ratio (number of genic region reads ÷ number of non-genic region reads) is calculated after counting the number of valid reads aligned to each region. A preset screening threshold is then set based on the ratio of the theoretical to the actual Coverage value and the ratio of genic to non-genic region reads to filter out false positive sequences with sequencing bias, alignment anomalies, and abnormal ratios. After filtering, knowledge base annotation is performed. Based on a preset pathogen classification knowledge base, each species is annotated as a pathogenic bacterium, opportunistic pathogen, colonizing bacterium, or ESCAPE nosocomial pathogen, ultimately distinguishing between background pathogens and pathogenic pathogens. The identification method proposed in this application is adapted to species detection work that cannot be accurately performed using traditional mNGS methods. It innovatively combines pan-genome and mNGS to realize a detection and analysis tool for different pathogenic pathogens and background pathogens in human organs. It is easy to use, accurate in results, and short in time. It solves the problem of accurate detection of different species with close phylogenetic relationships or high nucleotide similarity. At the same time, the detection accuracy is improved to subtype, which solves the difficulty of distinguishing background pathogens from true pathogenic pathogens in different human organs.

[0157] Understandably, while preliminary progress has been made in microbial detection using metagenomic next-generation sequencing (mNGS) both domestically and internationally, its clinical application is limited by the inherent deficiencies of reference databases. Current analyses heavily rely on microbial reference genome sequences for assessing relative abundance of species, but mainstream databases (such as NCBI RefSeq) suffer from key shortcomings such as insufficient coverage, limited classification accuracy, and missing subtype information. The use of these standard genomic reference databases makes it easy for classification errors to occur between different genera or even different species within the same genus of microorganisms with highly similar genomes. Furthermore, subtype sequences of important pathogens (such as influenza virus HA / NA subtypes) are not included in existing reference databases, creating identification blind spots. These deficiencies further amplify the interference effect of background microorganisms in the human body. This limitation of reference databases not only hinders the accurate analysis of multiple pathogens in mixed infections but also limits the application value of mNGS in rapid clinical etiological diagnosis.

[0158] The identification method proposed in this application is applicable to the detection needs of true pathogenic pathogens in samples from different parts of the human body. It can identify different species that are closely related or have high nucleotide similarity, and can detect important pathogen subtypes. The reference database uses the specific core gene sets of each pathogen species at the genus / species level and the representative sequences of the selected subtypes as the basic data. The reference database is used for metagenomic sequencing data. Different schemes and comparison algorithms are used to detect pathogen species in samples from multiple parts of the human body at the genus level, species level, and for bacteria, viruses, fungi, and parasites.

[0159] In summary, the metagenomic-based method for identifying pathogens from multiple sites in the human body provided in this disclosure has a complete process, is easy to use, has a high accuracy rate, can identify a wide range of pathogen species and bacterial / viral strains, can directly output identification results accurate to the subtype level, and can simultaneously distinguish between background species and true pathogens in human samples, providing risk warnings, thus having high application value.

[0160] Please see Figure 2 This is a schematic diagram of the human multi-site pathogen identification system based on metagenomics provided in Embodiment 2 of this disclosure. The human multi-site pathogen identification system 100 based on metagenomics provided in this disclosure includes:

[0161] The first construction module is used to analyze the genus-level pan-genome of pathogens and construct a first reference database; the first reference database includes a genus-level specific gene set.

[0162] The second construction module is used to analyze the species-level pan-genome of pathogenic bacteria within the same genus and to construct a second reference database; the second reference database includes a species-specific genome set.

[0163] The third building module is used to integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences.

[0164] The fourth building module is used to integrate the genetic data of fungi and parasites in public databases and construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites.

[0165] The alignment module is used to sequentially align metagenomic sequencing data of multiple human organ samples with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and calculate gene parameters based on the alignment results.

[0166] The identification module is used to identify background pathogens and pathogenic pathogens in multiple parts of the human body based on the gene parameters, and output an identification report.

[0167] Embodiment 3 of this disclosure also provides a database construction platform for identifying pathogens from multiple sites in the human body. Specifically, the database construction platform for identifying pathogens from multiple sites in the human body provided in this disclosure includes:

[0168] The first construction module is used to analyze the genus-level pan-genome of pathogens and construct a first reference database; the first reference database includes a genus-level specific gene set.

[0169] The second construction module is used to analyze the species-level pan-genome of pathogenic bacteria within the same genus and to construct a second reference database; the second reference database includes a species-specific genome set.

[0170] The third building module is used to integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences.

[0171] The fourth building module is used to integrate the genetic data of fungi and parasites in public databases and construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites.

[0172] The genus-level unique gene set of the first reference database, the species-level unique genome set of the second reference database, the representative virus sequence of the third reference database, and the representative fungal parasite sequence of the fourth reference database are used to compare with metagenomic sequencing data of multiple human organ samples in turn to identify background pathogens and pathogenic pathogens in multiple parts of the human body.

[0173] Embodiment 4 of this disclosure also provides a metagenomics-based platform for identifying pathogens from multiple sites in the human body, specifically including:

[0174] A storage module is used to store a pre-constructed first reference database, a second reference database, a third reference database, and a fourth reference database; the first reference database includes a genus-level unique gene set, the second reference database includes a species-level unique genome set, the third reference database includes representative virus sequences, and the fourth reference database includes representative fungal parasite sequences.

[0175] The alignment module is used to sequentially align metagenomic sequencing data of multiple human organ samples with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and calculate gene parameters based on the alignment results.

[0176] The identification module is used to identify background pathogens and pathogenic pathogens in multiple parts of the human body based on the gene parameters, and output an identification report.

[0177] Understandably, the aforementioned functional modules can be stored in memory as software programs and executed by a processor. In alternative embodiments, the aforementioned functional modules can also be hardware with specific functions, such as chips programmed with specific software.

[0178] It should be noted that, in practice, the metagenomics-based method for identifying pathogens from multiple human sites can be implemented using the metagenomics-based system 100, the database construction platform for identifying pathogens from multiple human sites, and the metagenomics-based platform for identifying pathogens from multiple human sites described in the above embodiments. The metagenomics-based system 100, the database construction platform for identifying pathogens from multiple human sites, and the metagenomics-based platform for identifying pathogens from multiple human sites can perform database construction and identify background pathogens and pathogenic pathogens from multiple human sites using one or more specific implementations of the metagenomics-based method for identifying pathogens from multiple human sites described in the above embodiments. That is, all embodiments of the metagenomics-based method for identifying pathogens from multiple human sites provided in this disclosure are applicable to the metagenomics-based system 100, the database construction platform for identifying pathogens from multiple human sites, and the metagenomics-based platform for identifying pathogens from multiple human sites provided in this disclosure, and can all achieve the same or similar beneficial effects, which will not be elaborated upon here.

[0179] Please see Figure 3 This disclosure also provides an electronic device, including: a memory 210 and one or more processors 220.

[0180] Specifically, the memory 210 is used to store one or more computer programs; when the one or more computer programs are executed by one or more processors 220, the method for identifying pathogens in multiple parts of the human body based on metagenomics described in Embodiment 1 above is implemented.

[0181] The memory 210 may be, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), etc. The memory 210 stores programs, and the processor 220 runs these programs after receiving execution instructions to implement the metagenomic-based multi-site pathogen identification method for humans described in Embodiment 1. It is understood that access to the memory 210 by the processor 220 and other possible components can be performed under the control of a memory controller.

[0182] The processor 220 may be an integrated circuit chip with signal processing capabilities. The processor 220 may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc., or it may be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, capable of implementing or executing the methods and steps disclosed in Embodiment 1 of this disclosure.

[0183] This disclosure also provides a computer storage medium storing a computer program, which, when executed by a processor, implements the metagenomic-based method for identifying pathogens from multiple parts of the human body as described in Embodiment 1 above.

[0184] This disclosure also provides a computer program product, including a computer program or instructions, which, when executed by a processor, implement the metagenomic-based method for identifying pathogens in multiple parts of the human body as described in Embodiment 1 above.

[0185] In summary, the metagenomic-based method, system, platform, equipment, medium, and product for identifying pathogens from multiple human sites provided in this disclosure, through analysis of existing pan-genome data, constructs a first reference database including a genus-level unique gene set, a second reference database including a species-level unique genome set, a third reference database including representative viral sequences, and a fourth reference database including representative fungal and parasitic sequences. Then, the metagenomic sequencing data from multiple human sites are sequentially compared with the first, second, third, and fourth reference databases, i.e., compared in the order of genus-level pathogens, species-level pathogens, viruses, and fungal parasites. This effectively distinguishes microorganisms from different genera or different species within the same genus with highly similar genomes. Furthermore, it enables the identification of viruses and fungal parasites, covering a wide range of pathogen species and strains, improving identification accuracy, and directly outputting identification results accurate to the subtype level. Simultaneously, it can distinguish between background species and true pathogens in human samples, demonstrating high application value.

[0186] The above description is merely an embodiment of this disclosure and does not limit the patent scope of this disclosure. Any equivalent structural or procedural transformations made using the content of this disclosure and its drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this disclosure.

Claims

1. A method for identifying pathogens from multiple sites in the human body based on metagenomics, characterized in that, Includes the following steps: Analyze the genus-level pan-genome of pathogenic bacteria and construct a first reference database; the first reference database includes a set of genus-specific genes. Analyze the species-level pan-genome of pathogenic bacteria within the same genus and construct a second reference database; the second reference database includes a set of species-specific genomes. Integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences; Genetic data of fungi and parasites from public databases are integrated to construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites. Metagenomic sequencing data of multiple human organ samples were sequentially compared with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and gene parameters were calculated based on the comparison results. Based on the gene parameters, background pathogens and pathogenic pathogens in multiple parts of the human body were identified, and an identification report was generated.

2. The method for identifying pathogens from multiple sites of the human body based on metagenomics as described in claim 1, characterized in that, Analyze the genus-level pan-genome of pathogens and construct a first reference database, including: Quality control of genus-level genomic data; After quality control, the longest sequence is selected from the reference genome of each genus-level species and several sequences of each genus-level species sequence subtype as the pangenome. The core gene set of the genus was identified within the pangenome. Comparative analysis of the core gene sets of all pathogens from different genera was conducted, and genera-specific gene sets were derived by comparing the core gene sets of different genera pairwise. The genus-specific gene sets of each genus were constructed as the first reference database.

3. The method for identifying pathogens from multiple sites of the human body based on metagenomics as described in claim 1, characterized in that, Analyze the species-level pan-genome of pathogenic bacteria within the same genus and construct a second reference database, including: Quality control of genomic data at the species level; Gene identification is performed on the quality-controlled genomic data to generate a pan-genome for the species; The core gene set of each species was identified within its pangenome; Comparative analysis of the core gene sets of different species within the same genus was conducted, and species-specific gene sets were derived by comparing the core gene sets of different species pairwise. A second reference database was constructed using gene sets specific to each species level.

4. The method for identifying pathogens from multiple sites of the human body based on metagenomics as described in claim 1, characterized in that, Integrate viral genetic data from public databases and construct a third reference database, including: Integrate viral genomes and representative fragment sequences from public databases, remove bacteriophages, and generate a pathogen gene sequence set that does not contain bacteriophages; The pathogen gene sequence set was filtered to remove gene sequences with a fuzzy base content >5%, repetitive fragment gene sequences, and gene sequences with a length <300bp. Clustering of viral sequences from the filtered pathogen gene sequence set; If the clustering results are consistent with the fragment information of the species, the longest sequence is selected from the different clustering branches after quality control as the representative virus sequence of that clustering branch, and all the representative virus sequences are constructed into a third reference database. If the clustering results are inconsistent with the fragment information of the species, the inconsistent genomes will be re-clustered by subspecies.

5. The method for identifying pathogens from multiple sites of the human body based on metagenomics as described in claim 1, characterized in that, Integrate genetic data of fungi and parasites from public databases and construct a fourth reference database, including: Integrate reference sequences of fungi and parasites from public databases; Quality control was performed on the reference sequences of pathogenic fungi and parasites. Sequences shorter than 1000 bp and sequences that successfully aligned with human and bacterial genomes were removed. Genome data with contamination level <5% and integrity >95% were retained to obtain representative sequences of fungi and parasites. All representative sequences of fungal parasites were constructed into a fourth reference database.

6. The method for identifying pathogens from multiple sites of the human body based on metagenomics as described in claim 1, characterized in that, Metagenomic sequencing data from multiple human organ samples were sequentially compared with the genus-specific gene sets of the first reference database, the species-specific genome sets of the second reference database, the representative viral sequences of the third reference database, and the representative fungal and parasitic sequences of the fourth reference database, database by database. Gene parameters were calculated based on the comparison results, including: Metagenomic sequencing data from samples of multiple human organs; The metagenomic sequencing data is compared with the genus-specific gene set in the first reference database; Gene sequences of a genus that are uniquely matched with the first reference database are extracted, and the genus-level gene sequences are compared with the species-specific genome sets within the same genus in the second reference database. Gene sequences that are uniquely matched with the second reference database are classified into that species, and the number of species-specific genes and species genome coverage are counted. If the metagenomic sequencing data does not match with either the first or second reference database, then the metagenomic sequencing data will be compared with the viral gene set in the third reference database. Gene sequences that are uniquely matched with the third reference database are classified according to the classification information of viral gene reference sequences, and the abundance and genome coverage of viral sequences are statistically analyzed. If the metagenomic sequencing data does not match the first reference database, the second reference database, and the third reference database, then the metagenomic sequencing data will be compared with the representative fungal parasite sequences in the fourth reference database. Gene sequences that successfully matched the fourth reference database were classified according to the classification information of fungal parasite gene reference sequences, and the abundance and genome coverage of fungal parasite sequences were statistically analyzed.

7. The method for identifying pathogens from multiple sites of the human body based on metagenomics as described in any one of claims 1-6, characterized in that, Based on the aforementioned gene parameters, background pathogens and pathogenic pathogens in multiple parts of the human body were identified, and an identification report was generated, including: Species abundance in each organ sample is calculated based on the gene parameters, using human organs as the unit. The threshold for background pathogens is calculated based on the species abundance, and the squared difference of twice the average species abundance in all organ samples is used as the threshold for background pathogens. Species that do not exceed the threshold are identified as background pathogens, and species that exceed the threshold are identified as pathogenic pathogens.

8. A metagenomic-based system for identifying pathogens from multiple sites in the human body, characterized in that, This can be achieved through the following modules: The first construction module is used to analyze the genus-level pan-genome of pathogens and construct a first reference database; the first reference database includes a genus-level specific gene set. The second construction module is used to analyze the species-level pan-genome of pathogenic bacteria within the same genus and to construct a second reference database; the second reference database includes a species-specific genome set. The third building module is used to integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences. The fourth building module is used to integrate the genetic data of fungi and parasites in public databases and construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites. The alignment module is used to sequentially align metagenomic sequencing data of multiple human organ samples with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and calculate gene parameters based on the alignment results. The identification module is used to identify background pathogens and pathogenic pathogens in multiple parts of the human body based on the gene parameters, and output an identification report.

9. A database construction platform for identifying pathogens from multiple sites in the human body, characterized in that, include: The first building module is used to analyze the genus-level pan-genome of pathogens and construct the first reference database; The first reference database includes a genus-specific gene set; The second construction module is used to analyze the species-level pan-genome of pathogenic bacteria within the same genus and to construct a second reference database; the second reference database includes a species-specific genome set. The third building module is used to integrate viral genetic data from public databases and construct a third reference database; the third reference database includes representative viral sequences. The fourth building module is used to integrate the genetic data of fungi and parasites in public databases and construct a fourth reference database; the fourth reference database includes representative sequences of fungi and parasites. The genus-level unique gene set of the first reference database, the species-level unique genome set of the second reference database, the representative virus sequence of the third reference database, and the representative fungal parasite sequence of the fourth reference database are used to compare with metagenomic sequencing data of multiple human organ samples in turn to identify background pathogens and pathogenic pathogens in multiple parts of the human body.

10. A metagenomic-based platform for identifying pathogens from multiple sites in the human body, characterized in that, include: The storage module is used to store the pre-built first reference database, second reference database, third reference database and fourth reference database; The first reference database includes a genus-level unique gene set, the second reference database includes a species-level unique genome set, the third reference database includes representative viral sequences, and the fourth reference database includes representative fungal parasite sequences. The alignment module is used to sequentially align metagenomic sequencing data of multiple human organ samples with the genus-specific gene set of the first reference database, the species-specific genome set of the second reference database, the representative viral sequences of the third reference database, and the representative fungal parasite sequences of the fourth reference database, and calculate gene parameters based on the alignment results. The identification module is used to identify background pathogens and pathogenic pathogens in multiple parts of the human body based on the gene parameters, and output an identification report.

11. An electronic device, characterized in that, include: Memory and one or more processors; The memory is used to store one or more computer programs; When the one or more computer programs are executed by the one or more processors, the method for identifying pathogens in multiple parts of the human body based on metagenomics as described in any one of claims 1-7 is implemented.

12. A computer storage medium storing a computer program; characterized in that, When the computer program is executed by the processor, it implements the method for identifying pathogens in multiple parts of the human body based on metagenomics as described in any one of claims 1-7.