Method for identifying viruses infecting dormant bacteria in river sediments and use thereof

By using propidium bromide azide treatment and multi-dimensional gene comparison technology, the problems of dead bacteria interference and false positives in the identification of dormant bacterial viruses have been solved, enabling accurate identification of dormant hosts and accurate analysis of virus-host relationships. This technology is applicable to microbial monitoring and environmental management of river ecosystems.

CN121963871BActive Publication Date: 2026-06-16HOHAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HOHAI UNIV
Filing Date
2026-03-30
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively distinguish between DNA interference from dead bacteria, accurately identify dormant hosts in environmental samples, and have a high false-positive rate for virus-host matching. Traditional methods cannot accurately analyze the relationship between dormant bacteria and their infecting viruses in complex environmental samples.

Method used

By constructing a multi-constraint screening mechanism that includes active screening, co-judgment of dormant features, cross-validation of viruses using multiple models, and high-strictness matching with CRISPR, including propidium azide bromide treatment to activate DNA, multi-dimensional gene alignment, and CRISPR spacer sequence alignment, the mechanism ensures accurate matching between the virus and the host.

🎯Benefits of technology

It effectively eliminates interference from dead bacteria, improves the accuracy of results, accurately identifies dormant bacteria, reduces the false positive rate of virus-host matching, and breaks through the limitations of laboratory isolation and culture of dormant bacteria. It is suitable for microbial analysis of complex aquatic sediment ecosystems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121963871B_ABST
    Figure CN121963871B_ABST
Patent Text Reader

Abstract

The application discloses a method for identifying viruses infecting dormant bacteria in river sediments and application thereof. The application relates to the technical field of microbiology and bioinformatics, and solves the problems that the prior art cannot effectively distinguish dead bacteria DNA interference, cannot accurately identify dormant hosts in environmental samples, and has a high false positive rate of virus-host matching. The application carries out azide bromide propyl dye treatment on river sediment samples, obtains active microbial community DNA, and carries out high-throughput sequencing; performs multi-module cooperative dormancy characteristic analysis on assembled bacterial genomes, and defines candidate genomes of dormant hosts; identifies candidate virus sequences in a triple cross-validation manner; extracts CRISPR spacer sequences in the genomes of the dormant hosts, and carries out comparison with the candidate virus sequences, so that the infection relationship between the viruses and the dormant hosts is established. The application can reduce the false positive rate in the virus-host matching process, and realizes accurate identification of viruses infecting dormant bacteria in complex environmental samples.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of microbiology and bioinformatics, and in particular to a method for identifying viruses infecting dormant bacteria in river sediments and its application. Background Technology

[0002] River sediments are among the most biodiverse habitats in aquatic ecosystems. Under complex and variable environmental pressures such as nutrient deprivation, extreme temperatures, and pollutant stress, many bacteria enter a dormant state to survive. Dormant bacteria play a crucial role in environmental material cycling and ecosystem stability. Meanwhile, viruses, as the most abundant biological entities in the environment, regulate the structure and function of microbial communities by infecting host bacteria. However, the interaction mechanisms between free viruses and dormant bacteria in the environment remain a blind spot in microbial ecology research.

[0003] In existing technologies, monitoring the infection relationship between viruses and hosts in the environment mainly faces the following technical bottlenecks: First, traditional methods rely heavily on the pure culture of microorganisms, but more than 99% of microorganisms in the natural environment cannot be cultured in the laboratory, especially dormant bacteria. This leads to the omission of a large amount of potential "virus-dormant host" interaction information. Second, although metagenomic technology based on high-throughput sequencing has been widely used in environmental microbiology research, conventional DNA extraction methods cannot distinguish between dead and live bacterial DNA. As a result, due to the interference of dead cell DNA, the sequencing results are difficult to accurately reflect the potentially active microbial community. Third, there is currently a lack of an efficient, systematic, and multi-dimensional bioinformatics analysis strategy for the accurate identification of dormant bacteria and their corresponding specific viruses.

[0004] It is evident that traditional microbial analysis methods and conventional metagenomic techniques have significant limitations in accurately identifying dormant bacteria and their infecting viruses in complex environmental samples. Currently, there is an urgent need for a novel identification method that can effectively eliminate interference from dead bacteria, accurately identify dormant hosts, and establish precise infection relationships. Summary of the Invention

[0005] The purpose of this invention is to provide a method for identifying viruses infecting dormant bacteria in river sediments and its application, in order to solve the problems in the prior art that cannot effectively distinguish the interference of dead bacterial DNA, are difficult to accurately identify dormant hosts in environmental samples, and have a high false positive rate of virus-host matching.

[0006] Firstly, this invention provides a method for identifying viruses infecting dormant bacteria in river sediments. By constructing a multi-constraint screening mechanism that includes activity screening, collaborative determination of dormant characteristics, cross-validation of multiple virus models, and high-strictness CRISPR matching, a systematic identification process for dormant hosts and their infecting viruses is established. The method specifically includes the following steps:

[0007] Step 1: Collect river sediment samples, add the river sediment samples to propidium bromide azidophosphate solution for incubation in the dark, then activate them by light irradiation, extract the DNA of live bacteria treated with propidium bromide azidophosphate and perform high-throughput sequencing to obtain raw sequencing data.

[0008] Step 2: Perform quality filtering on the raw sequencing data to remove adapters and low-quality sequences, obtaining clean sequences; perform de novo assembly and binning on the clean sequences to obtain bacterial genome sequences.

[0009] Step 3: Perform quantitative analysis of dormancy characteristics on the bacterial genome sequence; wherein, spore formation-related genes, toxin-antitoxin system genes, and resuscitation-promoting factor genes are identified respectively; calculate the proportion of the above three types of genes to the total number of functional genes in each genome; after standardizing the proportions of the three types of genes, substitute them into a weighted scoring model to calculate the comprehensive score S. When S reaches a preset threshold, the genome is defined as a candidate genome sequence for a dormant host.

[0010] Step 4: Perform virus identification and screening on the assembled sequences obtained from de novo assembly in Step 2. Use the following three types of virus identification models with different algorithm principles to independently predict the same assembled sequence: deep learning virus identification model based on sequence pattern learning, prediction model based on viral feature marker genes, and virus protein domain identification model based on hidden Markov model. Only when the same sequence is identified as a viral sequence by all three types of models at the same time will it be defined as a candidate viral sequence.

[0011] Step 5: Predict and extract CRISPR spacer sequences from the candidate genome sequences of the dormant host obtained in Step 3. Compare the CRISPR spacer sequences with the candidate virus sequences obtained in Step 4. Determine the infection relationship between the virus and the host based on the homology of the comparison, and finally identify the virus that infects dormant bacteria.

[0012] There are sequential constraints between the steps of this invention. The candidate set generated by the previous screening step directly limits the input range of the next screening step, forming a progressive filtering structure rather than a parallel combination.

[0013] Optionally, in step one, the final concentration of the propidium azide bromide solution used is 20μM-100μM; the incubation time in the dark is 5min-15min; and the light irradiation activation uses a halogen lamp or an LED lamp, with an irradiation time of 10min-20min.

[0014] Optionally, in step two, the length of the assembled bacterial genome sequence can be set to ≥1000bp.

[0015] Optionally, in step three, the comprehensive score S is: S = w1 P'_spore+w2 P'_TA+w3 P'_Rpf;

[0016] Wherein P'_spore, P'_TA, and P'_Rpf are the standardized values ​​of the gene proportions of the spore formation regulation module, the toxin-antitoxin regulation module, and the resuscitation promotion module, respectively, obtained by Z-score or Min-Max standardization of the proportion values; the weights w1, w2, and w3 are determined by optimization using the receiver operating characteristic curve of the training sample set; the preset threshold is determined based on positive and negative control genome training.

[0017] Optionally, the spore formation regulation module is applicable to bacterial groups with spore-forming ability, and the weight of the spore formation regulation module is set to 0 in bacterial groups without spore-forming ability; the spore formation regulation module includes Spo0A, Spo0F and their homologous protein families, the toxin-antitoxin regulation module includes HipA-HipB, YafQ-DinJ, MazE-MazF and RelE-RelB and their homologous protein families, and the resuscitation promotion function module includes Rpf family proteins or their domain homologous proteins.

[0018] Optionally, in step four, the three virus identification models with different algorithm principles are selected from DeepVirFinder, VirSorter, VIBRANT, or virus identification models implemented based on the same algorithm principle as the above three virus identification models; the length screening parameter of the assembled candidate virus sequence is set to ≥5000bp.

[0019] Optionally, in step five, the CRISPR spacer sequence is compared with the candidate virus sequence, and the parameters for determining whether there is an infection relationship are set as follows: coverage ≥90%, consistency ≥80%, mismatch number ≤4, to obtain the predicted phage host information.

[0020] Secondly, this invention provides the application of the above-described method for identifying viruses infecting dormant bacteria in river sediments in river ecosystem microbial monitoring, virus-host interaction research, and environmental governance assessment.

[0021] Compared with the prior art, the beneficial effects that the method of this application can achieve specifically include:

[0022] Effectively eliminating interference from dead cells and improving the reliability of results: This invention introduces a propidium azide bromide chemical modification and retention technology. Propidium azide bromide can only penetrate dead cells with damaged cell membranes and covalently binds to free DNA under light, thereby blocking its amplification in subsequent experiments. This step effectively removes the interference of a large amount of dead bacterial DNA in river sediments, ensuring that the objects of subsequent assembly and analysis are microbial communities with biological activity or potential activity (dormant state), reducing host prediction errors from the source.

[0023] Multidimensional feature gene alignment for precise identification of dormant bacteria: This invention does not rely on a single indicator, but instead constructs a feature functional gene library covering three dimensions: spore formation genes, toxin-antitoxin system genes, and resuscitation promotion genes. Through multidimensional cross-validation and a proportional synergistic determination mechanism of the three functional modules, dormant hosts are screened, rather than relying on a single dormancy marker gene. This significantly improves the accuracy and reliability of screening and defining dormant bacterial genome sequences from complex metagenomic data.

[0024] Triple cross-validation ensures the reliability of virus identification: This method requires candidate sequences to pass triple validation simultaneously using a deep learning model, feature gene prediction, and an HMM model. Compared to methods using a single tool or union screening, the candidate virus sequences obtained by this approach show better performance in terms of genome integrity and viral feature significance, effectively eliminating noise interference from viral-like sequences in the environmental background.

[0025] Definite Host Origin Tracing Based on CRISPR System: This invention extracts specific CRISPR spacer sequences from the genomes of dormant bacteria as "immune memory" records and directly compares them with assembled viral sequences for homology. Compared to traditional alignment methods that construct CRISPR specific spacer sequences based on the NCBI database, the method of matching CRISPR spacer sequences from the genomes of dormant bacteria in the same sediment sample provides strong evidence of a definitive historical infection relationship, greatly reducing the false positive rate of virus-host network prediction.

[0026] No pure culture required, wide applicability: This method is based on culture-independent metagenomic sequencing and multidimensional bioinformatics alignment strategies, which overcomes the limitation that dormant bacteria are difficult to isolate and culture in the laboratory, and provides an innovative scientific means to analyze the survival strategies of microorganisms in complex aquatic sediment ecosystems. Attached Figure Description

[0027] To more clearly illustrate the technical solution of the present invention, the drawings used in the embodiments will be briefly introduced below. Obviously, those skilled in the art can obtain other drawings based on these drawings without creative effort.

[0028] Figure 1 This is a flowchart illustrating a method for identifying viruses infecting dormant bacteria in river sediments, according to an embodiment of this application.

[0029] Figure 2 This is a bar chart comparing the number of virus-host matches and the proportion of low-confidence matches before and after propidium azide bromide treatment. The hollow black-framed bars on the left axis represent the total number of matches, and the filled diagonal lines on the right axis represent the proportion of low-confidence matches.

[0030] Figure 3 A comparison chart showing the number of virus-host matches and the proportion of low-confidence matches under different dormancy determination strategies. Detailed Implementation

[0031] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention. The technical solutions provided by various embodiments of this invention will be described in detail below with reference to the accompanying drawings.

[0032] Please see Figure 1 This application provides a method for identifying viruses infecting dormant bacteria in river sediments. By combining chemical modification with metagenomic bioinformatics analysis, it achieves efficient discovery of dormant microbial communities and related viruses. The method specifically includes the following steps:

[0033] Step 1: Collect river sediment samples, add the river sediment samples to propidium bromide azidophosphate solution for incubation in the dark, then activate them by light irradiation, extract the DNA of live bacteria treated with propidium bromide azidophosphate and perform high-throughput sequencing to obtain raw sequencing data.

[0034] Specifically, PMA (Propidium Monoazide) is a photoreactive DNA-binding dye. Because it cannot penetrate the cell membrane of intact living cells, it can only selectively enter dead cells or cells with damaged membranes. During incubation in the dark, PMA binds to the DNA of dead bacteria; subsequently, under strong light (such as halogen lamps or LED lamps), PMA undergoes photolysis and forms covalent cross-links with the DNA, preventing the modified DNA from being amplified and sequenced in subsequent DNA extraction and library construction steps.

[0035] As one implementation method, the final concentration of the PMA solution is set to 20μM-100μM, and the incubation time in the dark is set to 5min-15min to ensure that the dye fully penetrates the dead cells; the strong light irradiation time is set to 10min-20min to ensure that the cross-linking reaction is complete.

[0036] Step 2: Perform quality filtering on the raw sequencing data to remove adapters and low-quality sequences, obtaining clean sequences; perform de novo assembly and binning on the clean sequences to obtain bacterial genome sequences.

[0037] Specifically, after obtaining the sequencing data, quality control software, such as FastP or Trimmomatic, is used to remove adapter sequences and low-quality bases to obtain high-quality, clean data. Next, assembly tools, such as MEGAHIT or SPAdes, are used for de novo assembly. To ensure data reliability, the length of the assembled bacterial genome sequence can be set to be greater than 1000 bp to remove excessively short, meaningless sequences. Subsequently, binning software, such as MetaBAT2, is used to obtain the metagenomically assembled genome.

[0038] Step 3: Perform quantitative analysis of dormancy characteristics on bacterial genome sequences; specifically, identify genes related to spore formation, genes of the toxin-antitoxin system, and genes that promote resuscitation; calculate the proportion of the above three types of genes to the total number of functional genes in each genome; after standardizing the proportions of the three types of genes, substitute them into a weighted scoring model to calculate the comprehensive score S; when S reaches a preset threshold, the genome is defined as a candidate genome sequence for a dormant host.

[0039] Specifically, bacterial dormancy involves complex gene regulation. This application selects three key systems: spore formation-related genes, toxin-antitoxin system-related genes, and resuscitation-promoting factor-related genes. Spore formation-related genes are mainly used for identifying dormancy in sporogenic bacteria such as Firmicutes, including Spo0A, Spo0F, and their homologous protein families. Toxin-antitoxin system-related genes induce cell growth arrest and entry into a dormant state, including HipA-HipB, RelE-RelB, MazE-MazF, or their functional homologous systems. Resuscitation-promoting factor-related genes are Rpf family proteins or their domain homologous proteins; carrying these genes indicates that the bacteria possess the potential to awaken from dormancy.

[0040] In practice, a search tool based on a local alignment algorithm (Basic Local Alignment Search Tool, BLAST) or a sequence alignment tool based on a Hidden Markov Model Explorer (HMMER) can be used to align the assembled genome sequence with the constructed local feature library to identify genes. Then, the proportions of the three gene types relative to the total number of functional genes in the genome are calculated. These proportions are then standardized, such as through Z-score normalization or Min-Max normalization. A comprehensive index is calculated using a weighted or collaborative algorithm. Only when this comprehensive index reaches a preset threshold can the genome be defined as a candidate genome for a dormant host. This multi-dimensional constraint mechanism effectively eliminates interference from non-dormant bacteria.

[0041] In practice, the preset threshold is determined based on training and validation analysis of a known reference genome set. The specific steps are as follows:

[0042] (1) Constructing a reference genome set

[0043] First, two sets of bacterial genomes with clearly defined phenotypic records are selected from public databases such as the National Center for Biotechnology Information (NCBI) and the RefSeq (Reference Sequence Database): The training positive set contains typical bacteria with clearly defined dormancy, persistence, or spore-forming abilities, such as Bacillus subtilis and Mycobacterium tuberculosis. The training negative set contains rapidly growing bacteria that typically do not possess typical dormancy mechanisms.

[0044] (2) Feature value extraction and standardization

[0045] The HMMER or BLAST tools were used to scan the genomes in the reference set and count the number of genes in the sporulation regulation module (N_spore), toxin-antitoxin regulation module (N_TA), and resuscitation promotion module (N_Rpf). The proportion of each gene to the total number of functional genes in the genome (N_total) was calculated, P_i.

[0046] P_i = N_i / N_total.

[0047] Here, i represents the three types of functional modules mentioned above. Subsequently, P_i is subjected to Z-score standardization or Min-Max normalization to eliminate the order-of-magnitude differences between different modules.

[0048] (3) Calculation of comprehensive indicators

[0049] Construct a comprehensive judgment function, i.e., a comprehensive score S. S = w1 P'_spore+w2 P'_TA+w3 P'_Rpf. Where w1, w2, and w3 are the weight coefficients of each module, and P'_spore, P'_TA, and P'_Rpf are the standardized values ​​of the gene proportions of the spore formation regulation module, the toxin-antitoxin regulation module, and the resuscitation promotion module, respectively. These values ​​are obtained by Z-score or Min-Max standardization.

[0050] The weights w1, w2, and w3 are determined as follows: First, a training sample library containing known dormant phenotype genomes (positive set) and non-dormant genomes (negative set) is constructed; the standardized proportion of each genome in the three functional modules is calculated respectively; with classification accuracy as the objective function, a random search or grid search algorithm is used to traverse the weight combinations in the interval [0,1], and an ROC curve is plotted for the comprehensive score S generated for each weight group. The weight coefficient corresponding to maximizing the area under the curve is selected as the final model parameter. In a preferred embodiment of this application, the optimized weight allocation is as follows: w1 = 0.4 for the spore formation regulation module, w2 = 0.35 for the toxin-antitoxin regulation module, and w3 = 0.25 for the resuscitation promotion functional module. The basis for setting w1 as the highest weight is that spore-producing bacteria such as Firmicutes in river sediments are the core components of dormant communities, and their sequence characteristics are the most significant, with the strongest phenotypic indicative role.

[0051] (4) Threshold optimization selection

[0052] Based on the optimized weights described above, the comprehensive score S of positive and negative samples in the training set is calculated. Sensitivity and specificity are calculated for different S values, and the Youden index (sensitivity + specificity - 1) is introduced. The S value that maximizes classification accuracy and minimizes the false positive rate is selected as the preset threshold. This value can maximize the identification of dormant hosts and eliminate noise interference from non-dormant bacteria. When the comprehensive score S of a bacterial subgenome in the river sediment sample to be tested is ≥ the preset threshold, it is defined as a candidate genome of a dormant host. In Example 1, after training analysis with 100 positive samples and 100 negative samples, the preset threshold is set to 0.65.

[0053] Step 4: Perform virus identification and screening on the assembled sequences obtained from de novo assembly in Step 2. Use the following three types of virus identification models with different algorithm principles to independently predict the same assembled sequence: deep learning virus identification model based on sequence pattern learning, prediction model based on viral feature marker genes, and virus protein domain identification model based on hidden Markov model. Only when the same sequence is identified as a viral sequence by all three types of models at the same time will it be defined as a candidate viral sequence.

[0054] Specifically, to reduce the false positive rate of virus identification in environmental metagenomics, this method no longer uses a single tool or union screening. Instead, it requires that the sequence must be validated simultaneously by three different principle models: in practice, DeepVirFinder can be selected for the deep learning model based on no reference sequence; VirSorter can be selected for the sequence prediction model based on viral characteristic genes; and VIBRANT can be selected for the virus annotation algorithm based on hidden Markov models. Only when the same sequence is identified as a virus in all three models is it defined as a candidate viral sequence. Furthermore, to eliminate false positives caused by short sequences, the viral sequence length screening parameter is set to 5000-10000 bp or more.

[0055] Step 5: Predict and extract CRISPR spacer sequences (Clustered Regularly Interspaced Short Palindromic Repeats) from the dormant host candidate genome sequences obtained in Step 3. Compare the CRISPR spacer sequences with the candidate virus sequences obtained in Step 4. Determine the infection relationship between the virus and the host based on the homology of the comparison, and finally identify the virus that infects dormant bacteria.

[0056] Specifically, the CRISPR-Cas system is an important component of the bacterial adaptive immune system. After encountering a viral infection, bacteria cleave viral fragments and integrate them as spacer sequences into their own CRISPR array. Therefore, the CRISPR spacer sequence provides a precise historical record of the host's past infections with specific viruses.

[0057] In the procedure, firstly, tools such as MinCED or CRISPRCasFinder are used to predict and extract CRISPR spacer sequences from the dormant bacterial genome defined in step three. Then, the BLASTn tool is used to compare the extracted spacer sequences with the candidate virus sequences obtained in step four at the nucleic acid level. To ensure the rigor of the matching, the parameters for determining the existence of an infection relationship are strictly set as follows: coverage ≥ 90%, consistency ≥ 80%, and mismatch number ≤ 4. Any pairing that meets these conditions is confirmed as a virus capable of infecting the corresponding dormant bacteria.

[0058] It should be noted that the method of this application can be implemented with the assistance of various software in the field of bioinformatics, including biological software tools that are already maturely used in this field, which will not be listed here. Furthermore, it is not limited to using self-programmed computer programs for specific implementation of the method, which will not be detailed here. Those skilled in the art should understand that the preset threshold of 0.65 is an optimal value for the river sediment sample environment of this application; for different types of environmental samples, the optimal threshold can be re-determined based on the ROC curve and Youden's index optimization method of this application.

[0059] The specific operation of this method will be further explained below with specific application examples.

[0060] Example 1

[0061] The following example, using a specific river sediment sample, further illustrates the operation of this method:

[0062] A surface sediment sample was collected from a polluted river, and 1g of the sample was weighed and resuspended in PBS buffer. PMA reagent was added to bring the final concentration to 50μM. The sample was incubated in a dark room for 10 min; then it was activated by strong light irradiation under a halogen lamp for 15 min, during which time the sample was kept in an ice-water bath to prevent overheating.

[0063] Total viable bacterial DNA was extracted from the processed samples, a library was constructed, and Illumina high-throughput sequencing was performed to obtain 60 Gb of raw sequencing data.

[0064] Data was filtered using FastP software, and de novo assembly was performed using MEGAHIT. Short contigs shorter than 1000 bp were removed, and then 150 bacterial bin genome sequences were obtained using MetaBAT2.

[0065] In this embodiment, ROC curves were plotted by training and analyzing the genomes of 100 known dormant bacteria and 100 non-dormant bacteria. The results showed that when S was 0.65, the Youden index reached its maximum value of 0.82, at which point the model's sensitivity was 89% and its specificity was 93%. Therefore, setting 0.65 as the preset threshold for defining dormant hosts can effectively balance identification efficiency and false positive control.

[0066] When performing quantitative analysis on 150 bacterial sequences, the formula S=0.4 was used. P'_spore+0.35 P'_TA+0.25 P'_Rpf calculates the overall score for each genome. The calculated S values ​​for the 150 genomes range from 0.08 to 0.92. For typical non-dormant strains, such as some Proteobacteria members, their spore module scores are close to 0, with S values ​​generally concentrated in the 0.10-0.45 range. For target dormant candidate bacteria, they have higher scores in the TA system or Rpf module, especially spore-producing groups, whose S values ​​are significantly higher. Ultimately, 32 genomes with S values ​​≥ 0.65 (preset threshold) were identified as dormant host candidate genomes.

[0067] Contigs longer than 5000 bp were extracted and input into DeepVirFinder (Score ≥ 0.9, P < 0.05), VirSorter (Score ≥ 0.9), and VIBRANT software, respectively. The intersection of the three datasets identified 2105 high-quality candidate virus sequences.

[0068] 480 CRISPR spacer sequences were extracted from the genomes of the aforementioned 32 dormant bacteria using the MinCED tool.

[0069] 480 CRISPR spacer sequences were compared with 2105 viral sequences using BLASTn. The parameters were set as follows: coverage ≥90%, consistency ≥80%, and mismatch number ≤4.

[0070] Results analysis: The alignment revealed that 750 viral sequences precisely matched the CRISPR spacer sequences of 24 dormant bacteria (belonging to the phyla Actinobacteria and Firmicutes). This indicates that the method successfully removed dead bacterial interference from massive metagenomic data, accurately screened dormant hosts, and confirmed the existence of a real infection network relationship between these 750 environmental viruses and dormant bacteria.

[0071] This application also provides an application of the above-mentioned method in river ecosystem microbial monitoring, virus-host interaction research, and environmental remediation assessment. This method can rapidly assess the dormant potential of pathogens in aquatic sediments and screen for specific bacteriophages capable of targeting and infecting these pathogens, providing data support for future aquatic environment bioremediation and phage therapy.

[0072] Example 2

[0073] To verify the effect of propidium azidobromide treatment on the accuracy of virus-host matching, a control experiment was set up without propidium azidobromide treatment.

[0074] River sediment samples from the same source as in Example 1 were collected and divided into two groups: the experimental group was treated with propidium bromide azidophosphate according to the steps of this invention; the control group was not treated with propidium bromide azidophosphate phosphate, and the remaining DNA extraction, sequencing, assembly, binning, virus screening and CRISPR matching steps were kept the same.

[0075] After high-throughput sequencing and assembly analysis, the results are as follows: 150 bacterial genomes were obtained in the experimental group, of which 32 were identified as candidate genomes of dormant hosts through three-module synergistic screening; 230 bacterial genomes were obtained in the control group, of which 68 were identified as candidate genomes of dormant hosts.

[0076] The virus screening results showed that the experimental group obtained 2105 candidate virus sequences after cross-screening using three models, while the control group obtained 2702 candidate virus sequences.

[0077] CRISPR matching results showed that the experimental group obtained 750 virus-host matching relationships, while the control group obtained 960 matching relationships.

[0078] The virus sequence was manually verified and combined with a genome integrity assessment, revealing that it was caused by... Figure 2 It can be seen that 482 matching relationships (about 50%) in the control group were low-quality viral sequences or low-quality bacterial genome sequences, while 68 unreasonable matching relationships (about 9%) were found in the experimental group.

[0079] The above results indicate that without propidium azidobromide treatment, the presence of residual dead bacterial DNA in the environment increases the potential sources of false hosts, leading to an inflated number of virus-host matches. After propidium azidobromide treatment, the matching network structure is more stable, and the false positive rate is significantly reduced.

[0080] Example 3

[0081] To verify the effectiveness of the three-module collaborative hibernation determination mechanism, a single-module and multi-module control experiment was set up.

[0082] Based on the bacterial genome obtained in Example 1, the following three criteria were used to screen dormant hosts:

[0083] Group A: The proportion of genes related to the spore formation regulatory module was used as the sole criterion for judgment;

[0084] Group B: The spore formation regulation module and the toxin-antitoxin module were used as the criteria for judgment;

[0085] Group C: A three-module synergistic judgment mechanism was adopted, consisting of a spore formation regulation module, a toxin-antitoxin module, and a resuscitation promotion module. The judgment threshold was set at 0.65.

[0086] The filtering results are as follows:

[0087] Group A screening yielded 62 candidate genomes of dormant hosts;

[0088] Group B screening yielded 48 candidate genomes;

[0089] Group C's three-module collaborative screening yielded 32 candidate genomes.

[0090] Further CRISPR matching analysis was conducted:

[0091] Group A obtained 1228 virus-host matching relationships;

[0092] Group B yielded 919 matching relationships;

[0093] Group C yielded 750 matches.

[0094] After reviewing the functional consistency and genome integrity of the matching relationships, it was found that... Figure 3 It can be known that:

[0095] In Group A, 634 (approximately 52%) of the matches lacked support from resuscitation-related genes or were low-quality viral or bacterial genome sequences.

[0096] In Group B, 253 (approximately 28%) of the matches lacked support from resuscitation-related genes or were low-quality viral or bacterial genome sequences.

[0097] In Group C, all matches obtained through the three-module collaborative screening were supported by resuscitation-related genes, with 68 unreasonable matches (approximately 9%). Statistical analysis showed that the unreasonable match rate in Group C was significantly lower than that in Groups A and B (p<0.05).

[0098] The results showed that a single dormancy marker gene or a two-module determination method could easily misclassify atypical dormant bacteria with some stress-regulating genes as dormant hosts, thus amplifying virus matching noise. The three-module synergistic screening mechanism can significantly reduce the number of misclassifications and improve the stability and consistency of dormant host identification through multifunctional dimension cross-restriction.

[0099] The embodiments of the present invention described above do not constitute a limitation on the scope of protection of the present invention.

Claims

1. A method for identifying viruses infecting dormant bacteria in river sediments, characterized in that, Includes the following steps: Step 1: Collect river sediment samples, add the river sediment samples to propidium bromide azidobromide solution for incubation in the dark, then activate them by light irradiation, extract the DNA of live bacteria treated with propidium bromide azidobromide and perform high-throughput sequencing to obtain raw sequencing data. Step 2: Perform quality filtering on the raw sequencing data to remove adapters and low-quality sequences, obtaining clean sequences; perform de novo assembly and binning on the clean sequences to obtain bacterial genome sequences; Step 3: Perform quantitative analysis of dormancy characteristics on the bacterial genome sequence; wherein, spore formation-related genes, toxin-antitoxin system genes, and resuscitation-promoting factor genes are identified respectively; calculate the proportion of the above three types of genes to the total number of functional genes in each genome; after standardizing the proportions of the three types of genes, substitute them into a weighted scoring model to calculate the comprehensive score S; when S reaches a preset threshold, the genome is defined as a candidate genome sequence for a dormant host. Step 4: Perform virus identification and screening on the assembled sequences obtained from de novo assembly in Step 2. Use the following three types of virus identification models with different algorithm principles to independently predict the same assembled sequence: deep learning virus identification model based on sequence pattern learning, prediction model based on viral feature marker genes, and virus protein domain identification model based on hidden Markov model. Only when the same sequence is identified as a viral sequence by all three types of models at the same time will it be defined as a candidate viral sequence. Step 5: Predict and extract CRISPR spacer sequences from the candidate genome sequences of the dormant host obtained in Step 3. Compare the CRISPR spacer sequences with the candidate virus sequences obtained in Step 4. Determine the infection relationship between the virus and the host based on the homology of the comparison, and finally identify the virus that infects dormant bacteria.

2. The method for identifying viruses infecting dormant bacteria in river sediments according to claim 1, characterized in that, In step one, the final concentration of the propidium azide bromide solution used is 20μM-100μM; the incubation time in the dark is 5min-15min; the light irradiation activation uses a halogen lamp or an LED lamp, and the irradiation time is 10min-20min.

3. The method for identifying viruses infecting dormant bacteria in river sediments according to claim 1, characterized in that, In step two, the length of the assembled bacterial genome sequence is set to ≥1000bp.

4. The method for identifying viruses infecting dormant bacteria in river sediments according to claim 1, characterized in that, In step three, the comprehensive score S is: S = w1 P'_spore+w2 P'_TA+w3 P'_Rpf; Wherein P'_spore, P'_TA, and P'_Rpf are the standardized values ​​of the gene proportions of the spore formation regulation module, the toxin-antitoxin regulation module, and the resuscitation promotion module, respectively, obtained by Z-score or Min-Max standardization of the proportion values; the weights w1, w2, and w3 are determined by optimization using the receiver operating characteristic curve of the training sample set; the preset threshold is determined based on positive and negative control genome training.

5. The method for identifying viruses infecting dormant bacteria in river sediments according to claim 4, characterized in that, The spore formation regulation module is applicable to bacterial groups with spore-forming ability. In bacterial groups without spore-forming ability, the weight of the spore formation regulation module is set to 0. The spore formation regulation module includes Spo0A, Spo0F and their homologous protein families. The toxin-antitoxin regulation module includes HipA-HipB, YafQ-DinJ, MazE-MazF and RelE-RelB and their homologous protein families. The resuscitation promotion function module includes Rpf family proteins or their domain homologous proteins.

6. The method for identifying viruses infected with dormant bacteria in river sediments according to claim 1, characterized in that, In step four, the three virus identification models with different algorithm principles are selected from DeepVirFinder, VirSorter, VIBRANT, or virus identification models based on the same algorithm principle as the above three virus identification models; the length screening parameter of the assembled candidate virus sequence is set to ≥5000bp.

7. The method for identifying viruses infecting dormant bacteria in river sediments according to claim 1, characterized in that, In step five, the CRISPR spacer sequence is compared with the candidate virus sequence. The parameters for determining whether there is an infection relationship are set as follows: coverage ≥ 90%, consistency ≥ 80%, and mismatch number ≤ 4, to obtain the predicted phage host information.

8. The method for identifying viruses infecting dormant bacteria in river sediments as described in any one of claims 1-7, and its application in river ecosystem microbial monitoring, virus-host interaction research, and environmental governance assessment.