A method for pathogenic microorganism analysis based on long read sequencing data
By establishing an error-correcting reference sequence library and performing quality control correction on sequencing data, the problem of low data quality in third-generation sequencing has been solved, achieving highly accurate and sensitive detection of pathogenic microorganisms, applicable to the identification of bacteria, fungi, parasites, and viruses.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN EASYDIAGNOSIS BIOMEDICINE
- Filing Date
- 2022-03-29
- Publication Date
- 2026-06-30
AI Technical Summary
Third-generation sequencing data has long read lengths but low quality, resulting in poor accuracy in pathogen identification and a high number of false positives. Existing software lacks specificity, which affects the application of pathogen detection.
Develop a parameterized clustering method suitable for third-generation targeted sequencing. By establishing an error-correcting reference sequence library, the sequencing data is quality controlled and corrected. Similar sequencing sequences are merged and compared to a pathogen identification database to filter out erroneous and interfering data.
It significantly improves the accuracy of pathogenic microorganism species identification, reduces false positive results, shortens data analysis time, and can simultaneously detect bacteria, fungi, parasites, and viruses, generating high-quality test reports.
Smart Images

Figure BDA0003571488270000071 
Figure BDA0003571488270000081 
Figure BDA0003571488270000082
Abstract
Description
Technical Field
[0001] This invention belongs to the field of pathogen detection technology, specifically relating to a method for analyzing pathogens based on long-read sequencing data. Background Technology
[0002] Pathogenic microorganisms, also known as pathogens, are a class of microorganisms that can cause disease in humans and animals. These include viruses, bacteria, rickettsiae, mycoplasma, chlamydia, spirochetes, fungi, and actinomycetes. These pathogenic microorganisms can cause infections, allergies, tumors, dementia, and other diseases, and are also a major factor endangering food safety. Therefore, the detection of pathogens must be rapid and accurate. With the continuous development of medical microbiology research techniques, etiological diagnosis is no longer limited to the pathogen level; detection methods at the molecular and genetic levels are constantly emerging and being applied in clinical and laboratory settings. With the advancement of technology, gene detection technology, which can rapidly and objectively detect suspected pathogenic microorganisms (including bacteria, fungi, and viruses) in clinical samples without relying on traditional microbial culture, is gradually replacing other detection techniques and becoming the mainstream detection technology for pathogens in clinical laboratories and basic laboratories.
[0003] For the detection of pathogenic microorganisms, next-generation sequencing (NGS) is currently the mainstream platform. NGS data offers high accuracy and throughput, but suffers from short read lengths and long sequencing times; typically, read lengths are only 150 bp, and sequencing times exceed 12 hours. In contrast, third-generation sequencing (NGS) provides much longer read lengths, reaching the kb level and above, but suffers from higher error rates. Currently, the mainstream NGS platforms are PacBio and Nanopore. Nanopore sequencing technology, in particular, uses changes in electrical current through biological nanopores to infer the base composition of single-molecule DNA (RNA) molecules. It offers advantages such as small and portable instruments, ease of operation, fast sequencing speed, and longer read lengths (average read length up to 20 kb, with some reaching the Mb level), demonstrating significant application potential in the clinical detection of pathogenic microorganisms.
[0004] Pathogen analysis based on sequencing data heavily relies on the reference databases and analysis software used. Currently, most databases and analysis software for pathogen identification are developed based on second-generation sequencing data. While pathogen identification software suitable for Nanopore sequencing data (such as Centrifuge) is highly versatile, it lacks specificity for processing real-world data. Comparison results are performed on a per-sequence basis; directly using these results for species classification and statistics is highly prone to false positives due to the low quality of Nanopore sequencing, severely hindering the application of third-generation sequencing in pathogen detection. Summary of the Invention
[0005] In response to the characteristics of long read lengths and low data quality in existing third-generation sequencing data, this invention develops a parameterized clustering method suitable for third-generation targeted sequencing. This method can accurately correct sequencing data, significantly improve the accuracy of pathogenic microorganism species identification, and effectively shorten data analysis time.
[0006] To achieve the above objectives, the technical solution of the present invention is as follows:
[0007] A method for analyzing pathogenic microorganisms based on long-read sequencing data includes the following steps:
[0008] S1. Extract and download the reference genome of pathogenic microorganisms to obtain reference sequences of target regions, and cluster the reference sequences according to similarity to obtain an error correction reference sequence library;
[0009] S2. The quality-controlled sequencing data is compared with the error-correction reference sequence library, and the sequencing sequences that are aligned to the same reference sequence are corrected.
[0010] S3. Next, compare the sequencing sequences in the sequencing data pairwise and merge similar sequencing sequences;
[0011] S4. Align the sequencing data obtained in step S3 with the pathogen identification database to obtain a list of pathogen detections.
[0012] Furthermore, in the above technical solution, in step S1, after clustering correction, the reference sequences are adjusted for directional consistency according to the primer matching order to obtain a directionally consistent error-correction reference database. The clustering correction in step S1 involves merging identical reference sequences, statistically analyzing the similarity between them, and filtering out sequences with intra-species similarity greater than inter-species similarity. Specifically, the similarity threshold is between 95% and 99.9%.
[0013] The reference sequence is obtained by downloading the reference genome sequence of the pathogenic microorganism from databases such as NCBI, and then truncating the matched reference sequence according to the matching of primers for the target region and the length of the amplified product after matching. The target region can be 16S, 23S, ITS, or other target regions. It should be understood that the primers are those used during targeted sequencing.
[0014] Furthermore, in the above technical solution, the quality control method for sequencing data in step S2 can be: filtering out sequencing sequences in the original sequencing data that are less than m in length or have a quality value lower than n. Preferably, the value of m ranges from 100bp to 600bp, and the value of n ranges from 8 to 11.
[0015] After quality control, the raw sequencing data yields long and high-quality sequencing data. The data volume, average length, N50 length, and sequencing sequence quality score of the filtered high-quality sequencing data are used as evaluation indicators for sequencing data quality.
[0016] Furthermore, during the quality control of the raw sequencing data, sequencing sequences that align to the host reference genome must be removed. If the host is human, the quality-controlled sequencing data is compared to the human reference genome. Data that aligns to the human reference genome is considered non-human data for analysis and is filtered out. The proportion of human data in the quality-controlled sequencing data is used as a reference indicator for the reliability of library construction and sequencing quality.
[0017] Furthermore, in the above technical solution, the correction method in step S2 is specifically as follows: first, sequencing sequences with an error rate higher than 'a' or a coverage rate lower than 'b' are filtered out, where 'a' ranges from 5% to 10% and 'b' ranges from 50% to 95%; then, using the reference sequence as a benchmark, errors are corrected on the base sites in the sequencing sequence. It can be understood that the error rate is the error rate when comparing the sequencing sequence to the reference sequence, and the coverage rate is relative to the target region.
[0018] Furthermore, the error correction method can be as follows: Statistically analyze the base frequency at each site in the sequencing sequence. If the proportion of sites with sequencing depths lower than *c* is greater than *d*, then that sequencing sequence is removed. For the retained sequencing sequences, if the sequencing depth is lower than *c*, the corresponding base from the target region reference sequence is used for substitution. For sites with sequencing depths greater than *c*, the base with the highest base frequency is used. The values of *c* and *d* can be determined based on the actual sequencing data. Preferably, the value of *c* ranges from 3 to 7, and the value of *d* ranges from 10% to 30%.
[0019] Furthermore, in the above technical solution, the merging method in step S3 specifically involves: for two sequences that reach a similarity threshold, retaining the sequence with the higher frequency, and updating the frequency of that sequence to the sum of the frequencies of the two sequences. Even further, the similarity threshold ranges from 95% to 99.9%. The similarity in this invention can be calculated using the Levenshtein.ratio module in Python.
[0020] Furthermore, in the above technical solution, the merged sequencing data obtained in step S3 is compared with the reagent background microbial database to remove data from microorganisms originating from sequencing reagents or consumables. Even further, the removal parameters can be: statistically analyzing the number of sequences compared to the background microbial database and the number of mismatches, filtering out sequencing sequences with a mismatch count of less than 3. The construction of the reagent background microbial database is existing technology and will not be elaborated upon here.
[0021] Furthermore, in the above technical solution, the quality control parameters in step S4 are: retaining alignment sequences with coverage greater than e and alignment error rate less than f, where the value of e ranges from 85% to 99% and the value of f ranges from 0.5% to 3%.
[0022] In the above technical solution, while generating the list of detected microorganisms in step S4, the target region information, species groups, number of sequences compared, sequence length, number of mismatches, and high-quality merged sequences are statistically analyzed and output as basic data quality control indicators for data analysis and judgment.
[0023] It is understandable that the pathogen identification database in step S4 can use an existing database. However, due to its low accuracy and inconsistent standards, the existing database can be corrected in the specific implementation process by establishing an error correction reference sequence library.
[0024] Furthermore, in the above technical solution, due to the low quality of the raw Nanopore sequencing data, a certain proportion of data splitting errors occur during data splitting. Additionally, due to unavoidable aerosol contamination and environmental background microbial contamination during the experiment, the microbial detection list obtained in step S4 needs further filtering. The filtering parameters include:
[0025] Filter out groups of Reads below x in the identification results, where the preferred value of x is 5 to 20; this indicates that the data of this type has low reliability and may be caused by environmental pollution, aerosols, reagent background, or incorrect matching due to low sequence quality during the comparison.
[0026] The group whose sequencing read count is lower than the total sequencing read count y of the sample is filtered out. The preferred value of y is 0.2% to 2%. This group of data has low reliability due to its low read count.
[0027] In addition, based on the sample source, the microorganisms in the pathogen detection list can be further identified to determine the final detected pathogens and colonizing microorganisms.
[0028] The beneficial effects of this invention are as follows:
[0029] 1) An error-correction reference sequence library was established, and the sequencing data was corrected using this library to obtain high-quality sequencing data. This effectively improved the speed of data analysis (2Gb third-generation sequencing data analysis and the final pathogen detection report could be generated within minutes), and reduced the complexity of sequencing alignment results, making the analysis results simpler, more accurate, and easier to screen and interpret.
[0030] 2) By quality control and correction of the raw sequencing data, the accuracy of sequencing sequences can exceed 99%, effectively reducing false positives in data analysis. Furthermore, for third-generation targeted sequencing data, this method can accurately detect the same species in the target region with as few as 10 sequences, demonstrating high accuracy and sensitivity. Merging the corrected sequences can effectively reduce interference between neighboring species.
[0031] 3) Remove reagents or environmental background microorganisms at the sequencing data level to avoid interference from such microorganisms with the final detection results.
[0032] 4) The method provided by this invention is adapted to the detection of 16S, 23S, ITS or other target regions, and can realize the simultaneous detection of bacteria, fungi, parasites, viruses and drug resistance genes.
[0033] 5) Statistical quality control can be performed in the processes of sequencing data quality control, correction, comparison and identification, which can not only generate the classification level of species identification, but also determine the reliability of the final results. Detailed Implementation
[0034] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments thereof. It should be understood that the specific embodiments described herein are for illustration and explanation only and are not intended to limit the present invention.
[0035] Example 1
[0036] This embodiment uses bronchoalveolar lavage fluid samples as an example to simultaneously detect bacteria and fungi present in the samples. The specific process is as follows:
[0037] (1) Establishment of the error correction reference sequence library
[0038] Reference sequences of target regions for pathogenic microorganisms were obtained from databases such as NCBI. The target region for bacteria was the full-length 16S region, and the target region for fungi was the full-length ITS region (existing library construction methods can be referenced, and will not be elaborated here). The above reference sequences were clustered and corrected according to 99% similarity, and the orientation of the reference sequences was aligned to obtain a high-quality error-corrected reference sequence library.
[0039] (2) Quality control of raw sequencing data
[0040] The bronchoalveolar lavage fluid samples were sequenced using the Nanopore platform. After sequencing, NanoStat was used to perform quality control on the sequencing data to determine the sequencing data yield, accuracy, and length. The input sequencing data folder was named "20220113_1331_X1_FAR52513". This folder was generated by the sequencer system and contained information such as sequencing time, chip placement, and chip number. The folder also contained sequencing-related information and the original FASTQ file.
[0041] NanoFit was used to filter the sequencing data, with parameters set to -q 10 and -l 300 (i.e., data shorter than 300 bp and with a quality score below 10 were filtered out). The filtered FastQ file was then used as the input file for subsequent analysis. Because target lengths vary considerably, fragment lengths were selected based on actual lengths that could match the reference database. The actual sequencing generated 23.6M of data, and the filtered data volume was 22.8M.
[0042] The filtered FastQ file was compared with the human genome, and the sequenced human genome sequences were removed. The data size after removing human data was 22.7M.
[0043] (3) Error correction of sequencing data
[0044] The third-generation sequencing data was aligned to an error correction database using the software Minimap2, resulting in a BAM file aligned to the reference sequence. Error correction was performed on the alignment results, calculating the coverage and base frequency of each reference sequence at each site. If the coverage was below 90% or the matching error rate was greater than 8%, the sequence was discarded. Base sites were corrected based on the base frequencies of the retained sequences (in this example, c = 5 and d = 20%). A corrected FASTA file was generated, with annotation information for each sequence in the FASTA file including the number of original sequencing sequences from which the corrected sequence was obtained.
[0045] (4) Merging of sequencing sequences
[0046] The sequencing sequences in the corrected FASTA file are compared pairwise, and similar sequences with a similarity greater than 99% are merged to obtain the merged FASTA file.
[0047] (5) Preliminary identification results
[0048] The merged FASTA files were aligned to the identification database using Minimap2 software. Alignment details were analyzed, and sequences with coverage greater than 90% and an error rate less than 1% were retained to obtain preliminary identification results. The identification results were then merged according to species classification level, and a preliminary list of detected microorganisms was generated based on the annotation database. The preliminary microbial detection information is shown in Table 1.
[0049] Table 1
[0050]
[0051]
[0052] (6) Filtering of preliminary identification results
[0053] Groups with fewer than 10 reads in the aligned sequence were filtered out, as were analysis results where the number of sequencing reads in the identified sequence was less than 1% of the total number of reads in the sample. In addition, this embodiment also performed multiple parallel samples. Based on this, further filtering was conducted to remove analysis results where the number of sequencing reads in the identified sequence was less than 0.5% of the total number of reads for the species in all samples from the same batch (i.e., all samples sequenced simultaneously). This latter part was highly likely due to data splitting errors. The detection information of the filtered microorganisms is shown in Table 2:
[0054] Table 2
[0055]
[0056]
[0057] (7) Removal of background microorganisms
[0058] The merged high-quality sequences were compared with the reagent background microbial database. The number of sequences that aligned to the background microbial database and the number of mismatches were counted, and data with fewer than 3 mismatches were filtered out. After comparison, Cladosporium crousii in Table 2 was considered a reagent background microorganism.
[0059] (8) Determination of pathogenic microorganisms and colonizing microorganisms
[0060] Based on the sample source, all eligible species identification results were further analyzed to determine the pathogenic and colonizing microorganisms, and the final detected pathogenic and colonizing microorganisms were identified, as shown in Table 3.
[0061] Table 3
[0062] Latin name of species sequence number Chinese name determination Klebsiella pneumoniae 3961 Klebsiella pneumoniae Pathogenic microorganisms Parvimonas micra 1640 Micromonas colonizing microorganisms Streptococcus anginosus 1542 Streptococcus pharyngitis colonizing microorganisms Staphylococcus epidermidis 1062 Staphylococcus epidermidis Pathogenic microorganisms Haemophilus influenzae 501 Haemophilus influenzae colonizing microorganisms Anaeroglobusgeminatus 469 Diploanaerobic cocci colonizing microorganisms Campylobacter concisus 449 Conc. Campylobacter colonizing microorganisms Corynebacterium propinquum 396 Corynebacterium propionate colonizing microorganisms Streptococcus mitis 287 Streptococcus suis colonizing microorganisms Proteus mirabilis 231 Proteus mirabilis colonizing microorganisms Porphyromonasendodontalis 226 Porphyromonas pulpis colonizing microorganisms Anaerococcusnagyae 162 Anaerobic cocci of Naggi colonizing microorganisms
[0063] In summary, the method of this invention can achieve accurate detection of the same species in the target region with a sequencing error rate of less than 8%, and the data analysis has high accuracy and sensitivity.
[0064] The above description is merely a preferred embodiment of the present invention and should not be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention by those skilled in the art should be included within the scope of protection of the present invention.
Claims
1. A method for analyzing pathogenic microorganisms based on long-read sequencing data for non-diagnostic purposes, characterized in that, Includes the following steps: S1. The reference genome of pathogenic microorganisms is extracted and downloaded to obtain the reference sequence of the target region. The reference sequences are clustered and corrected according to similarity to obtain the error correction reference sequence library. S2. Compare the quality-controlled sequencing data with the error correction reference sequence database, and correct the sequencing sequences that match the same reference sequence. S3. Next, compare the sequencing sequences in the sequencing data pairwise and merge similar sequencing sequences; S4. Align the sequencing data obtained in step S3 with the pathogen identification database, and obtain the pathogen detection list after quality control. The reference sequence described in step S1 is adjusted for directional consistency according to the primer matching order after clustering; The merging method described in step S3 is as follows: for two sequences that reach the similarity threshold, retain the sequence with the higher frequency and update the frequency of that sequence to the sum of the frequencies of the two sequences; The similarity threshold ranges from 95% to 99.9%. The quality control method described in step S2 is as follows: filtering out sequencing sequences with a length less than m or a quality value less than n, and removing sequencing sequences aligned to the host reference genome; the value of m ranges from 100 to 600 bp, and the value of n ranges from 8 to 11. The correction method described in step S2 is as follows: first, sequence sequences with an error rate higher than a or a coverage rate lower than b are filtered out; then, based on a reference sequence, errors are corrected on the base sites in the sequence sequence; the value of a ranges from 5% to 10%, and the value of b ranges from 50% to 95%. The error correction process is as follows: The base frequency of each site in the sequencing sequence is statistically analyzed, and the ratio of sites with sequencing depths lower than c to the total sequencing sequence is calculated. Sequencing sequences with a ratio higher than d are retained. In the retained sequencing sequences, for sites with sequencing depths lower than c, the corresponding base from the reference sequence is used for substitution, and for sites with sequencing depths greater than c, the base with the highest base frequency is used. The value of c ranges from 3 to 7, and the value of d ranges from 10% to 30%.
2. The method according to claim 1, characterized in that, The merged sequencing data is aligned to the reagent background microbial database. After removing the sequencing data aligned to the reagent background microbial database, it is then aligned to the pathogen identification database.
3. The method according to claim 1, characterized in that, The quality control parameters in step S4 are: retaining alignment sequences with coverage greater than e and alignment error rate less than f, where e ranges from 85% to 99% and f ranges from 0.5% to 3%.
4. The method according to claim 1, characterized in that, It also includes the following steps: S5. Filter the pathogen detection list to remove groups with a sequence number below x or a sequence number below the total sequence number y, where x ranges from 5 to 20 and y ranges from 0.2% to 2%.