Methods, systems, and media for reference database construction for high pathogenic virus detection

By integrating and quality-controlling viral sequences, standardizing metadata, and clustering based on nucleotide consistency and alignment coverage, a reference database for detecting highly pathogenic viruses was constructed. This solved the data fragmentation and redundancy problems of existing databases, enabling accurate detection and rapid response to highly pathogenic viruses, and improving the sensitivity and specificity of detection.

CN122245444APending Publication Date: 2026-06-19INST OF MICROBIOLOGY CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF MICROBIOLOGY CHINESE ACAD OF SCI
Filing Date
2026-05-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing virus databases suffer from fragmented data resources, numerous redundant sequences, chaotic metadata, and a lack of standardized cleaning processes in the detection of highly pathogenic viruses. This leads to a high risk of missed detections during high-throughput sequencing and alignment, making it difficult to meet the needs for early warning and precise prevention and control of emerging and re-emerging infectious diseases.

Method used

Virus sequences from public databases are integrated, and quality control and metadata standardization are performed. Clustering is conducted based on nucleotide identity > 99% and alignment coverage > 99% to identify and merge highly homologous redundant sequences. These sequences are then compared with biological characteristics to retain the representative reference sequences with the best quality control results, thus constructing a reference database for the detection of highly pathogenic viruses.

Benefits of technology

A high-quality representative viral sequence dataset has been constructed, which can meet the needs for accurate detection of highly pathogenic viral species and subtypes, has the ability to identify closely related viruses, supports rapid response to emerging and re-emerging viruses, provides systematic data support, and provides technical support for early warning and precise prevention and control of highly pathogenic viruses.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245444A_ABST
    Figure CN122245444A_ABST
Patent Text Reader

Abstract

This disclosure provides a method, system, and medium for constructing a reference database for the detection of highly pathogenic viruses. It integrates viral sequences from public databases, performs quality control on the viral sequences, standardizes metadata information, and then performs rigorous clustering to identify and merge highly homologous redundant sequences. Further fine-grained deduplication within cluster branches is then performed, ultimately yielding a high-quality dataset of representative viral sequences, which forms the basis of the reference database for the detection of highly pathogenic viruses. This disclosure constructs representative subtype sequences based on genetic evolutionary screening among viral species at the genus / species level as core data resources. This ensures that the constructed reference database not only meets the needs for accurate detection at the species and subtype levels of highly pathogenic viruses but also possesses effective identification capabilities for closely related viruses. Furthermore, it supports rapid response to emerging and re-emerging viruses, providing systematic data support for early warning and precise control of highly pathogenic viruses.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of gene processing technology, and in particular to a method, system, and medium for constructing a reference database for the detection of highly pathogenic viruses. Background Technology

[0002] Currently, the detection and identification technology system for highly pathogenic viruses has significant limitations, making it difficult to meet the needs of early warning and precise prevention and control of emerging and re-emerging infectious diseases. Existing virus databases, such as VISDB, ICTV-Virus Taxonomy, and GISAID, have significant shortcomings in addressing the needs for species and typing detection and risk warning of highly pathogenic viruses. The fundamental differences between the initial design intentions and research perspectives of existing databases have resulted in fragmented data resources, hindering the formation of a systematic support capability. Specifically, public virus sequence databases (such as NCBI's viral genome resources) themselves suffer from serious data quality problems, directly restricting the accurate detection and risk assessment of highly pathogenic viruses. Statistics show that among known virus species, over 76% lack reference or representative genome sequences; over 23% have only one reference sequence, failing to represent the genetic diversity of the entire species; furthermore, the metadata in the sequence data is disorganized, lacking a unified and standardized cleaning process, and contains excessive redundant sequences. For example, the total number of viral gene sequences excluding influenza and SARS-CoV-2 exceeds 3 million, of which over 2.7 million are redundant sequences. These data deficiencies mean that when performing high-throughput sequencing alignment based on existing databases, a large number of short sequence reads cannot obtain accurate species annotations, especially for emerging or rare viruses, where the risk of missed detection is extremely high.

[0003] While various specialized databases have made contributions in specific fields, none have fundamentally solved the aforementioned data quality problems, and the lack of functional modules further exacerbates the limitations of their applications. For example, although the VISDB database focuses on viral sequence integration, its core service target is basic virology research, lacking a specific design for detecting exogenous viral contamination. It also has significant deficiencies in data quality control standards, annotation depth, and clinical relevance assessment, and has not effectively cleaned up redundancy and metadata clutter. ICTV-Virus Taxonomy, as an authoritative reference for viral taxonomy, focuses on the definition and naming rules of viral species, without providing genomic sequence data for direct comparison, and cannot support rapid species identification and pathogenicity prediction after high-throughput sequencing. While GISAID has played an important role in the monitoring of influenza viruses and SARS-CoV-2, its construction scheme is concentrated on sharing mechanisms, resulting in uneven data coverage across regions and viral groups, and a lack of a unified quality control system and standardized analysis processes.

[0004] Therefore, the core issue lies in the lack of a comprehensive reference database construction method specifically designed for the detection of highly pathogenic viruses. Current databases, due to varying construction purposes—either focusing on specific viral groups or serving basic research—have failed to support the detection and typing of a wide range of highly pathogenic virus species, making it difficult to provide systematic data resource support for high-throughput identification and risk warning of highly pathogenic viruses. Therefore, it is necessary to provide a method, system, and medium for constructing a reference database for the detection of highly pathogenic viruses. Summary of the Invention

[0005] The purpose of this disclosure is to provide a method, system, and medium for constructing a reference database for the detection of highly pathogenic viruses, in order to solve at least one of the aforementioned technical problems.

[0006] To achieve the above objectives, in a first aspect, this disclosure provides a method for constructing a reference database for the detection of highly pathogenic viruses, comprising the following steps:

[0007] Integrate virus sequences from public databases and perform quality control on the virus sequences;

[0008] Standardize the metadata information corresponding to the quality-controlled virus sequences;

[0009] For the standardized viral sequences, clustering was performed at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and the clustering results were compared with the known biological characteristics of the virus.

[0010] If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch; if the clustering results are inconsistent with the known biological characteristics of the virus, the clustering and comparison are carried out again based on the downgraded classification unit.

[0011] For the multiple representative reference sequences retained in the viral branch, duplicates are removed based on their corresponding metadata information to obtain the final representative reference sequences of the viral species.

[0012] A reference database for detecting highly pathogenic viruses was constructed by incorporating representative reference sequences of the final viral species into the reference database.

[0013] In a preferred embodiment, the viral sequences integrated from a public database include all sequences of species, subspecies, and undetermined species.

[0014] In a preferred embodiment, quality control of the viral sequence includes:

[0015] Remove sequences containing the keyword "phage" from the viral sequence and remove sequences with duplicate names;

[0016] For the fragment sequences in the viral sequence, calculate the proportion of fuzzy bases in all fragment sequences, and remove sequences whose fuzzy base proportion exceeds 5% of the total sequence length; on a virus species basis, compare the lengths of the fragment sequences pairwise, and if the shorter sequence can be 100% covered by the longer sequence, then remove the shorter sequence.

[0017] For the assembled sequences in the viral sequence, count the number of contigs in each assembled sequence, remove sequences with more than a preset threshold of contigs, and remove sequences with a length of less than 300 bp.

[0018] The virus sequences were obtained after quality control by screening sequences that met the criteria of a contamination rate of less than 1% and an assembly integrity of more than 99%.

[0019] In one preferred implementation, the metadata information corresponding to the quality-controlled virus sequence is standardized, including:

[0020] The metadata information is cleaned, the time information in the metadata information is corrected by year, the geographical source information is classified by continent and place name, and the separation source and host are classified according to the ecological classification standard;

[0021] By mapping the metadata dictionary to the corresponding fields of the metadata information of the virus sequence, the metadata information of all sequences can be standardized.

[0022] In a preferred embodiment, if the clustering result is consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control result in the cluster is selected as the representative reference sequence of the viral branch. The viral branch sequence with the best quality control result is the sequence with the most complete metadata information and the most complete sequence information or the longest assembly length.

[0023] In a preferred embodiment, if the clustering results are inconsistent with the known biological characteristics of the virus, the clustering and comparison are carried out again based on the downgraded classification unit, and the downgraded classification unit is the subtype category.

[0024] In one preferred implementation, for multiple representative reference sequences retained in the virus branch, deduplication is performed based on their corresponding metadata information, including:

[0025] If the metadata and subtype information of multiple sequences in the same viral branch are completely identical, then only the single entry with the highest assembly quality is retained as the unique representative sequence of that viral branch.

[0026] If multiple sequences within the same viral branch have different metadata information or belong to different subtypes, all of them are retained.

[0027] Secondly, this disclosure provides a reference database construction system for the detection of highly pathogenic viruses, implemented through the following modules:

[0028] An integrated quality control module is used to integrate virus sequences from public databases and perform quality control on the virus sequences;

[0029] The metadata processing module is used to standardize the metadata information corresponding to the quality-controlled virus sequences.

[0030] The clustering module is used to cluster standardized viral sequences at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and to compare the clustering results with the known biological characteristics of the virus.

[0031] If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch.

[0032] If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units;

[0033] The deduplication module is used to deduplicat multiple representative reference sequences retained in the virus branch based on their corresponding metadata information to obtain the final representative reference sequence of the virus species.

[0034] The database building module is used to incorporate representative reference sequences of the final virus species into the reference database, thereby constructing a reference database for the detection of highly pathogenic viruses.

[0035] Thirdly, this disclosure also provides a viral sequence processing method for detecting highly pathogenic viruses, comprising the following steps:

[0036] Integrate virus sequences from public databases and perform quality control on the virus sequences;

[0037] Standardize the metadata information corresponding to the quality-controlled virus sequences;

[0038] For the standardized viral sequences, clustering was performed at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and the clustering results were compared with the known biological characteristics of the virus.

[0039] If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch.

[0040] If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units;

[0041] For the multiple representative reference sequences retained in the viral branch, duplicates are removed based on their corresponding metadata information to obtain the final representative reference sequences of the viral species.

[0042] Fourthly, this disclosure also provides a virus sequence processing system for detecting highly pathogenic viruses, implemented through the following modules:

[0043] An integrated quality control module is used to integrate virus sequences from public databases and perform quality control on the virus sequences;

[0044] The metadata processing module is used to standardize the metadata information corresponding to the quality-controlled virus sequences.

[0045] The clustering module is used to cluster standardized viral sequences at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and to compare the clustering results with the known biological characteristics of the virus.

[0046] If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch.

[0047] If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units;

[0048] The deduplication module is used to remove duplicates from multiple representative reference sequences retained in the virus branch based on their corresponding metadata information, so as to obtain the final representative reference sequence of the virus species.

[0049] Fifthly, this disclosure also provides an electronic device, including: a memory and one or more processors; the memory is used to store one or more computer programs; when the one or more computer programs are executed by the one or more processors, they implement the reference database construction method for highly pathogenic virus detection described in any embodiment of the first aspect of this disclosure or the virus sequence processing method for highly pathogenic virus detection described in the third aspect.

[0050] In a sixth aspect, this disclosure also provides a computer storage medium storing a computer program; when the computer program is executed by a processor, it implements the reference database construction method for detecting highly pathogenic viruses as described in any embodiment of the first aspect of this disclosure or the virus sequence processing method for detecting highly pathogenic viruses as described in the third aspect.

[0051] Beneficial effects:

[0052] Compared to existing technologies, the reference database construction method, system, and medium for highly pathogenic virus detection provided in this disclosure integrate virus sequences from public databases, perform quality control on the virus sequences, standardize metadata information, and then perform rigorous clustering to identify and merge highly homologous redundant sequences. Furthermore, it performs refined deduplication within the clustering branches, ultimately obtaining a high-quality representative virus sequence dataset and constructing a reference database capable of being used for highly pathogenic virus detection. This disclosure constructs representative subtype sequences based on genetic evolution screening among virus species at the genus / species level as core data resources. This ensures that the constructed reference database can not only meet the precise detection needs at the species and subtype levels of highly pathogenic viruses, but also has the ability to effectively identify closely related viruses. Simultaneously, it can support rapid response to emerging and sudden-emerging viruses, thus providing systematic data support for early warning and precise prevention and control of highly pathogenic viruses. Attached Figure Description

[0053] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of this disclosure and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0054] Figure 1 This is a flowchart illustrating a method for constructing a reference database for detecting highly pathogenic viruses, as provided in this disclosure.

[0055] Figure 2 This is a schematic diagram of the reference database construction system for the detection of highly pathogenic viruses provided in this disclosure.

[0056] Figure 3 A schematic block diagram of the electronic device provided in this disclosure. Detailed Implementation

[0057] The technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. The components of the embodiments of this disclosure described and shown in the accompanying drawings can be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed disclosure, but merely represents selected embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without inventive effort are within the scope of protection of this disclosure.

[0058] Please see Figure 1 This is a flowchart of a method for constructing a reference database for detecting highly pathogenic viruses, provided in Embodiment 1 of this disclosure. It should be noted that the method of this disclosure is not limited to the order of the following steps, and in other embodiments, the method of this disclosure may include only a portion of the following steps, or some steps may be deleted.

[0059] The reference database construction method for highly pathogenic viruses provided in this embodiment can be applied to a reference database construction system for highly pathogenic viruses. By constructing representative subtype sequences based on genetic evolution screening among viral species at the genus / species level as core data resources, the constructed reference database can not only meet the accurate detection needs at the species and subtype level of highly pathogenic viruses, but also have the ability to effectively identify closely related viruses. At the same time, it can support rapid response to emerging and sudden viruses, thereby providing systematic data support for early warning and precise prevention and control of highly pathogenic viruses.

[0060] The reference database construction method for detecting highly pathogenic viruses provided in this embodiment includes the following steps:

[0061] Step S10: Integrate virus sequences from public databases and perform quality control on the virus sequences.

[0062] Specifically, this step can screen high-quality viral sequences, ensuring the integrity and quality of the sequence data. The reference database construction system for highly pathogenic virus detection includes an integrated quality control module, which integrates viral sequences from public databases and performs quality control on the viral sequences.

[0063] A virus catalog can be constructed based on public databases, such as NCBI and GenBank. The naming of virus strains must conform to the species naming guidelines established by the International Committee on Taxonomy of Viruses (ICTV), prioritizing whole genome sequences. The assembly should be complete and continuous, containing little or no unknown bases. Genomes contaminated with exogenous sequences such as vectors, aptamers, or host genomes must be rigorously evaluated and excluded. The virus catalog must include all viruses in the pathogen database. Virus sequences must include all sequences at the species, subspecies, and no-rank (indeterminate species) levels. The corresponding nucleic acid sequence data can come from RefSeq, a high-quality, non-redundant gene and protein sequence database maintained by NCBI, or from the GenBank database. Specifically, it can include non-reference sequence data and NCBI reference sequence data (i.e., NC sequences). Non-reference sequences can include assembly sequences, scaffold sequences, etc.

[0064] In some implementations, step S10 involves quality control of the viral sequence, including:

[0065] Remove sequences containing the keyword "phage" from the viral sequence and remove sequences with duplicate names;

[0066] For the fragment sequences in the viral sequence, calculate the proportion of fuzzy bases in all fragment sequences, and remove sequences whose fuzzy base proportion exceeds 5% of the total sequence length; on a virus species basis, compare the lengths of the fragment sequences pairwise, and if the shorter sequence can be 100% covered by the longer sequence, then remove the shorter sequence.

[0067] For the assembled sequences in the viral sequence, count the number of contigs in each assembled sequence, remove sequences with more than a preset threshold of contigs, and remove sequences with a length of less than 300 bp.

[0068] The virus sequences were obtained after quality control by screening sequences that met the criteria of a contamination rate of less than 1% and an assembly integrity of more than 99%.

[0069] Specifically, after integrating data from multiple sources, records containing the keyword "phage" in their sequence identifiers and with duplicate strain or viral strain names are filtered to construct a basic viral dataset with phage sequences removed. That is, data containing "phage" in the sequence name is filtered and deleted, compared with the Phage dataset from NCBI Virus (43,716 records), and sequences with duplicate names are removed.

[0070] For each fragment of the viral sequence, the proportion of ambiguous bases (nucleotide symbols other than adenine A, thymine T, cytosine C, and guanine G) in all sequences is calculated on a per-fragment basis. Sequences with an ambiguous base proportion exceeding 5% of the total sequence length are removed. Then, on a per-virus species basis, sequences are paired. If a shorter sequence is 100% covered by a longer sequence, the shorter sequence is removed from the dataset. This can be performed using the BlastN software with the parameter set to 100%.

[0071] For assembly sequences in viral sequences, the number of contigs in each assembly is counted on a single species basis. The number of contigs refers to the number of contigs obtained after assembling short sequence fragments from sequencing into longer, continuous sequence contigs during genome assembly. A higher number of contigs indicates poor genome assembly, and it is necessary to determine the range of outliers. Furthermore, RVDB provides a viral length range of 300bp-1.5kb, therefore, sequences with a length < 300bp need to be deleted.

[0072] Sequences meeting the criteria of a contamination rate of less than 1% and an assembly integrity of greater than 99% are selected and included in the candidate reference genome set, thus obtaining the quality-controlled viral sequences. The high quality of the viral sequences is beneficial for subsequent applications.

[0073] Step S20: Standardize the metadata information corresponding to the quality-controlled virus sequence.

[0074] Specifically, this step involves the structured collection and integration of metadata. The reference database construction system for the detection of highly pathogenic viruses includes a standardization processing module, which is used to standardize the metadata information corresponding to the virus sequences after quality control.

[0075] To achieve data interaction and collaborative analysis among clinical information systems, genome monitoring systems and related platforms, a standardized structured format was used to collect sample and corresponding viral metadata information. A bidirectional readable framework that balances human interpretation and machine processing was established to systematically integrate viral genomics data, covering genome, metagenomics and targeted sequencing data from human samples.

[0076] In some implementations, step S20 includes the following steps:

[0077] The metadata information is cleaned, the time information in the metadata information is corrected by year, the geographical source information is classified by continent and place name, and the separation source and host are classified according to the ecological classification standard;

[0078] By mapping the metadata dictionary to the corresponding fields of the metadata information of the virus sequence, the metadata information of all sequences can be standardized.

[0079] Specifically, the time and geographic origin information in the metadata is corrected by adjusting them to the same format to make them more accurate and consistent. The standardized virus sequence has a unified structured format, which is highly readable and facilitates subsequent processing.

[0080] Step S30: For the standardized viral sequences, at the viral species level, clustering is performed based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and the clustering results are compared with the known biological characteristics of the virus;

[0081] If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch; if the clustering results are inconsistent with the known biological characteristics of the virus, the clustering and comparison are carried out again based on the downgraded classification unit.

[0082] Specifically, the clustering results are compared and verified with known biological characteristics of the virus, such as segment information (i.e., fragment information) of multi-fragment viruses. If the clustering branch is consistent with the known segment division, the sequence with the best quality control result is selected from each branch as the representative reference sequence of that branch. The viral branch sequence with the best quality control result is the sequence with the most complete metadata information and the most complete sequence information or the longest assembly length. Specifically, the most complete metadata information means including at least five metadata information items such as strain name, strain type, isolation host, isolation time, and geographical origin. Among several sequences including at least five metadata information items, the sequence with the most complete sequence information or the longest assembly length is selected as the viral branch sequence. Among them, the most complete sequence information means the sequence with the largest genome coverage, that is, the sequence with the highest integrity.

[0083] If the clustering results are inconsistent with known segment information or other key classification features, it indicates that there may be insufficiently clarified genetic diversity within the current "species" level taxonomic unit. In this case, the relevant sequence set should be re-executed based on a downgraded taxonomic unit (such as subtype category), and the above clustering and representative sequence selection and alignment process should be repeated.

[0084] This step, through iterative and refined analysis, precisely captures and retains biologically significant subtype-specific sequences while removing redundancy, thereby optimizing species detection sensitivity and improving detection specificity for highly homologous species, ensuring the differentiation and specificity of closely related species. Setting the clustering parameters to nucleotide identity > 99% and alignment coverage > 99% ensures the number of cluster branches, thus guaranteeing the number of representative sequences, improving species and strain coverage, and making the constructed reference database more comprehensive.

[0085] Step S40: For the multiple representative reference sequences retained in the virus branch, deduplication is performed based on their corresponding metadata information to obtain the final representative reference sequences of the virus species.

[0086] Specifically, this step implements refined deduplication within cluster branches. The reference database construction system for highly pathogenic virus detection includes a deduplication module. This module is used to deduplicatize multiple representative reference sequences retained in the virus branches based on their corresponding metadata information, thereby obtaining the final representative reference sequences of the virus species.

[0087] After completing the initial clustering and preliminarily selecting representative sequences, a refined deduplication screening is performed on multiple sequences retained in the same cluster branch based on their metadata information. This can effectively eliminate non-critical repetitive sequences while fully retaining key sequences that reflect the true genetic structure, transmission trajectory and diversity of the viral population, thereby improving the sensitivity of species detection and enhancing the specificity for distinguishing highly homologous viruses.

[0088] In some implementations, step S40 includes:

[0089] If the metadata and subtype information of multiple sequences in the same viral branch are completely identical, then only the single entry with the highest assembly quality is retained as the unique representative sequence of that viral branch.

[0090] If multiple sequences within the same viral branch have different metadata information or belong to different subtypes, all of them are retained.

[0091] Specifically, if multiple sequences within a branch have identical metadata and subtype information, they are considered technical duplication or multiple submissions from the same source. In this case, only the single entry with the highest sequence assembly quality is retained as the sole representative sequence for that branch. The highest assembly quality is the longest assembled sequence, i.e., the sequence with the best assembly integrity. If sequences within the same branch have significantly different biologically relevant metadata information (e.g., isolated from different hosts or regions), or belong to clearly defined subtype categories, all are retained. In some implementations, metadata information includes, but is not limited to, strain name, strain type, isolation host, isolation time, and geographical origin.

[0092] Step S50: Incorporate the representative reference sequences of the final virus species into the reference database to construct a reference database for the detection of highly pathogenic viruses.

[0093] Specifically, after processing the virus sequences in steps S10-S40, a standard dataset of representative reference sequences of non-redundant, high-quality virus species is obtained. This standard dataset is incorporated into a reference database to construct a reference database for the detection of highly pathogenic viruses. The standard dataset can be used for the detection and typing of highly pathogenic viruses, and can provide systematic data resource support for the high-throughput identification and risk warning of highly pathogenic viruses.

[0094] In some implementations, after step S50, the method further includes: verifying and correcting the accuracy of species classification names in the reference database in accordance with international nomenclature standards.

[0095] Specifically, in order to ensure the accuracy and timeliness of all species classification names and scientific names in the database, existing classification names are reviewed and corrected in accordance with relevant international naming standards to confirm that they are all currently recognized valid classification names.

[0096] In some implementations, after step S50, the method further includes: performing pairwise comparisons of representative reference sequences of viral species in the reference database; if a representative reference sequence of a viral species shows high similarity to sequences of different species, it is considered a potential classification error or data contamination and is removed; if it cannot be compared with sequences of other species, it is retained. This ultimately generates a non-redundant reference database containing typing information.

[0097] Understandably, existing public viral sequence databases suffer from severe reference sequence gaps. Over 76% of viral species lack representative genome sequences, and over 23% of species have only one sequence, failing to represent the genetic diversity within a species and making it difficult to obtain accurate species annotations from high-throughput sequencing data. Furthermore, existing databases have disorganized metadata and excessive redundant sequences, severely interfering with the reliability of detection results. In addition, while existing specialized databases such as VISDB, ICTV-VirusTaxonomy, and GISAID each have their strengths, none have integrated the function of accurately distinguishing between closely related or highly nucleotide-similar viral species, nor can they perform typing detection of important viral subtypes, and they lack the ability to rapidly identify emerging and re-emerging viruses. To address the aforementioned issues, this disclosure constructs representative subtype sequences based on genetic evolution screening among viral species at the genus / species level as core data resources. This enables the constructed reference database to not only meet the precise detection needs at the level of highly pathogenic viral species and subtypes, but also to effectively identify closely related viruses. Furthermore, it supports rapid response to emerging and re-emerging viruses, thereby providing systematic data support for early warning and precise prevention and control of highly pathogenic viruses.

[0098] This disclosure employs a "genetic evolutionary screening" strategy for precise differentiation of closely related species. Addressing the limitation of existing databases in effectively distinguishing different viral species and subtypes with high nucleotide similarity, this disclosure proposes for the first time a subtype representative sequence screening method based on genus / species-level genetic evolutionary relationships. Unlike traditional databases that simply compile sequences, this disclosure identifies and preserves biologically significant subtype-specific sequences through rigorous cluster analysis (Identity > 99%, Coverage > 99%) and cross-validation with known viral biological characteristics (such as segment information for multi-fragment viruses). This fundamentally solves the problem of accurate identification of highly homologous viruses.

[0099] This disclosure employs a dual quality control system that balances "redundancy removal" and "preservation of genetic diversity." Existing databases often face a dilemma: either excessive redundancy impacts analysis efficiency, or excessive redundancy removal leads to the loss of genetic diversity. This disclosure innovatively constructs a "two-step" refined redundancy removal mechanism. First, technical redundancy is removed through sequence similarity clustering. Then, sequences within the same cluster branch are further screened based on metadata information (host, geographical origin, time, etc.). This design effectively eliminates redundant sequences while fully preserving key sequences that reflect the true genetic structure, transmission trajectory, and spatiotemporal diversity of the viral population, significantly improving the sensitivity of species detection.

[0100] This disclosure employs a "two-way readable" integration framework for structured metadata and genomic data. Addressing the shortcomings of existing public databases' metadata being disorganized and difficult for machines to process, this disclosure establishes a two-way readable metadata framework that balances human interpretation and machine processing. By standardizing the collection of sample information (host, geographical origin, isolation time, habitat, etc.) in a structured format, seamless integration and collaborative analysis with clinical information systems and genome surveillance systems are achieved, providing a high-quality data foundation for subsequent source tracing analysis, spatiotemporal evolution, and risk assessment.

[0101] This disclosure provides a rapid response detection capability for emerging / emerging viruses. Existing databases for detecting emerging viruses heavily rely on known reference sequences, resulting in inherent blind spots. This disclosure, by retaining representative subtype-level sequences based on genetic evolutionary selection and combining them with a complete species-level taxonomic framework, enables the database to possess the potential for "cross-subtype identification" of closely related unknown or mutated viruses, thereby achieving a methodologically sound rapid early warning capability for emerging and emerging viruses.

[0102] In summary, the reference database construction method for detecting highly pathogenic viruses provided in this disclosure has the following technical advantages:

[0103] 1) Significantly improved the accuracy and specificity of virus species and subtype identification. By constructing a representative sequence library of subtypes based on genus / species-level genetic evolution screening, and using strict clustering with dual thresholds of nucleotide identity >99% and alignment coverage >99%, the technical challenge of accurately distinguishing different virus species with close phylogenetic relationships and high nucleotide similarity in existing databases was effectively solved. Simultaneously, cross-validation of the clustering results was performed using biological characteristics such as multi-fragment information of the virus, ensuring the biological significance of subtype classification and fundamentally improving detection specificity.

[0104] 2) This invention effectively solves the problem of balancing redundancy and diversity preservation, and improves detection sensitivity. It innovatively adopts a "two-step" refined redundancy removal mechanism. First, it removes technical redundancy through sequence similarity clustering. Then, it performs a secondary screening of sequences within the same cluster branch based on metadata information (host, geographical origin, separation time, habitat, etc.). This design effectively removes more than 2.7 million redundant sequences while fully preserving key sequences that reflect the true genetic structure, transmission trajectory, and spatiotemporal diversity of the virus population. It fundamentally solves the dilemma of traditional databases where "excessive redundancy removal leads to loss of diversity" or "excessive redundancy affects analysis efficiency," and significantly improves the detection sensitivity of emerging and mutated viruses.

[0105] 3) A high-quality, standardized viral genome data resource was constructed. By integrating multiple quality control parameters, including contamination rate <1%, integrity >99%, fuzzy bases <5%, exogenous sequence screening, and sequence length ≥300bp, the raw data was systematically screened. Combined with a structured metadata collection framework, the data quality problems commonly found in existing public databases, such as messy metadata, incomplete sequence information, and missing reference sequences (more than 76% of species have no reference genome), were effectively solved, providing a reliable data foundation for accurate virus detection.

[0106] 4) It enables rapid response to emerging and re-emerging viruses. By preserving representative subtype sequences and establishing a complete species-level taxonomic framework, the database possesses the potential for "cross-subtype identification" of closely related unknown or mutated viruses. Compared to traditional methods that rely on known reference sequences, this disclosure methodologically achieves early warning capabilities for emerging and re-emerging viruses, shortening the technological gap from virus emergence to detection and response.

[0107] The reference database constructed using the method for constructing a reference database for the detection of highly pathogenic viruses provided in this disclosure can significantly reduce the cost of virus detection and monitoring. Traditional virus detection methods heavily rely on virus culture and animal experiments, resulting in long experimental cycles, complex operations, and high costs. This disclosure, by constructing a high-quality reference database, enables high-throughput sequencing-based virus detection to achieve accurate identification without relying on live virus culture, significantly reducing experimental consumables, equipment investment, and labor costs. Statistics show that this disclosure can reduce the cost of a single virus detection by more than 60% and shorten the detection cycle by more than 70%. It also enhances data reuse value and avoids redundant construction. This disclosure effectively cleans redundant sequences in existing public databases, retaining only representative and biologically significant sequences, forming a non-redundant, high-quality standard dataset. Researchers do not need to repeatedly perform data cleaning and quality control work and can directly use this database for subsequent analysis, significantly saving time and computational resources spent on data preprocessing. This publicly disclosed database, which lowers the technical threshold of the public health monitoring system, is standardized, structured, and easy to integrate. It can be easily connected with existing clinical information systems and genome monitoring systems. Grassroots disease control institutions do not need to be equipped with professional bioinformatics teams to achieve efficient virus identification and source tracing analysis with the help of this database, thus reducing the application threshold of advanced detection technologies.

[0108] This disclosure enhances early warning capabilities for emerging and re-emerging infectious diseases, safeguarding public health security. By constructing a high-quality virus reference database, it provides crucial technical support for the early identification and precise tracing of highly pathogenic viruses. In the early stages of an outbreak, the ability to rapidly identify virus species and subtypes and trace their cross-host transmission pathways is decisive for timely implementation of control measures and interruption of transmission chains. Through structured collection of metadata information such as host, geographical origin, and habitat of samples, it provides a data foundation for revealing the transmission patterns of viruses at the human-animal-environment interface, contributing to a deeper understanding of the ecological drivers of cross-species viral transmission and providing a scientific basis for the source control of zoonotic diseases.

[0109] Please see Figure 2 This is a schematic diagram of the reference database construction system for highly pathogenic virus detection provided in Embodiment 2 of this disclosure. The reference database construction system 100 for highly pathogenic virus detection provided in this disclosure includes:

[0110] The integrated quality control module 10 is used to integrate virus sequences from public databases and perform quality control on the virus sequences.

[0111] Metadata processing module 20 is used to standardize the metadata information corresponding to the quality-controlled virus sequence;

[0112] Clustering module 30 is used to cluster standardized viral sequences at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and to compare the clustering results with the known biological characteristics of the virus.

[0113] If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch.

[0114] If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units;

[0115] The deduplication module 40 is used to deduplicatize multiple representative reference sequences retained in the virus branch based on their corresponding metadata information to obtain the final representative reference sequence of the virus species.

[0116] Database building module 50 is used to incorporate representative reference sequences of the final virus species into the reference database to build a reference database for the detection of highly pathogenic viruses.

[0117] This disclosure also provides a viral sequence processing method for detecting highly pathogenic viruses, comprising the following steps:

[0118] Integrate virus sequences from public databases and perform quality control on the virus sequences;

[0119] Standardize the metadata information corresponding to the quality-controlled virus sequences;

[0120] For the standardized viral sequences, clustering was performed at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and the clustering results were compared with the known biological characteristics of the virus.

[0121] If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch.

[0122] If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units;

[0123] For the multiple representative reference sequences retained in the viral branch, duplicates are removed based on their corresponding metadata information to obtain the final representative reference sequences of the viral species.

[0124] Embodiment 4 of this disclosure also provides a virus sequence processing system for detecting highly pathogenic viruses, implemented through the following modules:

[0125] An integrated quality control module is used to integrate virus sequences from public databases and perform quality control on the virus sequences;

[0126] The metadata processing module is used to standardize the metadata information corresponding to the quality-controlled virus sequences.

[0127] The clustering module is used to cluster standardized viral sequences at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and to compare the clustering results with the known biological characteristics of the virus.

[0128] If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch.

[0129] If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units;

[0130] The deduplication module is used to remove duplicates from multiple representative reference sequences retained in the virus branch based on their corresponding metadata information, so as to obtain the final representative reference sequence of the virus species.

[0131] Understandably, the aforementioned functional modules can be stored in memory as software programs and executed by a processor. In alternative embodiments, the aforementioned functional modules can also be hardware with specific functions, such as chips programmed with specific software.

[0132] It should be noted that, in practice, the reference database construction method for highly pathogenic virus detection can be implemented using the aforementioned reference database construction system 100 for highly pathogenic virus detection. The reference database construction system 100 for highly pathogenic virus detection, employing one or more specific embodiments of the reference database construction method for highly pathogenic virus detection described in the above embodiments, can construct a reference database including high-quality representative viral sequences for virus detection. Similarly, the virus sequence processing method for highly pathogenic virus detection can be implemented using the aforementioned virus sequence processing system for highly pathogenic virus detection. The virus sequence processing system for highly pathogenic virus detection, employing the aforementioned virus sequence processing method for highly pathogenic virus detection, can process viral sequences to obtain high-quality representative viral sequences for virus detection.

[0133] That is, all embodiments of the reference database construction method for highly pathogenic virus detection provided in this disclosure are applicable to the reference database construction system 100 for highly pathogenic virus detection, the virus sequence processing method for highly pathogenic virus detection, and the virus sequence processing system for highly pathogenic virus detection provided in this disclosure, and can all achieve the same or similar beneficial effects, which will not be described in detail here.

[0134] Please see Figure 3 This disclosure also provides an electronic device, including: a memory 210 and one or more processors 220.

[0135] Specifically, the memory 210 is used to store one or more computer programs; when the one or more computer programs are executed by one or more processors 220, they implement the reference database construction method for highly pathogenic virus detection described in Embodiment 1 or the virus sequence processing method for highly pathogenic virus detection described in Embodiment 3.

[0136] The memory 210 may be, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), etc. The memory 210 stores programs, and the processor 220 runs these programs after receiving execution instructions to implement the reference database construction method for highly pathogenic virus detection described in Embodiment 1 or the virus sequence processing method for highly pathogenic virus detection described in Embodiment 3. It is understood that access to the memory 210 by the processor 220 and other possible components can be performed under the control of the memory controller.

[0137] The processor 220 may be an integrated circuit chip with signal processing capabilities. The processor 220 may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc., or it may be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, capable of implementing or executing the methods and steps disclosed in Embodiment 1 or Embodiment 3 of this disclosure.

[0138] This disclosure also provides a computer storage medium storing a computer program, which, when executed by a processor, implements the reference database construction method for detecting highly pathogenic viruses described in Embodiment 1 or the virus sequence processing method for detecting highly pathogenic viruses described in Embodiment 3.

[0139] This disclosure also provides a computer program product, including a computer program or instructions, which, when executed by a processor, implement the reference database construction method for detecting highly pathogenic viruses described in Embodiment 1 or the virus sequence processing method for detecting highly pathogenic viruses described in Embodiment 3.

[0140] In summary, the method, system, equipment, media, and product for constructing a reference database for highly pathogenic virus detection provided in this disclosure integrate virus sequences from public databases, perform quality control on the virus sequences, standardize metadata information, and then perform rigorous clustering to identify and merge highly homologous redundant sequences. Furthermore, refined deduplication within the clustering branches is performed, ultimately resulting in a high-quality representative virus sequence dataset and constructing a reference database capable of being used for highly pathogenic virus detection. This disclosure constructs representative subtype sequences based on genetic evolution screening among virus species at the genus / species level as core data resources. This ensures that the constructed reference database can meet the precise detection needs at the species and subtype levels of highly pathogenic viruses, effectively identify closely related viruses, and support rapid response to emerging and sudden-emerging viruses, thus providing systematic data support for early warning and precise prevention and control of highly pathogenic viruses.

[0141] The above description is merely an embodiment of this disclosure and does not limit the patent scope of this disclosure. Any equivalent structural or procedural transformations made using the content of this disclosure and its drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this disclosure.

Claims

1. A method for constructing a reference database for detecting highly pathogenic viruses, characterized in that, Includes the following steps: Integrate virus sequences from public databases and perform quality control on the virus sequences; Standardize the metadata information corresponding to the quality-controlled virus sequences; For the standardized viral sequences, clustering was performed at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and the clustering results were compared with the known biological characteristics of the virus. If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch; if the clustering results are inconsistent with the known biological characteristics of the virus, the clustering and comparison are carried out again based on the downgraded classification unit. For the multiple representative reference sequences retained in the viral branch, duplicates are removed based on their corresponding metadata information to obtain the final representative reference sequences of the viral species. A reference database for detecting highly pathogenic viruses was constructed by incorporating representative reference sequences of the final viral species into the reference database.

2. The method for constructing a reference database for detecting highly pathogenic viruses as described in claim 1, characterized in that, The viral sequences integrated from public databases include all sequences from species, subspecies, and undetermined species.

3. The method for constructing a reference database for detecting highly pathogenic viruses as described in claim 1, characterized in that, Quality control of the viral sequence includes: Remove sequences containing the keyword "phage" from the viral sequence and remove sequences with duplicate names; For the fragment sequences in the viral sequence, calculate the proportion of fuzzy bases in all fragment sequences, and remove sequences whose fuzzy base proportion exceeds 5% of the total sequence length; on a virus species basis, compare the lengths of the fragment sequences pairwise, and if the shorter sequence can be 100% covered by the longer sequence, then remove the shorter sequence. For the assembled sequences in the viral sequence, count the number of contigs in each assembled sequence, remove sequences with more than a preset threshold of contigs, and remove sequences with a length of less than 300 bp. The virus sequences were obtained after quality control by screening sequences that met the criteria of a contamination rate of less than 1% and an assembly integrity of more than 99%.

4. The method for constructing a reference database for detecting highly pathogenic viruses as described in claim 1, characterized in that, The metadata information corresponding to the quality-controlled virus sequences is standardized, including: The metadata information is cleaned, the time information in the metadata information is corrected by year, the geographical source information is classified by continent and place name, and the separation source and host are classified according to the ecological classification standard; By mapping the metadata dictionary to the corresponding fields of the metadata information of the virus sequence, the metadata information of all sequences can be standardized.

5. The method for constructing a reference database for detecting highly pathogenic viruses as described in claim 1, characterized in that, If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch. The viral branch sequence with the best quality control results is the sequence with the most complete metadata information and the most complete sequence information or the longest assembly length.

6. The method for constructing a reference database for detecting highly pathogenic viruses as described in claim 1, characterized in that, If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison will be carried out again based on the downgraded classification unit, and the downgraded classification unit will be the subtype category.

7. The method for constructing a reference database for detecting highly pathogenic viruses as described in any one of claims 1-6, characterized in that, For multiple representative reference sequences retained in the virus branch, deduplication is performed based on their corresponding metadata information, including: If the metadata and subtype information of multiple sequences in the same viral branch are completely identical, then only the single entry with the highest assembly quality is retained as the unique representative sequence of that viral branch. If multiple sequences within the same viral branch have different metadata information or belong to different subtypes, all of them are retained.

8. A reference database construction system for detecting highly pathogenic viruses, characterized in that, This can be achieved through the following modules: An integrated quality control module is used to integrate virus sequences from public databases and perform quality control on the virus sequences; The metadata processing module is used to standardize the metadata information corresponding to the quality-controlled virus sequences. The clustering module is used to cluster standardized viral sequences at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and to compare the clustering results with the known biological characteristics of the virus. If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch. If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units; The deduplication module is used to deduplicat multiple representative reference sequences retained in the virus branch based on their corresponding metadata information to obtain the final representative reference sequence of the virus species. The database building module is used to incorporate representative reference sequences of the final virus species into the reference database, thereby constructing a reference database for the detection of highly pathogenic viruses.

9. A method for processing viral sequences for detecting highly pathogenic viruses, characterized in that, Includes the following steps: Integrate virus sequences from public databases and perform quality control on the virus sequences; Standardize the metadata information corresponding to the quality-controlled virus sequences; For the standardized viral sequences, clustering was performed at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and the clustering results were compared with the known biological characteristics of the virus. If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch. If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units; For the multiple representative reference sequences retained in the viral branch, duplicates are removed based on their corresponding metadata information to obtain the final representative reference sequences of the viral species.

10. A virus sequence processing system for detecting highly pathogenic viruses, characterized in that, This can be achieved through the following modules: An integrated quality control module is used to integrate virus sequences from public databases and perform quality control on the virus sequences; The metadata processing module is used to standardize the metadata information corresponding to the quality-controlled virus sequences. The clustering module is used to cluster standardized viral sequences at the viral species level based on the criteria of nucleotide identity > 99% and alignment coverage > 99% between pairs of sequences, and to compare the clustering results with the known biological characteristics of the virus. If the clustering results are consistent with the known biological characteristics of the virus, the viral branch sequence with the best quality control results in the cluster is selected as the representative reference sequence of the viral branch. If the clustering results are inconsistent with the known biological characteristics of the virus, then the clustering and comparison should be carried out again based on the downgraded classification units; The deduplication module is used to remove duplicates from multiple representative reference sequences retained in the virus branch based on their corresponding metadata information, so as to obtain the final representative reference sequence of the virus species.

11. An electronic device, characterized in that, include: Memory and one or more processors; The memory is used to store one or more computer programs; When the one or more computer programs are executed by the one or more processors, they implement the reference database construction method for detecting highly pathogenic viruses as described in any one of claims 1-7 or the virus sequence processing method for detecting highly pathogenic viruses as described in claim 9.

12. A computer storage medium storing a computer program; characterized in that, When the computer program is executed by the processor, it implements the reference database construction method for detecting highly pathogenic viruses as described in any one of claims 1-7 or the virus sequence processing method for detecting highly pathogenic viruses as described in claim 9.