Method and system for evaluating virulence grade of a strain of nocardia seriolae
By constructing a dedicated virulence gene feature library and a random forest regression model for Nocardia amberjack, and combining it with semi-supervised learning, a rapid and accurate assessment of the virulence level of Nocardia amberjack was achieved, solving the problems of high cost and long cycle in existing technologies, and providing a basis for screening vaccine candidate strains.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGXI ACADEMY OF FISHERY SCI
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245453A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of aquatic pathogen detection technology, and to, but is not limited to, a method and system for assessing the virulence level of Nocardia amberjack strains. Background Technology
[0002] Nocardia amberjack is a pathogen widely found in both marine and freshwater aquaculture environments. It can infect a variety of important commercially farmed fish species, such as mandarin fish, largemouth bass, snakehead, and tilapia, causing chronic systemic granulomatous lesions, resulting in continuous mortality of farmed populations and causing serious economic losses to the aquaculture industry.
[0003] In related technologies, the conventional lethal dose (LD50) animal experiments for assessing the virulence of *Nocardia amberjack* take 20-30 days, are costly, and are subject to animal ethics constraints. Furthermore, the results are difficult to standardize between different laboratories, leading to very few quantitative virulence assessments being conducted in actual production. While high-throughput sequencing can obtain the entire genome within 48 hours, existing alignment with the Virulence Factors of Pathogenic Bacteria Database (VFDB) only outputs a list of virulence genes and cannot quantify the virulence level. Existing machine learning methods, such as random forest typing and the Zoonoticus binary classification model, are either only used for strain tracing or designed for zoonotic diseases, lacking an assessment scheme for Nocardia-specific virulence characteristics and continuous virulence scoring. Therefore, there is an urgent need for a technology based on genome sequence that can rapidly quantify the virulence level of *Nocardia amberjack*. Summary of the Invention
[0004] Based on the above problems, this application provides a method and system for assessing the virulence level of Nocardia amberjack strains, aiming to achieve accurate assessment of the virulence level of Nocardia amberjack while significantly reducing experimental costs and shortening the cycle.
[0005] The technical solution of this application embodiment is implemented as follows: In a first aspect, embodiments of this application provide a method for assessing the virulence level of a *Nocardia amberjack* strain. The method includes: obtaining the whole genome sequence of the *Nocardia amberjack* strain to be tested; analyzing the whole genome sequence based on a classified and weighted specific virulence gene feature library of *Nocardia amberjack* to obtain a virulence feature vector of the *Nocardia amberjack* strain to be tested; using a pre-trained regression model to predict the LD50 of the virulence feature vector to obtain the predicted LD50 value of the *Nocardia amberjack* strain to be tested; wherein, the regression model is based on a random forest regression algorithm, using a first *Nocardia amberjack* strain with a labeled LD50 value and completed whole genome sequencing, and a second *Nocardia amberjack* strain with an unlabeled LD50 value and completed whole genome sequencing, trained through a self-trained semi-supervised learning framework; and determining the virulence level of the *Nocardia amberjack* strain to be tested based on the predicted LD50 value of the *Nocardia amberjack* strain to be tested and a preset virulence classification baseline.
[0006] In some embodiments, the method for obtaining the classified and weighted specific virulence gene feature library of Nocardia amberjack includes: obtaining the reference genome sequences of multiple publicly published Nocardia amberjack strains, and performing comprehensive analysis and integration of the reference genome sequences to obtain a specific virulence gene set of Nocardia amberjack; classifying the virulence genes in the specific virulence gene set of Nocardia amberjack by functional category and assigning weights to obtain a specific virulence gene feature library.
[0007] In some embodiments, the virulence genes in the Nocardia amberjack's specific virulence gene set are functionally classified and weighted to obtain a specific virulence gene feature library. This includes: classifying the virulence genes in the Nocardia amberjack's specific virulence gene set according to the biological function of the virulence gene encoding products to obtain multiple virulence categories; assigning differentiated weight coefficients to each virulence category based on the relative contribution of each virulence gene in the multiple virulence categories to the pathogenic process of Nocardia amberjack; and integrating the identification information of each virulence gene, its virulence category, and the weight coefficient of the virulence category to obtain the specific virulence gene feature library.
[0008] In some embodiments, based on a classified and weighted specific virulence gene feature library of Nocardia amberjack, the whole genome sequence is analyzed to obtain the virulence feature vector of the Nocardia amberjack strain to be tested. This includes: comparing the presence / deletion profile of virulence genes in the whole genome sequence based on the identification information of virulence genes in the specific virulence gene feature library to obtain the presence / deletion profile of virulence genes in the Nocardia amberjack strain to be tested; and performing a comprehensive virulence analysis on the presence / deletion profile of virulence genes in the Nocardia amberjack strain to be tested using virulence categories and weighting coefficients of virulence categories in the specific virulence gene feature library to obtain the virulence feature vector of the Nocardia amberjack strain to be tested.
[0009] In some embodiments, a comprehensive virulence analysis is performed on the presence / deletion profile of virulence genes in the tested Nocardia amberjack strain using virulence categories and weighting coefficients from a dedicated virulence gene feature library, to obtain a virulence feature vector of the tested Nocardia amberjack strain. This includes: for each virulence category in the dedicated virulence gene feature library, determining the number of gene types present under that virulence category in the presence / deletion profile of the tested Nocardia amberjack strain; and using a preset virulence calculation formula to calculate the virulence feature vector of the tested Nocardia amberjack strain. Number of gene types present The aforementioned toxicity category Weighting coefficients The aforementioned toxicity category The total number of gene types included in the exclusive virulence gene signature library Calculations are performed to obtain the toxicity category. Poisoning power score The preset toxicity calculation formula is as follows: According to the preset toxicity category order, the toxicity scores of each toxicity category in the exclusive toxicity gene feature library are calculated. The virulence feature vector of the Nocardia amberjack strain to be tested was obtained by splicing the vectors.
[0010] In some embodiments, the training process of the regression model includes: analyzing the whole genome sequences of the first and second *Nocardia amberjack* strains based on a dedicated virulence gene feature library to obtain the first virulence feature vector of the first strain and the second virulence feature vector of the second strain; using the first virulence feature vector of the first strain and the measured LD50 value of the first strain, evaluating the benchmark performance of the initial model built based on the random forest regression algorithm to obtain an intermediate model; using the intermediate model, predicting and ranking the confidence of the second virulence feature vector of the second strain, and selecting a high-confidence unlabeled sample set with pseudo-labels; and weighting the intermediate model based on the first virulence feature vector of the first strain, the measured LD50 value of the first strain, and the high-confidence unlabeled sample set with pseudo-labels to obtain a trained regression model.
[0011] In some embodiments, the virulence level of the tested Nocardia amberjack strain is determined based on the predicted LD50 value of the strain and a preset virulence classification baseline, including: when the predicted LD50 value of the tested Nocardia amberjack strain is less than 1 × 10⁻⁶. 5 At a concentration of CFU / mL, the virulence level of the tested Nocardia amberjack strain was determined to be high-risk; the predicted LD50 value of the tested Nocardia amberjack strain was greater than or equal to 1×10⁻⁶. 5 CFU / mL, and less than 1×107 At a concentration of CFU / mL, the virulence level of the tested Nocardia amberjack strain was determined to be medium risk; the predicted LD50 value of the tested Nocardia amberjack strain was greater than or equal to 1×10⁻⁶. 7 When the concentration of CFU / mL was reached, the virulence level of the tested Nocardia amberjack strain was determined to be low-risk virulence.
[0012] Secondly, embodiments of this application provide a virulence level assessment system for *Nocardia amberjack* strains. The system includes: an acquisition unit for acquiring the whole genome sequence of the *Nocardia amberjack* strain to be tested; an analysis unit for analyzing the whole genome sequence based on a classified and weighted specific virulence gene feature library of *Nocardia amberjack* to obtain a virulence feature vector of the *Nocardia amberjack* strain to be tested; and a prediction unit for using a trained regression model to predict the LD50 of the virulence feature vector to obtain the virulence level assessment system for the *Nocardia amberjack* strain to be tested. The predicted LD50 values of *Nocardia amberjack* strains; wherein, the regression model is based on the random forest regression algorithm, using the first *Nocardia amberjack* strain with labeled LD50 values and completed whole-genome sequencing, and the second *Nocardia amberjack* strain with unlabeled LD50 values and completed whole-genome sequencing, trained through a self-trained semi-supervised learning framework; the determination unit is used to determine the virulence level of the *Nocardia amberjack* strain to be tested based on the predicted LD50 values of the *Nocardia amberjack* strain to be tested and the preset virulence classification baseline.
[0013] The beneficial effects of the technical solutions provided in this application include at least the following: The method and system for assessing the virulence level of *Nocardia amberjack* strains provided in this application embodiment include the following steps: First, the whole genome sequence of the *Nocardia amberjack* strain to be tested is obtained. Second, based on the classified and weighted specific virulence gene feature library of *Nocardia amberjack*, the whole genome sequence is analyzed to obtain the virulence feature vector of the *Nocardia amberjack* strain to be tested. Then, a pre-trained regression model is used to predict the LD50 of the virulence feature vector to obtain the predicted LD50 value of the *Nocardia amberjack* strain to be tested. The regression model is based on the random forest regression algorithm and is trained using a self-trained semi-supervised learning framework with a first *Nocardia amberjack* strain labeled with measured LD50 values and having completed whole genome sequencing, and a second *Nocardia amberjack* strain without labeled LD50 values and having completed whole genome sequencing. Finally, the virulence level of the *Nocardia amberjack* strain to be tested is determined based on the predicted LD50 value of the *Nocardia amberjack* strain to be tested and a preset virulence classification baseline. In this way, on the one hand, by utilizing a classified and weighted dedicated virulence gene feature library to calculate the virulence feature vector of the whole genome sequence of the tested Nocardia amberjack strain, the virulence feature vector can be made more biologically interpretable and discriminative. On the other hand, by effectively integrating a small number of strains with measured LD50 values and a large number of strains with only publicly available genome sequences through self-trained semi-supervised learning, a mapping relationship between virulence feature vectors and LD50 predicted values is established, overcoming the data bottleneck of traditional large-scale regression infection experiments. This enables quantitative prediction from virulence gene profile to virulence level. In other words, this application, by constructing a functionally weighted dedicated virulence gene feature library of Nocardia amberjack and combining it with a random forest regression model under a semi-supervised learning framework, realizes the automated evaluation of the entire chain from whole genome sequence to LD50 predicted value to virulence level. In this way, it can achieve accurate evaluation of the virulence level of Nocardia amberjack while significantly reducing experimental costs and shortening the cycle, thus providing an intuitive decision-making basis for subsequent applications such as vaccine candidate strain screening.
[0014] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit the technical solutions provided in the embodiments of this application. Attached Figure Description
[0015] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort, wherein: Figure 1 A flowchart illustrating a method for assessing the virulence level of amberjack Nocardia strain provided in this application embodiment; Figure 2 A schematic diagram illustrating the functional category weight distribution of Nocardia amberjack virulence genes provided in this application embodiment; Figure 3 A schematic diagram of the iterative performance curve of self-trained semi-supervised learning provided in the embodiments of this application; Figure 4 Radar diagrams showing the distribution of virulence characteristics of multiple test strains provided in embodiments of this application; Figure 5 This is a schematic diagram of the composition of a virulence level assessment system for Nocardia amberjack strains provided in this application embodiment. Detailed Implementation
[0016] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. The following embodiments are used to illustrate this application, but are not intended to limit the scope of this application. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0017] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.
[0018] It should be noted that the terms "first, second, and third" used in the embodiments of this application are merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first, second, and third" can be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.
[0019] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of this application pertain. It should also be understood that terms such as those defined in general dictionaries should be understood to have a meaning consistent with their meaning in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0020] Example 1: See Figure 1This is a flowchart illustrating a method for assessing the virulence level of a yellowtail Nocardia strain provided in this application embodiment. This method can be executed by an electronic device, such as a computer or server. Here, in conjunction with... Figure 1 The following explanation is provided: Step 101: Obtain the whole genome sequence of the Nocardia strain of yellowtail to be tested.
[0021] In some embodiments, the Nocardia seriolae strain to be tested refers to any strain belonging to the taxonomic Nocardia seriolae species, which can be obtained by isolation and identification from lesion tissue of diseased fish, or from public collections such as the American Type Culture Collection (ATCC) and the China Center for Type Culture Collection (CCTCC).
[0022] It should be noted that this application does not specifically limit the source of the Nocardia strain of yellowtail to be tested; it can be a known preserved strain, a clinically isolated strain, an environmentally isolated strain, or a strain that has undergone artificial mutagenesis / genetic modification. That is, those skilled in the art can obtain or prepare the above-mentioned strains according to conventional microbiological methods.
[0023] In some embodiments, the whole genome sequence of the *Nocardia amberjack* strain to be tested refers to the complete nucleotide sequence information covering the entire chromosome and plasmids of the *Nocardia amberjack* strain obtained by high-throughput sequencing technology after DNA is extracted from the strain. The assembly result is usually expressed in FASTA or GenBank format, with a genome coverage of not less than 90%, preferably not less than 99%. In other words, the whole genome sequence of the *Nocardia amberjack* strain to be tested refers to sequence data that can reflect the genotypic characteristics of the strain, and its scope includes: complete genome sequence, genome draft sequence, or a sequence set that covers at least 90% of all coding genes of the strain.
[0024] In some embodiments, a high-throughput sequencing platform, such as Illumina, DNBSEQ, or Nanopore, can be used to obtain the complete genome sequence of the *Nocardia amberjack* strain. Further quality control of the complete genome sequence can be performed, such as removing low-quality reads and adapter sequences, followed by genome assembly and gene prediction to obtain a complete genome sequence and coding DNA sequence (CDS) prediction results. The integrity, GC content (typical GC content of *Nocardia amberjack* is approximately 68%), and genome size (typical range 7.7-8.4 Mb) of the assembled genome are then checked to ensure that the quality of the input complete genome sequence meets the requirements for subsequent analysis.
[0025] Step 102: Based on the classified and weighted specific virulence gene feature library of Nocardia amberjack, the whole genome sequence is analyzed to obtain the virulence feature vector of the Nocardia amberjack strain to be tested.
[0026] In some embodiments, a function-weight mapping relationship of Nocardia amberjack virulence genes is established in the classified and weighted specific virulence gene feature library of Nocardia amberjack virulence. Here, the classified and weighted specific virulence gene feature library of Nocardia amberjack virulence can be obtained through the following steps A1 and A2: Step A1: Obtain the reference genome sequences of several publicly published strains of Nocardia amberjack, and perform comprehensive analysis and integration of the reference genome sequences to obtain the specific virulence gene set of Nocardia amberjack.
[0027] In some embodiments, the reference genome sequences of multiple publicly published strains of Nocardia amberjack are obtained from public nucleic acid databases, such as the Genetic Sequence Data Bank (GenBank), including but not limited to: 1. Strain 20230510 (GenBank accession number: CP130742, containing a complete circular chromosome, full length 8123106bp, GC content: 68.14%, encoding 7638 CDS).
[0028] 2. Strain UTF1 (reference strain, GenBank accession number: AP017900.1 or AP017900, with complete genome).
[0029] 3. Strain ZJ0503 (GenBank accession number: JNCT00000000.1, currently a scaffold-level genome draft).
[0030] 4. Strain 024013 (GenBank accession number: AP028459 or AP028459.1, with a complete genome).
[0031] 5. Strain KGN1266 (GenBank accession number: AP028458 or AP028458.1, with a complete genome).
[0032] Here, strains that have completed full genome assembly and functional annotation are given priority.
[0033] In some embodiments, a comprehensive analysis and integration of the reference genome sequence is performed to obtain a specific virulence gene set for *Nocardia amberjack*, including but not limited to: sequence alignment with VFDB, which can be performed using software such as BLASTn for nucleotide sequences or BLASTp for amino acid sequences. The criteria for screening significant alignment results are as follows: E value ≤ 1e -5 The sequence identity was ≥80%, and the query sequence coverage was ≥70%. A list of virulence genes for each strain was obtained. Simultaneously, the reference genome sequences of each strain were compared to identify common and unique virulence-related genes. Based on existing literature reports on Nocardia amberjack virulence factors, such as gamma-glutamyl endopeptidase (GluNS), ESX-1 secretion system components, SodC and Sod2 superoxide dismutases, type VII secretion system components (T7SS-2 / T7SS-3), EspG1, and ESAT6, strain-specific virulence genes not included in the database were supplemented, forming a dedicated virulence gene set for Nocardia amberjack.
[0034] Step A2: Classify the virulence genes in the specific virulence gene set of Nocardia amberjack by function and assign weights to obtain the specific virulence gene feature library.
[0035] In some embodiments, step A2 above can be implemented by the following steps A21 to A23: Step A21: According to the biological function of the virulence gene encoding products, classify the virulence genes in the specific virulence gene set of Nocardia amberjack into multiple virulence categories.
[0036] In some embodiments, each virulence gene in the specific virulence gene set of *Nocardia seriolaes* is classified into a predetermined functional category according to the biological function of its encoded product. The functional categories refer to the classification system of 403 virulence genes in strain 20230510 in the published literature "Pathogenicity and whole-genome analysis of a *Siniperca chuatsi-derived* Nocardia seriolaestrain", specifically including: exotoxins, immune modulation, effector delivery systems, regulatory systems, nutrient metabolism, adhesion and invasion, secretion systems, stress adaptation, biofilm formation, iron uptake, proteases, and other virulence-related categories.
[0037] Step A22: Assign differentiated weighting coefficients to each virulence category based on the relative contribution of virulence genes in each virulence category to the pathogenesis of Nocardia amberjack.
[0038] Following the previous text, based on the relative contribution of virulence genes in each virulence category to the pathogenicity of *Nocardia amberjack*, i.e., their direct contribution to the pathogenicity process, differentiated weighting coefficients are assigned. The weighting is based on the following: exotoxins directly damage host cells; effector genes (e.g., secretory systems) directly mediate the transmembrane transport of virulence factors; and immune regulatory genes directly help the pathogen evade host immune clearance. These three categories can be assigned the highest weighting coefficients (e.g., weighting coefficient range 1.5-2.0). Regulatory system genes indirectly affect pathogenicity by regulating virulence gene expression and are assigned medium weighting coefficients (e.g., weighting coefficient range 1.0-1.5). Nutritional metabolism genes and stress adaptation genes mainly maintain the survival of the pathogen within the host and are assigned basic weighting coefficients (e.g., weighting coefficient range 0.5-1.0). Specific weighting coefficients can be dynamically optimized and adjusted based on feedback from newly added training data in practical applications. See Table 1 for reference.
[0039] Table 1. Examples of weights for some toxicity categories ; Step A23: Integrate the identification information of each virulence gene, its virulence category, and the weight coefficient of the virulence category to obtain a unique virulence gene feature library.
[0040] In some embodiments, the identification information of each virulence gene (e.g., gene name, locus_tag, or sequence feature), its virulence category, and the weight coefficient of the virulence category are associated and stored to form a structured, proprietary virulence gene feature library, such as... Figure 2 As shown in the figure. Correspondingly, the example structure of the specific virulence gene feature library is shown in Table 2 below.
[0041] It should be noted that the specific virulence gene signature database is based on currently published literature and may be dynamically adjusted according to actual circumstances.
[0042] Table 2. Relevant information on the 12 virulence categories included in the specific virulence gene signature library. ; In some embodiments, step 102 can be implemented by steps 1021 and 1022. Figure 1 (not shown in the image) Step 1021: Based on the identification information of virulence genes in the dedicated virulence gene feature library, perform virulence gene presence / deletion profile comparison on the whole genome sequence to obtain the virulence gene presence / deletion profile of the Nocardia amberjack strain to be tested.
[0043] In some embodiments, the whole genome sequence of the *Nocardia amberjack* strain to be tested can be obtained first. Bioinformatics software, such as Prokka, Bakta, and BRAKER, can be used to predict all CDS sequence sets, and optionally translate them into protein sequence sets. Then, the predicted CDS sequence sets or protein sequence sets are compared one by one with the virulence gene identifiers in a dedicated virulence gene feature library to obtain the presence / deletion profile of virulence genes in the *Nocardia amberjack* strain to be tested. Here, BLASTn or BLASTp can be used, with an alignment threshold set: E value ≤ 1e. -5 The sequence consistency must be ≥80%, and the query sequence coverage ≥70%. Specifically, for each virulence gene in the dedicated virulence gene feature library, if a homologous sequence meeting the alignment threshold exists in the tested *Nocardia amberjack* strain, the virulence gene is determined to be "present" (recorded as 1); otherwise, it is determined to be "missing" (recorded as 0). Finally, the determination results for all virulence genes of the tested *Nocardia amberjack* strain are summarized to form a binary vector. ,in, This refers to the total number of genes in the exclusive virulence gene signature library. ∈{0,1}, to output the presence / deletion profile of virulence genes in the tested Nocardia amberjack strain. .
[0044] Step 1022: Using the virulence category and weight coefficient of the virulence category in the dedicated virulence gene feature library, a comprehensive virulence analysis is performed on the presence / deletion spectrum of virulence genes of the tested Nocardia amberjack strain to obtain the virulence feature vector of the tested Nocardia amberjack strain.
[0045] In some embodiments, step 1022 may be implemented by the following steps B1 to B3: Step B1: For each virulence category in the dedicated virulence gene feature library, determine the number of gene types present under the virulence category in the presence / deletion spectrum of virulence genes in the amberjack Nocardia strain to be tested.
[0046] Step B2: Using a preset toxicity calculation formula, classify the toxicity category. Number of gene types present The aforementioned toxicity category Weighting coefficients The aforementioned toxicity category The total number of gene types included in the exclusive virulence gene signature library Calculations are performed to obtain the toxicity category. Poisoning power score .
[0047] The preset toxicity calculation formula is as follows: Formula (1).
[0048] Step B3: According to the preset toxicity category order, assign toxicity scores for each toxicity category in the exclusive toxicity gene feature library. The virulence feature vector of the Nocardia amberjack strain to be tested was obtained by splicing the vectors.
[0049] In some embodiments, the presence / deletion profile of virulence genes in the tested Nocardia amberjack strain is analyzed. They are grouped according to the virulence categories in the dedicated virulence gene feature library. Let there be a total of... toxicity categories ( The value of can be 12 (as mentioned above), for The first of the toxicity categories toxicity categories ( ), Round to the integers representing the presence / deletion profile of virulence genes in the tested Nocardia amberjack strain. In the middle, determine the first Number of gene types existing under each virulence category and the The total number of gene types included in the dedicated virulence gene signature library for each virulence category .
[0050] For the Each toxicity category has a toxicity score. It is calculated according to the formula (1) above. Here, For the first The coverage of virulence genes in each virulence category. Formula (1) takes into account the following technical considerations: by incorporating coverage into the calculation instead of using only absolute counts, the bias caused by the difference in the number of genes contained in different functional categories can be eliminated, making the virulence scores of each virulence category comparable.
[0051] Here, the toxicity scores of each toxicity category in the exclusive toxicity gene feature library can be calculated according to a preset toxicity category order (as described above). By splicing the data, the virulence feature vector of the Nocardia amberjack strain to be tested was obtained. .
[0052] Step 103: Using the trained regression model, predict the LD50 of the virulence feature vector to obtain the predicted LD50 value of the Nocardia amberjack strain to be tested.
[0053] The regression model is based on the random forest regression algorithm and is trained using the first Nocardia amberjack strain with measured LD50 values and completed whole-genome sequencing, and the second Nocardia amberjack strain with unlabeled measured LD50 values and completed whole-genome sequencing, through a self-training semi-supervised learning framework.
[0054] In some embodiments, the first Nocardia amberjack strain with a measured LD50 value and completed whole-genome sequencing can be: Nocardia amberjack strain data that has completed animal reinfection experiments and obtained measured LD50 values, extracted from published experimental research literature. Specifically, this includes: 1. Strain 20230510: isolated from diseased mandarin fish, with a measured LD50 of 3.89 × 10⁻⁶. 4 CFU / mL, data from the paper "Pathogenicity and whole-genome analysis of a Siniperca chuatsi-derived Nocardia seriolae strain", and the GenBank accession number is CP130742.
[0055] 2. Wild strain ZJ0503: isolated from diseased oval pomfret in Zhanjiang, Guangdong Province, with a measured LD50 of 4.75 × 10⁻⁶. 5CFU / mL, data from the paper "Construction of an attenuated glutamyl endopeptidasedeletion strain of Nocardia seriolae", and the GenBank accession number is: JNCT00000000.1.
[0056] 3. Attenuated strain NS-ΔGluNS: Obtained by gene knockout from wild-type strain ZJ0503, with a measured LD50 value of 3.41 × 10⁻⁶. 6 CFU / mL, data sourced from the paper "Construction of an attenuated glutamylendopeptidase deletion strain of Nocardia seriolae".
[0057] 4. Strain NK201610020: isolated from diseased black-spotted hybrid snakehead, with a measured LD50 value of 1.079 × 10⁻⁶. 3 CFU / mL, data from the paper “Pathogenicity and whole genome analysis of Nocardia scutellaria from snakehead” (Microbiology Bulletin, 2022, No. 006), whole genome sequencing has been completed.
[0058] 5. Strain NS01: Isolated from diseased largemouth bass, with a measured LD50 of 7.5 × 10⁻⁶. 7 CFU / mL, data from the paper “Analysis of Nocardia virulence and drug resistance in largemouth bass based on whole genome sequencing” (Journal of Fisheries of China, 2025), whole genome sequencing has been completed.
[0059] The whole genome sequences of strain 20230510 and wild strain ZJ0503 (the original parent strain of attenuated strain NS-ΔGluNS: wild strain ZJ0503) have been published, and the whole genome sequences of strains NK201610020 and NS01 have been confirmed to have been completed and submitted to public databases.
[0060] In some embodiments, the second Nocardia amberjack strain without labeled LD50 values and with completed whole-genome sequencing includes strains whose whole-genome sequences have been retrieved from public databases, such as GenBank, including but not limited to strains UTF1, 024013, KGN1266, and M150506. All of these strains have completed whole-genome sequencing and assembly, and possess complete CDS prediction information.
[0061] In some embodiments, the training process of the above regression model includes: Step C1: Based on the dedicated virulence gene feature library, the whole genome sequences of the first and second Nocardia amberjack strains were analyzed to obtain the first virulence feature vector of the first and second Nocardia amberjack strains.
[0062] Step C2: Using the first virulence feature vector of the first Nocardia amberjack strain and the measured LD50 value of the first Nocardia amberjack strain, the initial model built based on the random forest regression algorithm is evaluated to obtain the intermediate model.
[0063] Step C3: Using an intermediate model, the second virulence feature vector of the second yellowtail Nocardia strain is predicted and ranked by confidence, and a high-confidence unlabeled sample set with pseudo-labels is obtained.
[0064] Step C4: Based on the first virulence feature vector of the first yellowtail Nocardia strain, the measured LD50 value of the first yellowtail Nocardia strain, and the high-confidence unlabeled sample set with pseudo-labels, the intermediate model is weighted and trained to obtain the trained regression model.
[0065] In some embodiments, the first yellowtail Nocardia strain is used as a tagged sample, and the second yellowtail Nocardia strain is used as an untagged sample. The following preprocessing operations are performed on each sample to obtain a uniform input feature representation: First, following the methods provided in steps 101 and 102 above, based on the dedicated virulence gene feature library, the whole genome sequences of each strain (Nocardia amberjack strain 1 and Nocardia amberjack strain 2) are compared for virulence gene presence / deletion profiles, and the virulence feature vector of each strain is calculated. Here, a weighted summation can be further performed on the obtained virulence feature vectors of each strain to obtain the comprehensive virulence index Vscore for each strain to be evaluated.
[0066] In this application, for the tagged sample (Nocardia flounder strain 1), the measured LD50 value of Nocardia flounder strain 1 was logarithmically transformed (taking... The regression target value is used to eliminate the training bias caused by the order-of-magnitude variation in the measured LD50 value; for unlabeled samples (Nocardia spp. 2), only their respective second virulence feature vectors and the comprehensive virulence index Vscore to be evaluated are retained, and no regression target value is set.
[0067] In addition, for some tagged samples where the measured LD50 value is in "CFU / fish", it can be uniformly converted to "CFU / g" based on the average weight of the experimental fish (about 50g), and then converted to "CFU / mL" (0.2mL intraperitoneal injection volume) to maintain consistency with the unit of samples such as strain 20230510.
[0068] Thus, after preprocessing, a labeled sample set is formed. Among them, regarding the first The measured LD50 value of a labeled sample is as follows: , Number of labeled samples; Unlabeled sample set ;in, This represents the number of unlabeled samples.
[0069] Then, with the labeled sample set The toxicity feature vector in the data is used as the input feature, and the corresponding... The prediction target is used to construct an initial model using a random forest regression algorithm (e.g., 100 decision trees, with a maximum depth of 5 for each tree).
[0070] Here, during the initial model construction, leave-one-out cross-validation can be used to evaluate model performance: each time, one labeled sample is selected as the validation sample, and the rest... One sample is used for training, and the process is repeated iteratively. The process is repeated several times, and finally, the root mean square error (RMSE) and the coefficient of determination are calculated based on all the validation results. If the coefficient of determination of the initial model is lower than a preset threshold, it is determined that the number of labeled samples is insufficient or the feature representation ability is limited, and further semi-supervised learning is needed to expand the training sample set to improve the generalization performance of the model.
[0071] Finally, perform a self-training semi-supervised learning iteration: (a) Using the currently trained regression model, perform a regression on the unlabeled sample set. For each sample in the dataset, make a prediction and output the result for each unlabeled sample. The predicted values and prediction confidence levels are calculated. Here, for the random forest regression model, the prediction confidence level can be expressed as the standard deviation of the predicted values of each decision tree; where, the smaller the standard deviation, the more consistent the prediction results of each tree, and the higher the confidence level.
[0072] (b) Starting from the unlabeled sample set, in descending order of prediction confidence. Select the p samples with the highest prediction confidence (p is usually set to: unlabeled sample set). Of the 10%-20% of the sample size, samples with a prediction confidence standard deviation greater than 0.5 were removed, and the remaining samples were... The predicted values are used as pseudo-labels, and these pseudo-labeled samples are removed from the unlabeled sample set. Removed from the original sample set and added to the labeled sample set. In this process, an expanded labeled sample set is obtained. This corresponds to the number of newly added pseudo-labeled samples.
[0073] (c) Using the expanded labeled sample set Retrain the regression model. During training, apply smaller sample weights to pseudo-label samples to reduce the impact of pseudo-label error on the model. Here, the sample weights are set as follows: the weight of the true label sample is 1.0, and the weight of the pseudo-label sample is a function of its confidence value; for example: weight = confidence value × decay factor, where the decay factor decreases with each iteration.
[0074] (d) Repeat steps (a)-(c) above until any of the following stopping conditions are met: unlabeled sample set There are no more selectable samples; the model's performance on the validation set no longer improves; the preset maximum number of iterations has been reached.
[0075] In this application, during the self-training iteration process, leave-one-out cross-validation is used to evaluate the performance of the model after each round of augmentation: the current labeled sample set is... The dataset is divided into a training subset and a validation subset (one true labeled sample is retained for validation each time), and the validation set RMSE is calculated. The pseudo-label confidence is defined as the standard deviation of the predictions made by each decision tree in the random forest: the smaller the standard deviation, the higher the prediction consistency and the higher the confidence. The iteration stops when any of the following conditions are met: (i) The unlabeled sample set U is empty.
[0076] (ii) The RMSE decrease is <0.01 after two consecutive iterations, or there are no samples that can be screened through the confidence threshold (standard deviation ≤0.5) in this round.
[0077] (iii) Reach the preset maximum number of iterations, 10 rounds.
[0078] The weights of pseudo-labeled samples in subsequent training are set as their confidence value and decay factor. , The product of (the current iteration round) is used to reduce the cumulative impact of early pseudo-label errors.
[0079] Here you can refer to Figure 3 This is a schematic diagram of the iterative performance curve of self-trained semi-supervised learning. After training, a trained regression model M is obtained. Furthermore, it can be further evaluated based on the comprehensive toxicity index Vscore of each sample in the labeled sample set. By fitting Vscore with the least squares method The calibration curves can be linear or nonlinear. The fitting function can take the form of a linear affine transformation or a more general polynomial form, depending on the goodness of fit and the residual distribution.
[0080] In some embodiments, the virulence feature vector of the tested Nocardia amberjack strain can be directly input into a pre-trained regression model to output the virulence feature vector of the tested Nocardia amberjack strain. The predicted value can then be used for this. The predicted values were subjected to anti-logarithmic transformation to obtain the predicted LD50 value of the Nocardia amberjack strain to be tested, with the unit being CFU / mL.
[0081] In this application, the confidence interval of the LD50 prediction value can also be calculated based on the distribution of prediction errors from cross-validation during model training and the distribution of prediction values from each sub-model in the random forest regression model. For the random forest model, the 2.5 percentile and 97.5 percentile of each decision tree prediction value are taken as the lower and upper bounds of the 95% prediction confidence interval.
[0082] Step 104: Determine the virulence level of the tested Nocardia amberjack strain based on the predicted LD50 value and the preset virulence classification baseline.
[0083] In some embodiments, the virulence level of the tested Nocardia amberjack strain can be further classified based on the predicted LD50 value and 95% confidence interval of the tested Nocardia amberjack strain.
[0084] In some embodiments, step 104 can be implemented in the following manner: Method 1: The predicted LD50 value of the Nocardia strain of yellowtail to be tested is less than 1×10⁻⁶. 5 When the concentration of CFU / mL was reached, the virulence level of the tested Nocardia amberjack strain was determined to be high-risk virulence.
[0085] Method 2: The predicted LD50 value of the Nocardia strain of yellowtail to be tested is greater than or equal to 1×10⁻⁶. 5 CFU / mL, and less than 1×10 7 When the concentration of CFU / mL was reached, the virulence level of the tested Nocardia amberjack strain was determined to be medium risk.
[0086] Method 3: The predicted LD50 value of the Nocardia strain of yellowtail to be tested is greater than or equal to 1×10⁻⁶. 7 When the concentration of CFU / mL was reached, the virulence level of the tested Nocardia amberjack strain was determined to be low-risk virulence.
[0087] Here, when the 95% confidence interval of the LD50 predicted value crosses a level, the Nocardia strain of yellowtail to be tested is marked as a boundary strain and conservatively treated according to the higher risk level.
[0088] In this way, a Nocardia amberjack strain with an unknown virulence phenotype, i.e. the Nocardia amberjack strain to be tested, can obtain LD50 prediction value and virulence level determination results within a few hours after its whole genome sequencing is completed, thereby guiding disease control decisions in aquaculture sites without having to wait for animal reinfection experiment results that take more than 20 days.
[0089] In some embodiments, the above threshold is: 1×10 5 CFU / mL, 1×10 7 CFU / mL can be dynamically calibrated based on the accumulation of new training data. For publicly available experimental data: wild-type strain ZJ0503 (LD50 = 4.75 × 10⁻⁶). 5 The strain with a concentration of CFU / mL was considered toxic; strain 20230510 (LD50 = 3.89 × 10⁻⁶ CFU / mL) was also considered toxic. 4 The strain with CFU / mL was a highly virulent strain; the attenuated strain NS-ΔGluNS (LD50 = 3.41 × 10⁻⁶ CFU / mL) was also virulent. 6 A concentration of CFU / mL indicates a toxic strain and can be used as a reference for grading standards.
[0090] In some embodiments, after obtaining the virulence level of the Nocardia amberjack strain to be tested, a corresponding virulence level assessment report can also be generated; wherein, the virulence level assessment report may include: basic information of the Nocardia amberjack strain to be tested (number, source host, isolation time); score distribution of each virulence category; LD50 predicted value and confidence interval; virulence level determination result; details such as gene presence / deletion spectrum of each virulence category (for result interpretability and traceability).
[0091] It should be noted that when new strain data with measured LD50 values are added, the new data can be included in the labeled sample set, and the semi-supervised learning training process of the regression model can be re-executed to update the regression model. At the same time, the functional category weight coefficients in the classified and weighted specific virulence gene feature library of Nocardia amberjack can be fine-tuned and optimized according to the correlation strength between virulence genes and LD50 in the new data, so that the weight system is more in line with actual biological laws.
[0092] The virulence level assessment method for *Nocardia amberjack* strains provided in this application involves the following steps: First, the whole genome sequence of the *Nocardia amberjack* strain to be tested is obtained. Second, based on the classified and weighted specific virulence gene feature library of *Nocardia amberjack*, the whole genome sequence is analyzed to obtain the virulence feature vector of the *Nocardia amberjack* strain to be tested. Then, a pre-trained regression model is used to predict the LD50 of the virulence feature vector to obtain the predicted LD50 value of the *Nocardia amberjack* strain to be tested. The regression model is based on the random forest regression algorithm, using a first *Nocardia amberjack* strain with labeled LD50 values and completed whole genome sequencing, and a second *Nocardia amberjack* strain with unlabeled LD50 values and completed whole genome sequencing, trained through a self-trained semi-supervised learning framework. Finally, based on the predicted LD50 value of the *Nocardia amberjack* strain to be tested and a preset virulence classification baseline, the virulence level of the *Nocardia amberjack* strain to be tested is determined. In this way, on the one hand, by utilizing a classified and weighted dedicated virulence gene feature library to calculate the virulence feature vector of the whole genome sequence of the tested Nocardia amberjack strain, the virulence feature vector can be made more biologically interpretable and discriminative. On the other hand, by effectively integrating a small number of strains with measured LD50 values and a large number of strains with only publicly available genome sequences through self-trained semi-supervised learning, a mapping relationship between virulence feature vectors and LD50 predicted values is established, overcoming the data bottleneck of traditional large-scale regression infection experiments. This enables quantitative prediction from virulence gene profile to virulence level. In other words, this application, by constructing a functionally weighted dedicated virulence gene feature library of Nocardia amberjack and combining it with a random forest regression model under a semi-supervised learning framework, realizes the automated evaluation of the entire chain from whole genome sequence to LD50 predicted value to virulence level. In this way, it can achieve accurate evaluation of the virulence level of Nocardia amberjack while significantly reducing experimental costs and shortening the cycle, thus providing an intuitive decision-making basis for subsequent applications such as vaccine candidate strain screening.
[0093] The following description uses specific experimental data to illustrate the virulence assessment method of the above-mentioned Nocardia amberjack strain. However, it is worth noting that this specific experimental data is only for better illustration of this application and does not constitute an undue limitation on this application.
[0094] The specific experimental data involves the strain samples in the experimental materials: Test strain A: Nocardia amberjack 20230510 (isolated from diseased mandarin fish, with complete genome sequence and measured LD50 value).
[0095] Test strain B: Nocardia amberjack ZJ0503 (isolated from diseased oval pomfret, with complete genome sequence and measured LD50 value).
[0096] Test strain C: Nocardia amberjack NS01 (isolated from diseased largemouth bass, with complete genome sequence and measured LD50 value).
[0097] Test strain D: A newly isolated Nocardia amberjack (for timeliness verification).
[0098] Here you can refer to Figure 4 Radar maps showing the virulence characteristics of Nocardia amberjack 20230510, Nocardia amberjack ZJ0503, and the test strain D are presented.
[0099] Timeliness comparison experiment: Experimental setup: Test strain D was used, and its virulence was assessed using both the virulence assessment method (A) provided in this application and the traditional LD50 animal infection experiment (B). The total time taken for both methods from obtaining the virulence assessment conclusion from test strain D was recorded. Among these: The process for strain A is as follows: strain resuscitation and culture (3-5 days) → genome sequencing (2 days) → bioinformatics analysis and virulence calculation (several hours).
[0100] B's procedure: strain resuscitation culture (3-5 days) → concentration gradient preparation (1 day) → intraperitoneal injection infection (1 day) → continuous observation and statistical analysis of mortality rate (14-20 days) → LD50 calculation (1 day).
[0101] Evaluation metrics: total time taken throughout the entire process (unit: days), time breakdown for each stage, and expected results are shown in Table 3.
[0102] Table 3 Timeliness Comparison Data ; Thus, the toxicity assessment method provided in this application shortens the toxicity assessment cycle from "monthly" (approximately 24 days) to "weekly" (approximately 7.5 days), which can significantly improve the efficiency of emergency response to aquatic diseases.
[0103] Comparative experiment on quantitative capabilities: Comparison method: The virulence level assessment method provided in this application vs. the existing VFDB virulence gene comparison method: The strain genome is compared with the VFDB database, the presence / deletion spectrum of virulence genes is output, and virulence is assessed by gene count (the more counts, the stronger the virulence).
[0104] Experimental setup: Three bacterial strains with known measured LD50 values (20230510 high-virulence baseline strain, ZJ0503 medium-virulence strain, and NS01 low-virulence baseline strain) were selected and evaluated using the VFDB method and the virulence rating assessment method provided in this application, respectively. The existing VFDB virulence gene alignment method outputs: virulence gene counts and a list; the virulence rating assessment method provided in this application outputs the virulence rating. The evaluation indicators are: whether a continuously quantified virulence score can be output, the correlation between the score result and the measured LD50 value, and the consistency between the virulence rating determination and the measured LD50 rating. The expected results are shown in Table 4.
[0105] Table 4 Comparison of Quantitative Capabilities ; Analysis: The VFDB method only outputs gene counts, and the count differences among the three strains are small (375-389), which cannot effectively distinguish the virulence level and misclassifies the poisoned strain B as highly virulent; the virulence level assessment method provided in this application has a virulence level determination accuracy of 100%.
[0106] Conclusion: The toxicity level assessment method provided in this application solves the problem of low toxicity discrimination caused by "equal weight counting" in the VFDB method through the functional category weighting mechanism, and achieves a quantitative assessment that is highly consistent with the measured LD50 value.
[0107] Based on the above description, the virulence assessment method for Nocardia amberjack strains provided in this application aims to solve the following technical problems existing in the prior art: 1. Existing animal reinfection experiments (LD50 determination) are time-consuming (20-30 days), costly, and subject to significant animal ethics pressures, failing to meet the actual needs of aquaculture sites for rapid assessment of strain virulence.
[0108] 2. Existing virulence gene analysis methods, such as VFDB alignment, can only output a list of the presence / absence of virulence genes, lacking a quantitative scoring mechanism for the differences in the functional categories of virulence genes, and thus cannot quantitatively classify the virulence level of strains.
[0109] 3. Existing machine learning-based pathogen prediction models are designed for the classification of zoonotic diseases, outputting binary classification results. They do not consider the unique virulence factors and pathogenic characteristics of Nocardia amberjack, and cannot be directly applied to the virulence level assessment of aquatic pathogens.
[0110] This application targets *Nocardia amberjack*, a specific aquatic pathogen, and constructs an intelligent virulence level assessment technology system based on a combination of weighted scoring of virulence gene functional categories and semi-supervised learning. This system includes the following core technical components: First, we established a virulence gene feature library specific to Nocardia amberjack and its functional category weighting system. Unlike existing methods that count all virulence genes with "equal weight," this application assigns differentiated weight coefficients to virulence genes according to their functional categories (e.g., exotoxins, immune regulation, effector transmission, nutritional metabolism, etc.), forming a "gene-functional category-weight" mapping relationship, making the virulence score closer to biological reality.
[0111] Second, a multi-dimensional virulence score calculation based on functional category weighting was designed. After the whole genome sequence of the test strain is input, it is compared with the virulence gene feature library through a sequence alignment engine to obtain the virulence gene presence / deletion profile; then, according to the functional category of the gene and the preset weight coefficient, the category score of each category is calculated.
[0112] Third, a virulence index-LD50 mapping model based on semi-supervised learning is constructed. A small number of strains with existing LD50 measured values are used as tagged samples (baseline strains), and a large number of publicly available strains with only genome sequences but no LD50 measured values are used as untagged samples. Through a self-trained semi-supervised learning framework, a mapping relationship between multi-dimensional virulence scores and LD50 predicted values is established, realizing quantitative prediction from virulence gene profile to virulence level.
[0113] Based on the foregoing embodiments, this application further provides a virulence level assessment system for Nocardia amberjack strains, see [link to relevant documentation]. Figure 5 The virulence rating system 500 for this Nocardia amberjack strain includes: Acquisition unit 501 is used to acquire the whole genome sequence of the Nocardia strain of yellowtail to be tested.
[0114] Analysis unit 502 is used to analyze the whole genome sequence based on the classified and weighted specific virulence gene feature library of Nocardia amberjack to obtain the virulence feature vector of the Nocardia amberjack strain to be tested.
[0115] The prediction unit 503 is used to predict the LD50 of the virulence feature vector using a pre-trained regression model to obtain the predicted LD50 value of the Nocardia amberjack strain to be tested. The regression model is based on the random forest regression algorithm and is trained using a self-trained semi-supervised learning framework with a first Nocardia amberjack strain with a labeled LD50 measured value and a completed whole-genome sequencing, and a second Nocardia amberjack strain with an unlabeled LD50 measured value and a completed whole-genome sequencing.
[0116] The determination unit 504 is used to determine the virulence level of the tested Nocardia amberjack strain based on the predicted LD50 value of the strain and the preset virulence classification baseline.
[0117] It should be noted that the description of the above system embodiments is similar to the description of the above method embodiments, and has similar beneficial effects. For technical details not disclosed in the system embodiments of this application, please refer to the description of the method embodiments of this application for understanding.
[0118] It should be noted that, in the embodiments of this application, if the above-mentioned method for assessing the virulence level of *Nocardia amberjack* strains is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, or the part that contributes to related technologies, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause an electronic device to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), magnetic disks, or optical disks. Thus, the embodiments of this application are not limited to any specific hardware and software combination.
[0119] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. It should be understood that in the various embodiments of this application, the sequence numbers of the above-described processes do not imply a sequential order of execution; the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application. The sequence numbers of the above-described embodiments are merely descriptive and do not represent the superiority or inferiority of the embodiments.
[0120] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0121] In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.
[0122] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units. They may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of the embodiments of this application, depending on actual needs.
[0123] In addition, each functional unit in the various embodiments of this application can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.
[0124] Alternatively, if the integrated units described above are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, or the parts that contribute to related technologies, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause the device automatic test line to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROMs, magnetic disks, or optical disks.
[0125] The methods disclosed in the several method embodiments provided in this application can be arbitrarily combined without conflict to obtain new method embodiments.
[0126] The features disclosed in the several method or system embodiments provided in this application can be arbitrarily combined without conflict to obtain new method or system embodiments.
[0127] The above description is merely an embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for assessing the virulence level of *Nocardia amberjack* strains, characterized in that, The method includes: Obtain the complete genome sequence of the Nocardia strain of yellowtail to be tested; Based on the classified and weighted specific virulence gene feature library of Nocardia amberjack, the whole genome sequence was analyzed to obtain the virulence feature vector of the Nocardia amberjack strain to be tested. The LD50 of the virulence feature vector was predicted using a pre-trained regression model to obtain the predicted LD50 value of the Nocardia amberjack strain to be tested. The regression model was based on the random forest regression algorithm and was trained using a self-trained semi-supervised learning framework with the first Nocardia amberjack strain with measured LD50 values and completed whole-genome sequencing labeled with actual LD50 values and the second Nocardia amberjack strain with unlabeled LD50 values and completed whole-genome sequencing. Based on the predicted LD50 value of the tested Nocardia amberjack strain and the preset virulence classification baseline, the virulence level of the tested Nocardia amberjack strain was determined.
2. The method according to claim 1, characterized in that, Methods for obtaining the classified and weighted specific virulence gene signature library of Nocardia amberjack include: The reference genome sequences of several publicly published strains of Nocardia amberjack were obtained, and the reference genome sequences were comprehensively analyzed and integrated to obtain the specific virulence gene set of Nocardia amberjack. The virulence genes in the specific virulence gene set of Nocardia amberjack were classified by function and weighted to obtain a specific virulence gene feature library.
3. The method according to claim 2, characterized in that, The virulence genes in the specific virulence gene set of Nocardia amberjack were functionally classified and weighted to obtain a specific virulence gene feature library, including: Based on the biological function of the products encoded by virulence genes, the virulence genes in the specific virulence gene set of Nocardia amberjack were classified into multiple virulence categories. Based on the relative contribution of virulence genes in various virulence categories to the pathogenesis of Nocardia amberjack, differentiated weighting coefficients are assigned to each virulence category. By integrating the identifier information of each virulence gene, its corresponding virulence category, and the weight coefficient of the virulence category, a unique virulence gene feature library is obtained.
4. The method according to claim 1, characterized in that, Based on the classified and weighted specific virulence gene feature library of *Nocardia amberjack*, the whole genome sequence was analyzed to obtain the virulence feature vector of the *Nocardia amberjack* strain to be tested, including: Based on the identification information of virulence genes in the dedicated virulence gene feature library, the presence / deletion profile of virulence genes was compared with the whole genome sequence to obtain the presence / deletion profile of virulence genes of the tested Nocardia amberjack strain; Using the virulence categories and weighting coefficients of the specific virulence gene feature library, a comprehensive virulence analysis was performed on the presence / deletion profile of the virulence genes of the tested Nocardia amberjack strain to obtain the virulence feature vector of the tested Nocardia amberjack strain.
5. The method according to claim 4, characterized in that, Using virulence categories and weighting coefficients from a dedicated virulence gene feature library, a comprehensive virulence analysis was performed on the presence / deletion profile of virulence genes in the tested Nocardia amberjack strain, yielding a virulence feature vector for the tested Nocardia amberjack strain, including: For each virulence category in the dedicated virulence gene feature library, the number of gene types present under the virulence category is determined in the presence / deletion spectrum of virulence genes in the tested Nocardia amberjack strain; Using a preset toxicity calculation formula, the toxicity category is... Number of gene types present The aforementioned toxicity category Weighting coefficients The aforementioned toxicity category The total number of gene types included in the exclusive virulence gene signature library Calculations are performed to obtain the toxicity category. Poisoning power score The preset toxicity calculation formula is as follows: ; According to the preset toxicity category order, the toxicity scores of each toxicity category in the exclusive toxicity gene feature library are calculated. The virulence feature vector of the Nocardia amberjack strain to be tested was obtained by splicing the vectors.
6. The method according to claim 1, characterized in that, The training process of a regression model includes: Based on the dedicated virulence gene feature library, the whole genome sequences of the first and second Nocardia amberjack strains were analyzed to obtain the first virulence feature vector of the first and second Nocardia amberjack strains. The first virulence feature vector of the first Nocardia amberjack strain and the measured LD50 value of the first Nocardia amberjack strain were used to evaluate the benchmark performance of the initial model built based on the random forest regression algorithm, and an intermediate model was obtained. An intermediate model was used to predict and rank the second virulence feature vector of the second yellowtail Nocardia strain, and a high-confidence unlabeled sample set with pseudo-labels was obtained. Based on the first virulence feature vector of the first yellowtail Nocardia strain, the measured LD50 value of the first yellowtail Nocardia strain, and the high-confidence unlabeled sample set with pseudo-labels, the intermediate model is weighted and trained to obtain the trained regression model.
7. The method according to claim 1, characterized in that, Based on the predicted LD50 value of the tested Nocardia amberjack strain and the preset virulence classification baseline, the virulence level of the tested Nocardia amberjack strain was determined, including: The predicted LD50 value of the tested Nocardia strain of yellowtail is less than 1×10⁻⁶. 5 When the concentration of CFU / mL was reached, the virulence level of the tested Nocardia amberjack strain was determined to be high-risk virulence. The predicted LD50 value of the tested Nocardia strain of yellowtail is greater than or equal to 1×10⁻⁶. 5 CFU / mL, and less than 1×10 7 When the concentration of CFU / mL was reached, the virulence level of the tested Nocardia amberjack strain was determined to be medium risk. The predicted LD50 value of the tested Nocardia strain of yellowtail is greater than or equal to 1×10⁻⁶. 7 When the concentration of CFU / mL was reached, the virulence level of the tested Nocardia amberjack strain was determined to be low-risk virulence.
8. A virulence rating assessment system for *Nocardia amberjack* strains, characterized in that, The system includes: The acquisition unit is used to acquire the whole genome sequence of the Nocardia amberjack strain to be tested; The analysis unit is used to analyze the whole genome sequence based on the classified and weighted specific virulence gene feature library of Nocardia amberjack to obtain the virulence feature vector of the Nocardia amberjack strain to be tested. The prediction unit is used to predict the LD50 of the virulence feature vector using a pre-trained regression model to obtain the predicted LD50 value of the Nocardia amberjack strain to be tested. The regression model is based on the random forest regression algorithm and is trained using a self-trained semi-supervised learning framework with a first Nocardia amberjack strain with labeled LD50 values and completed whole-genome sequencing, and a second Nocardia amberjack strain with unlabeled LD50 values and completed whole-genome sequencing. The determination unit is used to determine the virulence level of the tested Nocardia amberjack strain based on the predicted LD50 value and the preset virulence classification baseline.