Target sequence screening method, kit and application for species identification based on artificial intelligence

By employing an AI-based target sequence screening method, which utilizes a target sequence recognition model and a nucleic acid database to screen species-specific target sequences, the method addresses the issues of misjudgment and low efficiency in existing species identification methods, achieving highly efficient and accurate species identification.

CN121237196BActive Publication Date: 2026-06-23INST OF MEDICINAL PLANT DEV CHINESE ACADEMY OF MEDICAL SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INST OF MEDICINAL PLANT DEV CHINESE ACADEMY OF MEDICAL SCI
Filing Date
2025-09-18
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing species identification methods suffer from misjudgments, rely on expert experience, are time-consuming and costly, and AI models are difficult to accurately identify species and subspecies levels, failing to effectively utilize the identification features in species whole genome data.

Method used

Artificial intelligence-based target sequence screening methods use target sequence recognition models to predict whole-genome data, combine preset screening conditions and nucleic acid databases to screen target sequences, design target primers for verification, and screen out species-specific target sequences.

Benefits of technology

It significantly improves the screening efficiency and accuracy of target sequences, enabling efficient identification of species, especially species and subspecies.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121237196B_ABST
    Figure CN121237196B_ABST
Patent Text Reader

Abstract

The present disclosure provides a target sequence screening method, a kit and an application for species identification based on artificial intelligence. Specifically, first, a target sequence identification model is used to predict a plurality of first gene fragments from the whole genome data of a target species and output a prediction probability; then, based on the prediction probability and a preset first screening condition, the plurality of first gene fragments are screened to obtain second gene fragments; then, based on a pre-constructed nucleic acid database, the plurality of second gene fragments are screened to obtain third gene fragments; finally, the third gene fragments verified by a target primer are determined as target sequences. With such a technical solution, the target sequence identification model can predict a plurality of first gene fragments and output a prediction probability, greatly improving the screening efficiency and accuracy of target sequences.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of biotechnology, and in particular to a target sequence screening method, kit, and application for species identification based on artificial intelligence. Background Technology

[0002] Classical species identification methods face numerous problems in practical applications, such as misjudgment due to morphological similarity, reliance on the experience and subjective judgment of identification experts, high time and cost, and difficulty in handling large-scale samples. Emerging molecular biology techniques, such as DNA barcoding, focus on only a few specific regions of the genome. Although they can achieve species identification within a certain range, some species cannot be accurately identified due to the presence of identical DNA barcode sequences.

[0003] Deep learning-based artificial intelligence (AI) identification models primarily rely on image recognition. Since these models are trained on datasets typically encompassing various phenotypic features of species, their identification capabilities encounter significant bottlenecks as the number of species and the scale of data increase. For example, in practical applications, AI identification models usually only achieve good identification results at the genus level, while identification at the species and subspecies levels remains challenging. In recent years, AI technology, through algorithms such as neural networks and natural language processing, has demonstrated significant advantages in areas such as gene sequence analysis and variant site annotation. However, no research has yet reported on how to utilize AI technology to mine discriminative features from whole-genome data for biological species identification. Summary of the Invention

[0004] In view of this, the purpose of this disclosure is to propose a target sequence screening method, kit and application for species identification based on artificial intelligence.

[0005] To achieve the above objectives, this disclosure provides a target sequence screening method for species identification based on artificial intelligence, comprising:

[0006] Acquire and identify multiple first gene fragments based on whole-genome data of the target species;

[0007] Using a target sequence recognition model, target sequence probability prediction is performed on the plurality of first gene fragments, and the predicted probability of each first gene fragment is output.

[0008] Based on the predicted probability and the preset first screening conditions, multiple first gene fragments are screened to obtain multiple second gene fragments;

[0009] Based on a pre-constructed nucleic acid database, the third gene fragment is obtained by screening the multiple second gene fragments.

[0010] Target primers were designed based on the third gene fragment. The specificity of the third gene fragment was verified using the target primers, the first genome of the target species, and the second genome of a non-target species. The verified third gene fragment was then identified as the target sequence.

[0011] The target sequence recognition model is obtained by training an initial model based on a dataset.

[0012] The dataset includes target sequence samples and whole genome samples of the species corresponding to the target sequence samples.

[0013] In some embodiments, the acquisition and determination of multiple first gene fragments based on the whole genome data of the target species specifically includes:

[0014] Obtain whole genome data from multiple individuals of the target species, and segment each whole genome data to obtain multiple fourth gene fragments;

[0015] An inverted index is established based on multiple fourth gene fragments and each individual, and the number of times each fourth gene fragment appears in different individuals is counted.

[0016] Based on the preset second screening conditions and the number of screenings, the first gene fragment is obtained by screening multiple fourth gene fragments.

[0017] In some embodiments, the preset first filtering condition includes a first sorting range;

[0018] The step of selecting multiple first gene fragments to obtain multiple second gene fragments based on the predicted probability and preset first screening conditions specifically includes:

[0019] The plurality of first gene fragments are sorted based on the predicted probabilities;

[0020] In response to determining that the first gene fragment belongs to the first sorting range, the first gene fragment is identified as the second gene fragment.

[0021] In some embodiments, the step of screening the plurality of second gene fragments to obtain a third gene fragment based on a pre-constructed nucleic acid database specifically includes:

[0022] Each second gene fragment is compared with the nucleic acid sequence in the nucleic acid database;

[0023] In response to the determination that any second gene fragment and any nucleic acid sequence of the same length as the second gene fragment in any species other than the target species in the nucleic acid database have at least N base differences, the second gene fragment is determined to be the third gene fragment;

[0024] Where N≥3.

[0025] In some embodiments, the target sequence samples satisfy multiple filtering rules; wherein, the filtering rules are obtained statistically based on multiple target sequence samples;

[0026] The dataset also includes at least one of the first type of sequence samples and the second type of sequence samples;

[0027] The first type of sequence sample satisfies the multiple filtering rules and does not belong to the target sequence; the second type of sequence sample is a random sequence or satisfies some of the multiple filtering rules.

[0028] In some embodiments, the ratio of the number of target sequence samples, the first type of sequence samples, and the second type of sequence samples is 1:0.8 to 1.0:0.05 to 0.2.

[0029] In some embodiments, the initial model includes a feature extraction module and a classification module; wherein the feature extraction module is a pre-trained model based on the Transformer architecture.

[0030] Based on the same inventive concept, this disclosure also provides a species identification kit, the kit comprising primer sequences; the primer sequences are designed for target sequences; the target sequences are selected from at least one of 5'-TTTCAGATTCTAAGCCTACCCTACT-3', SEQ ID NO:1 and 5'-TTTCCTGACGAATGGACATGTTGCG-3', SEQ ID NO:4.

[0031] Based on the same inventive concept, this disclosure also provides an application of a target sequence in species identification, wherein the target sequence is selected from at least one of 5'-TTTCAGATTCTAAGCCTACCCTACT-3', SEQ ID NO:1 and 5'-TTTCCTGACGAATGGACATGTTGCG-3', SEQ ID NO:4. It should be noted that the application provided by this disclosure can identify all samples for which the target sequence can be obtained, including but not limited to medicinal herbs, processed medicinal materials, prepared Chinese medicines, and dietary supplements.

[0032] In some embodiments, SEQ ID NO:1 is used to identify the fungus *Colletotrichum cirrhosa*; and SEQ ID NO:4 is used to identify the fungus *Colletotrichum siamensis*.

[0033] As described above, the target sequence screening method, kit, and application for species identification based on artificial intelligence provided in this disclosure firstly predicts multiple first gene fragments from whole-genome data of a target species using a target sequence recognition model and outputs predicted probabilities. Next, based on the predicted probabilities and preset first screening conditions, the multiple first gene fragments are screened to obtain second gene fragments. Then, based on a pre-constructed nucleic acid database, the multiple second gene fragments are screened to obtain third gene fragments. Finally, the third gene fragment validated by target primers is identified as the target sequence. Using this technical solution, the target sequence recognition model can predict multiple first gene fragments and output predicted probabilities, significantly improving the efficiency and accuracy of target sequence screening. Attached Figure Description

[0034] To more clearly illustrate the technical solutions in this disclosure or related technologies, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are only embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0035] Figure 1A This diagram illustrates a flowchart of a target sequence screening method for species identification based on artificial intelligence, provided in an embodiment of this disclosure.

[0036] Figure 1B This illustration shows a partial flowchart of another target sequence screening method for species identification based on artificial intelligence, provided in an embodiment of this disclosure.

[0037] Figure 2A This shows the GenBank alignment results of the species-specific target sequence of *Colletotrichum spp.* in Example 2 of this disclosure;

[0038] Figure 2B The GenBank alignment results of the species-specific target sequence of *Colletotrichum sibirica* in Example 2 of this disclosure are shown.

[0039] Figure 2C The Sanger sequencing results of the species-specific target sequence of *Colletotrichum oryzae* in Example 2 of this disclosure are shown.

[0040] Figure 2D The Sanger sequencing results of the species-specific target sequence of *Colletotrichum sibirica* in Example 2 of this disclosure are shown.

[0041] Figure 3A The results of enzyme-linked immunosorbent assay (ELISA) analysis of *Colletotruichum* and other closely related species in Example 3 of this disclosure are shown.

[0042] Figure 3B This shows the ELISA reader detection results of the *Colletotruichum* species and other closely related species of the genus *Colletotruichum* in Example 3 of this disclosure;

[0043] Figure 4A The results of visual fluorescence detection of Colletotruichum and other closely related species of the genus Colletotruichum are shown in Example 3 of this disclosure;

[0044] Figure 4B The results of visual fluorescence detection of the *Colletotruichum* species and other closely related species of the genus *Colletotruichum* in Example 3 of this disclosure are shown. Detailed Implementation

[0045] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following detailed description is provided in conjunction with specific embodiments and the accompanying drawings.

[0046] It should be noted that, unless otherwise defined, the technical or scientific terms used in the embodiments of this disclosure should have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms "first," "second," and similar words used in the embodiments of this disclosure do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Words such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, but do not exclude other elements or objects.

[0047] The abbreviations used in this disclosure have their conventional meanings in the chemical and biological fields. The chemical structures and formulas described herein are constructed according to standardized valence rules known in the field of chemistry. Unless otherwise specified, “μM” refers to “μmol / L” and “mM” refers to “mmol / L”.

[0048] As described in the background section, the relevant technologies have not yet utilized AI technology to mine distinguishing features in species whole-genome data for biological species identification.

[0049] In view of this, embodiments of this disclosure provide a target sequence screening method, kit, and application for species identification based on artificial intelligence. The screening method includes: first, using a target sequence recognition model to predict multiple first gene fragments from whole-genome data of a target species and outputting predicted probabilities; then, screening the multiple first gene fragments based on the predicted probabilities and preset first screening conditions to obtain second gene fragments; next, screening the multiple second gene fragments based on a pre-constructed nucleic acid database to obtain third gene fragments; and finally, identifying the third gene fragment, validated by target primers, as the target sequence. Using this technical solution, the target sequence recognition model can predict multiple first gene fragments and output predicted probabilities, significantly improving the efficiency and accuracy of target sequence screening.

[0050] Figure 1A This diagram illustrates a flowchart of a target sequence screening method for species identification based on artificial intelligence, provided in an embodiment of this disclosure. Figure 1B This illustration shows a partial flowchart of another target sequence screening method for species identification based on artificial intelligence, provided in an embodiment of this disclosure.

[0051] In some embodiments, an initial model is first constructed.

[0052] Optionally, the initial model may include a feature extraction module and a classification module; wherein, the feature extraction module is a pre-trained model based on the Transformer architecture, such as the DNABERT model; the classification module may be a linear layer or other network that implements classification. It should be noted that the feature extraction module can output the features at the CLS position as sequence features to the classification module.

[0053] Alternatively, the initial model can also be a DNABERT model with a classification head, such as a sequence-level classification model, a token-level classification model, etc., and this disclosure does not limit it.

[0054] It should be noted that the initial model can also be a large text classification model, and this disclosure does not limit it to this.

[0055] Next, refer to Figure 1B Dataset 201A was constructed based on data specific to known species. It should be noted that the dataset may include both training and test sets.

[0056] For example, the dataset may include target sequence samples from 31 species of the genus *Alternaria* identified in previous studies, along with whole genome samples from 145 individuals of the corresponding species. It should be noted that the target sequence samples can be positive samples.

[0057] To enhance the model's ability to distinguish subtle differences, the inventors of this disclosure have also added hard negative samples to the dataset.

[0058] In some embodiments, hard negative samples can be determined as follows: Statistical analysis is performed on positive samples (corresponding to target sequence samples) to obtain filtering rules. For example, the overall GC content of the sequence, the GC content of the 5' and 3' end 3-kmers and 6-kmers respectively, and the number and quantity of repeats of consecutive single, double, and triple bases (e.g., AAA repeats are 3, AAAA repeats are 4, the number of times the sequence contains this sequence is the quantity) are used to construct the content range of each indicator, which constitutes the filtering rules.

[0059] By filtering the whole genome sequence using filtering rules, gene fragments that meet all filtering rules but are not positive samples can be identified as difficult negative samples.

[0060] This shows that hard negative samples and positive samples have high sequence similarity but are not considered positive samples. These samples are the difficult points for model learning and help improve the model's ability to distinguish subtle differences.

[0061] To enhance the model's ability to discriminate against non-target general backgrounds, some embodiments of this disclosure also add negative samples to the dataset.

[0062] In some embodiments, negative samples can be constructed using two methods of randomly selecting sequence fragments. For example, negative samples can be constructed using random sequences. Alternatively, gene fragments that satisfy some of the filtering rules (e.g., one or two conditions) can be selected as negative samples. Gene fragments that satisfy some of the filtering rules maintain similarity to positive samples in overall sequence features, but their sequence content is significantly different, which can enhance the model's discriminative power.

[0063] Then, supervised transfer learning training is performed on the initial model using the dataset. It should be noted that during training, a partial parameter freezing strategy can be employed to prevent noise from significantly damaging the parameters and reducing overfitting. Simultaneously, a label smoothing strategy can be applied to avoid overfitting to noisy data.

[0064] In some embodiments, the ratio of positive samples:hard negative samples:negative samples can be 1:0.8~1.0:0.05~0.2. Such a ratio can enhance the model's learning of hard negative samples while ensuring learning of both positive and negative samples, thereby improving classification accuracy. If the ratio of hard negative samples is increased, the model will not learn enough from positive samples; if the ratio of hard negative samples is decreased, the model will not be able to learn sufficiently from hard negative samples.

[0065] Optionally, the ratio of positive samples: negative samples: negative samples can be 1:0.9:0.1.

[0066] Through targeted dataset construction and training strategy optimization, the trained target sequence recognition model can accurately adapt to the target sequence characteristics of species, ultimately achieving high-precision recognition and differentiation of species target sequences, and significantly improving the model's recognition performance on species-specific target sequences.

[0067] Continue to refer to Figure 1B 202A. In some embodiments, the local nucleic acid database may be constructed based on DNA sequence data from the GenBank database, the National Genomics Data Center (NGDC), or other publicly available databases.

[0068] It should be noted that GenBank is a DNA sequence database established by the National Center for Biotechnology Information (NCBI) in the United States. Its website address is: https: / / www.ncbi.nlm.nih.gov / genbank / . The National Genome Science Data Center is a life and health big data center jointly built by the Beijing Institute of Genomics, Chinese Academy of Sciences, the Institute of Biophysics, and the Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences. Its website address is: https: / / ngdc.cncb.ac.cn / .

[0069] For example, if the species to be tested is a eukaryote, the local nucleic acid database can primarily use the Core nucleotide database and all eukaryotic genome data in the GenBank database.

[0070] Based on the target sequence recognition model trained above and the local nucleic acid database, target sequences for species identification can be screened. The target sequence screening method is explained in detail below with reference to the accompanying figures.

[0071] like Figure 1A and Figure 1B As shown, the screening method 100 includes:

[0072] S101: Obtain and identify multiple first gene fragments based on the whole genome data of the target species.

[0073] Here, as Figure 1BAs shown, users can input the name of the target species and its whole genome data storage path 203A, and obtain the whole genome data based on the storage path.

[0074] In some embodiments, step S101 may include:

[0075] The whole genome data of multiple individuals of the target species are obtained, and the whole genome data of each individual is segmented to obtain multiple fourth gene fragments. Here, the length of the fourth gene fragment can be preset, such as any one of 20 to 800 bp, such as 20 bp to 100 bp, 20 bp to 80 bp, 25 bp, 50 bp, etc., and this disclosure does not limit it.

[0076] Optionally, the whole genome data can be segmented to obtain multiple fourth gene fragments in the following manner: for example, if the length of the fourth gene fragment is K and the length of the whole genome is L, then the whole genome can be segmented into L-K+1 fourth gene fragments.

[0077] refer to Figure 1B 204B, based on multiple fourth gene fragments and each individual, establishes an inverted index to count the number of times each fourth gene fragment appears in different individuals.

[0078] Based on the preset second screening conditions and the number of screenings, the first gene fragment is obtained by screening multiple fourth gene fragments.

[0079] For example, the second screening criterion could be the fourth gene fragment that appears in the top 80% of the order.

[0080] The fourth gene fragment that occurs infrequently in an individual may be an individual-specific fragment rather than a species-specific fragment. By using a pre-set second screening condition, the fourth gene fragment that occurs infrequently in an individual can be excluded, thereby reducing the input of individual-specific fourth gene fragments into the target sequence identification model.

[0081] It should be noted that the above inverted index is only an example, and those skilled in the art can also use other methods to count the number of times the fourth gene fragment appears in different individuals, and this disclosure does not limit this.

[0082] S103: Reference Figure 1B 205B uses a target sequence recognition model to predict the target sequence probability of the plurality of first gene fragments and outputs the predicted probability of each first gene fragment.

[0083] S105: Based on the predicted probability and the preset first screening conditions, multiple first gene fragments are screened to obtain multiple second gene fragments;

[0084] In some embodiments, the preset first filtering condition includes a first sorting range, such as predicting the top 20,000; S105 specifically includes:

[0085] The plurality of first gene fragments are sorted based on the predicted probabilities;

[0086] In response to determining that the first gene fragment belongs to the first sorting range, the first gene fragment is identified as the second gene fragment.

[0087] In some alternative embodiments, the preset first screening conditions include a first sorting range (e.g., the top 500 predicted sequences), a second sorting range (501-1000 predicted sequences), and a third sorting range (1001-1500 predicted sequences). Based on this, steps S105-S109 can be executed cyclically. For example, if none of the third gene fragments described in S109 are identified as target sequences, then the first gene fragment in the second sorting range can be identified as the second gene fragment, and steps S107 and S109 can continue. This process continues until at least one third gene fragment in S109 is identified as a target sequence.

[0088] S107: Based on a pre-constructed nucleic acid database (202A), the multiple second gene fragments are screened to obtain a third gene fragment;

[0089] In some embodiments, S107 specifically includes:

[0090] refer to Figure 1B 206B: Align each of the second gene fragments with the nucleic acid sequences in the nucleic acid database;

[0091] Here, a query library can be constructed based on the second gene fragment; based on the query library and the nucleic acid database, BLAST (Basic Local Alignment Search Tool) alignment can be performed and the alignment results can be output.

[0092] refer to Figure 1B 207B: In response to determining that any second gene fragment and any nucleic acid sequence of equal length to the second gene fragment in the nucleic acid database, excluding the target species, differ by at least N bases, the second gene fragment is identified as the third gene fragment; where N ≥ 3. It should be noted that N is a positive integer.

[0093] S107 can be used to filter second gene fragments that have small differences from those in the nucleic acid database, thereby improving the specificity of third gene fragments.

[0094] S109: Design target primers based on the third gene fragment, and use the target primers, the first genome of the target species, and the second genome of the non-target species to verify the specificity of the third gene fragment; and determine the verified third gene fragment as the target sequence; here, the non-target species can be species of the same genus as the target species, and the number of such species can be multiple, which is not limited in this disclosure;

[0095] It should be noted that the first genome of the target species and the second genome of the non-target species can be obtained separately through DNA extraction technology. For example, the first genome can be obtained from an organism of the target species using DNA extraction technology, and the second genome can be obtained from an organism of the non-target species using DNA extraction technology.

[0096] Continue to refer to Figure 1B In 207B, the target primers designed based on the third gene fragment can be target primers designed within a 500bp range upstream and downstream of the location of the third gene fragment (corresponding to context matching) (e.g., SEQ ID NO:2, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:6).

[0097] In some embodiments, verifying the specificity of the third gene fragment using the target primer, the first genome of the target species, and the second genome of a non-target species specifically includes:

[0098] Using the target primers, the first genome is amplified to obtain a first amplification product, and the second genome is amplified to obtain a second amplification product;

[0099] Based on the first amplification product and the second amplification product, gel electrophoresis images and sequencing data (e.g., Sanger sequencing) are obtained;

[0100] Based on the gel electrophoresis pattern and the sequencing data, the specificity of the third gene fragment was determined.

[0101] It should be noted that the target primers amplify the target band only in the first genome and not in the second genome, and the sequencing data are specific only if the first genome sequence is completely matched and there are at least N base differences in all other genomes.

[0102] In some embodiments, the target sequence samples satisfy multiple filtering rules; wherein, the filtering rules are obtained statistically based on multiple target sequence samples;

[0103] The dataset also includes at least one of the first type of sequence samples (corresponding to hard negative samples) and the second type of sequence samples (corresponding to negative samples);

[0104] The first type of sequence sample satisfies the multiple filtering rules and does not belong to the target sequence; the second type of sequence sample is a random sequence or satisfies some of the multiple filtering rules.

[0105] In some embodiments, the ratio of the number of target sequence samples, the first type of sequence samples, and the second type of sequence samples is 1:0.8 to 1.0:0.05 to 0.2.

[0106] It should be noted that the filtering method of this embodiment can be executed by a single device, such as a computer or server. The method of this embodiment can also be applied to a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the method of this embodiment, and the multiple devices will interact with each other to complete the method described.

[0107] It should be noted that the above description describes some embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in a different order than that shown in the above embodiments and still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0108] To make the technical solution of this disclosure clearer and easier to understand, the target sequence screening method for species identification based on artificial intelligence provided in this disclosure will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0109] Unless otherwise specified, the experimental methods used in the following examples are conventional methods, performed according to the techniques or conditions described in the literature in this field or according to the product instructions. Unless otherwise specified, the materials and reagents used in the following examples are commercially available.

[0110] Example 1

[0111] 1. Materials

[0112] Downloaded 17 published whole genome sequences of the fungus *Colletotrichum fructicola* and 24 published whole genome sequences of *Colletotrichum siamense* from GenBank.

[0113] Table 1. Information on GenBank-Downloaded Genome Data

[0114]

[0115]

[0116] 2. Target sequence prediction for species identification based on artificial intelligence

[0117] 1) Divide the above genome data into 25kmer fragments and build an inverted index to mark the number of times each 25kmer fragment appears in the genomes of different individuals within the species;

[0118] 2) Extract the 25kmer fragments obtained in step 1) and extract the fragments that appear in the top 80% of different individuals within the species. Submit these fragments to the target sequence recognition model trained above. The model will predict whether each 25kmer fragment is a species-specific target sequence and output the predicted probability.

[0119] 3) Select the top 20,000 species-specific target sequence prediction probability scores obtained in step 2) to construct a query library; perform BLAST alignment with the local nucleic acid database and output the alignment results; Figure 2A This shows the GenBank alignment results of the species-specific target sequence of *Colletotrichum spp.* in Example 2 of this disclosure; Figure 2B The GenBank alignment results of the species-specific target sequence of *Colletotrichum sibirica* in Example 2 of this disclosure are shown.

[0120] 4) Based on the comparison results in step 3), 25kmers that differ from any other species by more than 3 bases are selected as species-specific candidate target sequences for the species. Among them, 14 species-specific candidate target sequences were obtained for *Colletotrichum cirrhosa* and 5 species-specific candidate target sequences were obtained for *Colletotrichum siamensis*.

[0121] Example 2

[0122] To confirm that the specific candidate target sequences screened according to Example 1 actually exist only in the target species and to ensure that they accurately reflect their theoretical characteristics in practical applications, specific primer pairs were designed based on the species-specific candidate target sequences of *Colletotrichum cirrhosa* and *Colletotrichum siamense*. PCR amplification and Sanger sequencing were then performed on these species and other closely related species for verification.

[0123] 1. Materials

[0124] The *Colletotrichum* fungal strain was purchased from the China General Microbiological Culture Collection Center and Beina Chuanglian Biotechnology Co., Ltd. Specific species information is as follows:

[0125] Table 2 Information on samples of *Colletotrichum* fungi

[0126]

[0127]

[0128] 2. Experimental Procedure

[0129] 2.1 DNA Extraction

[0130] Genomic DNA was extracted from the samples of *Colletotrichum* to be tested using a kit method.

[0131] 2.2 PCR amplification

[0132] Based on the specific candidate target sequence screened in Example 1, the upstream and downstream extensions of this sequence by 100 bp were used as extended sequences for primer design. The designed primer pairs were used to perform PCR amplification on all samples to be tested for specificity verification. The PCR system was: 12.5 μL 2×Taq PCR Master Mix, 1 μL each of upstream and downstream primers (concentration 10 μmol / L), 2 μL DNA template (approximately 20 ng), 8.5 μL ddH2O, totaling 25 μL. The PCR reaction program was: 95℃ for 3 min; 95℃ for 30 s, 56℃ for 30 s, 72℃ for 30 s, 30 cycles; 72℃ for 10 min.

[0133] 2.3 Agarose gel electrophoresis

[0134] The specificity and fragment length distribution of the amplified products were evaluated by 1.5% agarose gel electrophoresis (120V, 50min) combined with the DL1000 molecular weight standard.

[0135] 2.4 Sanger sequencing

[0136] Sanger bidirectional sequencing was performed on the specific target bands and all visible amplified bands presented in the agarose gel imaging. If the sequencing results completely matched only the target species-specific candidate target sequence, and there were no amplified bands in other non-target species or the visible amplified band sequencing data differed from the specific candidate target sequence by 3 or more bases, then the specific candidate target sequence was actually valid and could be used as a target sequence for species identification.

[0137] The following information pertains to a specific target sequence and its primer pair identified by *Colletotrichum cirrhosa*:

[0138] Cfr_Target:5'-TTTCAGATTCTAAGCCTACCCTACT-3',SEQ ID NO:1;

[0139] Cfr_F:5'-GAACAAGGAAATCCAGGCCCTACTC-3',SEQ ID NO:2;

[0140] Cfr_R:5'-ATAATCAGGCTTTGCGTGGCTGTAG-3', SEQ ID NO:3;

[0141] The information on the specific target sequence and its primer pair for the final screening of *Colletotrichum sicca* is as follows:

[0142] Csi_Target:5'-TTTCCTGACGAATGGACATGTTGCG-3',SEQ ID NO:4;

[0143] Csi_F:5'-TTTCCAGTCCGGCTCAGTGTATTGG-3',SEQ ID NO:5;

[0144] Csi_R:5'-TGAAAGTCCGTCGAAGTTCAATGGC-3', SEQ ID NO:6;

[0145] Figure 2C The Sanger sequencing results of the species-specific target sequence of *Colletotrichum oryzae* in Example 2 of this disclosure are shown. Figure 2D The Sanger sequencing results of the species-specific target sequence of *Discocephala sibirica* in Example 2 of this disclosure are shown.

[0146] It should be noted that, considering the need to provide technical support for the accurate identification and rapid detection of *Colletotrichum foetans* and *Colletotrichum siamensis* using CRISPR-Cas12a detection technology, the specific target sequences for *Colletotrichum foetans* and *Colletotrichum siamensis* mentioned above differ from other non-target species by three or more bases, except for the Protospacer Adjacent Motif (PAM).

[0147] It should be understood that when other identification and rapid detection technologies are used, such as sequencing, droplet digital PCR (ddPCR), and quantitative real-time PCR (qPCR), PAM need not be considered.

[0148] As can be seen from the above results, the target sequences obtained by the target sequence screening method for species identification based on artificial intelligence provided in this embodiment can be used for the identification of species of the genus Colletotrichum, and have good feasibility.

[0149] Example 3

[0150] In this embodiment, CRISPR-Cas12a detection technology is used to provide technical support for the accurate identification and rapid detection of *Colletotrichum foetida* and *Colletotrichum sirenus*.

[0151] 1. Materials

[0152] Same as Example 2.

[0153] 2. Experimental Procedure

[0154] 2.1 DNA Extraction

[0155] Same as Example 2.

[0156] 2.2 PCR amplification

[0157] Same as Example 2.

[0158] 2.3 Detection of species-specific target sequences of *Colletotrichum cirrhosa* and *Colletotrichum siamense* based on the CRISPR / Cas12a gene editing system

[0159] A crRNA was designed for *Colletotrichum foetida*, Cfr_crRNA: 5'-UAAUUUCUACUAAGUGUAGAUAGAUUCUAAGCCUACCCUACU-3', SEQ ID NO: 7; a crRNA was designed for *Colletotrichum siamensis*, Csi_crRNA: 5'-UAAUUUCUACUAAGUGUAGAUCUGACGAAUGGACAUGUUGCG-3', SEQ ID NO: 8. Take 10 μL of the PCR product obtained in step 2.2, add 1.65 μL crRNA (300 nM), 5 μL 10×NEBuffer 2.1, 1 μL EnGen Lba Cas12a Cpf1 (20 nM), and 30.35 μL ddH2O, mix well, and then incubate at 37℃ for 10 min. After taking it out, add 2 μL Poly_C_FQ (5'6-FAM / CCCCCCCCCC / 3'BHQ-1, SEQ ID NO:9). The fluorescence value can be detected at 37℃ using an ELISA reader at time intervals of 0, 5, 10, 15, and 20 min at wavelengths of λex 483 nm and λem 535 nm, or the fluorescence phenomenon can be directly observed using a blue light transilluminator.

[0160] In this embodiment, *Colletotrichum cirrhosa* and *Colletotrichum sirenus* were selected as target species, and the remaining samples were used as closely related species for experimental research.

[0161] Figure 3A The results of enzyme-linked immunosorbent assay (ELISA) analysis of *Colletotruichum* and other closely related species in Example 3 of this disclosure are shown. Figure 3B The results of ELISA reader assays for *Colletotruichum* species and other closely related species in Example 3 of this disclosure are shown. Figure 3A and Figure 3B As shown, the fluorescence value of the target species was statistically significantly higher than that of other species and the control group (CK) (P<0.01).

[0162] Figure 4A The results of visual fluorescence detection of Colletotruichum and other closely related species of the genus Colletotruichum are shown in Example 3 of this disclosure; Figure 4B The results of visual fluorescence detection of the *Colletotruichum* species and other closely related species of the genus *Colletotruichum* in Example 3 of this disclosure are shown. Only the target species showed a strong fluorescence signal visible to the naked eye.

[0163] Based on the above experimental results, it can be seen that the technical system described in this disclosure has obtained evidence of identity through two detection methods: enzyme-linked immunosorbent assay (ELISA) and visual fluorescence detection. This strongly proves that the technical system can meet the needs of accurate identification and rapid detection of target species.

[0164] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this disclosure (including the claims) is limited to these examples; within the framework of this disclosure, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of the embodiments of this disclosure as described above, which are not provided in detail for the sake of brevity.

[0165] This disclosure is intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A target sequence screening method for species identification based on artificial intelligence, characterized in that, include: Acquire and identify multiple first gene fragments based on whole-genome data of the target species; Using a target sequence recognition model, target sequence probability prediction is performed on the plurality of first gene fragments, and the predicted probability of each first gene fragment is output. Based on the predicted probability and the preset first screening conditions, multiple first gene fragments are screened to obtain multiple second gene fragments; Based on a pre-constructed nucleic acid database, the third gene fragment is obtained by screening the multiple second gene fragments. Target primers were designed based on the third gene fragment. The specificity of the third gene fragment was verified using the target primers, the first genome of the target species, and the second genome of a non-target species. The verified third gene fragment was then identified as the target sequence. The target sequence recognition model is obtained by training an initial model based on a dataset. The dataset includes target sequence samples and whole genome samples of the species corresponding to the target sequence samples; The acquisition and identification of multiple first gene fragments based on the whole genome data of the target species specifically includes: Obtain whole genome data from multiple individuals of the target species, and segment each whole genome data to obtain multiple fourth gene fragments; An inverted index is established based on multiple fourth gene fragments and each individual, and the number of times each fourth gene fragment appears in different individuals is counted. Based on the preset second screening conditions and the number of occurrences, multiple fourth gene fragments are screened to obtain the first gene fragment; wherein, the second screening conditions exclude fourth gene fragments that occur infrequently in an individual. The preset first filtering condition includes a first sorting range; The step of selecting multiple first gene fragments to obtain multiple second gene fragments based on the predicted probability and preset first screening conditions specifically includes: The plurality of first gene fragments are sorted based on the predicted probabilities; In response to determining that the first gene fragment belongs to the first sorting range, the first gene fragment is identified as the second gene fragment.

2. The target sequence screening method according to claim 1, characterized in that, The process of selecting third gene fragments from multiple second gene fragments based on a pre-constructed nucleic acid database specifically includes: Each second gene fragment is compared with the nucleic acid sequence in the nucleic acid database; In response to the determination that any second gene fragment and any nucleic acid sequence of the same length as the second gene fragment in any species other than the target species in the nucleic acid database have at least N base differences, the second gene fragment is determined to be the third gene fragment; Where N≥3.

3. The target sequence screening method according to claim 1, characterized in that, The target sequence samples satisfy multiple filtering rules; wherein, the filtering rules are obtained statistically based on multiple target sequence samples; The dataset also includes at least one of the first type of sequence samples and the second type of sequence samples; The first type of sequence sample satisfies the multiple filtering rules and does not belong to the target sequence; the second type of sequence sample is a random sequence or satisfies some of the multiple filtering rules.

4. The target sequence screening method according to claim 3, characterized in that, The ratio of the number of target sequence samples, the first type of sequence samples, and the second type of sequence samples is 1:0.8~1.0:0.05~0.

2.

5. The target sequence screening method according to claim 1, characterized in that, The initial model includes a feature extraction module and a classification module; wherein the feature extraction module is a pre-trained model based on the Transformer architecture.

6. A reagent kit for species identification, characterized in that, The kit includes primer sequences; the primer sequences are designed for target sequences; the target sequences are selected from at least one of 5'-TTTCAGATTCTAAGCCTACCCTACT-3', SEQ ID NO:1 and 5'-TTTCCTGACGAATGGACATGTTGCG-3', SEQ ID NO:4; wherein the target sequences are obtained by the target sequence screening method according to any one of claims 1 to 5.

7. An application of a target sequence in species identification, characterized in that, The target sequence is selected from at least one of 5'-TTTCAGATTCTAAGCCTACCCTACT-3', SEQ ID NO:1 and 5'-TTTCCTGACGAATGGACATGTTGCG-3', SEQ ID NO:4; wherein the target sequence is obtained by the target sequence screening method according to any one of claims 1 to 5; SEQ ID NO:1 is used to identify *Colletotrichum cirrhosa*; SEQ ID NO:4 is used to identify *Colletotrichum siamensis*.