Chromosome assembly method and apparatus for removing organelle genomic contamination sequences

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing an organelle database and using various sequencing technologies to compare and remove organelle genome sequences, the problem of organelle genome contamination during chromosome assembly was solved, thus improving the accuracy of chromosome assembly.

CN115261378BActive Publication Date: 2026-06-12BEIJING INST OF TECH

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING INST OF TECH
Filing Date: 2022-07-14
Publication Date: 2026-06-12

Application Information

Patent Timeline

14 Jul 2022

Application

12 Jun 2026

Publication

CN115261378B

IPC: C12N15/10; C12Q1/6869; G16B30/10; G16B30/20; G16B50/30

CPC: C12N15/1027; C12Q1/6869; G16B30/10; G16B30/20; G16B50/30; C12Q2535/122; C12Q2537/165

AI Tagging

Application Domain

Microbiological testing/measurement Sequence analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In existing technologies, contamination of organelle genome sequences during chromosome assembly can lead to errors in chromosome assembly, affecting assembly accuracy.

⚗Method used

We constructed an organelle database and used second-generation sequencing, third-generation sequencing, and Hi-C sequencing to compare and remove organelle genome sequences, thereby improving the accuracy of chromosome assembly.

🎯Benefits of technology

Effective removal of organelle genome sequences improves the accuracy of chromosome assembly and reduces the impact of organelle genomes on chromosome assembly.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115261378B_ABST

Patent Text Reader

Abstract

The present application relates to a chromosome assembly method and device for removing contaminant sequences of organelle genomes, which performs second-generation sequencing, third-generation sequencing and Hi-C sequencing on a sample, performs genome assembly on the third-generation sequencing data, respectively uses the second-generation sequencing sequences and the third-generation sequencing sequences to correct the genome, removes haploid sequences in the corrected genome, uses an organelle database to align the genome, identifies and removes organelle genome sequences, aligns the Hi-C sequencing sequences to the genome, and finally assembles the genome into chromosomes using the alignment results, and specifically relates to a chromosome assembly method and device for removing contaminant sequences of organelle genomes based on a locally constructed organelle database, which identifies and removes organelle genome sequences from genome sequences, so as to avoid the error of organelle genomes being incorrectly assembled into chromosomes and causing errors in chromosome assembly.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a chromosome assembly method, specifically to a chromosome assembly method and apparatus for constructing a local organelle database and using the database to detect samples and remove organelle genomic contamination sequences. Background Technology

[0002] Chromosome assembly currently relies primarily on a hybrid approach combining third-generation sequencing (NGS), second-generation sequencing (NGS), and Hi-C sequencing. NGS sequences are extremely long, typically reaching 100 kb per base for a single molecule, and are therefore commonly used to assemble contig-level genomes. However, due to the high error rate of NGS (up to 10%), contig-level genomes assembled from NGS sequences require error correction using more accurate second-generation sequencing (NGS). Even after error correction, the genome remains a contig-level genome, with the number of sequence fragments far exceeding the actual number of chromosomes. Therefore, Hi-C sequencing is used to cluster different contigs according to their corresponding chromosomes and ultimately assemble them into the appropriate chromosomes.

[0003] In addition to the cell nucleus containing chromosomal DNA sequences, organelles, such as mitochondria and chloroplasts, also contain corresponding organelle genomic DNA sequences. Organelle genomes are typically circular, with multiple organelles within each cell, and each organelle usually containing multiple organelle genomes. Therefore, the copy number of organelle genomes within a cell is far greater than the copy number of chromosomes. Before genome sequencing, total DNA needs to be extracted, a sequencing library constructed, and then sequenced. Since the extracted total DNA contains both organelle genomes and chromosome sequences, the final sequencing results contain both organelle genome sequences and chromosome sequences. During chromosome assembly, analysis revealed that if organelle genomes are not removed from the genome sequence beforehand, they may be incorrectly assembled into chromosomes, leading to chromosome assembly errors. The inventors have innovatively proposed a new chromosome assembly strategy to address this problem and conducted further research, thus completing this invention. Summary of the Invention

[0004] The purpose of this invention is to provide a chromosome assembly method that removes organelle genomic contamination sequences. This method utilizes a locally constructed organelle database to remove organelle genomic sequences before chromosome assembly by comparing genomic methods, thereby reducing the impact of organelle genomic sequences on the accuracy of chromosome assembly and improving the accuracy of chromosome assembly.

[0005] Another objective of this invention is to provide a chromosome assembly apparatus for removing organelle genomic contamination sequences. This apparatus utilizes a locally constructed organelle database to remove organelle genomic sequences using a comparative genomic approach before chromosome assembly, thereby reducing the impact of organelle genomic sequences on the accuracy of chromosome assembly and improving the accuracy of chromosome assembly.

[0006] According to one embodiment of the present invention, a chromosome assembly method for removing organelle genomic contamination sequences is provided, which may include the following steps: Step S1, performing second-generation sequencing, third-generation sequencing, and Hi-C sequencing on a sample; Step S2, assembling the genome from the third-generation sequencing data; Step S3, correcting errors in the genome obtained in Step S2 using the second-generation sequencing and third-generation sequencing sequences, and removing haploid sequences from the corrected genome; Step S4, aligning the genome obtained in Step S3 with an organelle database to identify and remove organelle genomic sequences; Step S5, aligning the Hi-C sequencing sequence to the genome obtained in Step S4; and Step S6, using the alignment results obtained in Step S5, finally assembling the genome obtained in Step S4 into a chromosome.

[0007] As one implementation, the organelle database can be constructed through the following steps: Step S7, obtaining existing mitochondrial and chloroplast gene information; Step S8, based on the information obtained in Step S7, screening out coding genes present in more than 50% of species from the protein-coding genes of mitochondria and chloroplasts of different species as core coding genes, and then clustering each core coding gene according to 95% similarity, selecting one sequence from the groups containing more than two sequences as representative sequences to form the final core coding gene database; Step S9, based on the information obtained in Step S7, comparing the mitochondrial and chloroplast genomes, removing misassembled mitochondrial and chloroplast genomes, and the remaining sequences forming the organelle genome database; and Step S10, using the core coding gene database and the organelle genome database to form the organelle database.

[0008] As one implementation, the core coding gene database may include a mitochondrial core coding gene database and a chloroplast core coding gene database.

[0009] As one implementation, the mitochondrial core coding gene database may include an animal mitochondrial core coding gene database, a plant mitochondrial core coding gene database, a fungal mitochondrial core coding gene database, and a protist mitochondrial core coding gene database.

[0010] As one implementation, step S4 may include the following steps: Step S41, aligning the gene obtained in step S2 with the organelle genome, and extracting sequences with an alignment length greater than 1000 bp; Step S42, aligning the sequence extracted in step S41 with the core coding gene, and extracting the sequence aligned to the core coding gene, and finally identifying the organelle genome sequence from the genome sequence based on the length aligned with the genome sequence and the number of aligned core coding genes; and Step S43, removing the sequence extracted in step S42 from the genome obtained in step S3.

[0011] According to another embodiment of the present invention, a chromosome assembly apparatus for removing organelle genomic contamination sequences is provided, which may include: an organelle database, including a core coding gene database and an organelle genome database; a sequencing module for performing second-generation sequencing, third-generation sequencing, and Hi-C sequencing on a sample; a genome assembly module for assembling a genome from third-generation sequencing data; an error correction module for correcting errors in the genome assembled by the genome assembly module using second-generation sequencing sequences and third-generation sequencing sequences, and removing haploid sequences from the corrected genome; an organelle genome removal module for aligning the genome processed by the error correction module with the organelle database, identifying and removing organelle genome sequences; an alignment module for aligning Hi-C sequencing sequences to the genome processed by the organelle genome removal module; and a chromosome assembly module for finally assembling the genome processed by the organelle genome removal module into chromosomes using the alignment results obtained by the alignment module.

[0012] As one implementation method, the core coding gene database selects coding genes that are present in more than 50% of species from the protein-coding genes of mitochondria and chloroplasts of different species as core coding genes based on existing mitochondrial and chloroplast gene information. Then, each core coding gene is clustered according to 95% similarity, and one sequence is selected as the representative sequence from the group containing more than two sequences, thus forming a combination. The organelle genome database compares the mitochondrial and chloroplast genomes based on existing mitochondrial and chloroplast gene information, removes misassembled mitochondrial and chloroplast genomes, and combines the remaining sequences.

[0013] As one implementation, the core coding gene database may include a mitochondrial core coding gene database and a chloroplast core coding gene database.

[0014] As one implementation, the mitochondrial core coding gene database may include an animal mitochondrial core coding gene database, a plant mitochondrial core coding gene database, a fungal mitochondrial core coding gene database, and a protist mitochondrial core coding gene database.

[0015] In one implementation, the organelle genome removal module compares the gene assembled by the genome assembly module with the organelle genome, extracts sequences with an alignment length greater than 1000 bp, compares the extracted sequences with the core coding gene, and extracts the sequences aligned to the core coding gene and removes them from the genome processed by the error correction module.

[0016] The chromosome assembly method of the present invention for removing organelle genomic contamination sequences utilizes multiple methods for genome sequencing and constructs and utilizes an organelle database. Before assembling chromosomes, organelle genomic sequences mixed in with the genome sequence are removed in advance, thereby reducing the impact of organelle genomic sequences on the accuracy of chromosome assembly.

[0017] The chromosome assembly apparatus of the present invention for removing organelle genomic contamination sequences utilizes multiple methods for genome sequencing and constructs and utilizes an organelle database to remove organelle genomic sequences mixed in with the genome sequence before assembling chromosomes, thereby reducing the impact of organelle genomic sequences on the accuracy of chromosome assembly. Attached Figure Description

[0018] Figure 1 This is a schematic flowchart of a chromosome assembly method for removing organelle genomic contamination sequences according to one embodiment of the present invention.

[0019] Figure 2 This is a flowchart of one embodiment of the present invention for constructing a local organelle database and identifying and removing organelle genomes.

[0020] Figure 3 This is a chart showing the names of genomic fragments of pufferfish identified according to an embodiment of the present invention, and their total lengths compared to organelle genomes.

[0021] Figure 4 This is a graph showing the alignment results of the tig00001732_pilon_pilon fragment with the core coding gene according to an embodiment of the present invention.

[0022] Figure 5 This is a graph showing the comparison between the mitochondrial genome fragment tig00001732_pilon_pilon identified in pufferfish according to an embodiment of the present invention and the average sequencing depth of the genome.

[0023] Figure 6 This is a graph illustrating the results of chromosome misassembly in an embodiment of the present invention when the tig00001732_pilon_pilon fragment is not removed.

[0024] Figure 7 This is a schematic diagram of a chromosome assembly apparatus for removing organelle genomic contamination sequences according to one embodiment of the present invention.

[0025] Figure Labels

[0026] 1: Chromosome assembly device for removing organelle genomic contamination sequences; 11: Organelle database; 12: Sequencing module; 13: Genome assembly module; 14: Error correction module; 15: Organelle genome removal module; 16: Alignment module; 17: Chromosome assembly module. Detailed Implementation

[0027] The present invention will now be described in detail with reference to the accompanying drawings. However, the present invention is not limited to the drawings and the following description. Those skilled in the art can make various modifications and changes within the scope of the technical concept of the present invention, and all such modifications and changes fall within the scope of the present invention.

[0028] Furthermore, descriptions of common knowledge in the art are omitted where such descriptions might obscure the technical aspects of the present invention. Content not described herein is content that can be deduced by those skilled in the art.

[0029] Figure 1 This is a schematic flowchart of a chromosome assembly method for removing organelle genomic contamination sequences according to one embodiment of the present invention. Figure 2 This is a flowchart illustrating the construction of a local organelle database according to one embodiment of the present invention.

[0030] like Figures 1 to 2 As shown, the present invention provides a chromosome assembly method for removing organelle genomic contamination sequences, which may include the following steps: Step S1, performing second-generation sequencing, third-generation sequencing, and Hi-C sequencing on the sample; Step S2, assembling the genome from the third-generation sequencing data; Step S3, correcting errors in the genome obtained in Step S2 using the second-generation sequencing and third-generation sequencing sequences, and removing haploid sequences from the corrected genome; Step S4, aligning the genome obtained in Step S3 with an organelle database to identify and remove organelle genomic sequences; Step S5, aligning the Hi-C sequencing sequence to the genome obtained in Step S4; and Step S6, using the alignment results obtained in Step S5, finally assembling the genome obtained in Step S4 into chromosomes.

[0031] In one implementation, in step S2, Canu software can be used to assemble the genome from the third-generation sequencing data; in step S3, Pilon and Racon software can be used to correct errors in the genome obtained in step S2 using second-generation and third-generation sequencing sequences, respectively, and purge_dups software can be used to remove haploid sequences from the corrected genome; in step S5, Hi-C sequencing sequences can be aligned to the genome obtained in step S4 using HiC-Pro software; in step S6, ALLHiC software can be used to finally assemble the genome obtained in step S4 into chromosomes using the alignment results obtained in step S5. These steps can also be performed using other software commonly used in the art, and the present invention is not limited thereto.

[0032] Specifically, as an implementation, the organelle database can be constructed through the following steps: Step S7, obtaining existing mitochondrial and chloroplast gene information; Step S8, based on the information obtained in Step S7, screening out coding genes present in more than 50% of species from the protein-coding genes of mitochondria and chloroplasts of different species as core coding genes, then clustering each core coding gene according to 95% similarity and selecting one sequence from the groups containing more than two sequences as representative sequences to form the final core coding gene database; Step S9, based on the information obtained in Step S7, comparing the mitochondrial and chloroplast genomes, removing misassembled mitochondrial and chloroplast genomes, and the remaining sequences to form an organelle genome database; and Step S10, using the core coding gene database and the organelle genome database to form the organelle database.

[0033] As shown in Table 1, based on the existing mitochondrial and chloroplast gene information obtained in step S7, the number of organelle genes present in 98%, 90%, 80%, 70%, 60%, 50%, 40%, and 30% of species was statistically analyzed. It was confirmed that starting with organelle genes present in 50% of species as core genes, the number of core genes tended to stabilize. Therefore, in step S8, coding genes present in more than 50% of species were selected from the protein-coding genes of mitochondria and chloroplasts of different species as core coding genes. Then, each core coding gene was clustered using 95% similarity, a default value in the commonly used CD-HIT software. To improve the accuracy of the core gene sequences, one sequence was selected as the representative sequence from groups containing more than two sequences, thus forming the final core coding gene database.

[0034] Table 1:

[0035]

[0036]

[0037] Additionally, as an implementation, the existing mitochondrial and chloroplast genetic information can be obtained from the National Center for Biotechnology Information (NCBI), or other commonly used database resources in the field can be used; the present invention is not limited thereto. However, the organelle genome sequences in NCBI lack manual verification and may contain misassembled organelle genomes. Therefore, the step of removing misassembled mitochondrial and chloroplast genomes in step S9 is essential.

[0038] Specifically, as one implementation, in step S8, the code for constructing the organelle database, create_organelle_db.py, is shown in Table 2.

[0039] Table 2:

[0040]

[0041]

[0042]

[0043]

[0044]

[0045]

[0046]

[0047]

[0048]

[0049]

[0050]

[0051] On the other hand, as an implementation, the core coding gene database may include a mitochondrial core coding gene database and a chloroplast core coding gene database.

[0052] The mitochondrial core coding gene database may include an animal mitochondrial core coding gene database, a plant mitochondrial core coding gene database, a fungal mitochondrial core coding gene database, and a protist mitochondrial core coding gene database.

[0053] In addition, as an implementation, step S4 may include the following steps: step S41, aligning the gene obtained in step S2 with the organelle genome, and extracting sequences with an alignment length greater than 1000 bp; step S42, aligning the sequence extracted in step S41 with the core coding gene, and extracting the sequence aligned to the core coding gene, and finally identifying the organelle genome sequence from the genome sequence based on the length aligned with the genome sequence and the number of aligned core coding genes; and step S43, removing the sequence extracted in step S42 from the genome obtained in step S3.

[0054] In genome assembly, contigs shorter than 1000 bp are generally considered to be poorly assembled sequences. Therefore, in step S41, sequences longer than 1000 bp are extracted as potential organelle genome sequences.

[0055] Specifically, as one implementation, in step S4, the core code organelle_filter.py is shown in Table 3.

[0056] Table 3:

[0057]

[0058]

[0059]

[0060]

[0061]

[0062]

[0063]

[0064] On the other hand, the present invention provides a chromosome assembly apparatus 1 for removing genomic contamination sequences from organelles. For example... Figure 7As shown, the chromosome assembly device 1 for removing organelle genome contamination sequences may include: an organelle database 11, including a core coding gene database and an organelle genome database; a sequencing module 12 for performing second-generation sequencing, third-generation sequencing, and Hi-C sequencing on the sample; a genome assembly module 13 for assembling the genome from the third-generation sequencing data; an error correction module 14 for correcting errors in the genome assembled by the genome assembly module 13 using the second-generation sequencing and third-generation sequencing sequences, and removing haploid sequences from the corrected genome; an organelle genome removal module 15 for aligning the genome processed by the error correction module 14 with the organelle database 11, identifying and removing organelle genome sequences; an alignment module 16 for aligning the Hi-C sequencing sequence to the genome processed by the organelle genome removal module 15; and a chromosome assembly module 17 for finally assembling the genome processed by the organelle genome removal module 15 into chromosomes using the alignment results obtained through the alignment module 16.

[0065] In one implementation, the genome assembly module 13 can use Canu software to assemble the genome from third-generation sequencing data; the error correction module 14 can use Pilon and Racon software to correct errors in the genome assembled by the genome assembly module 13 using second-generation sequencing and third-generation sequencing sequences, respectively, and use purge_dups software to remove haploid sequences from the corrected genome; the alignment module 16 can use HiC-Pro software to align Hi-C sequencing sequences to the genome processed by the organelle genome removal module 15; the chromosome assembly module 17 can use ALLHiC software to use the alignment results obtained by the alignment module 16 to finally assemble the genome processed by the organelle genome removal module 15 into chromosomes. The above operations can also be performed using other software commonly used in the art, and the present invention is not limited thereto.

[0066] Furthermore, the description of the organelle database 11 in the chromosome assembly method for removing organelle genomic contamination sequences of the present invention can be applied in the same manner to the chromosome assembly apparatus 1 for removing organelle genomic contamination sequences of the present invention; therefore, this part of the description is omitted.

[0067] On the other hand, as an implementation, the organelle genome removal module 15 compares the gene assembled by the genome assembly module 13 with the organelle genome, extracts the sequence with a comparison length greater than 1000 bp, compares the extracted sequence with the core coding gene, and extracts the sequence that is aligned to the core coding gene and removes it from the genome processed by the error correction module 14.

[0068] Example

[0069] The following is for reference Figures 1 to 7The present invention will be described in more detail by way of an embodiment.

[0070] like Figures 1 to 2 As shown, the embodiment may include the following steps:

[0071] Download the second-generation sequencing, third-generation sequencing, and Hi-C sequencing data of pufferfish from NCBI. The second-generation sequencing data is 56Gb, the third-generation sequencing data is 55Gb, and the Hi-C sequencing data is 55Gb.

[0072] The third-generation sequencing data was assembled using Canu software to obtain a preliminary assembled genome g1. The command was: nohup canu-pSample-dcanu_assemble_result genomeSize=348m-pacbio TGS.fastq&. After assembly by Canu software, a genome of 445Mb size was obtained, which consists of 4957 genome fragments.

[0073] The Racon software was used to correct errors in the g1 genome using TGS sequences. The command was: `nohup minimap2-axmap-pb-r2k genome1.fasta TGS.fastq|samtools sort|samtools view>dbg.srp.minimap2.sam&&racon TGS.fastq dbg.srp.minimap2.sam TGS_sequence_assembled_draft_genome.fasta>genome2.fasta&;`

[0074] The pilon software was used to correct errors in the genome2.fasta file from the second-generation sequencing sequence, ultimately obtaining the corrected genome3.fasta file. The command was: `bwa index racon_correct_1.fasta&&bwa mem racon_correct_1.fasta ngs_1.fastq ngs_2.fastq|samtools view-bS-F 12|samtools sort>racon_bwa.sort.bam&&java-Xmx250G-jar pilon-1.24.jar--genome genome2.fasta--frags racon_bwa.sort.bam--output genome3.fasta--changes;`

[0075] The purge_dups software removes haplotype sequences from the genome based on sequencing depth. The command is: `minimap2-xmap-pb genome3.fasta TGS.fastq|gzip-c->aligned.paf.gz&&pbcstataligned.paf.gz&&calcuts PB.stat>cutoffs 2>calcults.log&&split_fagenome3.fasta>racon.pilon2.split&&minimap2-xasm5-DP racon.pilon2.splitracon.pilon2.split|gzip-c->racon.pilon2.split.self.paf.gz&&purge_dups-2-Tcutoffs-c PB.base.covracon.pilon2.split.self.paf.gz>dups.bed 2>purge_dups.log&&get_seqs-edups.bed` The command `genome3.fasta&&cut-d""-f 1purged.fa>racon.pilon2.purged.fasta` ultimately yields a 340M genome file, `genome4.fasta`, containing 1970 genome fragments.

[0076] The `organelle_filter.py` script was used to align the genome4.fasta file with an organelle database, identify and remove organelle genomes, resulting in the genome5.fasta file after organelle genome removal. The command was: `python organelle_filter.py organelles.csv 0.90true 1000 2genome4.fastaoutput`. Finally, a mitochondrial genome sequence (tig00001732_pilon_pilon) was identified from the genome4.fasta file. This sequence matched the mitochondrial genome sequence with 32,384 bases and matched 13 mitochondrial core coding genes. The tig00001732_pilon_pilon sequence was removed from the genome4.fasta file to obtain the new genome5.fasta file.

[0077] Use HiC-Pro software to align the Hi-C sequencing sequence to the genome5.fasta file to obtain the aligned file genome5.sam. The command is: python hicpro_align.py hic_1.fastq hic_2.fastq MBOI60;

[0078] The genome genome5.fasta was assembled into chromosomes using the ALLHiC software based on the HiC-Pro alignment results genome5.sam. The command is: python allhic_assemble_TGS.py MBOI 22 60.

[0079] Specifically, a comparison of the pufferfish genome with organelle genome databases revealed 54 genome fragments with sequences exceeding 1000 bp in length that matched those in organelle genomes. Figure 3 The document shows the names of the 54 identified genomic fragments and their total lengths when mapped to organelle genomes.

[0080] Fifty-four sequence fragments were compared with the organelle core gene database. Fragments that matched the core coding genes were extracted, and it was found that the tig00001732_pilon_pilon fragment contained 13 core coding genes. Therefore, the tig00001732_pilon_pilon fragment belongs to the mitochondrial genome of the pufferfish. Figure 4 The image shows 13 core coding genes that were aligned to the tig00001732_pilon_pilon fragment.

[0081] The results of comparing the mitochondrial genome fragment tig00001732_pilon_pilon identified in pufferfish with the average sequencing depth of the genome are as follows: Figure 5 As shown, the sequencing depth of organelle genomes is significantly higher than the average sequencing depth of genomes. This phenomenon is consistent with the fact that organelle genome copy numbers are greater than those of chromosomes. Therefore, it can be further confirmed that the tig00001732_pilon_pilon fragment is the mitochondrial genome of the pufferfish.

[0082] If the tig00001732_pilon_pilon fragment is not removed, and chromosome assembly is performed using all genome fragments, such as... Figure 6 As shown, the tig00001732_pilon_pilon fragment was ultimately incorrectly assembled onto chromosome 7. Therefore, it can be confirmed that removing organelle genomic contamination sequences is a very important step before assembling chromosomes.

[0083] After the above steps, 22 pufferfish chromosomes were finally obtained, with a total genome size of 340Mb. Comparing the 22 pufferfish chromosomes with the identified mitochondrial genome (tig00001732_pilon_pilon) confirmed that the mitochondrial sequences were no longer present in the 22 pufferfish chromosomes.

[0084] In summary, the chromosome assembly method and apparatus for removing organelle genomic contamination sequences of the present invention utilizes multiple methods for genome sequencing and constructs and utilizes an organelle database to remove organelle genomic sequences mixed in with the genome sequence before chromosome assembly, thereby reducing the impact of organelle genomic sequences on the accuracy of chromosome assembly.

[0085] The above provides a detailed description of one embodiment of the present invention, but the present invention is not limited thereto. The scope of the present invention is defined only by the appended claims, and various modifications and variations of the present invention fall within the scope of the present invention.

Claims

1. A method for chromosome assembly that removes organelle genomic contamination sequences, characterized in that, Includes the following steps: Step S1: Perform second-generation sequencing, third-generation sequencing, and Hi-C sequencing on the sample; Step S2: Perform genome assembly on the third-generation sequencing data; Step S3: The genome obtained in step S2 is corrected using second-generation sequencing and third-generation sequencing sequences, respectively, and haploid sequences in the corrected genome are removed. Step S4: Use the organelle database to compare the genome obtained in step S3, identify and remove organelle genome sequences; Step S5: Align the Hi-C sequencing sequence to the genome obtained in step S4; as well as Step S6: Using the alignment results obtained in step S5, the genome obtained in step S4 is finally assembled into chromosomes. in, The organelle database was constructed through the following steps: Step S7: Obtain existing gene information for mitochondria and chloroplasts; Step S8: Based on the information obtained in step S7, select coding genes that exist in more than 50% of species from the protein coding genes of mitochondria and chloroplasts of different species as core coding genes. Then, cluster each core coding gene according to 95% similarity and select one sequence from the groups containing more than two sequences as the representative sequence to form the final core coding gene database. Step S9: Based on the information obtained in step S7, the genomes of mitochondria and chloroplasts are compared, and misassembled mitochondrial and chloroplast genomes are removed. The remaining sequences form an organelle genome database; and Step S10: The organelle database is constructed using the core coding gene database and the organelle genome database. Step S4 includes the following steps: Step S41: The genome obtained in step S2 is compared with the organelle genome database, and sequences with a comparison length greater than 1000 bp are extracted. Step S42 involves comparing the sequences extracted in step S41 with the core coding gene database, extracting the sequences that align to the core coding gene database, and finally identifying organelle genome sequences from the genome sequence; and Step S43: The sequence extracted in step S42 is removed from the genome obtained in step S3.

2. The chromosome assembly method for removing organelle genomic contamination sequences according to claim 1, characterized in that, The core coding gene database includes the mitochondrial core coding gene database and the chloroplast core coding gene database.

3. The chromosome assembly method for removing organelle genomic contamination sequences according to claim 2, characterized in that, The mitochondrial core coding gene database includes animal mitochondrial core coding gene database, plant mitochondrial core coding gene database, fungal mitochondrial core coding gene database, and protist mitochondrial core coding gene database.

4. A chromosome assembly apparatus for removing genomic contamination sequences from organelles, characterized in that, The apparatus for implementing the chromosome assembly method for removing organelle genomic contamination sequences as described in any one of claims 1 to 3, the apparatus comprising: Organelle databases, including core coding gene databases and organelle genome databases; The sequencing module performs second-generation sequencing, third-generation sequencing, and Hi-C sequencing on samples. The genome assembly module assembles genomes from third-generation sequencing data; The error correction module uses second-generation sequencing sequences and third-generation sequencing sequences to correct errors in the genome assembled by the genome assembly module, and removes haploid sequences from the corrected genome. The organelle genome removal module uses the organelle database to compare the genome processed by the error correction module, identify and remove organelle genome sequences; The alignment module aligns the Hi-C sequencing sequence to the genome processed by the organelle genome removal module; and The chromosome assembly module, using the alignment results obtained through the alignment module, ultimately assembles the genome processed by the organelle genome removal module into chromosomes. in, The core coding gene database is constructed by screening protein-coding genes present in more than 50% of species from mitochondrial and chloroplast gene information. Each core coding gene is then clustered according to 95% similarity, and one sequence is selected as the representative sequence from groups containing two or more sequences. The organelle genome database is composed of sequences obtained by comparing the mitochondrial and chloroplast genomes based on existing genetic information, removing misassembled mitochondrial and chloroplast genomes, and then combining the remaining sequences. The organelle genome removal module compares the genome assembled by the genome assembly module with the organelle genome database, extracts sequences with an alignment length greater than 1000 bp, compares the extracted sequences with the core coding gene database, and extracts the sequences aligned to the core coding gene database and removes them from the genome processed by the error correction module.

5. The chromosome assembly apparatus for removing organelle genomic contamination sequences according to claim 4, characterized in that, The core coding gene database includes the mitochondrial core coding gene database and the chloroplast core coding gene database.

6. The chromosome assembly apparatus for removing organelle genomic contamination sequences according to claim 5, characterized in that, The mitochondrial core coding gene database includes animal mitochondrial core coding gene database, plant mitochondrial core coding gene database, fungal mitochondrial core coding gene database, and protist mitochondrial core coding gene database.