Method for determining position loaded with nucleic acid library to be sequenced on sequencing chip, and use thereof
By sequencing the primer binding sequences of the nucleic acid library to be tested on a sequencing chip, incorporating labeling groups and detecting signals, the problem of low sequencing accuracy in existing technologies is solved, enabling early accurate localization and improved sequencing quality.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- MGI TECH CO LTD
- Filing Date
- 2024-12-11
- Publication Date
- 2026-06-18
Smart Images

Figure PCTCN2024138568-FTAPPB-I100001 
Figure PCTCN2024138568-FTAPPB-I100002 
Figure PCTCN2024138568-FTAPPB-I100003
Abstract
Description
Methods and applications for determining the location of sequencing chip loading of nucleic acid libraries to be tested Technical Field
[0001] This application belongs to the field of biotechnology, specifically relating to a method for determining the location of a sequencing chip loaded with a nucleic acid library to be tested, and its application. Background Technology
[0002] Current technologies for identifying whether DNA nanospheres (DNBs) or DNA clusters are loaded at each modification site on a sequencing chip typically rely on copynumber correction algorithms. However, this correction algorithm primarily analyzes and judges information from the first ten cycles. The accuracy of sequencing quality in the first ten cycles improves with increasing cycle number, and the sequencing quality from cycle 11 onwards only approaches the true quality. Due to the inherent randomness of sequences, multiple cycles without copynumber correction can lead to lower sequencing accuracy. Summary of the Invention
[0003] This application aims to at least partially address one of the technical problems in the related art. To this end, one objective of this application is to provide a method for accurately determining the location of the sequencing chip where a nucleic acid library to be tested is loaded.
[0004] Another objective of this application is to propose a sequencing correction method for different positions on a sequencing chip to improve sequencing accuracy.
[0005] Specifically, this application provides the following technical solution:
[0006] In a first aspect, this application proposes a method for determining the location of a sequencing chip loaded with a nucleic acid library to be tested. According to an embodiment of this application, the method includes the following steps: (a) sequencing at least one known nucleotide at the 5' end of the primer-binding sequence in the nucleic acid library to be tested via a primer extension reaction, wherein the nucleotide incorporating the primer carries a labeling group and generates a detectable signal in the extension reaction, and the sequencing chip is signal detected in each cycle of the extension reaction; (b) determining the location of the sequencing chip loaded with the nucleic acid library to be tested based on the detected signal, wherein at least one signal detection is an indication of the location where the nucleic acid library to be tested is loaded.
[0007] In some examples of this application, the aforementioned method, by sequencing known types of nucleotides, can determine early on whether the sequencing site has been successfully loaded into the nucleic acid library to be tested, effectively saving resources for later sequencing analysis. It can also avoid low sequencing accuracy due to insufficient information in the first few sequencing cycles, effectively improving the sequencing accuracy of the first few nucleotides at the 5' end of the nucleic acid library to be tested, thereby improving the overall sequencing quality and preventing errors in site determination due to poor sequencing quality.
[0008] Secondly, this application proposes a sequencing correction method for different positions on a sequencing chip. According to an embodiment of this application, the method includes: (i) sequencing at least one known nucleotide at the 5' end of a primer-binding sequence in multiple nucleic acid libraries at different positions on the sequencing chip via a primer extension reaction, wherein the nucleotide incorporated into the primer carries a labeling group and generates a detectable signal in the extension reaction; (ii) obtaining the signal intensity value of the primer-incorporated nucleotide in a first preset number of sequencing cycles; (iii) performing a first classification on the first preset number of sequencing cycles to obtain a first nucleotide for each signal in each sequencing cycle; and (iv) determining the relative replication number based on the signal intensity value of the first preset number of sequencing cycles and the first nucleotide.
[0009] In some examples of this application, the aforementioned method improves signal consistency during gene sequencing, effectively reduces the error rate caused by copy number inconsistency, optimizes signal distribution, and improves the accuracy of base identification.
[0010] Thirdly, this application proposes an apparatus for determining the location of a sequencing chip loaded with a nucleic acid library to be tested. According to an embodiment of this application, the apparatus includes: a sequencing unit for sequencing at least one known nucleotide at the 5' end of a primer-binding sequence in the nucleic acid library to be tested via a primer extension reaction, wherein the nucleotide incorporated into the primer carries a labeling group and generates a detectable signal during the extension reaction, and the sequencing chip is signal-detected in each cycle of the extension reaction; and a judgment unit for determining the location of the sequencing chip loaded with the nucleic acid library to be tested based on the detected signal, wherein at least one signal detection indicates the location where the nucleic acid library to be tested is loaded.
[0011] Those skilled in the art will understand that the features and advantages described above for the method of determining the location of the sequencing chip loading nucleic acid library to be tested are also applicable to the above-mentioned device, and will not be repeated here.
[0012] Fourthly, this application proposes a sequencing correction system for different positions on a sequencing chip. According to an embodiment of this application, the system includes: a sequencing module for sequencing at least one known nucleotide at the 5' end of a primer-binding sequence in multiple nucleic acid libraries at different positions on the sequencing chip via a primer extension reaction, wherein the nucleotide incorporated into the primer carries a labeling group and generates a detectable signal during the extension reaction; a signal intensity value acquisition module for acquiring the signal intensity value of the primer-incorporated nucleotide in a first preset number of sequencing cycles; a nucleotide classification module for performing a first classification on the first preset number of sequencing cycles to obtain a first nucleotide identified by each signal in each sequencing cycle; and a correction module for determining the relative replication number based on the signal intensity value of the first preset number of sequencing cycles and the first nucleotide.
[0013] Those skilled in the art will understand that the features and advantages described above for sequencing correction methods at different locations on the sequencing chip also apply to the above system, and will not be repeated here.
[0014] Fifthly, this application proposes a kit for determining the location of a sequencing chip loaded with a nucleic acid library to be tested. According to embodiments of this application, the kit includes: primers for extending the nucleic acid library to be tested, a polymerase, and nucleotides carrying a labeling group and generating a detectable signal; wherein the 5' end of the primer is complementary to the 3' end of the primer-binding sequence in the nucleic acid library to be tested, and the 3' end of the primer lacks at least one nucleotide complementary to the 5' end of the primer-binding sequence. In some examples of this application, the kit can be used for portable and accurate determination of the location of a sequencing chip loaded with a nucleic acid library to be tested.
[0015] Sixthly, this application proposes a computing device. According to an embodiment of this application, the device includes: a processor and a memory; the memory is used to store a computer program; the processor is used to execute the computer program to implement the method for determining the location of the sequencing chip loading nucleic acid library as described in the first aspect or the sequencing correction method for different locations of the sequencing chip as described in the second aspect.
[0016] In some examples of this application, the aforementioned computing device improves data processing speed and obtains faster result feedback by automatically executing the method for determining the location of the sequencing chip loading nucleic acid library or the sequencing correction method for different locations of the sequencing chip through computer instructions.
[0017] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 is a schematic flowchart of a method for determining the location of a sequencing chip loading a nucleic acid library to be tested, provided in an embodiment of this application.
[0020] Figure 2 is a schematic flowchart of the sequencing correction method for different positions of the sequencing chip provided in the embodiments of this application;
[0021] Figure 3 is a schematic diagram of the device for determining the location of the sequencing chip loading nucleic acid library provided in the embodiments of this application;
[0022] Figure 4 is a schematic diagram of the sequencing correction system for different positions of the sequencing chip provided in the embodiments of this application;
[0023] Figure 5 is a schematic diagram of the electronic device provided in an embodiment of this application;
[0024] Figure 6 is a schematic diagram showing the comparison of sequencing error rates of different methods provided in the embodiments of this application;
[0025] Figure 7 is a schematic diagram showing the results of the analysis of the impact of introducing fixed sequence correction on the effective cycle and coefficient changes of Copynum provided in the embodiments of this application. Detailed Implementation
[0026] The embodiments of this application are described in detail below, with examples of the embodiments illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.
[0027] In this application, unless otherwise stated, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0028] This application discloses a method and apparatus for determining the location of a sequencing chip loaded with a nucleic acid library, a sequencing correction method and system for different locations on the sequencing chip, a reagent kit for determining the location of the sequencing chip loaded with a nucleic acid library, and a computing device. These are described in detail below:
[0029] Methods for determining the location of the sequencing chip loading the nucleic acid library to be tested
[0030] In one aspect of this application, a method is proposed for determining the location of a sequencing chip loaded with a nucleic acid library to be tested. Referring to Figure 1, the method includes:
[0031] (a) Sequencing at least one known nucleotide at the 5' end of the primer binding sequence in the aforementioned nucleic acid library by primer extension reaction, wherein the nucleotide incorporated into the primer carries a labeling group and generates a detectable signal in each cycle of the extension reaction; the aforementioned sequencing chip is detected in each cycle of the extension reaction.
[0032] It should be noted that the 5' end of the aforementioned primer binding sequence only indicates the direction.
[0033] In some examples of this application, the aforementioned sequencing is selected from single-color sequencing and multicolor sequencing. Among them, the aforementioned multicolor sequencing includes: two-color sequencing, three-color sequencing, and four-color sequencing.
[0034] In some examples of this application, the primer nucleotides incorporated in the extension reaction are complementary to the aforementioned known nucleotides. Sequencing of the known nucleotides allows for the effective detection of the loading location and loading status of the nucleic acid library to be tested.
[0035] It should be noted that the aforementioned "known nucleotides" refer to specific nucleotides expected to appear at a certain position on the sequencing chip during the sequencing process; that is, nucleotides whose type and position are known. Known nucleotides include the following two cases:
[0036] 1. Nucleotides capable of generating detectable signals during sequencing. For example, in two-color sequencing, nucleotides carrying fluorescent groups can be detected by the sequencer by generating fluorescent signals; these are considered "known nucleotides."
[0037] 2. Even if a nucleotide cannot directly generate a detectable signal, its position and type are known. For example, G in the sequencing primer binding sequence, even if G does not carry a fluorescent group and does not emit light during sequencing, its type and position are still known, and therefore it is also considered a "known nucleotide".
[0038] In some examples of this application, the primer-incorporated nucleotides further carry reversible blocking modification groups. These groups are used to temporarily prevent further nucleotide elongation in each sequencing cycle, ensuring that only one nucleotide is elongated at each position during sequencing.
[0039] In some examples of this application, the aforementioned signal is an optical signal.
[0040] This application does not specifically limit the type of light signal. Those skilled in the art can choose according to experimental needs, such as fluorescence, phosphorescence, chemiluminescence, bioluminescence, electroluminescence, thermoluminescence, sonoluminescence, triboluminescence, or radioluminescence.
[0041] Fluorescence refers to the phenomenon where certain substances absorb high-energy photons (such as ultraviolet or blue light) and then rapidly emit lower-energy photons. Phosphorescence refers to the phenomenon where certain substances emit light for a relatively long time (from milliseconds to hours or even longer) after absorbing photons. This is because electrons pass through a triplet state when returning from the excited state to the ground state, resulting in a delayed emission. Chemiluminescence refers to the phenomenon where reactants release energy and emit light when they are converted into products during a chemical reaction. Bioluminescence refers to the phenomenon of cold light produced by certain organisms through chemical reactions. This process is usually produced by the reaction of luciferin with oxygen catalyzed by luciferase. More specifically, fluorescence refers to the phenomenon of cold light produced by certain organisms through chemical reactions, usually produced by the reaction of luciferin with oxygen catalyzed by luciferase, which releases energy and emits light. Electroluminescence refers to the phenomenon where certain materials emit light under the influence of an electric field. Electrons transition to an excited state under the influence of an electric field and then release photons when returning to the ground state. Thermoluminescence refers to the phenomenon where certain materials release previously absorbed energy and emit light when heated. Sonoluminescence refers to the light emitted when bubbles in a liquid rapidly collapse under the influence of ultrasound. Triboluminescence refers to the light emitted when certain materials are rubbed, broken, or torn. Radioluminescence refers to the light emitted by certain materials under the influence of radioactive radiation (such as alpha, beta, and gamma rays).
[0042] In some examples of this application, the aforementioned signal detection includes signal type detection and signal strength detection.
[0043] In some examples of this application, the aforementioned optical signal is a fluorescence signal. Determining whether a sequencing site is loaded with a nucleic acid library based on the fluorescence signal includes: classifying the fluorescence signal obtained from sequencing into nucleotides; if the type of nucleotide obtained from sequencing at least partially matches at least one nucleotide at the 5' end of the primer binding sequence, this indicates that the sequencing site is loaded with a nucleic acid library. It is understood that if the type of nucleotide obtained from sequencing matches one or more known nucleotide types (at least one nucleotide at the 5' end of the sequencing primer binding site), then the sequencing site is considered to be loaded with a nucleic acid library.
[0044] For example, assuming the sequencing type is two-color sequencing, adenine deoxyribonucleic acid (abbreviated as "A") is labeled with red fluorescence, thymine deoxyribonucleic acid (abbreviated as "T") is labeled with green fluorescence, cytosine deoxyribonucleic acid (abbreviated as "C") is labeled with both red and green fluorescence, and guanine deoxyribonucleic acid (abbreviated as "G") is not labeled with fluorescence.
[0045] If the 4 nucleotides (ATGG) at the 5' end of the sequencing primer binding site are sequenced and the obtained nucleotide sequence is TACC, then the sequencing site is considered to have loaded the nucleic acid library to be tested.
[0046] If the 4 nucleotides (ATGG) at the 5' end of the sequencing primer binding site are sequenced, and the obtained nucleotide sequence is GAGG, it is also considered that the sequencing site is loaded with the nucleic acid library to be tested.
[0047] In some examples of this application, the aforementioned at least one known nucleotide is selected from two, three, four or more nucleotides.
[0048] Those skilled in the art will understand that the aforementioned at least one known nucleotide may be entirely selected from the primer binding sequence; or may be partially selected from the primer binding sequence and partially selected from the nucleic acid sequence to be tested.
[0049] In some preferred embodiments of this application, the aforementioned at least one known nucleotide is selected from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, or 52.
[0050] In some examples of this application, in step (a): at least one known nucleotide at the 5' end of the primer-binding sequence and at least one nucleotide at the 3' end of the nucleic acid library to be tested are sequenced by primer extension reaction, wherein the 5' end of the primer-binding sequence is linked to the 3' end of the nucleic acid library to be tested. By sequencing at least one known nucleotide at the 5' end of the primer-binding sequence and at least one nucleotide at the 3' end of the nucleic acid library to be tested, more sequencing information can be obtained, providing more reference information during the sequencing process. This not only helps to confirm the loading position of the library, but also improves the accuracy and reliability of subsequent sequencing results through the information of the known nucleotides.
[0051] In some examples of this application, at least one nucleotide at the 3' end of the aforementioned nucleic acid library to be tested is selected from 2, 3, 4, 5, 6, 7, 8, 9, 10 or more nucleotides.
[0052] (b) Based on the detected signal, determine the location where the nucleic acid library to be tested is loaded on the aforementioned sequencing chip, wherein at least one signal detection is an indication of the location where the nucleic acid library to be tested is loaded.
[0053] In some examples of this application, the aforementioned nucleic acid library to be tested includes DNA nanospheres or DNA clusters. By detecting the signal, the location of the nucleic acid library loaded on the sequencing chip is accurately determined, allowing subsequent signal acquisition and analysis only of the effectively loaded locations. This avoids sequencing errors and data interference caused by locations where the library is not loaded. This approach not only improves sequencing accuracy and data quality but also reduces resource waste on vacant sites, optimizes subsequent data analysis workflows, and enhances overall sequencing efficiency and reliability.
[0054] Sequencing correction methods for different positions on the sequencing chip
[0055] In another aspect of this application, a sequencing correction method for different positions on a sequencing chip is proposed. Referring to Figure 2, the method includes: (i) sequencing at least one known nucleotide at the 5' end of a primer-binding sequence in multiple nucleic acid libraries at different positions on the sequencing chip via a primer extension reaction, wherein the nucleotide incorporated into the primer during the extension reaction carries a labeling group and generates a detectable signal;
[0056] It should be noted that the 5' end of the aforementioned primer binding sequence only indicates the direction.
[0057] In some examples of this application, the aforementioned sequencing is selected from single-color sequencing and multicolor sequencing. Among them, the aforementioned multicolor sequencing includes: two-color sequencing, three-color sequencing, and four-color sequencing.
[0058] In some examples of this application, the primer nucleotides incorporated in the extension reaction are complementary to the aforementioned known nucleotides. Sequencing of the known nucleotides allows for the effective detection of the loading location and loading status of the nucleic acid library to be tested.
[0059] It should be noted that the aforementioned "known nucleotides" refer to specific nucleotides expected to appear at a certain position on the sequencing chip during the sequencing process; that is, nucleotides whose type and position are known. Known nucleotides include the following two cases:
[0060] 1. Nucleotides capable of generating detectable signals during sequencing. For example, in two-color sequencing, nucleotides carrying fluorescent groups can be detected by the sequencer by generating fluorescent signals; these are considered "known nucleotides."
[0061] 2. Even if a nucleotide cannot directly generate a detectable signal, it is considered a "known nucleotide" because its position and type are known. For example, G in the sequencing primer binding sequence, even if G does not carry a fluorescent group and does not emit light during sequencing, its type and position are still known, and therefore it is also considered a "known nucleotide".
[0062] In some examples of this application, the primer-incorporated nucleotides further carry reversible blocking modification groups. These groups are used to temporarily prevent further nucleotide elongation in each sequencing cycle, ensuring that only one nucleotide is elongated at each position during sequencing.
[0063] In some examples of this application, the aforementioned signal is an optical signal.
[0064] In this article, the explanation of the term "optical signal" is based on the description in the first part, and will not be repeated here due to space limitations.
[0065] In some examples of this application, the aforementioned at least one known nucleotide is selected from two, three, four or more nucleotides.
[0066] Those skilled in the art will understand that the aforementioned at least one known nucleotide can be entirely selected from the primer binding sequence; or it can be partially selected from the primer binding sequence and partially selected from the nucleic acid sequence to be tested. Introducing fixed sequence correction can effectively improve sequencing performance while maintaining the stability of the sequencing system.
[0067] In some preferred embodiments of this application, the aforementioned at least one known nucleotide is selected from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, or 52.
[0068] In some examples of this application, step (i) involves sequencing at least one known nucleotide at the 5' end of the primer-binding sequence and at least one nucleotide at the 3' end of the aforementioned nucleic acid libraries at different locations on the sequencing chip via primer extension reactions, wherein the 5' end of the primer-binding sequence is linked to the 3' end of the aforementioned nucleic acid libraries. Sequencing at least one known nucleotide at the 5' end of the primer-binding sequence and at least one nucleotide at the 3' end of the nucleic acid libraries provides more sequencing information and more reference information during the sequencing process. This not only helps confirm the library loading location but also improves the accuracy and reliability of subsequent sequencing results through the information from the known nucleotides.
[0069] In some examples of this application, at least one nucleotide at the 3' end of the aforementioned nucleic acid library to be tested is selected from 2, 3, 4, 5, 6, 7, 8, 9, 10 or more nucleotides.
[0070] (ii) Obtain the signal intensity value of the primer-incorporated nucleotides in the first preset number of sequencing cycles;
[0071] It should be noted that the aforementioned first preset number can be set to N, and the specific number can be flexibly adjusted according to actual needs. At least one sequencing cycle of the first preset number can be the first N cycles of all cycles, or the first N consecutive cycles. For example, N cycles from the 1st cycle to the Nth cycle, or from the 2nd cycle to the (N+1)th cycle can be selected, and the specific range can be set according to the sequencing protocol.
[0072] (iii) Perform a first classification on the aforementioned first preset number of sequencing cycles to obtain the first nucleotide that each signal is identified as in each sequencing cycle;
[0073] The aforementioned first classification can be a base pre-classification, while the final classification obtained from sequencing can be a second classification, etc.
[0074] (iv) Determine the relative replication number based on the signal intensity value of the first preset number of sequencing cycles and the first nucleotide.
[0075] In some examples of this application, steps (iii) and (iv) above include: correcting the light intensity value based on a sorting (median) method and a non-sorting (non-normalization) method.
[0076] Light intensity value correction based on the sorting (median) method includes:
[0077] The nucleotides incorporated with primers are obtained after normalization and error and asynchronous elimination of light intensity values in the first preset number of sequencing cycles, and the luminescence signal of each cycle is pre-classified by bases.
[0078] Based on the signal intensity values of the first preset number of sequencing cycles, the second preset number of quantiles of the signal intensity values of each second nucleotide are calculated for each sequencing cycle, wherein the second nucleotide is the nucleotide in the first nucleotide that has a detectable signal in the first channel and / or the second channel.
[0079] Normalization is performed based on all signal intensity values of each second nucleotide in the first and second channels and the calculated quantiles of the signal intensity values to reduce errors caused by signal differences.
[0080] Based on the normalization results and the number of the second nucleotides in the first preset number of sequencing cycles for each signal, the corrected relative replication number of each signal is determined.
[0081] Optionally, based on the quantile of the signal intensity value of each second nucleotide obtained for each sequencing cycle, the upper bound of the relative replication number of all signals is calculated; and abnormal signals in the signal are excluded according to the comparison results of the corrected relative replication number of each signal with the upper bound value.
[0082] Light intensity value correction that is not based on sorting (non-normalization) methods includes:
[0083] Obtain the signal intensity values of at least one first-preset number of sequencing cycles that have undergone error and asynchronous elimination processing. Specifically, extract the light intensity values from the first N sequencing cycles, which have completed error and asynchronous elimination processing but have not undergone normalization. Each light intensity value is recorded on a different channel, such as the H channel (first channel) and the L channel (second channel), to distinguish the light intensity signal corresponding to each base.
[0084] The light intensity signal of each sequencing cycle is pre-classified by bases and classified into one of A, C, G, or T.
[0085] The number of third nucleotides in each signal in the first preset number of sequencing cycles is counted. The third nucleotide is the nucleotide in the first nucleotide that can emit light in the first channel and / or the second channel. For example, in dual-channel sequencing, the G base does not have a detectable signal, and only the number of A, C, and T bases in each signal is counted.
[0086] And calculate the sum of the signal intensity values of each signal in the corresponding light emission channel in the first preset number of sequencing cycles. That is, accumulate the light intensity values of each signal in different channels in the first N sequencing cycles to obtain the total light intensity value of each signal.
[0087] The relative replication number after signal correction is determined based on the number of the third nucleotide and the sum of the signal intensity values.
[0088] In the above-mentioned schemes for correcting light intensity values based on the sorting (median) method and the schemes for correcting light intensity values without sorting (non-normalization), the formulas for calculating the relative replication number can be found in CN118314955A.
[0089] In some examples of this application, the above method further includes: using the relative replication number of the signal to perform light intensity correction on each channel of the entire sequencing cycle.
[0090] In some examples of this application, if the relative replication number after correction for each of the aforementioned signals is 0, it indicates that the aforementioned sequencing site has not loaded the nucleic acid library to be tested; if the relative replication number after correction for each of the aforementioned signals is not 0, it indicates that the aforementioned sequencing site has loaded the nucleic acid library to be tested. By determining whether a sequencing site has loaded the nucleic acid library to be tested, it is possible to determine whether to analyze that site in subsequent sequencing data processing. If it is determined that a certain sequencing site has not loaded the nucleic acid library to be tested, then that sequencing site is filtered out in subsequent sequencing data processing, saving sequencing analysis resources.
[0091] In some examples of this application, the aforementioned method improves signal consistency during gene sequencing, effectively reduces the error rate caused by copy number inconsistency, optimizes signal distribution, and improves the accuracy of base identification.
[0092] A device for determining the location of the sequencing chip where the nucleic acid library to be tested is loaded.
[0093] In another aspect of this application, an apparatus is proposed for determining the location of a sequencing chip loaded with a nucleic acid library to be tested. Referring to Figure 3, the apparatus includes a sequencing unit 100 and a judgment unit 200.
[0094] The sequencing unit 100 is used to sequence at least one known nucleotide at the 5' end of the primer binding sequence in the aforementioned nucleic acid library by means of a primer extension reaction, wherein the nucleotide incorporated into the primer carries a labeling group and generates a detectable signal in the extension reaction, and the aforementioned sequencing chip is detected in each cycle of the extension reaction.
[0095] The judgment unit 200 is used to determine the location of the nucleic acid library to be tested loaded on the aforementioned sequencing chip based on the detected signal, wherein at least one signal detection is an indication of the location where the nucleic acid library to be tested is loaded.
[0096] Those skilled in the art will understand that the features and advantages described above for the method of determining the location of the sequencing chip loading nucleic acid library to be tested are also applicable to the above-mentioned device, and will not be repeated here.
[0097] It should be understood that the device embodiments and method embodiments can correspond to each other, and similar descriptions can be referred to the method embodiments. To avoid repetition, they will not be repeated here. Specifically, the device can perform the embodiments of the above-described method for determining the location of the sequencing chip loading the nucleic acid library to be tested, and the operations and / or functions performed by each unit in the device correspond to those in the method embodiments, which will not be repeated here for the sake of brevity.
[0098] Sequencing correction system for different positions on the sequencing chip
[0099] In another aspect of this application, a sequencing correction system for different positions on a sequencing chip is proposed. Referring to Figure 4, the system includes: a sequencing module 01, a signal intensity value acquisition module 02, a nucleotide classification module 03, and a correction module 04. Wherein,
[0100] Sequencing module 01 is used to sequence at least one known nucleotide at the 5' end of the primer binding sequence in multiple nucleic acid libraries at different positions on the sequencing chip through primer extension reaction, wherein the nucleotide incorporated into the primer in the extension reaction carries a labeling group and generates a detectable signal;
[0101] Signal intensity value acquisition module 02 is used to acquire the signal intensity value of the primer-incorporated nucleotide in the first preset number of sequencing cycles;
[0102] Nucleotide classification module 03 is used to perform a first classification on the aforementioned first preset number of sequencing cycles to obtain the first nucleotide that each signal is identified as in each sequencing cycle;
[0103] The calibration module 04 is used to determine the relative replication number based on the signal intensity value of the first preset number of sequencing cycles and the first nucleotide.
[0104] Those skilled in the art will understand that the features and advantages described above for sequencing correction methods at different locations on the sequencing chip also apply to the above system, and will not be repeated here.
[0105] It should be understood that the system embodiments and method embodiments can correspond to each other, and similar descriptions can be referred to the method embodiments. To avoid repetition, they will not be repeated here. Specifically, the system can execute the above embodiments of the sequencing correction method for different positions of the sequencing chip, and the operations and / or functions performed by each module in the system correspond to those in the method embodiments. For the sake of brevity, they will not be repeated here.
[0106] A kit for determining the location of the sequencing chip where the nucleic acid library to be tested is loaded.
[0107] In another aspect of this application, a kit is provided for determining the location of a sequencing chip loaded with a nucleic acid library to be tested. The kit includes: primers for extending the nucleic acid library to be tested, a polymerase, and nucleotides carrying a labeling group and generating a detectable signal; wherein the 5' end of the primers is complementary to the 3' end of the primer-binding sequence in the nucleic acid library to be tested, and the 3' end of the primers lacks at least one nucleotide complementary to the 5' end of the primer-binding sequence. In some examples of this application, the aforementioned kit can be used for portable and accurate determination of the location of a sequencing chip loaded with a nucleic acid library to be tested.
[0108] In some examples of this application, the nucleotide carrying the labeling group and generating a detectable signal further carries a reversible blocking modification group. This is used to temporarily prevent further elongation of the nucleotide in each sequencing cycle, ensuring that only one nucleotide is elongated at each position during sequencing.
[0109] computing devices
[0110] In another aspect of this application, a computing device is proposed, comprising: a processor and a memory; the memory for storing a computer program; and the processor for executing the computer program to implement a method for determining the location of a sequencing chip loaded with a nucleic acid library to be tested, as in any of the foregoing examples, or a sequencing correction method for different locations on a sequencing chip, as in any of the foregoing examples.
[0111] The term "electronic device" is intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Computing devices can also refer to various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0112] As shown in Figure 5, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes based on a computer program stored in ROM (Read-Only Memory) 502 or a computer program loaded from storage unit 508 into RAM (Random Access Memory) 503. The RAM 503 can also store various programs and data required for the operation of the device 500. The computing unit 501, ROM 502, and RAM 503 are interconnected via a bus 504. An I / O (Input / Output) interface 505 is also connected to the bus 504.
[0113] Multiple components in device 500 are connected to I / O interface 505, including: input unit 506, such as keyboard, mouse, etc.; output unit 507, such as various types of monitors, speakers, etc.; storage unit 508, such as disk, optical disk, etc.; and communication unit 509, such as network card, modem, wireless transceiver, etc. Communication unit 509 allows device 500 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0114] The computing unit 501 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, CPUs (Central Processing Units), GPUs (Graphics Processing Units), various special-purpose AI (Artificial Intelligence) computing chips, various computing units running machine learning model algorithms, DSPs (Digital Signal Processors), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as methods for determining the location of the sequencing chip for loading the nucleic acid library to be tested or methods for sequencing correction at different locations on the sequencing chip. For example, in some embodiments, the methods for determining the location of the sequencing chip for loading the nucleic acid library to be tested or the methods for sequencing correction at different locations on the sequencing chip can be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program can be loaded and / or installed on device 500 via ROM 502 and / or communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the method described above can be performed. Alternatively, in other embodiments, computing unit 501 can be configured by any other suitable means (e.g., by means of firmware) to perform the aforementioned method for determining the location of the sequencing chip loading the nucleic acid library to be tested or the sequencing correction method for different locations on the sequencing chip.
[0115] In this application, the logic and / or steps represented in the flowcharts or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, a computer-readable medium can even be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory. The various computer-readable storage media described in this invention can represent one or more devices and / or other machine-readable storage media for storing information. The term "machine-readable storage medium" can include, but is not limited to, wireless channels and various other media capable of storing, containing, and / or carrying instructions and / or data.
[0116] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0117] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
[0118] The solution of the present invention will be explained below in conjunction with embodiments. Those skilled in the art will understand that the following embodiments are only used to illustrate the present invention and should not be regarded as limiting the scope of the present invention. For those not specified in the embodiments regarding specific technologies or conditions, the technologies or conditions described in the literature in the art or according to the product specifications shall be followed. For reagents or instruments not indicating the manufacturer, they are all conventional products that can be obtained through commercial purchases.
[0119] Example 1:
[0120] This example is based on the Dolphin sequencing platform of MGISEQ. All the reagents used are from the sequencing reagents (hereinafter referred to as SE150 kit) supporting this platform; the verification samples used are from Escherichia coli. During the verification process, referring to the "Instruction Manual for MGISEQ-200RS High-throughput (Fast) Sequencing Reagent Kit", the preparation of DNA nanoballs (DNBs) required for SE150, loading, and the preparation of the SE150 reagent tank were carried out. This verification was tested using SE150 sequencing as an example.
[0121] 1. Instruments and Reagents
[0122] Instruments: dolphin sequencer, PCR instrument, 3.0 fluorescence quantitative analyzer, high-speed centrifuge;
[0123] Reagents: Table 1;
[0124] Table 1
[0125] 2. Specific Steps
[0126] During the single-strand primer hybridization process, the conventional primer was replaced with an IP1-4bp primer (the last 4 bases at the 3' end of the single-strand sequencing primer were removed):
[0127] IP1-4bp primer: CCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCG (SEQ ID NO:1)
[0128] The design of this primer ensures the binding of the 5' end of the primer to the 3' end of the nucleic acid library to be detected, thereby providing a stable extension starting point.
[0129] Sequencing Strategy:
[0130] Single strand: A target fragment of ACTT + 146bp, and the target region was accurately sequenced using the extension reaction.
[0131] Analysis Method:
[0132] Method 1 (control group):
[0133] Starting from cycle 5, the copy number is corrected for 10 cycles starting from cycle 5, and 146bp sequencing quality data is read.
[0134] Method 2 (Experimental Group):
[0135] During the basecall process of sequencing, the first 4 bases of the primers and the first 6 bases of the nucleic acid library to be tested are used as auxiliary corrections to form a copy number correction region of a total of 10 bases, and 146 bp of sequencing quality data is read.
[0136] 3. Results Analysis
[0137] As shown in Figure 6, sequencing based on the method of this application significantly reduces the sequencing error rate of the first 5 bp.
[0138] As shown in Figure 7, the effective cycle of copynum in the experimental group increased by an average of about 0.08, indicating that the correction strategy can improve the effectiveness of sequencing signals (Figure 7A-C); most of the differences were concentrated in the range of 0.00001 to 0.0001, and the system stability was not significantly affected (Figure 7D); the proportion of changes in copynum coefficient was low, and the cumulative curve showed that more than 90% of the changes were concentrated in a small range, and the performance remained stable (Figure 7E, F).
[0139] As shown in Table 2, compared with Method 1, sequencing using Method 2 improved the mapping rate by 0.28%, decreased the mismatch rate by 0.03%, increased the number of concordant reads by 1.34%, and decreased the number of reads with mismatch by 1.06%.
[0140] Table 2
[0141] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0142] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.
Claims
1. A method for determining the location of a sequencing chip loaded with a nucleic acid library to be tested, characterized in that, The method includes the following steps: (a) Sequencing at least one known nucleotide at the 5' end of the primer binding sequence in the nucleic acid library to be tested by primer extension reaction, wherein the nucleotide incorporated into the primer carries a labeling group and generates a detectable signal in each cycle of the extension reaction; the sequencing chip is detected in each cycle of the extension reaction. (b) Based on the detected signal, determine the location where the sequencing chip is loaded with the nucleic acid library to be tested, wherein at least one signal detection is an indication of the location where the nucleic acid library to be tested is loaded.
2. The method according to claim 1, characterized in that, In the extension reaction, the nucleotides incorporated into the primers are complementary to the known nucleotides.
3. The method according to claim 2, characterized in that, The nucleotides incorporated into the primers further carry reversible blocking modification groups.
4. The method according to claim 1, characterized in that, The signal is an optical signal; Optionally, the signal detection includes signal type detection and signal strength detection.
5. The method according to claim 1, characterized in that, The at least one known nucleotide is selected from two, three, four or more nucleotides.
6. The method according to any one of claims 1-5, characterized in that, In step (a): at least one known nucleotide at the 5' end of the primer binding sequence and at least one nucleotide at the 3' end of the nucleic acid library to be tested are sequenced by primer extension reaction, wherein the 5' end of the primer binding sequence is connected to the 3' end of the nucleic acid library to be tested; Optionally, at least one nucleotide at the 3' end of the nucleic acid library to be tested is selected from 2, 3, 4, 5, 6, 7, 8, 9, 10 or more nucleotides.
7. The method according to claim 1, characterized in that, The nucleic acid library to be tested includes: DNA nanospheres or DNA clusters.
8. A method for sequencing calibration at different locations on a sequencing chip, characterized in that, include: (i) Sequencing at least one known nucleotide at the 5' end of the primer binding sequence in multiple nucleic acid libraries at different locations on a sequencing chip by primer extension reaction, wherein the nucleotide incorporated into the primer in the extension reaction carries a labeling group and generates a detectable signal; (ii) Obtain the signal intensity value of the primer-incorporated nucleotides in the first preset number of sequencing cycles; (iii) Perform a first classification on the first preset number of sequencing cycles to obtain the first nucleotide that each signal is identified as in each sequencing cycle; (iv) Determine the relative replication number based on the signal intensity value of the first preset number of sequencing cycles and the first nucleotide.
9. The method according to claim 8, characterized in that, The nucleotides incorporated into the primers during the extension reaction are complementary to the known nucleotides. Optionally, the nucleotides incorporated into the primers may further carry reversible blocking modification groups.
10. The method according to claim 8, characterized in that, The signal is an optical signal.
11. The method according to claim 8, characterized in that, The at least one known nucleotide is selected from two, three, four or more nucleotides.
12. The method according to any one of claims 8-11, characterized in that, Step (i) is as follows: sequencing at least one known nucleotide at the 5' end of the primer binding sequence and at least one nucleotide at the 3' end of the nucleic acid library to be tested in multiple nucleic acid libraries to be tested at different positions of the sequencing chip by primer extension reaction, wherein the 5' end of the primer binding sequence is connected to the 3' end of the nucleic acid library to be tested. Optionally, at least one nucleotide at the 3' end of the nucleic acid library to be tested is selected from 2, 3, 4, 5, 6, 7, 8, 9, 10 or more nucleotides.
13. The method according to claim 8, characterized in that, The (ii) includes: The nucleotides incorporated with primers were obtained after normalization and error and asynchronous elimination of light intensity values in the first preset number of sequencing cycles. Optionally, (iv) includes: Based on the signal intensity values of the first preset number of sequencing cycles, the second preset number of quantiles of the signal intensity values of each second nucleotide are calculated for each sequencing cycle, wherein the second nucleotide is the nucleotide in the first nucleotide that has a detectable signal in the first channel and / or the second channel; Normalization is performed based on all signal intensity values of each second nucleotide in the first channel and the second channel, and the calculated quantiles of the signal intensity values. Based on the normalization results and the number of the second nucleotides in the first preset number of sequencing cycles for each signal, the corrected relative replication number of each signal is determined. Optionally, it may further include: Based on the quantiles of the signal intensity values of each second nucleotide obtained for each sequencing cycle, the upper bound of the relative replication number of all signals is calculated. Based on the comparison results of the relative replication number of each signal after correction with the upper limit value, abnormal signals in the signal are excluded.
14. The method according to claim 8, characterized in that, The method further includes: Light intensity correction was performed on each channel of the entire sequencing cycle using the relative replication number of the signal. Optionally, step (ii) further includes: Obtain the signal intensity values of at least one sequencing cycle for which errors and asynchronous elimination have been completed (a first preset number of cycles). Optionally, (iv) includes: The number of third nucleotides for each signal in the first preset number of sequencing cycles is counted, wherein the third nucleotide is the nucleotide in the first nucleotide that can emit light in the first channel and / or the second channel; and the sum of the signal intensity values of each signal in the corresponding light-emitting channel in the first preset number of sequencing cycles is calculated. The relative replication number after signal correction is determined based on the number of the third nucleotide and the sum of the signal intensity values.
15. The method according to claim 8, characterized in that, If the relative replication number after correction for each signal is 0, it indicates that the sequencing site has not been loaded with the nucleic acid library to be tested; If the relative replication number after correction for each signal is not 0, it indicates that the sequencing site is loaded with the nucleic acid library to be tested.
16. An apparatus for determining the location of a sequencing chip loaded with a nucleic acid library to be tested, characterized in that, include: A sequencing unit is used to sequence at least one known nucleotide at the 5' end of the primer binding sequence in the nucleic acid library to be tested via a primer extension reaction, wherein the nucleotide incorporated into the primer carries a labeling group and generates a detectable signal in the extension reaction, and the sequencing chip is detected in each cycle of the extension reaction. The judgment unit is used to determine the location where the sequencing chip loads the nucleic acid library to be tested based on the detected signal, wherein at least one signal detection is an indication of the location where the nucleic acid library to be tested is loaded.
17. A sequencing calibration system for different positions on a sequencing chip, characterized in that, include: The sequencing module is used to sequence at least one known nucleotide at the 5' end of the primer binding sequence in multiple nucleic acid libraries at different positions on the sequencing chip through primer extension reaction, wherein the nucleotide incorporated into the primer in the extension reaction carries a labeling group and generates a detectable signal; The signal intensity value acquisition module is used to acquire the signal intensity value of the primer-incorporated nucleotides in the first preset number of sequencing cycles; A nucleotide classification module is used to perform a first classification on the first preset number of sequencing cycles to obtain the first nucleotide that each signal is identified as in each sequencing cycle. The calibration module is used to determine the relative replication number based on the signal intensity value of the first preset number of sequencing cycles and the first nucleotide.
18. A kit for determining the location of a sequencing chip loaded with a nucleic acid library to be tested, characterized in that, include: Primers, polymerase, and nucleotides carrying label groups and generating detectable signals for extending the nucleic acid library to be tested; The 5' end of the primer is complementary to the 3' end of the primer-binding sequence in the nucleic acid library to be tested, and the 3' end of the primer lacks at least one nucleotide complementary to the 5' end of the primer-binding sequence.
19. The reagent kit according to claim 18, characterized in that, Nucleotides carrying labeling groups and producing detectable signals may further carry reversible blocking modification groups.
20. A computing device, characterized in that, include: Processor and memory; The memory is used to store computer programs; The processor is configured to execute the computer program to implement the method for determining the location of the sequencing chip loaded with the nucleic acid library as described in any one of claims 1-7, or the sequencing correction method for different locations on the sequencing chip as described in any one of claims 8-15.