Sequence correction method and device, electronic equipment and storage medium
By acquiring the relationship data between second-generation and third-generation sequencing, and using distribution models and gene gradients to screen target gene sets, the problem of uncontrollable barcode correction accuracy in third-generation sequencing technology was solved, achieving controllable correction accuracy and improved correction rate.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU HUADA XUFENG TECHNOLOGY CO LTD
- Filing Date
- 2024-12-10
- Publication Date
- 2026-06-19
AI Technical Summary
In existing third-generation sequencing technologies, the accuracy of barcode correction methods is uncontrollable, and multiple alignments and mismatches are prone to occur.
By acquiring the relationship data between second-generation sequencing and third-generation sequencing of the target sample, and using pre-constructed distribution models, gene gradients, and internal edit distances, the target gene set is screened out, and based on this, the first alignment tag sequence and the second alignment tag sequence are determined, so as to achieve controllable correction accuracy.
It achieves controllable accuracy of three generations of barcode correction and improves the correction accuracy rate.
Smart Images

Figure CN122245409A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of sequencing technology, specifically to a sequence correction method and apparatus, electronic equipment and storage medium. Background Technology
[0002] In biotechnology, sequencing refers to the process of determining the nucleotide sequence in DNA or RNA molecules. Currently, sequencing technologies include second-generation sequencing (NGS) and third-generation sequencing (NGS). NGS can refer to high-throughput sequencing, while NGS can refer to single-molecule sequencing.
[0003] Barcode technology refers to assigning a unique identifier (i.e., barcode) to each single cell during the sequencing process, so as to distinguish different cells in subsequent data processing and analysis.
[0004] In related technologies, the method for correcting barcodes in third-generation sequencing technology involves comparing the barcodes detected by third-generation sequencing technology with those detected by second-generation sequencing technology one-to-one. However, this method is prone to multiple alignments and mismatches, meaning its accuracy is uncontrollable. Therefore, how to provide a sequence correction method that achieves controllable correction accuracy has become an urgent technical problem to be solved. Summary of the Invention
[0005] The main objective of this application is to propose a sequence correction method, apparatus, electronic device, and storage medium, aiming to achieve controllable accuracy of third-generation Barcod correction.
[0006] To achieve the above objectives, a first aspect of this application provides a sequence correction method, the method comprising:
[0007] Obtain first relational data from next-generation sequencing of the target sample, wherein the first relational data is used to describe the association between each gene in the target sample and the first target tag sequence;
[0008] Based on the pre-constructed distribution model, the gene gradient of each gene in the first relation data, and the internal edit distance of each gene in the first relation data, a target gene set is determined from the first relation data. The distribution model is used to describe the target accuracy at different reference edit distances under different sampling gradients; the gene gradient is used to represent the number of the first target tag sequences corresponding to the gene.
[0009] Obtain second relational data from third-generation sequencing of the target sample, wherein the second relational data is used to describe the association between each gene in the target sample and the second target tag sequence;
[0010] A first alignment tag sequence is selected from the first relation data based on the target gene set, and a second alignment tag sequence is selected from the second relation data, wherein the first alignment tag sequence is used to correct the second alignment tag sequence.
[0011] In some embodiments, the method for constructing the distribution model includes:
[0012] Based on a preset sampling gradient, sequence extraction is performed on a preset label sequence database to obtain sample label sequences;
[0013] Based on the sample label sequence and the simulated label sequence of the sample label sequence, determine the target accuracy of the simulated label sequence at different reference edit distances;
[0014] The distribution model is constructed based on the sampling gradient, the reference edit distance, and the target accuracy.
[0015] In some embodiments, the method for determining the simulated tag sequence includes:
[0016] Determine the base mutation rate corresponding to the sampling gradient;
[0017] The sample tag sequence corresponding to the sampling gradient is subjected to base mutation according to the base mutation rate to obtain the simulated tag sequence.
[0018] In some embodiments, determining the target accuracy of the simulated label sequence at different reference edit distances based on the sample label sequence and the simulated label sequence of the sample label sequence includes:
[0019] Calculate the target edit distance between the sample label sequence and the simulated label sequence;
[0020] When it is determined that the target edit distance meets the preset edit conditions, the simulated label sequence corresponding to the target distance is taken as the positive sample label sequence;
[0021] For each reference edit distance, a first total number of positive sample label sequences whose target edit distance is equal to the reference edit distance is determined, and the target accuracy is calculated based on the first total number and the second total number of sample label sequences.
[0022] In some embodiments, the method for determining the reference edit distance includes:
[0023] The lower limit of edit distance is calculated based on the base mutation rate and the tag sequence length.
[0024] Determine the maximum edit distance;
[0025] The range of the reference edit distance is determined based on the upper limit of the edit distance and the lower limit of the edit distance.
[0026] In some embodiments, determining the target gene set from the first relation data based on a pre-constructed distribution model, the gene gradient of each gene in the first relation data, and the internal edit distance of each gene in the first relation data includes:
[0027] The target accuracy for each gene in the first relational data is determined based on the gene gradient, the internal edit distance, the sampling gradient, and the reference edit distance.
[0028] The target gene set is selected from the first relation data based on the target accuracy corresponding to each gene in the first relation data.
[0029] In some embodiments, the step of filtering a first alignment tag sequence from the first relation data based on the target gene set and filtering a second alignment tag sequence from the second relation data includes:
[0030] Based on the target gene set, the genes corresponding to each second target tag sequence are traversed and matched, and the second alignment tag sequence is determined from the second target tag sequence according to the matching results;
[0031] Based on the genes in the target gene set that correspond to the second alignment tag sequence, a third target tag sequence is determined from the first target tag sequence;
[0032] The first alignment tag sequence is determined from the third target tag sequence by comparing the alignment edit distance between the third target tag sequence and the second alignment tag sequence.
[0033] In some embodiments, before selecting a first alignment tag sequence from the first relation data based on the target gene set and selecting a second alignment tag sequence from the second relation data, the method further includes:
[0034] Determine the expression level of each gene in the second relationship data in the target sample;
[0035] The genes corresponding to each second target tag sequence in the second relation data are sorted based on the expression levels.
[0036] To achieve the above objectives, a second aspect of this application provides a sequence correction apparatus, the apparatus comprising:
[0037] The first relation data determination unit is used to obtain the first relation data under the second-generation sequencing of the target sample, wherein the first relation data is used to describe the relationship between each gene in the target sample and the first target tag sequence.
[0038] A gene set determination unit is used to determine a target gene set from the first relation data based on a pre-constructed distribution model, the gene gradient of each gene in the first relation data, and the internal edit distance of each gene in the first relation data. The distribution model is used to describe the target accuracy at different reference edit distances under different sampling gradients; the gene gradient is used to represent the number of the first target tag sequences corresponding to the genes.
[0039] The second relationship data determination unit is used to obtain the second relationship data under the third-generation sequencing of the target sample, wherein the second relationship data is used to describe the association between each gene in the target sample and the second target tag sequence;
[0040] The second sequence acquisition unit is used to select a first alignment tag sequence from the first relationship data according to the target gene set, and to select a second alignment tag sequence from the second relationship data, wherein the first alignment tag sequence is used to correct the second alignment tag sequence.
[0041] To achieve the above objectives, a third aspect of this application provides an electronic device including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the method described in the first aspect.
[0042] To achieve the above objectives, a fourth aspect of the present application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described in the first aspect.
[0043] To achieve the above objectives, a fifth aspect of this application provides a computer program product comprising a computer program that is read and executed by a processor of a computer device, causing the computer device to perform the method described in the first aspect.
[0044] The sequence correction method, apparatus, electronic device, and storage medium proposed in this application utilize a distribution model to describe the target accuracy at different reference edit distances under different sampling gradients. This ensures that the range of the target accuracy is controllable when the range of the reference edit distance is set. Furthermore, this application's method for determining a target gene set based on the distribution model and then determining a first alignment tag sequence and a second alignment tag sequence based on the target gene set also makes the correction accuracy of the second alignment tag sequence based on the first alignment tag sequence controllable. Attached Figure Description
[0045] The accompanying drawings are provided to further understand the technical solutions of this disclosure and constitute a part of the specification. They are used together with the embodiments of this disclosure to explain the technical solutions of this disclosure and do not constitute a limitation on the technical solutions of this disclosure.
[0046] Figure 1 This is a flowchart of the sequence correction method provided in the embodiments of this application;
[0047] Figure 2 This is a flowchart of a method for determining a simulated tag sequence provided in an embodiment of this application;
[0048] Figure 3 This is a flowchart of the method for determining the reference edit distance provided in the embodiments of this application;
[0049] Figure 4 yes Figure 1 A flowchart of an embodiment of step S102;
[0050] Figure 5 yes Figure 4 A flowchart of an embodiment of step S403;
[0051] Figures 6A to 6C This is a schematic diagram of the distribution model provided in the embodiments of this application;
[0052] Figure 7 yes Figure 1 A flowchart of an embodiment of step S104;
[0053] Figure 8 This is a flowchart of another embodiment of the sequence correction method provided in this application;
[0054] Figure 9 yes Figure 1 A flowchart of an embodiment of step S106;
[0055] Figure 10 This is a schematic diagram of the sequence correction device provided in the embodiments of this application;
[0056] Figure 11 This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0057] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and are not intended to limit the scope of this disclosure.
[0058] Before providing a further detailed description of the embodiments of this disclosure, the terms and concepts used in these embodiments are explained, and they are subject to the following interpretations:
[0059] Long-read sequencing is an advanced DNA sequencing technology. It produces longer DNA fragments. Compared to short-read sequencing, long-read sequencing provides longer reads, allowing for a more comprehensive interpretation of the genome. Long-read sequencing can be applied to genome assembly, structural variation detection, transcriptomics research, and epigenetics research.
[0060] Short read sequencing is a DNA sequencing technology. It produces short DNA sequence fragments, typically ranging from tens to hundreds of base pairs (bp) in length. Correspondingly, short read sequencing can also be applied to genome assembly, structural variation detection, transcriptome research, and epigenetics research.
[0061] Next-Generation Sequencing (NGS) includes technologies such as Roche 454 sequencing, Illumina sequencing, and SOLiD sequencing. NGS is characterized by high throughput, high accuracy, short read lengths, and low cost.
[0062] Third-generation sequencing (TGS) includes single-molecule real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) and nanopore sequencing technology from Oxford Nanopore Technologies. Third-generation sequencing technologies are characterized by longer read lengths and the ability to directly sequence single molecules.
[0063] A unique molecular identifier (UMI) is a short, randomized or specific nucleotide sequence used to label individual mRNA molecules to distinguish different mRNA molecules of the same gene from the same cell. Compared to a barcode, which distinguishes different samples or cells, a UMI distinguishes different mRNA molecules of the same gene within a single sample. In other words, a barcode focuses on sample or cell-level differentiation, while a UMI focuses on molecular-level differentiation.
[0064] Because second-generation sequencing (NGS) has shorter read lengths and a more stable data processing workflow, the bars added during library construction can be accurately identified and distinguished, resulting in a generally low misclassification rate for NGS bars (hereinafter referred to as NGS bars). While third-generation sequencing (NGS) offers advantages such as longer read lengths, complex factors during sequencing, such as fluctuations in electrical signals, can affect the accuracy of barcode sequence identification, potentially leading to base misclassification. Therefore, NGS bars can be corrected based on NGS bars.
[0065] In related technologies, the correction method for third-generation barcodes involves comparing each third-generation barcode with a second-generation barcode one by one. However, this method is prone to multiple comparisons and mismatches, meaning that the accuracy of this correction method is uncontrollable.
[0066] Based on this, embodiments of this application propose a sequence correction method, apparatus, electronic device, and storage medium that can achieve controllable accuracy of third-generation barcode correction.
[0067] The sequence correction method provided in the embodiments of this application will be described below.
[0068] Reference Figure 1 In some embodiments, the sequence correction method provided in this application includes, but is not limited to, steps S101 to S104.
[0069] Step S101: Obtain the first relational data under the second-generation sequencing of the target sample, wherein the first relational data is used to describe the association between each gene in the target sample and the first target tag sequence;
[0070] Step S102: Based on the pre-constructed distribution model, the gene gradient of each gene in the first relation data, and the internal edit distance of each gene in the first relation data, determine the target gene set from the first relation data. The distribution model is used to describe the target accuracy at different reference edit distances under different sampling gradients; the gene gradient is used to represent the number of first target tag sequences corresponding to the gene.
[0071] Step S103: Obtain the second relationship data under third-generation sequencing of the target sample, wherein the second relationship data is used to describe the association between each gene in the target sample and the second target tag sequence;
[0072] Step S104: Select a first alignment tag sequence from the first relation data based on the target gene set, and select a second alignment tag sequence from the second relation data, wherein the first alignment tag sequence is used to correct the second alignment tag sequence.
[0073] In steps S101 to S104 of the embodiments of this application, the distribution model can be used to describe the target accuracy at different reference edit distances under different sampling gradients. Thus, when the range of the reference edit distance is set, the range of the target accuracy is controllable. Based on this, the method of determining the target gene set based on the distribution model and determining the first alignment tag sequence and the second alignment tag sequence based on the target gene set in the embodiments of this application makes the correction accuracy of the second alignment tag sequence based on the first alignment tag sequence controllable as well.
[0074] Reference Figure 2 To facilitate understanding, the distribution model will be explained first. In some embodiments, the method for constructing the distribution model may include, but is not limited to, steps S201 to S203.
[0075] Step S201: Extract sequences from a preset label sequence database according to a preset sampling gradient to obtain sample label sequences;
[0076] Step S202: Determine the target accuracy of the simulated label sequence at different reference edit distances based on the sample label sequence and the simulated label sequence of the sample label sequence;
[0077] Step S203: Construct a distribution model based on the sampling gradient, reference edit distance, and target accuracy.
[0078] In step S201 of some embodiments, the sampling gradient can be adaptively set according to actual needs, or determined according to how many cells in the target sample express the gene. In other words, the sampling gradient can be used to indicate the number of tag sequences extracted from the tag sequence database in each batch. For example, the sampling gradient can be set to [100, 200, 300, 400, ..., 2000]. The tag sequence database can refer to a database used to store a series of known barcode sequences (whitelist). N batches of sequences are randomly extracted from the tag sequence database according to the sampling gradient to obtain N batches of data, each batch of data including n sample tag sequences. Wherein, N is a positive integer greater than 0, and the specific value of N can be adaptively set according to actual needs, such as being set to 100 or a positive integer greater than 100. This application embodiment does not specifically limit this. n = sampling gradient. For example, when the sampling gradient is 100, it indicates that 100 sample tag sequences are randomly extracted from the tag sequence database each time, and these 100 sample tag sequences are used as the data of the same extraction batch. It is understandable that the target sample can refer to a biological sample obtained in accordance with relevant laws and regulations, including plant samples and animal samples.
[0079] In step S202 of some embodiments, the simulated tag sequence may refer to a tag sequence obtained by base mutation of a sample tag sequence. It is understood that the sample tag sequence may refer to a second-generation barcode sequence, and the simulated tag sequence may refer to a third-generation barcode sequence. Edit distance can refer to the minimum number of steps required to transform one sequence into another through gene editing. It is understood that gene editing may include insertion, deletion, and substitution. For example, to determine the edit distance of a string, the string "kitten" can be transformed into the string "sitting" through the following edit path P1:
[0080] The first step is to replace the character 'k' with the character 's' (kitten → sitten);
[0081] The second step is to replace the character 'e' with the character 'i' (sitten → sittin);
[0082] The third step is to insert a character g after the character n (sittin → sitting).
[0083] Therefore, it can be seen that the number of steps required to transform the string "kitten" into the string "sitting" is 3, i.e., the edit distance is 3. It is understood that the method for determining the sequence edit distance is similar to the method for determining the string edit distance described above, and will not be repeated in this embodiment.
[0084] The reference edit distance can refer to a pre-set edit distance. Target accuracy can refer to the correct recall rate when a sample label sequence obtained according to a certain sampling gradient is transformed into a simulated label sequence through the reference edit distance.
[0085] Reference Figure 3 In some embodiments, the method for determining the simulated tag sequence may include, but is not limited to, steps S301 to S302:
[0086] S301, determine the base mutation rate corresponding to the sampling gradient;
[0087] S302, perform base mutations on the sample tag sequence corresponding to the sampling gradient according to the base mutation rate to obtain the simulated tag sequence.
[0088] In step S301 of some embodiments, the base mutation rate can refer to the probability of a base mutation occurring in the tag sequence. Each sampling gradient can be set with a corresponding base mutation rate. The specific value of the base mutation rate can be adaptively set as needed, and this embodiment of the application does not specifically limit this. For example, the base mutation rate can be set according to different third-generation sequencing technologies, such as 10%, 15%, and 20%.
[0089] In step S302 of some embodiments, the sample tag sequence extracted by the corresponding sampling gradient is subjected to base mutation according to the base mutation rate to obtain a simulated tag sequence. For example, when the base mutation rate is 10%, 10% of the bases in the sample tag sequence are randomly mutated to obtain the simulated tag sequence corresponding to the sample tag sequence.
[0090] Reference Figure 4 In some embodiments, the method for determining the reference edit distance may include, but is not limited to, steps S401 to S403.
[0091] Step S401: Calculate the lower limit of the edit distance based on the base mutation rate and the tag sequence length;
[0092] Step S402: Determine the upper limit of the edit distance;
[0093] Step S403: Determine the range of the reference editing distance based on the upper limit and lower limit of the editing distance.
[0094] In step S401 of some embodiments, the tag sequence length can refer to a pre-set length of the tag sequence applicable to this embodiment, such as a tag sequence length of 20 bp. The maximum number of mutations in a tag sequence of this length can be calculated based on the base mutation rate. For example, with a base mutation rate of 10%, a 20 bp tag sequence will mutate a maximum of 2 base pairs during base mutation; therefore, the edit distance between this tag sequence and the mutated tag sequence is 2. This embodiment can use the edit distance calculated based on the base mutation rate and the tag sequence length as a lower limit for the edit distance; for example, based on the above example, the lower limit for the edit distance can be set to 2.
[0095] In step S402 of some embodiments, the upper limit of the edit distance can refer to the maximum possible value of the reference edit distance. The specific value of the upper limit of the edit distance can be dynamically adjusted according to actual needs, and this application embodiment does not specifically limit this. For example, the upper limit of the edit distance can be determined based on the reference accuracy. The reference accuracy can refer to a pre-set correct recall rate. The edit distance corresponding to the reference accuracy is used as the upper limit of the edit distance. For example, a test tag sequence is randomly sampled from the tag sequence database according to the sampling gradient, and the test tag sequence is mutated according to a preset base mutation rate to obtain a test simulation sequence of the test tag sequence. The test accuracy of the test tag sequence and the test simulation sequence at different edit distances is calculated. The test accuracy equal to the reference accuracy is determined, and the edit distance corresponding to the test accuracy is used as the upper limit of the edit distance. For example, if the reference accuracy is 0%, the edit distance corresponding to the test accuracy of 0% is used as the upper limit of the edit distance. For example, the upper limit of the edit distance can be determined to be 10. It is understood that the specific value of the reference accuracy can be adaptively set according to actual needs, and this application embodiment does not specifically limit this.
[0096] In step S403 of some embodiments, the editing distance range can be determined based on the upper limit and lower limit of the editing distance, and the editing distance within this range can be used as a reference editing distance. For example, when the editing distance range is determined to be [2, 10] based on the upper limit and lower limit of the editing distance, the reference editing distance can be any positive integer in [2, 10].
[0097] It is understandable that different editing distance ranges can be obtained according to the above method when the base mutation rate is different. For example, when the base mutation rate is 10%, the editing distance range is [2, 10]; when the base mutation rate is 15%, the editing distance range is [3, 10]; and when the base mutation rate is 20%, the editing distance range is [4, 10].
[0098] Reference Figure 5 In some embodiments, step S202 may include, but is not limited to, steps S501 to S503.
[0099] Step S501: Calculate the target edit distance between the sample label sequence and the simulated label sequence;
[0100] Step S502: When it is determined that the target edit distance meets the preset edit conditions, the simulated label sequence corresponding to the target distance is used as the positive sample label sequence;
[0101] Step S503: For each reference edit distance, determine the first total number of positive sample label sequences whose target edit distance is equal to the reference edit distance, and calculate the target accuracy based on the first total number and the second total number of sample label sequences.
[0102] In step S501 of some embodiments, for each sampling batch, the step size of each simulated label sequence is calculated with all sample label sequences, that is, the number of steps required to transform a sample label sequence into a simulated label sequence is determined, resulting in multiple target edit distances. For example, when the sampling gradient is 100, 100 sample label sequences can be randomly sampled in each sampling batch. After performing base mutations on each sample label sequence, 100 simulated label sequences can be obtained. For each sampling batch, the step size of each simulated label sequence is calculated with the 100 sample label sequences in the same sampling batch, resulting in multiple target edit distances.
[0103] In step S502 of some embodiments, the preset editing condition can refer to the condition that determines whether the sample label sequence is correctly recalled. For example, in the embodiments of this application, if the target edit distance value is the smallest and unique, then the sample label sequence corresponding to the target edit distance is determined to be correctly recalled. The sample label sequence determined to be correctly recalled is taken as a positive sample label sequence. That is, in a certain sampling batch of a certain sampling gradient, for a certain sample label sequence, if the target edit distance between the sample label sequence and a certain simulated label sequence obtained in the same sampling batch is the smallest, and the edit path of the sample label sequence changing into the simulated label sequence is unique, then the sample label sequence is taken as a positive sample label sequence. It can be understood that in the above example of transforming the string "kitten" into the string "sitting", the following edit path P2 can also be included:
[0104] The first step is to insert the character g after the character 'n' (kitten → kitteng).
[0105] The second step is to replace the character 'k' with the character 's' (kitteng → sitteng);
[0106] The third step is to replace the character 'e' with the character 'i' (sitteng → sitting);
[0107] As can be seen, the edit distance of edit path P2 is also 3, but edit path P2 is different from edit path P1. Therefore, the edit distance for transforming the string "kitten" into the string "sitting" is not unique.
[0108] In step S503 of some embodiments, the target edit distance corresponding to each positive sample label sequence can be determined. Thus, when determining the target accuracy of a reference edit distance under a certain sampling gradient, for each sampling batch of the reference edit distance, a target edit distance equal to the reference edit distance can be determined first, and the first total number of positive sample labels corresponding to the equal target edit distance in the corresponding sampling batch can be determined. The second total number can refer to the total number of sample labels in the sampling batch. The ratio of the first total number and the second total number can be calculated to obtain the initial accuracy of a sampling batch corresponding to the reference edit distance under the sampling gradient. The ratio calculation described above is performed on each sampling batch corresponding to the reference edit distance to obtain multiple initial accuracies. Operations such as quantile calculation or mean calculation are performed on the multiple initial accuracies to obtain the target accuracy of the corresponding sampling gradient under a specific reference edit distance. For example, the target accuracy when the sampling gradient is 100 and the reference edit distance is 2 can be obtained.
[0109] In step S203 of some embodiments, a method can be constructed based on the target accuracy, sampling gradient, and reference edit distance, as follows: Figures 6A to 6C The distribution model is shown below. Here, BC represents the sampling gradient, LD represents the reference edit distance, and the percentages corresponding to BC and LD represent the target accuracy. Figure 6A The distribution model shown Figure 6B The distribution model shown Figure 6C The distribution model shown can correspond to different base mutation rates. For example, Figure 6A The distribution model shown is constructed based on the sample tag sequence that has undergone 10% base mutation. Figure 6B The distribution model shown is constructed based on the sample tag sequence that has undergone 15% base mutation. Figure 6C The distribution model shown is constructed based on the tag sequences of samples with 20% base mutations. Figure 6A In the data in region 601, the data represents the sample label sequences obtained based on sampling gradient 50, which have a correct recall rate of 100.0% at an edit distance of 2.
[0110] In step S101 of some embodiments, next-generation sequencing may refer to short-read sequencing. The first genome of the target sample and the first target tag sequence of the target sample during next-generation sequencing are obtained. It is understood that the first target tag sequence also belongs to the tag sequence database. The first genome may include multiple genes. By aligning the first target tag sequence to the first genome, the gene corresponding to the first target tag sequence in the first genome can be determined. Thus, based on the relationship between the first target tag sequences and genes of multiple target samples, the first relationship data shown in equation (1) can be constructed.
[0111]
[0112] In equation (1), G11 to G1n represent genes, and b11 to b1n represent the first target tag sequence.
[0113] In step S102 of some embodiments, the gene gradient and internal edit distance of each gene in the first relation data can be determined. The gene gradient can refer to the gene library capacity, i.e., how many first target tag sequences express the gene. The internal edit distance can refer to the minimum edit distance between multiple first target tag sequences in which the gene is expressed. For example, if gene G11 is expressed in 100 first target tag sequences, then the gene gradient of gene G11 is 100, and the internal edit distance of gene G11 can refer to the minimum edit distance between these 100 first target tag sequences. Based on the distribution model, the gene gradient and internal edit distance of each gene in the first relation data, multiple genes are screened from the first relation data, and a target gene set is constructed based on the screened multiple genes.
[0114] Reference Figure 7 In some embodiments, step S102 may include, but is not limited to, steps S701 to S702.
[0115] Step S701: Determine the target accuracy of each gene in the first relation data based on gene gradient, internal edit distance, sampling gradient and reference edit distance;
[0116] Step S702: Select the target gene set from the first relation data based on the target accuracy corresponding to each gene in the first relation data.
[0117] In step S701 of some embodiments, the gene gradients corresponding to each gene in the first relation data are matched with the sampling gradients, and the internal edit distances corresponding to each gene in the first relation data are matched with the reference edit distances to obtain the target accuracy corresponding to each gene in the first relation data. For example, if the gene gradient of a certain gene in the first relation data is 200 and the reference edit distance is 10, then according to... Figure 6A The distribution model shown indicates that the target accuracy for this gene is 0.0%.
[0118] It is understandable that, based on the gene gradient and internal edit distance corresponding to each gene in the first relational data, the relational data [gene gradient, internal edit distance] shown in the following equation (2) can be constructed.
[0119]
[0120] In step S702 of some embodiments, all genes in the first relation data are sorted according to the target accuracy corresponding to each gene in the first relation data. Multiple genes with an accuracy not less than a preset accuracy threshold can be selected from the sorted data, and a target gene set is constructed based on the selected genes. It is understood that the specific value of the accuracy threshold can be adaptively set according to actual needs, and this embodiment of the application does not specifically limit it in this regard.
[0121] In step S103 of some embodiments, third-generation sequencing can refer to long-read sequencing. The second genome of the target sample and the second target tag sequence of the target sample during third-generation sequencing are obtained. It is understood that the second target tag sequence also belongs to the tag sequence database. The second genome may include multiple genes. Based on the second target tag sequence corresponding to each target sample and multiple genes, the second relationship data shown in equation (3) can be constructed.
[0122]
[0123] In equation (3), G21 to G2n represent genes, and b21 to b2n represent second target tag sequences.
[0124] Reference Figure 8 In some embodiments, before step S104, the method provided in this application may include, but is not limited to, steps S801 to S802.
[0125] Step S801: Determine the expression level of each gene in the second relation data in the target sample;
[0126] Step S802: Sort the genes corresponding to each second target tag sequence in the second relation data based on expression levels.
[0127] In steps S801 to S802 of some embodiments, expression level may refer to the number of UMIs. In the second relational data, for each second target tag sequence, the corresponding genes are sorted based on expression level, such that genes ranked earlier have higher expression levels than genes ranked later, i.e., more reads. For example, for the second relational data b2n: [G21, G22, G23], the expression level of gene G21 > the expression level of gene G22 > the expression level of gene G23.
[0128] In step S104 of some embodiments, the genes included in the target gene set are matched with the genes in the second relationship data, and the second target tag sequence corresponding to the matched gene in the second relationship data is used as the second alignment tag sequence. The second alignment tag sequence is matched with the first target tag sequence in the first relationship data, and a first alignment tag sequence is obtained by screening from multiple first target tag sequences. The first alignment tag sequence and the second alignment tag sequence are used as the optimal alignment, that is, the second alignment tag sequence can be corrected based on the first alignment tag sequence. It can be understood that the first alignment tag sequence is a second-generation barcode, and the second alignment tag sequence is a third-generation barcode. Therefore, by sorting the genes corresponding to the second target tag sequence by expression level, genes with high expression levels can be preferentially matched with genes in the target gene set. In this way, the accuracy of determining the second alignment tag sequence can be improved, which in turn can improve the accuracy of determining the first alignment tag sequence.
[0129] Reference Figure 9 In some embodiments, step S104 may include, but is not limited to, steps S901 to S903.
[0130] Step S901: Based on the target gene set, perform traversal matching on the genes corresponding to each second target tag sequence, and determine the second alignment tag sequence from the second target tag sequence according to the matching results;
[0131] Step S902: Based on the genes in the target gene set that correspond to the second alignment tag sequence, determine the third target tag sequence from the first target tag sequence;
[0132] Step S903: Based on the alignment edit distance between the third target label sequence and the second alignment label sequence, determine the first alignment label sequence from the third target label sequence.
[0133] In step S901 of some embodiments, the gene corresponding to each second target tag sequence is matched with the target gene set. If there is a gene that is the same as a gene in the target gene set, the second target tag sequence corresponding to that gene is used as the second alignment tag sequence. For example, if the gene G22 included in the second target sequence tag b22 is the same as the gene Gx in the target gene set, then the second target sequence tag b22 is used as the second alignment tag sequence.
[0134] In step S902 of some embodiments, based on genes that successfully match the target gene set, genes in the first relationship data that are identical to the successfully matched genes are identified. Multiple first target tag sequences corresponding to these identical genes are then used as third target tag sequences. For example, if genes G12 and G22 are determined to be identical in the first relationship data, then multiple first target tag sequences (such as b12, b15, and b1n) corresponding to gene G12 are used as third target tag sequences.
[0135] In step S903 of some embodiments, the second alignment tag sequence is compared with multiple third target tag sequences to calculate the step size, thereby determining the alignment edit distance between the second alignment tag sequence and each third target tag sequence. The multiple alignment edit distances are compared, and the third target tag sequence corresponding to the smallest and unique alignment edit distance is taken as the first alignment tag sequence. For example, if the second alignment tag sequence b22 has the smallest alignment edit distance with the third target tag sequence b15, and the editing path from the third target tag sequence b15 to the second alignment tag sequence b22 is unique, then the third target tag sequence b15 can be taken as the first alignment tag sequence. Thus, the second alignment tag sequence b22 can be corrected based on the first alignment tag sequence b15.
[0136] The sequence correction method provided in this application can determine the target accuracy of sample tag sequences and simulated tag sequences at different reference edit distances, thus making the range of target accuracy controllable when the range of reference edit distance is set. Based on this, this application provides a method for determining a target gene set based on the target accuracy, and then determining the first alignment tag sequence and the second alignment tag sequence based on the target gene set, making the correction accuracy of the second alignment tag sequence based on the first alignment tag sequence controllable. This application constructs different distribution models based on sampling gradient and reference edit distance, providing an analytical basis for subsequent accurate barcode identification. This application determines the first alignment tag sequence and the second alignment tag sequence based on barcode, gene, and UMI triple information, which can improve the accuracy of correcting third-generation barcodes based on second-generation barcodes.
[0137] Reference Figure 10 This application also provides a sequence correction device, which includes:
[0138] The first relation data determination unit 1001 is used to obtain the first relation data under the second-generation sequencing of the target sample, wherein the first relation data is used to describe the relationship between each gene in the target sample and the first target tag sequence.
[0139] The gene set determination unit 1002 is used to determine the target gene set from the first relation data based on the pre-constructed distribution model, the gene gradient of each gene in the first relation data, and the internal edit distance of each gene in the first relation data. The distribution model is used to describe the target accuracy at different reference edit distances under different sampling gradients; the gene gradient is used to represent the number of first target tag sequences corresponding to the gene.
[0140] The second relation data determination unit 1003 is used to obtain the second relation data under the third generation sequencing of the target sample, wherein the second relation data is used to describe the association between each gene in the target sample and the second target tag sequence;
[0141] The second sequence acquisition unit 1004 is used to select a first alignment tag sequence from the first relation data based on the target gene set, and to select a second alignment tag sequence from the second relation data, wherein the first alignment tag sequence is used to correct the second alignment tag sequence.
[0142] It is evident that the content of the above sequence correction method embodiments is applicable to the embodiments of this sequence correction device. The specific functions implemented by the embodiments of this sequence correction device are the same as those of the above sequence correction method embodiments, and the beneficial effects achieved are also the same as those achieved by the above sequence correction method embodiments.
[0143] Reference Figure 11 , Figure 11 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes:
[0144] The processor 1101 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.
[0145] The memory 1102 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 1102 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 1102 and is called and executed by the processor 1101 using the sequence correction method of the embodiments of this application.
[0146] Input / output interface 1103 is used to implement information input and output;
[0147] The communication interface 1104 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0148] Bus 1105 transmits information between various components of the device (e.g., processor 1101, memory 1102, input / output interface 1103, and communication interface 1104);
[0149] The processor 1101, memory 1102, input / output interface 1103 and communication interface 1104 are connected to each other within the device via bus 1105.
[0150] This application also provides a computer program product, which includes a computer program. A processor of a computer device reads and executes the computer program, causing the computer device to perform the sequence correction method described above.
[0151] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in this disclosure and the foregoing drawings are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented, for example, in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “including,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that includes a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatuses.
[0152] It should be understood that in this disclosure, "at least one item" means one or more, and "more than one" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0153] It should be understood that in the description of the embodiments of this application, "multiple" means two or more, "greater than", "less than", "exceeding" etc. are understood to exclude the number itself, and "above", "below", "within" etc. are understood to include the number itself.
[0154] In the several embodiments provided in this disclosure, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.
[0155] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0156] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0157] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0158] It should also be understood that the various implementation methods provided in this application can be combined arbitrarily to achieve different technical effects.
[0159] The above is a detailed description of the embodiments of this disclosure. However, this disclosure is not limited to the above embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of this disclosure. All such equivalent modifications or substitutions are included within the scope defined by the claims of this disclosure.
Claims
1. A sequence correction method, characterized in that, The method includes: Obtain first relational data from next-generation sequencing of the target sample, wherein the first relational data is used to describe the association between each gene in the target sample and the first target tag sequence; Based on the pre-constructed distribution model, the gene gradient of each gene in the first relation data, and the internal edit distance of each gene in the first relation data, a target gene set is determined from the first relation data. The distribution model is used to describe the target accuracy at different reference edit distances under different sampling gradients; the gene gradient is used to represent the number of the first target tag sequences corresponding to the gene. Obtain second relational data from third-generation sequencing of the target sample, wherein the second relational data is used to describe the association between each gene in the target sample and the second target tag sequence; A first alignment tag sequence is selected from the first relation data based on the target gene set, and a second alignment tag sequence is selected from the second relation data, wherein the first alignment tag sequence is used to correct the second alignment tag sequence.
2. The method according to claim 1, characterized in that, The method for constructing the distribution model includes: Based on a preset sampling gradient, sequence extraction is performed on a preset label sequence database to obtain sample label sequences; Based on the sample label sequence and the simulated label sequence of the sample label sequence, determine the target accuracy of the simulated label sequence at different reference edit distances; The distribution model is constructed based on the sampling gradient, the reference edit distance, and the target accuracy.
3. The method according to claim 2, characterized in that, The method for determining the simulated tag sequence includes: Determine the base mutation rate corresponding to the sampling gradient; The sample tag sequence corresponding to the sampling gradient is subjected to base mutation according to the base mutation rate to obtain the simulated tag sequence.
4. The method according to claim 2, characterized in that, The determination of the target accuracy of the simulated label sequence at different reference edit distances based on the sample label sequence and the simulated label sequence of the sample label sequence includes: Calculate the target edit distance between the sample label sequence and the simulated label sequence; When it is determined that the target edit distance meets the preset edit conditions, the simulated label sequence corresponding to the target distance is taken as the positive sample label sequence; For each reference edit distance, a first total number of positive sample label sequences whose target edit distance is equal to the reference edit distance is determined, and the target accuracy is calculated based on the first total number and the second total number of sample label sequences.
5. The method according to claim 3, characterized in that, The method for determining the reference edit distance includes: The lower limit of edit distance is calculated based on the base mutation rate and the tag sequence length. Determine the maximum edit distance; The range of the reference edit distance is determined based on the upper limit of the edit distance and the lower limit of the edit distance.
6. The method according to claim 1, characterized in that, The step of determining the target gene set from the first relation data based on the pre-constructed distribution model, the gene gradient of each gene in the first relation data, and the internal edit distance of each gene in the first relation data includes: The target accuracy for each gene in the first relational data is determined based on the gene gradient, the internal edit distance, the sampling gradient, and the reference edit distance. The target gene set is selected from the first relation data based on the target accuracy corresponding to each gene in the first relation data.
7. The method according to claim 1, characterized in that, The step of selecting a first alignment tag sequence from the first relation data based on the target gene set and selecting a second alignment tag sequence from the second relation data includes: Based on the target gene set, the genes corresponding to each second target tag sequence are traversed and matched, and the second alignment tag sequence is determined from the second target tag sequence according to the matching results; Based on the genes in the target gene set that correspond to the second alignment tag sequence, a third target tag sequence is determined from the first target tag sequence; The first alignment tag sequence is determined from the third target tag sequence based on the alignment edit distance between the third target tag sequence and the second alignment tag sequence.
8. The method according to claim 7, characterized in that, Before selecting a first alignment tag sequence from the first relation data based on the target gene set and selecting a second alignment tag sequence from the second relation data, the method further includes: Determine the expression level of each gene in the second relationship data in the target sample; The genes corresponding to each second target tag sequence in the second relation data are sorted based on the expression levels.
9. A sequence correction device, characterized in that, The device includes: The first relation data determination unit is used to obtain the first relation data under the second-generation sequencing of the target sample, wherein the first relation data is used to describe the relationship between each gene in the target sample and the first target tag sequence. A gene set determination unit is used to determine a target gene set from the first relation data based on a pre-constructed distribution model, the gene gradient of each gene in the first relation data, and the internal edit distance of each gene in the first relation data. The distribution model is used to describe the target accuracy at different reference edit distances under different sampling gradients; the gene gradient is used to represent the number of the first target tag sequences corresponding to the genes. The second relationship data determination unit is used to obtain the second relationship data under the third-generation sequencing of the target sample, wherein the second relationship data is used to describe the association between each gene in the target sample and the second target tag sequence; The second sequence acquisition unit is used to select a first alignment tag sequence from the first relationship data according to the target gene set, and to select a second alignment tag sequence from the second relationship data, wherein the first alignment tag sequence is used to correct the second alignment tag sequence.
10. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the sequence correction method according to any one of claims 1 to 8.
11. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the sequence correction method according to any one of claims 1 to 8.
12. A computer program product comprising a computer program that is read and executed by a processor of a computer device, causing the computer device to perform the sequence correction method according to any one of claims 1 to 8.