Sequencing data grading methods, systems, equipment and media

CN116844648BActive Publication Date: 2026-06-30BOE TECHNOLOGY GROUP CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BOE TECHNOLOGY GROUP CO LTD
Filing Date
2023-06-30
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing targeted sequencing technologies are mainly used to detect rearrangement features of IG or TCR, which has limited applicability and cannot comprehensively detect lymphoma-related fusion genes and clonal rearrangement information.

Method used

After acquiring targeted sequencing data and performing data preprocessing, the fusion gene score is calculated by combining fusion gene identification and clonal rearrangement identification, and the sequencing data is graded by integrating detection methods such as IG/TCR rearrangement.

Benefits of technology

It improves the accuracy and applicability of sequencing data grading, enabling more comprehensive detection of lymphoma-related fusion genes and clonal rearrangement information, and adapting to the needs of large-scale, high-throughput, and high-precision sequencing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116844648B_ABST
    Figure CN116844648B_ABST
Patent Text Reader

Abstract

This disclosure provides a method, system, device, and medium for grading sequencing data. The grading method includes: acquiring targeted sequencing data; performing data preprocessing on the targeted sequencing data to obtain sample sequences; identifying fusion genes in the sample sequences to determine if fusion genes exist, obtaining fusion gene identification results; and performing clonal rearrangement identification on the sample sequences to determine if a master clone exists; calculating fusion gene scores based on the fusion gene identification results; and grading the sample sequences based on the fusion gene scores and the clonal rearrangement identification results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of bioinformatics, specifically to a method, system, computer equipment, and storage medium for grading sequencing data. Background Technology

[0002] Currently, the mainstream technology used for lymphoma-related detection is multiplex amplification technology in targeted sequencing, but this technology is only currently used to detect rearrangement features of IG (immunoglobulin) or TCR (T cell receptor). Summary of the Invention

[0003] The following is an overview of the subject matter described in detail herein. This overview is not intended to limit the scope of the claims.

[0004] This disclosure provides a method, system, computer device, and storage medium for grading sequencing data, which can integrate multiple types of information to grade sequencing data.

[0005] On one hand, embodiments of this disclosure provide a method for grading sequencing data, including:

[0006] Obtain targeted sequencing data, and perform data preprocessing on the targeted sequencing data to obtain sample sequences;

[0007] The sample sequence is subjected to targeted sequencing to identify the fusion gene, thereby determining whether a fusion gene exists and obtaining the fusion gene identification result. The sample sequence is also subjected to clonal rearrangement identification to determine whether a master clone exists.

[0008] The fusion gene score is calculated based on the fusion gene identification results, and the sample sequence is graded based on the fusion gene score and the clonal rearrangement identification results.

[0009] On the other hand, embodiments of this disclosure also provide a grading system for sequencing data, including:

[0010] The preprocessing module is used to acquire targeted sequencing data and perform data preprocessing on the targeted sequencing data to obtain sample sequences;

[0011] The identification module is used to perform targeted sequencing on the sample sequence to identify the presence of a fusion gene, obtain the fusion gene identification result, and to perform clonal rearrangement identification on the sample sequence to identify the presence of a master clone; and

[0012] The grading module is used to calculate the fusion gene score based on the fusion gene identification results, and to grade the sample sequence based on the fusion gene score and the clonal rearrangement identification results.

[0013] In another aspect, embodiments of this disclosure also provide a computer-readable storage medium storing computer-executable instructions for implementing the above-described sequencing data grading method.

[0014] In another aspect, embodiments of this disclosure also provide a computer device, including a processor and a memory storing a computer program that can run on the processor, wherein the processor executes the program to implement the steps in the sequencing data grading method described above.

[0015] This disclosure discloses a method for grading sequencing data by combining the detection results of fusion genes and IG / TCR rearrangement detection methods, rather than using a single gene feature for grading, which has higher accuracy and wider applicability.

[0016] After reading and understanding the accompanying diagrams and detailed descriptions, the other aspects can be understood. Attached Figure Description

[0017] The accompanying drawings are provided to further illustrate the technical solutions of this disclosure and form part of the specification. They are used together with the embodiments of this disclosure to explain the technical solutions of this disclosure and do not constitute a limitation on the technical solutions of this disclosure. The shape and size of one or more modules in the drawings do not reflect actual proportions and are only intended to illustrate the content of this disclosure.

[0018] Figure 1 This is a flowchart of a sequencing data grading method according to an embodiment of this disclosure;

[0019] Figure 2 This is a schematic diagram of a sequencing data grading system according to an embodiment of the present disclosure;

[0020] Figure 3 This is a schematic diagram of the preprocessing module;

[0021] Figure 4 This is a schematic diagram of the identification module;

[0022] Figure 5 This is a schematic diagram of a computer device according to an embodiment of the present disclosure. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this disclosure clearer, the embodiments of this disclosure will be described in detail below. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be arbitrarily combined with each other.

[0024] The embodiments of this disclosure will now be described in detail with reference to the accompanying drawings. The implementation can be carried out in many different forms. Those skilled in the art will readily understand that the methods and content can be changed to one or more forms without departing from the spirit and scope of this disclosure. Therefore, this disclosure should not be construed as limited to the content described in the following embodiments. Without conflict, the embodiments and features in the embodiments of this disclosure can be arbitrarily combined with each other.

[0025] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. While any methods and materials similar to or equivalent to those described herein may be used to practice or test this disclosure, preferred methods and materials are described. For the purposes of this disclosure, the following terms are defined below.

[0026] In this application, unless otherwise expressly stated, the use of the singular includes the plural. It must be noted that, unless the context clearly indicates otherwise, the singular forms “a,” “an,” and “the” as used herein include plural references. In this application, unless otherwise stated, the use of “or” means “and / or.”

[0027] The terms “about” or “approximately” mean within an acceptable range of error for a particular value, as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, i.e., the limitations of the measurement system. For example, “about” may, according to practice in the art, mean within one or more standard deviations. Alternatively, “about” may mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a particular value. In other instances, “about 10” includes quantities from 10 to 9 to 11.

[0028] In other instances, the term "about" when referring to a reference value may also include a range of values ​​plus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1%. Optionally, particularly relating to biological systems or processes, the term "about" may mean within an order of magnitude of the value. When a particular value is described in this application and claims, unless otherwise stated, the term "about" should be assumed to mean within an acceptable range of error for the particular value.

[0029] As used in this specification and one or more claims, the terms “comprising” (and any form of “comprising” such as “comprise” and “comprises”), “having” (and any form of “having” such as “have” and “has”), “including” (and any form of “including” such as “includes” and “include”), or “containing” (and any form of “containing” such as “contains” and “contain”) are inclusive or open-ended and do not exclude additional, unlisted elements or method steps. It is contemplated that any embodiments discussed in this specification can be implemented with reference to any method or combination of this disclosure, and vice versa. Furthermore, combinations of this disclosure can be used to implement methods of this disclosure.

[0030] This disclosure provides a method for grading sequencing data, such as... Figure 1 As shown, it includes the following steps 10-30.

[0031] Step 10: Obtain targeted sequencing data, and perform data preprocessing on the targeted sequencing data to obtain sample sequences;

[0032] Step 20: Perform targeted sequencing on the sample sequence to identify the fusion gene, thereby determining whether a fusion gene exists and obtaining the fusion gene identification result; and perform clonal rearrangement identification on the sample sequence to determine whether a master clone exists.

[0033] Step 30: Calculate the fusion gene score based on the fusion gene identification results, and classify the sample sequence according to the fusion gene score and the cloning rearrangement identification results.

[0034] The method described in this disclosure classifies sequencing data by integrating the detection results of gene fusion and IG / TCR rearrangement detection methods, rather than using a single gene feature for classification, resulting in higher accuracy and wider applicability.

[0035] Furthermore, by using targeted sequencing data for detection, the high flexibility of targeted sequencing allows for modification of sequencing content, making it easier to add or remove gene combinations in the future, facilitating technology updates and iterations. In addition, targeted sequencing has lower sequencing data volume and cost compared to whole genome sequencing, transcriptome sequencing, and other methods, and has lower requirements for data storage and data processing time.

[0036] In one exemplary embodiment, the preprocessing in step 10 above may include one or more of the following:

[0037] 1) Adapter sequence removal is performed on the sequencing data obtained after target sequencing;

[0038] The sequencing data may be data captured by hybridization.

[0039] For example, the adapter sequence can be identified by comparison. During excision, the adapter sequence with a length greater than the preset adapter length can be excised. The preset adapter length can be 2-4 bp (base pairs), for example, 3 bp.

[0040] 2) Filter the sequencing data after adapter sequence removal. This filtering includes one or more of the following: quality filtering and length filtering. Quality filtering includes filtering out sequences and / or segments with quality values ​​lower than a preset quality value, which can be 20-30, for example, 25. Length filtering includes filtering out sequences shorter than a preset length value. The preset length value can vary depending on the library. For example, for libraries with paired-end sequencing lengths greater than or equal to 150 bp (such as PE150, PE250, PE300, etc.), the preset length value can range from 70-90 bp, for example, 80 bp. The preset length can also be adjusted between 50-120 bp as needed. For libraries with paired-end sequencing lengths less than 250 bp (such as PE150), the preset length value can be 80-180 bp, for example, 120 bp. PE indicates paired-end sequencing. If the filtering includes both quality filtering and length filtering, the order is not limited; quality filtering can be performed first, followed by length filtering, or vice versa.

[0041] By excising the connector sequence and filtering the excised sequence, subsequent identification can be made more accurate.

[0042] Targeted sequencing is a high-throughput sequencing technology based on specific sequences, used to selectively analyze or detect target DNA or RNA sequences. Targeted sequencing uses probes to capture DNA fragments from specific genomic regions in a sample, followed by library construction, and then sequencing and quantitative analysis using a high-throughput sequencer. Compared to whole-genome sequencing, targeted sequencing offers advantages such as low cost, high efficiency, and high sensitivity, and can adapt to large-scale, high-throughput, and high-precision sequencing needs.

[0043] In one exemplary embodiment, step 20 may include steps 21-22, wherein the order of steps 21 and 22 is interchangeable. By identifying the fusion gene and the master clone, the sequencing data category can be identified from multiple dimensions, thus broadening its applicability.

[0044] Step 21 involves identifying fusion genes through targeted sequencing of the sample sequences and selecting candidate fusion gene pairs; specifically including:

[0045] 1) Align the sample sequences to the genome. Genome versions include, but are not limited to, any one or more of the following: hs1, hg38, and hg19. BWA can be used as the alignment software. After alignment, SAMtools should be used for format conversion, sorting, and indexing.

[0046] Genome versions hs1, hg38, and hg19 are all common human genome sequence versions. hs1, also known as T2T-CHM13V2.0, where T2T indicates telomere-to-telomere assembly, representing a complete genome version, was released on January 24, 2022, and is currently the most up-to-date genome version, with a total length of 3,117,275,501 bp. hg38 and hg19 were released in 2013 and 2009, respectively, with hg38 currently being the most widely used human genome sequence version.

[0047] Sequence alignment to the genome is the process of comparing a known DNA or RNA sequence with the genome sequence of a species or strain to find the location of these sequences in the genome and related information.

[0048] BWA (Burrows-Wheeler Aligner) is a software tool for DNA sequence alignment that efficiently aligns short sequences to a reference genome. Compared to other algorithms, BWA is more efficient and accurate.

[0049] SAMtools is a software tool for processing SAM (Sequence Alignment / Map) and BAM (Binary Alignment / Map) format files.

[0050] 2) Fusion gene detection was performed using FuSeq_WES for targeted sequencing. The results were compared with known fusion genes to identify candidate fusion gene pairs.

[0051] Fusion genes can help determine disease type and prognosis, guiding doctors to choose the most suitable treatment. Furthermore, detecting changes in fusion genes allows for monitoring treatment effectiveness and disease progression, enabling timely adjustments to treatment plans. Fusion genes can also serve as drug targets: some fusion genes can act as tumor-specific biomarkers, becoming novel targets for cancer therapy.

[0052] FuSeq is a targeted sequencing technology, also known as functional unitsequencing. WES (whole exome sequencing) can be used to detect gene fusion events. Gene fusion refers to the linkage of the DNA sequences of two or more genes with each other under certain conditions to form a new gene. By using the above technology for targeted sequencing to detect fusion genes, fusion gene identification results can be obtained, that is, candidate fusion gene pairs can be identified.

[0053] Step 22 involves performing clonal rearrangement identification on the sample sequence to determine the presence of a master clone; specifically including:

[0054] 1) Perform IG and TCR rearrangement on the sample sequences;

[0055] IG and TCR refer to the two chains of immunoglobulin and immunoglobulin receptor, respectively. Identifying rearrangements yields various rearrangement information; each rearrangement represents a clone, used for the next step of unique clone filtering.

[0056] Identification methods can include Mixcr or Vidjil. Both Mixcr and Vidjil are software tools for sequence analysis of immune receptor genes. Mixcr can be used to identify, align, and annotate immune receptor gene sequences from high-throughput sequencing data (such as single-cell RNA sequencing, B / T lymphocyte clonal amplification, etc.). Mixcr supports processing various types of immune receptor genes, including heavy and light chains of IG and TCR, and provides rich functional and parameter setting options. Using Mixcr, clonal amplification events and their variation information in different samples can be quickly and effectively identified, and immune receptor gene expression levels can be analyzed and compared. If Mixcr is used, the identification method can be 'Exome-CDR3'. Exome-CDR3 is a method for immune receptor gene analysis that performs targeted sequencing of the exome and then uses CDR3 region-specific primers to detect and analyze the CDR3 subregion (complementarity-determining region 3) sequence of the immune receptor. CDR3 is one of the highly variable regions in the immune receptor and plays a crucial role in antigen recognition and selection by the immune system. By utilizing exome sequencing technology and CDR3 region-specific primers, Exome-CDR3 can rapidly and accurately identify and analyze the CDR3 subregion sequences of the immune receptor in multiple samples, and assess their impact on physiological and pathological processes. For example, in fields such as tumor immunotherapy and cancer immune surveillance, Exome-CDR3 can help detect and analyze tumor-associated T-cell and B-cell clonal expansion events, assess their association with tumor antigen specificity, and provide valuable references for personalized immunotherapy.

[0057] Vidjil is primarily designed for analyzing single-cell sequencing data. Similar to Mixcr, Vidjil can also be used to identify and annotate immune receptor gene sequences, and it also offers clustering-based and visualization-based data analysis methods. Vidjil boasts good compatibility and ease of use, supporting multiple input and output formats.

[0058] 2) Perform unique clone filtering on the rearranged data, i.e., the rearranged information, to determine whether it is a unique clone;

[0059] In immunology and bioinformatics, a unique clone typically refers to an immune cell clone possessing a unique or specific B-cell receptor or T-cell receptor sequence. In the human immune system, B cells and T cells generate different receptors through VDJ rearrangements, allowing each cell to potentially generate a unique receptor capable of recognizing and binding to a specific antigen. Therefore, each immune cell with a unique receptor can be considered a "unique clone." In practical bioinformatics analyses, researchers often identify and study unique clones in the immune system by sequencing and analyzing the receptor gene sequences of B cells and T cells. This information helps us understand the diversity and specificity of immune responses and how the immune system responds to a wide variety of pathogens.

[0060] Filtering can be performed using one or two of the following conditions: a unique clone should satisfy one or two of the following conditions:

[0061] Condition 1: Determine whether the number of sequences supported by the clone is greater than a first preset value; the first preset value may be the minimum number of sequences required to make the unique clone reliable.

[0062] The support sequence number represents the number of sequences supported by a clone identified in the sequencing data. For example, if clone A has 6 sequences identified as A in the sequencing data, then clone A has a support sequence number of 6. Unreliable sequences can be eliminated by determining the support sequence number of a unique clone. When the support sequence number is greater than a first preset value, it indicates that there are enough sequences to support this clone, and it is considered a unique clone. If a clone has only one supported sequence, it is considered unreliable. This is because after PCR amplification, the sequencer performs high-throughput sequencing, and it is unlikely that only one sequence would support it. If so, it may be a sequencing error or a chimera, etc.

[0063] The range of the first preset value can be, for example, 3-7, such as 5.

[0064] Condition two: Determine whether the amino acid length of the cloned CDR is greater than a second preset value. This second preset value can be the minimum CDR amino acid sequence length that makes the unique clone reliable;

[0065] CDRs (Complementarity-Determining Regions) are specific regions on immune receptor molecules (including antibodies and T-cell receptors). They typically contain highly variable amino acid sequences and are the basis for clone identification. CDRs are specific regions on B-cell receptors (antibodies) and T-cell receptors, playing a crucial role in the recognition and binding of antigens. In the structure of B-cell receptors (antibodies) and T-cell receptors, each receptor has two main parts: a constant region and a variable region. The variable region contains three CDR regions: CDR1, CDR2, and CDR3. These regions have relatively low variability and are more stable. Longer CDRs indicate more accurate rearrangement information and a more accurate unique clone.

[0066] The range of the second preset value can be, for example, 4-8, such as 6.

[0067] Master clone analysis can aid in the identification of sample sequences. For example, a low master clone level indicates a smaller number of cell clones, which can be identified as one type of sequence; a high master clone level indicates a larger number of cell clones, which can be identified as another type of sequence. Furthermore, detecting changes in the master clone can reveal tumor progression. For instance, in lymphoma patients, disease progression can be monitored by detecting the master clone level of clonal immunoglobulin genes.

[0068] 3) Perform master clone identification on the unique clone obtained from the identification process to determine whether a master clone exists;

[0069] In immunology, the dominant clone of an immune cell is the one that is dominant at a particular moment or under specific physiological or pathological conditions. In the human immune system, each immune cell clone has a unique receptor that recognizes and binds to a specific antigen. When these cells encounter their specific antigen, they are activated and begin clonal expansion to produce more antigen-specific immune cells. During this process, those immune cell clones that are particularly effective at recognizing and responding to a specific pathogen proliferate more and thus become dominant in the overall immune cell population; this is the so-called "dominant clone." In certain situations, such as in a specific type of infection or disease, the concept of the dominant clone is particularly important because these cells play a leading role in the immune response to that pathological condition.

[0070] The following method can be used for master clone identification: Calculate the ratio of the number of supporting sequences for each unique clone to the total number of supporting sequences for all unique clones. Sort the obtained ratios from largest to smallest to obtain a candidate dataset. Determine whether the number of unique clones in the candidate dataset whose number of supporting sequences is greater than or equal to a first benchmark value is greater than a first threshold, and whether the number of unique clones in the candidate dataset whose total number of supporting sequences is greater than or equal to a second benchmark value is greater than a second threshold. If so, the current sample sequence is considered to have no master clone. Conversely, if the number of unique clones in the candidate dataset whose number of supporting sequences is greater than or equal to the first benchmark value is less than the first threshold, or if the number of unique clones in the candidate dataset whose total number of supporting sequences is greater than or equal to the second benchmark value is less than the second threshold, then the current sample sequence is considered to have a master clone. The above method for calculating unique clones takes into account that when a clone has a prominent proportion in the total sequencing data and the total number of unique clone supports, it is considered a master clone. Therefore, this embodiment aims to obtain master clones with a prominent proportion and controllable number by sorting and accumulating the number of sequence supports.

[0071] For example, calculate the ratio Qi of the number of supporting sequences Ni of the i-th unique clone to the total number of supporting sequences C of all unique clones: Qi = Ni / C; arrange the obtained ratios in descending order to obtain the dataset D = {Q1, Q2, Q3…}, and sum them up over Q. When the first benchmark value is within the range of 90-98%, for example, 95%, that is, the total number of reads supporting the top n unique clones in D is greater than or equal to 95%; when When the second benchmark value is in the range of 0.3-0.7, for example, 0.5, that is, the total number of reads of the unique clone ranked in the top m in D is greater than or equal to 50%. When the value of n is greater than the first threshold T1 and the value of m is greater than the second threshold T2, it is considered that the sample does not have a master clone. The value of T1 can be in the range of 5-8 and the value of T2 can be in the range of 8-10.

[0072] In one exemplary embodiment, step 30 may include steps 31-32.

[0073] Step 31, calculate the fusion gene score S;

[0074] When sample j is identified as a fusion gene through the aforementioned step 21, let the number of candidate fusion gene pairs be M, the number of non-redundant genes constituting the fusion gene pair be K, and the gene set A = {a1, a2, a3…a…} K The number of different fusion gene combinations for each gene is B = {b1, b2, b3…b}. K},

[0075]

[0076] S j Let M represent the fusion gene score of the j-th sample data, M be the number of candidate fusion gene pairs, K be the number of non-redundant genes, and b be the number of fusion genes. i This represents the i-th fusion gene pair.

[0077] Formula 1 yields a score that can be used for fusion determination. A higher score indicates fewer pairs of different genes involved, given a fixed number of fusion genes (K). In other words, it indicates a smaller number of data points for each gene's involvement in different fusion gene combinations. (Formula 1...) The smaller the value, the larger the S-value of the fusion gene.

[0078] Step 32: Classify and identify the sample sequences based on the fusion gene score and the presence of a master clone.

[0079] When S j If the sequence data is greater than the third preset value (the third preset value is used to characterize that the current sample data has evidence of fusion genes. For example, it is 0), and the clonal rearrangement result shows the presence of a master clone, the sequencing data is identified as belonging to the first level, for example, highly suspected cancer;

[0080] When S j If the sequence data is less than or equal to the third preset value and the clonal rearrangement result indicates the presence of a master clone, the sequence data is identified as belonging to the second level, for example, highly suspected cancer but lacking evidence of fusion genes.

[0081] When S j If the sequence data is greater than the third preset value and the cloning rearrangement result indicates that there is no master clone, the sequence data is identified as belonging to the third level, for example, it does not have cancer characteristics. However, since Sj is greater than the third preset value, it indicates that there is evidence of fusion genome, and further verification can be performed by PCR (polymerase chain reaction).

[0082] When S j If the sequence data is less than or equal to the third preset value and the cloning rearrangement result indicates that there is no master clone, the sequence data is identified as belonging to the fourth level, for example, not having cancer characteristics.

[0083] This disclosure also provides a grading system for sequencing data, such as... Figure 2 As shown, it includes a preprocessing module, an identification module, and a grading module, wherein:

[0084] The preprocessing module is used to acquire targeted sequencing data and perform data preprocessing on the targeted sequencing data to obtain sample sequences;

[0085] The identification module is used to perform targeted sequencing on the sample sequence to identify the fusion gene, so as to identify whether there is a fusion gene and obtain the fusion gene identification result, and to perform clonal rearrangement identification on the sample sequence to identify whether there is a master clone.

[0086] The grading module is used to calculate the fusion gene score based on the fusion gene identification results, and to grade the sample sequence according to the fusion gene score and the clonal rearrangement identification results.

[0087] The system of this disclosure uses a combination of gene fusion and IG / TCR rearrangement detection methods to classify sequencing data based on the overall detection results, rather than using a single gene feature for classification. This results in higher accuracy and wider applicability.

[0088] In one exemplary embodiment, such as Figure 3 As shown, to improve the accuracy of subsequent identification, the preprocessing module includes a resection module and a filtering module, wherein:

[0089] The excision module is used to excise adapter sequences from targeted sequencing data;

[0090] The filtering module is used to filter the sequencing data after the adapter sequence is removed. The filtering includes any one or more of the following: quality filtering and length filtering. The quality filtering includes filtering out sequences and / or segments with quality values ​​lower than a preset quality value. The length filtering includes filtering out sequences with lengths less than a preset length value.

[0091] In one exemplary embodiment, such as Figure 4 As shown, the identification module includes a fusion gene identification module and a master clone identification module, wherein:

[0092] The fusion gene identification module is used to align the sample sequence to the genome, perform fusion gene detection, and identify candidate fusion gene pairs.

[0093] The master clone identification module is used to rearrange the sample sequence for immunoglobulin IG and immunoglobulin receptor TCR, filter the rearranged data for unique clones to identify whether it is a unique clone, and then perform master clone identification on the unique clones to identify whether there is a master clone.

[0094] In an exemplary embodiment, the master clone identification module performs unique clone filtering on the rearranged data in the following manner: clones that meet one or two of the following conditions are identified as unique clones:

[0095] Condition 1: Determine if the number of sequences supported by cloning is greater than the first preset value;

[0096] Condition 2: Determine whether the amino acid length of the cloned CDR is greater than the second preset value.

[0097] In an exemplary embodiment, the master clone identification module performs master clone identification on the unique clone in the following manner:

[0098] The master clone identification module calculates the ratio of the number of supporting sequences for each unique clone to the total number of supporting sequences for all unique clones. The obtained ratios are arranged in descending order to obtain a candidate dataset. If the number of unique clones in the candidate dataset with the number of supporting sequences for unique clones being greater than or equal to the first benchmark value is less than the first threshold, or if the number of unique clones in the candidate dataset with the total number of supporting sequences for unique clones being greater than or equal to the second benchmark value is less than the second threshold, then the current sample sequence is considered to have a master clone.

[0099] In an exemplary embodiment, the grading module identifies the sample sequence in the following manner:

[0100] The fusion gene score is calculated using the following formula: Among them, S j Let M represent the fusion gene score of the j-th sample data, M be the number of candidate fusion gene pairs, K be the number of non-redundant genes, and b be the number of fusion genes. i This represents the i-th fusion gene pair.

[0101] When the fusion gene score is greater than a third preset value and the cloning rearrangement result indicates the presence of a master clone, the grading module identifies the sequencing data as belonging to the first level; the third preset value is used to characterize that the current sample data has evidence of fusion genes.

[0102] When the fusion gene score is less than or equal to the third preset value, and the cloning rearrangement result indicates the existence of a master clone, the grading module identifies the sequencing data as belonging to the second level.

[0103] When the fusion gene score is greater than the third preset value and the cloning rearrangement result indicates that there is no master clone, the grading module identifies the sequencing data as belonging to the third level.

[0104] When the score of the fusion gene is less than or equal to the third preset value, and the cloning rearrangement result indicates that there is no master clone, the grading module identifies the sequencing data as belonging to the fourth level.

[0105] For details on the implementation, please refer to the methods described above; they will not be repeated here.

[0106] The following example, using targeted sequencing technology to grade and identify lymphomas, will be used to illustrate the above method in detail.

[0107] S1. Targeted sequencing was performed on 7 sample data. The sequencing information was customized in Table 1. The sequencing strategy was PE150 (paired sequencing, sequence length of 150bp). The amount of data obtained is shown in Table 2.

[0108] Table 1 lists the options available for customization.

[0109]

[0110]

[0111]

[0112]

[0113]

[0114]

[0115] Table 2. Statistics on the number of raw sequencing samples

[0116] Sample name Sequence number (pairings) Number of bases Sample1 61,911,787 9,286,768,050 Sample2 39,048,223 5,857,233,450 Sample3 58,613,404 8,792,010,600 Sample4 27,882,034 4,182,305,100 Sample5 45,325,899 6,798,884,850 Sample6 60,137,595 9,020,639,250 Sample7 36,099,125 5,414,868,750

[0117] The obtained sequencing data were preprocessed as follows: adapter sequence removal, low-quality base removal, and filtering of shorter sequence pairs after removal, including steps S1.1-S1.3.

[0118] S1.1 The sequencing sequence is parsed and the adapter sequence is removed. Specifically, the adapter sequence is identified in the sequencing data based on the adapter sequence used by the sequencing platform, and the adapter sequence is removed. For example, the adapter sequence is removed when the length of the sequence end is greater than or equal to 3 bp.

[0119] S1.2 Filter and remove low-quality sequences in the data after removing the adapter sequence; the sequence filtering criteria are, for example, the bases with an average quality value of less than 25 in the removed sequence. If any sequence in Read1 (hereinafter referred to as R1, each sequence obtained from sequencing is called a read) or Read2 (hereinafter referred to as R2) meets the filtering criteria, then the pair of sequences is filtered out.

[0120] S1.3 Length filtering is performed on the excised sequences. The length filtering criteria are determined based on library characteristics. For libraries with paired-end sequencing lengths greater than or equal to 150 bp (e.g., PE150), the filtering length threshold is 80 bp, which can be adjusted between 50-120 bp as needed. For libraries with paired-end sequencing lengths less than 250 bp (e.g., PE150), the filtering length threshold is 120 bp, which can be adjusted between 80-180 bp as needed. Filtering any sequence from Read1 or Read2 yields sequences that meet the filtering criteria.

[0121] In this example, the filtering criterion is that the length of the low-quality sequence after removal is less than 50 bp.

[0122] The number of data points after filtering the joint sequence content, quality value, and length is shown in Table 3.

[0123] Table 3 Sequencing data filtering statistics

[0124] Sample name Percentage of connectors (%) Percentage of filtration quality (%) Percentage of filtration quality (%) Sample1 11.1 0.2 1.07 Sample2 14.1 0.4 0.24 Sample3 20.1 0.1 0.05 Sample4 14.1 0.1 0.02 Sample5 11.3 0.1 0.93 Sample6 11.9 0.2 0.98 Sample7 10.6 0.3 1.35

[0125] S2, construct the comparison database, compare the filtered data, and obtain the fusion gene characteristics. This includes steps S2.1-S2.2.

[0126] S2.1 In this example, the hg38 gene is used as the reference genome. A genome index is constructed using BWA and aligned. After alignment, SAMtools is used to convert the format to BAM format, and the BAM format is sorted and an index file is constructed.

[0127] S2.2 The aligned files were used to perform fusion gene detection using FuSeq_WES to select candidate fusion gene pairs. Some of the identification results are shown in Table 4.

[0128] Table 4. Identification of Fusion Genes (Partial)

[0129] Sample name Candidate fusion gene pairs Non-redundant number of fusion genes Sample1 4 7 Sample2 3 3 Sample3 0 0 Sample4 3 4 Sample5 5 8 Sample6 1 1 Sample7 1 1

[0130] S3 involves performing IG and TCR rearrangement identification and filtering on the filtered high-quality sequencing data obtained in step S1, and identifying the master clone. This includes steps S3.1-S3.3.

[0131] S3.1 uses the 'exome-cdr3' method of mixcr for clone identification, which may include the following chains: IGH, IGK, IGL, TRA, TRB, TRG, and TRD.

[0132] S3.2 Filter the identified data, and only those data meeting the following conditions are considered unique clones i:

[0133] 1) The number of supporting sequences for a unique clone, Ni ≥ 5

[0134] 2) The amino acid length of the uniquely cloned CDR is Li≥6

[0135] In this example, if five identical sequences appear that support a unique clone, the sequence is considered credible. Furthermore, if the specific information in these five sequences (such as the length of the CDR3) is reasonable (the longer the CDR length, the more accurate the information), then the unique clone is considered more credible.

[0136] The threshold for identifying a unique clone can be adjusted. For example, the threshold can be lowered, i.e., the standard can be reduced, or the threshold can be raised, i.e., the standard can be increased.

[0137] S3.3 Identify the master clone P for the unique clone identified;

[0138] 1) Calculate the number of supporting reads N for the unique clone. i The ratio Q to the total number of supports for all unique clones, C. i =Ni / C;

[0139] 2) Sort Q in descending order to obtain the dataset D = {Q1, Q2, Q3…}. Summate the sums of Q. When, that is, the total number of reads supporting the unique clone ranked in the top n in D is greater than or equal to 95%; when When the total number of reads supporting the unique clone ranked in the top m in D is greater than or equal to 50%, record the values ​​of n and m.

[0140] 3) When the value of n is greater than the threshold T1 and the value of m is greater than the threshold T2, it is considered that the sample does not have a master clone. Conversely, when the value of n is less than the threshold T1 or the value of m is less than the threshold T2, it is considered that the sample has a master clone. The value of T1 can be adjusted between 5 and 8, and the value of T2 can be adjusted between 8 and 10.

[0141] Table 5 shows some of the identification results after unique clone and master clone identification.

[0142] Table 5. Master clone identification results (partial)

[0143]

[0144]

[0145] S4. Based on the fusion gene identification results in step S2 and the master clone identification results in step S3, lymphoma grading is performed.

[0146] In this experiment, all 7 samples were identified with master clones and fusion gene scores S>0, and were therefore classified as Grade 1, i.e., suspected lymphoma, which is consistent with the actual results, and the experimental accuracy rate was 100%.

[0147] In an exemplary embodiment of this disclosure, a computer device is also provided. The computer device may include a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the operations performed by the extension system of this disclosure.

[0148] like Figure 5 As shown, in one example, terminal 100 may include: processor 110, memory 120, bus system 130 and transceiver 140, wherein the processor 110, the memory 120 and the transceiver 140 are connected through the bus system 130, the memory 120 is used to store instructions, and the processor 110 is used to execute the instructions stored in the memory 120 to control the transceiver 140 to send signals.

[0149] It should be understood that processor 110 can be a central processing unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0150] Memory 120 may include read-only memory and random access memory, and provides instructions and data to processor 110. A portion of memory 120 may also include non-volatile random access memory. For example, memory 120 may also store device type information.

[0151] In addition to a data bus, the bus system 130 may also include a power bus, a control bus, and a status signal bus. However, for clarity, in... Figure 5 The general designated all buses as Bus System 130.

[0152] In implementation, the processing performed by the terminal device can be accomplished through integrated logic circuits in the hardware of the processor 110 or through software instructions. That is, the steps of the method disclosed in this embodiment can be executed by a hardware processor, or by a combination of hardware and software modules within the processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other storage media. This storage medium is located in memory 120, and the processor 110 reads information from memory 120 and, in conjunction with its hardware, completes the steps of the aforementioned method. To avoid repetition, further details are omitted here.

[0153] It will be understood by those skilled in the art that all or some of the steps, systems, or apparatuses disclosed above, and their functional modules / units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software may be distributed on a computer-readable medium, which may include computer storage media (or non-transitory media) and communication media (or transient media). As is known to those skilled in the art, the term computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, it is well known to those skilled in the art that communication media typically contain computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.

Claims

1. A method for grading sequencing data, characterized in that, include: Obtain targeted sequencing data, and perform data preprocessing on the targeted sequencing data to obtain sample sequences; The sample sequence is subjected to targeted sequencing to identify the fusion gene, thereby determining whether a fusion gene exists and obtaining the fusion gene identification result. The sample sequence is also subjected to clonal rearrangement identification to determine whether a master clone exists. The fusion gene score is calculated based on the fusion gene identification results, and the sample sequences are graded based on the fusion gene score and the clonal rearrangement identification results; wherein... The calculation of the fusion gene score based on the fusion gene identification results includes: calculating the fusion gene score using the following formula: , of which S j Let M represent the fusion gene score of the j-th sample data, M be the number of candidate fusion gene pairs, and K be the number of non-redundant genes. b i This represents the i-th fusion gene pair; The step of grading the sample sequences based on the fusion gene score and clonal rearrangement identification results includes: When the fusion gene score is greater than a third preset value and the cloning rearrangement result indicates the existence of a master clone, the sequencing data is identified as belonging to the first level; the third preset value is used to characterize that the current sample data has evidence of fusion genes. When the score of the fusion gene is less than or equal to the third preset value, and the cloning rearrangement result shows the existence of a master clone, the sequencing data is identified as belonging to the second level. When the fusion gene score is greater than the third preset value and the cloning rearrangement result indicates that there is no master clone, the sequencing data is identified as belonging to the third level. When the score of the fusion gene is less than or equal to the third preset threshold, and the cloning rearrangement result indicates that there is no master clone, the sequencing data is identified as belonging to the fourth level.

2. The grading method according to claim 1, characterized in that, The data preprocessing of the targeted sequencing data includes: Adapter sequence removal was performed on the targeted sequencing data; The sequencing data after the adapter sequence is removed is filtered, and the filtering includes any one or more of the following: quality filtering and length filtering; wherein, the quality filtering includes: filtering out sequences and / or segments with a quality value lower than a preset quality value; the length filtering includes: filtering out sequences with a length less than a preset length value.

3. The grading method according to claim 1, characterized in that, The targeted sequencing of the sample sequence to identify the presence of a fusion gene and obtain the fusion gene identification result includes: The sample sequences were aligned to the genome, and fusion gene detection was performed to identify candidate fusion gene pairs.

4. The grading method according to claim 1, characterized in that, The step of performing clonal rearrangement identification on the sample sequence to determine the existence of a master clone includes: The sample sequences were rearranged using immunoglobulin IG and immunoglobulin receptor TCR. The rearranged data is filtered for unique clones to determine whether it is a unique clone. The unique clone obtained from the identification process is subjected to master clone identification to determine whether a master clone exists.

5. The grading method according to claim 4, characterized in that, The process of filtering the rearranged data for unique clones includes identifying clones as unique clones if they meet one or more of the following conditions: Condition 1: Determine if the number of sequences supported by cloning is greater than the first preset value; Condition 2: Determine whether the amino acid length of the cloned CDR is greater than the second preset value.

6. The grading method according to claim 5, characterized in that, The process of identifying the master clone from the unique clone obtained includes: Calculate the ratio of the number of supporting sequences for each unique clone to the total number of supporting sequences for all unique clones. Sort the obtained ratios in descending order to obtain a candidate dataset. If the number of unique clones in the candidate dataset with the number of supporting sequences for unique clones being greater than or equal to the first benchmark value is less than the first threshold, or if the number of unique clones in the candidate dataset with the total number of supporting sequences for unique clones being greater than or equal to the second benchmark value is less than the second threshold, then the current sample sequence is considered to have a master clone.

7. A grading system for sequencing data, characterized in that, include: The preprocessing module is used to acquire targeted sequencing data and perform data preprocessing on the targeted sequencing data to obtain sample sequences; The identification module is used to identify fusion genes by targeted sequencing of the sample sequence to determine whether there are fusion genes and obtain fusion gene identification results, and to identify clonal rearrangement of the sample sequence to determine whether there is a master clone. as well as The grading module is used to calculate the fusion gene score based on the fusion gene identification results, and to grade the sample sequence based on the fusion gene score and the clonal rearrangement identification results; wherein, the grading module grades the sample sequence in the following manner: The fusion gene score is calculated using the following formula: , of which S j Let M represent the fusion gene score of the j-th sample data, M be the number of candidate fusion gene pairs, and K be the number of non-redundant genes. b i This represents the i-th fusion gene pair; When the fusion gene score is greater than a third preset value and the cloning rearrangement result indicates the presence of a master clone, the grading module identifies the sequencing data as belonging to the first level; the third preset value is used to characterize that the current sample data has evidence of fusion genes. When the fusion gene score is less than or equal to the third preset value, and the cloning rearrangement result indicates the existence of a master clone, the grading module identifies the sequencing data as belonging to the second level. When the fusion gene score is greater than the third preset value and the cloning rearrangement result indicates that there is no master clone, the grading module identifies the sequencing data as belonging to the third level. When the score of the fusion gene is less than or equal to the third preset value, and the cloning rearrangement result indicates that there is no master clone, the grading module identifies the sequencing data as belonging to the fourth level.

8. The hierarchical system according to claim 7, characterized in that, The preprocessing module includes a cut-off module and a filtering module, wherein: The excision module is used to excise adapter sequences from targeted sequencing data; The filtering module is used to filter the sequencing data after the adapter sequence is removed. The filtering includes any one or more of the following: quality filtering and length filtering. The quality filtering includes filtering out sequences and / or segments with quality values ​​lower than a preset quality value. The length filtering includes filtering out sequences with lengths less than a preset length value.

9. The hierarchical system according to claim 7, characterized in that, The identification module includes a fusion gene identification module and a master clone identification module, wherein: The fusion gene identification module is used to align the sample sequence to the genome, perform fusion gene detection, and identify candidate fusion gene pairs; The master clone identification module is used to rearrange the sample sequence for immunoglobulin IG and immunoglobulin receptor TCR, filter the rearranged data for unique clones to identify whether it is a unique clone, and then perform master clone identification on the unique clones to identify whether there is a master clone.

10. The hierarchical system according to claim 9, characterized in that, The master clone identification module performs unique clone filtering on the rearranged data in the following way: clones that meet one or two of the following conditions are identified as unique clones: Condition 1: Determine whether the number of sequences supported by the clone is greater than a first preset value, where the first preset value is the minimum number of sequences required to make the unique clone reliable. Condition 2: Determine whether the amino acid length of the cloned CDR is greater than a second preset value, which is the minimum CDR amino acid sequence length that makes the unique clone reliable.

11. The hierarchical system according to claim 9, characterized in that, The master clone identification module uses the following method to identify the master clone of a unique clone: The master clone identification module calculates the ratio of the number of supporting sequences for each unique clone to the total number of supporting sequences for all unique clones. The obtained ratios are arranged in descending order to obtain a candidate dataset. If the number of unique clones in the candidate dataset with the number of supporting sequences for unique clones being greater than or equal to the first benchmark value is less than the first threshold, or if the number of unique clones in the candidate dataset with the total number of supporting sequences for unique clones being greater than or equal to the second benchmark value is less than the second threshold, then the current sample sequence is considered to have a master clone.

12. A non-transient computer-readable storage medium storing computer-executable instructions for performing the method according to any one of claims 1-6.

13. A computer device comprising a processor and a memory storing a computer program executable on the processor, wherein, When the processor executes the program, it implements the steps of the method as described in any one of claims 1-6.