A method, device, product and medium for processing alternative splicing data

CN122201419APending Publication Date: 2026-06-12遵义医科大学第二附属医院

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
遵义医科大学第二附属医院
Filing Date
2026-03-17
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies struggle to integrate multiple variable splicing quantitative algorithms, making it difficult to accommodate minute coordinate shifts between different algorithms. This leads to an increase in false positives and data redundancy, reducing the accuracy of variable splicing data integration and storage efficiency.

Method used

By calling multiple independent variable splicing quantification algorithms, using unified identifiers and tolerance thresholds for gene symbols, strand directions, and splicing event types, the same target variable splicing event is identified. Based on consensus rules and multi-step quality control conditions, high-confidence consensus splicing events are screened and stored in a relational database using a binary document array, achieving accurate integration and efficient storage of cross-algorithm events.

🎯Benefits of technology

It improves the integration accuracy and storage efficiency of alternative splicing data, reduces the interference of false positive events, supports rapid retrieval and interactive analysis, and ensures the reliability and reproducibility of biological research.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201419A_ABST
    Figure CN122201419A_ABST
Patent Text Reader

Abstract

A variable splicing data processing method, device, product and medium, relate to the technical field of gene sequencing data processing. In the method, ribonucleic acid sequencing data is obtained, a plurality of independent variable splicing quantification algorithms are called to obtain an initial variable splicing event set and a splicing inclusion proportion value; gene symbols, chain directions, splicing event types and genome coordinates are extracted; if the gene symbols, chain directions, splicing event types are the same and the absolute value of the difference of the genome coordinates is less than or equal to a preset tolerance threshold, it is determined that it is the same target variable splicing event; a unified standardized identifier is generated, a high-confidence consensus splicing event set is screened based on a preset consensus rule and a multi-step quality control condition; and corresponding sample identifier sets and splicing inclusion proportion value sets are respectively stored as parallel binary document type array fields in a relational database. The accuracy of variable splicing data integration is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of gene sequencing data processing technology, specifically to a method, device, product, and medium for alternative splicing data processing. Background Technology

[0002] With the groundbreaking advancements in tumor immunotherapy in clinical practice, the search for robust predictive biomarkers has become a research hotspot. Alternative splicing, as a key post-transcriptional regulatory mechanism in the tumor immune microenvironment, can provide a novel dimension independent of the overall gene expression level. Accurate identification and massive data management of alternative splicing events in large-scale sequencing data through high-throughput RNA sequencing technology combined with bioinformatics algorithms has significant implications for analyzing immunotherapy heterogeneity and discovering new targets.

[0003] Currently, to improve the detection coverage of alternative splicing events, multiple independent alternative splicing quantification algorithms are often used in combination to process sequencing data. When integrating the results of multiple algorithms, the relevant techniques mainly rely on a rigorous and accurate comparison of the splicing node coordinates output by each algorithm to extract overlapping events. The massive amount of sample and splicing event quantitative data is then directly written into a database in the form of a traditional two-dimensional relational data table for subsequent clinical phenotypic association analysis.

[0004] However, relying solely on strict and precise coordinate comparison is insufficient to accommodate the slight boundary coordinate offsets caused by differences in the underlying statistical models of different quantitative algorithms. This can easily lead to the same biological event being misclassified as multiple independent false positive events, resulting in extremely high redundancy in the integration results. At the same time, this high redundancy further exacerbates the expansion of data dimensions, causing traditional two-dimensional table storage architectures to suffer from severe table structure bloat when dealing with the cross-storage of tens of thousands of samples and millions of spliced ​​events, thereby reducing the accuracy of integrating variable spliced ​​data. Summary of the Invention

[0005] This application provides a variable splice data processing method, device, product, and medium, which improves the accuracy of variable splice data integration.

[0006] The first aspect of this application provides a method for processing variable splicing data, which specifically includes: acquiring ribonucleic acid sequencing data of multiple sequencing samples, and calling multiple independent variable splicing quantification algorithms to process the ribonucleic acid sequencing data, thereby obtaining the initial variable splicing event set output by each variable splicing quantification algorithm and the corresponding splicing inclusion ratio value. Extract the gene symbol, strand direction, splicing event type, and genome coordinates of each initial alternative splicing event from the initial alternative splicing event set; For the initial alternative splicing events output by different alternative splicing quantification algorithms, if the gene symbol, strand direction and splicing event type are the same, and the absolute value of the difference between the genome coordinates is less than or equal to the preset tolerance threshold, then the corresponding initial alternative splicing events are determined to be the same target alternative splicing event. A unified normalized identifier is generated for variable splicing events with the same target, and the variable splicing events with the same target are filtered based on preset consensus rules and preset multi-step quality control conditions to obtain a set of high-confidence consensus splicing events. The set of sample identifiers and the set of corresponding splicing proportion values ​​for each high-confidence consensus splicing event in the high-confidence consensus splicing event set are stored as parallel binary document array fields in a relational database.

[0007] By employing the aforementioned technical solution, multiple independent alternative splicing quantification algorithms are invoked to process ribonucleic acid (RNA) sequencing data. Leveraging the complementarity of different algorithms in feature extraction strategies and statistical models, systematic biases caused by parameter settings, alignment strategies, or statistical assumptions in a single algorithm are eliminated, resulting in a more comprehensive initial alternative splicing event set coverage. Four core feature dimensions—gene symbol, strand direction, splicing event type, and genomic coordinates—are extracted, providing a standardized feature space for cross-platform comparison of different algorithm outputs. This ensures accurate identification of homologous events even with subtle differences in coordinate systems, naming conventions, and event classification systems. The same target alternative splicing event is identified by determining that the gene symbol, strand direction, and splicing event type are identical, and the absolute value of the difference in genomic coordinates is less than or equal to a preset tolerance threshold. This approach tolerates reasonable deviations in the precise location of splicing sites from different algorithms while avoiding duplicate counting caused by misclassifying multiple algorithm detection results of the same biological event as independent events, ensuring a one-to-one accurate mapping for event integration. A unified, standardized identifier is generated for the same target alternative splicing event, integrating scattered detection results from different algorithms into a single event entity. This eliminates the heterogeneity of algorithmic sources and establishes a foundation for cross-algorithm event tracing and quantitative data aggregation. Based on pre-defined consensus rules requiring events to be detected jointly by multiple algorithms, a cross-validation mechanism between algorithms filters out algorithm-specific false positives and technical noise. Combined with pre-defined multi-step quality control conditions, rigorous screening is performed from three orthogonal dimensions: sample coverage, median splicing inclusion ratio, and join point read support number. This preserves a high-confidence consensus splicing event set that simultaneously possesses algorithmic consensus, sample representativeness, splicing strength, and sufficient sequencing evidence, significantly reducing the interference of false positive events on subsequent analysis. Each high-confidence consensus splicing event's corresponding set of sample identifiers and splicing inclusion ratio values ​​are stored as parallel binary document-type array fields in a relational database. The use of binary encoding significantly reduces storage space. The parallel array structure maintains a precise correspondence between sample identifiers and splicing inclusion ratio values ​​through an index position alignment mechanism, supporting inclusion queries based on sample sets and filtering operations based on numerical ranges. This enables rapid retrieval and interactive analysis of large-scale splicing event datasets. In summary, this solution quantitatively eliminates algorithmic bias through multi-algorithm collaboration, achieves accurate integration of cross-algorithm events through feature matching and tolerance control, ensures event reliability through consensus rules and multi-dimensional quality control, and supports complex query requirements through an efficient storage structure. From data generation, event integration, quality control to data management, a complete technical chain is formed, improving the accuracy of variable splicing data integration.

[0008] Optionally, the extraction of gene symbols, strand directions, splicing event types, and genomic coordinates for each initial alternative splicing event in the initial alternative splicing event set specifically includes: Read the original annotation files corresponding to the initial variable splicing event sets output by each variable splicing quantitative algorithm; Based on a pre-defined dictionary mapping table, the algorithm-specific event type labels in the original comment file are converted into a unified splicing event type; Based on a pre-defined gene annotation lookup table, the original gene identifiers in the original annotation file are converted into standardized gene symbols. The chain direction and genome coordinates were parsed from the original annotation file.

[0009] By adopting the above technical solution, the original annotation files corresponding to the initial variable splicing event sets output by various variable splicing quantification algorithms were read to obtain the original output data of different algorithms. According to the preset dictionary mapping table, the algorithm-specific event type labels in the original annotation files were converted into unified splicing event types, eliminating the naming conflicts of different terms such as exon skipping, skipping exon, ES, and SkipExon used by different algorithms to refer to the same biological phenomenon. A cross-algorithm comparable event type classification system was established. Based on the preset gene annotation lookup table, the original gene identifiers in the original annotation files were converted into standardized gene symbols, solving the gene identification barrier caused by different annotation systems such as ENSG numbering, RefESq numbering, and gene symbols used by different algorithms. The chain direction and genomic coordinates were parsed from the original annotation files to provide the precise location information of the event on the chromosome. The unified splicing event type, combined with the standardized gene symbols and location coordinates, constitutes a complete feature vector for cross-algorithm event matching, enabling the same biological event from different algorithms to be accurately identified and integrated through feature consistency and coordinate proximity.

[0010] Optionally, for the initial alternative splicing events output by different alternative splicing quantification algorithms, if the gene symbol, strand direction, and splicing event type are all the same, and the absolute value of the difference between the genome coordinates is less than or equal to a preset tolerance threshold, then the corresponding initial alternative splicing events are determined to be the same target alternative splicing event, specifically including: Compare the gene symbol, strand direction, and splicing event type of the initial alternative splicing event output by different alternative splicing quantification algorithms; Extract the start and end coordinates of the initial variable splice event; Calculate the absolute value of the first difference between the coordinates of the starting point and the absolute value of the second difference between the coordinates of the ending point. If the comparison results are all the same, and the absolute values ​​of the first and second differences are both less than or equal to the preset tolerance threshold, then the corresponding initial variable splicing events are confirmed to be merged into the same target variable splicing event.

[0011] By employing the above technical solution, comparing the gene symbols, strand directions, and splicing event types of the initial alternative splicing events output by different quantitative alternative splicing algorithms quickly eliminates event combinations occurring in different genes, different DNA strands, or belonging to different splicing modes, significantly narrowing the range of candidate events requiring fine coordinate comparison. Extracting the start and end coordinates of the initial alternative splicing events yields key location information defining the event boundaries. Calculating the absolute value of the first difference between the start coordinates and the absolute value of the second difference between the end coordinates converts the localization deviations of different algorithms for the same event into quantifiable numerical indicators. If the comparison results are identical and both the absolute values ​​of the first and second differences are less than or equal to a preset tolerance threshold, the corresponding initial alternative splicing events are confirmed to be merged into the same target alternative splicing event. This approach tolerates several base shifts caused by differences in comparison strategies and splicing site identification rules between different algorithms, while avoiding the erroneous merging of different biological events with excessively large coordinate differences. This achieves accurate cross-algorithm identification and redundancy removal for the same splicing event.

[0012] Optionally, generating a unified, normalized identifier for the same target variable splicing event specifically includes: For multiple initial alternative splicing events identified as the same target alternative splicing event, the corresponding genome coordinates are extracted and constructed into an exon boundary coordinate set; According to the preset splicing order, the gene symbol, strand direction, splicing event type, and exon boundary coordinate set are spliced ​​into a string, and the string is used as a unified normalized identifier for the same target variable splicing event.

[0013] By adopting the above technical solution, the corresponding genomic coordinates of multiple initial alternative splicing events identified as the same target alternative splicing event are extracted and constructed into an exon boundary coordinate set. The coordinate information from different algorithms is integrated into a unified location description representing the event, which eliminates the problem of identifier dispersion caused by multiple algorithms using different coordinate expressions for the same event. According to the preset splicing order, the gene symbol, strand direction, splicing event type, and exon boundary coordinate set are concatenated into a string as a unified normalized identifier for the same target alternative splicing event. This gives the same biological event, which was originally scattered in the output of different algorithms, a unique and human-readable identity code. This allows the target event to be located directly by the normalized identifier without having to query the original event identifiers of each algorithm separately during subsequent data retrieval. The normalized identifier contains complete information on gene symbol, strand direction, splicing event type, and exon boundary coordinate set, supporting rapid filtering and classification based on arbitrary components. The same splicing event in different research batches or databases can be accurately identified as the same object through string matching of the normalized identifier, establishing the event tracking capability across datasets and across time.

[0014] Optionally, the step of filtering variable splicing events for the same target based on preset consensus rules and preset multi-step quality control conditions to obtain a high-confidence consensus splicing event set specifically includes: The number of quantitative algorithms for variable splicing that identify the same target variable splicing event is counted to obtain the number of supported algorithms. If the number of algorithm support is greater than or equal to the preset algorithm number threshold, then the same target variable splice event is determined to satisfy the preset consensus rule, and the same target variable splice event that satisfies the preset consensus rule is retained as a candidate splice event; Candidate splicing events are filtered based on preset multi-step quality control conditions to obtain a high-confidence consensus splicing event set.

[0015] By employing the above technical solution, the number of quantitative algorithms for variable splicing that identify the same target variable splicing event is statistically analyzed to obtain the algorithm support number. This transforms the consistency of identification among multiple algorithms into a quantifiable reliability index. If the algorithm support number is greater than or equal to a preset algorithm number threshold, the same target variable splicing event is determined to satisfy a preset consensus rule. This effectively filters out potential false positive events detected only by a single algorithm. The mutual verification mechanism between algorithms significantly reduces erroneous detections caused by comparison biases, parameter settings, or limitations of statistical models in specific algorithms. This ensures that the same target variable splicing event satisfies the preset consensus rule. The selected events were retained as candidate splicing events, constructing a preliminary credible event set. This laid a high-starting-point data foundation for subsequent refined quality control. Based on preset multi-step quality control conditions, candidate splicing events were filtered. Through layers of checks, including read coverage threshold, splicing site normativity test, expression level filtering, and statistical significance test, low-quality and low-confidence events were further eliminated, resulting in a high-confidence consensus splicing event set. This provides a precise data source with both algorithmic consensus and quality reliability for downstream differential splicing analysis, functional annotation, and biological interpretation, avoiding research conclusion bias caused by biological inferences based on low-quality splicing events.

[0016] Optionally, the step of filtering candidate splicing events based on preset multi-step quality control conditions to obtain a high-confidence consensus splicing event set specifically includes: Obtain the sample coverage, median splice inclusion rate, and number of reads supporting candidate splicing events across all sequencing samples; If the sample coverage is greater than or equal to the preset coverage threshold, the median splicing inclusion ratio is greater than the preset ratio threshold, and the number of reads supported by the join point is greater than or equal to the preset number of reads threshold, then the candidate splicing event is determined to meet the preset multi-step quality control conditions. Candidate splicing events that meet the preset multi-step quality control conditions are categorized into the high-confidence consensus splicing event set.

[0017] By employing the above technical solutions, a multi-level quality assessment system was established based on three orthogonal dimensions: sample coverage, median splicing inclusion ratio, and joinpoint read support number, to obtain the sample coverage rate, median splicing inclusion ratio, and sequencing evidence of candidate splicing events across all sequencing samples. A sample coverage rate greater than or equal to a preset coverage threshold ensured that splicing events were not technical artifacts or biological noise originating from individual samples, but rather genuine phenomena stably detected in a sufficient proportion of experimental replicates, thus improving the reproducibility and biological universality of the events. A median splicing inclusion ratio greater than a preset ratio threshold excluded trace events with extremely low splicing inclusion levels, preventing the misclassification of random splicing errors or extremely low abundance of non-functional transcripts during mRNA precursor processing as regulatory alternative splicing, and ensuring the inclusion analysis was accurate. The events possess sufficient expression intensity and potential functional relevance, and the number of reads supporting the join points must be greater than or equal to a preset read count threshold. This requires that the splicing join sites receive direct support from a sufficient number of sequencing reads, avoiding the inclusion of low-confidence splicing sites due to insufficient sequencing depth or accidental matching by alignment algorithms. The three quality control conditions work together to construct a rigorous multi-dimensional filtering network, classifying candidate splicing events that meet the preset multi-step quality control conditions into a high-confidence consensus splicing event set. This provides a high-quality dataset with algorithmic consensus, sample representativeness, biological relevance, and sufficient sequencing evidence for subsequent differential splicing analysis, studies on the association between alternative splicing and phenotypes, and exploration of splicing regulation mechanisms. It significantly reduces the interference of false positive events on downstream analysis results and improves the reliability and reproducibility of biological discoveries.

[0018] Optionally, storing the set of sample identifiers corresponding to each high-confidence consensus splicing event in the high-confidence consensus splicing event set and the set of corresponding splicing inclusion ratio values ​​as parallel binary document array fields in a relational database specifically includes: The sample identifiers in the sample identifier set are sorted according to the preset sample sorting rules to obtain the first ordered array; According to the preset sample sorting rules, the splicing inclusion ratio values ​​corresponding to each sample identifier are sorted to obtain a second ordered array aligned with the index of the first ordered array. Serialize the first sorted array and the second sorted array into binary document format respectively; The serialized first and second ordered arrays are inserted as independent data columns into the same data row of the relational database, and a generalized inverted index is built on the independent data columns.

[0019] By adopting the above technical solution, the sample identifier set is sorted according to the preset sample sorting rules to obtain a first ordered array. This ensures that the sample dimensions of different splicing events are arranged in a unified order, eliminating the problem of inconsistent sample order caused by differences in data sources or processing batches. This establishes a standardized data organization foundation for cross-event sample-level comparison and batch querying. Simultaneously, the splicing inclusion ratio values ​​are sorted according to the same sorting rules to obtain a second ordered array aligned with the index of the first ordered array. A strict index position correspondence precisely binds the sample identity to its quantitative value, enabling fast association queries between sample identifiers and splicing inclusion ratio values ​​via array subscripts. This avoids the redundant storage overhead and additional field matching operations of traditional key-value pair nested structures. The two ordered arrays are serialized into binary document format, using compact binary encoding instead of a text-based structure, significantly reducing storage space usage. Simultaneously, the binary format supports streaming reading and partial parsing, reducing memory peaks during data loading and deserialization time. Inserting the serialized array as an independent data column into the same data row of the relational database maintains the transactional integrity of splicing events while leveraging the mature transaction control mechanisms of relational databases. Parallel columnar storage allows the query engine to selectively load sample identifier columns or numerical columns according to actual needs, achieving column pruning optimization and reducing I / O overhead. Building a generalized inverted index on the independent data columns enables inclusion queries, intersection queries, and range filtering for array elements to quickly locate target splicing events using the index, without requiring full table scans and element-by-element array comparisons. It supports complex query conditions such as finding splicing events detected in a specific sample set or filtering events with a splicing inclusion ratio within a specified range, returning results in millisecond response times, significantly improving the retrieval efficiency and interactive analysis experience of large-scale splicing event datasets. Attached Figure Description

[0020] Figure 1 This is a schematic diagram of the architecture of a variable splicing data processing system provided in an embodiment of this application; Figure 2 This is a flowchart illustrating a variable splicing data processing method provided in an embodiment of this application; Figure 3 This is an exemplary hardware structure diagram of an electronic device for variable splicing data processing provided in an embodiment of this application. Detailed Implementation

[0021] Figure 1 An exemplary system architecture for a variable splicing data processing system is shown.

[0022] like Figure 1As shown, the system architecture may include electronic device 11, network 12, and client 13. Network 12 serves as the medium for providing a communication link between electronic device 11 and client 13. Network 12 may include various connection types, such as wired or wireless communication links or fiber optic cables.

[0023] Users (such as medical researchers or data analysts) can use client 13 to interact with electronic device 11 via network 12 to send RNA sequencing data of sequencing samples, or receive and visualize analysis results of high-confidence consensus splicing events. Client 13 is hardware and can be various terminal devices with a display screen, including but not limited to smartphones, tablets, laptops, and desktop computers.

[0024] Electronic device 11 can be a backend server providing alternative splicing data processing and integration services. Electronic device 11 can acquire sequencing data, call multiple independent alternative splicing quantification algorithms for processing and feature extraction, determine the same target alternative splicing event through tolerance matching, and then filter to obtain a high-confidence consensus splicing event set, and store the corresponding data in a parallel binary document array field in an associated relational database.

[0025] The following detailed explanation uses the electronic device side as an example.

[0026] This embodiment provides a variable splicing data processing method. Figure 2 This is a flowchart illustrating a variable splicing data processing method provided in an embodiment of this application, as shown below. Figure 2 As shown, the method includes steps S101 to S105: S101: Obtain ribonucleic acid sequencing data from multiple sequencing samples, and call multiple independent variable splicing quantification algorithms to process the ribonucleic acid sequencing data, obtaining the initial variable splicing event set output by each variable splicing quantification algorithm and the corresponding splicing inclusion ratio.

[0027] In this embodiment, the variable splicing quantification algorithm refers to a computational tool used to identify and quantify the degree of occurrence of variable splicing events from sequencing data. This algorithm can detect different types of splicing changes during gene transcription, such as exon skipping, intron retention, and the use of alternative splicing sites, by analyzing the alignment positions of sequencing reads on the genome and splice junction information, and can numerically quantify the proportion of each type of splicing change. Different variable splicing quantification algorithms employ different statistical models and detection strategies. For example, rMATS performs statistical inference based on splice junction read counts, MAJIQ constructs confidence intervals based on local splicing variation maps, SUPPA2 estimates proportions based on transcript abundance, and SplAdder detects events by amplifying the splicing map of reference annotations.

[0028] Specifically, the electronic device reads the ribonucleic acid (RNA) sequencing data of multiple sequencing samples that have completed sequence alignment from the storage system. The RNA sequencing data includes the alignment coordinates of the sequencing reads on the reference genome, the location of splice junctions, and the coverage depth. The electronic device inputs the RNA sequencing data into multiple independent variable splicing quantification algorithms for parallel processing. Each variable splicing quantification algorithm analyzes and statistically models the splice junction information or transcript abundance information in the RNA sequencing data, identifies variable splicing events that meet preset detection criteria, and generates an initial variable splicing event set. The initial variable splicing event set contains the genomic location information and event type label of all variable splicing events detected by the algorithm. At the same time, it calculates the splicing inclusion ratio of each variable splicing event in each sequencing sample. The splicing inclusion ratio represents the percentage of transcripts that have undergone a specific splicing pattern out of all transcripts covering this splicing site, thus obtaining the initial variable splicing event set and the corresponding splicing inclusion ratio output by each variable splicing quantification algorithm.

[0029] In a preferred embodiment, acquiring ribonucleic acid sequencing data from multiple sequencing samples specifically includes: acquiring the Bulk RNA-ESq dataset from the immune checkpoint blockade (ICB) treatment cohort. Before alternative splicing quantification, the raw sequencing data undergoes unified sequence alignment and gene expression quantification: the STAR algorithm is used in two-pass Mode Basic to align sequencing reads to the human reference genome, filtering non-classical intron motifs and allowing a maximum mismatch count of 15 per read; simultaneously, featureCounts is used for gene-level expression quantification, and Salmon is used for transcript abundance quantification based on transcript sequences to obtain TPM values. Further, the invocation of multiple independent alternative splicing quantification algorithms specifically includes calling MAJIQ, rMATS, SUPPA2, and SplAdder algorithms independently in single-sample mode. Specifically: the MAJIQ algorithm generates PSI estimates in Voila binary format based on the constructed splice graph, requiring at least 10 supporting reads for each detected join point; the rMATS algorithm generates an output file based on join point counts, with a statistical cutoff value of 0.0001 for the proportion of splice sites used; the SUPPA2 algorithm estimates PSI values ​​based on transcript abundance (TPM) output by Salmon; and the SplAdder algorithm constructs a splice graph by expanding reference annotations with join points specific to the sample, and detects alternative splicing events at the highest stringency level (C3 confidence level).

[0030] S102: Extract the gene symbol, strand direction, splicing event type, and genome coordinates of each initial alternative splicing event in the initial alternative splicing event set.

[0031] In this embodiment, a gene symbol refers to a standardized character label used to uniquely identify and formally represent gene names. Since different alternative splicing quantification algorithms may use different gene identification systems when outputting initial alternative splicing events—for example, Ensembl gene numbers, RefSeq numbers, or HGNC official gene symbols—it is necessary to uniformly convert these heterogeneous gene identifiers into standardized gene symbols to achieve cross-algorithm event alignment. The strand direction associated with the gene symbol indicates the transcriptional direction of the gene on the DNA double helix, the splicing event type is used to label the specific pattern of alternative splicing, and the genomic coordinates are used to record the precise location information of the splicing event on the reference genome. These features together constitute a complete descriptive framework for alternative splicing events.

[0032] Specifically, the electronic device reads the original annotation files corresponding to the initial alternative splicing event sets output by each alternative splicing quantification algorithm. The original annotation files contain event labeling information and gene identification information in algorithm-specific formats. The electronic device calls a preset dictionary mapping table to identify and convert the algorithm-specific event type labels in the original annotation files, uniformly mapping the event type labels used by different algorithms to standardized splicing event types. The electronic device calls a preset gene annotation lookup table to query and match the original gene identifiers in the original annotation files, converting the original gene identifiers into standardized gene symbols. At the same time, the electronic device parses the strand direction label field and coordinate field in the original annotation files, extracting strand direction information representing the transcription direction and genomic coordinate information containing chromosome number and nucleotide position, thereby completing the extraction of gene symbols, strand direction, splicing event types, and genomic coordinates for each initial alternative splicing event in the initial alternative splicing event set.

[0033] Based on the above embodiments, as an optional embodiment, the step of extracting the gene symbol, strand direction, splicing event type, and genome coordinates of each initial alternative splicing event in the initial alternative splicing event set may include S201 to S204: S201: Read the original annotation file corresponding to the initial variable splicing event set output by each variable splicing quantitative algorithm.

[0034] In this embodiment, the original annotation file refers to a structured text file containing detailed descriptions of splicing events output by the Quantitative Algorithm (QAL) after splicing event detection. The original annotation file records the attribute information of each detected initial alternative splicing event using an algorithm-specific file format and field naming conventions, including multi-dimensional annotation data such as gene identifiers, event type labels, genomic location coordinates, strand direction markers, and statistical confidence scores. Different QAL algorithms generate original annotation files that differ significantly in file format, field names, and content organization. For example, the rMATS algorithm outputs tab-delimited text files, the MAJIQ algorithm outputs binary Voila files, the SUPPA2 algorithm outputs text files containing PSI values, and the SplAdder algorithm outputs compressed tab-delimited files.

[0035] Specifically, the electronic device determines the output directory path and file naming pattern corresponding to each variable splice quantitative algorithm. The electronic device traverses the output directory of each variable splice quantitative algorithm to locate the original annotation file corresponding to the initial variable splice event set. The electronic device identifies the file format type of the original annotation file and selects the corresponding file parser for reading. The file parser loads the structured data in the original annotation file into memory to form a data object that can be processed later. The data object retains all field information and data records in the original annotation file, thereby completing the reading of the original annotation file corresponding to the initial variable splice event set output by each variable splice quantitative algorithm.

[0036] S202: Convert the algorithm-specific event type labels in the original comment file into a unified splicing event type according to the preset dictionary mapping table.

[0037] In this embodiment, the dictionary mapping table refers to a pre-constructed data structure used to store the correspondence between event type labels of different variable splice quantification algorithms and standardized splice event types. The dictionary mapping table records the conversion rules from algorithm-specific event type labels to unified splice event types in key-value pairs, enabling the mapping of personalized annotation terms used by various algorithms to a consistent event classification system. Algorithm-specific event type labels represent the algorithm-unique event type naming method used by each variable splice quantification algorithm in its output results. For example, rMATS uses the ES label to represent exon skipping events, MAJIQ uses the cassette label to represent the same type of event, and SUPPA2 uses the skipping_exon label to represent this type of event. The dictionary mapping table can uniformly convert these different labels into the standardized splice event type of exon skipping.

[0038] Specifically, the electronic device loads a preset dictionary mapping table from the configuration file into memory. The dictionary mapping table contains multiple mapping entries, each recording an algorithm identifier, an algorithm-specific event type label, and the corresponding unified splicing event type. The electronic device traverses the data records in the original annotation file, extracts the algorithm-specific event type label field for each data record, and uses the current algorithm identifier and the extracted algorithm-specific event type label as the query key to search and match in the dictionary mapping table. After retrieving the corresponding mapping entry, the electronic device obtains the unified splicing event type stored in the mapping entry. The electronic device replaces the algorithm-specific event type label field in the data record with the obtained unified splicing event type. After completing the traversal and replacement of all data records, a conversion annotation file with standardized event types is generated, thereby realizing the conversion of the algorithm-specific event type label in the original annotation file to the unified splicing event type.

[0039] In a specific application scenario, the preset dictionary mapping table uniformly maps the event types of the four algorithms to five classic splicing types. The specific mapping rules include: uniformly converting rMATS's ES, SUPPA2's ES, SplAdder's exon_skip, and MAJIQ's casEStte to exon skipping (ES); uniformly converting rMATS's IR, SUPPA2's IR, SplAdder's intron_retention, and MAJIQ's alternative_intron to intron retention (IR); uniformly converting rMATS's MXE, SUPPA2's MX, SplAdder's mutex_exons, and MAJIQ's mutuallyly_exclusive to mutually exclusive exons (MXE); and maintaining consistent reports for alternative 3' splice site use (A3SS) and alternative 5' splice site use (A5SS).

[0040] S203: Based on a preset gene annotation lookup table, convert the original gene identifiers in the original annotation file into standardized gene symbols.

[0041] In this embodiment, the gene annotation lookup table refers to a pre-established reference database that stores the mapping relationships between various gene identifier systems and standardized gene symbols. The gene annotation lookup table integrates gene identifier information from authoritative genome annotation resources, recording the correspondence between various identifiers such as Ensembl gene numbers, RefSeq transcript numbers, Entrez gene IDs, and HGNC official gene symbols, providing a unified reference standard for the conversion of heterogeneous gene identifier systems. The original gene identifier refers to the gene identifier form used by the alternative splicing quantification algorithm in the original annotation file. Different algorithms may use different gene identifier systems; for example, rMATS typically uses Ensembl gene numbers, LeafCutter may use RefSeq numbers, and MAJIQ may directly use gene symbols. Standardized gene symbols refer to standardized gene names represented using internationally recognized gene nomenclature standards, typically using official gene symbols developed by the HGNC Human Genome Nomenclature Committee or authoritative gene symbol systems corresponding to other species.

[0042] Specifically, the electronic device loads a pre-defined gene annotation lookup table from the reference database into memory. The gene annotation lookup table contains multiple annotation entries, each recording an original gene identifier type, an original gene identifier value, and the corresponding canonical gene symbol. The electronic device traverses the data records in the original annotation file, identifies the identifier type of the gene identifier field for each data record, and extracts the gene identifier value. The electronic device uses the identified identifier type and the extracted gene identifier value as the query key to perform a search and match in the gene annotation lookup table. If a corresponding annotation entry is found, the canonical gene symbol stored in the annotation entry is obtained. The electronic device replaces the original gene identifier field in the data record with the obtained canonical gene symbol. If no corresponding annotation entry is found, the original gene identifier is retained or marked as unmapped. After completing the traversal and replacement of all data records, a standard annotation file with normalized gene identifiers is generated, thereby realizing the conversion of the original gene identifiers in the original annotation file to canonical gene symbols.

[0043] S204: Extract the chain direction and genome coordinates from the original annotation file.

[0044] In this embodiment, strand direction refers to the transcriptional direction information of the gene involved in the alternative splicing event within the DNA double-stranded structure, used to identify whether gene transcription occurs on the positive or negative strand. Strand direction is typically represented by symbols: a plus sign (+) or the number 1 for the positive strand, and a minus sign (-) or the number -1 for the negative strand. Some algorithms may also use text labels such as "forward" and "reverse." Strand direction information is crucial for accurately understanding the biological significance of splicing events and correctly resolving the relative positions of exons and introns. Genome coordinates refer to the precise location information of key splicing sites and exon boundaries involved in the alternative splicing event on the reference genome sequence, typically including three core elements: chromosome number, start position, and termination position. Genome coordinates are represented using specific coordinate systems, commonly including 0-based semi-open interval coordinate systems and 1-based closed interval coordinate systems. The original annotation files output by different algorithms may use different coordinate systems. For example, the genome coordinates of exon skipping events need to record the coordinate values ​​of multiple key sites, such as the upstream exon termination position, the start and termination positions of skipping exons, and the downstream exon start position.

[0045] Specifically, the electronic device identifies the file format specifications of the original annotation file and obtains field definition information. Based on the field definition information, the electronic device locates the target field position storing strand direction information in the original annotation file. The electronic device traverses the data records in the original annotation file and extracts the strand direction marker value from the target field position. The electronic device performs format normalization processing on the extracted strand direction marker value and uniformly converts it into a standard plus or minus sign representation. Based on the field definition information, the electronic device locates the chromosome field, start position field, and end position field storing genomic coordinate information in the original annotation file. The electronic device extracts the chromosome number, coordinate start value, and coordinate end value from each coordinate-related field. The electronic device identifies the coordinate system type used in the original annotation file and performs coordinate value conversion to unify it into a standard coordinate system when necessary. The electronic device associates and stores the parsed strand direction information and genomic coordinate information with the corresponding data records to form a structured data object containing position annotations, thereby completing the operation of parsing strand direction and genomic coordinates from the original annotation file.

[0046] S103: For the initial alternative splicing events output by different alternative splicing quantification algorithms, if the gene symbol, strand direction, and splicing event type are all the same, and the absolute value of the difference between the genome coordinates is less than or equal to the preset tolerance threshold, then the corresponding initial alternative splicing events are determined to be the same target alternative splicing event.

[0047] In this embodiment, an initial alternative splicing event refers to the original splicing event record extracted from the standard annotation files of different alternative splicing quantification algorithms. Each initial alternative splicing event contains basic annotation information such as gene symbol, strand direction, splicing event type, and genome coordinates. Due to slight differences in splicing site identification and boundary definition among different algorithms, the same biological splicing event may be recorded as multiple initial alternative splicing events by different algorithms with slightly different coordinate forms. A common target alternative splicing event refers to a unified event entity that represents the same biological splicing phenomenon after cross-algorithm comparison and merging. The common target alternative splicing event integrates the detection results of the same splicing event from multiple algorithms, providing a standardized event unit for subsequent consistency analysis and comprehensive evaluation. The tolerance threshold refers to the maximum range of differences between genome coordinates allowed when comparing initial alternative splicing events output by different algorithms. The setting of the tolerance threshold takes into account factors such as the uncertainty of sequencing data, differences in the localization accuracy of algorithms, and slight changes in genome annotation versions. In a preferred embodiment, the preset tolerance threshold is set to 1 base pair (i.e., ±1 bp). By setting a fuzzy matching tolerance window of ±1 bp, we can effectively solve the problem of small offsets in the definition of exon boundaries in different algorithms (such as MAJIQ and rMATS), and greatly remove algorithm-specific artifacts. This allows us to accurately converge hundreds of thousands of initial events into high-confidence events, ensuring that the final variable splicing molecular mechanism target has extremely high reproducibility.

[0048] Specifically, the electronic device acquires standard annotation files from the outputs of the first and second alternative splicing quantification algorithms. It extracts a first initial alternative splicing event set from the first algorithm's standard annotation file and a second initial alternative splicing event set from the second algorithm's standard annotation file. The electronic device iterates through each initial event in the first initial alternative splicing event set, extracting its gene symbol, strand direction, splicing event type, and genomic coordinate information as a comparison benchmark. The electronic device searches the second initial alternative splicing event set for candidate events with the same gene symbol to form a candidate event subset. The electronic device then... The system further filters matching events with the same strand direction and splicing event type. For the selected matching events, the electronic device calculates the absolute value of the difference between the genomic coordinates of the event and the genomic coordinates of the alignment benchmark. The electronic device determines whether the absolute value of the difference of all key coordinate sites is less than or equal to a preset tolerance threshold. If the absolute value of the difference of all coordinates meets the tolerance threshold condition, the current initial event and the matching event are determined to be the same target alternative splicing event and are merged and marked. If the absolute value of the difference of any coordinate exceeds the tolerance threshold, they are determined to be different alternative splicing events. After the electronic device completes the traversal and comparison of all initial events, it generates a merged set of alternative splicing events with the same target.

[0049] Based on the above embodiments, as an optional embodiment, for the initial alternative splicing events output by different alternative splicing quantification algorithms, if the gene symbol, strand direction, and splicing event type are all the same, and the absolute value of the difference between the genome coordinates is less than or equal to a preset tolerance threshold, then the step of determining that the corresponding initial alternative splicing events are the same target alternative splicing event may include S301 to S304: S301: Compare the gene symbol, strand direction, and splicing event type of the initial variable splicing event output by different variable splicing quantification algorithms.

[0050] In this application embodiment, a gene symbol refers to a standardized name or code that identifies a specific gene. Common gene symbol systems include HGNC gene symbols, Ensembl gene IDs, and RefSeq gene IDs. Gene symbols are used to uniquely identify the gene entity to which an alternative splicing event belongs and are the primary matching condition for cross-algorithm event alignment. Strand direction refers to the transcriptional direction of the gene containing the alternative splicing event in the DNA double-strand structure, including both positive and negative strand directions. Strand direction information determines the relative order of splice sites and the arrangement of exons and introns, and is a key dimension for determining the consistency of splicing events. Splicing event type refers to the specific pattern classification of the alternative splicing event. Common types include exon skipping (ES), variable 5' splice site (A5SS), variable 3' splice site (A3SS), intron retention (IR), and mutually exclusive exons (MXE). The splicing event type defines the topological structural characteristics of the splicing event; the same biological splicing phenomenon must have the same type classification to be considered a matching event.

[0051] Specifically, the electronic device acquires standard annotation files output by multiple alternative splicing quantification algorithms and extracts initial alternative splicing event sets from them. The electronic device uses the initial alternative splicing event set output by the first algorithm as a reference event set and the initial alternative splicing event sets output by other algorithms as comparison event sets. The electronic device iterates through each initial alternative splicing event in the reference event set and extracts its gene symbol, strand direction, and splicing event type as comparison feature combinations. The electronic device searches the comparison event set for candidate events with the same gene symbol to form the first round of screening results. In the first round of screening results, the electronic device further filters for candidate events with the same strand direction to form the second round of screening results. In the second round of screening results, the electronic device continues to filter for candidate events with the same splicing event type to form the final matching candidate event set. The electronic device establishes a temporary association between the reference event and the events in the matching candidate event set. The electronic device repeats the above comparison process until all events in the reference event set are cross-referenced with all comparison event sets. The electronic device counts the number of matching candidate events found for each reference event in different algorithms, thus completing the comparison operation of the initial alternative splicing events output by different alternative splicing quantification algorithms in three dimensions: gene symbol, strand direction, and splicing event type.

[0052] S302: Extract the coordinates of the start and end points of the initial variable splice event.

[0053] In the embodiments of this application, the start site coordinates refer to the start boundary position of the splicing site involved in the alternative splicing event on the reference genome. For exon skipping events, this includes the termination site of the upstream exon, the start site of the skipping exon, the termination site of the skipping exon, and the start site of the downstream exon. For alternative 5' splicing site events, this includes the start site of the long exon, the start site of the short exon, and the termination site of the common exon. For alternative 3' splicing site events, this includes the start site of the common exon, the termination site of the long exon, and the termination site of the short exon. For intron-preserving events, this includes the termination site of the upstream exon of the intron, the start site of the intron, the termination site of the intron, and the start site of the downstream exon of the intron. The termination site coordinates refer to the termination boundary position of the splicing site involved in the alternative splicing event on the reference genome. Together with the start site coordinates, they define the complete range of the splicing region. Different splicing event types correspond to different numbers and combinations of start site coordinates and termination site coordinates.

[0054] Specifically, the electronic device acquires the initial alternative splicing event data record after the gene symbol, strand direction, and splicing event type have been aligned. It identifies the splicing event type field of the initial alternative splicing event and reads the type value. It determines the name and number of coordinate fields to be extracted for the current splicing event type, locates the field in the initial alternative splicing event data record that stores genomic coordinates, reads the string value of the start site coordinate from the corresponding field and converts it to an integer format, reads the string value of the end site coordinate from the corresponding field and converts it to an integer format, and arranges the extracted start site coordinate and end site coordinate according to the order of the splicing sites on the genome to form an ordered coordinate set containing all key coordinates.

[0055] S303: Calculate the absolute value of the first difference between the coordinates of the starting point and the absolute value of the second difference between the coordinates of the ending point.

[0056] In this embodiment, the first absolute difference refers to the degree of difference in the coordinates of the initial variable splicing events output by different quantitative variable splicing algorithms at the starting point. It is obtained by subtracting the starting point coordinates in the reference event set from the starting point coordinates in the event set to be compared and taking the absolute value. This is used to quantify the positional deviation of different algorithms in identifying the starting boundary of the same splicing event. The smaller the first absolute difference, the more consistent the different algorithms are in locating the starting point. The second absolute difference refers to the degree of difference in the coordinates of the initial variable splicing events output by different quantitative variable splicing algorithms at the ending point. It is obtained by subtracting the ending point coordinates in the reference event set from the ending point coordinates in the event set to be compared and taking the absolute value. This is used to quantify the positional deviation of different algorithms in identifying the ending boundary of the same splicing event. The first and second absolute differences together constitute the evaluation index of the splicing event coordinate matching degree.

[0057] Specifically, the electronic device selects an initial alternative splicing event from the reference event set as the reference event and extracts all start site coordinates and all end site coordinates. It then obtains matching candidate events from the event set that have the same gene symbol, same strand direction, and same splicing event type as the reference event and extracts their corresponding start site coordinates and end site coordinates. Each start site coordinate of the reference event is paired one by one with the corresponding start site coordinate of the matching candidate event. The difference between each pair of paired start site coordinates is subtracted to obtain the first absolute difference value. Similarly, each end site coordinate of the reference event is paired one by one with the corresponding end site coordinate of the matching candidate event. The difference between each pair of paired end site coordinates is subtracted to obtain the second absolute difference value. The first absolute difference value corresponding to all start site coordinates and the second absolute difference value corresponding to all end site coordinates are stored to form a set of absolute difference values.

[0058] S304: If the comparison results are all the same, and the absolute values ​​of the first and second differences are both less than or equal to the preset tolerance threshold, then the corresponding initial variable splice events are confirmed to be merged into the same target variable splice event.

[0059] In this embodiment, the tolerance threshold refers to the maximum permissible range of differences in the splice site coordinates of the initial variable splice events output by different variable splice quantification algorithms. It is used to determine whether initial variable splice events from different algorithms point to the same true splice event. The tolerance threshold is typically set to a small base pair value to ensure the accuracy of coordinate matching. When both the absolute values ​​of the first and second differences are less than or equal to the tolerance threshold, the splice site coordinates of the two initial variable splice events are considered sufficiently close and can be regarded as the same splice event. The same target variable splice event refers to the merged result of initial variable splice events confirmed as the same true splice event after coordinate comparison and tolerance determination. It includes consensus information from multiple initial variable splice events from different algorithms. The same target variable splice event is the basic data unit for subsequent variable splice analysis.

[0060] Specifically, the electronic device reads the gene symbol alignment results, strand direction alignment results, and splicing event type alignment results, and determines whether the three alignment results are all in the same state. Under the condition that the three alignment results are all in the same state, it iterates through the absolute values ​​of the first difference corresponding to all start site coordinates, compares the magnitude of each absolute value of the first difference with the tolerance threshold, and determines whether all absolute values ​​of the first difference are less than or equal to the tolerance threshold. Under the condition that all absolute values ​​of the first difference are less than or equal to the tolerance threshold, it iterates through the absolute values ​​of the second difference corresponding to all end site coordinates, compares the magnitude of each absolute value of the second difference with the tolerance threshold, and determines whether all absolute values ​​of the second difference are less than or equal to the tolerance threshold. Under the condition that all absolute values ​​of the first difference and all absolute values ​​of the second difference are less than or equal to the tolerance threshold, it confirms that the initial variable splicing event in the reference event set and the matching candidate event in the event set to be compared point to the same real splicing event. The initial variable splicing event in the reference event set and the matching candidate event in the event set to be compared are merged to generate a single target variable splicing event data record containing information from both.

[0061] S104: Generate a unified normalized identifier for variable splicing events with the same target, and filter the variable splicing events with the same target based on preset consensus rules and preset multi-step quality control conditions to obtain a set of high-confidence consensus splicing events.

[0062] In this embodiment, the normalized identifier refers to a unique standardized coded string generated for the same target alternative splicing event after coordinate comparison and merging. It is constructed by combining information such as gene symbols, splicing event types, chromosome positions, strand directions, and coordinates of key splicing sites, according to a unified format rule. This identifier indicates that the same splicing event output by algorithms from different sources has a consistent identification code after merging. The normalized identifier can eliminate differences in naming methods among different alternative splicing quantification algorithms and facilitate subsequent event tracing and data integration. The consensus rule refers to the standard requirements for consistent determination of the same target alternative splicing event from multiple alternative splicing quantification algorithms. This includes the frequency requirement that the splicing event must be detected simultaneously by at least a specified number of algorithms, and the directional requirement that the splicing inclusion rate values ​​output by different algorithms maintain a consistent trend.

[0063] Specifically, the electronic device acquires the gene symbol, splicing event type, chromosome number, strand direction, and coordinates of all start and end sites for the same target alternative splicing event. It then concatenates the gene symbol, splicing event type code, chromosome number, strand direction symbol, and coordinate values ​​according to predefined delimiters and order to form a normalized identifier string. This normalized identifier is written into the identifier field of the same target alternative splicing event to complete the identifier assignment. The device also counts the number of different quantitative alternative splicing algorithms that detect the same target alternative splicing event, determining whether the number of algorithms meets the minimum algorithm count requirement preset in the consensus rules. If the minimum algorithm count requirement is met, the device extracts the data from the same target alternative splicing event. The splicing inclusion rate trends among different sample groups are analyzed to determine whether the splicing inclusion rate trends output by different algorithms have a consistent upward or downward direction. Under the condition that the splicing inclusion rate trend direction is consistent, multi-step quality control conditions are applied to screen the same target variable splicing event. The multi-step quality control conditions include that the difference in splicing inclusion rate must exceed a preset change threshold, the probability value of the statistical significance test must be less than a preset significance level, and the splicing site sequence must conform to the classical splicing signal pattern. Each target variable splicing event is tested to see if it simultaneously meets all the conditions of the multi-step quality control conditions. The target variable splicing events that simultaneously meet the consensus rules and multi-step quality control conditions are marked as passing the screening state and summarized to form a high-confidence consensus splicing event set.

[0064] Based on the above embodiments, as an optional embodiment, the step of generating a unified normalized identifier for the same target variable splicing event may include S401 to S402: S401: For multiple initial alternative splicing events that are determined to be the same target alternative splicing event, extract the corresponding genome coordinates and construct an exon boundary coordinate set.

[0065] In this embodiment, the exon boundary coordinate set refers to a data structure containing the genomic coordinates of all exon start and stop sites extracted from multiple initial alternative splicing events identified as the same target alternative splicing event. It includes the positional information of all exon boundaries involved in the same splicing event from different alternative splicing quantification algorithms. The exon boundary coordinate set is the foundational data for subsequent coordinate consensus integration and normalized identifier generation. Genomic coordinates refer to the absolute positional values ​​of exon boundaries on the reference genome sequence, represented by a coordinate triplet consisting of chromosome number, base position, and strand direction. These coordinates are used to precisely locate the physical position of the exon fragments involved in the splicing event within the genome.

[0066] Specifically, the electronic device reads a list of all merged initial alternative splicing events from the same target alternative splicing event data record. It iterates through each initial alternative splicing event and extracts its chromosome number, strand direction, exon start site coordinates, and exon stop site coordinates. Based on the splicing event type, it determines the number of exon boundaries to be extracted and their corresponding coordinates. For exon skipping events, it extracts four boundary coordinates: upstream exon stop site, skipping exon start site, skipping exon stop site, and downstream exon start site. For variable 5' splicing site events, it extracts three boundary coordinates: shared exon start site, short transcript exon stop site, and long transcript exon stop site. For variable 3' splicing site events... For events of the short transcript type, three boundary coordinates are extracted: exon start site, exon start site, and shared exon termination site. For mutually exclusive exon type events, six boundary coordinates are extracted: upstream exon termination site, first exon start and termination site, second exon start and termination site, and downstream exon start site. For intron-preserving type events, two boundary coordinates are extracted: upstream exon termination site and downstream exon start site. All extracted boundary coordinates are stored in a list data structure in the format of coordinate triplets composed of chromosome number, base position, and strand direction. The coordinate triplets are sorted and organized according to the order of the boundary positions on the genome. All sorted coordinate triplets are organized into an exon boundary coordinate set data structure.

[0067] S402: According to the preset splicing order, the gene symbol, strand direction, splicing event type, and exon boundary coordinate set are spliced ​​into a string, and the string is used as a unified normalized identifier for the same target alternative splicing event.

[0068] In this embodiment, the splicing order refers to the order in which information elements such as gene symbols, strand directions, splicing event types, and exon boundary coordinate sets are combined according to a predefined arrangement when constructing a normalized identifier. The splicing order ensures that the same splicing event from different sources can generate a completely consistent identifier string format. The splicing order typically follows a logical structure from gene-level information to location-level information: gene symbol, splicing event type, chromosome number, strand direction, and exon boundary coordinate sequence ordered by genomic location. String splicing refers to the process of connecting different information elements into a single continuous string using predefined delimiters. Commonly used delimiters include colons to separate different information fields, vertical bars to separate multiple coordinate values, and underscores to connect composite information. The normalized identifier string generated after string splicing has the characteristics of uniqueness, readability, and parsability.

[0069] Specifically, the electronic device reads the gene symbol field from the data record of the same target alternative splicing event to obtain the gene name string, reads the strand direction field to obtain the positive or negative strand direction symbol, reads the splicing event type field to obtain the event type code, such as ES for exon skipping, A5SS for variable 5' splicing site, A3SS for variable 3' splicing site, MXE for mutually exclusive exons, and IR for intron retention, and reads the exon boundary coordinate set to obtain the chromosome number and a list of all boundary coordinate values ​​sorted by genomic position. Following a preset splicing order, the gene symbol, splicing event type code, chromosome number, and strand direction symbol are connected sequentially using colons as delimiters to form the identifier prefix. Each coordinate value in the exon boundary coordinate set is traversed and converted into a string format. All coordinate value strings are sequentially connected using vertical bars as delimiters to form the coordinate sequence. The identifier prefix and coordinate sequence are finally spliced ​​together using colons to generate a complete normalized identifier string. The generated normalized identifier string is then written into the identifier field of the same target alternative splicing event to complete the identifier assignment operation.

[0070] Based on the above embodiments, as an optional embodiment, the step of filtering the same target variable splicing events based on preset consensus rules and preset multi-step quality control conditions to obtain a high-confidence consensus splicing event set may include S501 to S503: S501: Statistically count the number of variable splicing quantitative algorithms that identify the same target variable splicing event to obtain the number of algorithm support.

[0071] In this embodiment, the algorithm support number refers to the total number of independent algorithms that successfully identify and report the same target alternative splicing event after multiple alternative splicing quantification algorithms analyze the same batch of RNA sequencing data in parallel. The algorithm support number is an important indicator for measuring the reliability and consistency of splicing event detection. A higher support number indicates a higher confidence level in the verification of the splicing event by multiple independent algorithms. The algorithm support number is usually used as a core criterion in consensus rules to screen high-quality splicing events. Alternative splicing quantification algorithms refer to bioinformatics analysis tools used to identify and quantify alternative splicing events from RNA sequencing data. Different algorithms use different statistical models and detection strategies, such as calculation based on exon inclusion ratio, counting of spliced ​​join reads, or inference based on transcript abundance. Commonly used alternative splicing quantification algorithms include rMATS, MAJIQ, SUPPA2, and SplAdder.

[0072] Specifically, the electronic device reads the merge source field from the data records of the same target variable splice event to obtain the initial variable splice event list that is merged into the target event. It then iterates through each event record in the initial variable splice event list to extract its source algorithm identifier field. The extracted source algorithm identifier is stored in a set data structure to automatically remove duplicate algorithm names. The number of elements with unique algorithm identifiers in the set data structure is counted to obtain the number of independent algorithms that identify the same target variable splice event. The counted number of independent algorithms is assigned as the algorithm support number to the algorithm support number field of the same target variable splice event. It is then determined whether the algorithm support number reaches the preset minimum algorithm support number threshold requirement. If the algorithm support number meets the minimum threshold requirement, the same target variable splice event is marked as having passed the algorithm support number screening state. If the algorithm support number does not meet the minimum threshold requirement, the same target variable splice event is marked as having failed the algorithm support number screening state and is filtered out in subsequent analysis.

[0073] S502: If the number of algorithm support is greater than or equal to the preset algorithm number threshold, then the same target variable splice event is determined to satisfy the preset consensus rule, and the same target variable splice event that satisfies the preset consensus rule is retained as a candidate splice event.

[0074] In this embodiment, the algorithm number threshold refers to the minimum number of algorithmic supports pre-set in the consensus rules to determine whether the same target alternative splicing event has sufficient reliability. The algorithm number threshold needs to balance the requirements of detection sensitivity and specificity. Setting the threshold too low may lead to an increase in false positive splicing events, while setting it too high may lead to over-filtering of real splicing events. Common algorithm number thresholds are set to two, indicating that at least two independent algorithms are required, or three, indicating that at least three independent algorithms are required. The specific value of the threshold can be adjusted according to the total number of algorithms used in the project and the data quality requirements. The consensus rule is a quality control standard system based on multi-algorithm consistency verification. By requiring that the same target alternative splicing event must be simultaneously identified by at least a specified number of independent algorithms to be accepted as a high-confidence event, the consensus rule effectively reduces the inherent bias of a single algorithm and the impact of false positive detections, improving the accuracy and repeatability of the splicing events reported in the final report. Candidate splicing events refer to a set of high-quality alternative splicing events of the same target that meet the minimum number of algorithmic supports after being screened by the consensus rules and retained for subsequent differential analysis and biological interpretation.

[0075] Specifically, the electronic device reads the algorithm support number field from the same target variable splicing event data record to obtain the algorithm support number value of the current event. It reads the preset algorithm number threshold parameter from the configuration parameter file or system settings to obtain the minimum algorithm support number requirement of the consensus rule. It performs a numerical comparison operation to determine whether the algorithm support number is greater than or equal to the algorithm number threshold. If the algorithm support number is greater than or equal to the algorithm number threshold, it determines that the same target variable splicing event meets the preset consensus rule. The pass status field of the same target variable splicing event that meets the consensus rule is set to true, indicating that the event has passed the quality control screening. The same target variable splicing event that meets the consensus rule is added to the candidate splicing event. The event is retained in the list data structure. If the number of events supported by the algorithm is less than the algorithm's threshold, it is determined that the same target variable splice event does not meet the preset consensus rules. The pass status field of the same target variable splice event that does not meet the consensus rules is set to false to indicate that the event has not passed the quality control screening. The same target variable splice event that does not meet the consensus rules is added to the filter event list data structure or directly discarded without further processing. The total number of events in the candidate splice event list is counted to obtain the number of high-quality splice events that have passed the consensus rules screening. A screening statistics report is generated to record the original total number of the same target variable splice events, the number of candidate splice events that have passed the screening, and the number and percentage of filtered events.

[0076] In a preferred embodiment, the preset algorithm number threshold is set to 2. That is, the system employs a strict "2-out-of-4" consensus rule, requiring that the same target alternative splicing event must be independently detected by at least two of the four algorithms (MAJIQ, rMATS, SUPPA2, and SplAdder) to be retained as a candidate splicing event. This hard threshold filtering significantly removes artifacts specific to a single algorithm, accurately converging hundreds of thousands of initial events into high-confidence events, ensuring that the final derived alternative splicing molecular mechanism targets have extremely high reproducibility.

[0077] S503: Filter candidate splicing events based on preset multi-step quality control conditions to obtain a high-confidence consensus splicing event set.

[0078] In this embodiment, the multi-step quality control conditions refer to a multi-level quality control standard system applied sequentially according to a predetermined priority order. This includes statistical significance thresholds, splicing ratio difference thresholds, read support number thresholds, sample coverage thresholds, and coordinate consistency checks. A funnel-style screening strategy ensures that the events ultimately retained simultaneously meet all quality standards. The high-confidence consensus splicing event set refers to the most reliable set of splicing events that, after being screened layer by layer by the multi-step quality control conditions, simultaneously meet the algorithm consistency requirements and multi-dimensional quality standards. It possesses characteristics of high detection reliability, high statistical significance, high biological relevance, and high reproducibility.

[0079] Specifically, the electronic device reads all variable splicing event data for the same target from the candidate splicing event list, reads the multi-step quality control condition parameter set from the configuration parameter file, initializes the quality control statistical counter, and sequentially executes the following steps: First, statistical significance threshold filtering to determine whether the false discovery rate is less than the preset threshold; second, splicing ratio difference threshold filtering to determine whether the absolute value of the difference between the comparison groups is greater than the minimum difference threshold; third, read support number threshold filtering to determine whether the number of included and skip reads has reached the minimum support number; fourth, sample coverage threshold filtering to determine whether the sample coverage ratio in each experimental group has reached the minimum requirement; and fifth, coordinate consistency test filtering to verify whether the exon boundary coordinates are consistent with the reference genome annotation. The splicing events that pass all quality control conditions are saved as a high-confidence consensus splicing event set, and a quality control statistical report is generated to record the initial number, the remaining number and filtering rate after each step of filtering, and the total number and proportion of the final high-confidence consensus splicing event set.

[0080] Based on the above embodiments, as an optional embodiment, the step of filtering candidate splicing events based on preset multi-step quality control conditions to obtain a high-confidence consensus splicing event set may include S601 to S603: S601: Obtain the sample coverage of candidate splicing events in all sequencing samples, the median splicing inclusion rate, and the number of reads supporting the join points.

[0081] In this embodiment, sample coverage refers to the proportion of samples in which a candidate splicing event is successfully detected out of the total number of sequencing samples. It is used to assess the detection stability and prevalence of the splicing event in the sample population. The median splicing inclusion ratio is the median value obtained by sorting the splicing inclusion ratios of a candidate splicing event in all samples that detected the event from smallest to largest. It reflects the typical splicing level of the splicing event in the sample population and reduces the impact of extreme values. The number of reads supporting a connection point refers to the total number of sequencing reads that support a candidate splicing event across splicing connection points, including the number of reads supporting exon inclusion patterns and the number of reads supporting exon skipping patterns. It is used to measure the sequencing depth and reliability of the splicing event detection.

[0082] Specifically, the electronic device reads the identification information and gene information of each candidate splicing event from the high-confidence consensus splicing event set. For each candidate splicing event, it traverses all sequencing samples to read the detection status and quantitative data of the event in each sample. It counts the number of samples in all sequencing samples that successfully detect the event and calculates the sample coverage rate, which is equal to the number of detected samples divided by the total number of samples. It extracts the splicing inclusion ratio of the event in each detected sample to construct a numerical array. The numerical array is sorted in ascending order and the value in the middle position is taken as the median splicing inclusion ratio. It reads the number of inclusion-type connection point reads and the number of skip-type connection point reads of the event in all samples and sums them to obtain the sum of the connection point read support. The sample coverage rate, median splicing inclusion ratio, and connection point read support of each candidate splicing event are integrated into feature data records and saved in the splicing event feature data table.

[0083] S602: If the sample coverage rate is greater than or equal to the preset coverage threshold, the median splicing inclusion ratio is greater than the preset ratio threshold, and the number of connection point reads is greater than or equal to the preset read quantity threshold, then the candidate splicing event is determined to meet the preset multi-step quality control conditions.

[0084] In this embodiment, the coverage threshold refers to a preset minimum requirement for sample coverage, used to screen splicing events that are stably detected in a sufficient proportion of samples. The proportion threshold refers to a preset minimum requirement for the median splice inclusion proportion, used to screen events with significant splicing activity. The read count threshold refers to a preset minimum requirement for the number of reads supporting the junction, used to screen events with sufficient sequencing depth. The multi-step quality control condition determination refers to a quality control process that screens high-quality candidate splicing events by simultaneously meeting the threshold requirements of three dimensions: sample coverage, median splice inclusion proportion, and number of reads supporting the junction.

[0085] Specifically, the electronic device reads preset coverage thresholds, proportion thresholds, and read count thresholds from the configuration parameter file. It also reads the sample coverage, median splicing inclusion ratio, and join point read support count for each candidate splicing event from the splicing event feature data table. For each candidate splicing event, a three-step judgment process is executed: first, it determines whether the sample coverage of the event is greater than or equal to the coverage threshold; second, it determines whether the median splicing inclusion ratio of the event is greater than the proportion threshold; and third, it determines whether the join point read support count of the event is greater than or equal to the read count threshold. If all three judgments are true, the candidate splicing event is deemed to meet preset multi-step quality control conditions and is marked as a qualified event. If any judgment is false, the candidate splicing event is deemed not to meet the multi-step quality control conditions and is marked as a non-qualified event. Qualified events that meet the multi-step quality control conditions are added to the high-confidence consensus splicing event set. A quality control pass statistical report is generated, recording the number of events that passed quality control, the number of events that failed quality control, and the filtering contribution rate of each threshold. In this application, the coverage threshold can be 33%, the proportion threshold can be 0.1, and the read count threshold can be 10.

[0086] S603: Candidate splicing events that meet the preset multi-step quality control conditions are included in the high-confidence consensus splicing event set.

[0087] In this embodiment, inclusion in the high-confidence consensus splicing event set refers to transferring and saving candidate splicing events selected through multi-step quality control conditions to a dedicated high-quality event dataset. This dataset contains complete annotation information, quantitative data, statistical significance indicators, and quality control labels for the events, serving as a reliable data foundation for subsequent differential analysis, functional annotation, and biological interpretation. The high-confidence consensus splicing event set is characterized by high event reliability, strong detection stability, clear biological significance, and good reproducibility.

[0088] Specifically, the electronic device filters all candidate splicing events marked as qualified events from the quality control judgment results, reads the complete information of each qualified event, including event identifier, gene identifier, splicing type, chromosome coordinates, exon number, sample coverage, median splicing inclusion ratio, number of connection point reads, statistical significance value, and difference value between comparison groups. A quality control pass tag and pass timestamp are added to each qualified event. The qualified events and their complete information are written into the high-confidence consensus splicing event set data structure. An index is built for the high-confidence consensus splicing event set to support fast retrieval by gene identifier, splicing type, and chromosome location. A high-confidence consensus splicing event set summary report is generated, recording the total number of events, the distribution of each splicing type, the number of genes involved, and the quality control pass rate. The high-confidence consensus splicing event set and summary report are persistently saved to the database and file system.

[0089] S105: Store the set of sample identifiers and the set of corresponding splicing proportion values ​​for each high-confidence consensus splicing event in the high-confidence consensus splicing event set as parallel binary document array fields in a relational database.

[0090] Based on the above embodiments, as an optional embodiment, the step of storing the set of sample identifiers corresponding to each high-confidence consensus splicing event in the high-confidence consensus splicing event set and the set of corresponding splicing inclusion ratio values ​​as parallel binary document array fields in a relational database may include S701 to S704: S701: Sort each sample identifier in the sample identifier set according to the preset sample sorting rules to obtain the first ordered array.

[0091] In this embodiment, the sample identifier set refers to an ordered set of unique identifiers of all sequencing samples that detected a high-confidence consensus splicing event, arranged in order of detection or sample number. The splicing inclusion ratio value set refers to an ordered set of quantitative values ​​of the splicing inclusion ratio of the splicing event in each sample, corresponding one-to-one with the position in the sample identifier set. The parallel binary document array field refers to using fields supporting binary large objects or document types in a relational database table structure to store the sample identifier array and the splicing inclusion ratio value array respectively. The two arrays maintain a correspondence through position indexes, and a binary serialization format is used to achieve efficient storage and fast retrieval.

[0092] Specifically, the electronic device reads the event identifier and basic annotation information of each splicing event from the high-confidence consensus splicing event set, extracts the detection results of the event from all sequencing samples, filters out samples that successfully detected the event, and constructs an ordered array of sample identifiers by arranging the sample identifiers of the detected samples according to sample number or detection order. Then, it extracts the corresponding splicing inclusion ratio values ​​in the same order of the sample identifier arrays and constructs an ordered array of splicing inclusion ratio values, ensuring a one-to-one correspondence between the position indices of the two arrays. Finally, it performs binary serialization encoding on the sample identifier array to generate a binary byte stream of sample identifiers, and performs binary serialization encoding on the splicing inclusion ratio value array. The code generation and splicing process incorporates a proportion of binary byte streams. A splicing event storage table is created in a relational database, containing event identifier primary key fields, gene identifier fields, splicing type fields, sample identifier array binary document fields, and splicing incorporation proportion array binary document fields. For each splicing event, the event identifier, gene identifier, splicing type, and the serialized sample identifier binary byte stream and splicing incorporation proportion binary byte stream are inserted as a record into the database table. An event identifier primary key index and a gene identifier general index are created for the database table to support fast querying. A data storage completion report is generated, recording the number of stored events, the table size, and the storage time.

[0093] S702: Sort the splicing inclusion ratio values ​​corresponding to each sample identifier according to the preset sample sorting rules to obtain a second ordered array aligned with the index of the first ordered array.

[0094] In this embodiment, the preset sample sorting rule refers to the sorting standard used when sorting sample identifiers, including rules such as ascending lexicographical order of sample numbers, priority sorting of sample groups, order of sample collection time, or sorting by sample type classification. Alignment with the first ordered array index means that the splicing inclusion ratio value at each position in the second ordered array maintains the original correspondence with the sample identifier at the same position in the first ordered array; that is, when a sample identifier is moved to a new position after sorting, its corresponding splicing inclusion ratio value also moves synchronously to the same position. The second ordered array refers to the splicing inclusion ratio value array rearranged according to the sample order of the first ordered array, where the i-th element is the splicing inclusion ratio value corresponding to the i-th sample identifier in the first ordered array.

[0095] Specifically, the electronic device reads the first ordered array (i.e., the ordered array of sample identifiers) that has been sorted according to a preset sample sorting rule from memory. It then reads the original set of sample identifiers and the set of splicing inclusion ratio values ​​before sorting to establish a mapping dictionary from sample identifiers to splicing inclusion ratio values. An empty second ordered array is created to store the sorted splicing inclusion ratio values. Each sample identifier in the first ordered array is traversed and processed sequentially starting from zero according to its position index. For the sample identifier with position index i in the first ordered array, the corresponding splicing inclusion ratio value is retrieved from the mapping dictionary, and the retrieved splicing inclusion ratio value is added to the position index i of the second ordered array. After traversing all positions, a second ordered array with the same length as the first ordered array and corresponding index positions is obtained. The length of the second ordered array is verified to be consistent with the length of the first ordered array, and the value at each position matches the corresponding sample identifier. The index-aligned first and second ordered arrays are stored as paired data structures in memory variables.

[0096] S703: Serialize the first ordered array and the second ordered array into binary document format respectively.

[0097] In this embodiment, serialization to a binary document format refers to converting a data structure in memory into a compact binary byte stream according to specific encoding rules for efficient storage and transmission. Binary document formats include JSON binary formats such as BSON, message packing formats such as MessagePack, protocol buffer formats such as Protocol Buffers, or custom binary encoding formats. Serialization of the first ordered array refers to encoding the sample identifier string array into a binary byte stream containing a string length prefix and character encoding. Serialization of the second ordered array refers to encoding the spliced ​​proportioned floating-point number array into a byte stream containing a numeric type identifier and a floating-point binary representation.

[0098] Specifically, the electronic device reads the first ordered array (sample identifier ordered array) and the second ordered array (splitting and inclusion ratio value ordered array) from memory variables. It selects a preset binary serialization encoding format, such as MessagePack or a custom binary format. When serializing the first ordered array, it first writes the array length as a four-byte integer header. Then, it iterates through each sample identifier string in the array, first writing the string's length as a two-byte integer prefix, followed by the corresponding UTF-8 encoded byte sequence. All bytes are then concatenated sequentially to generate the binary byte stream of the first ordered array. When serializing the second ordered array, it first writes the array length as a four-byte integer header. Then, it iterates through each splicing and inclusion ratio value in the array, converting the floating-point value to an eight-byte double-precision floating-point binary representation according to the IEEE 754 standard. All bytes are then concatenated sequentially to generate the binary byte stream of the second ordered array. Optionally, the generated binary byte stream is compressed using GZIP or LZ4 to reduce storage space. The serialized first and second ordered array binary byte streams are stored in memory variables, recording the length and compression ratio of each binary byte stream.

[0099] S704: Insert the serialized first and second ordered arrays as independent data columns into the same data row of the relational database, and build a generalized inverted index on the independent data columns.

[0100] In this embodiment, independent data columns refer to two independent field columns created separately for the first ordered array and the second ordered array in the relational database table structure. These are typically BLOB, VARBINARY, or dedicated binary document types such as PostgreSQL's BYTEA type. The same data row refers to the first ordered array and the second ordered array being stored as different attribute fields of the same splice event record in the same row of the database table, linked by a primary key event identifier. A generalized inverted index refers to an index structure built on binary document type fields that supports fast retrieval. Unlike traditional B-tree indexes, generalized inverted indexes can parse the binary document content to establish a mapping relationship between content and row records, supporting fast searching of splice events containing a sample based on a sample identifier or filtering events that meet certain conditions based on a numerical range.

[0101] Specifically, the electronic device reads the serialized first and second ordered binary byte streams from memory variables, and reads the basic attributes of the currently processed splicing event, such as the event identifier, gene identifier, and splicing type. It then creates or opens a splicing event storage table in the relational database. This table contains an event identifier primary key field, a gene identifier field, a splicing type field, a sample identifier array binary field, and a splicing inclusion ratio array binary field. A data insertion SQL statement is constructed, specifying the event identifier, gene identifier, splicing type, and the first and second ordered binary byte streams as parameters. The SQL insertion statement is executed to insert the serialized binary data as BLOB or BYTEA type values ​​into the sample identifier array field and the splicing inclusion ratio array field, forming the same data row record. When building a generalized inverted index on the group field, an index auxiliary table is first created to store the mapping relationship between sample identifiers and event identifiers. All data rows are traversed, and the sample identifier list is obtained by deserializing the binary field of the sample identifier array. Each sample identifier and its corresponding event identifier are inserted as key-value pairs into the index auxiliary table. A B-tree index is built for the sample identifier field of the index auxiliary table to support fast lookup. When building a numerical range index on the splice inclusion ratio array field, a range index auxiliary table is created to store the mapping between numerical ranges and event identifiers. The minimum, maximum, and quantile values ​​are calculated to divide the numerical ranges from the numerical list obtained by deserializing the splice inclusion ratio array field. The mapping relationship between each event identifier and its numerical range is inserted into the range index auxiliary table, and a range query index is built. After all events are inserted and the index is built, an index statistics report is generated to record the index size and construction time.

[0102] The following describes an exemplary electronic device for variable splicing data processing provided in embodiments of this application. Figure 3 This is an exemplary hardware structure diagram of an electronic device for variable splicing data processing provided in an embodiment of this application.

[0103] In some embodiments, the electronic device for variable splicing data processing is a computer device or includes a computer device. The computer device includes a processor, memory, and a network interface connected via a system bus. The processor of the computer device provides computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer device stores data. The network interface of the computer device is used to communicate with other external terminals or servers via a network connection. In some embodiments, the network interface can be a wired network interface; in some embodiments, the network interface can also be a wireless network interface. When the computer program is executed by the processor, it implements the methods described in the embodiments of this application.

[0104] Those skilled in the art will understand that Figure 3 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

Claims

1. A variable splicing data processing method, characterized in that, The method includes: The ribonucleic acid (RNA) sequencing data of multiple sequencing samples were obtained, and multiple independent variable splicing quantification algorithms were called to process the RNA sequencing data, so as to obtain the initial variable splicing event set and the corresponding splicing inclusion ratio value output by each variable splicing quantification algorithm. Extract the gene symbol, strand direction, splicing event type, and genome coordinates of each initial alternative splicing event from the initial alternative splicing event set; For the initial alternative splicing events output by different alternative splicing quantification algorithms, if the gene symbol, strand direction and splicing event type are the same, and the absolute value of the difference between the genome coordinates is less than or equal to the preset tolerance threshold, then the corresponding initial alternative splicing events are determined to be the same target alternative splicing event. A unified normalized identifier is generated for variable splicing events with the same target, and the variable splicing events with the same target are filtered based on preset consensus rules and preset multi-step quality control conditions to obtain a set of high-confidence consensus splicing events. The set of sample identifiers and the set of corresponding splicing proportion values ​​for each high-confidence consensus splicing event in the high-confidence consensus splicing event set are stored as parallel binary document array fields in a relational database.

2. The variable splicing data processing method according to claim 1, characterized in that, The extraction of gene symbols, strand directions, splicing event types, and genomic coordinates for each initial alternative splicing event in the initial alternative splicing event set specifically includes: Read the original annotation files corresponding to the initial variable splicing event sets output by each variable splicing quantitative algorithm; Based on a pre-defined dictionary mapping table, the algorithm-specific event type labels in the original comment file are converted into a unified splicing event type; Based on a pre-defined gene annotation lookup table, the original gene identifiers in the original annotation file are converted into standardized gene symbols. The chain direction and genome coordinates were parsed from the original annotation file.

3. The variable splicing data processing method according to claim 1, characterized in that, For the initial alternative splicing events output by different alternative splicing quantification algorithms, if the gene symbol, strand direction, and splicing event type are all the same, and the absolute value of the difference between the genome coordinates is less than or equal to a preset tolerance threshold, then the corresponding initial alternative splicing events are determined to be the same target alternative splicing event. Specifically, this includes: Compare the gene symbol, strand direction, and splicing event type of the initial alternative splicing event output by different alternative splicing quantification algorithms; Extract the start and end coordinates of the initial variable splice event; Calculate the absolute value of the first difference between the coordinates of the starting point and the absolute value of the second difference between the coordinates of the ending point. If the comparison results are all the same, and the absolute values ​​of the first and second differences are both less than or equal to the preset tolerance threshold, then the corresponding initial variable splicing events are confirmed to be merged into the same target variable splicing event.

4. The variable splicing data processing method according to claim 1, characterized in that, The generation of a unified, normalized identifier for the same target variable splicing event specifically includes: For multiple initial alternative splicing events identified as the same target alternative splicing event, the corresponding genome coordinates are extracted and constructed into an exon boundary coordinate set; According to the preset splicing order, the gene symbol, strand direction, splicing event type, and exon boundary coordinate set are spliced ​​into a string, and the string is used as a unified normalized identifier for the same target alternative splicing event.

5. The variable splicing data processing method according to claim 1, characterized in that, The process of filtering variable splicing events for the same target based on preset consensus rules and preset multi-step quality control conditions yields a high-confidence consensus splicing event set, specifically including: The number of quantitative algorithms for variable splicing that identify the same target variable splicing event is counted to obtain the number of supported algorithms. If the number of algorithm support is greater than or equal to the preset algorithm number threshold, then the same target variable splice event is determined to satisfy the preset consensus rule, and the same target variable splice event that satisfies the preset consensus rule is retained as a candidate splice event; Candidate splicing events are filtered based on preset multi-step quality control conditions to obtain a high-confidence consensus splicing event set.

6. The variable splicing data processing method according to claim 5, characterized in that, The process of filtering candidate splicing events based on preset multi-step quality control conditions to obtain a high-confidence consensus splicing event set specifically includes: Obtain the sample coverage, median splice inclusion rate, and number of reads supporting candidate splicing events across all sequencing samples; If the sample coverage is greater than or equal to the preset coverage threshold, the median splicing inclusion ratio is greater than the preset ratio threshold, and the number of reads supported by the join point is greater than or equal to the preset number of reads threshold, then the candidate splicing event is determined to meet the preset multi-step quality control conditions. Candidate splicing events that meet the preset multi-step quality control conditions are categorized into the high-confidence consensus splicing event set.

7. The variable splicing data processing method according to claim 1, characterized in that, The step of storing the set of sample identifiers corresponding to each high-confidence consensus splicing event in the high-confidence consensus splicing event set and the set of corresponding splicing inclusion ratio values ​​as parallel binary document array fields in a relational database specifically includes: The sample identifiers in the sample identifier set are sorted according to the preset sample sorting rules to obtain the first ordered array; According to the preset sample sorting rules, the splicing inclusion ratio values ​​corresponding to each sample identifier are sorted to obtain a second ordered array aligned with the index of the first ordered array. Serialize the first sorted array and the second sorted array into binary document format respectively; The serialized first and second ordered arrays are inserted as independent data columns into the same data row of the relational database, and a generalized inverted index is built on the independent data columns.

8. An electronic device for variable splicing data processing, characterized in that, The electronic device includes: a memory and one or more processors; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code including computer instructions, and the one or more processors call the computer instructions to cause the electronic device to perform the method as described in any one of claims 1-7.

9. A computer program product containing instructions, characterized in that, When the computer program product is run on an electronic device that performs variable splicing data processing, the electronic device causes the electronic device to perform the method as described in any one of claims 1-7.

10. A computer-readable storage medium comprising instructions, characterized in that, When the instructions are executed on an electronic device that performs variable splicing data processing, the electronic device causes the electronic device to perform the method as described in any one of claims 1-7.