A query method and device for a MiRBase gene library based on Spark SQL
By using Spark SQL's distributed query statements and standard query statements, combined with the RefGene and MiRBase gene databases, the problem of low query efficiency in MiRBase was solved, achieving efficient and accurate mutation analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAN UNIV OF POSTS & TELECOMM
- Filing Date
- 2024-12-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing distributed SQL query engines have low query efficiency in the MiRBase gene database and cannot meet the needs of the explosive growth of gene sequencing data.
Using Spark SQL's distributed and standard query statements, combined with the RefGene and MiRBase gene databases, we can perform efficient mutation analysis based on specified query conditions.
It enables efficient and accurate querying of the MiRBase gene database, improving the efficiency of variant analysis.
Smart Images

Figure CN122201461A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of distributed SQL query engine technology, and in particular to a query method and apparatus for the MiRBase gene database based on Spark SQL. Background Technology
[0002] Gene sequencing refers to the analysis of blood, body fluids, or cells using sequencing instruments to determine the base sequence of deoxyribonucleic acid (DNA). miRBase provides a comprehensive database including miRNA sequence data, annotations, and predicted gene targets, making it one of the most important public databases for storing miRNA information.
[0003] With the rapid decline in costs, gene sequencing is increasingly being applied clinically, leading to an explosive growth in sequencing data and a surge in the amount of data requiring variant analysis. However, existing gene data analysis based on databases such as RefGene and ResSeq is limited by the efficiency of distributed SQL query engines for range queries on these two databases, resulting in very low query efficiency for MiRBase. Summary of the Invention
[0004] In view of this, the purpose of this invention is to propose a query method and apparatus for the MiRBase gene database based on Spark SQL, which can perform variant analysis efficiently and accurately.
[0005] To achieve the above objectives, this invention provides a method for querying the MiRBase gene database based on Spark SQL, comprising:
[0006] This query uses a specified distributed SQL query to query the RefGene database and returns the unique IDs of the genes. The specified distributed SQL query is the query `select * from s rgjoin r on goverlap((s.txStart, s.txEnd, s.exonCount, s.exonStarts, s.exonEnds, s.chr, s.strand), (r.start, r.end, r.chr))`. In this query, `s` represents the RefGene database in table form, and `r` represents the variants to be annotated in table form. Using a tuple as the condition for ON, the parameters in the tuple are represented as follows: s.txStart represents the start point field of mutation in table s, s.txEnd represents the end point field of mutation in table s, s.exonCount represents the number of exons in table s, s.exonStarts represents the set of start points for each exon in table s, s.exonEnds represents the set of end points for each exon in table s, s.chr represents the chromosome number field in table s, and s.strand represents the direction (positive and negative strand) field of the gene in table s; r.start represents the start point field of mutation in table r, r.end represents the end point field of mutation in table r, and r.chr represents the chromosome number field in table r.
[0007] Based on the returned gene ID, query the MiRBase gene database and return the query results. Queries to the MiRBase gene database use standard query statements from a distributed SQL query engine.
[0008] This invention also provides a query device for the MiRBase gene database based on the Spark SQL query engine. The annotation device includes a central processing unit (CPU) capable of performing various appropriate actions and processes based on data and programs stored in memory. The CPU, memory, input / output section, external storage section, and network section are interconnected via a bus. Attached Figure Description
[0009] Figure 1 flow chart
[0010] Figure 2 Device schematic diagram Detailed Implementation
[0011] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to specific embodiments and accompanying drawings.
[0012] refer to Figure 1 This is a flowchart of an embodiment of the present invention.
[0013] The gene analysis annotation method includes the following steps:
[0014] Step 101: Query the RefGene database using the specified Spark SQL statement and return the unique IDs of the genes. The specified Spark SQL statement is the query `select * from s rgjoin r on goverlap((s.txStart, s.txEnd, s.exonCount, s.exonStarts, s.exonEnds, s.chr, s.strand), (r.start, r.end, r.chr))`. In this query, `s` represents the RefGene database in table form, and `r` represents the variants to be annotated in table form. Using a tuple as the condition for ON, the parameters in the tuple are represented as follows: s.txStart represents the start point field of mutation in table s, s.txEnd represents the end point field of mutation in table s, s.exonCount represents the number of exons in table s, s.exonStarts represents the set of start points for each exon in table s, s.exonEnds represents the set of end points for each exon in table s, s.chr represents the chromosome number field in table s, and s.strand represents the direction (positive and negative strand) field of the gene in table s; r.start represents the start point field of mutation in table r, r.end represents the end point field of mutation in table r, and r.chr represents the chromosome number field in table r.
[0015] Step 102: Based on the returned gene ID, query the MiRBase gene database and return the query results. The MiRBase gene database query uses standard Spark SQL query statements, such as `select * from mirbase where ID = "NM_147191"`. The condition parameters include the SNP's unique identifier gene ID in RefGene.
[0016] refer to Figure 2 It shows a schematic diagram of the structure of a computer system suitable for implementing the terminal device / server of the present application. Figure 2 The terminal device / server shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.
[0017] like Figure 2 As shown, the device includes a central processing unit 201, which can perform various appropriate actions and processes based on data and programs stored in a memory 202. The central processing unit 201, the memory 202, the input / output section 204, the external storage section 205, and the network section 206 are interconnected via a bus 203.
[0018] The apparatus described above is used to implement the corresponding methods in the foregoing embodiments and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0019] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this disclosure (including the claims) is limited to these examples; within the framework of this invention, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in the details for the sake of brevity.
[0020] The embodiments of this invention are intended to cover all such substitutions, modifications, and variations falling within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A method for querying the MiRBase gene database based on Spark SQL, characterized in that... include: Use the specified distributed SQL query statement to query the RefGene database and return the ID that uniquely identifies the gene; The specified distributed SQL query statement refers to the query statement `select * from s rgjoin r ongoverlap ((s.txStart,s.txEnd,s.exonCount,s.exonStarts,s.exonEnds,s.chr,s.strand),(r.start,r.end,r.chr))`. In this query statement, `s` represents the RefGene database in table form, and `r` represents the variants to be annotated in table form. Using a tuple as the condition for ON, the parameters in the tuple are as follows: s.txStart represents the start point field of mutation in table s, s.txEnd represents the end point field of mutation in table s, s.exonCount represents the number of exons in table s, s.exonStarts represents the set of start points for each exon in table s, s.exonEnds represents the set of end points for each exon in table s, s.chr represents the chromosome number field in table s, and s.strand represents the direction (positive and negative strand) field of the gene in table s; r.start represents the start point field of mutation in table r, r.end represents the end point field of mutation in table r, and r.chr represents the chromosome number field in table r. Based on the returned gene ID, query the MiRBase gene database and return the query results; The queries to the MiRBase gene database use standard query statements from a distributed SQL query engine.
2. An annotation device for MiRBase based on Spark SQL, characterized in that... include: The central processing unit (CPU), memory, input / output section, external storage section, and network section are interconnected via a bus. The CPU can perform various appropriate actions and processes based on the data and programs stored in memory.