Query, acquisition and traversal method of columnar storage database and related device
By working together with a columnar file sequential reader and a streaming client, efficient and accurate data reading of Parquet files is achieved, solving the problems of low data reading efficiency and memory overflow in the native Spark SQL engine and improving the performance of the Spark SQL engine.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG DAHUA TECH CO LTD
- Filing Date
- 2026-01-28
- Publication Date
- 2026-06-19
AI Technical Summary
When reading Parquet files, the native Spark SQL engine reads a large amount of invalid data at once, resulting in wasted disk I/O and network bandwidth, reduced execution speed, and increased risk of memory overflow.
By working together with a columnar file sequential reader and a streaming client, target column data is read and stored on demand, and attribute values are accurately obtained using a column data block reader, avoiding unnecessary data loading.
It improves data reading efficiency, reduces disk I/O and network bandwidth consumption, lowers the risk of memory overflow, and enhances the execution performance of the Spark SQL engine.
Smart Images

Figure CN122240610A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of big data distributed computing technology, and in particular to a method and apparatus for querying, retrieving, and traversing a columnar storage database. Background Technology
[0002] The native Spark SQL (Structured Query Language) engine, when reading data (such as Parquet files) based on SQL, currently reads all the data from the required columns of a row group in the Parquet file at once, and then filters the read data in memory. This processing method of the native Spark SQL engine not only reads a large amount of invalid data, wasting disk I / O (Input / Output) and network bandwidth, greatly reducing the execution speed of the Spark SQL engine, but also introduces the risk of memory overflow because it reads a large amount of data into memory for computation. Summary of the Invention
[0003] This application provides at least one method and related apparatus for querying, retrieving, and traversing a columnar storage database, which can improve data reading efficiency and security.
[0004] The first aspect of this application provides a columnar storage database query method, which includes: in response to receiving a query keyword and query attribute information, a columnar file sequential reader sends a storage instruction to a streaming client, determines the target row number of the query keyword in a target columnar storage file, and stores the target column data in the target columnar storage file into several column data blocks, wherein the target column data is column data determined by the streaming client in the target columnar storage file based on the query attribute information, and the several column data blocks are created by the streaming client for the target column data; the columnar file sequential reader creates corresponding column data block readers for the several column data blocks, wherein the column data block readers are used to read row data in the corresponding column data blocks; the columnar file sequential reader sends a data read instruction to the column data block readers to invoke the column data block readers to obtain the attribute value corresponding to the query attribute information from the several column data blocks based on the target row number; the columnar file sequential reader uses the attribute value as the query result of the query keyword and the query attribute information.
[0005] Before storing the target column data from the target column storage file into the target column data block, the process includes: obtaining footnote information for each column storage file in the database; and, based on the footnote information, using the column storage file containing the keyword to be queried as the target column storage file.
[0006] A second aspect of this application provides a method for acquiring columnar storage data. The method includes: a streaming client receiving a storage instruction from a columnar file sequential reader, determining the target row number of the keyword to be queried in the target columnar storage file, and creating a column reader and several column data blocks for the target column data. The storage instruction includes the keyword to be queried and the attribute information to be queried received by the columnar file sequential reader. The target column data is the column data determined by the streaming client in the target columnar storage file based on the attribute information to be queried. The streaming client uses the column reader to store the target column data into several column data blocks. The streaming client receives a data reading instruction forwarded by the column data block reader, and based on the target row number, retrieves the attribute value corresponding to the attribute information to be queried from the several column data blocks corresponding to the target column data. The column data block reader is created by the columnar file sequential reader for the target column data block.
[0007] The process of determining the target row number of the query keyword in the target columnar storage file includes: analyzing the query keyword to determine the column data corresponding to the query keyword in the target columnar storage file; traversing the row data in the column data to determine the target row data corresponding to the query keyword; and using the row number corresponding to the target row data as the target row number. Before creating a column reader and several column data blocks for the target column data, the process includes: obtaining the length information of the target column data; dividing the target column data based on the length information to obtain several slices; and creating a column reader and several column data blocks for the target column data includes: generating a slice read stream based on several slices and inserting the slice read stream into the column reader, wherein the slice read stream represents the reading order of several slices; or, determining the number of column data blocks based on the length information of the target column data.
[0008] The column reader includes a page reader, a decoder, a data type adapter, and a slice read stream. The column reader stores the target column data into a column data block, including: reading slices sequentially based on the slice read stream; reading page data from the current slice using the page reader, where page data represents data of a preset page storage space size within the current slice, and page data includes several rows of the target column data; decoding the page data using the decoder to obtain page decoded data; encapsulating the page decoded data using the data type adapter to obtain page array data; and storing the page array data into the column data block, where the row numbers of the row data in the column data block match the row numbers of the target column data.
[0009] The process of storing page array data into column data blocks includes: summing the number of used rows in the current column data block with the number of rows in the page array data to obtain the number of currently used rows in the current column data block; and stopping the storage of data into the current column data block in response to the current number of used rows being greater than the row number threshold corresponding to the current column data block.
[0010] A third aspect of this application provides a method for traversing columnar storage data. The columnar data block reader is created by a columnar file sequential reader for columnar data blocks. The method includes: the columnar data block reader forwarding a data reading instruction sent by the columnar file sequential reader to a streaming client, thereby invoking the streaming client to read row data from several columnar data blocks. The several columnar data blocks are created by the streaming client for target columnar data, which is column data determined by the streaming client in a target columnar storage file based on the query attribute information received by the columnar file sequential reader; updating the current number of rows read; in response to the current number of rows read being less than the target row number, comparing the current number of rows read with a row number threshold of the columnar data block, wherein the target row number is the row number in the target columnar storage file determined by the streaming client based on the storage instruction sent by the columnar file sequential reader; and in response to the current number of rows read being less than or equal to the row number threshold of the columnar data block, repeatedly updating the current number of rows read and subsequent steps.
[0011] Updating the current number of rows read includes: incrementing the number of rows read by one to get the current number of rows read; after updating the current number of rows read, it also includes: in response to the current number of rows read being equal to the target row number, using the currently read row data as the attribute value corresponding to the attribute information to be queried.
[0012] The fourth aspect of this application provides an electronic device including a memory and a processor coupled to each other. The processor is used to execute program instructions stored in the memory to implement the columnar storage database query method in the first aspect, or the columnar storage data acquisition method in the second aspect, or the columnar storage data traversal method in the third aspect.
[0013] The fifth aspect of this application provides a computer-readable storage medium having program instructions stored thereon. When the program instructions are executed by a processor, they implement the columnar storage database query method of the first aspect above, or the columnar storage data acquisition method of the second aspect above, or the columnar storage data traversal method of the third aspect above.
[0014] In the above scheme, the columnar file sequential reader sends a storage instruction to the streaming client based on the query keyword and the query attribute information. The streaming client, according to the storage instruction, determines the target row number of the query keyword in the target columnar storage file and the target column data of the query attribute information in the target columnar storage file. Then, the streaming client creates several column data blocks for the target column data and stores the target column data from the target columnar storage file into these column data blocks. Next, the columnar file sequential reader creates corresponding column data block readers for the several column data blocks. The columnar file sequential reader sends a data read instruction to the column data block readers to retrieve the attribute values corresponding to the query attribute information from the several column data blocks based on the target row number. Finally, the columnar file sequential reader uses the attribute values as the query result for the query keyword and the query attribute information. This application stores the target column data corresponding to the attribute information to be queried into a column data block to speed up the reading speed. At the same time, the position of the query keyword can be determined by the target row number of the query keyword in the target column storage file. During reading, the data corresponding to the query keyword can be accurately read without reading additional data, thereby improving the efficiency of data reading. It can also avoid the risk of memory overflow caused by reading a large amount of data into memory for calculation.
[0015] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this application. Attached Figure Description
[0016] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with this application and, together with the specification, serve to explain the technical solutions of this application.
[0017] Figure 1 This is a schematic diagram of the framework of an embodiment of the Spark SQL engine in this application;
[0018] Figure 2 This is a flowchart illustrating an embodiment of the columnar storage database query method of this application; Figure 3 This is a flowchart illustrating an embodiment of the columnar data retrieval method of this application; Figure 4 This is a flowchart illustrating an embodiment of the columnar data traversal method of this application; Figure 5 This is a schematic diagram of the framework of an embodiment of the columnar storage database query device of this application; Figure 6 This is a schematic diagram of the framework of an embodiment of the electronic device of this application; Figure 7 This is a schematic diagram of a framework of an embodiment of the computer-readable storage medium of this application. Detailed Implementation
[0019] The embodiments of this application will now be described in detail with reference to the accompanying drawings.
[0020] In the following description, specific details such as particular system architectures, interfaces, and technologies are presented for illustrative purposes rather than for limiting purposes, in order to provide a thorough understanding of this application.
[0021] In this document, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " generally indicates that the preceding and following related objects have an "or" relationship. Furthermore, "many" in this document means two or more. Moreover, the term "at least one" in this document means any combination of at least two of any one or more of a plurality of objects. For example, including at least one of A, B, and C can mean including any one or more elements selected from the set consisting of A, B, and C.
[0022] Please see Figure 1 , Figure 1 This is a schematic diagram of the framework of an embodiment of the Spark SQL engine 100 of this application. The Spark SQL engine 100 includes an operator 110 and a streaming client 120. The operator 110 includes a columnar file sequential reader 111 and a columnar data block reader 112. The columnar file sequential reader 111 is used to send a storage instruction to the streaming client 120, causing the streaming client 120 to split the columnar data in the Parquet file and store the split data into columnar data blocks. In addition, the columnar file sequential reader 111 is also used to create the columnar data block reader 112 and send a data read instruction to the columnar data block reader 112 to read the required data from the columnar data blocks. The columnar data block reader 112 is used to read data from the columnar data blocks. The streaming client 120 is used to split the columnar data in the Parquet file and store the split data into columnar data blocks.
[0023] The streaming client is a Parquet client, a tool for reading data from Parquet files. It can be implemented in various programming languages, such as Java and C++, and can be integrated into the Spark SQL engine for use by the engine. The "streaming" in the streaming client refers to a sequential reading of data from columns. For example, the streaming client reads rows of data in column C1 in the order of the first row, second row, third row, and so on.
[0024] Parquet is an open-source columnar storage file format designed for big data analytics, emphasizing high-performance compression, encoding, and querying. Originally developed by Cloudera and Twitter, it has become a core component of the Apache Hadoop ecosystem and is widely used in big data processing tools such as Spark, Hive, and Presto.
[0025] Operator 110 can be the DataSourceScan operator, the delayed materialization operator, etc., without specific limitations here.
[0026] Please see Figure 2 , Figure 2 This is a flowchart illustrating an embodiment of the columnar storage database query method of this application. The execution entity of this method is a columnar file sequential reader 111. Specifically, it may include the following steps: Step S210: In response to receiving the query keyword and query attribute information, the columnar file sequential reader sends a storage instruction to the streaming client, determines the target row number of the query keyword in the target columnar storage file, and stores the target column data in the target columnar storage file into several column data blocks.
[0027] The target column data is determined by the streaming client 120 in the target columnar storage file based on the attribute information to be queried. Several column data blocks are created by the streaming client 120 for the target column data. The attribute information to be queried is the attribute information corresponding to the query keyword. For example, if the query keyword is "Zhang San" and the query attribute information is age, then the query task is to query the age of "Zhang San". Another example is if the query keyword is "sales department" and the query attribute information is first-quarter profit, then the query task is to query the sales department's first-quarter profit in previous years. The target columnar storage file is a Parquet file, and the target column data can be column C1 data, column C2 data, etc., from the Parquet file; no specific limitations are made here. The column data blocks can be Lazy Blocks.
[0028] The columnar file sequential reader 111, as the core coordination component, establishes a connection with the streaming client 120 through storage instructions. The streaming client 120 determines the target column data in the target columnar storage file based on the attribute information to be queried. This target column data is the column data in the columnar storage file that matches the attribute information to be queried, and it is stored in a logically segmented manner. The streaming client 120 divides the target column data into several column data blocks, each column data block containing 4096 rows of data by default, forming a data storage unit.
[0029] In one implementation, after the user inputs the keyword and attribute information to be queried in the current query task, the columnar file sequential reader 111 generates a save instruction and sends it to the streaming client 120. This invokes the streaming client 120 to read data from the target columnar storage file, thereby determining the target row number of the keyword in the target columnar storage file. Simultaneously, the streaming client 120 saves the target column data from the target columnar storage file into several column data blocks.
[0030] Furthermore, multiple columnar storage files exist in the database. To improve data reading efficiency and reduce the need for full reads of all columnar storage files, the columnar file sequential reader 111 calls the streaming client 120 to obtain footer information for each columnar storage file in the database. This footer information refers to the footer portion of the Parquet file, which stores metadata such as column data length and statistical information (minimum value, maximum value). Next, the columnar file sequential reader 111 analyzes each columnar storage file based on the footer information, selecting the columnar storage file containing the query keyword as the target columnar storage file. For example, in the name column, if the keyword range is between the minimum and maximum values, that file is selected as the target columnar storage file. There can be one or more columnar storage files containing the query keyword; therefore, there can be one or more corresponding target columnar storage files.
[0031] Filtering files based on footnote information ensures that queries only process files containing the keywords, reducing unnecessary data scanning. Analyzing keywords to determine corresponding columns allows queries to directly target relevant data columns, improving query efficiency. Traversing row data to determine the target row enables efficient location of the target row, shortening query response time and improving query performance.
[0032] Step S220: The columnar file sequential reader creates a corresponding column data block reader for several column data blocks.
[0033] The column data block reader 112 is used to read the row data in the corresponding column data block.
[0034] The columnar file sequential reader 111 creates a column data block reader 112 for each column data block. This reader establishes a cooperative relationship with the column data block through a data access interface, providing row-level data reading functionality.
[0035] In one embodiment, the streaming client 120 creates five column data blocks for the target column data, and the column file sequential reader 111 creates five corresponding column data block readers 112 to read the data in the column data blocks in order to read the data in the target column data.
[0036] Step S230: The columnar file sequential reader sends a data reading instruction to the column data block reader to invoke the column data block reader to obtain the attribute value corresponding to the attribute information to be queried from several column data blocks based on the target row number.
[0037] In one embodiment, the columnar file sequential reader 111 calls the column data block reader 112 to read row data sequentially from several column data blocks via a data read instruction. Specifically, the target row number is row 5000, the target column data corresponds to 5 column data blocks, and each column data block has 4096 rows of storage space for data storage. First, the first column data block is read row by row, and the number of rows read is counted. After reading the first column data block, the number of rows read is 4096. Then, the second column data block is read. When the 904th row of the second column data block is read, the number of rows read is exactly 5000. The row data of the 904th row is returned as an attribute value. The row data after the 904th row of the second column data block will not be read again.
[0038] Step S240: The columnar document sequential reader uses the attribute values as the query results of the query keywords and the query attribute information.
[0039] In one embodiment, the columnar file sequential reader 111 merges the received attribute values with the corresponding keywords to be queried as the query result for this query task.
[0040] This embodiment integrates the delayed materialization function into the reading process, and immediately terminates subsequent data loading when the target row number is read.
[0041] Please see Figure 3 , Figure 3 This is a flowchart illustrating an embodiment of the columnar data acquisition method of this application. The execution entity of this method is the streaming client 120. Specifically, it may include the following steps: Step S310: The streaming client receives the storage instruction sent by the columnar file sequential reader, determines the target row number of the keyword to be queried in the target columnar storage file, and creates a column reader and several column data blocks for the target column data.
[0042] The storage instruction includes the query keyword and query attribute information received by the columnar file sequential reader 111, and the target column data is the column data determined by the streaming client 120 in the target columnar storage file based on the query attribute information.
[0043] In one implementation, to determine the target row number, the streaming client 120 analyzes the query keyword to determine the column data corresponding to the query keyword in the target columnar storage file. Next, it iterates through the rows in that column data to determine the target row data corresponding to the query keyword. Finally, the row number corresponding to the target row data is used as the target row number.
[0044] For example, if the keyword to be queried is "Zhang San", the streaming client 120 analyzes "Zhang San" and the analysis result is that "Zhang San" is a name. Therefore, the corresponding name column data in the target column storage file is traversed.
[0045] For example, if the keyword to be queried is "sales department", the streaming client 120 analyzes "sales department" and the analysis result shows that "sales department" is a department. Therefore, the corresponding department column data in the target columnar storage file is traversed.
[0046] In one implementation, the streaming client 120 initializes the target columnar storage file after receiving a save instruction. Specifically, it obtains the length information of the target column data, which refers to the total number of bytes or logical length of the column data. Then, it segments the target column data according to the length information to obtain several slices. For example, by default, it segments the data into 8MB of storage space. If the target column data is 27MB, it can be divided into four slices: the first three slices are 8MB each, and the last slice is 3MB.
[0047] Next, the streaming client 120 creates a column reader for the target column data and generates a Chunked Input Stream based on several slices. This Chunked Input Stream is then inserted into the column reader, where the slice read stream represents the reading order of the slices. The streaming client 120 also generates column data blocks and determines the number of column data blocks based on the length information of the target column data. For example, if the target column data has 10,000 rows, and a column data block can store 4,096 rows by default, then 10,000 ÷ 4096 = 2 remainder 1808 rows. Therefore, this target column data requires 3 column data blocks.
[0048] Obtaining the length information of the target column data enables logical partitioning based on the data size, which helps avoid reading unnecessary data; partitioning the target column data based on the length information into several slices allows the data to be processed in batches of 8MB logical blocks, which helps reduce the reading of invalid data; generating slice read streams and inserting them into column readers makes the reading order orderly and controllable, which helps support the streaming reading process; determining the number of column data blocks based on the length information of the target column data makes the data block size adapt to the data volume, which helps improve memory usage efficiency.
[0049] Step S320: The streaming client uses a column reader to store the target column data into several column data blocks.
[0050] In one implementation, the column reader includes a page reader, a decoder, a data type adapter, and a slice read stream. The page reader is used to read binary data from a Parquet file of one page size.
[0051] The decoder is used to decode the binary data from the Parquet file into actual data, such as long or int type data. Depending on the encoding format of the Parquet file, a corresponding decoder needs to be developed. The specific correspondence is shown in Table 1: Table 1
[0052] The data type adapter is used to construct the corresponding array based on the field type of the actual data decoded by the decoder. For example, if a field in a Parquet file is stored in INT64 format, the data type adapter needs to create a Long array.
[0053] Based on the field types in the Parquet file, corresponding data type adapters need to be developed. Table 2 below lists the correspondences for some types: Table 2
[0054] Please combine Figure 1 To store the target column data into several data blocks, the streaming client 120 can read the corresponding slices of the target column data sequentially based on the slice read stream. Next, a page reader is used to read the page data (Page) in the current slice. A Page represents a preset page storage space size (e.g., 1MB or 2MB) of data in the current slice, and can include several rows of data from the target column. Data in Parquet files is binary, and different data types have different byte sizes. For example, names (e.g., Sima Guang) are text and occupy a relatively long number of bytes; a single Page can store two name data entries (i.e., two rows of data). Ages (e.g., 18) are numbers and occupy fewer bytes; a single Page can store ten age data entries (i.e., ten rows of data). Therefore, the number of rows of data that a Page can store varies depending on the column data.
[0055] The page data is decoded using a decoder to obtain page decoded data. Then, a data type adapter is used to encapsulate the corresponding field types of the page decoded data, resulting in page array data. Finally, the page array data is stored in column data blocks, where the row number in the column data block matches the row number in the target column data. For example, if "Zhang San" is in row 5000 of the target column data, then "Zhang San" is stored in row 904 of the second column data block; similarly, if "Sales Department" is in row 100 of the target column data, then "Sales Department" is stored in row 100 of the first column data block.
[0056] Furthermore, during the process of storing page array data into column data blocks, the number of rows already used in the current column data block is summed with the number of rows in the page array data to obtain the number of rows currently used in the current column data block. If the number of rows currently used exceeds the row count threshold corresponding to the current column data block, data storage for the current column data block is stopped. Specifically, the page array data corresponding to the target column data is stored into the current column data block. For each row stored in the current column data block, the row count is incremented by 1. When the number of rows stored in the current column data block equals 4096, page array data storage for the current column data block is stopped, and another empty column data block continues to store the page array data.
[0057] As a data storage container, the column data block has a preset maximum row count threshold (e.g., 4096 rows) to limit the storage size of a single data block. During operation, after page array data is stored into the column data block, the system calculates the cumulative value of the used row count in the column data block and the current page array data row count in real time to form the current used row count. When this value exceeds the row count threshold, the system immediately terminates subsequent data storage operations. The row count threshold can be set to 4096 rows to match the batch processing row count of Spark SQL Engine 100. The row count threshold can be dynamically adjusted based on the Parquet file data volume. When processing multiple columns of data, the row count threshold can be uniformly set to the same value to ensure data block capacity consistency. The row count calculation mechanism achieves dynamic tracking of the row count through accumulation operations, combined with threshold comparison to achieve precise control of data storage and avoid redundant reads caused by exceeding the data block capacity limit.
[0058] The row count calculation mechanism dynamically updates the number of used rows in a column data block, thereby enabling precise control over the capacity of the column data block and helping to avoid excessive reading of invalid data. The threshold comparison mechanism triggers a stop operation by comparing the current number of used rows with the row count threshold, thereby reducing unnecessary data storage and helping to reduce the risk of memory overflow.
[0059] Step S330: The streaming client receives the data reading instruction forwarded by the column data block reader, and obtains the attribute value corresponding to the attribute information to be queried from several column data blocks corresponding to the target column data based on the target row number.
[0060] The column data block reader 112 is created by the columnar file sequential reader 111 for the target column data block.
[0061] In one embodiment, the streaming client 120 receives a data reading instruction forwarded by the column data block reader 112, reads several column data blocks corresponding to the target column data until the data corresponding to the target row number in the column data block is read, stops the reading operation, and uses the data as the attribute value corresponding to the attribute information to be queried.
[0062] The above methods ensure that the target row number is determined so that data reading only covers the target location, thereby reducing invalid data reading and helping to reduce disk I / O and network bandwidth consumption. The lazy loading mechanism of column data blocks allows data to be read on demand, thereby avoiding storing the full amount of data in memory and helping to prevent the risk of memory overflow. The cooperation between the column reader and the column data blocks enables data to be accessed in batches, thereby realizing the streaming reading process and helping to improve the execution performance of the Spark SQL engine 100.
[0063] Please see Figure 4 , Figure 4 This is a flowchart illustrating an embodiment of the columnar data traversal method of this application. The execution entity of this method is the columnar data block reader 112. Specifically, it may include the following steps: Step S410: The column data block reader forwards the data reading instructions sent by the columnar file sequential reader to the streaming client, so as to call the streaming client to read the row data in several column data blocks.
[0064] Among them, several data blocks are created by the streaming client 120 for the target column data, which is the column data determined by the streaming client 120 in the target column storage file based on the query attribute information received by the column file sequential reader 111.
[0065] In one embodiment, the column data block reader 112 calls the streaming client 120 to sequentially read a number of column data blocks, and for each column data block, reads the row data therein row by row.
[0066] Step S420: Update the current number of rows read.
[0067] In one implementation, the number of rows read is incremented by one to obtain the current number of rows read. Furthermore, in response to the current number of rows read being equal to the target row number, the currently read row data is used as the attribute value corresponding to the attribute information to be queried.
[0068] Step S430: In response to the current number of rows read being less than the target row number, compare the current number of rows read with the row number threshold of the column data block.
[0069] The target row number is determined by the storage instruction sent by the streaming client based on the columnar file sequential reader, which determines the row number of the keyword to be queried in the target columnar storage file received by the columnar file sequential reader.
[0070] In one implementation, if the number of rows read is less than the target row number, it indicates that data reading can continue. At this point, it is also necessary to compare the number of rows read with the row count threshold of the current column data block. For example, if the target row number is 11000, the number of rows read is 10000, the current column data block is the 3rd column data block, and the row count threshold for each column data block is 4096 rows, then the row count threshold for the 3rd column data block is 12288 rows. Since the number of rows read is less than the target row number, the number of rows read is compared with the 12288 rows.
[0071] In addition, if the number of rows read equals the target row number during the data reading process, the currently read row data will be used as the attribute value corresponding to the attribute information to be queried.
[0072] Step S440: In response to the size relationship where the number of rows read is less than or equal to the row count threshold of the column data block, repeatedly update the number of rows read and subsequent steps.
[0073] In one implementation, the current column data block is the third column data block and the row count threshold for each column data block is 4096 rows. The row count threshold for the three column data blocks is 12288 rows. The current number of rows read is 10000 rows. Since 10000 rows < 12288 rows, the process returns to step S420 and is re-executed.
[0074] The column data block reader 112 has a delayed materialization function. The delayed materialization function updates the number of rows read after reading a row of data in the column data block and determines whether the number of rows read is less than the target row number. If it is less, it further compares the size relationship between the number of rows read and the row number threshold of the column data block.
[0075] In a specific application scenario, the user inputs the query keywords "Zhang San" and the attribute information "age" to be queried. The columnar file sequential reader 111 receives "Zhang San" and "age" and sends a storage instruction to the streaming client 120. The storage instruction includes a getSource instruction, a getNextBlockSet instruction, "Zhang San," and "age." The streaming client 120 initializes a Parquet file source instance and a Parquet file reader instance based on the getSource instruction. Then, the streaming client 120 obtains the footnote information of each columnar storage file in the database and returns the footnote information to the columnar file sequential reader 111. The columnar file sequential reader 111 analyzes each columnar storage file based on the footnote information and selects the columnar storage file containing "Zhang San" as the target columnar storage file.
[0076] After determining the target columnar storage file, the streaming client 120 obtains the length information of the target column data from the footer information of the Parquet file. Based on the length information, it segments the target column data into several slices, defaulting to 8MB of storage space. Based on these slices, it generates a Chunked Input Stream. The streaming client 120 determines the target column data to be the age column based on "age". The streaming client 120 analyzes "Zhang San" to determine the corresponding name column data in the target columnar storage file. Then, it iterates through the rows in the name column data, determining that "Zhang San" is located in row 5000; therefore, row 5000 is used as the target row number.
[0077] The streaming client 120, based on the `getNextBlockSet` instruction, calls the `AdvanceToNextRowGroup` method to create a column reader and the `CreateBlock` method to create column data blocks. Based on the length information of the target column data, the number of column data blocks is determined to be three. One of these column data blocks is a `LazyBLock`, which does not actually read data but only describes the required number of data rows (default 4096 rows) and the data reading method to be called, i.e., the anonymous function method: `()=>readBlock`. A chunked input stream is inserted into the column reader; therefore, the column reader includes a page reader, a decoder, a data type adapter, and the chunked input stream. The streaming client 120 can read the slices corresponding to the target column data sequentially based on the chunked input stream. Next, the page reader reads the page data (`Page`) from the current slice. The decoder decodes the page data (`Page`) to obtain the page decoded data. Then, the data type adapter encapsulates the corresponding field types of the page decoded data to obtain the page array data. Finally, the page array data is stored in the three column data blocks.
[0078] The columnar file sequential reader 111 creates the first column data block reader 112 and initializes it with Block data. The columnar file sequential reader 111 sends a Read Batch instruction (i.e., a data read instruction) to the first column data block reader 112, and the first column data block reader 112 sends a Get Value instruction to the streaming client 120 to read data in the first column data block row by row. The number of rows read is incremented by one in real time as the current number of rows read. If the current number of rows read is less than the target row number, the current number of rows read is compared with the row number threshold of the column data block. If the current number of rows read is less than or equal to the row number threshold of the current column data block, then the reader continues to read data in the first column data block.
[0079] After failing to read the 5000th row of data in the target column in the first column data block, the columnar file sequential reader 111 creates a second column data block reader 112. Subsequent operations are the same as those for the first column data block reader 112, and will not be described again here. When the 904th row of data is read from the second column data block reader 112, the current number of rows read equals the target row number, and subsequent reading operations stop, ending the query task. The 904th row of data read from the second column data block is used as the query result for the query keywords and query attribute information, and this query result is inserted into the column vector ColumnVector of ColumnBatch. In this way, the downstream calculation operators do not need to be modified, and the new streaming client 120 can be integrated.
[0080] This application utilizes a column data block storage mechanism in the streaming client to ensure that target column data is loaded into column data blocks only as needed. Determining the target row number ensures that data reading only covers the target location. Furthermore, by employing a collaborative working mechanism between the column data block reader and the streaming client, along with a row number threshold control mechanism, data traversal only reads the data required for the target row number, thereby reducing disk I / O and network bandwidth consumption. The cooperation between the column reader and the column data blocks enables batch data access, thus achieving a streaming reading process. Simultaneously, the row-level reading method of the column data blocks allows query results to be returned promptly, which helps reduce memory usage, further improving the execution performance of the Spark SQL engine and reducing the risk of memory overflow.
[0081] Those skilled in the art will understand that, in the above-described method of the specific implementation, the order in which each step is written does not imply a strict execution order and does not constitute any limitation on the implementation process. The specific execution order of each step should be determined by its function and possible internal logic.
[0082] Please see Figure 5 , Figure 5This is a schematic diagram of a framework of an embodiment of the columnar storage database query device 500 of this application. The columnar storage database query device 500 includes a storage instruction sending module 510, a creation module 520, a query module 530, and an output module 540. The storage instruction sending module 510, in response to receiving the query keyword and query attribute information, sends a storage instruction to the streaming client via a columnar file sequential reader, determines the target row number of the query keyword in the target columnar storage file, and stores the target column data in the target columnar storage file into several column data blocks. The target column data is the column data determined by the streaming client in the target columnar storage file based on the query attribute information, and the several column data blocks are created by the streaming client for the target column data. The creation module 520 executes the columnar file sequential reader to create corresponding column data block readers for the several column data blocks, wherein the column data block readers are used to read the row data in the corresponding column data blocks. The query module 530 executes the columnar file sequential reader to send a data read instruction to the column data block reader, so that the column data block reader can retrieve the attribute value corresponding to the attribute information to be queried from several column data blocks based on the target row number. The output module 540 executes the columnar file sequential reader to use the attribute value as the query result of the query keyword and the attribute information to be queried.
[0083] In one embodiment, before storing the target column data from the target column storage file into the target column data block, the storage instruction sending module 510 performs the following steps: obtaining footnote information for each column storage file in the database; and based on the footnote information, selecting the column storage file containing the keyword to be queried as the target column storage file.
[0084] Please see Figure 6 , Figure 6 This is a schematic diagram of an embodiment of the electronic device 60 of this application. The electronic device 60 includes a memory 61 and a processor 62 coupled to each other. The processor 62 executes program instructions stored in the memory 61 to implement the steps of any of the above-described columnar storage database query method embodiments, or the steps of any of the above-described columnar storage data acquisition method embodiments, or the steps of any of the above-described columnar storage data traversal method embodiments. In a specific implementation scenario, the electronic device 60 may include, but is not limited to, a microcomputer or a server. Furthermore, the electronic device 60 may also include mobile devices such as laptops and tablets, without limitation.
[0085] Specifically, processor 62 controls itself and memory 61 to implement the steps of any of the above-described columnar storage database query method embodiments, or the steps of any of the above-described graph-based storage data acquisition method embodiments, or the steps of any of the above-described columnar storage data traversal method embodiments. Processor 62 can also be called a CPU (Central Processing Unit). Processor 62 may be an integrated circuit chip with signal processing capabilities. Processor 62 can also be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor or any conventional processor. Furthermore, processor 62 can be implemented using integrated circuit chips.
[0086] Please see Figure 7 , Figure 7 This is a schematic diagram of a framework of an embodiment of the computer-readable storage medium 70 of this application. The computer-readable storage medium 70 stores program instructions 701 that can be executed by a processor. The program instructions 701 are used to implement the steps of any of the above-described columnar storage database query method embodiments, or to implement the steps of any of the above-described columnar storage data acquisition method embodiments, or to implement the steps of any of the above-described columnar storage data traversal method embodiments.
[0087] In some embodiments, the functions or modules of the apparatus provided in this disclosure can be used to perform the methods described in the above method embodiments. The specific implementation can be referred to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.
[0088] The description of the various embodiments above tends to emphasize the differences between the various embodiments. The similarities or similarities between them can be referred to, and for the sake of brevity, they will not be repeated here.
[0089] In the several embodiments provided in this application, it should be understood that the disclosed methods and apparatus can be implemented in other ways. For example, the apparatus implementations described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection of devices or units may be electrical, mechanical, or other forms.
[0090] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0091] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods of various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
Claims
1. A columnar storage database query method, characterized in that, include: In response to receiving the query keyword and query attribute information, the columnar file sequential reader sends a storage instruction to the streaming client, determines the target row number of the query keyword in the target columnar storage file, and stores the target column data in the target columnar storage file into several column data blocks. The target column data is the column data determined by the streaming client in the target columnar storage file based on the query attribute information, and the several column data blocks are created by the streaming client for the target column data. The columnar file sequential reader creates a corresponding column data block reader for the plurality of column data blocks, wherein the column data block reader is used to read the row data in the corresponding column data block; The columnar file sequential reader sends a data reading instruction to the column data block reader, so as to invoke the column data block reader to obtain the attribute value corresponding to the queried attribute information from the plurality of column data blocks based on the target row number; The columnar file sequential reader uses the attribute value as the query result for the keyword and attribute information to be queried.
2. The method according to claim 1, characterized in that, Before storing the target column data from the target columnar storage file into the target column data block, the following steps are included: Retrieve footnote information for each columnar storage file in the database; Based on the footnote information, the columnar storage file containing the keyword to be queried is selected as the target columnar storage file.
3. A method for retrieving data stored in a columnar format, characterized in that, include: The streaming client receives a storage instruction from the columnar file sequential reader, determines the target row number of the keyword to be queried in the target columnar storage file, and creates a column reader and several column data blocks for the target column data. The storage instruction includes the keyword to be queried and the attribute information to be queried received by the columnar file sequential reader. The target column data is the column data determined by the streaming client in the target columnar storage file based on the attribute information to be queried. The streaming client uses the column reader to store the target column data into the plurality of column data blocks; The streaming client receives a data reading instruction forwarded by the column data block reader, and obtains the attribute value corresponding to the queried attribute information from several column data blocks corresponding to the target column data based on the target row number. The column data block reader is created by the columnar file sequential reader for the target column data block.
4. The method according to claim 3, characterized in that, Determining the target row number of the keyword to be queried in the target columnar storage file includes: The query keyword is analyzed to determine the column data corresponding to the query keyword in the target columnar storage file; The row data in the column data is traversed to determine the target row data corresponding to the keyword to be queried; The row number corresponding to the target row data is used as the target row number; Before creating the column reader and several column data blocks for the target column data, the following is included: Obtain the length information of the target column data; The target column data is segmented based on the length information to obtain several slices; The process of creating a column reader and several column data blocks for the target column data includes: Based on the aforementioned slices, a slice read stream is generated and inserted into the column reader, wherein the slice read stream represents the reading order of the aforementioned slices; or, The number of column data blocks is determined based on the length information of the target column data.
5. The method according to claim 4, characterized in that, The column reader includes a page reader, a decoder, a data type adapter, and a slice read stream. The step of using the column reader to store the target column data into the column data block includes: Based on the slice read stream, the slices are read sequentially; The page reader is used to read page data in the current slice, wherein the page data represents data of a preset page storage space size in the current slice, and the page data includes several rows of data of the target column data; The page data is decoded using the decoder to obtain page decoded data; The page decoded data is encapsulated using the data type adapter to obtain page array data; The page array data is stored in the column data block, wherein the row number of the row data in the column data block matches the row number of the target column data.
6. The method according to claim 5, characterized in that, After storing the page array data into the column data block, the following steps are included: The number of rows used in the current column data block is summed with the number of rows in the page array data to obtain the number of rows currently used in the current column data block; In response to the fact that the number of rows currently used is greater than the threshold number of rows corresponding to the current column data block, data storage for the current column data block is stopped.
7. A method for traversing columnar data storage, characterized in that, A column data block reader is created by a columnar file sequential reader for column data blocks, the method comprising: The column data block reader forwards the data reading instruction sent by the columnar file sequential reader to the streaming client, so as to call the streaming client to read the row data in the plurality of column data blocks, wherein the plurality of column data blocks are created by the streaming client for target column data, and the target column data is the column data determined by the streaming client in the target columnar storage file based on the query attribute information received by the columnar file sequential reader; Update the current number of rows read; In response to the current number of read rows being less than the target row number, the size relationship between the current number of read rows and the row number threshold of the column data block is compared, wherein the target row number is the row number of the query keyword received by the column file sequential reader in the target column storage file, which is determined by the storage instruction sent by the streaming client based on the column file sequential reader. In response to the size relationship being that the number of rows currently read is less than or equal to the row count threshold of the column data block, the steps of updating the number of rows currently read and subsequent steps are repeated.
8. The method according to claim 7, characterized in that, The update of the number of rows currently read includes: Increment the number of rows read by one to get the current number of rows read; After updating the number of rows currently read, the following is also included: In response to the current number of rows read being equal to the target row number, the currently read row data is used as the attribute value corresponding to the attribute information to be queried.
9. An electronic device, characterized in that, The system includes a memory and a processor that are coupled to each other. The processor is used to execute program instructions stored in the memory to implement the columnar storage database query method according to any one of claims 1 to 2, or the columnar storage data acquisition method according to any one of claims 3 to 6, or the columnar storage data traversal method according to any one of claims 7 to 8.
10. A computer-readable storage medium having program instructions stored thereon, characterized in that, When the program instructions are executed by the processor, they implement the columnar storage database query method according to any one of claims 1 to 2, or the columnar storage data acquisition method according to any one of claims 3 to 6, or the columnar storage data traversal method according to any one of claims 7 to 8.