A method for multi-condition filtering queries in HBase

CN117931804BActive Publication Date: 2026-06-30UNIV OF ELECTRONICS SCI & TECH OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UNIV OF ELECTRONICS SCI & TECH OF CHINA
Filing Date
2024-01-24
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

HBase is inefficient when performing complex queries with multiple conditions, and existing solutions such as Phoenix are inefficient in large-scale data analysis and cannot meet the requirements of real-time queries.

Method used

A secondary index table for non-RowKey attributes is created. Multiple filtering conditions are processed in parallel using the MapReduce programming framework. Filters are generated through Spark's distributed HBase connector. The score for each condition is calculated, and RowKeys that meet the conditions are selected.

Benefits of technology

It significantly improves the speed of multi-condition queries, solves the efficiency problem of HBase in complex queries, and achieves more efficient data retrieval.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117931804B_ABST
    Figure CN117931804B_ABST
Patent Text Reader

Abstract

This invention discloses an HBase multi-condition filtering query method, comprising the following steps: S1, establishing a secondary index table; S2, for multiple filtering attributes to be filtered, calculating the score of each condition, and calculating the score m required for objects that meet the final query results. n S3. For the query conditions to be filtered, except for the first filtering condition, the rest are connected by flat AND, OR, or NOT; obtain the RowKey set of objects in the main table that match the attribute from the secondary index table; S4. Obtain the score for each RowKey, the score is not less than m. n The object corresponding to the RowKey is the query object that meets the filtering conditions. This invention utilizes big data tools and the MapReduce programming framework to execute queries with multiple filtering conditions in parallel and process the relationships between these conditions, effectively improving the query speed with multiple filtering conditions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of software algorithm technology, and specifically relates to a multi-condition filtering query method for HBase. Background Technology

[0002] With the rapid development of the internet, a large number of non-relational databases have emerged to address the challenges of storing and retrieving massive amounts of data. Unlike traditional relational databases, non-relational databases were designed from the outset with distributed computing, massive storage, and high availability in mind. They are mainly divided into five categories: Document stores, Graph DBMS, Search engines, Wide column stores, and Time Series DBMS. Each category directly addresses the storage and retrieval problems of a specific type of data. HBase belongs to the wide column store category. This type of database is specifically designed for handling large amounts of data. It organizes data through column families, making read and write operations on large datasets more efficient. As a core component of the Hadoop ecosystem (a big data storage and computing framework), HBase, with its excellent write performance, superior scalability, and stable data storage, is an ideal storage medium for massive amounts of structured data and plays a crucial role in the non-relational database camp. It is a key component in the core storage architectures of many companies both domestically and internationally.

[0003] HBase is based on HDFS (a distributed file storage system) and can store data securely and in a distributed manner. It organizes data files using LSM-Tree (Log Structured Merge Tree), which can significantly improve data loading efficiency compared to B-Tree and B+Tree (Multi-Way Search Tree) methods.

[0004] However, HBase has some disadvantages in retrieval. In the LSM-Tree structure, the IndexBlock stores RowKey (row key) index information, making queries using RowKey as the retrieval condition in HBase highly efficient, typically taking only a few milliseconds to tens of milliseconds. However, this method of retrieving HBase databases solely based on RowKey is very limited and cannot fully meet various query requirements. When performing exact matching searches on non-primary key columns, only Filter (a conditional query component provided by HBase, requiring a full table scan) can be used to obtain records that meet the conditions. If the table stores hundreds of millions of data points, timeout exceptions are prone to occur during conditional queries, far from meeting the requirements of real-time queries.

[0005] To address the shortcomings of HBase retrieval, existing solutions involve creating secondary indexes on non-RowKey columns, using these row attributes as RowKeys, and the RowKeys from the main table as column values. When querying using non-primary key columns as filtering attributes, filtering is performed in the corresponding table using a Row-Filter, and the resulting values ​​are then retrieved from the main table. However, this approach is inefficient when handling complex queries with multiple conditions. A more popular solution for complex queries is HBase + Phoenix. Phoenix is ​​a query engine built on top of HBase, a top-level project under the Apache Software Foundation, supporting JDBC (Java Database Connectivity) and SQL (Structured Query Language) statements, making it ideal for technical migration from relational databases. However, when executing multi-condition queries with extremely large datasets, Phoenix pre-fetches data from the secondary indexes and then performs set operations on this data using Java tools, resulting in low efficiency. Large-scale data analysis often requires multi-condition queries with extremely large datasets.

[0006] In this scenario, Phoenix's efficiency is unsatisfactory. Summary of the Invention

[0007] The purpose of this invention is to overcome the shortcomings of the prior art and provide an HBase multi-condition filtering query method that utilizes big data tools and the MapReduce programming framework to execute queries with multiple filtering conditions in parallel and process the relationships between these conditions, thereby effectively improving the query speed with multiple filtering conditions.

[0008] The objective of this invention is achieved through the following technical solution: an HBase multi-condition filtering query method, comprising the following steps:

[0009] S1. Based on the query filtering attributes of the data in the HBase main table, save the non-RowKey attributes as separate entries in a new table in HBase as a secondary index table. In the secondary index table, the attribute entries are sorted in lexicographical order.

[0010] S2. For multiple filtering attributes to be filtered in the query, calculate the score for each condition using a formula based on their order and the descriptors AND, OR, and NOT. Then, calculate the score m required for objects that meet the final query results based on the overall filtering conditions. n ;

[0011] S3. For the query conditions to be filtered, except for the first filter condition, the rest are connected with flat AND, OR or NOT; then for each query condition, generate the corresponding HBase filter and obtain the RowKey set of objects that match the attribute in the main table from the secondary index table.

[0012] S4. Using Spark's MapReduce framework, in the Map phase, elements in the set corresponding to each filter condition are mapped to the score of that set. Then, in the Reduce phase, the scores of each identical RowKey are summed to obtain the score for each RowKey, where the score is not less than m. n The object corresponding to the RowKey is the query object that meets the filtering conditions.

[0013] The specific implementation method of step S2 is as follows: First, accept n filtering conditions, and these filtering conditions, except for the first one, are described by AND, OR or NOT. According to their order and descriptor, calculate the score of each condition, as well as the minimum and maximum scores that satisfy all conditions from the first filtering condition to each condition.

[0014] With K i Let m represent the score of the object that satisfies the i-th condition. i M represents the minimum score that satisfies the first to the i-th conditions. i This represents the highest score that satisfies all conditions from the first to the i-th condition; ultimately, in queries with n filtering conditions, the score is not lower than m. n The element is the element that meets all the conditions;

[0015] m i M i The calculation method is as follows: assign 1 to K1, m1, and M1; if the concatenation operator of the i-th condition is AND, then K i =M i-1 M i =M i-1 +K i mi =m i-1 +K i If the concatenation operator for the i-th condition is OR, then K i =m i-1 m i =m i-1 M i =M i-1 +K i If the connector for the i-th condition is NOT, then K i =INT_MIN,m i =m i-1 M i =M i-1 .

[0016] In step S3, for each query condition, a corresponding HBase filter is generated. The specific implementation method for obtaining the RowKey set of objects matching the attribute in the main table from the secondary index table is as follows: through Spark's distributed HBase connector, according to each filter condition, the matching RowKey set of the main table is queried in the secondary index table, and the elements in the set are stored in Spark's memory.

[0017] The beneficial effects of this invention are: the method of this invention makes full use of big data tools and the MapReduce programming framework, and can execute queries with multiple filtering conditions in parallel and process the relationships between these conditions. Compared with the Phoenix query engine's method of serially processing condition relationships, this invention can effectively improve the query speed with multiple filtering conditions. Attached Figure Description

[0018] Figure 1 The flowchart shows the HBase multi-condition query method of the present invention. Detailed Implementation

[0019] The technical solution of the present invention will be further described below with reference to the accompanying drawings.

[0020] This invention provides an HBase multi-condition query method based on the MapReduce programming model. First, a secondary index table is created for the non-RowKey columns of the data in the HBase main table. Then, during the query, multiple filtering conditions are accepted, and the score of elements meeting each filtering condition and the required score *m* for the final result are calculated. After retrieving elements that meet each filtering condition, the scores of these elements are mapped, and the scores of identical elements are summed. Elements with a final score not less than *m* are the elements in the query result. The specific process is as follows: Figure 1 As shown, it includes the following steps:

[0021] S1. Based on the query filtering attributes of the data in the HBase main table, save the non-RowKey attributes as separate entries in a new table in HBase as a secondary index table. In the secondary index table, the attribute entries are sorted in lexicographical order.

[0022] S2. For multiple filtering attributes to be filtered in the query, calculate the score for each condition using a formula based on their order and the descriptors AND, OR, and NOT. Then, calculate the score m required for objects that meet the final query results based on the overall filtering conditions. n The specific implementation method is as follows: First, accept n filtering conditions, and these filtering conditions, except for the first one, are described by AND, OR or NOT. According to their order and descriptors, calculate the score of each condition, as well as the minimum and maximum scores that satisfy all conditions from the first filtering condition to each condition.

[0023] With K i Let m represent the score of the object that satisfies the i-th condition. i M represents the minimum score that satisfies the first to the i-th conditions. i This represents the highest score achieved by satisfying any of the first to the i-th conditions. For example, if the first condition scores 1 and the second condition (using OR as the concatenation operator) also scores 1, then an element only needs to satisfy either condition to satisfy the query result, and the lowest score m2 for elements satisfying the query structure is 1. If an element satisfies both conditions, its score is the highest score M2, which is 2. Ultimately, in a query with n filtering conditions, the score must be at least m... n The element is the element that meets all the conditions;

[0024] m i M i The calculation method is as follows: assign 1 to K1, m1, and M1; if the concatenation operator of the i-th condition is AND, then K i =M i-1 M i =M i-1 +K i m i =m i-1 +K i If the concatenation operator for the i-th condition is OR, then K i =m i-1 m i =m i-1 M i =M i-1 +K i If the connector for the i-th condition is NOT, then K i =INT_MIN,m i =m i-1M i =M i-1 .

[0025] S3. For the query conditions to be filtered, except for the first filter condition, the rest are connected by flat AND, OR, or NOT operators. For the OR operator, either condition on either side needs to be satisfied; for the AND operator, both conditions need to be satisfied; for NOT, it indicates that the current condition should not contain certain attributes. Then, for each query condition, a corresponding HBase filter is generated, and the RowKey set of objects matching that attribute in the main table is obtained from the secondary index table. The specific implementation method is as follows: through Spark's distributed HBase connector, according to each filter condition, the matching RowKey set of the main table is queried from the secondary index table, and the elements in the set are stored in Spark's memory.

[0026] S4. Using Spark's MapReduce framework, in the Map phase, elements in the set corresponding to each filter condition are mapped to the score of that set. Then, in the Reduce phase, the scores of each identical RowKey are summed to obtain the score for each RowKey, where the score is not less than m. n The object corresponding to the RowKey is the query object that meets the filtering conditions, and these objects can then be retrieved based on the RowKey set. The process is as follows: based on K calculated in S2... i The values ​​are mapped, and then the scores of elements with the same score across all conditions are summed. If the score of an element is not less than m... n If the element is not less than m, then it will be one of the elements in the query results. Finally, elements with a score not less than m will be selected. n The set of elements is the RowKey set of the main table of the query results, and the object information in the main table can be obtained from this set.

[0027] Those skilled in the art will recognize that the embodiments described herein are intended to help the reader understand the principles of the invention, and should be understood that the scope of protection of the invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific modifications and combinations based on the technical teachings disclosed in this invention without departing from the spirit of the invention, and these modifications and combinations are still within the scope of protection of this invention.

Claims

1. A multi-condition filtering query method for HBase, characterized in that, Includes the following steps: S1. Based on the query filtering attributes of the data in the HBase main table, save the non-RowKey attributes as separate entries in a new table in HBase as a secondary index table. In the secondary index table, the attribute entries are sorted in lexicographical order. S2. For multiple filtering attributes to be filtered in the query, calculate the score for each condition using a formula based on their order and the descriptors AND, OR, and NOT. Then, calculate the score required for objects that meet the final query results based on the overall filtering conditions. The specific implementation method is as follows: First, accept n filtering conditions, and these filtering conditions, except for the first one, are described by AND, OR or NOT. According to their order and descriptors, calculate the score of each condition, as well as the minimum and maximum scores that satisfy all conditions from the first filtering condition to each condition. by This represents the score of the object that satisfies the i-th condition. This represents the minimum score that satisfies the first to the i-th conditions. This represents the highest score that satisfies all conditions from the first to the i-th condition; ultimately, in queries with n filtering conditions, the score is not lower than [a certain value]. The element is the element that meets all the conditions; , The calculation method is as follows: , , The value is assigned to 1; If the concatenation operator for the i-th condition is AND, then = , = + , = + ; If the concatenation operator for the i-th condition is OR, then = , = , = + ; If the connector for the i-th condition is NOT, then =INT_MIN, = , = ; S3. For the query conditions to be filtered, except for the first filter condition, the rest are connected with flat AND, OR or NOT; then for each query condition, generate the corresponding HBase filter and obtain the RowKey set of objects that match the attribute in the main table from the secondary index table. S4. Using Spark's MapReduce framework, in the Map phase, elements in the set corresponding to each filter condition are mapped to the score of that set. Then, in the Reduce phase, the scores of each identical RowKey are summed to obtain the score for each RowKey, where the score is not less than [a certain value]. The object corresponding to the RowKey is the query object that meets the filtering conditions.

2. The HBase multi-condition filtering query method according to claim 1, characterized in that, In step S3, for each query condition, a corresponding HBase filter is generated. The specific implementation method for obtaining the RowKey set of objects matching the attribute in the main table from the secondary index table is as follows: through Spark's distributed HBase connector, according to each filter condition, the matching RowKey set of the main table is queried in the secondary index table, and the elements in the set are stored in Spark's memory.