Method and System for Constructing a Data Structure for Completeness Verification of Ciphertext Range Retrieval Results
By normalizing data records and constructing a k-ary tree structure, the problem of high computational and storage overhead in verifying the completeness of encrypted range retrieval results for large-scale datasets in existing technologies is solved, thus achieving efficient data verification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
- Filing Date
- 2023-03-23
- Publication Date
- 2026-06-30
AI Technical Summary
When processing datasets of hundreds of millions in scale, existing technologies incur excessive computational and storage overhead in constructing verification data structures for ciphertext range retrieval result completeness verification, especially when the spatial distribution of data record value ranges is uneven, where computational and storage overhead increases significantly.
By normalizing the data records to make them more balanced in the value range space, and by using cube coding and k-ary tree structure, combined with Bloom filter and hash signature, a data structure for verifying the completeness of ciphertext range retrieval results is constructed.
It reduces the computational overhead of constructing the verification data structure, reduces the number of tree nodes and the number of codes for Bloom filter insertion, significantly reduces computational and storage overhead, and improves construction efficiency.
Smart Images

Figure CN116484399B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of database security technology, and in particular to a method and system for constructing a data structure for verifying the completeness of encrypted range retrieval results. Background Technology
[0002] With the rapid development of computing technology, enterprises and organizations in real life have accumulated a large amount of data. However, the computing power and data storage space of the data centers maintained by these enterprises and organizations themselves limit the sharing and utilization of massive amounts of data, and the maintenance cost of data centers is high. As a new computing paradigm, cloud computing can effectively reduce the maintenance cost of IT services. Therefore, more and more enterprises or organizations are choosing to outsource their local data to various services provided by cloud service providers (such as cloud database services) to obtain more efficient, professional, and low-cost data services.
[0003] However, outsourcing data to cloud service providers may introduce security risks such as sensitive information leakage or data tampering. Because cloud service providers operate in an open internet environment, they are vulnerable to attacks that could lead to the leakage of sensitive data. Furthermore, cloud service providers are not entirely trustworthy. Hackers or malicious employees could attack and tamper with data in cloud databases for illicit gain. Moreover, cloud database software itself may have defects, causing certain queries to fail to execute correctly and produce erroneous results.
[0004] For data-sensitive critical applications, data security and the accuracy of query results are of paramount importance. In practice, it is essential to ensure that cloud-based data query results are authentic, reliable, and unaltered; erroneous or tampered query results can lead to severe economic losses.
[0005] The proof-based query integrity authentication framework is an effective method for verifying the correctness and completeness of search results from outsourced cloud databases. This framework consists of three parties: the data owner, the service provider, and the client. Figure 1 As shown.
[0006] The service provider executes a query on the encrypted data index based on the client's 100% range encrypted query request to obtain the encrypted query result R. Based on the encrypted query result R and the verification data structure, the service provider generates a completeness proof π for the encrypted query result, and simultaneously returns both the encrypted query result and the completeness proof to the client.
[0007] After obtaining the encrypted query results and integrity proof, the client decrypts the encrypted query results using the key sk provided by the data owner to obtain the plaintext query results. The client then uses the cryptographic digest of the verification data structure, the integrity proof, and the key sk′ to perform a integrity verification algorithm on the plaintext query results. If the verification passes, it means the plaintext query results are correct and complete (i.e., no data records matching the query conditions are missing from the dataset), and the records in the plaintext query results have not been tampered with. The generation algorithm of the verification data structure and the integrity verification algorithm ensure that any tampering with the cloud dataset, the verification data structure, or the encrypted and plaintext query results can be detected and will fail the integrity verification.
[0008] Currently, methods using proof-based retrieval integrity verification frameworks can be categorized according to the construction characteristics of the verification data structure. These can be divided into tree-based methods such as those based on Merkle hash trees (Devanbu P, Gertz M, Martel C, et al. Authentic data publication over the Internet[J]. Journal of Computer Security, 2002, 11(3):291-314.), those based on Merkle B-Tree (Yin Y, Papadias D, Papadopoulos S, et al. Authenticated join processing in outsourced databases[C] / / ACM SIGMOD International Conference on Management of Data. ACM, 2009.), those based on cryptographic aggregation signatures (Mykletun E, Narasimha M, Tsudik G. Authentication and integrity in outsourced databases[J]. ACM Transactions on Storage, 2006, 2(2):107-138.), and those based on cryptographic accumulators (Zhang Y, Katz J, Papamanthou...). C. IntegriDB: Verifiable SQL for outsourced databases [C] / / Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 2015: 1480-1491.), etc. However, the aforementioned methods all assume that the dataset is a plaintext dataset or is only encrypted to preserve order. In these methods, service providers (usually cloud databases) can obtain the plaintext records of the dataset or the order information between data records, exposing the privacy of the relevant data.
[0009] To protect data privacy, Wu et al. proposed a verification data structure, SVETree, to support the completeness verification of encrypted data range retrieval results, and a corresponding prototype system, ServeDB (WU S, LI Q, LI G, et al. ServeDB: Secure, Verifiable, and Efficient Range Queries on Outsourced Database [C / OL] / / 2019 IEEE 35th International Conference on Data Engineering (ICDE). 2019:626-637. DOI:10.1109 / ICDE.2019.00062.). However, the computational and storage overhead of the SVETree construction process is high, making it difficult to handle large-scale datasets with hundreds of millions of records. The time and storage overhead of the SVETree construction process is affected by the distribution of data records in the value range space. When the data records are unevenly distributed in the value range space, i.e., a small number of records are concentrated in a small area, SVETree will use more cubic encoding levels to reduce the size of the cube, which will lead to a significant increase in the computational and storage overhead of the SVETree structure. Meanwhile, the SVETree data structure generates a leaf node for each data record. When the dataset contains hundreds of millions of data records, the number of tree vertices in the SVETree structure is also hundreds of millions. Calculating the corresponding Bloom filter for each tree vertex in the SVETree structure will bring a lot of computational overhead, resulting in an excessively long construction time for the SVETree structure.
[0010] Therefore, existing methods for verifying the completeness of range retrieval results for encrypted data suffer from high computational and storage overhead in the construction process of the verification data structure. Currently, there is a lack of an efficient method to support the completeness verification of encrypted range retrieval results for datasets with scales of hundreds of millions. Summary of the Invention
[0011] Purpose of the invention: The purpose of this invention is to provide a method and system for constructing a data structure for verifying the completeness of encrypted range retrieval results, supporting the completeness verification of encrypted range retrieval results on large-scale datasets, while reducing the computational overhead of constructing the verification data structure.
[0012] Technical Solution: This invention provides a method and system for constructing a data structure for verifying the completeness of encrypted range retrieval results, comprising the following steps:
[0013] (1) Obtain the dataset from the dataset owner and assign a unique and non-repeating number to each record in the dataset;
[0014] (2) Obtain the global maximum and global minimum values of each searchable dimension in the dataset, sample some data records from the dataset to form a sampled dataset, and calculate the quantile array of each searchable dimension based on the values of the sampled dataset in each searchable dimension and the global maximum and global minimum values of that dimension. The quantile array contains multiple quantiles in the searchable dimension.
[0015] (3) For each record in the dataset, normalize the value x of each dimension of the record to the range of [0,1]. For each value x of a searchable dimension of the record, calculate the quantile interval of x according to the quantile array of the corresponding dimension. If the value of x is between the a% quantile and the b% quantile in the quantile array, normalize x to the range of [a / 100%, b / 100%] and replace the value x with the normalized value to obtain the normalized dataset.
[0016] (4) Based on the normalized dataset, by iteratively searching for a suitable number of cubic coding system levels L, in the Lth iteration, the entire data value space is divided into 2 equal parts. dL There are 3 cube cells, where d is the number of searchable dimensions in the dataset; data records are assigned to the corresponding cube cells according to the value range of the cube cells and the inclusion relationship between the values of the normalized data records; if the maximum number of records contained in a cube cell is less than or equal to a specified threshold, the iteration process stops, and the number of iterations at this time is the number of levels in the cube coding system.
[0017] (5) For each record in the normalized dataset, generate a set of cube codes based on the cube unit to which the record belongs at each cube level;
[0018] (6) Group the data records in the normalized dataset according to the cube code of the data record at the last cube level, group the records with the same cube code into a group, construct a cube for each group of records, the cube code of the record constitutes the key of the cube, the content of each group of records constitutes the value of the cube, encrypt the value of the cube, and form an encrypted cube index.
[0019] (7) Construct a k-ary tree structure based on the encrypted cubic grid index, where k is any positive integer greater than or equal to 2. Each node in the k-ary tree has at most k child nodes. The number of leaf nodes in the k-ary tree is the same as the number of cubic grids, and each leaf node corresponds to a different cubic grid.
[0020] (8) Attach a Bloom filter to each tree node of the k-ary tree. For each tree node, insert all the cube codes of the cubic grid corresponding to the leaf nodes covered by the tree node into the Bloom filter.
[0021] (9) Generate a node hash signature for each tree node in the k-ary tree. The node hash signature is related to the content of the Bloom filter of the tree node and the node hash signatures of all the child nodes of the tree node.
[0022] (10) The encrypted cubic index, k-ary tree and cubic coding system hierarchy together constitute the verification data structure of the dataset for verifying the completeness of the ciphertext range retrieval results. The verification data structure is uploaded to the cloud database. At the same time, the summary information of the verification data structure is shared with the customer for verification and query.
[0023] Furthermore, in step (3), when normalizing a record, for each value x of a searchable dimension of the record, the smallest array index y in the quantile array q of that dimension that satisfies the condition q[y]≤x≤q[y+1] is found, and the quantile corresponding to index y in the quantile array is calculated. and Where |q| is the number of elements contained in the quantile array q, according to the formula Calculate the normalized value norm(x) of x. The normalized value norm(x) takes values between [0,1].
[0024] Furthermore, in step (5), when generating a cube code for the record, a unique number is generated for all cubes in each cube level. The cube code of the record is related to the number of cube levels and the cube number to which the record belongs in that level.
[0025] Furthermore, in step (7), the process of constructing the k-ary tree adopts a bottom-up approach, that is, constructing the tree layer by layer from the bottom layer, i.e., the layer where the leaf node is located. For a certain layer of tree nodes, they are grouped into groups of k, and a new tree node is created for each group of nodes. All of these nodes are child nodes of the newly created tree node. The tree nodes are constructed layer by layer until there is only one tree node in a certain layer.
[0026] This invention provides a system for constructing a data structure to verify the completeness of encrypted range retrieval results, comprising a multidimensional dataset module, a normalization module, an iterative search module, an encryption module, and a data structure construction and verification module;
[0027] The cube module is used to obtain the dataset from the dataset owner, sample the dataset to form a sampled dataset, and calculate the quantile array of each searchable dimension based on the sampled dataset;
[0028] The normalization module is used to normalize the range of values of data records to the range [0,1] using the quantile array;
[0029] The iterative search module is used to iteratively search for a suitable number of levels in a cube coding system.
[0030] The encryption module is used to distribute data records into different cube cells through the cube coding system, and to encrypt records belonging to the same cube cell using a key to generate a cube index.
[0031] The module for constructing the verification data structure is used to build a multi-branch tree layer by layer from bottom to top, based on each cube in the cube index as a tree node of the multi-branch tree, to form a complete verification data structure.
[0032] Beneficial Effects: Compared with existing technologies, the significant advantages of this invention are: by normalizing the values of data records, the distribution of data records in the value domain space becomes more balanced, thereby reducing the number of levels in the cube coding system and the number of codes that need to be inserted into the tree node Bloom filter, ultimately reducing the computational overhead and power consumption during construction. Simultaneously, by allowing k-ary trees as the basic structure for the verification data structure, compared to the SVETree verification data structure using balanced binary trees, when k=4, the height of the k-ary tree generated by the method described in this invention, i.e., the number of tree node levels, is only half that of the balanced binary tree, reducing the computational overhead during construction by nearly 50%. Leaf nodes of the k-ary tree are created based on cube units, with the number of leaf nodes being the same as the number of cube units. By selecting a suitable maximum threshold for the number of records contained in a cube unit, the number of cube units can be significantly lower than the number of records in the dataset, thus reducing the number of tree nodes in the k-ary tree to be lower than the number of tree nodes in the balanced binary tree in the SVETree verification structure, significantly reducing the computational and storage overhead of constructing the verification data structure, and supporting the reduction of computational power consumption. Attached Figure Description
[0033] Figure 1 This is an interactive flowchart of the existing integrity verification framework.
[0034] Figure 2 This is an example dataset from the existing technology;
[0035] Figure 3 It is a distribution diagram of data records in the value range space under the example dataset in the prior art;
[0036] Figure 4 This is the normalized example dataset in this invention;
[0037] Figure 5 This is a distribution diagram of data records in the value range space under the normalized example dataset in this invention;
[0038] Figure 6 This is a schematic diagram of the quadtree structure constructed from the normalized example dataset in this invention;
[0039] Figure 7 This is a comparison chart of the time consumption for constructing and verifying data structures in this invention and existing technologies;
[0040] Figure 8 This is a flowchart of the data structure for verifying the completeness of encrypted range retrieval results according to the present invention. Detailed Implementation
[0041] The specific embodiments of the present invention will be described below with reference to the accompanying drawings.
[0042] Example
[0043] This invention provides a method for constructing a data structure to verify the completeness of encrypted range retrieval results. This method reduces the computational and spatial overhead of constructing the verification data structure, solving the performance problem caused by the high computational time and space costs of existing methods when processing large-scale datasets. Please refer to... Figure 3 As shown, with Figure 3 Taking a dataset D and its distribution in a two-dimensional value space as an example under existing technology, the dataset contains 10 records, each with two searchable dimensions x and y. The system hyperparameters are set as follows: sampling rate r = 0.7, number of quantiles Q = 5, tree node width k = 4, maximum threshold u for the number of records per cube cell = 2, and maximum number of layers l = 25.
[0044] Please see Figure 8 As shown, the method for constructing a data structure for verifying the completeness of encrypted range retrieval results according to the present invention includes the following steps:
[0045] (1) Obtain the dataset from the data owner and assign a unique and non-repeating number to each record in the dataset.
[0046] The records in the dataset are numbered consecutively starting from 0; the dataset contains 10 records, numbered sequentially from 0 to 9, as shown below. Figure 2 As shown.
[0047] (2) Obtain the global maximum and global minimum values of each searchable dimension in the dataset. Sample a portion of the data records from the dataset to form a sampled dataset. Based on the values of the sampled dataset in each searchable dimension and the global maximum and global minimum values of that dimension, calculate the quantile array of each searchable dimension. The quantile array contains multiple quantiles in the searchable dimension.
[0048] Iterate through each record in the dataset to obtain the minimum value of the example dataset in each searchable dimension as <1, 10> (where the minimum value of dimension x is min1 = 1 and the minimum value of y is min2 = 10), and the maximum value of the two searchable dimensions as <19, 170> (where the maximum value of dimension x is max1 = 19 and the maximum value of dimension y is max2 = 170).
[0049] Random samples are taken from the dataset based on a sampling rate r = 0.7. Since the dataset contains 10 records, 7 records are randomly sampled. In this example, records numbered 0, 1, 2, 3, 4, 6, and 9 are sampled, forming the sampled dataset D′. The sampled dataset contains records... Figure 2 It is marked with a gray background.
[0050] For a searchable dimension x, retrieve the values of dimension x from the sampled dataset to form a value array V for that dimension. x = [6.5, 7.5, 6, 7, 8, 12, 16], and add the minimum value of 1 and the maximum value of 19 of this dimension to V. x In the middle, make V x = [6.5, 7.5, 6, 7, 8, 12, 16, 1, 19]; For a searchable dimension y, a value array V is formed for that dimension based on the values taken in the sampled dataset along that dimension. y =[140,145,120,110,125,70,30], add the minimum value of this dimension, 10 and the maximum value, 170, to the value array, making V y =[140,145,120,110,125,70,30,10,170].
[0051] For each dimension of the value array V i Sort all elements in ascending order; obtain the number of elements N in the value array, and based on the quantile number parameter Q, extract the values from the value array V. i Start selecting from the first element, and then select every... Each element is selected once, until the value array V is selected. i The last element; value array V i The selected elements constitute the quantile array q for this dimension. i If the last element of the value array is not in the quantile array, then add the last element of the value array to the end of the quantile array.
[0052] For a searchable dimension x, the array of values for that dimension V x Sort all elements in ascending order to obtain V. x= [1,6,6.5,7,7.5,8,12,16,19], the value array contains 9 elements, and according to the system hyperparameter Q=5, from the value array V x Start selecting from the first element, and then select every... Each element (i.e., 2 elements) is selected once, until the last element is selected. The 1st, 3rd, 5th, 7th, and 9th elements of the value array are selected sequentially (i.e., 1, 6.5, 7.5, 12, and 19). The value array V... x The selected elements form the quantile array q. x = [1, 6.5, 7.5, 12, 19]; for the searchable dimension y, for the value array V y Sort all elements in ascending order to obtain V. y = [10, 30, 70, 110, 120, 125, 140, 145, 170], value array V y The system contains N=9 elements, and based on the system hyperparameter Q=5, each element is selected once in pairs. V is selected. y The 1st, 3rd, 5th, 7th, and 9th elements are selected, and these elements constitute the quantile array q for this dimension. y =[10,70,120,140,170].
[0053] (3) For each record in the dataset, normalize the value x of each dimension of the record to the range of [0, 1]. For each value x of a searchable dimension of the record, calculate the quantile interval of x according to the quantile array of the corresponding dimension. If the value of x is between the a% quantile and the b% quantile in the quantile array, normalize x to the range of [a / 100%, b / 100%] and replace the value x with the normalized value to obtain the normalized dataset.
[0054] When normalizing a record, for each value x in a searchable dimension of the record, find the smallest array index y in the quantile array q that satisfies the condition q[y]≤x≤q[y+1], and calculate the quantile corresponding to index y in the quantile array. and Where |q| is the number of elements contained in the quantile array q, according to the formula Calculate the normalized value norm(x) of x. The normalized value norm(x) takes values between [0,1].
[0055] The processing procedure is the same for every record in the dataset. The following example uses record number 0; for this record, the searchable dimension x has a value of 6.5, and the quantile array q... x Find the matching condition q in the middle x[y]≤6.5≤q x The smallest array index y in [y+1] is 1, corresponding to q x [1] is 6.5 and q x [2] is 7.5. Calculate the corresponding quantile in the quantile array. The retrieveable dimension x is calculated using the following formula with a normalized value of 6.5. Replace the original value 6.5 in the data record with the normalized value 0.25; for the searchable dimension y of this record, which has a value of 140, find the smallest array index y=3 that meets the condition in the quantile array, calculate the quantiles a=75% and b=100%, and calculate the normalized value according to the formula. Replace the value of the searchable dimension y in the original record with 0.75; after step (3), the record with number 0 has values of 0.25 and 0.75 in dimensions x and y, respectively; Figure 2 After this step, the normalized dataset obtained by processing all records of the example dataset shown is as follows: Figure 4 As shown. Please refer to [the original text]. Figure 5 As shown, the data records in the normalized dataset are more evenly distributed in the value range space.
[0056] (4) Based on the normalized dataset, by iteratively searching for a suitable number of cubic coding system levels L, in the Lth iteration, the entire data value space is divided into 2 equal parts. dL There are 3 cube cells, where d is the number of searchable dimensions in the dataset. Data records are assigned to the corresponding cube cells according to the value range of the cube cells and the inclusion relationship between the values of the normalized data records. If the maximum number of records contained in a cube cell is less than or equal to a specified threshold, the iteration process stops. The number of iterations at this time is the number of levels in the cube coding system.
[0057] according to Figure 3The normalized dataset shown in the diagram begins its first iteration with a hierarchy number L = 1, iteratively searching for a suitable number of cubic encoding hierarchy levels. For the first iteration, L is 1, dividing each searchable dimension in the interval [0,1] into two sub-intervals. The first sub-interval covers the range [0,0.5), and the second sub-interval covers the range [0.5,1.0]. The two searchable dimensions (d = 2) of the dataset then divide the entire data value space into four cubic units. The first cubic unit covers the sub-intervals [0,0.5) and [0,0.5) of dimension x and y, the second cubic unit covers the sub-intervals [0.5,1.0] of dimension x and y, the third cubic unit covers the sub-intervals [0,0.5) of dimension x and y, and the fourth cubic unit covers... The sub-intervals of dimension x [0.5, 1.0] and dimension y [0.5, 1.0] are defined. Each record in the normalized dataset is assigned to a corresponding cube cell according to the value of each searchable dimension. For example, record number 0 has values of 0.25 and 0.75 in dimensions x and y respectively, falling within the sub-interval covered by the third cube cell. Therefore, this record will be assigned to the third cube cell. The number of records assigned to all cube cells is counted. In this embodiment, the first cube cell is assigned 2 records, the second cube cell is assigned 3 records, the third cube cell is assigned 2 records, and the fourth cube cell is assigned 3 records. Therefore, the maximum value X = 3. X (3 in this embodiment) is compared with the maximum threshold u (2 in this embodiment) for the number of records contained in each cube cell. Since X > u, the iteration stopping condition is not met.
[0058] In the second iteration, the number of levels L = 2. Each searchable dimension is divided into four sub-intervals on an average basis in the interval [0,1]. The ranges covered by these four sub-intervals are [0,0.25), [0.25,0.5), [0.5,0.75), and [0.75,1.0], respectively. The two searchable dimensions divide the value space into 16 cubic units on an average basis, such as... Figure 5 As shown; the data records of the normalized dataset are assigned to each cube cell, and the number of records assigned to each cube cell is counted. The maximum value X = 2 is obtained; the maximum value X is compared with the threshold u of the maximum number of records contained in each cube cell. Since X ≤ u, the iteration stopping condition is met, and the iteration stops; the level number L = 2 at the time of iteration stopping is added to the verification data structure as the level number parameter of the cube coding system.
[0059] (5) For each record in the normalized dataset, generate a set of cube codes based on the cube unit to which the record belongs at each cube level.
[0060] Assign a value from 0 to 2 to all cube cells in level i. di The sequence number is between -1 and d (where d is the number of searchable dimensions), and the method of assigning the sequence number is consistent across all records. Based on the value of record r in each searchable dimension, the corresponding cube cell in each level i is calculated. Based on the sequence number I of the cube cell, the hash function value HMAC(sk′, i||I) (where || is the concatenation operator) is calculated using the key sk′. This hash function value is the corresponding cube code c of record r in cube level i. r,i .
[0061] The method for generating a cube code set for each record in the normalized dataset is the same. The following describes the generation process of record number 0 (referred to as record 0) as an example. All cube units in cube level 1 are assigned an index from 0 to 3. This index is determined according to the enumeration order of the cube units described in step (4). Record 0 belongs to the third cube in cube level 1, so the index of the cube unit is I = 2. The hash function value HMAC(sk′,1||2) is calculated using the key sk′. The hash function value is the cube code c of record 0 in cube level 1. 0,1 All cube cells in cube level 2 are assigned indices from 0 to 15. The indices of cube cell 0 in cube level 2 are recorded as I = 13. Therefore, the hash function value HMAC(sk′, 2||13) is used as the cube code c for recording 0 in cube level 2. 0,2 Based on the two-level cube coding, the cube coding set for recording 0 is {HMAC(sk′,1||2),HMAC(sk′,2||13)}.
[0062] (6) Group the data records in the normalized dataset according to the cube code of the data record at the last cube level, group the records with the same cube code into a group, construct a cube for each group of records, the cube code of the record constitutes the key of the cube, the content of each group of records constitutes the value of the cube, encrypt the value of the cube to form an encrypted cube index.
[0063] A cube index consists of several cube cells, each being a key-value pair. The cube encoding set of any record in a group constitutes the key part of the key-value pair, while the contents of all records belonging to that group constitute the value part. The value part of the key-value pair is encrypted using a key sk, forming an encrypted cube. All encrypted cubes constitute an encrypted cube index. Each cube corresponds to a cube unit in the cube encoding hierarchy L, and each cube unit corresponds to a continuous region in the data value domain space. Records belonging to the same cube are close to each other in the value domain space.
[0064] Because the number of cube levels L equals 2, the records in the normalized dataset are grouped according to their corresponding cube codes at cube coding level 2. Records with the same cube code are assigned to the same group. In this example, the two records numbered 8 and 9 belong to the same group, while the other records belong to independent groups, for a total of 9 groups. Taking the group to which the two records numbered 8 and 9 belong as an example, this group corresponds to a cube, which is a key-value pair. The key part is the cube code set {HMAC(sk′,1||1),HMAC(sk′,2||3)} of the record numbered 8, and the value part is the content of the two records numbered 8 and 9 (in this example, the record number and the values of the two dimensions x and y). The value part of the key-value pair is encrypted using the key sk and the AES256 algorithm to form an encrypted cube. All 9 encrypted cubes in this embodiment constitute a cube index, and the 9 cubes are numbered consecutively from 0 to 8. In this example, the cube containing the two records numbered 8 and 9 is numbered 1.
[0065] (7) Construct a k-ary tree structure based on the encrypted cubic grid index, where k is any positive integer greater than or equal to 2. Each node in the k-ary tree has at most k child nodes. The number of leaf nodes in the k-ary tree is the same as the number of cubic grids, and each leaf node corresponds to a different cubic grid.
[0066] In this embodiment, the tree node width hyperparameter k = 4. Therefore, this step constructs a quadtree, where the 9 cubes in the cube index correspond to the 9 leaf nodes of the quadtree, as follows: Figure 6In the quadtree shown, the tree nodes numbered 0 to 8 are leaf nodes. Each leaf node corresponds to a cube in the cube index, where leaf node number 1 corresponds to cube number 1. Starting from the tree level 1 (bottom level) where the leaf node is located, the 9 tree nodes in this level are grouped into groups of 4. Tree nodes numbered 0 to 3 are grouped into one group (referred to as the group of tree nodes 0 to 3), tree nodes numbered 4 to 7 are grouped into another group, and tree node number 8 is in a separate group. In this embodiment, a [missing information - likely a specific tree node group] is created for the group of tree nodes 0 to 3. A new tree node, numbered 9 (referred to as tree node 9), is created. Tree nodes 0 through 3 are all child nodes of the newly created tree node 9. Tree nodes 10 and 11 are created for the other two groups. The tree nodes numbered 9, 10, and 11 constitute tree level 2. Starting from tree level 2, the nodes are grouped into groups of four, resulting in one group. A tree node numbered 12 is created for this group, constituting tree level 3. Since tree level 3 contains only one tree node, the quadtree construction process stops. At this point, the tree node numbered 12 is the root node of the quadtree.
[0067] (8) Attach a Bloom filter to each tree node of the k-ary tree. For each tree node, insert all the cube codes of the cubic grid corresponding to the leaf nodes covered by the tree node into the Bloom filter.
[0068] For each tree node N in the k-ary tree, recursively search all descendant nodes of the tree node and extract all leaf nodes from the descendant nodes. These leaf nodes are the leaf nodes covered by tree node N. For each leaf node covered by tree node N, insert all the codes in the cube code set of the corresponding cube (i.e., the key part of the cube key-value pair) into the Bloom Filter of the tree node.
[0069] Please see Figure 6 As shown, each node of the quadtree is appended with a Bloom filter. The process of inserting cube codes into the Bloom filter of each node is the same. The following explanation uses the node numbered 9 (referred to as node 9) as an example. All descendant nodes of node 9 include the four nodes numbered 0 to 3. Since these four descendant nodes are all leaf nodes, they are leaf nodes covered by node 9. The insertion process for each leaf node covered by node 9 is similar. The following explanation uses the insertion process of leaf node number 1 as an example. The two codes HMAC(sk′,1||1),HMAC(sk′,2||3)) from the cube code set {HMAC(sk′,1||1),HMAC(sk′,2||3)} of the cube corresponding to the leaf node (i.e., the cube numbered 1) are inserted into the Bloom filter of node 9.
[0070] (9) Generate a node hash signature for each tree node in the k-ary tree. The node hash signature is related to the contents of the Bloom filter of the tree node and the node hash signatures of all the child nodes of the tree node.
[0071] The HMAC mechanism is used to generate the hash signature sig of the Bloom filter based on the content BL of the Bloom filter in the tree node and the key sk′. BL =HMAC(sk′,BL); Apply the HMAC mechanism again, based on the hash signature sig of the Bloom Filter of the tree node. BL The node hash signature of the current tree node is generated from the node hash signatures of all child nodes of the tree node and the key sk′. N =HMAC(sk′,sig BL ||sig C1 ||sig C2 ||…||sig Ck ), where || is the concatenation operator, sig C1 sig C2 、…、sig Ck It is the node hash signature of the k child nodes of the tree node.
[0072] The method of appending a node hash signature to each tree node is the same. The following explanation uses the generation process of tree node number 9 (referred to as tree node 9) as an example. The bit array in the Bloom filter of tree node 9 is used as the content BL of the Bloom filter. Based on the key sk′, the hash value HMAC(sk′,BL) is calculated using the HMAC mechanism. This hash value is stored as the hash signature sig of the Bloom filter. BL According to sig BL And the node hash signatures sig0, sig1, sig2, and sig3 of the four child nodes of tree node 9 (i.e., tree nodes numbered 0 to 3), calculated by the formula sig9 = HMAC(sk′, sig BL ||sig0||sig1||sig2||sig3), sig9 is the node hash signature of tree node 9, where || is the concatenation operation.
[0073] (10) The encrypted cubic index, k-ary tree and cubic coding system hierarchy together constitute the verification data structure of the dataset for verifying the completeness of the ciphertext range retrieval results. The verification data structure is uploaded to the cloud database. At the same time, the summary information of the verification data structure is shared with the customer for verification and query.
[0074] In this embodiment, the encrypted cubic index, quadtree, and cubic coding system with a hierarchy L=2 constitute the verification data structure. This content will be stored in a file and uploaded to the service provider (usually a cloud database). The node hash signature of the root node of the quadtree, the cubic coding system with a hierarchy L=2, and the quantile array q of each searchable dimension in the dataset are also included. x and q y The keys sk and sk′ will be stored in another file and transmitted to the client in a secure and private manner.
[0075] To test the performance of the method proposed in this invention, the Foursquare check-in dataset, the US Wild Fire event dataset, and the Gowalla social network check-in dataset were used as inputs. The construction time of the proposed method for constructing the verification data structure was measured. For comparison, the construction method of the SVETree verification data structure described in the literature "WU S, LI Q, LI G, et al. ServeDB: Secure, Verifiable, and Efficient Range Queries on Outsourced Database [C / OL] / / 2019IEEE35th International Conference on Data Engineering (ICDE).2019:626-637." was used as a reference method, and the construction time of the reference method was also measured. Please refer to [link to relevant documentation]. Figure 7 As shown in the figure, the experimental results show that the construction time of the method proposed in this invention is significantly lower than that of the reference method, which can effectively improve the construction efficiency of the verification data structure.
Claims
1. A method of constructing a ciphertext range search result completeness verification data structure, characterized by, Includes the following steps: (1) Obtain the dataset from the dataset owner and assign a unique and non-repeating number to each record in the dataset; (2) Obtain the global maximum and global minimum values of each searchable dimension in the dataset, sample some data records from the dataset to form a sampled dataset, and calculate the quantile array of each searchable dimension based on the values of the sampled dataset in each searchable dimension and the global maximum and global minimum values of that dimension. The quantile array contains multiple quantiles in the searchable dimension. (3) for each record in the dataset, normalize the value of each dimension of the record to the range between the quantile and the quantile , and replace the value with the normalized value, thereby obtaining a normalized dataset; (4) Based on the normalized dataset, search for a suitable number of levels in the cube coding system through iterative search. In the In each iteration, the entire data range space is divided into equal parts. cubic units, of which It is the number of searchable dimensions in the dataset; based on the value range of the cube cells and the inclusion relationship between the values of the normalized data records, the data records are assigned to the corresponding cube cells; if the maximum number of records contained in a cube cell is less than or equal to a specified threshold, the iteration process stops, and the number of iterations at this time is the number of levels in the cube coding system; (5) For each record in the normalized dataset, generate a set of cube codes based on the cube unit to which the record belongs at each cube level; (6) Group the data records in the normalized dataset according to the cube code of the data record at the last cube level, group the records with the same cube code into a group, construct a cube for each group of records, the cube code of the record constitutes the key of the cube, the content of each group of records constitutes the value of the cube, encrypt the value of the cube to form an encrypted cube index. (7) Construct based on the encrypted cubic index. tree structure For any positive integer greater than or equal to 2, Each node in a basalt tree has at most Number of child nodes The number of leaf nodes in a branched tree is the same as the number of cubic cells, and each leaf node corresponds to a different cubic cell. (8) is Each tree node of the basalt tree is appended with a Bloom filter. For each tree node, all cube codes of the cubic grid corresponding to the leaf nodes covered by that tree node are inserted into the Bloom filter. (9) is Each tree node in the basalt tree generates a node hash signature, which is related to the contents of the Bloom filter of that tree node and the node hash signatures of all the child nodes of that tree node; (10) Encrypted cubic index, The hierarchical structure of the cross-tree and the cube coding system together constitutes the verification data structure for the completeness verification of the encrypted range retrieval results of this dataset. This verification data structure is uploaded to the cloud database. At the same time, the summary information of the verification data structure is shared with the customer for verification and query.
2. The method for constructing a data structure for verifying the completeness of encrypted range retrieval results according to claim 1, characterized in that, In step (3), when normalizing a record, the value of each searchable dimension of the record is... quantile array in this dimension Search for conditions minimum array index Calculate the subscript The corresponding quantile in the quantile array is and ,in It is a quantile array The number of elements contained, according to the formula calculate normalized value Normalized value The range of values is within between.
3. The method for constructing a data structure for verifying the completeness of encrypted range retrieval results according to claim 1, characterized in that, In step (5), when generating a cube code for a record, a unique number is generated for all cubes in each cube level. The cube code of a record is related to the number of cube levels and the cube number to which the record belongs in that level.
4. The method for constructing a data structure for verifying the completeness of encrypted range retrieval results according to claim 1, characterized in that, Construction in step (7) The process of building a tree involves constructing it layer by layer from the bottom up, starting with the level containing the leaf nodes. For a node at a certain level, the tree is constructed using... Each node is grouped into a set, and a new tree node is created for each group. All nodes in this group are children of the newly created tree node. The tree nodes are constructed layer by layer until there is only one tree node in a certain layer.
5. A system for constructing a data structure for verifying the completeness of encrypted range retrieval results using the method described in claim 1, characterized in that, It includes modules for multidimensional datasets, normalization, iterative search, encryption, and building and verifying data structures. The cube module is used to obtain the dataset from the dataset owner, sample the dataset to form a sampled dataset, and calculate the quantile array of each searchable dimension based on the sampled dataset; The normalization module is used to normalize the range of values in data records using a quantile array. Between ranges; The iterative search module is used to iteratively search for a suitable number of levels in a cube coding system. The encryption module is used to distribute data records into different cube cells through the cube coding system, and to encrypt records belonging to the same cube cell using a key to generate a cube index. The module for constructing the verification data structure is used to build a multi-branch tree layer by layer from bottom to top, based on each cube in the cube index as a tree node of the multi-branch tree, to form a complete verification data structure.
6. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 4.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method described in any one of claims 1 to 4.