A data privacy protection encryption retrieval method and system and a storage medium

By optimizing keyword frequency partitioning and index structure, and employing a combination of static primary indexes and dynamic secondary indexes, the problems of efficient retrieval and frequent updates of encrypted data were solved, improving query speed and system stability, and ensuring data security.

CN122286830APending Publication Date: 2026-06-26BEIJING DEXUN AVIATION SERVICE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING DEXUN AVIATION SERVICE CO LTD
Filing Date
2026-04-21
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to achieve efficient keyword retrieval of encrypted data without decryption, especially when the dataset changes. It is difficult to balance query efficiency with update efficiency, and high-frequency keyword updates lead to lock contention and write conflicts, reducing system performance.

Method used

The keyword set is divided into low-frequency and high-frequency subsets. A static main index and a dynamic secondary index that supports concurrent updates are constructed using a minimum perfect hash function. Deterministic tokens are generated through a pseudo-random function, and secure retrieval is achieved by combining symmetric encryption and stream encryption techniques.

Benefits of technology

It enables efficient retrieval without decrypting data, improves query speed and storage efficiency, alleviates the bottleneck of high-frequency keyword updates, and ensures the stability and security of the system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122286830A_ABST
    Figure CN122286830A_ABST
Patent Text Reader

Abstract

This invention provides an encrypted retrieval method, system, and storage medium for data privacy protection. The retrieval method includes: dividing document set keywords into low-frequency and high-frequency subsets according to a frequency threshold; generating deterministic tokens for keywords using a key-based pseudo-random function and storing the correspondence between tokens and keyword types; for low-frequency keywords, constructing a static primary index using a minimum perfect hash structure, mapping tokens to hash buckets, and storing pointers to a list of encrypted document identifiers; for high-frequency keywords, constructing a dynamically updated secondary index using encrypted key-value storage, storing the mapping between tokens and encrypted document identifiers; during retrieval, the user generates a deterministic token for the keyword to be searched and uploads it as a trapdoor; the server queries the primary and secondary indexes according to the type; and the document identifiers generate a key stream using a derived symmetric key and an initialization vector, and are then XORed bitwise.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of retrieval, and in particular relates to an encrypted retrieval method, system and storage medium for data privacy protection. Background Technology

[0002] To protect data confidentiality, users typically encrypt data before uploading. However, this renders traditional plaintext retrieval techniques ineffective. A key challenge is how to perform keyword retrieval on encrypted data without decryption. Searchable symmetric encryption technology allows data owners to build an encrypted index and store it along with the encrypted data on a server. Users can then initiate queries by submitting specific keywords—"trapdoors"—generated by a key. The server uses these trapdoors to perform a matching operation on the encrypted index and returns the corresponding encrypted document identifier. Throughout this process, the server remains unaware of the query content or the original data.

[0003] Static indexing schemes typically build an index once for a fixed dataset, such as an inverted index structure. While offering very high query efficiency, they almost never support, or only support, extremely inefficient data updates. Once the dataset changes, such as adding or deleting documents, the client often needs to download the entire index, modify it locally, re-encrypt it, and upload it again. This incurs huge overhead in large-scale, changing data environments, making it impractical. Using data structures such as linked lists, trees, or encrypted hash tables to organize the index allows for document addition and deletion operations on encrypted datasets. For high-frequency keywords, the corresponding index entries are frequently accessed and modified concurrently, easily leading to severe lock contention and write conflicts. This reduces the system's update throughput and scalability, and query performance often cannot match that of optimized static structures. Therefore, how to plan a scheme that balances query efficiency and update efficiency while alleviating the bottleneck of high-frequency keyword updates is a key focus and challenge in this field. Summary of the Invention

[0004] To address the problem that existing technologies struggle to balance query efficiency and update efficiency, and are unable to overcome the bottleneck in updating high-frequency keywords.

[0005] In the first aspect, this disclosure provides a data privacy-preserving encrypted retrieval method, comprising: The document set is pre-analyzed, and the keyword set is divided into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold; a unified, key-based pseudo-random function is defined to generate a unique deterministic token for any keyword, and the deterministic token and the keyword type division relationship are securely stored. For the low-frequency keyword subset, a static primary index is constructed based on the deterministic token set corresponding to the subset; for the high-frequency keyword subset, a dynamic secondary index supporting concurrent updates is constructed based on the deterministic token set corresponding to the subset. When performing keyword retrieval, the user terminal uses the pseudo-random function to generate a trapdoor, i.e., a deterministic token, for the keyword to be retrieved and sends the trapdoor to the server. After receiving the trapdoor, the server determines the keyword type corresponding to the trapdoor based on the pre-stored deterministic token and keyword type classification relationship. If it is a low-frequency keyword, it searches in the static main index; if it is a high-frequency keyword, it searches in the dynamic secondary index and returns the retrieved encrypted document identifier list. The encrypted document identifier list is obtained by performing a bitwise XOR operation on the original document identifier list and a symmetric key derived using the keyword, master key, and parameters that change with updates, as well as an initialization vector and a key stream generated in stream encryption mode.

[0006] Furthermore, the pre-analysis of the document set, dividing the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold, includes: Traverse the document set to count the total number of times each keyword appears in all documents; If the total number of times a keyword appears is not greater than the keyword frequency threshold, then the keyword is classified into the low-frequency keyword subset. Conversely, the keywords are categorized into a subset of high-frequency keywords.

[0007] Furthermore, the step of constructing a static master index for the low-frequency keyword subset based on the deterministic token set corresponding to the subset includes: Using the set of deterministic tokens corresponding to all low-frequency keywords as input, a function is constructed using the minimum perfect hash function algorithm to map each deterministic token to a unique integer value; Create a pointer array of size equal to the total number of low-frequency keywords, and store the storage address of the list of encrypted document identifiers associated with the token at the position determined by the integer value in the array.

[0008] Furthermore, the step of constructing a dynamic secondary index supporting concurrent updates for the subset of high-frequency keywords, based on the set of deterministic tokens corresponding to the subset, includes: The encrypted key-value storage structure is implemented using an encrypted key-value database that supports concurrent operations; The key of the database is a deterministic token encrypted using the pseudo-random function, and the value is the storage location of the corresponding encrypted document identifier list. The encryption ensures the confidentiality of the tokens and does not reveal their inherent order.

[0009] Furthermore, if the keyword is low-frequency, the query will be performed in the static master index, including: The server uses the trapdoor submitted by the client as input and calculates a unique integer index value using the minimum perfect hash function of the static master index. Access the position of the corresponding index value in the pointer array, read and return the stored list of encrypted document identifiers.

[0010] Furthermore, if the keyword is a high-frequency keyword, then the query is performed in the dynamic secondary index, including: The server uses the same pseudo-random function as when building the index to encrypt the trapdoor submitted by the client, resulting in an encrypted trapdoor; Using the encrypted trapdoor as a key, a matching search is performed in the encrypted key-value storage structure of the dynamic secondary index to locate and return the corresponding list of encrypted document identifiers.

[0011] Further, the step of obtaining the original document identifier list by performing a bitwise XOR operation with a symmetric key derived using the keywords, the master key, and parameters that change with updates, an initialization vector, and a keystream generated in stream encryption mode, includes: For a keyword, master key, and parameters that change with updates, a key derivation method based on pseudo-random functions is used to derive a symmetric key and initialization vector specific to the keyword. A symmetric encryption algorithm is used to generate a key stream of the same length as the original document identifier list in stream encryption mode, using the derived symmetric key and initialization vector. An encrypted list of document identifiers is obtained by performing a bitwise XOR operation between the original list of document identifiers and the key stream.

[0012] On the other hand, this disclosure also provides an encrypted retrieval system for data privacy protection, including the following modules: The first generation module is used to analyze the document set in advance, divide the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold; define a unified, key-based pseudo-random function to generate a unique deterministic token for any keyword, and securely store the deterministic token and the keyword type division relationship; The construction module is used to construct a static primary index for the low-frequency keyword subset based on the deterministic token set corresponding to the subset; and to construct a dynamic secondary index that supports concurrent updates for the high-frequency keyword subset based on the deterministic token set corresponding to the subset. The retrieval module is used to generate a trapdoor (i.e., a deterministic token) for the keyword to be retrieved using the pseudo-random function on the user end, and send the trapdoor to the server. After receiving the trapdoor, the server determines the keyword type corresponding to the trapdoor based on the pre-stored deterministic token and keyword type classification relationship. If it is a low-frequency keyword, it searches in the static main index; if it is a high-frequency keyword, it searches in the dynamic secondary index and returns a list of encrypted document identifiers. The list of encrypted document identifiers is obtained by performing a bitwise XOR operation on the original document identifier list and a symmetric key derived from the keyword, master key, and parameters that change with updates, as well as an initialization vector and a key stream generated in stream encryption mode.

[0013] Preferably, the step of pre-analyzing the document set and dividing the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold includes: Traverse the document set to count the total number of times each keyword appears in all documents; If the total number of times a keyword appears is not greater than the keyword frequency threshold, then the keyword is classified into the low-frequency keyword subset. Conversely, the keywords are categorized into a subset of high-frequency keywords.

[0014] Preferably, the step of constructing a static master index for the low-frequency keyword subset based on the deterministic token set corresponding to the subset includes: Using the set of deterministic tokens corresponding to all low-frequency keywords as input, a function is constructed using the minimum perfect hash function algorithm to map each deterministic token to a unique integer value; Create a pointer array of size equal to the total number of low-frequency keywords, and store the storage address of the list of encrypted document identifiers associated with the token at the position determined by the integer value in the array.

[0015] Preferably, the step of constructing a dynamic secondary index supporting concurrent updates for the subset of high-frequency keywords, based on the deterministic token set corresponding to the subset, includes: The encrypted key-value storage structure is implemented using an encrypted key-value database that supports concurrent operations; The key of the database is a deterministic token encrypted using the pseudo-random function, and the value is the storage location of the corresponding encrypted document identifier list. The encryption ensures the confidentiality of the tokens and does not reveal their inherent order.

[0016] Preferably, if the keyword is low-frequency, the query is performed in the static master index, including: The server uses the trapdoor submitted by the client as input and calculates a unique integer index value using the minimum perfect hash function of the static master index. Access the position of the corresponding index value in the pointer array, read and return the stored list of encrypted document identifiers.

[0017] Preferably, if the keyword is a high-frequency keyword, then the query is performed in the dynamic secondary index, including: The server uses the same pseudo-random function as when building the index to encrypt the trapdoor submitted by the client, resulting in an encrypted trapdoor; Using the encrypted trapdoor as a key, a matching search is performed in the encrypted key-value storage structure of the dynamic secondary index to locate and return the corresponding list of encrypted document identifiers.

[0018] Preferably, the step of obtaining the original document identifier list by performing a bitwise XOR operation with a symmetric key derived using the keywords, the master key, and parameters that change with updates, an initialization vector, and a keystream generated in stream encryption mode, includes: For a keyword, master key, and parameters that change with updates, a key derivation method based on pseudo-random functions is used to derive a symmetric key and initialization vector specific to the keyword. A symmetric encryption algorithm is used to generate a key stream of the same length as the original document identifier list in stream encryption mode, using the derived symmetric key and initialization vector. An encrypted list of document identifiers is obtained by performing a bitwise XOR operation between the original list of document identifiers and the key stream.

[0019] This invention achieves a balance between retrieval efficiency and update performance by classifying keywords by frequency and employing different index structures. For infrequently changing low-frequency keywords, a static primary index is constructed using minimum perfect hashing, improving query speed and reducing storage space usage. For frequently updated high-frequency keywords, an independent secondary index is built for management, preventing impact on the primary index structure due to frequent changes and ensuring system stability and update efficiency under continuous data writing scenarios. When encrypting the document identifier list, a key associated with the keyword and update status is used, ensuring that even repeated updates to the same keyword produce different ciphertexts, thus hiding data access and update patterns and enhancing the security of the entire encrypted retrieval scheme. Attached Figure Description

[0020] Figure 1 A flowchart of the first embodiment provided by the present invention; Figure 2 The architecture diagram provided for this invention; Figure 3 This is a flowchart for user retrieval in this invention. Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of this specification clearer, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of them. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this specification.

[0022] Example 1 In Embodiment 1 of the present invention, an encrypted retrieval method for data privacy protection is proposed, such as... Figure 1 As shown, it includes the following steps: S1, Analyze the document set in advance and divide the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold; Using natural language processing toolkits such as spaCy or NLTK, each document in the document set is segmented, stop words are removed, and stemming or lemmatization is performed to obtain standardized keywords. A hash map data structure is initialized, and the standardized keywords of all documents are traversed, using the keyword as the key and the number of documents in which it appears as the value, to count the document frequency of each keyword. A value is set, such as one percent of the total number of documents, as a frequency threshold. All keywords in the hash map are traversed again. If the document frequency is lower than the threshold, the keyword is classified into the low-frequency keyword subset; otherwise, it is classified into the high-frequency keyword subset.

[0023] In an optional embodiment, the pre-analysis of the document set, dividing the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold, includes: Traverse the document set to count the total number of times each keyword appears in all documents; If the total number of times a keyword appears is not greater than the keyword frequency threshold, then the keyword is classified into the low-frequency keyword subset. Conversely, the keywords are categorized into a subset of high-frequency keywords.

[0024] Determine a key parameter – the keyword frequency threshold. A lower threshold will classify more keywords as high-frequency, increasing the size of the dynamic secondary index and potentially reducing the query efficiency for high-frequency words, but it can accommodate more frequent updates of keywords. A higher threshold, on the other hand, will make the static primary index larger, resulting in extremely fast query speeds for low-frequency words, but any document update may cause some low-frequency words to become high-frequency, requiring re-partitioning and rebuilding of the index. In a typical system containing N documents, It can be set to Or a fixed empirical value. For example, for a document set containing 1,000,000 documents, it could be... Set to 1000.

[0025] Initialize a hash map `keyword_counts_map` to store each keyword and its occurrence count. Iterate through each document in document set D. For each document, standard natural language preprocessing is performed, including word segmentation, stop word removal, and stemming, to extract a meaningful set of keywords. Each extracted keyword *w* is iterated over, and its corresponding count in *keyword_counts_map* is incremented. After iterating through all documents, the frequencies of all keywords in the entire document set are obtained. *Keyword_counts_map* is then iterated over; for each keyword *w* and its frequency *c*, if... Then add w to the low-frequency keyword subset. ;like Then add w to the high-frequency keyword subset. .

[0026] S2 defines a unified, key-based pseudo-random function to generate a unique deterministic token for any keyword and securely stores the relationship between the deterministic token and the keyword type. The HMAC-SHA256 algorithm is chosen as the pseudo-random function, with the HMAC key being the master key MK. For each predefined keyword w, a 256-bit deterministic token t is generated by calculating HMAC-SHA256(MK, w). A two-column table is created: the first column stores the deterministic token t, and the second column stores a Boolean flag, with 0 for low frequencies and 1 for high frequencies. A separate encryption key is used to encrypt the entire table using AES-GCM encryption mode, and the encrypted result is stored in a disk file or database to prevent the server from accessing the token-type mapping.

[0027] S3, For the subset of low-frequency keywords, construct a static master index based on the set of deterministic tokens corresponding to the subset; Collect all deterministic tokens corresponding to low-frequency keywords to form a token set. Using a minimal perfect hash library that implements the CHD or BBHash algorithm, such as the cmph library, take this token set as input to generate a minimal perfect hash function G. This function G can map any token belonging to the set to a unique integer from 0 to M-1, where M is the total number of low-frequency keywords. Create a pointer array A of size M. For each low-frequency keyword token t, calculate the hash value h = G(t), and store the disk file offset or database record primary key that stores the list of encrypted document identifiers corresponding to the token t as a pointer in the pointer array A at index h.

[0028] In an optional embodiment, constructing a static master index for the low-frequency keyword subset based on the deterministic token set corresponding to the subset includes: Using the set of deterministic tokens corresponding to all low-frequency keywords as input, a function is constructed using the minimum perfect hash function algorithm to map each deterministic token to a unique integer value; Create a pointer array of size equal to the total number of low-frequency keywords, and store the storage address of the list of encrypted document identifiers associated with the token at the position determined by the integer value in the array.

[0029] Get a subset of low-frequency keywords And let the size be M. For Each keyword in A 256-bit deterministic token is generated using a predefined pseudo-random function and a master key MK. All generated tokens Form a set Choose a minimum perfect hash function (MPHF) algorithm, such as a compression, hashing, and substitution algorithm. The hash function G is constructed using the input. This function G is able to... Any token in Uniquely and conflict-free, mapped to an integer index value ,in G itself is represented by a very compact data structure, typically only a few bits in size per token. For example, for a keyword M=1,000,000, the structure requires only about 300KB of storage space.

[0030] After generating the MPHF function G, create a pointer array A of size M in memory or persistent storage. Each element of this array stores a 64-bit memory address or file offset. Iterate through each low-frequency keyword. and tokens Acquisition and List of associated original document identifiers The list is encrypted to obtain ;Will Stored at a location on disk or in a database, and obtain the storage address. Calculate the hash value of the token. and the storage address Stored at the corresponding position in the pointer array, i.e. After the traversal is complete, the static primary index is constructed, consisting of the MPHF function structure and A.

[0031] S4, For the high-frequency keyword subset, construct a dynamic secondary index that supports concurrent updates based on the deterministic token set corresponding to the subset; Choose an embedded key-value database based on a B+ tree or LSM tree, such as RocksDB or LevelDB, as the underlying storage for dynamic secondary indexes. The database keys are deterministic tokens containing high-frequency keywords, and the values ​​are lists of corresponding encrypted document identifiers. To support concurrent updates, during write or modification operations on the database, utilize the database's built-in transaction mechanism or external read-write locks, such as pthread_rwlock, to ensure data consistency and atomicity of operations. The entire database file is encrypted at the storage level using file system-level encryption such as LUKS, or the stored values ​​(i.e., the encrypted document identifier lists) are encrypted at the application layer, thus achieving encrypted key-value storage.

[0032] In an optional embodiment, constructing a dynamic secondary index supporting concurrent updates for the subset of high-frequency keywords based on the set of deterministic tokens corresponding to the subset includes: The encrypted key-value storage structure is implemented using an encrypted key-value database that supports concurrent operations; The key of the database is a deterministic token encrypted using the pseudo-random function, and the value is the storage location of the corresponding encrypted document identifier list. The encryption ensures the confidentiality of the tokens and does not reveal their inherent order.

[0033] A key-value database supporting high-concurrency read / write and atomic operations was chosen as the underlying storage engine. To enhance security and prevent the keys in the database from exposing any information about the keyword token, a second layer of encryption was used. A dedicated 256-bit secret key was defined for secondary indexes. For high-frequency keyword subsets Each keyword in First, generate a deterministic token. .use And another pseudo-random function for tokens Perform encryption transformation to generate database keys. , where PRF is a pseudo-random function. This will be used as a key in a key-value database, and the key will be completely random, thus hiding the original token.

[0034] Traversal Each high-frequency keyword in and tokens Calculate the database key. Simultaneously, retrieve the... List of associated original document identifiers Encryption obtained .Will Write to the storage system to obtain a 64-bit storage pointer. . key-value pairs Insert into the key-value database. Because modern key-value databases natively support concurrent writes under multi-threading and guarantee the atomicity of operations, multiple document update operations can be safely processed in parallel, maintaining dynamic secondary indexes.

[0035] S5, When performing keyword retrieval, the user terminal uses the pseudo-random function to generate a trapdoor, i.e. a deterministic token, for the keyword to be retrieved, and sends the trapdoor to the server; like Figure 2 The architecture diagram shows that the user inputs the keyword w to be searched on the client side; the client software calls the same HMAC-SHA256 algorithm as the server side, uses the user's master key MK to calculate the keyword w, and generates a deterministic token t, which is the trapdoor. The client sends the trapdoor t to the server through a secure channel established using the TLS 1.3 protocol.

[0036] S6. After receiving the trapdoor, the server determines the keyword type corresponding to the trapdoor based on the pre-stored relationship between deterministic tokens and keyword types. If it is a low-frequency keyword, the server queries the static main index; if it is a high-frequency keyword, the server queries the dynamic secondary index and returns a list of encrypted document identifiers. After receiving the trapdoor t, the server uses the stored encryption key to decrypt the previously stored token type classification table; it then searches for the trapdoor t in the table and obtains the type flag corresponding to it. If the flag is 0, indicating a low-frequency keyword, the server calls the constructed minimal perfect hash function G to calculate the hash value h = G(t), accesses the position at index h in the pointer array A, obtains a pointer to the encrypted document identifier list, and reads the corresponding encrypted data from the disk or database based on this pointer. If the flag is 1, indicating a high-frequency keyword, the server uses the secondary index key to calculate the database key DB_Key for the trapdoor t, such as... Figure 3As shown, a Get operation is performed in RocksDB to obtain the corresponding storage pointer, and the list of encrypted document identifiers is read based on the pointer. The server returns the retrieved list of encrypted document identifiers to the client via a TLS secure channel.

[0037] In an optional embodiment, if the keyword is low-frequency, then the query is performed in the static master index, including: The server uses the trapdoor submitted by the client as input and calculates a unique integer index value using the minimum perfect hash function of the static master index. Access the position of the corresponding index value in the pointer array, read and return the stored list of encrypted document identifiers.

[0038] When the server receives a trapdoor t from a user and determines, based on pre-stored partitioning relationships, that the trapdoor t corresponds to a low-frequency keyword, the query process is as follows. The server holds the MPHF function G and pointer array A generated during index building. The server uses the received trapdoor t as input to the G function and performs the calculation: h = G(t). Typically, only a few memory accesses and bitwise operations are required to obtain a unique integer index h in the range [0, M-1].

[0039] The server uses the calculated h as the array index to access A, i.e., executes the operation ptr=A[h]. This is a random memory access with an O(1) time complexity, instantly obtaining a 64-bit pointer ptr. Based on the storage address pointed to by ptr, the server reads the complete, encrypted list of document identifiers EncDB(w) from the disk file or block storage. The time taken for this step is determined by I / O latency. The server returns the read binary data block EncDB(w) as the query result to the client through a secure channel; the entire query process requires no comparison or iteration operations.

[0040] In an optional embodiment, if the keyword is a high-frequency keyword, then the query is performed in the dynamic secondary index, including: The server uses the same pseudo-random function as when building the index to encrypt the trapdoor submitted by the client, resulting in an encrypted trapdoor; Using the encrypted trapdoor as a key, a matching search is performed in the encrypted key-value storage structure of the dynamic secondary index to locate and return the corresponding list of encrypted document identifiers.

[0041] When the server receives the trapdoor 't' and determines that 't' corresponds to a high-frequency keyword, the query operation is performed on the dynamic secondary index. The same key used when building the index is employed. Using the pseudo-random function PRF, the received trapdoor t is transformed to calculate the key used for database lookup: This ensures that the keys generated during the query match the keys stored in the database.

[0042] get Then, the server sends a GET request to the underlying key-value database. The database engine will then use its internal index structure to look up the key. If If the key exists, the database returns the associated value, a 64-bit pointer `ptr` storing a list of encrypted document identifiers. If the key does not exist, a null value is returned. If the query is successful, the server uses `ptr` to read the encrypted data `EncDB(w)` from the storage system and returns it to the client. The performance of the entire process mainly depends on the efficiency of the key-value database's `get` operation, which is typically a logarithmic time complexity operation capable of supporting large-scale, high-frequency keyword queries.

[0043] S7, wherein the encrypted document identifier list is obtained by performing a bitwise XOR operation on the original document identifier list, a symmetric key derived using the keyword, the master key, and parameters that change with updates, an initialization vector, and a key stream generated in stream encryption mode.

[0044] For each keyword w, maintain an associated update counter. Initially set to 0, incremented by one each time the document identifier list is updated; using the HMAC-based key derivation function HKDF-SHA256, with the master key MK as input key material, keyword w, and the current counter. The combined string is used as input to derive a 256-bit symmetric key k and a 128-bit initialization vector IV. The original document identifier list DB(w) is serialized into a byte stream; the AES algorithm is used and configured in CTR counter stream encryption mode, and the encryptor is initialized using the derived key k and initialization vector IV. The encryptor generates a key stream of the same length as DB(w), and performs a bit-by-bit XOR operation between DB(w) and this key stream to obtain the encrypted document identifier list EncDB(w), which is the encrypted data stored in the index.

[0045] In an optional embodiment, the step of obtaining the original document identifier list by performing a bitwise XOR operation with a symmetric key derived using the keywords, the master key, and parameters that change with updates, an initialization vector, and a keystream generated in stream encryption mode, includes: For a keyword, master key, and parameters that change with updates, a key derivation method based on pseudo-random functions is used to derive a symmetric key and initialization vector specific to the keyword. A symmetric encryption algorithm is used to generate a key stream of the same length as the original document identifier list in stream encryption mode, using the derived symmetric key and initialization vector. An encrypted list of document identifiers is obtained by performing a bitwise XOR operation between the original list of document identifiers and the key stream.

[0046] Provides independent, updatable encryption protection for the document list for each keyword, with parameters including: keyword w, a 256-bit master key MK, and a 64-bit state counter associated with w that changes with updates. . The value of w must be incremented by one each time the document list is updated, with an initial value of 0. Key derivation is then performed. An HMAC-based key derivation function is used to generate a one-time symmetric key and initialization vector. The derivation process consists of two steps: deriving the root key, i.e., the token, from the master key MK and the keyword w. .Will As the input key material IKM, the current state As the information 'info' is input into the HKDF-Expand stage, a sufficiently long output key material 'OKM' is derived, for example... The OKM is split, with the first 256 bits used as the AES encryption key k and the last 128 bits used as the initialization vector IV.

[0047] After obtaining k and IV, the encryption phase begins. The original document identifier list DB(w) associated with the keyword w is used as plaintext. The AES-256 algorithm is selected and operates in counter mode (CTR). The AES-256-CTR cipher is initialized using k and IV, generating a pseudo-random keystream S with the same length as DB(w). The plaintext DB(w) and the keystream S are then XORed bitwise to obtain the encrypted document identifier list. .because Each update changes the ciphertext, and therefore the derived key (k) and indicative index (IV) also change, ensuring that even minor changes to the document list result in completely different ciphertexts. The server needs to securely store and maintain a corresponding state counter for each keyword. .

[0048] Example 2 Embodiment 2 of the present invention provides an encrypted retrieval system for data privacy protection, comprising the following modules: The first generation module is used to analyze the document set in advance, divide the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold; define a unified, key-based pseudo-random function to generate a unique deterministic token for any keyword, and securely store the deterministic token and the keyword type division relationship; The construction module is used to construct a static primary index for the low-frequency keyword subset based on the deterministic token set corresponding to the subset; and to construct a dynamic secondary index that supports concurrent updates for the high-frequency keyword subset based on the deterministic token set corresponding to the subset. The retrieval module is used to generate a trapdoor (i.e., a deterministic token) for the keyword to be retrieved using the pseudo-random function on the user end, and send the trapdoor to the server. After receiving the trapdoor, the server determines the keyword type corresponding to the trapdoor based on the pre-stored deterministic token and keyword type classification relationship. If it is a low-frequency keyword, it searches in the static main index; if it is a high-frequency keyword, it searches in the dynamic secondary index and returns a list of encrypted document identifiers. The list of encrypted document identifiers is obtained by performing a bitwise XOR operation on the original document identifier list and a symmetric key derived from the keyword, master key, and parameters that change with updates, as well as an initialization vector and a key stream generated in stream encryption mode.

[0049] The above description is merely an embodiment of this specification and is not intended to limit this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this specification.

Claims

1. A data privacy-preserving encrypted retrieval method, characterized in that, Includes the following steps: The document set is analyzed in advance, and the keyword set is divided into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold; Define a unified, key-based pseudo-random function to generate a unique deterministic token for any keyword and securely store the relationship between the deterministic token and the keyword type. For the subset of low-frequency keywords, a static master index is constructed based on the set of deterministic tokens corresponding to the subset; For the high-frequency keyword subset, a dynamic secondary index supporting concurrent updates is constructed based on the deterministic token set corresponding to the subset; When performing keyword retrieval, the user terminal uses the pseudo-random function to generate a trapdoor, i.e., a deterministic token, for the keyword to be retrieved and sends the trapdoor to the server. After receiving the trapdoor, the server determines the keyword type corresponding to the trapdoor based on the pre-stored deterministic token and keyword type classification relationship. If it is a low-frequency keyword, it searches in the static main index; if it is a high-frequency keyword, it searches in the dynamic secondary index and returns the retrieved encrypted document identifier list. The encrypted document identifier list is obtained by performing a bitwise XOR operation on the original document identifier list and a symmetric key derived using the keyword, master key, and parameters that change with updates, as well as an initialization vector and a key stream generated in stream encryption mode.

2. The encrypted retrieval method according to claim 1, characterized in that, The pre-analysis of the document set, dividing the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold, includes: Traverse the document set to count the total number of times each keyword appears in all documents; If the total number of times a keyword appears is not greater than the keyword frequency threshold, then the keyword is classified into the low-frequency keyword subset. Conversely, the keywords are categorized into a subset of high-frequency keywords.

3. The encrypted retrieval method according to claim 1, characterized in that, The step of constructing a static master index for the low-frequency keyword subset based on the deterministic token set corresponding to the subset includes: Using the set of deterministic tokens corresponding to all low-frequency keywords as input, a function is constructed using the minimum perfect hash function algorithm to map each deterministic token to a unique integer value; Create a pointer array of size equal to the total number of low-frequency keywords, and store the storage address of the list of encrypted document identifiers associated with the token at the position determined by the integer value in the array.

4. The encrypted retrieval method according to claim 1, characterized in that, The step of constructing a dynamic secondary index supporting concurrent updates for the high-frequency keyword subset based on the deterministic token set corresponding to the subset includes: The encrypted key-value storage structure is implemented using an encrypted key-value database that supports concurrent operations; The key of the database is a deterministic token encrypted using the pseudo-random function, and the value is the storage location of the corresponding encrypted document identifier list. The encryption ensures the confidentiality of the tokens and does not reveal their inherent order.

5. The encrypted retrieval method according to claim 1, characterized in that, If the keyword is low-frequency, then the query will be performed in the static master index, including: The server uses the trapdoor submitted by the client as input and calculates a unique integer index value using the minimum perfect hash function of the static master index. Access the position of the corresponding index value in the pointer array, read and return the stored list of encrypted document identifiers.

6. The encrypted retrieval method according to claim 1 or 3, characterized in that, If the keyword is a high-frequency keyword, then the query is performed in the dynamic secondary index, including: The server uses the same pseudo-random function as when building the index to encrypt the trapdoor submitted by the client, resulting in an encrypted trapdoor; Using the encrypted trapdoor as a key, a matching search is performed in the encrypted key-value storage structure of the dynamic secondary index to locate and return the corresponding list of encrypted document identifiers.

7. The encrypted retrieval method according to claim 1, characterized in that, The method of obtaining the original document identifier list by performing a bitwise XOR operation with a symmetric key derived using the keywords, the master key, and parameters that change with updates, an initialization vector, and a keystream generated in stream encryption mode, includes: For a keyword, master key, and parameters that change with updates, a key derivation method based on pseudo-random functions is used to derive a symmetric key and initialization vector specific to the keyword. A symmetric encryption algorithm is used to generate a key stream of the same length as the original document identifier list in stream encryption mode, using the derived symmetric key and initialization vector. An encrypted list of document identifiers is obtained by performing a bitwise XOR operation between the original list of document identifiers and the key stream.

8. A data privacy-preserving encrypted retrieval system, characterized in that, Includes the following modules: The first generation module is used to analyze the document set in advance, divide the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold; define a unified, key-based pseudo-random function to generate a unique deterministic token for any keyword, and securely store the deterministic token and the keyword type division relationship; The construction module is used to construct a static master index for the subset of low-frequency keywords based on the set of deterministic tokens corresponding to the subset; For the high-frequency keyword subset, a dynamic secondary index supporting concurrent updates is constructed based on the deterministic token set corresponding to the subset; The retrieval module is used to generate a trapdoor (i.e., a deterministic token) for the keyword to be retrieved using the pseudo-random function on the user end, and send the trapdoor to the server. After receiving the trapdoor, the server determines the keyword type corresponding to the trapdoor based on the pre-stored deterministic token and keyword type classification relationship. If it is a low-frequency keyword, it searches in the static main index; if it is a high-frequency keyword, it searches in the dynamic secondary index and returns a list of encrypted document identifiers. The list of encrypted document identifiers is obtained by performing a bitwise XOR operation on the original document identifier list and a symmetric key derived from the keyword, master key, and parameters that change with updates, as well as an initialization vector and a key stream generated in stream encryption mode.

9. The encrypted retrieval system according to claim 8, characterized in that, The pre-analysis of the document set, dividing the keyword set into a low-frequency keyword subset and a high-frequency keyword subset according to a preset frequency threshold, includes: Traverse the document set to count the total number of times each keyword appears in all documents; If the total number of times a keyword appears is not greater than the keyword frequency threshold, then the keyword is classified into the low-frequency keyword subset. Conversely, the keywords are categorized into a subset of high-frequency keywords.

10. A computer-readable storage medium storing a computer program thereon, characterized in that, The computer program, when executed by a processor, implements the method as described in any one of claims 1-7.