Ciphertext data search method and system based on overlapping clustering

By preprocessing and indexing encrypted datasets using overlapping clustering and blockchain technology, the efficiency and accuracy issues of encrypted data search in existing technologies are resolved. This enables efficient and secure multi-keyword search and dynamic index updates, ensuring the credibility of data sharing and user privacy.

CN118779496BActive Publication Date: 2026-06-19SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
Filing Date
2024-07-24
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for outsourced data storage and sharing suffer from insufficient accuracy in single-keyword searches, low efficiency in large-scale encrypted searches, difficulty in dynamically updating search indexes, and issues with the relevance and reliability of search results, making it difficult for data requesters to quickly and accurately find the datasets they need.

Method used

Overlapping clustering technology is used to preprocess the encrypted dataset. Combined with blockchain and attribute encryption, a dynamically updated search index is generated. The dataset is distributed into multiple clusters using the overlapping K-means clustering algorithm. The keyword relevance is calculated and ranked using the TF-IDF algorithm. The credibility of the search results is verified by blockchain.

Benefits of technology

It improves the efficiency and accuracy of encrypted data search, ensures data security and trustworthiness, supports multi-keyword search, implements fine-grained access control and dynamic index updates, and protects user privacy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118779496B_ABST
    Figure CN118779496B_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for searching encrypted data based on overlapping clustering, belonging to the field of data security and privacy protection technology. Before encrypted search, this invention performs clustering preprocessing based on the dataset's summary, grouping similar datasets together and generating a multi-keyword search index to narrow the search scope and improve search efficiency. By combining technologies such as blockchain, attribute encryption, and policy hiding, it achieves efficient and reliable multi-keyword encrypted search and fine-grained access authorization, supports complete hiding of access policies and dynamic updates of the search index, and optimally ranks search results based on relevance, thus improving the efficiency and accuracy of encrypted search. This addresses the problem that the accuracy and efficiency of existing data resource searches need improvement.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data security and privacy protection technology, and in particular to a method and system for searching encrypted data based on overlapping clustering. Background Technology

[0002] The statements in this section merely refer to the background art related to this invention and do not necessarily constitute prior art.

[0003] With the rapid development of new-generation information technology, the amount of data generated by people has surged, making local storage space unable to meet the ever-increasing demand for data storage. More and more organizations and individuals are choosing to outsource data storage to third-party institutions such as cloud computing centers to effectively reduce storage costs for data owners and promote data sharing.

[0004] However, outsourced data storage and sharing face security risks such as data breaches. To ensure the security of outsourced data, data owners typically adopt a strategy of encrypting the data locally before outsourcing storage to prevent the leakage of sensitive information. However, encrypted storage brings new challenges to data sharing, as data requesters cannot directly find the desired dataset through keyword searches. To address this, searchable encryption technology has emerged, allowing data requesters to perform encrypted searches using specific keywords to find the target encrypted dataset, thereby achieving secure data sharing.

[0005] To prevent malicious users from obtaining raw data through searchable encryption and thus leaking sensitive information, researchers have proposed attribute-based searchable encryption methods. These methods enable ciphertext search while only allowing users who meet certain access policies to access the raw data. However, because these access policies are either explicitly stated in plaintext or partially hidden, the attributes of data requesters still face the security risk of information leakage. Furthermore, current searchable encryption methods also suffer from the following problems:

[0006] (1) The accuracy of single keyword search is insufficient, which can easily lead to poor matching between search results and actual needs.

[0007] (2) The efficiency of large-scale encrypted search decreases sharply. As the dataset continues to increase and data resources become richer, data requesters may find it difficult to accurately find all datasets closely related to the keywords even if they spend a lot of time. This not only affects search efficiency but also the accuracy of search results.

[0008] (3) The problem of dynamic updating of search index. As time goes by, the dataset will also change, and its initial search index may not be able to be updated in real time or can only be updated by rebuilding. This will affect search efficiency and may even make it difficult to obtain search results quickly.

[0009] (4) The relevance of search results. Due to the large amount of encrypted data resources, if the search results are not sorted from high to low based on relevance, data requesters will find it difficult to quickly find encrypted datasets that are highly relevant to their search keywords.

[0010] (5) The credibility of search results. Because the third-party servers that perform the search may not be entirely honest, it is difficult to guarantee the credibility of the search results.

[0011] Currently, encrypting and outsourcing data storage before publishing it on the blockchain for data sharing has become a common method for secure data sharing. This approach combines the advantages of data encryption and blockchain technologies, ensuring the security and trustworthiness of shared data. Data requesters can search for the desired dataset on the blockchain based on the dataset summary published on the chain. However, because there may be too many encrypted datasets that meet the search criteria, it is difficult to accurately find the desired dataset, and the accuracy and efficiency of data resource searching need to be improved. Summary of the Invention

[0012] To address the shortcomings of existing technologies, this invention provides a method, system, electronic device, computer-readable storage medium, and computer program product for searching encrypted data based on overlapping clustering. It combines overlapping clustering, blockchain, and attribute-based encryption technologies to achieve efficient encrypted data search.

[0013] In a first aspect, the present invention provides a method for searching encrypted data based on overlapping clustering;

[0014] A method for searching encrypted data based on overlapping clustering, comprising:

[0015] Based on a preset algorithm, search parameters are generated according to the set of search keywords and sent to the cloud server, and a data usage request is sent to the blockchain.

[0016] Obtain the set of ciphertext hash addresses corresponding to the target ciphertext dataset and download the target ciphertext dataset. Use the preset private key to decrypt and obtain the corresponding plaintext data.

[0017] The encrypted hash address set is obtained by the cloud server after searching the encrypted dataset according to search parameters and a preset search index and sorting it by relevance; the search index is generated and dynamically updated by the cloud server using a clustering smart contract to perform overlapping clustering preprocessing based on the data provided by the data owner; the relevance sorting is performed by the blockchain after the attribute verification is passed in response to the data usage request.

[0018] In some implementations, the encrypted dataset is generated by the data advocate by encrypting the dataset he owns based on a generated access policy vector and incorporating attribute-based searchable encryption techniques.

[0019] The encrypted dataset is stored on the cloud server, and the access policy vector set, hash value set, cloud storage hash address of the encrypted dataset, and data digest corresponding to the encrypted dataset are uploaded to the blockchain by the data advocate.

[0020] In some implementations, the search index is generated by the cloud server assigning clustering preprocessing tasks to computing nodes under the blockchain based on data provided by the data owner, and the computing nodes perform clustering preprocessing operations on the data digests published on the blockchain using overlapping K-means clustering.

[0021] In some implementations, the clustering preprocessing operation using overlapping K-means clustering specifically includes:

[0022] Multiple datasets are randomly selected as initial cluster centers. The correlation between each dataset and the cluster center is calculated and normalized to obtain the normalized correlation.

[0023] Based on the normalized correlation degree, the dataset is assigned to multiple cluster centers; the location and covariance matrix of the cluster centers are updated according to the assignment results, and the effective number of members of each cluster is calculated until the convergence condition is met.

[0024] In some implementations, the relevance ranking is determined by the cloud server based on the relevance score of the search keywords. The relevance score is generated by the cloud server based on the term frequency-inverse document frequency value fed back by the data owner and the search parameters sent by the data requester.

[0025] In some implementations, the target encrypted dataset is a encrypted dataset that has been verified through consensus.

[0026] The consensus verification is performed by the cloud server using a practical Byzantine fault-tolerant consensus algorithm to verify the search results and ranking results, and a trust-based reward and punishment mechanism is used to reward or punish the nodes that perform consensus verification.

[0027] Secondly, the present invention provides a encrypted data search system based on overlapping clustering;

[0028] A encrypted data search system based on overlapping clustering, including a client and a cloud server;

[0029] The client generates search parameters based on a set of search keywords using a preset algorithm and sends them to the cloud server, and sends a data usage request to the blockchain; the cloud server searches for the encrypted dataset based on the search parameters and a preset search index, sorts it by relevance, and obtains the encrypted hash address set corresponding to the target encrypted dataset; the client obtains the encrypted hash address set corresponding to the target encrypted dataset, downloads the target encrypted dataset, and decrypts it using a preset private key to obtain the corresponding plaintext data;

[0030] The search index is generated and dynamically updated by the cloud server using overlapping clustering preprocessing performed by a clustering smart contract based on the data provided by the data owner; the relevance calculation is performed by the blockchain after the attribute verification is passed in response to the data usage request.

[0031] Thirdly, the present invention provides an electronic device;

[0032] An electronic device includes a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the above-described encrypted data search method based on overlapping clustering.

[0033] Fourthly, the present invention provides a computer-readable storage medium;

[0034] A computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implements the steps of the encrypted data search method based on overlapping clustering as described above.

[0035] Fifthly, the present invention provides a computer program product;

[0036] A computer program product includes a computer program / instructions that, when executed by a processor, implement the steps of the above-described ciphertext data search method based on overlapping clustering.

[0037] Compared with the prior art, the beneficial effects of the present invention are:

[0038] 1. The technical solution provided by this invention uses an overlapping k-means clustering algorithm to preprocess the dataset and generate a search index, supporting the same dataset belonging to multiple clusters, ensuring the detail and accuracy of the clustering results. When performing encrypted search operations, the corresponding search index can be quickly located based on the search keywords, narrowing the search scope and significantly improving the efficiency and accuracy of encrypted search.

[0039] 2. The technical solution provided by this invention sorts search results according to their relevance to search keywords, enabling data requesters to quickly determine the datasets they need without spending time understanding the relevance of each dataset.

[0040] 3. The technical solution provided by this invention combines technologies such as attribute encryption, inner product functions, and blockchain to achieve fine-grained access control and complete hiding of access policies, ensuring the security of shared data, solving the problem of plaintext attribute leakage, and protecting the privacy of user identity.

[0041] 4. The technical solution provided by this invention supports multi-keyword search and uses smart contracts to monitor changes in encrypted datasets to automatically update the search index, solving the problem of search index updates and making it more flexible in application scenarios with large data scales and frequent updates.

[0042] 5. The technical solution provided by this invention uses blockchain instead of a trusted third party, and uses smart contracts to verify user attributes during the access authorization stage, avoiding risks such as third-party manipulation and improving the security and reliability of access authorization; it uses blockchain to verify and reach consensus on encrypted search results and ranking results, and to evaluate the trustworthiness of cloud servers, thereby realizing dynamic adjustment of cloud server revenue, incentivizing cloud servers to provide high-quality services, and improving the credibility of computing. Attached Figure Description

[0043] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0044] Figure 1 A flowchart illustrating the encrypted data search method based on overlapping clustering provided in an embodiment of the present invention;

[0045] Figure 2 A schematic diagram of data interaction for a ciphertext data search method based on overlapping clustering provided in an embodiment of the present invention;

[0046] Figure 3 This is a schematic diagram of the system architecture of the encrypted data search system based on overlapping clustering provided in an embodiment of the present invention. Detailed Implementation

[0047] It should be noted that the following detailed descriptions are exemplary and intended to provide further explanation of the invention. Unless otherwise specified, all technical and scientific terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0048] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments of the present invention. As used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. Furthermore, it should be understood that the terms “comprising” and “having”, and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0049] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0050] Example 1

[0051] The large number of encrypted datasets stored on the blockchain, along with the numerous encrypted datasets that meet the search criteria, necessitates improvements in the efficiency and accuracy of existing encrypted data searches. Therefore, this invention provides an encrypted data search method based on overlapping clustering, combining clustering preprocessing, blockchain, and policy hiding technologies to achieve efficient and reliable multi-keyword encrypted search and fine-grained access authorization, thereby enhancing the effectiveness and accuracy of encrypted search.

[0052] Next, combined Figures 1-3 This embodiment provides a detailed description of a encrypted data search method based on overlapping clustering. This encrypted data search method based on overlapping clustering, applied to data requesters, includes the following steps:

[0053] S1. Based on a preset algorithm, generate search parameters according to the set of search keywords and send them to the cloud server, and send a data usage request to the blockchain; among them, the search parameters include the search trapdoor and the weight ratio of the multi-keyword set corresponding to the set of search keywords.

[0054] For example, data requester u inputs a set of search keywords Q locally. u =kw1A…Akw g Λ…Akw G (1≤g≤G) and common parameter pk are used to generate the search trapdoor T. u Simultaneously, set the weight ratios corresponding to multiple keyword sets, and implement search traps (T). u Represented as:

[0055] TrapGen(Q u ,pk)→T u ;

[0056] The weight ratio corresponding to a set of multiple keywords is expressed as follows:

[0057] α u =α1:…:α g :…:α G .

[0058] Then, search for the trapdoor T. u The weight ratio α corresponding to the multi-keyword set u Send it to the cloud server to request a ciphertext search.

[0059] The encrypted data search method based on overlapping clustering requires data interaction between nodes including the data requester (client), cloud server, data advocate, and blockchain. Initialization is also required before step S1. As one implementation method, the specific process is as follows:

[0060] Step 1: Initialize the blockchain, which includes:

[0061] (1) Election of authoritative nodes.

[0062] Blockchain nodes use the Practical Byzantine Fault Tolerance (PBFT) consensus algorithm to elect a node as the attribute authority node AU.

[0063] (2) Deployment of smart contracts.

[0064] Deploy clustering smart contracts, index dynamic update smart contracts, and attribute verification smart contracts on the blockchain.

[0065] Smart contracts are automated protocols that run on the blockchain. In practical applications, smart contracts automatically execute agreed-upon operations.

[0066] The clustering smart contract receives data summaries from data owners, stores them on the blockchain, and triggers the smart contract to complete the clustering task of the data summaries; the index dynamic update smart contract generates a search index for the data summaries of each cluster after clustering, and automatically triggers the clustering smart contract to execute dynamic updates of the search index when the data owner's data summaries change; the attribute verification smart contract verifies the attributes of the data requester before returning the searched data to the data requester.

[0067] (3) System initialization.

[0068] Input the security parameter λ and attribute domain AF into the attribute authority node AU to generate the system parameter pk and master key msk, represented as follows:

[0069] Setup(λ,AF)→{pk,msk}.

[0070] Step 2: Generate a key.

[0071] Specifically, the system parameter pk, the master key msk, and the attribute set s of the data requester u are included. u The authoritative node AU, acting as the input attribute, executes a key generation algorithm to generate the private key sk of the data requester u. u , is represented as:

[0072] KeyGen(pk, msk, s) u )→sk u .

[0073] Step 3: The data owner l owns each dataset m. l,q The process involves generating access policy vectors and data digests, encrypting the data, storing it in ciphertext, and publishing it on-chain. The specific steps are as follows:

[0074] Step 301: Generation of access policy vectors:

[0075] Specifically, the data owner l owns each of his / her datasets m. l,q Set access policy P l,q =(A l,q , ρ l,q , τ l,q The access policy is transformed into an access policy vector using a one-hot encoding method. That is, to generate an access policy vector set for the entire dataset of data owner l. Where 1 ≤ l ≤ L, L represents the number of data owners, and 1 ≤ q ≤ N. l N l Indicates the number of datasets owned by each data owner l, A l,q For the dataset m of data owner l l,q The attribute matrix, T l,q For a set of attributes, ρ l,q To establish attribute matrix A l,q With attribute set τ l,q The mapping relationship between them.

[0076] For example, in a medical scenario, the dataset m owned by data owner l l,q Attribute matrix Attribute set τ l,q ={"Department: Cardiology", "Doctor's Title: Associate Chief Physician", "Institution: Hospital", "Institution: Research Institute", "Title: Researcher"}, mapping relationship ρ l,q ={("Department: Cardiology"∩"Title: Associate Chief Physician"∩"Institution: Provincial Hospital")∪("Institution: Cardiovascular Research Institute"∩"Title: Researcher")}, only if the data requester's attributes satisfy the access policy P l,qIt is only when this time can one obtain medical information related to heart disease.

[0077] Step 302: The data owner l is responsible for each dataset m that they own. l,q Generate data summary Abs l,q , is represented as:

[0078] AbsGen(m l,q →Abs l,q .

[0079] Step 303: Data owner l uses public parameter pk, master key msk, and access policy vector set. Give your entire dataset Collection of encrypted ciphertext datasets With the corresponding hash value set Represented as:

[0080]

[0081] Step 304, Ciphertext Storage: Store (CT) l →addr l .

[0082] Specifically, the data owner uploads all of their encrypted datasets CT. l The data is then sent to a cloud server for cloud storage. The cloud server receives the encrypted dataset CT uploaded by the data owner l. l Store and return its cloud storage hash address At the same time, when a new ciphertext dataset is added, the set of ciphertext datasets is updated to CT′.

[0083] Step 305, Data Release:

[0084] Specifically, the data owner l will access the policy vector set. Hash set H l ciphertext dataset cloud storage hash address addr l and data summary Abs l,q The data summary set Abs l Data is uploaded to the blockchain and centrally published; it also allows data owners to publish individual encrypted datasets at any time.

[0085] Step 4: Cluster and generate search index:

[0086] Specifically, the clustering algorithm parameter K is input into the cloud server, and the cloud server calls the clustering smart contract to execute the clustering preprocessing task. The clustering smart contract delegates the computation task to the computing nodes under the blockchain. The computing nodes use the Overlapping K-means (OKM) clustering algorithm to perform clustering preprocessing operations on the data digests corresponding to the encrypted datasets published on all blockchains, and assign the encrypted datasets to one or more clusters j (j = 1, 2, ..., K). After completing the task, the results are uploaded to the blockchain for verification and recording.

[0087] As one implementation method, the compute node performs the following operations:

[0088] 1) Initialization.

[0089] Specifically, randomly select K data summary sets x i As the initial cluster center c j Each cluster center has a corresponding covariance matrix ∑j, which represents the distribution of the data summary set of that cluster.

[0090] 2) Calculate the correlation degree.

[0091] Specifically, calculate each data summary set x i With cluster center c j The correlation degree is calculated using the following formula:

[0092]

[0093] Wherein, P(x i |c j ) represents the data summary set x i The degree of association belonging to cluster j, where d is the dimension of the data summary set, ∑ j It is the covariance matrix of cluster j.

[0094] 3) Normalized correlation degree.

[0095] Specifically, for each dataset x i Normalize its correlation with all clusters to ensure that the sum is 1:

[0096]

[0097] 4) Assign the data summary set to the cluster center.

[0098] Specifically, based on the normalized correlation coefficient P(x) i |c j This method distributes a data summary set to multiple cluster centers, such that the sum of the correlations among the different clusters is 1, thus enabling a single data summary set to exist in multiple clusters.

[0099] For example, if P(x) i |c1)=0.5, P(x i |c2)=0.2, P(x i If |c3)=0.3, then we can consider the dataset x to be... i They were assigned to clusters 1, 2, and 3.

[0100] 5) Update the cluster centers and covariance matrix.

[0101] Specifically, based on the allocation results, update the position and covariance matrix of each cluster center, and calculate the effective number of members in each cluster:

[0102]

[0103] Where N is the total number of data summary sets. The cluster centers are updated based on the weighted average of all data summary sets:

[0104]

[0105] Then, the covariance matrix ∑j for each cluster is calculated, which represents the distribution shape and range of the data summary set in cluster j. The update steps are as follows:

[0106]

[0107] Among them, (x i -c j (x) i -c j ) T It is a data summary set x i With cluster center c j The outer product of the difference vectors between them.

[0108] 6) Complete clustering. Repeat steps 3)-5) until the cluster centers and covariance matrix reach the maximum number of iterations, i.e., the convergence condition is met, and clustering is complete.

[0109] 7) Search Index Generation. After grouping the data illumination sets into different clusters, a search index I is generated for each cluster j. j With index metadata The search indexes generated by all clusters will form an index set I and an index metadata dataset IndInfo, where the index metadata... By index identifier Timestamp and index storage hash address composition.

[0110] An index is a data structure used to accelerate data retrieval and querying. A cluster is a set of data summaries with similar content grouped together. In this embodiment, the data ID of each data summary under the cluster and the keywords it contains are used to form a search index, which facilitates search queries by keywords.

[0111] Index metadata is additional information that describes and manages an index. It consists of the index's unique identifier, storage address, timestamps of index creation and updates, and the cluster from which the index was generated. Index metadata can improve the query efficiency of an index and maintain index consistency.

[0112] Furthermore, to ensure the accuracy and timeliness of the search index, the cloud server invokes the index dynamic update smart contract to automatically monitor data changes. When a new encrypted dataset is added, the index dynamic update smart contract and the clustering smart contract are triggered to automatically update and generate a new search index. The dynamic update of the search index is represented as follows:

[0113] IndUpdate(I,CT′).

[0114] S2. Obtain the set of ciphertext hash addresses corresponding to the target ciphertext dataset and download the target ciphertext dataset. Use the preset private key to decrypt and obtain the corresponding plaintext data.

[0115] In this embodiment, the encrypted hash address is obtained by the cloud server after searching the encrypted dataset according to the search trapdoor and the preset search index and sorting it by relevance; the relevance sorting is executed after the blockchain responds to the data usage request and passes the attribute verification.

[0116] Furthermore, the specific process is as follows:

[0117] (1) The cloud server obtains the latest multi-keyword encrypted search index set I from the blockchain, finds the encrypted dataset that meets the requirements in the encrypted data resources according to the search trap, and uses the search index to find the encrypted dataset SerData containing the search keywords of the data requester u. u .

[0118] Specifically, the cloud server compares the trapdoor with the encrypted keywords in the search index. If the trapdoor matches a certain index entry, the query requirement is met. Then, based on the corresponding data ID information in the index, the encrypted dataset that meets the requirements is found in the encrypted data resource.

[0119] Building upon the preceding steps, let's take the keyword "fever" as an example for further explanation. The keyword "fever" is input, encrypted using a key to generate a trapdoor, which is then sent to the cloud server. The cloud server receives the trapdoor, searches for matching entries in the encrypted search index, and obtains the ID information corresponding to all entries containing "fever." Based on the ID information, it finds the cloud storage hash address, thereby extracting the corresponding encrypted dataset from the encrypted data resources.

[0120] (2) Data requester u initiates a data usage request to the blockchain, triggering the attribute verification smart contract; the attribute verification smart contract transforms the attributes of data requester u into vectors. and the corresponding access strategy vector Perform a matching verification. If the verification passes, output 'Y', indicating that access permission is granted; otherwise, output 'No', indicating that access is denied. After the attribute access authorization is successful, proceed to step (3).

[0121] Specifically, each attribute is mapped to a vector, with each attribute corresponding to a binary bit: 1 indicates the attribute is present, and 0 indicates it is not. A match is considered successful if the dot product of the attribute vector and the policy vector equals the number of 1s in the policy vector.

[0122] For example, the attribute vector of an internist can be represented as: [doctor, nurse, internal medicine, surgery], which is [1,0,1,0] after vectorization. At the same time, the access strategy is also in vector form. If the access strategy is doctor AND internal medicine, then it is [1,0,1,0] after vectorization.

[0123] The product of the attribute vector and the access policy is 1*1+0*0+1*1+0*0=2, and the number of 1s in the policy vector is 2. Therefore, the match is successful and the access permission is satisfied.

[0124] (3) Calculate the TF-IDF value:

[0125] Specifically, the data owner uses the TF-IDF algorithm to search for trapdoors T sent by the cloud. u Based on the retrieved encrypted dataset, calculate kw for each search keyword. g of The value is then returned to the cloud server.

[0126] For example, the specific process is as follows:

[0127] (301) Calculate kw for each keyword g The frequency TF and inverse document frequency IDF are calculated using the following formulas:

[0128]

[0129] The first formula represents the keyword kw g In the encrypted dataset t s The frequency of occurrence in, where η g,s The keyword kw g In the encrypted dataset t s The number of times it appears in ∑ m η g,s Represents the encrypted dataset t s The first formula is used to calculate the total number of all terms in the dictionary; the second formula is used to calculate the keyword kw. g The IDF value, where |D| represents the total number of ciphertext datasets, |g,s:d g ∈t s | indicates that the keyword kw is included. g The number of encrypted datasets.

[0130] (302) Calculate keyword kw g of The value is then fed back to the cloud server.

[0131] Specifically, the data owner, l, calculates the keyword kw contained in the encrypted dataset. g of value:

[0132]

[0133] (4) Relevance ranking:

[0134] The cloud server returns the data based on the data owner. Calculate the weight ratio α based on multiple keyword sets. u Relevance score u And without revealing the encrypted dataset SerData found during the search. i In the case of content, search results are sorted according to their relevance score. The specific steps of the sorting operation are as follows:

[0135] (401) Calculate the relevance score u :

[0136]

[0137] (402) Based on the calculated relevance score u The cloud server sorts the search results from highest to lowest score, resulting in a sorted encrypted dataset called SortData. u .

[0138] To ensure data consistency and reliability, further measures include:

[0139] (403) The cloud server requests consensus verification of the search results and ranking results.

[0140] Specifically, the cloud server uses the PBFT consensus algorithm to randomly select sample data and send it to the blockchain's master node (Node). z Master Node z Receive the verification request and broadcast it to each secondary node. f It is responsible for instructing all nodes to perform verification operations and comparing the verification results with the results submitted by the cloud server.

[0141] Under the PBFT framework, even if some malicious nodes exist, as long as more than two-thirds of the nodes verify the same result, the consistency and correctness of the data can be ensured, thus the result is considered reliable and valid. After successful consensus verification, steps (404) and (405) are executed.

[0142] (404) The cloud server returns a sorted encrypted dataset (SortData). u The corresponding data digest is given to the data requester u, who then selects the desired data from the sorted encrypted dataset, i.e., the target encrypted dataset, whose data content remains confidential.

[0143] (405) Trust assessment of cloud servers based on a trust-based reward and punishment mechanism.

[0144] After consensus verification is completed on the cloud server, all nodes that participated in the consensus verification jointly calculate the trust score and the final reward. If more than two-thirds of the nodes agree that the data is valid, the trust score increases, resulting in more rewards. Otherwise, the verification fails, the cloud server's trust score is lowered, and a reward penalty is imposed.

[0145] A trust-based reward and punishment mechanism is represented as follows:

[0146] Where B is a constant, including the bonus value B. up And deduction value B down , It is a coefficient that changes with the number of times the cloud is verified, including bonus coefficients. and deduction coefficient And B and All settings are configured by the cloud server user. The range is (0,9]. The range is (0, 90). The range of the current trust score TS and the historical trust score HTS is [TS min ,TS max= [40, 100]. In the next trust assessment, the current trust score TS will be updated to the new historical trust score HTS. It should be noted that when the trust score is below 40, network security needs to be considered and the server should be replaced.

[0147] The formula for calculating trust scores is as follows:

[0148] when An increase in trust rating is indicated as follows:

[0149]

[0150] Wherein, β1, β2, and β3 represent the weights of the factors influencing the trust score, including response time, validation success rate, and historical trust scores. It can initially be set to 1, and TS will increase as the number of successful verifications increases. max This is the threshold for the highest level of trust.

[0151] when When the trust score decreases, it indicates the following:

[0152]

[0153] in, Initially set to 1, and gradually increased with the number of verification failures, TS min is the minimum trust rating threshold, and nr is the ratio of the number of nodes that agree to the total number of nodes.

[0154] The formula for calculating rewards and penalties is as follows:

[0155] E = max(E min ,E0·[1+θ·(TS-HTS)]),

[0156] Where E and E0 are the final return and the basic return, respectively. min This is the minimum revenue value set to prevent the final revenue from being negative. θ∈(0,1), where θ is the revenue coefficient set by the user. TS-HTS represents the difference between the latest trust score after result verification and the historical trust score. The final revenue of the cloud server will be affected by the trust score. If the cloud server performs well, the revenue will increase and a reward will be obtained; otherwise, the revenue will decrease and a penalty will be obtained.

[0157] (5) The data requester obtains the target search result: Return(GoalData) u )→{addr l H l}

[0158] Specifically, the cloud server returns the target encrypted dataset GoalData selected by the data requester u. u All cloud storage hash addresses addr l .

[0159] (6) Data requester u decrypts plaintext data: Dec(GoalData) u ,addr l ,sk u ).

[0160] Specifically, the data requester u selects the ciphertext hash address set addr according to its own needs. l Download the encrypted dataset GoalData u and using the private key sk u Decrypt to obtain the corresponding plaintext data.

[0161] Example 2

[0162] Combination Figures 1-3 This embodiment discloses a encrypted data search system based on overlapping clustering, including a client, a cloud server, and a blockchain. The client is used by the data owner to generate access policy vectors and data digests, encrypt the dataset, and publish encrypted data resources. The data requester inputs a search trap to submit a search request, defines keyword weight ratios to rank the search results, submits data usage requests to complete attribute verification, and decrypts to obtain plaintext data. The cloud server is used to store the encrypted dataset sent by the data owner through the client, return its cloud storage hash address, perform multi-keyword encrypted searches, perform optimal ranking of search results based on keyword relevance, request the blockchain to perform consensus verification of the search and ranking results, and finally return the target encrypted dataset to the data requester through the client. The blockchain is used to elect a node as the attribute authority node (AU) through a consensus algorithm to be responsible for system initialization, generating system parameters and keys, deploying smart contracts, storing information such as the cloud storage hash address of the encrypted dataset, allowing the smart contract to be called to perform clustering preprocessing tasks, monitoring data changes to trigger dynamic index updates and attribute verification, setting verification nodes to perform consensus verification of the cloud server's search results and ranking results, and evaluating the trust level of the cloud server based on the consensus results to implement an effective reward and punishment mechanism.

[0163] As one implementation method, the encrypted data search steps based on the above-mentioned encrypted data search system based on overlapping clustering are as follows:

[0164] 1. Initialization: Setup(λ,AF)→{pk,msk}.

[0165] 1) Authority Node Election. Blockchain nodes use the PBFT consensus algorithm to elect a node as the attribute authority (AU).

[0166] 2) Smart Contract Deployment. Deploy clustering smart contracts, index dynamic update smart contracts, and attribute verification smart contracts on the blockchain.

[0167] 3) System initialization. AU takes security parameter λ and attribute field AF as input and generates system parameter pk and master key msk.

[0168] 2. Key generation: KeyGen(pk,msk,s) u )→sk u .

[0169] The AU executes the key generation algorithm, taking as input system parameters pk, master key msk, and attribute set s of the data requester u. u Generate the private key sk of the data requester u. u .

[0170] 3. Access strategy vector generation:

[0171] Data owner l uses a client to manage each of his / her datasets m. l,q Set access policy P l,q =(A l,q ,ρ l,q ,τ l,q The access policy is transformed into an access policy vector using a one-hot encoding method. That is, to generate an access policy vector set for the entire dataset of data owner l. Where 1 ≤ l ≤ L, L represents the number of data owners, and 1 ≤ q ≤ N. l N l Indicates the number of datasets owned by each data owner l, A l,q For the dataset m of data owner l l,q The attribute matrix, τ l,q For a set of attributes, ρ l,q To establish attribute matrix A l,q With attribute set τ l,q The mapping relationship between them, for example in a medical scenario, where the data owner l has a dataset m. l,q Attribute matrix Attribute set τ l,q ={"Department: Cardiology", "Doctor's Title: Associate Chief Physician", "Institution: Hospital", "Institution: Research Institute", "Title: Researcher"}, mapping relationship ρ l,q ={(“Department: Cardiology”∩“Title: Associate Chief Physician”∩“Institution: Provincial Hospital”)∪(“Institution: Cardiovascular Research Institute”∩“Title: Researcher)}, only if the data requester's attributes satisfy the access strategy P l,q It is only when this time can one obtain medical information related to heart disease.

[0172] 4. Data summary generation: AbsGen(m l,q →Abs l,q .

[0173] Data owner l uses a client to manage each dataset m it owns. l,q Generate data summary Abs l,q .

[0174] 5. Data encryption:

[0175] The data owner uses the public parameter pk, master key msk, and access policy vector set through the client. Give your entire dataset Collection of encrypted ciphertext datasets With the corresponding hash value set

[0176] 6. Encrypted storage: Store (CT) l →addr l .

[0177] The data owner uploads all of their encrypted datasets (CT) via a client. l Perform cloud storage. The cloud server receives a collection of encrypted datasets (CT) uploaded by the data owner (l). l Store and return its cloud storage hash address At the same time, when a new ciphertext dataset is added, the set of ciphertext datasets is updated to CT′.

[0178] 7. Data Release:

[0179] The data owner will access the policy vector set through the client. Hash set H l ciphertext dataset cloud storage hash address addr l and data summary Abs l,q The data summary set Abs l Data is uploaded to the blockchain and centrally published; it also allows data owners to publish individual encrypted datasets at any time.

[0180] 8. Cluster and generate search index:

[0181] The cloud server inputs the clustering algorithm parameter K and invokes the clustering smart contract to execute the clustering preprocessing task. The clustering smart contract delegates the computation task to the computing nodes under the blockchain. The computing nodes perform clustering preprocessing operations on the data summaries of all datasets published on the blockchain using the Overlapping K-means Clustering (OKM) algorithm, assigning the datasets to one or more clusters j (j = 1, 2, ..., K). After completing the task, the results are uploaded to the blockchain for verification and recording. The computing nodes perform the following operations:

[0182] 1) Initialization. Randomly select K data summary sets x i c as the initial cluster center j Each cluster center has a corresponding covariance matrix ∑ j , which represents the distribution of the dataset for this cluster.

[0183] 2) Calculate the correlation. Calculate the correlation for each data summary set x. i With cluster center c j The correlation degree is calculated using the following formula:

[0184]

[0185] Where d is the dimension of the dataset, ∑ j It is the covariance matrix of cluster j.

[0186] 3) Normalized correlation degree.

[0187] For each data summary set x i Normalize its correlation with all clusters to ensure that the sum is 1:

[0188]

[0189] 4) Assign the data summary set to the cluster center.

[0190] Based on the normalized correlation coefficient P(x) i |c j The data summary set is distributed to multiple cluster centers, where P(x) i |c j ) represents the data summary set x i The degree of association belonging to cluster j.

[0191] 5) Update the cluster centers and covariance matrix.

[0192] Update the location and covariance matrix of each cluster center based on the allocation results, and calculate the effective number of members in each cluster:

[0193]

[0194] Where N is the total number of data summary sets.

[0195] Update the cluster centers based on the weighted average of all data summary sets:

[0196]

[0197] Then, calculate the covariance matrix ∑ for each cluster. j This represents the distribution shape and range of the data summary set in cluster j. The update steps are as follows:

[0198]

[0199] Among them, (x i -c j (x) i -c j ) T It is a data summary set x i With cluster center c j The outer product of the difference vectors between them.

[0200] 6) Complete clustering. Repeat steps 3)-5) until the cluster centers and covariance matrix reach the maximum number of iterations, i.e., the convergence condition is met, and clustering is complete.

[0201] 7) Search Index Generation. After grouping the data summary set into different clusters, a search index I is generated for each cluster j. j With index metadata The search indexes generated by all clusters will form an index set I and an index metadata dataset IndInfo, where the index metadata... By index identifier Timestamp and index storage hash address composition.

[0202] 9. Dynamically update the search index: IndUpdate(I,CT′).

[0203] To ensure the accuracy and timeliness of the search index, the cloud server invokes a smart contract to dynamically update the index and automatically monitor data changes; when a new encrypted dataset is added, the smart contract is triggered to automatically update and generate a new search index.

[0204] 10. Generate search trapdoors: TrapGen(Q u ,pk)→T u .

[0205] Data requester u inputs a set of search keywords Q into the client. u =kw1Λ…Λkw g Λ…Λkw G (1≤g≤G) and common parameter pk are used to generate the search trapdoor T. uAt the same time, set the weight ratios for multiple keyword sets:

[0206] α u =α1:...:α g :...:α G ,

[0207] Send it to the cloud server to request a ciphertext search.

[0208] 11. Encrypted Search: Search(T) u ,I)→SerData u .

[0209] The cloud server retrieves the latest multi-keyword encrypted search index set I from the blockchain, finds the encrypted dataset that meets the requirements in the encrypted data resources according to the search trap, and uses the search index to find the encrypted dataset SerData containing the search keywords of the data requester u. u .

[0210] 12. Attribute Validation:

[0211] Data requester u initiates a data usage request to the blockchain, triggering an attribute verification smart contract; the attribute verification smart contract transforms the data requester's attributes into vectors. and the corresponding access strategy vector Perform a matching verification. If the verification passes, output 'Y', indicating that access is granted; otherwise, output 'No', indicating that access is denied.

[0212] 13. Calculate the TF-IDF value:

[0213] After attribute access authorization is granted, the data owner uses the TF-IDF algorithm to search the trapdoor T sent by the cloud. u Based on the retrieved encrypted dataset, calculate kw for each search keyword. g of The value is then returned to the cloud.

[0214] 1) The formulas for calculating the frequency (TF) and inverse document frequency (IDF) are as follows:

[0215]

[0216]

[0217] The first formula represents the keyword kw g In dataset t s The frequency of occurrence in, where η g,s The keyword kw g In dataset ts The number of times it appears in ∑ m η g,s Represents dataset t s The total number of all terms; the second formula is used to calculate the keyword kw. g The IDF value, where |D| represents the total number of data sets, |g,s:d g ∈t s | indicates that the term "kw" is included. g The number of datasets.

[0218] 2) Calculate keyword kw g of Value. The value of the keyword kw in the dataset is calculated by the data owner l through a client. g of value:

[0219]

[0220] 14. Result sorting:

[0221] The cloud server returns the TF-IDF based on the data owner's information. kwg Calculate the relevance score based on keywords. u And without revealing the encrypted dataset SerData found during the search. u In the case of content, the relevance score is calculated as a ratio of keyword weight to α. u Sort the search results. The specific steps for sorting are as follows:

[0222] 1) Calculate the relevance score u :

[0223]

[0224] 2) Search result sorting. Based on the calculated relevance score. u The cloud server sorts the search results from highest to lowest score, resulting in a sorted encrypted dataset called SortData. u .

[0225] 3) Consensus Verification. The cloud server requests consensus verification of the search results and ranking results, using the PBFT consensus algorithm to randomly select sample data and send it to the master node. z The master node receives the verification request and broadcasts it to all the slave nodes. fIt is responsible for instructing all nodes to perform verification operations and comparing the verification results with the results submitted by the cloud server. Under the PBFT framework, even if there are some malicious nodes, as long as more than two-thirds of the nodes verify the same result, the consistency and correctness of the data can be ensured, thus the result is considered reliable and valid.

[0226] 4) Return the sorting result. After successful consensus verification, the cloud server returns the sorted encrypted dataset SortData. u The corresponding data summary is given to the data requester u, while the data content remains confidential.

[0227] 15. Trust-based reward and punishment mechanism:

[0228] After consensus verification on the cloud server is completed, all nodes involved in the consensus verification jointly calculate the trust score and final reward. If more than two-thirds of the nodes agree that the data is valid, the trust score increases, resulting in more rewards. Otherwise, a verification failure result is returned, the cloud server's trust score is reduced, and a reward penalty is applied. Here, nr is the ratio of the number of agreeing nodes to the total number of nodes, and B is a constant, including the bonus value B0. up And deduction value B down , It is a coefficient that changes with the number of times the cloud is verified, including bonus coefficients. and deduction coefficient And B and All settings are configured by the cloud server user. The range is (0,9]. The range is (0, 90). The range of the current trust score TS and the historical trust score HTS is [TS min ,TS max = [40, 100]. In the next trust assessment, the current trust score TS will be updated to the new historical trust score HTS. It should be noted that when the trust score is below 40, network security needs to be considered and the server should be replaced.

[0229] 1) Trust rating calculation formula:

[0230] when An increase in trust rating is indicated as follows:

[0231]

[0232] Wherein, β1, β2, and β3 represent the weights of the factors influencing the trust score, including response time, validation success rate, and historical trust scores. It can initially be set to 1, and TS will increase as the number of successful verifications increases. maxThis is the threshold for the highest level of trust.

[0233] when When the trust score decreases, it indicates the following:

[0234]

[0235] in, Initially set to 1, and gradually increased with the number of verification failures, TS min This is the minimum trust rating threshold.

[0236] 2) Reward and penalty calculation formula:

[0237] E = max(E min ,E0·[1+θ·(TS-HTS)]),

[0238] Where E and E0 are the final return and the basic return, respectively. min This is the minimum revenue value set to prevent the final revenue from being negative. θ∈(0,1), where θ is the revenue coefficient set by the user. TS-HTS represents the difference between the latest trust score after result verification and the historical trust score. The final revenue of the cloud server will be affected by the trust score. If the cloud server performs well, the revenue will increase and a reward will be obtained; otherwise, the revenue will decrease and a penalty will be obtained.

[0239] 16. Obtain the target search result: Return(GoalData) u )→{addr l H l}

[0240] The cloud server returns the target encrypted dataset GoalData selected by the data requester u. u All cloud storage hash addresses addr l .

[0241] 17. Decrypt plaintext data: Dec(GoalData) u ,addr l ,sk u ).

[0242] Data requester u obtains the data from the client based on the ciphertext hash address set addr. l Download the encrypted dataset GoalData u and using the private key sk u Decrypt to obtain the corresponding plaintext data.

[0243] Example 3

[0244] Embodiment 3 of the present invention provides an electronic device, including a memory and a processor, as well as computer instructions stored in the memory and running on the processor. When the computer instructions are executed by the processor, they complete the steps of the above-described encrypted data search method based on overlapping clustering.

[0245] Example 4

[0246] Embodiment 4 of the present invention provides a computer-readable storage medium for storing computer instructions, which, when executed by a processor, complete the steps of the above-described encrypted data search method based on overlapping clustering.

[0247] Example 5

[0248] Embodiment 5 of the present invention provides a computer program product, including a computer program / instruction, which, when executed by a processor, implements the steps of the above-described encrypted data search method based on overlapping clustering.

[0249] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0250] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0251] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment, whereby a series of operational steps are performed to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0252] The descriptions of each embodiment in the above embodiments have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0253] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for searching ciphertext data based on overlapping clustering, characterized by, include: Based on a preset algorithm, search parameters are generated according to the set of search keywords and sent to the cloud server, and a data usage request is sent to the blockchain. Obtain the set of ciphertext hash addresses corresponding to the target ciphertext dataset and download the target ciphertext dataset. Use the preset private key to decrypt and obtain the corresponding plaintext data. The encrypted hash address set is obtained by the cloud server after searching the encrypted dataset according to search parameters and a preset search index and sorting it by relevance; the search index is generated and dynamically updated by the cloud server using a clustering smart contract to perform overlapping clustering preprocessing based on the data provided by the data owner; the relevance sorting is performed by the blockchain after the attribute verification is passed in response to the data usage request. The search index is generated by the cloud server assigning clustering preprocessing tasks to computing nodes under the blockchain based on data provided by the data owner. The computing nodes then perform clustering preprocessing operations on the data digests published on the blockchain using overlapping K-means clustering. The relevance ranking is determined by the cloud server based on the relevance score of the search keywords. The relevance score is generated by the cloud server based on the term frequency-inverse document frequency value fed back by the data owner and the search parameters sent by the data requester. The target ciphertext dataset is a ciphertext dataset that has been verified through consensus. The consensus verification is performed by the cloud server using a practical Byzantine fault-tolerant consensus algorithm to verify the search results and ranking results, and a trust-based reward and punishment mechanism is used to reward or punish the nodes that perform consensus verification.

2. The overlapping clustering based ciphertext data search method of claim 1, wherein, The encrypted dataset is generated by the data owner by encrypting the owned dataset according to the generated access policy vector and combining it with attribute-based searchable encryption technology. The encrypted dataset is stored on the cloud server, and the access policy vector set, hash value set, cloud storage hash address of the encrypted dataset, and data digest corresponding to the encrypted dataset are uploaded to the blockchain by the data owner.

3. The overlapping clustering based ciphertext data search method of claim 1, wherein, The clustering preprocessing operation using overlapping K-means clustering specifically includes: Multiple datasets are randomly selected as initial cluster centers. The correlation between each dataset and the cluster center is calculated and normalized to obtain the normalized correlation. Based on the normalized correlation degree, the dataset is assigned to multiple cluster centers; the location and covariance matrix of the cluster centers are updated according to the assignment results, and the effective number of members of each cluster is calculated until the convergence condition is met.

4. A ciphertext data search system based on overlapping clustering, characterized by, Including client-side and cloud server; The client generates search parameters based on a set of search keywords using a preset algorithm and sends them to the cloud server, and sends a data usage request to the blockchain; the cloud server searches for the encrypted dataset based on the search parameters and a preset search index, sorts it by relevance, and obtains the encrypted hash address set corresponding to the target encrypted dataset; the client obtains the encrypted hash address set corresponding to the target encrypted dataset, downloads the target encrypted dataset, and decrypts it using a preset private key to obtain the corresponding plaintext data; The search index is generated and dynamically updated by the cloud server using overlapping clustering preprocessing performed by a clustering smart contract based on data provided by the data owner; the relevance calculation is performed after the blockchain responds to the data usage request and passes attribute verification. The search index is generated by the cloud server assigning clustering preprocessing tasks to computing nodes under the blockchain based on data provided by the data owner. The computing nodes then perform clustering preprocessing operations on the data digests published on the blockchain using overlapping K-means clustering. The relevance ranking is determined by the cloud server based on the relevance score of the search keywords. The relevance score is generated by the cloud server based on the term frequency-inverse document frequency value fed back by the data owner and the search parameters sent by the data requester. The target ciphertext dataset is a ciphertext dataset that has been verified through consensus. The consensus verification is performed by the cloud server using a practical Byzantine fault-tolerant consensus algorithm to verify the search results and ranking results, and a trust-based reward and punishment mechanism is used to reward or punish the nodes that perform consensus verification.

5. An electronic device comprising a memory, a processor, and a computer program stored on the memory, wherein the computer program, when executed by the processor, causes the electronic device to perform the method of any one of claims 1 to 4. The processor executes the computer program to implement the steps of the encrypted data search method based on overlapping clustering as described in any one of claims 1-3.

6. A computer readable storage medium having stored thereon computer programs / instructions, characterized in that, When executed by a processor, the computer program / instruction implements the steps of the encrypted data search method based on overlapping clustering as described in any one of claims 1-3.

7. A computer program product comprising computer programs / instructions, characterized in that, When executed by a processor, the computer program / instruction implements the steps of the encrypted data search method based on overlapping clustering as described in any one of claims 1-3.