Method and apparatus for clustering privacy data by multiple parties

By employing multi-round iterative secure computation by multiple parties and an improved Euclidean distance formula, the instability problem in the multi-party joint clustering process is solved, achieving stable computation and privacy data protection even when the number of samples is zero.

CN115982607BActive Publication Date: 2026-06-12ANT BLOCKCHAIN TECHNOLOGY (SHANGHAI) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ANT BLOCKCHAIN TECHNOLOGY (SHANGHAI) CO LTD
Filing Date
2022-12-31
Publication Date
2026-06-12

Smart Images

  • Figure CN115982607B_ABST
    Figure CN115982607B_ABST
Patent Text Reader

Abstract

Embodiments of the present specification provide a method and device for clustering privacy data by multiple parties. The privacy data is distributed among multiple holders, and the sample features stored by the multiple holders are used to constitute a total feature matrix of multiple samples to be clustered. In any iteration, under a secure calculation manner, the multiple holders jointly determine distance slices of distances between the multiple samples and multiple cluster centers, and index slices of class cluster indexes; then, the holders jointly determine sum value slices of sample features belonging to each class cluster, and jointly determine the number of samples belonging to each class cluster, and correct the number of samples. Specifically, the number of samples can be corrected based on the sum of the number of samples and a first value to obtain a corrected number of samples. The multiple holders determine updated cluster center feature slices based on the sum value slices of each party and the corrected number of samples.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of computer technology, and more particularly to a method and apparatus for multi-party collaborative clustering of privacy data. Background Technology

[0002] K-means clustering has been widely used in many applications such as user classification, targeted marketing, image segmentation, and feature learning. The data to be clustered often involves multiple platforms and multiple domains. For example, in merchant classification analysis, electronic payment platforms possess merchants' transaction records, e-commerce platforms store merchants' sales data, and banking institutions possess merchants' loan data. This necessitates collaboration among multiple parties to perform clustering while ensuring the privacy of business data.

[0003] Currently, there is a desire for improved solutions that can enhance the stability of the clustering process when multiple parties collaborate to cluster privacy data. Summary of the Invention

[0004] This specification describes one or more embodiments of a method and apparatus for multi-party collaborative clustering of privacy data, aiming to improve the stability of the clustering process when multiple parties collaborate on clustering privacy data. The specific technical solution is as follows.

[0005] Firstly, the embodiments provide a method for multi-party collaborative clustering of privacy data, wherein the privacy data is distributed among multiple holders, and the sample features stored by each holder are used to construct a total feature matrix of multiple samples to be clustered; the method is executed by multiple holders and includes multiple iterations, wherein any one iteration includes:

[0006] Based on the sample feature slices and cluster center feature slices of each party, the first multi-party secure computation is performed among multiple holders to obtain the distance slices between multiple samples and multiple cluster centers respectively.

[0007] Based on the distance sharding of each party, a security comparison calculation is performed among multiple holders to obtain the index sharding of the cluster index, which is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers;

[0008] Based on the sample feature sharding and index sharding of each party, multiple holders perform a second multi-party secure computation to determine the sum value sharding of sample features belonging to each type of cluster.

[0009] Based on the index shards of each party, multiple holders perform secure statistical calculations to obtain the number of samples belonging to each cluster, and determine the number of samples belonging to each cluster based on the sum of the multiple sample numbers and the first value.

[0010] Based on the sum value sharding of each party and the number of sample corrections, multiple holders respectively determine the updated cluster center feature sharding.

[0011] In one implementation, the first value is a value that is close to 0 but not 0.

[0012] In one implementation, the step of determining the number of samples corrected to belong to each cluster includes:

[0013] The sum of the number of samples and the first value is used to determine the corrected number of samples belonging to each cluster.

[0014] In one implementation, the step of determining the number of samples corrected to belong to each cluster includes:

[0015] For any given first sample size, the sample correction number for the cluster corresponding to the first sample size is determined based on the ratio of the sum to the difference; wherein the sum is the sum of the first sample size and the first value, and the difference is the difference between 1 and the first sample size.

[0016] In one implementation, the step of performing secure statistical calculations among multiple holders based on party-specific index sharding includes:

[0017] Based on the index shards of each party, multiple holders perform secure statistical calculations to obtain sample quantity shards belonging to each type of cluster; among them, for any cluster, the multiple sample quantity shards of that cluster obtain the corresponding sample quantity during the assumed reconstruction.

[0018] The step of determining the number of samples corrected for each cluster includes:

[0019] Based on the sample quantity sharding of each party, multiple holders perform third-party secure computation based on the summation of multiple sample quantities with the first value, and obtain the sample correction quantity sharding of each type of cluster.

[0020] The steps for the multiple holders to determine the updated cluster center feature slices include:

[0021] Based on the sum value sharding and the sample correction quantity sharding of each party, multiple holders perform safe matrix multiplication to obtain updated cluster center feature sharding respectively.

[0022] In one implementation, the first value is obtained after parameter tuning over multiple training cycles.

[0023] In one implementation, the sample feature slices from multiple holders are obtained in the following manner:

[0024] Multiple holders obtain their own sample feature slices based on their respective sample features through a secret sharing algorithm; wherein the dimension of any holder's sample feature slice is the same as the total feature matrix.

[0025] In one implementation, the initial multiple cluster center feature slices are determined in the following manner:

[0026] Each holder extracts a number of agreed-upon sample features from its own sample feature slices, which serve as the holder's initial multiple cluster center feature slices.

[0027] In one implementation, the step of performing a first multi-party secure computation among the plurality of holders includes:

[0028] Multiple holders perform a first multi-party secure computation based on the improved Euclidean distance formula; wherein the improved Euclidean distance formula does not include square root operations.

[0029] Secondly, the embodiment provides a method for multi-party joint clustering of privacy data, wherein the privacy data is distributed among multiple holders, and the sample features stored by each holder are used to construct a total feature matrix of multiple samples to be clustered; the method is executed by any one holder and includes multiple iterations, wherein any one iteration includes:

[0030] Based on our sample feature slices and cluster center feature slices, we perform a first multi-party secure computation with other holders to obtain distance slices between multiple samples and multiple cluster centers that are held by us; other distance slices are held by other holders.

[0031] Based on the distance shards of this party, a secure comparison calculation is performed with the distance shards held by the other holders to obtain the index shards of the cluster index; wherein, the cluster index is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers, and the other index shards are held by the other holders;

[0032] Based on the sample feature slices and index slices of this party, a second multi-party secure computation is performed with the sample feature slices and index slices held by the other holders to obtain the sum slices of sample features belonging to each cluster; other sum slices are held by the other holders.

[0033] Based on the index shards held by this party, security statistical calculations are performed with the index shards held by the other holders to obtain the number of samples belonging to each cluster, and the number of samples belonging to each cluster is corrected based on the sum of multiple sample numbers and the first value.

[0034] Based on the sum value sharding of this method and the number of sample corrections, the updated cluster center feature sharding is determined.

[0035] Thirdly, the embodiments provide a system for multi-party joint clustering of privacy data, the system comprising multiple holders; the privacy data is distributed among the multiple holders, and the sample features stored by each holder are used to construct a total feature matrix of multiple samples to be clustered; the multiple holders are used to jointly cluster the privacy data through multiple rounds of iteration, wherein any one round of iteration includes:

[0036] Based on the sample feature slices and cluster center feature slices of each party, the first multi-party secure computation is performed among multiple holders to obtain the distance slices between multiple samples and multiple cluster centers respectively.

[0037] Based on the distance sharding of each party, a security comparison calculation is performed among multiple holders to obtain the index sharding of the cluster index, which is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers;

[0038] Based on the sample feature sharding and index sharding of each party, multiple holders perform a second multi-party secure computation to determine the sum value sharding of sample features belonging to each type of cluster.

[0039] Based on the index shards of each party, multiple holders perform secure statistical calculations to obtain the number of samples belonging to each cluster, and determine the number of samples belonging to each cluster based on the sum of the multiple sample numbers and the first value.

[0040] Based on the sum value sharding of each party and the number of sample corrections, multiple holders respectively determine the updated cluster center feature sharding.

[0041] Fourthly, the embodiment provides an apparatus for multi-party joint clustering of privacy data, wherein the privacy data is distributed among multiple holders, and the sample features stored by each holder are used to construct a total feature matrix of multiple samples to be clustered; the apparatus is deployed in any one holder, and the apparatus includes multiple modules that perform multiple rounds of iteration, wherein any one round of iteration includes:

[0042] The distance module is configured to perform a first multi-party secure computation with other holders based on its own sample feature slices and its own cluster center feature slices to obtain the distance slices between multiple samples and multiple cluster centers within its own scope; other distance slices are held by other holders.

[0043] The indexing module is configured to perform a secure comparison calculation based on the distance shards held by the user and the distance shards held by other holders to obtain the index shards of the cluster index; wherein, the cluster index is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers, and other index shards are held by the other holders;

[0044] The sum value module is configured to perform a second multi-party secure computation based on its own sample feature slices and index slices, together with the sample feature slices and index slices held by the other holders, to determine the sum value slices of sample features belonging to each cluster; other sum value slices are held by the other holders.

[0045] The quantity module is configured to perform secure statistical calculations based on the index shards held by the party and the index shards held by the other holders to obtain the number of samples belonging to each cluster, and to determine the corrected number of samples belonging to each cluster based on the sum of multiple sample numbers and the first value.

[0046] The update module is configured to determine the updated cluster center feature slices based on the sum value slices of the local side and the number of sample corrections.

[0047] Fifthly, an embodiment provides a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method described in any one of the first and second aspects.

[0048] In a sixth aspect, an embodiment provides a computing device including a memory and a processor, wherein the memory stores executable code, and the processor, when executing the executable code, implements the method of any one of the first and second aspects.

[0049] In the methods and apparatus provided in the embodiments of this specification, when clustering privacy data through multi-party joint sampling and multiple iterations, in any iteration, after each party has determined the sum value partitions of sample features belonging to each cluster and the number of samples belonging to each cluster, the number of samples is corrected. That is, the corrected number of samples is determined based on the sum of the number of samples and a first number. After this processing, even if the number of samples belonging to a certain cluster is zero during the clustering process, no error will occur when the multi-party joint calculation of the ratio of the sum value partitions to the corrected number of samples, thereby improving the stability of the clustering process when multi-party joint clustering of privacy data. Attached Figure Description

[0050] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.

[0051] Figure 1 This is a schematic diagram illustrating an implementation scenario of one embodiment disclosed in this specification;

[0052] Figure 2This is a flowchart illustrating a method for multi-party collaborative clustering of privacy data provided in an embodiment.

[0053] Figure 3 This is another flowchart illustrating a method for multi-party collaborative clustering of privacy data provided in this embodiment.

[0054] Figure 4 A schematic block diagram of a system for multi-party collaborative clustering of privacy data, provided as an example;

[0055] Figure 5 This is a schematic block diagram of an apparatus for multi-party collaborative clustering of privacy data, provided as an embodiment. Detailed Implementation

[0056] The solution provided in this specification will now be described with reference to the accompanying drawings.

[0057] Figure 1 This is a schematic diagram illustrating an implementation scenario of one embodiment disclosed in this specification. The data to be clustered is often distributed across multiple platforms and multiple domains, that is, across multiple holders, each storing multiple sample data containing private data. To protect the privacy and security of the private data, the multiple holders use their respective devices to perform clustering in a secure data interaction method that does not disclose private data, assigning the samples from the multiple holders to clusters corresponding to k cluster centers, where k is an integer.

[0058] The k-means clustering algorithm, also known as k-means, is an iterative clustering algorithm. Its steps are as follows: divide the data into k clusters, initialize the cluster centers of the k clusters, calculate the distance between each sample and each of the k cluster centers, and assign each sample to the cluster closest to it. After all samples have been partitioned, for each cluster, recalculate the cluster centers using all samples assigned to that cluster. Repeat the above process until a termination condition is met, at which point the iteration terminates. When applying k-means to scenarios involving multi-party collaborative clustering of privacy-sensitive data, the clustering process can be based on Secure Multi-party Computation (MPC) to protect privacy. In the embodiments of this specification, the clustering algorithm is not limited to k-means; other clustering algorithms with similar iterative processes can also be used, or improved versions of k-means clustering algorithms can be employed.

[0059] When using an iterative clustering process, the number of samples assigned to clusters can vary, and some clusters may contain no samples at all. This can lead to unexpected errors when multiple parties jointly update the cluster centers, making the clustering process highly unstable.

[0060] To improve the stability of the clustering process when multiple parties jointly cluster privacy data, this specification provides a method for joint clustering of privacy data. The method includes multiple iterations, where any one iteration includes: Step S210, based on the sample feature slices and cluster center feature slices of each party, multiple holders perform a first multi-party secure computation to obtain distance slices between multiple samples and multiple cluster centers; Step S220, based on the distance slices of each party, multiple holders perform a secure comparison computation to obtain index slices of cluster indices, wherein the cluster index is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers; Step S230, based on the sample feature slices and index slices of each party, multiple holders perform a second multi-party secure computation to determine the sum value slices of sample features belonging to each cluster; Step S240, based on the index slices of each party, multiple holders perform a secure statistical computation to obtain the number of samples belonging to each cluster, and based on the sum of the number of samples and a first value, determine the corrected number of samples belonging to each cluster; Step S250, based on the sum value slices and the corrected number of samples, multiple holders determine the updated cluster center feature slices.

[0061] In this embodiment, after the sample quantity is corrected, even if the number of samples belonging to a certain cluster is zero during the clustering process, there will be no error when the multi-party joint calculation of the ratio of sum value shards to the corrected sample quantity will occur, thereby improving the stability of the clustering process when multi-party joint clustering of privacy data.

[0062] The following is combined with Figure 2 The above embodiments will be described in detail. Figure 2 This is a flowchart illustrating a method for multi-party collaborative clustering of privacy data provided in an embodiment. The privacy data is distributed among multiple holders. Each holder stores multiple sample features, which are privacy data and cannot be sent out in plaintext. The sample features stored by each holder are used to construct the overall feature matrix of the samples to be clustered. The overall feature matrix is ​​an N*m dimensional matrix composed of the attribute values ​​of m attributes of N samples. Here, N and m are both integers. The multiple holders do not plaintext concatenate their respective samples to obtain the overall feature matrix.

[0063] There can be two or more holders. For the sake of convenience and brevity, the following explanation uses holder L and holder R (hereinafter referred to as L and R) as an example to illustrate the execution process of this method. Multiple holders process data through their respective devices, which can be any device, equipment, platform, device cluster, etc., with computing and processing capabilities.

[0064] Sample features stored across multiple holders can be considered business data. This business data can include feature data of business objects, which can be various business objects to be analyzed, such as users, products, or events. For example, if the business object is a user, the business data can contain user data, including user-related behavioral data items and basic user attribute data items. These specific attribute items constitute the attribute values ​​of the business data across different dimensions.

[0065] The sample features stored by multiple holders can exist in at least two data distributions: vertical data partitioning and horizontal data partitioning. In the vertical data partitioning scenario, each holder possesses data on all N different attributes of the business objects; in the horizontal data partitioning scenario, each holder possesses data on all m attributes of different business objects. Each business object is a sample, and the data of the business object under different attributes constitutes the sample features of that sample.

[0066] For example, in a two-party scenario, the sample set X = (x1, x2, ..., x...). m In a vertical data partitioning scenario, both L and R sides have the same number of samples, but they have different attributes. That is, both sides have N samples, but L side has m. L Each attribute item, R has m R There are several attribute items, m = m L +m R In the scenario of horizontal data partitioning, the L side and the R side have different numbers of samples, but the same number of attributes. That is, both sides have m attributes. The L side has N1 samples and the R side has N2 samples, where N = N1 + N2.

[0067] Regardless of the data distribution method, the attribute characteristic data of the business objects are all private data and can be stored as a private data matrix. For the security of this private data, each holder needs to keep its private data locally, without outputting plaintext data or performing plaintext aggregation. This embodiment's method is executed by multiple holders (e.g., L party and R party). The method includes the following steps.

[0068] In step S210, sample feature segmentation based on L-squared is performed. <x> L Cluster center feature segmentation <y> L and R-squared sample feature slices <x> R Cluster center feature segmentation <y> R The holders L and R perform a first multi-party secure computation to obtain distance slices in the L-party algorithm for the distance D between multiple samples and multiple cluster centers. <d> L and the distance fragment in R square <d> R .

[0069] In the following description, "<>" will be used to represent fragments, and the letter in the lower right corner will represent the corresponding holder, for example... <x> L This represents the partition of sample feature X within holder L. Distance partitions from multiple holders, assuming reconstruction, constitute the complete distance, for example, D = <d> L + <d> R Reconstruction can be an addition operation, or it can be based on an addition operation with the addition operation, where matrix transformations include, for example, multiplying by a preset value.

[0070] Sample feature partitioning can be understood as partitioning a matrix composed of multiple sample features. The matrix composed of multiple sample features in the holding party can also be called the original matrix; for example, party L possesses the original matrix X. L R possesses the original matrix X R The sample feature slices from multiple holders are obtained using the following method:

[0071] Based on the original matrix X of L-squared L The original matrix X of R squared R Through the secret sharing algorithm, sample feature fragments of L are obtained respectively. <x> L Sample feature segmentation with R-squared <x> R .

[0072] In a vertical data partitioning scenario, both L-squared and R-squared have N samples, with L-squared having m samples. L Each attribute item, R has m R There are 10 attribute items, therefore the original matrix X of L squared is 1000. L The dimension is N*m L The original matrix X of R squared R The dimension is N*m R In a data horizontal partitioning scenario, both L-squared and R-squared have m attribute items. L-squared has N1 samples, and R-squared has N2 samples. Therefore, the original matrix X of L-squared is... L The original matrix X has a dimension of N1*m and an R-squared value. R The dimension is N2*m.

[0073] The dimension of any sample feature slice held by any holder is the same as that of the total feature matrix X. When the dimension of the total feature matrix X is N*m, the sample feature slices... <x> L and <x> R The dimension of each is N*m, and the sample feature slices from multiple holders constitute the total feature matrix under the assumption of reconstruction, for example, X = <x> L + <x> R .

[0074] Multiple holders can determine the sample feature slices of L-squared through Secret Matrix Multiplication (SMM). <x> L Sample feature segmentation with R-squared <x> R The following section uses a vertical data partitioning scenario as an example to illustrate the specific execution process. The execution process for a horizontal data partitioning scenario is similar and will not be repeated here.

[0075] Holder L generates a first random matrix A0 locally and compares it with the original matrix X. L Perform a difference operation to obtain the first hidden matrix A. Then, exchange hidden matrix fragments with other holders. The resulting hidden matrices B from the other holders are then concatenated with the first random matrix A0 in a predetermined order to form the first sample feature fragment. <x> L The predetermined order can be a predetermined left-right position order, etc.

[0076] For example, the first hidden matrix A can be defined according to A = X L -A0 is used to calculate the sample feature fragmentation of L in the scenario of vertical data partitioning. <x> L You can follow <x> L It is obtained by concatenating in the form of (A0, B).

[0077] Similarly, holder R generates a second random matrix B0 locally and compares it with matrix X. R Perform a difference operation to obtain the second hidden matrix B. Exchange the hidden matrices with other holders, and then concatenate the resulting hidden matrices A from the other holders with the second random matrix B0 in a predetermined order to form the sample feature slices of R-squared. <x> R .

[0078] For example, the second hidden matrix B can be defined as B = X R –B0 is used to calculate the sample feature slices of R-squared in the scenario of vertical data partitioning. <x> R You can follow <x> R It is obtained by concatenating in the form of (A, B0).

[0079] Through the aforementioned secure data exchange, the holder L can obtain sample feature fragments. <x> L Holder R obtains sample feature slices <x> R Furthermore, their dimensions are all the same as the total feature matrix.

[0080] In one implementation, sample feature segmentation <x> L and sample feature slicing <x> R Both can be matrices where rows represent samples and columns represent attributes. Alternatively, they can be sample feature slices. <x> L and sample feature slicing <x> R Both can be matrices where columns represent samples and rows represent attributes.

[0081] To reduce communication among multiple holders, the initial k cluster center feature slices can be determined in the following way:

[0082] Each holder extracts a number of agreed-upon sample features from its own sample feature slices, which serve as the holder's initial multiple cluster center feature slices.

[0083] For example, when sample features are sliced <x> L and sample feature slicing <x> R When both L and R represent samples in rows and attributes in columns, L and R can each extract the first k samples from their respective sample feature slices to form a matrix, thus obtaining their respective initial multiple cluster center feature slices.

[0084] When holders L and R perform the first multi-party secure computation, they can proceed according to the i-th sample X. i With the j-th cluster center Y j The following distance calculation formula is used to calculate the distance between them:

[0085] D ij =Σ(X ih -Y jh ) 2

[0086] Among them, D ij Let X represent the i-th sample. i With the j-th cluster center Y j The distance between them, X ih Let X represent the i-th sample. i The h-th attribute value, Y jh Y represents the j-th cluster center. j The h-th attribute value is summed using the summation symbol Σ. i and Y j These represent sample features and cluster center features, respectively.

[0087] In L-squared-based sample feature segmentation <x> L Cluster center feature segmentation <y> L and R-squared sample feature slices <x> R Cluster center feature segmentation <y> R Determine the distance partition <d> L and <d> R When, the following relationship X = <x> L + <x> R ,Y= <y> L + <y> R Substitute the terms and expand to obtain multiple terms. Term with all subscripts of L is computed locally on the L side, and term with all subscripts of R is computed locally on the R side. For mixed terms with subscripts of L and R, secret-sharing matrix multiplication from multi-party secure computation can be used, employing a multiplication triplet protocol for privacy computation in a sum-sharing manner.

[0088] The above calculation process improves upon the Euclidean distance formula by removing the square root sign. The first multi-party security calculation performed in this way yields the same result as the calculation using the Euclidean distance formula. This method ensures accuracy while reducing computational complexity and improving processing efficiency.

[0089] The execution of a first-party secure computation between holders L and R means that they perform data interaction based on this computation, determining the distance partitions through a secure data interaction method. Distance partitions can be represented as a matrix. Any sample has k distances to k cluster centers, and the distances between N samples and k cluster centers constitute an N*k-dimensional distance matrix. The dimension of the distance partitions is also N*k-dimensional.

[0090] In step S220, distance fragmentation is based on L-squared. <d> L Distance partitions in R square <d> R Multiple holders perform a secure comparison calculation to obtain the index shards of the clustered index in L. <c> L And index partitioning in R <c> R .

[0091] This cluster index is used to characterize the affiliation of a sample within the clusters corresponding to multiple cluster centers. Multiple index shards held by different entities, under the assumption of reconstruction, yield a complete cluster index. Each entity cannot infer the complete data based solely on its own shard, thus protecting the privacy of the complete data from leakage.

[0092] A complete cluster index records the cluster to which each sample belongs. Therefore, the cluster index can be represented by an N-dimensional vector, and the index shards can also be represented by N-dimensional vectors. When performing secure comparison calculations, L-side and R-side can compare the distances between each sample's index shard and k clusters, finding the minimum distance and determining the cluster corresponding to the minimum distance as the sample's cluster. The cluster data for the N samples constitutes the index shards for each side. The entire comparison process uses secure comparison calculations, ensuring that the distance information between samples and cluster centers, as well as index information, is not leaked.

[0093] Among them, security comparison calculation is a relatively mature technology. When performing security comparison calculation in step S220, one or more existing security comparison algorithms can be used. For example, a binary tree-based security comparison calculation can be used. The specific details will not be elaborated in this embodiment.

[0094] In step S230, sample feature segmentation based on L-squared is performed. <x> L and index sharding <c> L and R-squared sample feature slices <x> R and index sharding <c> R Multiple holders perform a second multi-party secure computation to determine the sum value slices of sample features belonging to each cluster, including the sum value slices of the L-squared. <s> L And the sum of R-squared slices <s> R .

[0095] The second multi-party secure computation may include secret-sharing matrix multiplication. The sample feature slice contains the feature vector information of each sample, and the index slice contains the cluster information to which the sample belongs. Therefore, based on the sample feature slices and index slices of each party, secret-sharing matrix multiplication allows multiple holders to obtain their corresponding sum slices. Based on the index slice, a single holder cannot directly know which cluster a sample belongs to. Using secret-sharing matrix multiplication, it is not necessary for multiple holders to reconstruct the complete cluster index to sum the feature vectors of all samples belonging to a certain cluster. The sum slice contains k m-dimensional feature vectors of k clusters; therefore, the sum slice can be represented by a k*m-dimensional matrix. The specific implementation process of step S230 can be referred to existing technologies; the specific calculation formulas and processes are not elaborated in this embodiment.

[0096] In step S240, index sharding based on L-squared is performed. <c> L And R-squared index shards <c> R Multiple holders perform secure statistical calculations to obtain the number of samples num belonging to each cluster, and determine the corrected number of samples belonging to each cluster based on the sum of the multiple sample numbers num and the first value Q.

[0097] Index sharding <c> L and <c> R It is possible to jointly characterize the cluster to which each sample belongs, and then count the number of samples contained in each cluster. A single holder cannot count the number of samples belonging to each cluster using a single index shard. Multiple holders can count the number of samples belonging to each cluster by performing secure statistical calculations.

[0098] Based on index sharding with multiple holders, the process of performing security statistical calculations among multiple holders can be carried out using existing security statistical calculation algorithms.

[0099] The first value Q can be a value close to 0 but not zero, for example, it can be 10. -12 Or 10 -11 The first value, Q, can be either positive or negative.

[0100] When determining the corrected number of samples belonging to each cluster, the sum of multiple sample counts num and a first value Q can be used to determine the corrected number of samples belonging to each cluster. For example, the corrected number of samples can be expressed as Q + num.

[0101] The number of sample corrections can also be determined in the following way: For any first sample size num1, the number of sample corrections for the cluster corresponding to the first sample size num1 is determined based on the ratio of the sum to the difference.

[0102] Wherein, the above sum is the sum of the first sample size num1 and the first value Q, and the above difference is the difference between 1 and the first sample size num1. For example, the sample correction number can be expressed as...

[0103] K1=(Q+num1) / (1-num1) (1)

[0104] Where K1 is the sample correction number for the cluster corresponding to the first sample quantity num1. In specific implementation, this ratio can be directly determined as the corresponding sample correction number, or the ratio can be transformed by a preset method to obtain the corresponding sample correction number.

[0105] Because it's possible for the number of samples belonging to a particular cluster to be zero during clustering, a serious error can occur in joint calculations when determining the mean of sample features (e.g., by combining the sum of sample features with the number of samples). Adding a small value to the number of samples can prevent this situation from happening, thus improving the stability of the joint clustering process.

[0106] To avoid the slight decrease in accuracy caused by directly adding a very small value to the denominator, the above formula (1) can be used to overcome this problem. After many experiments, it was found that the above formula (1) performs very well in practical applications, overcoming the error problem when the denominator is 0 and improving the accuracy of the calculation.

[0107] The aforementioned first value Q can be a preset value determined empirically, or it can be a value obtained after parameter tuning over multiple training cycles. In practical applications, plaintext datasets from a single device can be used for cluster training, with the first value Q used as a hyperparameter to be trained. The first value Q is adjusted over multiple training cycles. Each training cycle includes a process of performing clustering through multiple iterations until the clustering stopping condition is met.

[0108] Next, let's explain the sample size. In one scenario, the sample size obtained in step S240 can be complete data, meaning that both holders L and R receive the complete sample size belonging to each cluster, rather than fragmented data.

[0109] In other cases, holders L and R can obtain sample quantity shards belonging to each cluster class by performing secure statistical calculations based on their respective index shards. For example, for the j-th cluster, L and R each obtain sample quantity shards belonging to that j-th cluster. <num j > L and <num j > R For any j-th cluster, the multiple sample count slices corresponding to the j-th cluster are assumed to yield the complete sample count during reconstruction.

[0110] For any given j-th cluster, multiple holders can obtain the sample quantity shards belonging to that j-th cluster by performing security statistical calculations based on their respective index shards. This step can be implemented using existing technologies, and the specific implementation process can be referred to existing technologies, which will not be elaborated in this embodiment.

[0111] In one implementation, when multiple holders obtain sample quantity shards respectively, the multiple holders can perform data interaction based on secret sharing addition based on each party's sample quantity shards to obtain the sum of the multiple sample quantity shards, that is, to obtain the complete sample quantity belonging to each cluster.

[0112] The aforementioned sample correction quantity can be complete data or fragmented data. When the sample quantity is complete data, the sample correction quantity determined based on the sum of multiple sample quantities with the first value Q is also complete data. In practice, any holder can directly determine the complete sample correction quantity belonging to each cluster by summing multiple sample quantities with the first value Q. Alternatively, for any complete first sample quantity num1, the ratio of the sum to the difference can be directly determined as the complete sample correction quantity for the cluster corresponding to the first sample quantity num1, where the sum is the sum of the complete first sample quantity num1 and the first value Q, and the difference is the difference between 1 and the complete first sample quantity num1. When using complete data for the sample quantity, all parties can reduce data interaction, thereby improving processing efficiency.

[0113] When the sample quantity is fragmented data, that is, when the sample quantity is fragmented, when determining the sample correction quantity belonging to each cluster based on the sum of multiple sample quantity fragments and the first value Q, multiple holders can perform third-party secure computation based on the sample quantity fragments of each party, by performing the summation operation based on multiple sample quantities and the first value Q, to obtain the sample correction quantity fragments of each cluster.

[0114] The operation of summing any first sample size num1 with the first value Q can be expressed as Q + ( <num1> L + <num1> R ), or expressed as K1=[Q+( <num1> L + <num1> R )] / [1-( <num1> L + <num1> R Guided by these formulas, multiple holders can perform third-party secure computations (such as secure matrix multiplication based on secure matrix multiplication, etc.) to obtain sample number shards. Using both sample number shards and sample number correction shards enhances the protection of privacy data.

[0115] The statistical operation on the number of samples for each type of cluster in step S240 and the calculation operation on the sum value partitioning in step S230 can be performed in any order and can be performed simultaneously.

[0116] In step S250, based on the sum value sharding and sample correction quantity of each party, the multiple holders respectively determine the updated cluster center feature sharding.

[0117] When updating cluster centers, for any j-th cluster, the mean of multiple sample features can be calculated by the ratio of the sum of the features of multiple samples belonging to the j-th cluster to the number of samples. This mean can be determined as the updated cluster center of the j-th cluster, or the result after a preset transformation of the mean can be determined as the updated cluster center of the j-th cluster.

[0118] For any cluster of type j, when the number of corrected samples is the complete data, each holder can individually determine the updated cluster center characteristics of cluster j based on the ratio of their respective sum-valued shards to the number of corrected samples. In this way, the updated cluster center characteristics of all k clusters can be determined.

[0119] For any j-th cluster, when the sample correction quantity is the number of data segments (i.e., when the sample correction quantity is used for segmentation), multiple holders can obtain updated cluster center feature segments based on their sum-value segments and their sample correction quantity segments by performing secure matrix multiplication. The secure matrix multiplication can be a secret-sharing matrix multiplication.

[0120] For example, L-squared and R-squared can be used to obtain the sample feature mean slices using the following formulas:

[0121]

[0122] Where num1 is the number of samples belonging to the j-th cluster, num1 = <num1> L + <num1> R The sum can be represented by sum partitioning, i.e., sum1 = <sum1> L + <sum1> R Based on equation (2), by performing secure matrix multiplication, L-squared and R-squared can respectively obtain the sample feature mean slices. Multiple holders can determine the updated cluster center feature slices based on their respective sample feature mean slices.

[0123] Based on equation (2) above, when the number of samples belonging to a certain cluster is 1, the mean of the sample features is 0, and the cluster center feature of that cluster can be updated to 0. When the number of samples belonging to a certain cluster is 0, the sum of the sample features, sum1, is 0, so the mean of the sample features obtained based on equation (2) is also 0, and the cluster center feature of that cluster can be updated to 0. When the number of samples belonging to a certain cluster is a large value, the mean of the sample features can also be obtained through equation (2) in the same way as in plaintext.

[0124] The clustering process can be iterated multiple times and terminated when a termination condition is met, yielding the final cluster centers. The termination condition can be achieving a globally better or locally better outcome, or reaching a specified number of iterations.

[0125] Figure 3 This is another flowchart illustrating a method for multi-party collaborative clustering of privacy data provided in an embodiment. Figure 3 Is Figure 2 The embodiments shown are derived from the examples described above, and the executing entity is any one of multiple holders. Figure 2 For the same or similar parts, please refer to Figure 2 The embodiment shown is not repeated here. In this embodiment, any round of iteration includes the following steps.

[0126] Step S310: Based on the sample feature slices and cluster center feature slices of this party, perform a first multi-party secure computation with other holders to obtain distance slices of the distances between multiple samples and multiple cluster centers within this party. Other distance slices are held by other holders.

[0127] Step S320: Based on the distance shards held by this party, a secure comparison calculation is performed with the distance shards held by other holders to obtain the index shards of the cluster index. The cluster index is used to characterize the affiliation of a sample within the clusters corresponding to multiple cluster centers; other index shards are held by other holders.

[0128] Step S330: Based on the sample feature slices and index slices held by this party, perform a second multi-party secure computation with the sample feature slices and index slices held by other holders to obtain the sum slices of sample features belonging to each cluster. Other sum slices are held by other holders.

[0129] Step S340: Based on the index shards held by this party, perform security statistical calculations on the index shards held by other parties to obtain the number of samples belonging to each cluster, and determine the corrected number of samples belonging to each cluster based on the sum of multiple sample numbers and the first value.

[0130] Step S350: Based on the sum value partitioning and sample correction quantity of this method, determine the updated cluster center feature partitioning.

[0131] In the above embodiments, multiple parties perform privacy-preserving data clustering based on secure computation, achieving data security protection without revealing the input data throughout the process. The iterative process improves the plaintext computation process, avoiding complex square root and division operations within the cryptographic domain. This embodiment also supports batch processing, thus exhibiting good scalability when handling extremely large datasets. By adjusting the first value, this embodiment can achieve clustering results that are almost identical to the plaintext, with high accuracy.

[0132] In this specification, the terms "first" in phrases such as "first multi-party secure computation" and "first numerical value," as well as "second" in the text, are used merely for ease of distinction and description and do not have any limiting meaning.

[0133] The foregoing description describes specific embodiments of this specification; other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than those shown in the embodiments, and the desired result may still be achieved. Furthermore, the processes depicted in the drawings do not necessarily need to follow the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0134] Figure 4 This is a schematic block diagram of a system for joint clustering of privacy data by multiple parties, provided as an embodiment. The system includes multiple holders, such as holder 410, holder 420, and holder 430. Privacy data is distributed among these multiple holders, and the sample features stored by each holder are used to construct a total feature matrix of the samples to be clustered. The multiple holders are used to jointly cluster the privacy data through multiple iterations, wherein any one iteration includes:

[0135] Based on the sample feature slices and cluster center feature slices of each party, the first multi-party secure computation is performed among multiple holders to obtain the distance slices between multiple samples and multiple cluster centers respectively.

[0136] Based on the distance sharding of each party, a security comparison calculation is performed among multiple holders to obtain the index sharding of the cluster index, which is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers;

[0137] Based on the sample feature sharding and index sharding of each party, multiple holders perform a second multi-party secure computation to determine the sum value sharding of sample features belonging to each type of cluster.

[0138] Based on the index shards of each party, multiple holders perform secure statistical calculations to obtain the number of samples belonging to each cluster, and determine the number of samples belonging to each cluster based on the sum of the multiple sample numbers and the first value.

[0139] Based on the sum value sharding of each party and the number of sample corrections, multiple holders respectively determine the updated cluster center feature sharding.

[0140] In one implementation, the first value is a value that is close to 0 but not 0.

[0141] In one implementation, when multiple holders determine the number of samples to be corrected for each cluster type, the process includes:

[0142] The sum of the number of samples and the first value is used to determine the corrected number of samples belonging to each cluster.

[0143] In one implementation, when multiple holders determine the number of samples to be corrected for each cluster type, the process includes:

[0144] For any given first sample size, the sample correction number for the cluster corresponding to the first sample size is determined based on the ratio of the sum to the difference; wherein the sum is the sum of the first sample size and the first value, and the difference is the difference between 1 and the first sample size.

[0145] In one implementation, when multiple holders perform secure statistical calculations based on the index shards of each party, it includes:

[0146] Based on the index shards of each party, multiple holders perform secure statistical calculations to obtain sample quantity shards belonging to each type of cluster; among them, for any cluster, the multiple sample quantity shards of that cluster obtain the corresponding sample quantity during the assumed reconstruction.

[0147] When multiple holders determine the number of samples to be corrected for each cluster type, including:

[0148] Based on the sample quantity sharding of each party, multiple holders perform third-party secure computation based on the summation of multiple sample quantities with the first value, and obtain the sample correction quantity sharding of each type of cluster.

[0149] When multiple holders separately determine the updated cluster center feature slices, including:

[0150] Based on the sum value sharding and the sample correction quantity sharding of each party, multiple holders perform safe matrix multiplication to obtain updated cluster center feature sharding respectively.

[0151] In one implementation, the first value is obtained after parameter tuning over multiple training cycles.

[0152] In one implementation, the sample feature slices from multiple holders are obtained in the following manner:

[0153] Multiple holders obtain their own sample feature slices based on their respective sample features through a secret sharing algorithm; wherein the dimension of any holder's sample feature slice is the same as the total feature matrix.

[0154] In one implementation, multiple holders determine initial multiple cluster center feature slices in the following manner:

[0155] Each holder extracts a number of agreed-upon sample features from its own sample feature slices, which serve as the holder's initial multiple cluster center feature slices.

[0156] In one implementation, when multiple holders perform a first multi-party secure computation, it includes:

[0157] Multiple holders perform a first multi-party secure computation based on the improved Euclidean distance formula; wherein the improved Euclidean distance formula does not include square root operations.

[0158] Figure 5 This is a schematic block diagram of an apparatus for multi-party joint clustering of privacy data, provided as an embodiment. The privacy data is distributed among multiple holders, and the sample features stored by each holder are used to construct a total feature matrix of the multiple samples to be clustered. This apparatus embodiment is similar to... Figure 2 The method embodiment shown corresponds to this. The device 500 is deployed in any holder and includes several modules that perform multiple iterations, wherein any one iteration includes:

[0159] The distance module 510 is configured to perform a first multi-party secure computation with other holders based on its own sample feature slices and its own cluster center feature slices to obtain the distance slices of the distance between multiple samples and multiple cluster centers within its own scope; other distance slices are held by other holders.

[0160] The index module 520 is configured to perform a secure comparison calculation based on its own distance shards and the distance shards held by other holders to obtain the index shards of the cluster index; wherein, the cluster index is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers, and the other index shards are held by the other holders;

[0161] The sum value module 530 is configured to perform a second multi-party secure computation based on its own sample feature slices and index slices, together with the sample feature slices and index slices held by the other holders, to determine the sum value slices of sample features belonging to each cluster; other sum value slices are held by the other holders.

[0162] The quantity module 540 is configured to perform secure statistical calculations based on its own index shards and the index shards held by other holders to obtain the number of samples belonging to each cluster, and to determine the corrected number of samples belonging to each cluster based on the sum of multiple sample numbers and a first value.

[0163] The update module 550 is configured to determine the updated cluster center feature slices based on the sum value slices of this side and the number of sample corrections.

[0164] In one implementation, the first value is a value that is close to 0 but not 0.

[0165] In one implementation, when the quantity module 540 determines the number of samples belonging to each cluster, it includes:

[0166] The sum of the number of samples and the first value is used to determine the corrected number of samples belonging to each cluster.

[0167] In one implementation, when the quantity module 540 determines the number of samples belonging to each cluster, it includes:

[0168] For any given first sample size, the sample correction number for the cluster corresponding to the first sample size is determined based on the ratio of the sum to the difference; wherein the sum is the sum of the first sample size and the first value, and the difference is the difference between 1 and the first sample size.

[0169] In one embodiment, the quantity module 540 includes:

[0170] The statistics submodule (not shown in the figure) is configured to perform secure statistical calculations based on its own index shards and the index shards held by other holders to obtain sample quantity shards belonging to various clusters; wherein, for any cluster, the multiple sample quantity shards of that cluster obtain the corresponding sample quantity during the assumed reconstruction.

[0171] The correction submodule (not shown in the figure) is configured to perform a third-party secure computation based on the sample quantity sharding of this party and the summation of multiple sample quantities with the first value with other holders to obtain the sample correction quantity sharding of each type of cluster; the other sample correction quantity sharding of each type of cluster is held by other holders.

[0172] The update module 550 is specifically configured as follows:

[0173] Based on the sum value shard and the sample correction quantity shard of this party, a safe matrix multiplication is performed with other holders to obtain the updated cluster center feature shard; the other updated cluster center feature shards are held by other holders.

[0174] In one implementation, the first value is obtained after parameter tuning over multiple training cycles.

[0175] In one embodiment, the apparatus 500 further includes a determining module (not shown) configured to determine sample feature slices in the following manner:

[0176] Based on the sample features of this party, sample feature fragments of this party are obtained from other holders through a secret sharing algorithm; wherein, the dimension of the sample feature fragment of any holder is the same as the total feature matrix, and other sample feature fragments are held by other holders.

[0177] In one embodiment, the apparatus 500 further includes an initialization module (not shown) configured to determine initial cluster center feature slices in the following manner:

[0178] Multiple agreed-upon sample features are extracted from the sample feature segments of this party to serve as the initial multiple cluster center feature segments of this party; the initial multiple cluster center feature segments of other holders are extracted from the sample feature segments of the other holders according to the agreement.

[0179] In one embodiment, the distance module 510 is specifically configured as follows:

[0180] Multiple holders perform a first multi-party secure computation based on the improved Euclidean distance formula; wherein the improved Euclidean distance formula does not include square root operations.

[0181] The above-described apparatus embodiments correspond to the method embodiments, and detailed descriptions can be found in the description of the method embodiments section, which will not be repeated here. The apparatus embodiments are derived based on the corresponding method embodiments and have the same technical effects as the corresponding method embodiments; detailed descriptions can be found in the corresponding method embodiments.

[0182] This specification also provides a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform... Figures 1 to 3 Any one of the methods described.

[0183] This specification also provides a computing device, including a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to implement... Figures 1 to 3 Any one of the methods described.

[0184] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the embodiments for storage media and computing devices are basically similar to the method embodiments, so they are described more simply; relevant parts can be referred to the descriptions of the method embodiments.

[0185] Those skilled in the art will recognize that the functions described in the embodiments of the present invention in one or more of the above examples can be implemented using hardware, software, firmware, or any combination thereof. When implemented in software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium.

[0186] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, or improvements made based on the technical solutions of the present invention should be included within the scope of protection of the present invention. < / sum1> < / num1> < / num1> < / num1> < / num1> < / num1> < / c> < / c> < / c> < / c> < / s> < / s> < / c> < / x> < / c> < / x> < / c> < / c> < / d> < / d> < / y> < / y> < / x> < / x> < / d> < / d> < / y> < / x> < / y> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / x> < / d> < / d> < / x> < / d> < / d> < / y> < / x> < / y> < / x>

Claims

1. A method for clustering privacy data jointly by multiple parties, wherein the privacy data is distributed among multiple holders, and the sample features stored by the multiple holders are used to form a total feature matrix of multiple samples to be clustered; The method is executed by multiple holders and includes multiple iterations, wherein any one iteration includes: Based on the sample feature slices and cluster center feature slices of each party, the first multi-party secure computation is performed among multiple holders to obtain the distance slices between multiple samples and multiple cluster centers respectively. Based on the distance sharding of each party, a security comparison calculation is performed among multiple holders to obtain the index sharding of the cluster index, which is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers; Based on the sample feature sharding and index sharding of each party, multiple holders perform a second multi-party secure computation to determine the sum value sharding of sample features belonging to each type of cluster. Based on the index shards of each party, multiple holders perform secure statistical calculations to obtain the number of samples belonging to each type of cluster. Based on the sum of the multiple sample numbers and a first value, the corrected number of samples belonging to each type of cluster is determined so that the corrected number of samples differs from the corresponding number of samples by a small amount. The first value is a value that is close to 0 but not 0. Based on the sum value sharding of each party and the number of sample corrections, multiple holders respectively determine the updated cluster center feature sharding.

2. The method according to claim 1, wherein the step of determining the number of samples belonging to each cluster category comprises: The sum of the number of samples and the first value is used to determine the corrected number of samples belonging to each cluster.

3. The method according to claim 1, wherein the step of performing secure statistical calculations among multiple holders based on the index sharding of each party includes: Based on the index shards of each party, multiple holders perform secure statistical calculations to obtain sample quantity shards belonging to each type of cluster; among them, for any cluster, the multiple sample quantity shards of that cluster obtain the corresponding sample quantity during the assumed reconstruction. The step of determining the number of samples corrected for each cluster includes: Based on the sample quantity sharding of each party, multiple holders perform third-party secure computation based on the summation of multiple sample quantities with the first value, and obtain the sample correction quantity sharding of each type of cluster. The steps for the multiple holders to determine the updated cluster center feature slices include: Based on the sum value sharding and the sample correction quantity sharding of each party, multiple holders perform safe matrix multiplication to obtain updated cluster center feature sharding respectively.

4. The method according to claim 1, wherein the first value is obtained after parameter tuning through multiple training cycles.

5. The method according to claim 1, wherein the sample feature slices of multiple holders are obtained in the following manner: Multiple holders, based on their respective sample features, obtain their own sample feature slices through a secret sharing algorithm; among them, The dimension of any holder's sample feature slice is the same as that of the total feature matrix.

6. The method according to claim 5, wherein the initial multiple cluster center feature slices are determined in the following manner: Each holder extracts a number of agreed-upon sample features from its own sample feature slices, which serve as the holder's initial multiple cluster center feature slices.

7. The method of claim 5, wherein the step of performing a first multi-party secure computation among the plurality of holders comprises: Multiple holders perform a first multi-party secure computation based on the improved Euclidean distance formula; wherein the improved Euclidean distance formula does not include square root operations.

8. A method for clustering privacy data jointly by multiple parties, wherein the privacy data is distributed among multiple holders, and the sample features stored by each holder are used to construct a total feature matrix of multiple samples to be clustered; the method is executed by any one holder and includes multiple iterations, wherein any one iteration includes: Based on the sample feature slices and cluster center feature slices of this party, a first multi-party secure computation is performed with other holders to obtain the distance slices of the distance between multiple samples and multiple cluster centers in this party; Other distance fragments are held by other holders; Based on the distance shards of this party, a secure comparison calculation is performed with the distance shards held by the other holders to obtain the index shards of the cluster index; wherein, the cluster index is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers, and the other index shards are held by the other holders; Based on the sample feature slices and index slices of this party, a second multi-party secure computation is performed with the sample feature slices and index slices held by the other holders to obtain the sum slices of sample features belonging to each cluster; other sum slices are held by the other holders. Based on the index shards held by this party, security statistical calculations are performed with the index shards held by the other holders to obtain the number of samples belonging to each cluster. Based on the sum of multiple sample numbers and a first value, the corrected number of samples belonging to each cluster is determined so that the corrected number of samples differs from the corresponding number of samples by a small amount. The first value is a value that is close to 0 but not 0. Based on the sum value sharding of this method and the number of sample corrections, the updated cluster center feature sharding is determined.

9. A system for clustering privacy data jointly by multiple parties, the system comprising multiple holders; the privacy data is distributed among the multiple holders, and the sample features stored by the multiple holders are used to construct a total feature matrix of multiple samples to be clustered; Multiple holders are used to jointly cluster the privacy data through multiple iterations, wherein any one iteration includes: Based on the sample feature slices and cluster center feature slices of each party, the first multi-party secure computation is performed among multiple holders to obtain the distance slices between multiple samples and multiple cluster centers respectively. Based on the distance sharding of each party, a security comparison calculation is performed among multiple holders to obtain the index sharding of the cluster index, which is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers; Based on the sample feature sharding and index sharding of each party, multiple holders perform a second multi-party secure computation to determine the sum value sharding of sample features belonging to each type of cluster. Based on the index shards of each party, multiple holders perform secure statistical calculations to obtain the number of samples belonging to each type of cluster. Based on the sum of the multiple sample numbers and a first value, the corrected number of samples belonging to each type of cluster is determined so that the corrected number of samples differs from the corresponding number of samples by a small amount. The first value is a value that is close to 0 but not 0. Based on the sum value sharding of each party and the number of sample corrections, multiple holders respectively determine the updated cluster center feature sharding.

10. An apparatus for multi-party joint clustering of privacy data, wherein the privacy data is distributed among multiple holders, and the sample features stored by the multiple holders are used to form a total feature matrix of multiple samples to be clustered; The device is deployed in any one holder, and the device includes multiple modules that perform multiple iterations, wherein any one iteration includes: The distance module is configured to perform a first multi-party secure computation with other holders based on the sample feature slices and cluster center feature slices of the local party, to obtain the distance slices of the local party for the distance between multiple samples and multiple cluster centers. Other distance fragments are held by other holders; The indexing module is configured to perform a secure comparison calculation based on the distance shards held by the user and the distance shards held by other holders to obtain the index shards of the cluster index; wherein, the cluster index is used to characterize the affiliation of a sample in the clusters corresponding to multiple cluster centers, and other index shards are held by the other holders; The sum value module is configured to perform a second multi-party secure computation based on its own sample feature slices and index slices, together with the sample feature slices and index slices held by the other holders, to determine the sum value slices of sample features belonging to each cluster; other sum value slices are held by the other holders. The quantity module is configured to perform secure statistical calculations based on the index shards held by the user and the index shards held by other users to obtain the number of samples belonging to each cluster. Based on the sum of multiple sample quantities and a first value, the corrected number of samples belonging to each cluster is determined so that the corrected number of samples differs from the corresponding number of samples by a small amount. The first value is a value that is close to 0 but not 0. The update module is configured to determine the updated cluster center feature slices based on the sum value slices of the local side and the number of sample corrections.

11. A computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of claims 1-7.

12. A computing device comprising a memory and a processor, wherein the memory stores executable code, and the processor, when executing the executable code, implements the method of any one of claims 1-7.