A differentially private non-negative matrix factorization method based on random sampling
By incorporating random sampling into the noise introduction stage, a differential privacy nonnegative matrix factorization method is proposed to address the privacy information leakage problem in recommendation systems, achieving a higher level of privacy protection and better prediction performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TIANJIN UNIV
- Filing Date
- 2023-04-11
- Publication Date
- 2026-06-23
Smart Images

Figure CN116401707B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of machine learning technology and relates to a differential privacy nonnegative matrix factorization method based on random sampling. Background Technology
[0002] With the arrival of the big data era, while massive amounts of data bring us convenience, they also pose new challenges to the protection of user privacy information. In recent years, as people's awareness of privacy protection has been continuously strengthened, more and more users do not want their personal privacy information to be at risk of being leaked. As an indispensable technology in today's online business, recommendation systems have been widely used in the Internet and some industrial fields, and non-negative matrix factorization is one of the most commonly used methods to implement recommendation systems. Since the non-negative matrix factorization algorithm needs to collect users' personal privacy data, if no protection measures are taken, it may cause privacy information leakage when it is actually applied to recommendation systems. Existing differential attack algorithms such as reconstruction attacks, member reasoning attacks, and link attacks may cause the non-negative matrix factorization algorithm to leak sensitive information such as users' personal shopping preferences and search preferences. For example, in movie recommendation, suppose the system data table counts whether users like the movie (0 means dislike, 1 means like). If the system provides a statistical query: q(n) returns the sum of the values of the first n records, then if q(1) = 1, q(2) = 2, q(3) = 2, q(4) = 3, and the position of each record in the data table is known. However, in this scenario, an attacker could infer that the fourth person liked the movie by calculating q(4) - q(3) = 1, leading to the leakage of personal preference data. This demonstrates that even if the recommendation system does not directly publish personal privacy data, there is still a potential risk of privacy leakage. The aforementioned privacy protection issues pose a significant challenge to the application of nonnegative matrix factorization algorithms in recommendation systems and have become an urgent problem to be solved.
[0003] Differential privacy boasts a rigorous and elegant mathematical definition and lightweight computational burden. Its main function is to obfuscate the results of algorithms, ensuring that the probability of obtaining the same result for adjacent datasets remains within a certain range, thus mitigating differential attacks to some extent. Due to its superiority, differential privacy has become a popular privacy protection framework for many applications, leading to the emergence of numerous recommender system algorithms based on it. Mcsherry et al. (MCSHERRY F, MIRONOV I. Differentially private recommender systems: Building privacy into the Netflix prize contenders[C] / / Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery and data mining. Paris, France: ACM, 2009: 627-636.) first introduced differential privacy into a recommender system based on collaborative filtering matrix factorization. Their proposed perturbation method injects noise during the similarity calculation and recommendation stages to guarantee differential privacy. Berlioz et al. (BERLIOZ A, FRIEDMAN A, KAAFARM A, et al. Applying differential privacy to matrix factorization[C] / / Proceedings of the 9th ACM Conference on Recommendmender Systems. Vienna, Austria: ACM, 2015: 107-114.) proposed three methods to apply perturbations based on different stages of the matrix factorization algorithm: input perturbation of the original data, gradient perturbation during the iteration process, and output perturbation of the algorithm result. Zhang et al. (ZHANG S, LIU L, CHEN Z, et al. Probabilistic matrix factorization with personalized differential privacy[J]. Knowledge-Based Systems, 2019, 183: 104864.) designed a privacy-preserving probabilistic matrix factorization algorithm to publish a summary of perturbation terms by considering the user's personalized privacy needs.A common drawback of these methods is the lack of nonnegativity constraints. Recently, Xun et al. (RAN X, WANG Y, ZHANG LY, et al. A differentially private nonnegative matrix factorization for recommender system[J]. Information Sciences, 2022.) proposed a differentially private nonnegative matrix factorization algorithm (IDPNMF). This method guarantees nonnegativity based on perturbation of the objective function, but it suffers from the drawbacks of requiring pre-training and high computational cost. The IDPNMF algorithm requires a pre-processing of nonnegative matrix factorization to avoid excessive error accumulation due to multiple iterations, followed by a differential privacy-preserving recommendation. Furthermore, due to the complex correlation between the score and the nonnegative matrix factorization objective function, privacy analysis is difficult. The current best algorithm, IDPNMF, can only guarantee ∈-DP, while current differential privacy algorithms generally require achieving (∈,δ)-DP to enhance their application value. Summary of the Invention
[0004] To overcome the shortcomings of the existing technologies, the present invention proposes a differential privacy nonnegative matrix factorization method based on random sampling. By setting a noise addition mechanism and incorporating random sampling operations in the noise introduction stage, the method successfully avoids excessive error accumulation caused by multiple iterations, while taking into account both the privacy and usability of the application, thereby improving the privacy protection capability of recommendation systems.
[0005] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0006] A differential privacy nonnegative matrix factorization method based on random sampling includes the following steps:
[0007] Step 1: Input the raw rating data matrix: that is, the rating data V of the raw user rating data matrix collected by the system;
[0008] Step 2: Initialize the factor matrix;
[0009] Step 3: Iteratively optimize the factor matrix described in Step 2 until convergence or the required accuracy is achieved;
[0010] Step 4: Make recommendations based on the factor matrix generated in Step 3 and calculate the root mean square error (RMSE).
[0011] The specific method of step 2 is as follows: the rank is set in advance, and the factor matrix is randomly initialized according to the pre-set rank: the basis matrix H and the coefficient matrix W; the recommendation algorithm based on non-negative matrix factorization approximates the rating data V collected in step 1 as V≈WH, and then predicts the real rating data V according to WH.
[0012] The specific method for step 3 is as follows:
[0013] (3a) Optimize the coefficient matrix W: through For the current coefficient matrix W i Perform an update to obtain the updated coefficient matrix W. i+1 The superscript indicates the number of iterations, and the subscript indicates the row position of the vector in the matrix;
[0014] (3b) Optimize the basis matrix H:
[0015] First, considering the sensitivity of the objective function and pre-setting the privacy budget parameter ∈, a Gaussian noise acoustic matrix N is generated. The dimensions of the Gaussian noise acoustic matrix N are consistent with the coefficient matrix W, and its elements N ij They are independent and identically distributed according to a Gaussian distribution with zero mean and σ variance;
[0016] Secondly, a random sampling matrix Φ is generated based on the sampling rate γ, where Φ is defined as follows: Where, Φ ij Let represent the ij-th element of matrix Φ, and γ∈(0,1) control the sampling rate, representing the proportion of elements in W to be noise-added. Next, the Gaussian noise addition mechanism based on random sampling can be described as: W′=W i+1 +Φ(γ)⊙N, where W′ is the matrix after adding noise, ⊙ represents the multiplication between matrix elements, N is the Gaussian noise matrix, and Φ(γ) is the sampling matrix with a sampling rate of γ;
[0017] Finally, the basis matrix H is updated using the coefficient matrix W′ with randomly sampled Gaussian noise. The update formula for the basis matrix H is as follows:
[0018] (3c) Step (3a) completes one differential privacy iteration update of the coefficient matrix W, and step (3b) completes one differential privacy iteration update of the basis matrix H. Then, a nonnegative projection operation is performed on the coefficient matrix W and the basis matrix H after the one differential privacy iteration update; the nonnegative projection is defined as: Applying a nonnegative projection operation to the coefficient matrix W and the basis matrix H is as follows: W i+1 =max(0,W i+1 ),H i+1 =max(0,H i+1 ).
[0019] The formula for calculating the root mean square error (RMSE) in step 4 is as follows:
[0020]
[0021] Compared with the prior art, the present invention has the following advantages:
[0022] This invention incorporates random sampling in the noise introduction stage, successfully avoiding excessive error accumulation caused by multiple iterations, while also considering the privacy and usability of the application, thus improving its practical application value. It provides proofs of the privacy and usability of RDPNM: compared to the current best method IDPNMF, which can only guarantee ∈-DP, the RDPNMF algorithm proposed in this invention can satisfy (∈,δ)-DP, which is more in line with the current needs of the differential privacy field. It can satisfy differential privacy while maintaining the non-negativity of the learning model, thus having a wider range of privacy protection applications. Attached Figure Description
[0023] Figure 1 This is a flowchart illustrating the implementation of the present invention.
[0024] Figure 2 The figures show the experimental results of existing methods and the method of this invention on the MovieLens movie rating dataset, where: Figure 2 (a) shows the results for MovieLens-100k; Figure 2 (b) shows the results for MovieLens-1m; Figure 2 (c) shows the results for MovieLens-10m; Figure 2 (d) is the result of artificial data. Detailed Implementation
[0025] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.
[0026] Reference Figure 1 A differential privacy nonnegative matrix factorization method based on random sampling includes the following steps:
[0027] Step 1: Input the raw rating data matrix: that is, the rating data V of the raw user rating data matrix collected by the system;
[0028] Step 2: Initialize the factor matrix; the specific method is as follows: pre-set the rank, and randomly initialize the factor matrix according to the pre-set rank: the basis matrix H and the coefficient matrix W; the recommendation algorithm based on non-negative matrix factorization approximates the rating data V collected in Step 1 as V≈WH, and then predicts the real rating data V according to WH;
[0029] Step 3: Iteratively optimize the factor matrix described in Step 2 until convergence or the required accuracy is achieved; the specific method is as follows:
[0030] (3a) Optimize the coefficient matrix W: through For the current coefficient matrix W i Perform an update to obtain the updated coefficient matrix W. i+1 The superscript indicates the number of iterations, and the subscript indicates the row position of the vector in the matrix;
[0031] (3b) Optimize the basis matrix H:
[0032] First, considering the sensitivity of the objective function and pre-setting the privacy budget parameter ∈, a Gaussian noise acoustic matrix N is generated. The dimensions of the Gaussian noise acoustic matrix N are consistent with the coefficient matrix W, and its elements N ij They are independent and identically distributed according to a Gaussian distribution with zero mean and σ variance;
[0033] Secondly, a random sampling matrix Φ is generated based on the sampling rate γ, where Φ is defined as follows: Where, Φ ij Let represent the ij-th element of matrix Φ, and γ∈(0,1) control the sampling rate, representing the proportion of elements in W to be noise-added. Next, the Gaussian noise addition mechanism based on random sampling can be described as: W′=W i+1 +Φ(γ)⊙N, where W′ is the matrix after adding noise, ⊙ represents the multiplication between matrix elements, N is the Gaussian noise matrix, and Φ(γ) is the sampling matrix with a sampling rate of γ;
[0034] Finally, the basis matrix H is updated using the coefficient matrix W′ with randomly sampled Gaussian noise. The update formula for the basis matrix H is as follows:
[0035] (3c) Step (3a) completes one differential privacy iteration update of the coefficient matrix W, and step (3b) completes one differential privacy iteration update of the basis matrix H. Then, a nonnegative projection operation is performed on the coefficient matrix W and the basis matrix H after the one differential privacy iteration update; the nonnegative projection is defined as: Applying a nonnegative projection operation to the coefficient matrix W and the basis matrix H is as follows: W i+1 =max(0, W) i+1 ), H i+1 =max(0, H) i+1 ).
[0036] Step 4: Make recommendations based on the factor matrix generated in Step 3, and calculate the root mean square error.
[0037] Difference RMSE:
[0038]
[0039] The invention will be further explained below with reference to simulation experiments.
[0040] 1. Simulation conditions:
[0041] The experimental hardware platform consisted of an Intel Core i7-8750H CPU with a clock speed of 2.20GHz, 12GB of memory, and Windows 10 operating system. All methods were implemented on Matlab 2020b.
[0042] 2. Simulation content and simulation result analysis:
[0043] In the simulation experiments of this invention, the publicly available standard dataset MovieLens was used to generate the training and test sets. This dataset, collected by Grouplens, contains movie rating data from multiple users for multiple movies, as well as movie metadata and user attribute information. This invention employs three datasets: MovieLens-100K, MovieLens-1M, and MovieLens-10M, with the dataset size increasing progressively. The datasets include the number of users, the number of movies, and rating data.
[0044] A rating matrix R can be constructed from the MovieLens dataset, where the ij elements represent the rating of the j-th movie by the i-th user. Then, a non-negative matrix factorization can be performed on this matrix to obtain two factor matrices for predicting user ratings of other movies. The experiment divides the rating data into training and testing sets. The training set is used to train the factorization results, and the testing set is used to evaluate the recommendation system.
[0045] To evaluate the effectiveness of the simulation results of this invention, this chapter focuses on the accuracy of the predictions in the experiments. The accuracy of the prediction rating can typically be measured using the root mean square error (RMSE) on the test set. The definition of RMSE is as follows:
[0046]
[0047] Where, r ij and These represent the actual and predicted values of the rating data, respectively, and R is the test set of the predicted ratings. A smaller RMSE value indicates a smaller prediction error and better recommendation performance.
[0048] Differential privacy algorithms typically require studying the relationship between usability and privacy budget to compare the performance of different algorithms. To investigate the relationship between the predictive power (usability) and privacy budget ∈ (0, 1, 2, 3, 4, 5) of a differential privacy nonnegative matrix factorization (RDPNMF) algorithm, we let ∈ take values of {0.1, 0.5, 1, 2, 3, 4, 5}. Then, we fixed other parameters at r = 10, γ = 0.1, λ = 0.03, and T = 20, and ran the proposed RDPNMF, the baseline, and other existing algorithms, calculating their respective RMSE values. The closer the RMSE of the differential privacy algorithm is to the baseline, the better its usability; conversely, the closer it is to the baseline, the worse its usability. The experiment divided the rating data into training and testing sets, using tenfold cross-validation to train and evaluate the recommendation system, and the average of the ten RMSE results was used as the actual output of the algorithm. The comparison results are shown in Table 1.
[0049]
[0050] Table 1 Comparison of the present invention and prior art in simulation experiments.
[0051] Table 1 lists the root mean square error of RDPNMF for these three datasets. As can be seen from the table, under any level of privacy budget, MovieLens-10M exhibits the smallest accuracy loss among the three datasets, and MovieLens-1M's accuracy loss is also generally smaller than MovieLens-100k (with only one exception). Based on these experimental results, it can be concluded that RDPNMF can achieve better prediction performance on large-scale datasets.
[0052] Experimental results comparing with other algorithms are as follows: Figure 2 As shown in the figure, based on the results graph and experimental data analysis, the following conclusions can be drawn:
[0053] 1. All algorithms follow a general rule: as the privacy budget ∈ increases, the privacy algorithm gets closer and closer to the baseline. This is because the larger ∈ is, the smaller the amplitude of random noise, and therefore the smaller its impact on the RMSE result.
[0054] 2. Under the same privacy budget ∈ , RDPNMF has a smaller RMSE and is closer to the baseline than other algorithms, indicating that the decomposition result is closer to the original matrix, resulting in better prediction performance and stronger usability. Taking ∈ 1 as an example, the RMSE values of RDPNMF, RDPNMF-1, IDPNMF, and input-ALS on MovieLens 100k are 0.9417, 1.1333, 1.0163, and 1.8833, respectively, with differences from the baseline of 0.0214, 0.2130, 0.0960, and 0.9630. RDPNMF's performance is 97.8% better than input-ALS and 77.8% better than IDPNMF.
[0055] 3. Although RDPNMF-l does not perform as well as RDPNMF, it outperforms input-ALS, which verifies the effectiveness of the random sampling noise addition method. Furthermore, RDPNMF can approximate the baseline more quickly than IDPNMF, indicating that iterative perturbation performs better than objective function perturbation in practice, thus confirming the correctness of the proposed algorithm's motivation.
[0056] 4. In order to balance privacy and availability, based on the above experimental results, RDPNMF is close to the baseline when ∈ = 1. Therefore, we suggest that for RDPNMF, ∈ should ideally be around 1.
[0057] In summary, when privacy requirements are relaxed, the proposed RDPNMF method preserves the main characteristics of the dataset, thus the RDPNMF proposed in this invention has the best performance in all experiments.
Claims
1. A differential privacy nonnegative matrix factorization method based on random sampling, characterized in that: Includes the following steps: Step 1: Input the raw rating data matrix: that is, the rating data V of the raw user rating data matrix collected by the system; Step 2: Initialize the factor matrix; the specific method is as follows: pre-set the rank, and randomly initialize the factor matrix according to the pre-set rank: the basis matrix H and the coefficient matrix W; the recommendation algorithm based on non-negative matrix factorization approximates the rating data V collected in Step 1 as V ≈ WH, and then predicts the real rating data V according to WH; Step 3: Iteratively optimize the factor matrix described in Step 2 until convergence or the required accuracy is achieved; the specific method is as follows: (3a) Optimize the coefficient matrix W: through For the current coefficient matrix Perform an update to obtain the updated coefficient matrix. The superscript indicates the number of iterations, and the subscript indicates the row position of the vector in the matrix; (3b) Optimize the basis matrix H: First, the sensitivity of the objective function is considered, and privacy budget parameters are pre-set. To calculate the Gaussian noise acoustic matrix N, the dimensions of Gaussian noise acoustic matrix N are consistent with those of the coefficient matrix W, and its elements are... They are independent and identically distributed according to a Gaussian distribution with zero mean and σ variance; Secondly, based on the sampling rate Generate random sampling matrix The definition of Φ is ,in, Let represent the ij-th element of matrix Φ, and γ ∈ (0, 1) control the sampling rate, representing the proportion of elements in W to be noise-added. Next, the Gaussian noise addition mechanism based on random sampling can be described as follows: Where W′ is the matrix after adding noise, ⊙ represents the multiplication between matrix elements, N is the Gaussian noise matrix, and Φ(γ) is the sampling matrix with sampling rate γ; Finally, the basis matrix H is updated using the coefficient matrix W′ with randomly sampled Gaussian noise. The update formula for the basis matrix H is as follows: ; (3c) Step (3a) completes one differential privacy iteration update of the coefficient matrix W, and step (3b) completes one differential privacy iteration update of the basis matrix H. Then, a non-negative projection operation is performed on the coefficient matrix W and the basis matrix H after the one differential privacy iteration update; the non-negative projection is defined as: Applying a nonnegative projection operation to the coefficient matrix W and the basis matrix H is as follows: ; Step 4: Make recommendations based on the factor matrix generated in Step 3 and calculate the root mean square error (RMSE).
2. The differential privacy nonnegative matrix factorization method based on random sampling according to claim 1, characterized in that: The formula for calculating the root mean square error (RMSE) in step 4 is as follows: 。