User sensitive information protection method based on data obfuscation technique
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NINGBO UNIV
- Filing Date
- 2022-09-16
- Publication Date
- 2026-06-26
AI Technical Summary
In movie recommendation systems, user privacy information can be easily inferred by attackers through movie review information, and existing technologies struggle to protect user privacy without compromising the quality of personalized recommendations.
By generating a sensitive attribute association table, adding confusion ratings, and applying sampling and removal strategies, a confusion matrix is constructed to maintain data scale and recommendation efficiency.
It achieves the goal of protecting user privacy while misleading attackers' gender attribute inferences, without affecting the quality of personalized recommendations, adapting to data-sparse scenarios, and improving recommendation performance.
Smart Images

Figure CN115577170B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of movie recommendation technology, and in particular to a method for protecting sensitive user information based on data obfuscation technology. Background Technology
[0002] With the continuous development of internet technology and the exponential growth of online information, recommender systems, as an efficient information filtering method, have gradually gained favor in industry. In recommender systems, demographic information such as age, gender, and region is frequently used to provide users with targeted content and advertising products. Therefore, privacy-conscious users often do not include such information in their online profiles. However, with the rapid development of data mining technology, the information hidden by users is highly likely to be predicted by various powerful inference attacks.
[0003] In the field of movie recommendation, user-shared movie reviews on various platforms pose a significant privacy vulnerability. Attackers can collect this data to extract a range of private information, such as occupation, gender, and age. To combat malicious attacks on user privacy, many researchers have employed various methods, commonly including data perturbation algorithms and data encryption algorithms. Currently, obfuscation techniques in data perturbation are widely studied in many fields, but have not received much attention in the field of recommender systems. Therefore, combining data obfuscation techniques with movie recommendation to develop a user sensitive information protection method based on data obfuscation technology has significant application value. Summary of the Invention
[0004] The purpose of this invention is to provide a method for protecting sensitive user information based on data obfuscation technology. This invention has the advantage of misleading attackers into making gender attribute inferences and protecting user privacy, while not affecting the quality of existing personalized recommendations.
[0005] The technical solution of this invention: a method for protecting sensitive user information based on data obfuscation technology, comprising the following steps:
[0006] Step S1: Generate a sensitive attribute association table based on the association between recommended items and user gender characteristics;
[0007] Step S2: Based on the sensitive attribute association table obtained in Step S1, add obfuscation ratings to the existing user-item matrix using a sampling strategy to construct the user-item matrix with added obfuscation ratings.
[0008] Step S3: Record the number of obfuscation ratings added in step S2, apply the removal strategy, and finally generate the obfuscation matrix after applying the removal strategy.
[0009] The removal strategy is as follows: delete the same number of confusing ratings as the number added to maintain the original data scale; randomly select a user, and when the number of ratings for that user reaches a set threshold, delete strongly correlated items from that user's rating history.
[0010] Compared with existing technologies, the beneficial effects of this invention are as follows: This invention solves the problem of user privacy information leakage in recommendation systems. Step S1 generates a sensitive attribute association table that balances personalization and strong gender correlation; Step S2 overcomes the shortcomings of conventional data obfuscation schemes and considers the user's personalized recommendation needs while misleading privacy intruders; Step S3 introduces a removal strategy into the data obfuscation scheme. This strategy not only ensures the effectiveness of privacy protection but also improves recommendation efficiency by removing redundant interaction records and maintaining data scale. This invention provides a solution for privacy protection in recommendation systems, and has the advantages of adapting to sparse data scenarios and improving the recommendation effect of various data models. Ultimately, it can achieve user privacy protection while misleading attackers to infer gender attributes, without affecting the quality of existing personalized recommendations.
[0011] In the aforementioned method for protecting sensitive user information based on data obfuscation technology, the recommended item is a movie recommendation, and step S1 includes the following sub-steps:
[0012] Sub-step S1.1: Initially screen out a candidate set of movies that are strongly associated with gender characteristics, and use the logistic regression algorithm to obtain a preliminary gender association form;
[0013] Sub-step S1.2: Use the user KNN algorithm to mine user personalized preferences and obtain a personalized recommendation form for the user;
[0014] Sub-step S1.3: Take the intersection of the gender association form obtained in sub-step S1.1 and the user personalized recommendation form obtained in sub-step S1.2 to construct the final sensitive attribute association table.
[0015] In the aforementioned method for protecting sensitive user information based on data obfuscation technology, the linear formula and loss function of the logistic regression algorithm in sub-step S1.1 are expressed as follows:
[0016]
[0017]
[0018] in, Indicates the first The correlation between movies and gender attributes This represents the cross-entropy loss after adding a regularization term. It is an adjustable parameter for the regularization term.
[0019] In the aforementioned user sensitive information protection method based on data obfuscation technology, the userKNN algorithm in sub-step S1.2 includes the following three steps:
[0020] S1.2.1 Calculating Similarity: The Pearson coefficient is used, with a value ranging from [-1, 1]. A value of -1 indicates a negative correlation between the two sets of variables; 0 indicates no correlation; and 1 indicates a positive correlation. The calculation formula is as follows:
[0021]
[0022] in, Indicates film and similarity, Expressing opinions on the film and All users have been rated. and They represent movies and Average rating Indicates user For the movie The rating, Indicates user For the movie The rating;
[0023] S1.2.2 Similarity Compression: The similarity is compressed once based on the size of the intersection. The calculation formula is as follows:
[0024]
[0025] in, It is the compression ratio specified by the user. Expressing opinions on the film and The user group that has been rated;
[0026] S1.2.3, Neighbor Selection: Among all the movies rated by user u, find the k movies with the highest similarity to movie m, and let N(u, m) represent the set of these k movies. Using the algorithms in S1.2.1 and S1.2.2, obtain the user's personalized recommendation list. The KNN score prediction formula is as follows:
[0027]
[0028] in, Indicates user For the movie The predicted score.
[0029] In the aforementioned user sensitive information protection method based on data obfuscation technology, the sampling strategy in step S2 specifically selects movies by analyzing the relevant score distribution of movies corresponding to the opposite sex in the list, and directly selects the average score of the opposite sex group with the best overall performance as the scoring selection strategy.
[0030] In the aforementioned user sensitive information protection method based on data obfuscation technology, step S2 uses a softmax function to map the number of samples to the range [0, 1] to form sampling probabilities. When the number of times an item is selected reaches twice the original number, it is removed from the associated form to maintain the overall data distribution. The softmax function is:
[0031]
[0032] in, This represents the number of samples in the associated form. This represents the probability that each sample will eventually be selected.
[0033] In the aforementioned user sensitive information protection method based on data obfuscation technology, the equal scoring of the opposite-sex group is used as a scoring selection strategy, and its calculation formula is as follows:
[0034]
[0035] in, The level of confusion ratings that are included. Represents the number of users of the opposite sex. Representing a group users in For the movie The resulting rating. Attached Figure Description
[0036] Figure 1 This is a flowchart of the method of the present invention;
[0037] Figure 2 This is a diagram illustrating the application of data obfuscation. Detailed Implementation
[0038] The present invention will be further described below with reference to the accompanying drawings and embodiments, but this should not be construed as limiting the present invention.
[0039] Example: A method for protecting sensitive user information based on data obfuscation technology, the workflow is as follows: Figure 1 As shown, it includes the following steps:
[0040] Step S1: Generate a sensitive attribute association table based on the association between recommended items and user gender characteristics.
[0041] The recommended item is movie recommendations. Existing research indicates that the vast majority of movies are associated with gender characteristics, meaning different gender groups have significant cognitive biases regarding the same movie. If a user's interaction history contains a large number of movies strongly associated with "female," attribute inference can accurately capture this information and thus label the user as female. Based on these characteristics, this embodiment first preliminarily filters out a candidate set of movies strongly associated with gender characteristics for subsequent obfuscation work. Step S1 specifically proceeds with the following sub-steps:
[0042] Sub-step S1.1: Initially screen out a candidate set of movies strongly associated with gender characteristics. Use logistic regression to obtain a preliminary gender association table. Logistic regression is one of the most influential classification algorithms in machine learning and a fundamental building block of deep learning. The regression coefficients generated by logistic regression capture the degree of correlation between each movie and its category, where positive coefficients indicate a correlation with the male category and negative coefficients indicate a correlation with the female category. Simultaneously, regularization parameters are selected to further optimize the decision boundary. If the absolute value of the regression coefficient is less than a certain threshold, it indicates that the movie has a weak association with both genders and cannot effectively contribute to subsequent obfuscation; therefore, it should be discarded.
[0043] The linear formula and loss function of the logistic regression algorithm in sub-step S1.1 are expressed as follows:
[0044]
[0045]
[0046] in, Represents the logit transformation. Represents the intercept coefficient. Indicates the first The correlation between movies and gender attributes Indicates the proportion of confusing samples. This indicates the value of a certain feature in the sample. This indicates the result of binary classification of gender. This represents the cross-entropy loss after adding a regularization term. This indicates that under the sample conditions... The result under the conditions is The probability, This indicates the regularization term to be added. It is an adjustable parameter for the regularization term.
[0047] Finally, we obtained the gender association form. and , representing movie forms with strong associations between men and women, respectively.
[0048] Sub-step S1.2: Utilize the user KNN algorithm to mine user personalized preferences and obtain a personalized recommendation form. In previous work on obfuscation techniques, researchers focused on privacy protection while neglecting the characteristics of the recommender system domain. Recommender systems aim to provide high-quality personalized recommendations to solve the information overload problem, but obfuscation techniques introduce a large number of obfuscated samples into the current interaction matrix, significantly impacting recommendation performance. Specifically, in movie recommendations, user activity varies considerably; some users provide a large amount of movie reviews on the platform, while the vast majority of users have only a few interaction records, meaning the usable data is very sparse. How to avoid the decline in recommendation efficiency and the problem of sparse usable data is one of the core research contents of this invention.
[0049] KNN (Nearest Neighbors) is a commonly used algorithm in machine learning. The principle is as follows: For an item A that needs to be classified, a method is defined to describe the distance between items. The K nearest neighbors of the item with known categories are found. The item with the most frequent category among these K items is defined as the category of item A.
[0050] Extending this to movie recommendations, the idea of KNN can be implemented by finding similar users, defined as user-based KNN (user KNN). The KNN recommendation algorithm selects a similarity calculation method to calculate similarity, then sorts the objects according to their similarity, takes the top K objects, uses their similarity to the target object as weights, performs a weighted sum of scores, and finally normalizes the result using the sum of the similarity between these K objects and the target.
[0051] In this invention, considering data sparsity, the improved user KNN mainly consists of the following three steps:
[0052] S1.2.1 Calculating Similarity: Commonly used similarity metrics in recommendation systems include Pearson coefficient, cosine similarity, Euclidean distance, and Jaccard coefficient. This invention uses the Pearson coefficient, which ranges from -1 to 1. A value of -1 indicates a negative correlation between the two sets of variables; 0 indicates no correlation; and 1 indicates a positive correlation. The calculation formula is as follows:
[0053]
[0054] in, Indicates film and similarity, Expressing opinions on the film and All users have been rated. and They represent movies and Average rating Indicates user For the movie The rating, Indicates user For the movie The rating;
[0055] S1.2.2 Similarity Compression: According to the Pearson correlation formula, if the intersection of two movies is much smaller than the intersection of other movies, then the reliability of the similarity between these two movies is relatively low. As described above regarding data sparsity, it is quite common for small intersections to occur in recommender systems. This significantly increases the unreliability of similarity.
[0056] In real-world recommendation systems, the number of users and movies can easily reach millions, resulting in very little overlap in choices between two users. For mainstream datasets, MovieLens exhibits a sparsity of 4.5%, Netflix 1.2%, Bibsonomy 0.35%, and Delicious 0.046%. To improve the reliability of predictions, it's necessary to mitigate this unreliability; therefore, we compress the similarity based on the size of the intersection.
[0057]
[0058] in, It is the compression ratio specified by the user. Expressing opinions on the film and All users have been rated. Indicates compressed movie and similarity, This indicates the movie obtained from sub-step S1.2.1. and Similarity;
[0059] S1.2.3, Neighbor Selection: Among all the movies rated by user u, find the k movies with the highest similarity to movie m, and let N(u, m) represent the set of these k movies. Using the algorithms in S1.2.1 and S1.2.2, obtain the user's personalized recommendation list. The KNN score prediction formula is as follows:
[0060]
[0061] in, Indicates user For the movie Predicted score On behalf of users Already on the movie The actual score given.
[0062] Ultimately, a personalized form is obtained for each user. .
[0063] Sub-step S1.3: Obtain the gender association form from sub-step S1.1. , The user personalized recommendation form obtained in step S1.2 Find the intersection and construct the final sensitive attribute association table. The resulting sensitive attribute association table It takes into account both individual user preferences and has a high correlation with gender attributes.
[0064] Based on the above analysis, Algorithm 1 provides detailed steps for generating the sensitive attribute association table.
[0065] Algorithm 1:
[0066] Input: Original user-project interaction matrix (Include One user, (movie), logistic regression regularization parameters Similarity compression coefficient ;
[0067] Output: Sensitive attribute association table ;
[0068] 1. for (item) in do
[0069] 2. Compute correlation coefficients by logistic regression foreach ;
[0070] 3. Sort coefficients descending;
[0071] 4. return and / / contains personalized list ofindicative items for ;
[0072] 5. for (user) in do
[0073] 6. for (item in ) do
[0074] 7. Compute similarity to find nearest neighbor candidates in ;
[0075] 8. Sort selected items based on the number of possessed neighbor candidates;
[0076] 9. return / / contains personalized list of indicative items for ;
[0077] 10.for (user in ) do
[0078] 11. Fix a threshold on and
[0079] 12. Create personalized list of indicative items for : ;
[0080] 13. if ( is a Female ) then
[0081] 14. for item do
[0082] 15. add( ) if item
[0083] 16. else
[0084] 17. Do the same steps but for a Male target user
[0085] 18. Return
[0086] Step S2: Based on the sensitive attribute association table obtained in Step S1, add obfuscation ratings to the existing user-item matrix using a sampling strategy, and construct the user-item matrix with added obfuscation ratings. A data obfuscation application diagram can be found here. Figure 2 .
[0087] This process requires consideration of project selection strategies and scoring strategies.
[0088] Regarding the project selection strategy, previous studies employed a simple and efficient "greedy strategy"—selecting obfuscated items from the existing opposite-sex association forms according to their priority from high to low. If the current user is female and *Beautiful Days* is the most strongly associated movie with men, i.e., it's at the top of the male association form, then the "greedy strategy" would prioritize adding *Beautiful Days* as an obfuscated sample to this user's rating matrix. However, obfuscated matrices using the "greedy strategy" are often filled with a large number of strongly gender-related samples, and after visualization analysis, the overall data distribution becomes quite abrupt. Attackers can recognize from this phenomenon that obfuscation techniques have been applied to the rating matrix. Furthermore, by exploiting the fact that gender is a binary attribute, they can reverse the inference model's results and still compromise the user's gender information. Therefore, this invention uses a sampling strategy to select obfuscated items—selecting movies based on the distribution of relevance scores for movies corresponding to the opposite sex in the list. For example, if the male form... There are three movies, j1, j2, and j3, with scores of 0.5, 0.3, and 0.2 respectively. Therefore, j1 will be selected with a probability of 0.5, and so on. Additionally, if an item is selected twice as many times as it previously was, it will be removed from the associated form to maintain the overall data distribution.
[0089] In terms of scoring selection strategies, there are mainly three strategies: "project average score," "user average score," and "opposite-sex group average score." In this invention, the "opposite-sex group average score," which has the best overall performance, was directly selected as the scoring selection strategy.
[0090] That is, the sampling strategy in step S2 specifically selects movies by analyzing the relevant score distribution of movies corresponding to the opposite sex in the list, and directly selects the average score of the opposite sex group with the best overall performance as the scoring selection strategy.
[0091] In step S2, the sample size is mapped to the range [0, 1] using the softmax function to form the sampling probability. When the number of times an item is selected reaches twice the original number, it is removed from the associated form to maintain the overall data distribution. The softmax function is:
[0092]
[0093] in, This represents the number of samples in the associated form. This represents the probability that each sample will eventually be selected.
[0094] Regarding the scoring strategy, the "equal score for the opposite sex group" approach, which yielded the best overall performance, was chosen. The calculation formula is as follows:
[0095]
[0096] in, The level of confusion ratings that are included. Represents the number of users of the opposite sex. Representing a group users in For the movie The resulting rating.
[0097] Based on the above analysis, Algorithm 2 provides detailed steps for constructing the user item matrix after adding obfuscation ratings.
[0098] Algorithm 2:
[0099] Input: Proportion of confused samples The original user-project interaction matrix (Include One user, (Movie), User's Sensitive Attributes Related Form ;
[0100] Output: Confusion matrix after adding confused samples ;
[0101] 1.for (user) in do
[0102] 2. count = initial count[ ]*
[0103] 3. added = 0
[0104] 4. while added < count do
[0105] 5. i = picks the item in the randomly
[0106] 6. if == 0 then
[0107] 7. =
[0108] 8. added += 1
[0109] 9. Total added += added
[0110] 10. Return
[0111] Step S3: Record the number of obfuscation ratings added in step S2, apply the removal strategy, and finally generate the obfuscation matrix after applying the removal strategy.
[0112] To further improve privacy protection and recommendation efficiency, this invention employs a "removal strategy" to eliminate some redundant ratings. Existing research indicates two key points: First, once a user's ratings reach a certain threshold, subsequent rating data no longer provides sufficient information for the recommendation system; second, data scale is the primary indicator affecting recommendation efficiency, meaning that recommendation efficiency is highly sensitive to data scale. Based on these two points, removing unnecessary ratings to maintain data scale is essential. The main content of the "removal strategy" is as follows:
[0113] 1) Record the number of ratings added. The number of ratings deleted should be the same as the number added to maintain the original data size.
[0114] 2) Randomly select users, and when a user's rating count reaches 20, delete movies with high opposite-sex association from that user's rating history.
[0115] Based on the above analysis, Algorithm 3 provides detailed steps for generating a confusion matrix that applies the "removal strategy".
[0116] Algorithm 3:
[0117] Input: Confusion matrix after adding ratings (Include One user, (Movie), Total added obfuscation ratings, removal threshold;
[0118] Output: Final privacy-preserving obfuscation matrix ;
[0119] 1.for (user) in do
[0120] 2. if Interaction count ≥ threshold then
[0121] 3. remove count += 1
[0122] 4.To be removed = Total added / remove count
[0123] 5. for (user) in do
[0124] 6. if ( threshold) then
[0125] 7. if ( (is a Female) then
[0126] 8. removed = 0
[0127] 9. while ( removed < To be removed[ ]) do
[0128] 10. i = picks an item from
[0129] 11. if != 0 then
[0130] 12. = 0
[0131] 13. removed += 1
[0132] 14. else
[0133] 15. Do the same steps but for a Male target user
[0134] 16. Return
[0135] The above are merely preferred embodiments of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements and modifications made without departing from the principles of the present invention should also be considered within the scope of protection of the present invention.
Claims
1. A method for protecting sensitive user information based on data obfuscation technology, characterized in that, Includes the following steps: Step S1: Generate a sensitive attribute association table based on the association between recommended items and user gender characteristics; Step S2: Based on the sensitive attribute association table obtained in Step S1, add obfuscation ratings to the existing user-item matrix using a sampling strategy to construct the user-item matrix with added obfuscation ratings. The sampling strategy in step S2 specifically selects movies by drawing movies from the list corresponding to the opposite sex, and at the same time directly selects the average score of the opposite sex group with the best overall performance as the scoring selection strategy. In step S2, the sample size is mapped to the range [0, 1] using the softmax function to form the sampling probability. When the number of times an item is selected reaches twice the original number, it is removed from the associated form to maintain the overall data distribution. The softmax function is: in, Indicates the first The association score of each sample This represents the number of samples in the associated form. Indicates the first in the associated form The association score of each sample is the base of the natural logarithm. This represents the probability that each sample will ultimately be selected. Step S3: Record the number of obfuscation ratings added in step S2, apply the removal strategy, and finally generate the obfuscation matrix after applying the removal strategy. The removal strategy is as follows: delete the same number of confusing ratings as the number added to maintain the original data scale; randomly select a user, and when the number of ratings for that user reaches a set threshold, delete strongly correlated items from that user's rating history.
2. The method for protecting sensitive user information based on data obfuscation technology according to claim 1, characterized in that: The recommended items are movie recommendations, and step S1 includes the following sub-steps: Sub-step S1.1: Initially screen out a candidate set of movies that are strongly associated with gender characteristics, and use the logistic regression algorithm to obtain a preliminary gender association form; Sub-step S1.2: Use the user KNN algorithm to mine user personalized preferences and obtain a personalized recommendation form for the user; Sub-step S1.3: Take the intersection of the gender association form obtained in sub-step S1.1 and the user personalized recommendation form obtained in sub-step S1.2 to construct the final sensitive attribute association table.
3. The method for protecting sensitive user information based on data obfuscation technology according to claim 2, characterized in that: The linear formula and loss function of the logistic regression algorithm in sub-step S1.1 are expressed as follows: in, Represents the logit transformation. Represents the intercept coefficient. Indicates the first The correlation between movies and gender attributes Indicates the proportion of confusing samples. This indicates the value of a certain feature of the sample. This indicates the result of binary classification of gender. This represents the cross-entropy loss after adding a regularization term. This indicates that under the sample conditions... The result under the conditions is The probability, This indicates the regularization term to be added. It is an adjustable parameter for the regularization term.
4. The method for protecting sensitive user information based on data obfuscation technology according to claim 2, characterized in that: The user KNN algorithm in sub-step S1.2 includes the following three steps: S1.2.1 Calculating Similarity: The Pearson coefficient is used, with a value ranging from [-1, 1]. A value of -1 indicates a negative correlation between the two sets of variables; 0 indicates no correlation; and 1 indicates a positive correlation. The calculation formula is as follows: in, Indicates film and similarity, Expressing opinions on the film and All users have been rated. and They represent movies and Average rating Indicates user For the movie The rating, Indicates user For the movie The rating; S1.2.2 Similarity Compression: The similarity is compressed once based on the size of the intersection. The calculation formula is as follows: in, It is the compression ratio specified by the user. Expressing opinions on the film and All users have been rated. Indicates compressed movie and similarity, This indicates the movie obtained from sub-step S1.2.
1. and Similarity; S1.2.3, Neighbor Selection: Among all the movies rated by user u, find the k movies with the highest similarity to movie m, and let N(u, m) represent the set of these k movies. Using the algorithms in S1.2.1 and S1.2.2, obtain the user's personalized recommendation list. The KNN score prediction formula is as follows: in, Indicates user For the movie Predicted score On behalf of users Already on the movie The actual score given.
5. The method for protecting sensitive user information based on data obfuscation technology according to claim 1, characterized in that: The equal score for the opposite-sex group is used as the scoring selection strategy, and its calculation formula is as follows: in, The level of confusion ratings that are included. Represents the number of users of the opposite sex. Representing a group users in For the movie The resulting rating.