Credit investigation data desensitization processing method and system based on artificial intelligence
By employing an AI-based credit data anonymization method, sensitive information is identified and a disguised dataset is constructed for differential analysis. This resolves the conflict between privacy protection and data accuracy in existing technologies, achieving efficient and accurate credit data anonymization and improving the reliability of credit assessment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 天创信用服务有限公司
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies, in the process of de-identifying credit data, struggle to maintain the accuracy and intrinsic relevance of the data while protecting privacy. Furthermore, there is a risk of insufficient or excessive de-identification, which can affect the accuracy and security of credit assessment.
An AI-based approach is employed to identify sensitive information, construct a disguised dataset, and perform differential analysis. The sensitive information is then replaced with validation data to ensure that the logical and statistical characteristics of the data remain unchanged. The primary disguised data is prioritized, while the secondary disguised data serves as a backup, ensuring data stability and fault tolerance.
This approach protects privacy while maintaining the inherent logic and statistical characteristics of data, improving the quality and analyzability of anonymized data, avoiding the risk of sensitive information leakage, and enhancing the accuracy and reliability of credit assessment.
Smart Images

Figure CN122241759A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of data desensitization technology, specifically relating to a method and system for desensitizing credit information based on artificial intelligence. Background Technology
[0002] The social credit system collects, integrates, and deeply analyzes credit data to construct credit profiles for individuals and enterprises, providing credit data support for areas such as financial lending, risk control, and market supervision. Due to the high degree of privacy involved in credit data, data anonymization techniques are required to protect the privacy of individuals and enterprises.
[0003] Traditional data anonymization methods such as masking, generalization, or data perturbation usually sacrifice data accuracy, causing the anonymized data to lose its original statistical distribution characteristics and intrinsic correlations. This can lead to misleading subsequent credit assessments based on distorted data, reducing the accuracy and reliability of credit assessments. Furthermore, existing anonymization strategies struggle to differentiate based on the different sensitivity levels of data fields, the specific needs of business scenarios, and the complex logical relationships between data. This can easily result in over-anonymization of critical information or leave privacy leaks due to insufficient anonymization, potentially leading to data security incidents in severe cases.
[0004] In view of this, this application discloses a method and system for de-identifying credit information data based on artificial intelligence. Summary of the Invention
[0005] The purpose of this invention is to provide a method and system for de-identifying credit data based on artificial intelligence. This system can de-identify personal privacy and sensitive corporate information contained in credit data, while also ensuring that the de-identified data can be used to replace the original data.
[0006] To achieve the above objectives, the present invention provides the following technical solution:
[0007] An AI-based method for de-identifying credit information data includes the following steps:
[0008] Identify sensitive information in the credit data to be processed;
[0009] In response to sensitive information, a de-identification and replacement process based on verification data is executed. The de-identification and replacement process includes: constructing a disguised dataset based on the sensitive information; performing a difference analysis on the disguised dataset to select verification data from the disguised data; and replacing the sensitive information with the verification data to form de-identified credit data.
[0010] The process of constructing a disguised dataset based on sensitive information includes: determining a corresponding sorting number for the sensitive information; extracting disguised data from the disguised database according to the sorting number; and replacing the corresponding sensitive information with the disguised data to construct the disguised dataset. The data stored in the disguised database includes primary disguised data and secondary disguised data.
[0011] The process of extracting disguised data from the disguised database based on the sorting number includes: firstly extracting the main disguised data according to the sorting number; if the main disguised data cannot be extracted, then extracting the secondary disguised data, and using the timestamp of the secondary disguised data as the sorting number.
[0012] Furthermore, the identification of sensitive information in the credit data to be processed includes:
[0013] Perform attribute classification on the data fields in the credit data to be processed to divide the data fields into basic fields and sensitive fields; mark the information contained in the sensitive fields as sensitive information.
[0014] Furthermore, the step of performing differential analysis on the spoofed dataset to select validation data from the spoofed data includes:
[0015] Obtain adjacent camouflaged data in the camouflaged dataset; use a preset comparison function to calculate the difference between adjacent camouflaged data to generate a verification deviation value;
[0016] Determine whether the verification deviation value is within the preset value range; when the verification deviation value is within the preset value range, determine the adjacent fake data that generated the verification deviation value as candidate verification data, and select one of the candidate verification data as the verification data according to the positive or negative sign of the verification deviation value.
[0017] Furthermore, the step of replacing sensitive information with verification data to form de-identified credit data includes:
[0018] Replacement is performed when the degree of matching between the verification data and the sensitive information meets the preset matching threshold.
[0019] An AI-based credit data anonymization system includes the following modules:
[0020] The sensitive information identification module is used to identify sensitive information in the credit data to be processed;
[0021] The verification data selection module is used to construct a disguised dataset based on sensitive information in response to sensitive information; and to perform differential analysis on the disguised dataset to select verification data from the disguised data.
[0022] The desensitization and replacement module is used to replace sensitive information with verification data to form desensitized credit data.
[0023] Furthermore, the construction of the disguised dataset based on sensitive information includes:
[0024] Assign a sort number to the sensitive information; extract the disguised data from the disguised database based on the sort number; and replace the corresponding sensitive information with the disguised data to construct the disguised dataset.
[0025] Furthermore, the step of extracting disguised data from the disguised database according to the sorting number includes:
[0026] Primary camouflage data is extracted first based on the sorting number; if primary camouflage data cannot be extracted, secondary camouflage data is extracted, and the timestamp of the secondary camouflage data is used as the sorting number.
[0027] Furthermore, the step of performing differential analysis on the spoofed dataset to select validation data from the spoofed data includes:
[0028] Obtain adjacent camouflaged data in the camouflaged dataset; use a preset comparison function to calculate the difference between adjacent camouflaged data to generate a verification deviation value.
[0029] Furthermore, the step of performing differential analysis on the spoofed dataset to select validation data from the spoofed data also includes:
[0030] Determine whether the verification deviation value is within the preset value range; when the verification deviation value is within the preset value range, determine the adjacent fake data that generated the verification deviation value as candidate verification data, and select one of the candidate verification data as the verification data according to the positive or negative sign of the verification deviation value.
[0031] Furthermore, replacing sensitive information with verification data to form de-identified credit data includes:
[0032] Replacement is performed when the degree of matching between the verification data and the sensitive information meets the preset matching threshold.
[0033] Beneficial effects
[0034] This invention constructs a disguised dataset by replacing sensitive information with disguised data. The disguised dataset is then arranged by sorting number, and the difference between adjacent disguised data is calculated to output a verification deviation value. Based on the verification deviation value, verification data is selected from the disguised data to complete the replacement of sensitive information. This ensures that the final de-identified credit data effectively masks sensitive information while maintaining the inherent logic and statistical characteristics between data, thereby improving the data quality and analyzability of the de-identified credit data.
[0035] This invention performs attribute determination on the identified data fields, and then automatically classifies the data fields into basic fields and sensitive fields according to preset determination criteria, thereby marking the data fields to be de-identified, efficiently and accurately locating sensitive information, improving the processing efficiency of the de-identification process, avoiding the risk of sensitive information leakage due to incomplete rules, and enhancing the accuracy of the de-identification process.
[0036] This invention employs a dual-layer data source design that includes primary and secondary masquerading data. When extracting data, the primary masquerading data is extracted according to the sorting number; if the extraction fails, it automatically switches to the secondary masquerading data for extraction, thereby ensuring a stable supply of masquerading data and enhancing the fault tolerance and stability of the desensitization process. Attached Figure Description
[0037] Figure 1 This is a flowchart of the method of the present invention;
[0038] Figure 2 This is a flowchart of the method for constructing a disguised dataset based on sensitive information according to the present invention;
[0039] Figure 3 This is a flowchart of the method of the present invention for performing differential analysis on a spoofed dataset to select verification data from the spoofed data;
[0040] Figure 4 This is a system module diagram of the present invention. Detailed Implementation
[0041] The technical solution of this patent will be further described in detail below with reference to specific embodiments. The following embodiments are used to illustrate the present invention, but should not be used to limit the scope of protection of the present invention. The conditions in the embodiments can be further adjusted according to specific conditions. Simple improvements to the method of the present invention under the premise of the concept of the present invention are all within the scope of protection claimed by the present invention.
[0042] Example 1
[0043] See Figures 1-3 As shown, this embodiment provides a credit data anonymization method based on artificial intelligence, including the following steps:
[0044] The process involves acquiring credit data to be processed and identifying data fields to be anonymized. This includes obtaining the credit data to be processed and identifying multiple data fields contained within it. These data fields include specific items used to record personal information, sensitive corporate information, transaction information, and credit behavior.
[0045] Attribute classification is performed on multiple data fields to accurately identify data that needs to be protected. Specifically, attribute classification includes performing attribute judgment on multiple data fields based on preset judgment criteria, which include a set of rules for identifying sensitive information.
[0046] The rule set includes keywords based on field names, format features of field content, and statistical distribution features of data within fields. Keywords based on field names include fields containing words such as "ID card number," "address," and "telephone number." The format features of the field content ensure that fields conforming to a specific length and character combination format are recognized as ID card numbers.
[0047] By comparing the attributes of each data field with the rule set one by one, multiple data fields are divided into basic fields and sensitive fields. Specifically, information that is crucial to credit analysis but is not directly identifying an individual, such as loan amount and repayment status, can be classified as basic fields; while fields such as ID number, home address, and contact information are classified as sensitive fields.
[0048] After classification, all identified sensitive fields are marked as fields to be de-identified, providing a clear target for subsequent targeted de-identification operations.
[0049] Furthermore, specific sensitive information is extracted from the marked fields of data to be de-identified; a unique sort number is assigned to each piece of sensitive information, which is used to establish a mapping relationship with the original data order to ensure the logical consistency of subsequent processing;
[0050] Based on the sorting number, the corresponding disguised data that does not contain real sensitive information is extracted from the pre-set disguised database. The data stored in the disguised database includes primary disguised data and secondary disguised data. The primary disguised data is extracted first based on the sorting number. The primary disguised data is pre-generated and stored synthetic data that is similar to real data in terms of data type, format and statistical distribution.
[0051] If the primary masquerading data cannot be successfully extracted based on the sorting number, for example, if there is no matching pre-stored data in the database, then the backup data acquisition process is started, that is, the secondary masquerading data is extracted.
[0052] The secondary camouflage data is generated in real time by the data synthesis process. Specifically, the data synthesis process dynamically creates data that conforms to a specific format and statistical characteristics according to the input requirements; and uses the timestamp when the secondary camouflage data is acquired as its sorting number to ensure the uniqueness of the dynamically generated data number and to record the time sequence of data generation.
[0053] All the extracted spoofed data are combined to form a spoofed dataset.
[0054] Furthermore, the discrepancies are calculated and the verification deviation value is output. The spoofed datasets are arranged according to their sort numbers. To evaluate the internal consistency and smoothness of the spoofed datasets, the discrepancies between two adjacent spoofed data points in the spoofed datasets are calculated item by item. The output of the calculation is the verification deviation value, which constitutes a quantitative description of the local volatility of the spoofed datasets.
[0055] The calculation process is carried out according to a preset comparison logic, which is adjusted according to different data types. For numerical data, the difference between two values is directly calculated as the difference. For non-numerical data such as text or categories, it is mapped to a multi-dimensional numerical vector, and then the geometric distance or angle between the vectors is calculated to quantify the difference.
[0056] Furthermore, the verification data is filtered out. Based on the generated verification deviation value, the verification data to be used for replacement is selected from the fake dataset to eliminate fake data that may introduce abnormal fluctuations.
[0057] Set a preset numerical range for the verification deviation value. This range defines the acceptable normal fluctuation range between data. The preset numerical range is set by calculating the difference between all adjacent data in the original dataset to obtain the original difference dataset, then calculating the mean and standard deviation of the original difference dataset, and setting the preset numerical range to the range between the mean and 1.5 times the standard deviation above and below the mean, and the range between the mean and 1.5 times the standard deviation.
[0058] Determine whether each verification deviation value falls within a preset value range. If the verification deviation value falls within the preset value range, it indicates that the transition between the two adjacent spoofed data that generated the deviation value is smooth, and these two adjacent spoofed data are identified as a set of candidate verification data. If multiple verification deviation values fall within the preset value range, then the adjacent spoofed data corresponding to each verification deviation value that meets the condition are identified as a separate set of candidate verification data.
[0059] The final verification data is selected from each set of candidate verification data. The selection rules are as follows: if the verification deviation value corresponding to the candidate verification data is positive, it indicates that the data is showing an increasing trend, so the fake data with the larger value is selected from the candidate verification data as the verification data; if the verification deviation value corresponding to the candidate verification data is negative, the fake data with the smaller value is selected.
[0060] The selection strategy can maintain the local variation trend of the data, so that each selected verification data can be better integrated with the context of the original data.
[0061] Furthermore, de-identified credit data is generated and a replacement operation is performed. Before generating the final de-identified credit data, a critical quality check is performed. Before replacing the corresponding sensitive information with the selected check data, the content matching degree between the sensitive information in the fields of the data to be de-identified and the check data is evaluated to obtain a quantitative matching degree.
[0062] The degree of matching is a comprehensive evaluation result. Its calculation process includes comparing whether the data types of the two are consistent, whether the date format, address hierarchy structure and other data formats are consistent, and whether they belong to the same category semantically.
[0063] The functional relationship for calculating the degree of matching is:
[0064]
[0065] In the formula, This represents the Kronecker function, which is used to determine... The function checks if the two inputs are of the exact same type. If they are, the function value is 1; otherwise, it is 0.
[0066] The type extraction function indicates that it returns the data type of the input data, such as string, integer, or date.
[0067] This represents the format similarity function, which means calculating... The similarity in format between two input data can be calculated using regular expression matching degree or edit distance;
[0068] The overall matching score represents the weighted overall similarity between the verification data and the original sensitive data in terms of type, format, and semantics.
[0069] This indicates the original sensitive data, meaning the original sensitive information contained in the data field to be desensitized before the replacement operation is performed;
[0070] This refers to the verification data, which is the candidate data selected from the fake dataset to replace the original sensitive information.
[0071] The weight coefficients represent the importance weights of data type, format, and semantic similarity, respectively, and satisfy the following conditions: ;
[0072] The semantic vector represents the high-dimensional vector representation of the original data and the validation data, respectively, obtained by transforming them through pre-trained language models such as Word2Vec or BERT, and is used to capture their semantic information.
[0073] The calculated matching degree is compared with the preset matching threshold. If the matching degree is greater than or equal to the preset matching threshold, it indicates that the verification data meets the quality requirements for replacement. Then, the replacement operation is performed to replace the corresponding sensitive information with the verification data to generate de-identified credit data.
[0074] If the matching degree does not reach the preset matching threshold, the replacement operation will be abandoned to avoid introducing incompatible or erroneous data, thereby ensuring the overall quality and usability of the final output de-identified credit data.
[0075] Example 2
[0076] See Figure 4 As shown, this embodiment provides an artificial intelligence-based credit data anonymization processing system, including the following modules:
[0077] The sensitive information identification module is used to preprocess the received credit data to be processed, accurately identify the sensitive information that needs to be de-identified; perform attribute classification on each data field in the credit data to be processed. The attribute classification process can be based on a preset set of rules or implemented through machine learning models such as natural language processing models or classifiers, and accurately divide the data fields into basic fields and sensitive fields.
[0078] Among them, basic fields usually refer to general information that does not directly expose personal identity or privacy, while sensitive fields include information that can directly or indirectly identify an individual, such as ID card number, mobile phone number, home address, and specific financial amount.
[0079] After classification, all fields classified as sensitive are marked as fields to be de-identified, and sensitive information is extracted from these fields and passed to subsequent modules.
[0080] The verification data selection module is used to receive sensitive information, construct a disguised dataset, perform differential analysis, and select the verification data to be replaced; it constructs a disguised dataset based on the sensitive information and assigns a corresponding sort number to each piece of sensitive information; based on this sort number, it extracts disguised data from the pre-constructed disguised database.
[0081] The camouflage database stores two types of data: primary camouflage data and secondary camouflage data. During the extraction process, the primary camouflage data is extracted first based on the sorting number. If the primary camouflage data is not successfully extracted, the secondary camouflage data is extracted, and the timestamp of the secondary camouflage data is used as the new sorting number to ensure the traceability and uniqueness of the data processing.
[0082] All extracted spoofing data are combined to construct a complete spoofing dataset. A difference analysis is performed on the constructed spoofing dataset to obtain adjacent spoofing data arranged in a specific order, such as the original data order. A preset comparison function is called, which can be numerical difference calculation, string edit distance, vector cosine similarity, etc., to calculate the difference between adjacent spoofing data and generate a quantified verification deviation value.
[0083] Based on the analysis results, select verification data from the spoofed data and determine whether the generated verification deviation value is within the preset value range. The preset value range is used to filter out data that both maintain a certain logical relationship with neighboring data and have sufficient differences.
[0084] When it is determined that the verification deviation value is indeed within the preset value range, the adjacent fake data that generated the verification deviation value is determined as candidate verification data; according to the positive or negative sign of the verification deviation value or other logical indications, one of the candidate verification data is selected as the final verification data.
[0085] The de-identification and replacement module is used to perform the final replacement operation to generate de-identified credit data; it receives sensitive information and verification data, and performs content matching assessment before performing the replacement.
[0086] Evaluate whether the degree of matching between the verification data and sensitive information meets the preset matching threshold, and ensure that the replaced data is consistent with the original data in terms of format, type or semantics, so as to avoid introducing invalid or incorrectly formatted data;
[0087] The replacement operation is only performed when the matching degree meets the preset matching threshold. The sensitive information is replaced with the verification data. After processing, the output is the final de-identified credit data.
[0088] Dynamic and verified data anonymization can effectively hide sensitive information. Furthermore, through differential analysis and verification mechanisms, the quality of the verified data and the rationality of the context can be ensured, thus preserving the analytical value of the credit data.
[0089] The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand and implement the present invention. They should not be construed as limiting the scope of protection of the present invention. All equivalent changes or modifications made in accordance with the spirit and essence of the present invention should be covered within the scope of protection of the present invention.
Claims
1. A method for de-identifying credit information based on artificial intelligence, characterized in that, Includes the following steps: Identify sensitive information in the credit data to be processed; In response to sensitive information, a de-identification and replacement process based on verification data is executed. The de-identification and replacement process includes: constructing a disguised dataset based on the sensitive information; performing a difference analysis on the disguised dataset to select verification data from the disguised data; and replacing the sensitive information with the verification data to form de-identified credit data. The process of constructing a disguised dataset based on sensitive information includes: determining a corresponding sorting number for the sensitive information; extracting disguised data from the disguised database according to the sorting number; and replacing the corresponding sensitive information with the disguised data to construct the disguised dataset. The data stored in the disguised database includes primary disguised data and secondary disguised data. The process of extracting disguised data from the disguised database based on the sorting number includes: firstly extracting the main disguised data according to the sorting number; if the main disguised data cannot be extracted, then extracting the secondary disguised data, and using the timestamp of the secondary disguised data as the sorting number.
2. The method for de-identifying credit data based on artificial intelligence according to claim 1, characterized in that, The sensitive information identified in the credit data to be processed includes: Perform attribute classification on the data fields in the credit data to be processed to divide the data fields into basic fields and sensitive fields; mark the information contained in the sensitive fields as sensitive information.
3. The method for de-identifying credit data based on artificial intelligence according to claim 1, characterized in that, The step of performing differential analysis on the spoofed dataset to select validation data from the spoofed data includes: Obtain adjacent camouflaged data in the camouflaged dataset; use a preset comparison function to calculate the difference between adjacent camouflaged data to generate a verification deviation value; Determine whether the verification deviation value is within the preset value range; when the verification deviation value is within the preset value range, determine the adjacent fake data that generated the verification deviation value as candidate verification data, and select one of the candidate verification data as the verification data according to the positive or negative sign of the verification deviation value.
4. The method for de-identifying credit data based on artificial intelligence according to claim 1, characterized in that, The process of replacing sensitive information with verification data to form de-identified credit data includes: Replacement is performed when the degree of matching between the verification data and the sensitive information meets the preset matching threshold.
5. A credit data anonymization processing system based on artificial intelligence, characterized in that, Includes the following modules: The sensitive information identification module is used to identify sensitive information in the credit data to be processed; The selected data verification module is used to respond to sensitive information and construct a disguised dataset based on the sensitive information. Perform a differential analysis on the spoofed dataset to select validation data from the spoofed data; The desensitization and replacement module is used to replace sensitive information with verification data to form desensitized credit data.
6. The credit data anonymization system based on artificial intelligence according to claim 5, characterized in that, The construction of the disguised dataset based on sensitive information includes: Assign a sort number to the sensitive information; extract the disguised data from the disguised database based on the sort number; and replace the corresponding sensitive information with the disguised data to construct the disguised dataset.
7. The credit data anonymization system based on artificial intelligence according to claim 6, characterized in that, The step of extracting disguised data from the disguised database according to the sorting number includes: Primary camouflage data is extracted first based on the sorting number; if primary camouflage data cannot be extracted, secondary camouflage data is extracted, and the timestamp of the secondary camouflage data is used as the sorting number.
8. The credit data anonymization system based on artificial intelligence according to claim 5, characterized in that, The step of performing differential analysis on the spoofed dataset to select validation data from the spoofed data includes: Obtain adjacent camouflaged data in the camouflaged dataset; use a preset comparison function to calculate the difference between adjacent camouflaged data to generate a verification deviation value.
9. A credit data anonymization processing system based on artificial intelligence according to claim 5, characterized in that, The step of performing differential analysis on the spoofed dataset to select validation data from the spoofed data also includes: Determine whether the verification deviation value is within the preset value range; when the verification deviation value is within the preset value range, determine the adjacent fake data that generated the verification deviation value as candidate verification data, and select one of the candidate verification data as the verification data according to the positive or negative sign of the verification deviation value.
10. A credit data anonymization processing system based on artificial intelligence according to claim 5, characterized in that, Using verification data to replace sensitive information to create de-identified credit data includes: Replacement is performed when the degree of matching between the verification data and the sensitive information meets the preset matching threshold.