File desensitization method and device, electronic equipment and storage medium

By constructing historical sensitive field rules and automated matching for the information system, the problem of low file desensitization efficiency was solved, and automated desensitization of the current file was achieved, improving processing speed and efficiency.

CN114547696BActive Publication Date: 2026-06-23PING AN SECURITIES CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PING AN SECURITIES CO LTD
Filing Date
2022-03-23
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies for desensitizing files are inefficient, especially when dealing with large amounts of system files, requiring significant time and cost in manual labor, thus affecting the efficiency of file desensitization.

Method used

By constructing historical desensitization rules for sensitive fields in the information system, identifying and matching data fields in the current file, using historical desensitization rules for automated desensitization, and constructing current desensitization rules when a match fails, the system achieves automated data desensitization.

Benefits of technology

It improves the efficiency of file desensitization, ensures the automation and efficiency of current file desensitization, reduces manual intervention time, and increases processing speed.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114547696B_ABST
    Figure CN114547696B_ABST
Patent Text Reader

Abstract

The present application relates to the field of data processing, and discloses a file desensitization method, device, electronic equipment and storage medium, the method comprises: obtaining the historical data field of an information system, identifying the historical sensitive field in the historical data field, and constructing the historical desensitization rule of the historical sensitive field; receiving the current file of the information system, extracting the current data field of the current file, and matching the current data field with the historical sensitive field; when the current data field and the historical sensitive field match successfully, desensitizing the data of the current data field by using the historical desensitization rule to obtain first desensitization data; when the current data field and the historical sensitive field fail to match, constructing the current desensitization rule of the current data field to perform data desensitization on the current data field to obtain second desensitization data; and the first desensitization data and the second desensitization data are summarized to obtain the desensitized file of the current file. The present application can improve the efficiency of file desensitization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing, and more particularly to a method, apparatus, electronic device, and computer-readable storage medium for de-identifying documents. Background Technology

[0002] In the daily operation of information systems, various files are stored due to the system's own operation and business needs, such as code files, databases, log files, configuration files, and data files. These files record information about hosts, networks, and customers. However, some of the recorded data is sensitive and cannot be disclosed at will. But, due to business or other requirements, some files must be sent out. For example, purchased systems need to send log files to software vendors to troubleshoot and locate problems, and some system data needs to be exported to the operations department for analysis. Therefore, it is particularly important to desensitize sensitive data in the files generated by information systems.

[0003] Currently, data anonymization of files is usually done manually. However, in actual business scenarios, due to the different characteristics of system files, different personnel are usually needed to perform targeted anonymization of system files. This consumes a lot of manual time and affects the efficiency of file anonymization. In addition, when dealing with a large amount of system files, it will take a lot of time, which will also affect the efficiency of file anonymization. Summary of the Invention

[0004] This invention provides a document desensitization method, apparatus, electronic device, and computer-readable storage medium, the main purpose of which is to improve the efficiency of document desensitization.

[0005] To achieve the above objectives, the present invention provides a document de-identification method, comprising:

[0006] Obtain historical data fields from the information system, identify historically sensitive fields within the historical data fields, and construct historical desensitization rules for the historically sensitive fields;

[0007] Receive the current file of the information system, extract the current data field of the current file, and match the current data field with the historical sensitive field;

[0008] When the current data field successfully matches the historical sensitive field, the data in the current data field is desensitized using the historical desensitization rules to obtain the first desensitized data.

[0009] When the current data field fails to match the historical sensitive field, a current desensitization rule is constructed for the current data field to perform data desensitization on the current data field and obtain the second desensitized data;

[0010] The first de-identified data and the second de-identified data are combined to obtain the de-identified file of the current file.

[0011] Optionally, identifying historically sensitive fields in the historical data fields includes:

[0012] Obtain the field dimensions of the historical data field, and identify the field attributes of the historical data field based on the field dimensions;

[0013] Determine whether the field attribute exists in a preset sensitive attribute table;

[0014] If the field attribute does not exist in the sensitive attribute table, then the historical data field will not be considered a historical sensitive field;

[0015] If the field attribute exists in the sensitive attribute table, then the historical data field will be used as the historical sensitive field.

[0016] Optionally, the construction of the historical desensitization rules for the historical sensitive fields includes:

[0017] Configure the de-identification script for the historical sensitive fields, and define the de-identification strategy for the historical sensitive fields in the de-identification script;

[0018] Based on the desensitization strategy, generate historical desensitization rules for the historical sensitive fields.

[0019] Optionally, extracting the current data field of the current file includes:

[0020] The current file is cleaned to obtain cleaned data;

[0021] Identify the data objects in the cleaned data, convert the data objects into data fields, and obtain the current data fields of the current file.

[0022] Optionally, the step of cleaning the current file to obtain cleaned data includes:

[0023] The current file is deduplicated to obtain deduplicated data;

[0024] The system detects whether there is abnormal data in the deduplicated data, and deletes the abnormal data when abnormal data is found, thus obtaining cleaned data.

[0025] Optionally, matching the current data field with the historical sensitive field includes:

[0026] Convert the current data field into a current field vector, and convert the historical sensitive field into a historical field vector;

[0027] Calculate the field similarity between the current field vector and the historical field vector;

[0028] If the field similarity is less than the preset similarity, then the current data field fails to match the historical sensitive field;

[0029] If the field similarity is not less than the preset similarity, then the current data field and the historical data field are successfully matched.

[0030] Optionally, calculating the field similarity between the current field vector and the historical field vector includes:

[0031] The field similarity between the current field vector and the historical field vector is calculated using the following formula:

[0032]

[0033] in, Indicates field similarity. This represents the vector of the i-th current field. This represents the vector of the j-th historical field.

[0034] To address the above problems, the present invention also provides a document desensitization device, the device comprising:

[0035] The desensitization rule construction module is used to obtain historical data fields of the information system, identify historical sensitive fields in the historical data fields, and construct historical desensitization rules for the historical sensitive fields;

[0036] The sensitive field matching module is used to receive the current file of the information system, extract the current data field of the current file, and match the current data field with the historical sensitive fields;

[0037] The data desensitization module is used to desensitize the data in the current data field using the historical desensitization rules when the current data field successfully matches the historical sensitive field, so as to obtain the first desensitized data;

[0038] The data desensitization module is also used to construct a current desensitization rule for the current data field when the current data field fails to match the historical sensitive field, so as to perform data desensitization on the current data field and obtain the second desensitized data;

[0039] The de-identified file generation module is used to summarize the first de-identified data and the second de-identified data to obtain the de-identified file of the current file.

[0040] To address the above problems, the present invention also provides an electronic device, the electronic device comprising:

[0041] At least one processor; and,

[0042] A memory communicatively connected to the at least one processor; wherein,

[0043] The memory stores a computer program that can be executed by the at least one processor to implement the file desensitization method described above.

[0044] To address the aforementioned problems, the present invention also provides a computer-readable storage medium storing at least one computer program, which is executed by a processor in an electronic device to implement the aforementioned file desensitization method.

[0045] As can be seen, this embodiment of the invention first constructs historical desensitization rules for historically sensitive fields in the information system, which serve as desensitization rules for subsequent fields generated by the information system, thereby improving the desensitization efficiency of subsequent fields. It then matches the current data field in the information system with the historically sensitive fields to determine the desensitization rule for the current data field, thus fulfilling the prerequisite for data desensitization of the current data field. Secondly, when the current data field successfully matches the historically sensitive field, this embodiment of the invention uses the historical desensitization rules to desensitize the data in the current data field, obtaining first desensitized data. Conversely, when the current data field fails to match the historically sensitive field, it constructs a current desensitization rule for the current data field to perform data desensitization on the current data field, obtaining second desensitized data. This automates the data desensitization of the current file, ensuring the efficiency of the current file's desensitization. Therefore, the file desensitization method, apparatus, electronic device, and computer-readable storage medium proposed in this embodiment of the invention can improve the efficiency of file desensitization. Attached Figure Description

[0046] Figure 1 This is a schematic flowchart of a document de-identification method provided in an embodiment of the present invention;

[0047] Figure 2 This is a schematic diagram of a document desensitization device provided in an embodiment of the present invention;

[0048] Figure 3 This is a schematic diagram of the internal structure of an electronic device that implements a file desensitization method according to an embodiment of the present invention;

[0049] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0050] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0051] This invention provides a file de-identification method. The executing entity of the file de-identification method includes, but is not limited to, at least one of the following: a server, a terminal, or other electronic devices that can be configured to execute the method provided in this invention. In other words, the file de-identification method can be executed by software or hardware installed on a terminal device or a server device, and the software can be a blockchain platform. The server includes, but is not limited to, a single server, a server cluster, a cloud server, or a cloud server cluster. The server can be an independent server or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.

[0052] Reference Figure 1 The diagram shown is a schematic flowchart of a file de-identification method according to an embodiment of the present invention. In this embodiment, the file de-identification method includes:

[0053] S1. Obtain historical data fields from the information system, identify historically sensitive fields in the historical data fields, and construct historical desensitization rules for the historically sensitive fields.

[0054] In this embodiment of the invention, the information system refers to a business system running in different business scenarios, such as an e-commerce system, a management system, a financial system, and a logistics system. The historical data field refers to a data field that has been generated in the information system. For example, in an e-commerce system, the historical data field can be an order information field, a user information field, and a product information field. In a financial system, the historical data field can be a salary information field, a profit and loss information field, and a performance information field.

[0055] It should be understood that there may be some sensitive information fields in the historical data fields, such as the bank card number field in the salary information field. Therefore, this embodiment of the invention identifies the historical sensitive fields in the historical data fields in order to filter the sensitive fields of the current data fields in the subsequent information system.

[0056] As an embodiment of the present invention, the step of identifying historical sensitive fields in the historical data fields includes: obtaining the field dimension of the historical data field; identifying the field attribute of the historical data field based on the field dimension; determining whether the field attribute exists in a preset sensitive attribute table; if the field attribute does not exist in the sensitive attribute table, then the historical data field is not considered a historical sensitive field; if the field attribute exists in the sensitive attribute table, then the historical data field is considered a historical sensitive field.

[0057] The field dimension refers to the descriptive information used to characterize the historical data field, such as identity, name, address, etc. The field attribute refers to the feature indicator used to characterize the historical data field. For example, if the field dimension is identity, the field attributes of the historical data field can be determined to include ID number, family information, etc. If the field dimension is name, the field attributes of the historical data field can be determined to include surname, name length, etc. The preset sensitive attribute table is constructed based on different user needs. For example, user A needs to use attributes such as ID number, bank number, and institution name as sensitive attributes in the sensitive attribute table, while user B needs to use attributes such as mobile phone number, surname, and gender as sensitive attributes in the sensitive attribute table.

[0058] Furthermore, this embodiment of the invention constructs historical desensitization rules for the historical sensitive fields to determine the data desensitization methods and strategies for the historical sensitive fields, which serve as desensitization rules for subsequent information systems generating the same fields, thereby improving the desensitization efficiency of subsequent fields.

[0059] As an embodiment of the present invention, the construction of the historical desensitization rules for the historical sensitive fields includes: configuring a desensitization script for the historical sensitive fields, defining a desensitization strategy for the historical sensitive fields in the desensitization script, and generating historical desensitization rules for the historical sensitive fields according to the desensitization strategy.

[0060] The desensitization script refers to a program used to automate the execution of the historical sensitive fields. It is compiled using a scripting language, such as Python. The desensitization strategy refers to the desensitization method used when desensitizing the historical sensitive fields. It can define strategies such as deleting sensitive fields, obfuscating sensitive fields, encrypting sensitive fields, and blurring sensitive fields, and can also be set according to the actual business scenario.

[0061] S2. Receive the current file of the information system, extract the current data field of the current file, and match the current data field with the historical sensitive field.

[0062] In this embodiment of the invention, the current file refers to a system file generated in real time by the information system, such as a log file, code file, and data file. Furthermore, this embodiment of the invention extracts the current data field of the current file and matches the current data field with the historical sensitive field to ensure the data desensitization premise of the current file.

[0063] As an embodiment of the present invention, the step of extracting the current data field of the current file includes: cleaning the current file to obtain cleaned data, identifying data objects in the cleaned data, converting the data objects into data fields, and obtaining the current data field of the current file.

[0064] It should be understood that there may be some duplicate and useless data in the current file. Therefore, this embodiment of the invention cleans the current file to reduce the amount of data in the current file and improve the processing speed of subsequent files. Optionally, the step of cleaning the current file to obtain cleaned data includes: deduplicating the current file to obtain deduplicated data, detecting whether there is abnormal data in the deduplicated data, and deleting the abnormal data when there is abnormal data in the deduplicated data to obtain cleaned data.

[0065] Furthermore, in this embodiment of the invention, data objects in the cleaned data are identified to obtain data entities in the cleaned data, thereby determining the data fields of the current file. Optionally, the data objects are identified using Named-Entity Recognition (NER) tools.

[0066] Furthermore, in this embodiment of the invention, the conversion of the data object is implemented using a structured programming language, such as SQL.

[0067] Furthermore, to ensure the privacy and security of the current data field, the current data field can also be stored in a blockchain node.

[0068] Furthermore, in this embodiment of the invention, the current data field is matched with the historical sensitive field to determine the desensitization rules for the current data field, thereby achieving the prerequisite for data desensitization of the current data field.

[0069] As an embodiment of the present invention, the step of matching the current data field with the historical sensitive field includes: converting the current data field into a current field vector and the historical sensitive field into a historical field vector; calculating the field similarity between the current field vector and the historical field vector; if the field similarity is less than a preset similarity, the current data field and the historical sensitive field fail to match; if the field similarity is not less than the preset similarity, the current data field and the historical data field match successfully.

[0070] The vector conversion between the current data field and the historical sensitive data field is achieved through a vector conversion algorithm, such as the one-hot algorithm. This algorithm is used to hash the current data field and the historical sensitive data field to obtain the semantic representation information of the current data field and the historical sensitive data field, thereby ensuring the accuracy of subsequent field matching.

[0071] Furthermore, in an optional embodiment of the present invention, the field similarity between the current field vector and the historical field vector is calculated using the following formula:

[0072]

[0073] in, Indicates field similarity. This represents the vector of the i-th current field. This represents the vector of the j-th historical field.

[0074] S3. When the current data field successfully matches the historical sensitive field, the data in the current data field is desensitized using the historical desensitization rules to obtain the first desensitized data.

[0075] It should be understood that when the current data field successfully matches the historical sensitive field, it means that the current data field and the historical sensitive field have the same data content attributes. Therefore, in this embodiment of the invention, the current data field is desensitized by the historical desensitization rules to automate the desensitization of the current data field and improve the efficiency of file desensitization.

[0076] As an embodiment of the present invention, the step of using the historical desensitization rules to desensitize the current data field to obtain the first desensitized data includes: obtaining the desensitization script of the historical desensitization rules, and loading the current data field into the desensitization script to achieve data desensitization of the current data field.

[0077] S4. When the current data field fails to match the historical sensitive field, construct the current desensitization rule for the current data field to perform data desensitization on the current data field and obtain the second desensitized data.

[0078] It should be understood that when the current data field fails to match the historical sensitive field, it means that the current data field has not yet appeared in the information system. Therefore, this embodiment of the invention constructs a current desensitization rule for the current data field to perform data desensitization on the current data field. It should be noted that the construction principle of the current desensitization rule is the same as that of the historical desensitization rule, and the principle of performing data desensitization on the current data field through the current desensitization rule is the same as that of performing data desensitization on the current data field through the historical desensitization rule. Further details will not be elaborated here.

[0079] S5. Summarize the first de-identified data and the second de-identified data to obtain the de-identified file of the current file.

[0080] In this embodiment of the invention, the first de-identified data and the second de-identified data are summarized, that is, the first de-identified data and the second de-identified data are merged, so as to achieve data de-identification of the current file and obtain a de-identified file.

[0081] Furthermore, after obtaining the de-identified file of the current file, this embodiment of the invention further includes: encrypting the de-identified file using a preset encryption algorithm to prevent malicious tampering and to ensure the security and privacy of the de-identified file. Optionally, the preset encryption algorithm includes SHA algorithm, MD5 algorithm, etc.

[0082] As can be seen, this embodiment of the invention first constructs historical desensitization rules for sensitive fields in the information system, which serve as desensitization rules for subsequent fields generated by the information system, thereby improving the desensitization efficiency of subsequent fields. It then matches the current data field in the information system with the historical sensitive fields to determine the desensitization rule for the current data field, thus fulfilling the prerequisite for data desensitization of the current data field. Secondly, when the current data field successfully matches the historical sensitive field, this embodiment uses the historical desensitization rules to desensitize the data in the current data field, obtaining first desensitized data. Conversely, when the current data field fails to match the historical sensitive field, it constructs a current desensitization rule for the current data field to perform data desensitization on the current data field, obtaining second desensitized data. This automates the data desensitization of the current file, ensuring the desensitization efficiency of the current file. Therefore, the file desensitization method proposed in this embodiment can improve the efficiency of file desensitization.

[0083] like Figure 2 The diagram shown is a functional block diagram of the document desensitization device of this invention.

[0084] The document desensitization device 100 of this invention can be installed in an electronic device. Depending on the functions implemented, the document desensitization device may include a desensitization rule construction module 101, a sensitive field matching module 102, a data desensitization module 103, and a desensitized document generation module 104. The module described in this invention can also be referred to as a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can perform a fixed function, and which are stored in the memory of the electronic device.

[0085] In this embodiment, the functions of each module / unit are as follows:

[0086] The desensitization rule construction module 101 is used to obtain historical data fields of the information system, identify historical sensitive fields in the historical data fields, and construct historical desensitization rules for the historical sensitive fields;

[0087] The sensitive field matching module 102 is used to receive the current file of the information system, extract the current data field of the current file, and match the current data field with the historical sensitive fields;

[0088] The data desensitization module 103 is used to desensitize the data in the current data field using the historical desensitization rules when the current data field successfully matches the historical sensitive field, so as to obtain the first desensitized data.

[0089] The data desensitization module 103 is further configured to construct a current desensitization rule for the current data field when the current data field fails to match the historical sensitive field, so as to perform data desensitization on the current data field and obtain the second desensitized data;

[0090] The de-identified file generation module 104 is used to summarize the first de-identified data and the second de-identified data to obtain the de-identified file of the current file.

[0091] In detail, the modules in the document desensitization device 100 described in this embodiment of the invention employ the same methods as described above during use. Figure 1 The document desensitization method described herein uses the same technical means and can produce the same technical effect, so it will not be elaborated here.

[0092] like Figure 3 The diagram shown is a schematic diagram of the structure of the electronic device 1 that implements the document desensitization method of the present invention.

[0093] The electronic device 1 may include a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may also include a computer program, such as a file desensitization program, stored in the memory 11 and capable of running on the processor 10.

[0094] In some embodiments, the processor 10 may be composed of integrated circuits, such as a single packaged integrated circuit or multiple integrated circuits with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips. The processor 10 is the control unit of the electronic device 1, connecting various components of the electronic device 1 through various interfaces and lines. It executes programs or modules stored in the memory 11 (e.g., file desensitization programs) and calls data stored in the memory 11 to perform various functions and process data of the electronic device 1.

[0095] The memory 11 includes at least one type of readable storage medium, including flash memory, portable hard drive, multimedia card, card-type memory (e.g., SD or DX memory), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 11 can be an internal storage unit of the electronic device 1, such as a portable hard drive of the electronic device 1. In other embodiments, the memory 11 can be an external storage device of the electronic device 1, such as a plug-in portable hard drive, Smart Media Card (SMC), Secure Digital (SD) card, Flash Card, etc., equipped on the electronic device 1. Furthermore, the memory 11 can include both internal storage units and external storage devices of the electronic device 1. The memory 11 can be used not only to store application software and various types of data installed on the electronic device 1, such as the code of a file de-identification program, but also to temporarily store data that has been output or will be output.

[0096] The communication bus 12 can be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. This bus can be divided into an address bus, a data bus, a control bus, etc. The bus is configured to enable communication between the memory 11 and at least one processor 10, etc.

[0097] The communication interface 13 is used for communication between the aforementioned electronic device 1 and other devices, including a network interface and an employee interface. Optionally, the network interface may include a wired interface and / or a wireless interface (such as a Wi-Fi interface, Bluetooth interface, etc.), typically used to establish a communication connection between the electronic device 1 and other electronic devices 1. The employee interface may be a display, an input unit (such as a keyboard), or optionally, a standard wired or wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen, etc. The display may also be appropriately referred to as a screen or display unit, used to display information processed in the electronic device 1 and to display a visual employee interface.

[0098] Figure 3 Only electronic device 1 with components is shown; those skilled in the art will understand that... Figure 3 The structure shown does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown, or combine certain components, or have different component arrangements.

[0099] For example, although not shown, the electronic device 1 may also include a power supply (such as a battery) to power the various components. Preferably, the power supply can be logically connected to the at least one processor 10 through a power management device, thereby enabling functions such as charging management, discharging management, and power consumption management. The power supply may also include one or more DC or AC power supplies, recharging devices, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components. The electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be described in detail here.

[0100] It should be understood that the embodiments described are for illustrative purposes only and are not limited to this structure in terms of the scope of the patent invention.

[0101] The file desensitization program stored in the memory 11 of the electronic device 1 is a combination of multiple computer programs. When run in the processor 10, it can achieve the following:

[0102] Obtain historical data fields from the information system, identify historically sensitive fields within the historical data fields, and construct historical desensitization rules for the historically sensitive fields;

[0103] Receive the current file of the information system, extract the current data field of the current file, and match the current data field with the historical sensitive field;

[0104] When the current data field successfully matches the historical sensitive field, the data in the current data field is desensitized using the historical desensitization rules to obtain the first desensitized data.

[0105] When the current data field fails to match the historical sensitive field, a current desensitization rule is constructed for the current data field to perform data desensitization on the current data field and obtain the second desensitized data;

[0106] The first de-identified data and the second de-identified data are combined to obtain the de-identified file of the current file.

[0107] Specifically, the processor 10's implementation method of the above-mentioned computer program can be found in [reference needed]. Figure 1 The descriptions of the relevant steps in the corresponding embodiments are not repeated here.

[0108] Furthermore, if the modules / units integrated in the electronic device 1 are implemented as software functional units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium. The computer-readable storage medium can be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, or a read-only memory (ROM).

[0109] The present invention also provides a computer-readable storage medium storing a computer program, which, when executed by a processor of an electronic device 1, can perform the following:

[0110] Obtain historical data fields from the information system, identify historically sensitive fields within the historical data fields, and construct historical desensitization rules for the historically sensitive fields;

[0111] Receive the current file of the information system, extract the current data field of the current file, and match the current data field with the historical sensitive field;

[0112] When the current data field successfully matches the historical sensitive field, the data in the current data field is desensitized using the historical desensitization rules to obtain the first desensitized data.

[0113] When the current data field fails to match the historical sensitive field, a current desensitization rule is constructed for the current data field to perform data desensitization on the current data field and obtain the second desensitized data;

[0114] The first de-identified data and the second de-identified data are combined to obtain the de-identified file of the current file.

[0115] In the several embodiments provided by this invention, it should be understood that the disclosed devices, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.

[0116] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0117] Furthermore, the functional modules in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.

[0118] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present invention.

[0119] Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within the invention. No appended diagram markings in the claims should be construed as limiting the scope of the claims.

[0120] The blockchain referred to in this invention is a novel application model of computer technologies such as distributed data storage, peer-to-peer transmission, consensus mechanisms, and encryption algorithms. Essentially, a blockchain is a decentralized database, a chain of data blocks linked together using cryptographic methods. Each data block contains information about a batch of network transactions, used to verify the validity of the information (anti-counterfeiting) and generate the next block. A blockchain can include an underlying blockchain platform, a platform product service layer, and an application service layer.

[0121] The embodiments of this invention can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.

[0122] Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices recited in a system claim may also be implemented by a single unit or device through software or hardware. The term "second class" is used to indicate names and does not indicate any specific order.

[0123] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method for desensitizing documents, characterized in that, The method includes: Obtain historical data fields from the information system, identify historically sensitive fields in the historical data fields, configure desensitization scripts for the historically sensitive fields, define desensitization strategies for the historically sensitive fields in the desensitization scripts, and generate historical desensitization rules for the historically sensitive fields based on the desensitization strategies. Receive the current file of the information system, extract the current data field of the current file, and match the current data field with the historical sensitive field; When the current data field successfully matches the historical sensitive field, the data in the current data field is desensitized using the historical desensitization rules to obtain the first desensitized data. When the current data field fails to match the historical sensitive field, a desensitization script for the current data field is configured. The desensitization strategy for the current data field is defined in the desensitization script. A current desensitization rule for the current data field is generated according to the desensitization strategy. The current desensitization rule is used to desensitize the current data field to obtain the second desensitized data. The first de-identified data and the second de-identified data are combined to obtain the de-identified file of the current file.

2. The document desensitization method as described in claim 1, characterized in that, The identification of historical sensitive fields in the historical data fields includes: Obtain the field dimensions of the historical data field, and identify the field attributes of the historical data field based on the field dimensions; Determine whether the field attribute exists in a preset sensitive attribute table; If the field attribute does not exist in the sensitive attribute table, then the historical data field will not be considered a historical sensitive field; If the field attribute exists in the sensitive attribute table, then the historical data field will be used as the historical sensitive field.

3. The document desensitization method as described in claim 1, characterized in that, The step of extracting the current data field of the current file includes: The current file is cleaned to obtain cleaned data; Identify the data objects in the cleaned data, convert the data objects into data fields, and obtain the current data fields of the current file.

4. The document desensitization method as described in claim 3, characterized in that, The step of cleaning the current file to obtain cleaned data includes: The current file is deduplicated to obtain deduplicated data; The system detects whether there is abnormal data in the deduplicated data, and deletes the abnormal data when abnormal data is found, thus obtaining cleaned data.

5. The document desensitization method according to any one of claims 1 to 4, characterized in that, The step of matching the current data field with the historical sensitive field includes: Convert the current data field into a current field vector, and convert the historical sensitive field into a historical field vector; Calculate the field similarity between the current field vector and the historical field vector; If the field similarity is less than the preset similarity, then the current data field fails to match the historical sensitive field; If the field similarity is not less than the preset similarity, then the current data field and the historical data field are successfully matched.

6. The document desensitization method as described in claim 5, characterized in that, The calculation of the field similarity between the current field vector and the historical field vector includes: The field similarity between the current field vector and the historical field vector is calculated using the following formula: in, Indicates field similarity. This represents the vector of the i-th current field. This represents the vector of the j-th historical field.

7. A document desensitization device, characterized in that, The device includes: The desensitization rule construction module is used to obtain historical data fields of the information system, identify historical sensitive fields in the historical data fields, configure desensitization scripts for the historical sensitive fields, define desensitization strategies for the historical sensitive fields in the desensitization scripts, and generate historical desensitization rules for the historical sensitive fields according to the desensitization strategies. The sensitive field matching module is used to receive the current file of the information system, extract the current data field of the current file, and match the current data field with the historical sensitive fields; The data desensitization module is used to desensitize the data in the current data field using the historical desensitization rules when the current data field successfully matches the historical sensitive field, so as to obtain the first desensitized data; The data desensitization module is further configured to configure a desensitization script for the current data field when the current data field fails to match the historical sensitive field, define a desensitization strategy for the current data field in the desensitization script, generate a current desensitization rule for the current data field according to the desensitization strategy, and use the current desensitization rule to desensitize the current data field to obtain the second desensitized data; The de-identified file generation module is used to summarize the first de-identified data and the second de-identified data to obtain the de-identified file of the current file.

8. An electronic device, characterized in that, The electronic device includes: At least one processor; and, A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the file de-identification method as described in any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the file de-identification method as described in any one of claims 1 to 6.