Method and apparatus for identifying sensitive information

By using the AC automaton to describe and classify the features of sensitive information, generating pattern strings, and constructing an automaton suitable for sensitive information matching, the performance degradation problem in existing technologies is solved, and the ability to efficiently identify a variety of sensitive information is achieved.

CN116702034BActive Publication Date: 2026-06-16ALIPAY (HANGZHOU) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
Filing Date
2023-06-02
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing methods for identifying sensitive information based on regular expressions suffer from performance degradation in high-concurrency, long-string, and large-data-volume scenarios, and cannot handle multiple different types of sensitive information simultaneously.

Method used

The AC automaton recognition method is adopted. By describing and classifying various sensitive information features, corresponding pattern strings are generated, and the AC automaton is constructed for string matching, thus avoiding exponential consumption of memory and CPU.

🎯Benefits of technology

It improves the efficiency of sensitive information matching, enabling the identification of multiple types of sensitive information in a single scan and reducing resource consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116702034B_ABST
    Figure CN116702034B_ABST
Patent Text Reader

Abstract

Embodiments of the present specification provide a sensitive information identification method and device. The method comprises: obtaining a to-be-identified string; inputting the to-be-identified string as an input parameter of an identification interface into a pre-created AC automaton; determining, by the AC automaton, whether the to-be-identified string can match at least one of various pattern strings, and if so, determining that the to-be-identified string includes sensitive information; wherein the various pattern strings correspond to various types of sensitive information respectively; and the AC automaton is created based on the various pattern strings. Embodiments of the present specification can more efficiently identify sensitive information such as user privacy information.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to electronic information technology in one or more embodiments, and more particularly to methods and apparatus for identifying sensitive information. Background Technology

[0002] In internet businesses, there is a significant amount of sensitive information flowing through the network. This sensitive information may include users' personal privacy information or companies' operational data. To protect this sensitive information and prevent leakage risks, it is necessary to identify and classify the data flow links on the internet and to anonymize portions containing sensitive information.

[0003] Existing technologies include regular expression-based sensitive information identification schemes, which use regular expressions to identify and locate sensitive information in string data. However, the performance of regular expressions is greatly affected by the writer's input, and in high-concurrency, long string, and large data volume scenarios, it can lead to exponential increases in memory and CPU consumption, resulting in a sharp decline in processing performance. Furthermore, since a single regular expression can only identify and locate one type of sensitive information at a time, it cannot process multiple different types of sensitive information simultaneously, and its processing performance needs improvement.

[0004] Therefore, a more effective method for identifying sensitive information is needed. Summary of the Invention

[0005] This specification describes one or more embodiments of a method and apparatus for identifying sensitive information, which can identify sensitive information more efficiently.

[0006] According to the first aspect, a method for identifying sensitive information is provided, wherein the method includes:

[0007] Obtain the string to be recognized;

[0008] The string to be identified is passed as an input parameter to the pre-created AC automaton for the identification interface;

[0009] The AC automaton determines whether the string to be identified can match at least one of the various pattern strings. If it can, it is determined that the string to be identified contains sensitive information.

[0010] The various pattern strings correspond to different types of sensitive information; the AC automaton is created based on these various pattern strings.

[0011] The method for creating the AC automaton includes:

[0012] The sensitive information of various types is characterized by features, wherein the feature description includes at least one of the following: the number of characters involved in the sensitive information; the type of characters involved in the sensitive information; and the position of each type of character involved in the sensitive information within the sensitive information.

[0013] Classify each character involved in the various feature descriptions into at least one character set;

[0014] Assign a code to each character set; different character sets have different codes.

[0015] For each type of sensitive information, the following steps are performed: Based on the feature description of the current type of sensitive information, determine at least one current character set included in the current type of sensitive information, use the code corresponding to each current character set to mark each character in the current character set one by one, and connect each code in sequence to form the pattern string corresponding to the current type of sensitive information.

[0016] Create an AC automaton using the pattern strings corresponding to various types of sensitive information.

[0017] The method of creating an AC automaton using various pattern strings corresponding to different types of sensitive information includes:

[0018] A first AC automaton is created using the pattern strings corresponding to various types of sensitive information. For the created first AC automaton, it is determined whether any two child nodes directly connected to the same root node can represent the same type of character. If they can, the pattern string corresponding to the path of any one of the two child nodes is deleted in the first AC automaton. The deleted pattern string is then input into the second AC automaton to create the second AC automaton.

[0019] The step of using an AC automaton to determine whether the string to be identified can match at least one of various pattern strings includes: using multiple created AC automata to determine whether the string to be identified can match at least one of various pattern strings.

[0020] in,

[0021] The at least one character set includes a numeric character set consisting of at least one digit; correspondingly, setting a code for each character set includes: using a first character to represent any character in the range of 0-9, and setting a numeric character set to correspond to the first character;

[0022] And / or,

[0023] The at least one character set includes an alphabetic character set consisting of at least one English letter; correspondingly, setting a code for each character set includes: using a second character to represent any character in the range of az and AZ, and setting the alphabetic character set to correspond to the second character;

[0024] And / or,

[0025] The at least one character set includes a mixed character set composed of numbers and English letters; correspondingly, setting a code for each character set includes: using a third character to represent any character in the range of 0-9, az, and AZ, and setting the mixed character set to correspond to the third character;

[0026] And / or,

[0027] The at least one character set includes a hexadecimal character set consisting of characters that satisfy the hexadecimal format; correspondingly, setting a code for each character set includes: using a fourth character to represent any character in the range of 0-9, az, and AZ that satisfy the hexadecimal format, and setting the hexadecimal character set to correspond to the fourth character;

[0028] The at least one character set includes a license plate prefix character set; correspondingly, setting a code for each character set includes: using the fifth character to represent each Chinese character in the Chinese character prefix of the license plate number.

[0029] The sensitive information includes sensitive information related to the ID card type; correspondingly, the feature description of the sensitive information related to the ID card type includes: a string of 18 characters in length, the first 17 characters being numbers, and the last character possibly being a number or an X; correspondingly, the pattern string corresponding to the sensitive information related to the ID card type includes: NNNNNNNNNNNNNNNNNN and NNNNNNNNNNNNNNNNNX, where N is used to represent the first character;

[0030] And / or,

[0031] The sensitive information includes sensitive information of MAC address type; correspondingly, the feature description of sensitive information of MAC address type includes: the MAC address is a string of 17 characters in length, which contains 6 groups of 2-digit numbers in hexadecimal format, and each group of numbers is separated by a colon; correspondingly, the pattern string corresponding to sensitive information of MAC address type includes: SS:SS:SS:SS:SS:SS, where S is used to represent the fourth character;

[0032] And / or,

[0033] The sensitive information includes sensitive information of mobile phone number type; correspondingly, the feature description of sensitive information of mobile phone number type includes: Chinese mobile phone number is a pure numeric string of length 11; correspondingly, the pattern string corresponding to sensitive information of mobile phone number type includes: NNNNNNNNNNNN, where N is used to represent the first character;

[0034] And / or,

[0035] The sensitive information includes sensitive information of the type of civilian vehicle license plate number; correspondingly, the feature description of the sensitive information of the type of civilian vehicle license plate number includes: the civilian vehicle license plate number is a string with the first character being Chinese and the second character being an English letter, and the length being 7 or 8 characters, wherein the 3rd to 7th or 8th characters are numbers or English letters; correspondingly, the pattern string corresponding to the sensitive information of the type of civilian vehicle license plate number includes: PLWWWWW and PLWWWWWW, where P is used to represent the fifth character; L is used to represent the second character.

[0036] According to a second aspect, a sensitive information identification device is provided, the device comprising:

[0037] The module for obtaining the string to be recognized is configured to obtain the string to be recognized.

[0038] The input module is configured to pass the string to be recognized as an input parameter to the pre-created AC automaton.

[0039] The AC automaton module is configured to determine whether the string to be identified can match at least one of various pattern strings. If it can, it is determined that the string to be identified contains sensitive information. The various pattern strings correspond to various types of sensitive information. The AC automaton is created based on the various pattern strings.

[0040] The device further includes: a creation module;

[0041] The creation module is configured to: obtain feature descriptions of various types of sensitive information; wherein the feature descriptions include at least one of the following: the number of characters involved in the sensitive information; the type of characters involved in the sensitive information; the position of each type of character involved in the sensitive information in the sensitive information; classify each character involved in the various feature descriptions into at least one character set; set a code corresponding to each character set; wherein different character sets correspond to different codes; for each type of sensitive information, perform the following: based on the feature description of the current type of sensitive information, determine at least one current character set included in the current type of sensitive information, use the code corresponding to each current character set to mark each character in the current character set one by one, and connect each code in sequence to form a pattern string corresponding to the current type of sensitive information; create an AC automaton using the pattern strings corresponding to various types of sensitive information.

[0042] The creation module is configured to: create a first AC automaton using pattern strings corresponding to various types of sensitive information; for the created first AC automaton, determine whether any two child nodes directly connected to the same root node can represent the same type of character; if so, delete the pattern string corresponding to the path of any one of the two child nodes in the first AC automaton; input the deleted pattern string into the second AC automaton to create the second AC automaton.

[0043] The AC automaton module is configured to perform the following: using multiple created AC automata to determine whether the string to be identified can match at least one of various pattern strings.

[0044] According to a third aspect, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the methods described in any embodiment of this specification.

[0045] According to a fourth aspect, a computing device is provided, including a memory and a processor, wherein the memory stores executable code, and the processor, when executing the executable code, implements the method described in any embodiment of this specification.

[0046] The method and apparatus for identifying sensitive information provided in the embodiments of this specification have at least the following beneficial effects:

[0047] 1. In the embodiments of this specification, an Aho-Corasick automaton is used to identify whether a string contains sensitive information. In these embodiments, various types of sensitive information are categorized. For each type of sensitive information, a pattern string corresponding to that type is pre-obtained, and these pattern strings are added to the Aho-Corasick automaton. The Aho-Corasick automaton is initialized and constructed based on its algorithm. This constructed Aho-Corasick automaton can then be used for the identification and location of sensitive information. Therefore, based on the Aho-Corasick automaton, character matching of sensitive information can be transformed into state transitions. Even if the string to be identified is very long, it only needs to be scanned once. Compared to regular expression-based sensitive information matching methods, this avoids the problem of exponential memory and CPU consumption, greatly improving matching efficiency.

[0048] 2. The AC automaton in this specification is improved by classifying sensitive information and summarizing a general string (called a pattern string) that conforms to the characteristics of each type of sensitive information for each type. That is, each type of sensitive information corresponds to a string (called a pattern string), rather than each sensitive information corresponding to a string. In this way, as long as the characters in the string to be identified can match the pattern string, it can be determined that the string to be identified contains a type of sensitive information corresponding to the matched pattern string. For example, if the characters in the string to be identified can match the pattern string corresponding to the sensitive information of the type of mobile phone number, then it can be determined that the string to be identified contains a mobile phone number (it is only necessary to determine that it contains a mobile phone number, not which mobile phone number), and thus it can be determined that the string to be identified contains sensitive information. Appropriate processing can then be performed, such as desensitizing the sensitive information, so that the AC automaton can be used for matching sensitive information. Attached Figure Description

[0049] To more clearly illustrate the technical solutions in the embodiments or prior art of this specification, the drawings used in the description of the embodiments or prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0050] Figure 1 This is a flowchart of a method for identifying sensitive information in one embodiment of this specification.

[0051] Figure 2 This is a flowchart of a method for creating an AC automaton in one embodiment of this specification.

[0052] Figure 3 This is a state diagram of an AC automaton in the prior art.

[0053] Figure 4 This is a schematic diagram of the state of an AC automaton in one embodiment of this specification.

[0054] Figure 5 This is a schematic diagram of the structure of a sensitive information identification device in one embodiment of this specification.

[0055] Figure 6 This is a schematic diagram of the structure of a sensitive information identification device in another embodiment of this specification. Detailed Implementation

[0056] The solution provided in this specification will now be described with reference to the accompanying drawings.

[0057] First, it should be noted that the terminology used in the embodiments of this invention is for the purpose of describing specific embodiments only and is not intended to limit the invention. The singular forms “a,” “the,” and “the” used in the embodiments of this invention and the appended claims are also intended to include the plural forms, unless the context clearly indicates otherwise.

[0058] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0059] Figure 1 This is a flowchart of a sensitive information identification method in one embodiment of this specification. The subject executing this method is a sensitive information identification device. It is understood that this method can also be executed by any device, equipment, platform, or device cluster with computing and processing capabilities. See also... Figure 1 The method includes:

[0060] Step 101: Obtain the string to be recognized.

[0061] Step 103: Pass the string to be recognized as the input parameter of the recognition interface into the pre-created AC automaton.

[0062] Step 105: The AC automaton determines whether the string to be identified can match at least one of the various pattern strings. If it can, proceed to step 107; otherwise, proceed to step 109. The various pattern strings correspond to various types of sensitive information. The AC automaton is created based on the various pattern strings.

[0063] Step 107: Determine that the string to be identified contains sensitive information, and end the current process.

[0064] Step 109: Determine that the string to be identified does not contain sensitive information.

[0065] According to the above Figure 1 As shown in the flowchart, this embodiment utilizes the Aho-Corasick automaton to identify whether a string contains sensitive information. The Aho-Corasick automaton is a multi-pattern matching algorithm. In this embodiment, various types of sensitive information are categorized. For each type of sensitive information, a pattern string corresponding to that type is pre-obtained, and these pattern strings are added to the Aho-Corasick automaton. The Aho-Corasick automaton is initialized and constructed based on its algorithm. This constructed Aho-Corasick automaton can then be used for the identification and location of sensitive information. Therefore, based on the Aho-Corasick automaton, character matching for sensitive information can be transformed into state transitions. Even if the string to be identified is very long, it only needs to be scanned once. Compared to regular expression-based sensitive information matching methods, this avoids the problem of exponential memory and CPU consumption, greatly improving matching efficiency.

[0066] according to Figure 1 As shown in the flowchart, in the embodiments of this specification, it is necessary to pre-create an AC automaton suitable for sensitive information matching; that is, it is necessary to improve the existing AC automaton to make it suitable for sensitive information matching. See also Figure 2 In one embodiment of this specification, the method for creating an AC automaton includes:

[0067] Step 201: Classify sensitive information.

[0068] Step 203: Describe the features of various types of sensitive information; wherein the feature description includes at least one of the following: the number of characters involved in the sensitive information; the type of characters involved in the sensitive information; and the position of each type of character involved in the sensitive information within the sensitive information.

[0069] Step 205: Classify each character involved in the various feature descriptions into at least one character set.

[0070] Step 207: Assign a code to each character set; different character sets have different codes.

[0071] Step 209: For each type of sensitive information, perform the following: Based on the feature description of the current type of sensitive information, determine at least one current character set included in the current type of sensitive information, use the code corresponding to each current character set to mark each character in the current character set one by one, and connect each code in sequence to form the pattern string corresponding to the current type of sensitive information.

[0072] Step 211: Create an AC automaton using the pattern strings corresponding to various types of sensitive information, where a path from the root node to a leaf node in the AC automaton corresponds to at least one pattern string.

[0073] The Aho-Corasick automaton (AC) is a multi-pattern matching algorithm for solving multi-string matching problems. Because the AC automaton cleverly transforms character matching into state transitions, and only requires one scan when scanning long strings, it greatly improves matching efficiency and saves processing resources. However, in current technology, the AC automaton can only be used for keyword matching; that is, it can only match strings with pre-known keywords. For example, see... Figure 3 This explains the operation method of the AC automaton in the prior art. See [link / reference] Figure 3 In existing technologies, strings "she", "his", and "hers" are used as keywords to input into the AC automaton, thus creating an AC automaton suitable for keyword matching. Subsequently, when a string to be identified appears, it can be input into the AC automaton for matching to determine whether the string to be identified contains the pre-determined keyword strings "she", "his", or "hers". However, existing AC automata cannot be used for sensitive information matching. This is because existing technologies essentially exhaustively enumerate keywords; that is, every character of the keyword used for subsequent matching must be known in order to create the AC automaton based on the keywords. Therefore, this method is unsuitable for sensitive information matching. Sensitive information is usually of a single type and cannot be exhaustively enumerated. For example, ID card numbers and mobile phone numbers are sensitive information, and it is impossible to exhaustively enumerate all ID card numbers or all mobile phone numbers as keywords to input into the AC automaton to create it. Therefore, in the embodiments of this specification, by... Figure 2The process shown improves upon the Aucma automaton by classifying sensitive information and summarizing a universal string (called a pattern string) that conforms to the characteristics of each type of sensitive information. In other words, each type of sensitive information corresponds to a single string (pattern string), rather than each piece of sensitive information corresponding to a single string. Thus, if a character in the string to be identified matches the pattern string, it can be determined that the string contains a sensitive information of the type corresponding to that pattern string. For example, if a character in the string to be identified matches the pattern string corresponding to the sensitive information type of mobile phone number, then it can be determined that the string contains a mobile phone number (it only needs to be determined that it contains a mobile phone number, not which mobile phone number), and therefore, it can be determined that the string contains sensitive information. Appropriate processing can then be performed, such as de-identifying the sensitive information, enabling the Aucma automaton to be used for matching sensitive information. The following is a detailed explanation. Figure 2 The process of creating an AC automaton suitable for matching sensitive information is shown in the embodiments of this specification.

[0074] First, regarding step 201: classify sensitive information.

[0075] As mentioned earlier, sensitive information is usually of a single type and cannot be exhaustively listed. For example, ID card numbers, mobile phone numbers, and computer MAC (Media Access Control) addresses are all sensitive information, and their numbers are vast. A matching pattern string needs to be set for each type of sensitive information. Therefore, step 201 needs to be executed first to determine the type of each type of sensitive information, i.e., to classify the sensitive information. For example, the classified types include the following four types of sensitive information: ID card numbers, MAC addresses, mobile phone numbers, and vehicle license plate numbers.

[0076] Next, for step 203: perform feature description on various types of sensitive information; wherein the feature description includes at least one of the following: the number of characters involved in the sensitive information; the type of characters involved in the sensitive information; and the position of each type of character involved in the sensitive information within the sensitive information.

[0077] As mentioned earlier, in this embodiment of the specification, a general string (called a pattern string) that conforms to the characteristics of each type of sensitive information is summarized for each type of sensitive information. That is, each type of sensitive information corresponds to a string (called a pattern string), rather than each sensitive information corresponding to a string. Therefore, it is necessary to perform the feature description processing for each type of sensitive information in step 203 so that a general pattern string that conforms to the characteristics of each type of sensitive information can be summarized based on the feature description in subsequent processes.

[0078] Illustrate the process of step 203 with examples.

[0079] For sensitive information of the ID card number type, based on the characteristics of various ID card numbers, the feature descriptions of sensitive information of the ID card number type can be summarized. For example, it includes: a string with 18 digits, the first 17 digits are numbers, and the last digit may be a number or X.

[0080] For sensitive information of the MAC address type, based on the characteristics of various MAC addresses, the feature descriptions of sensitive information of the MAC address type can be summarized as follows: The MAC address is a string with 17 characters, which contains 6 groups of 2-digit numerical values that meet the hexadecimal format (the numerical values of the MAC address can be numerical values from 0 to 9 or numerical values within the range of a - f, A - F), and each group of numbers is separated by a colon.

[0081] For sensitive information of the mobile phone number type, based on the characteristics of various mobile phone numbers, the feature descriptions of sensitive information of the mobile phone number type can be summarized as follows: The mobile phone number is a pure numerical string with 11 digits.

[0082] For sensitive information of the civilian vehicle license plate number type, based on the characteristics of various civilian vehicle license plates, the feature descriptions of sensitive information of the civilian vehicle license plate number type can be summarized as follows: The civilian vehicle license plate is a string with the first character being a Chinese character (such as Chinese characters representing each region like "Jing", "Jin", "Ji", etc.), the second character being an English letter, and the length being 7 or 8 digits. Among them, the 3rd to 7th or 8th digit is a number or an English letter.

[0083] Next, for steps 205 - step 207: Classify each character involved in various feature descriptions into at least one character set; set a code for each character set corresponding to that character set; among them, the codes corresponding to different character sets are different.

[0084] For example, at least one of the following character sets will be involved in the various feature descriptions of various types of sensitive information:

[0085] a. A numeric character set composed of at least one number.

[0086] In steps 205 - step 207, any character within the range of 0 - 9 can be represented by a first character, such as N, and the code corresponding to the numeric character set is set as the first character N. That is to say, each number included in the subsequent numeric character set is successively replaced by the first character, such as N.

[0087] b. An alphabetic character set composed of at least one English letter.

[0088] In steps 205-207, the second character, such as L, can be used to represent any character within the range of az and AZ, and the code corresponding to the letter character set is set to the second character L. That is to say, each letter included in the subsequent letter character set is replaced by the second character, such as L.

[0089] c. A mixed character set consisting of numbers and English letters.

[0090] In steps 205-207, a third character, such as W, can be used to represent any character in the range of 0-9, az, and AZ, and the code corresponding to the mixed character set is set to the third character W. That is to say, each character (number / letter) included in the subsequent mixed character set is replaced by the third character, such as W.

[0091] d. A hexadecimal character set consisting of characters that conform to a hexadecimal format.

[0092] In steps 205-207, the fourth character, such as S, can be used to represent any character in the range of 0-9, az, and AZ that meets the hexadecimal format, and the code corresponding to the hexadecimal character set is set to the fourth character S. That is to say, each character included in the subsequent hexadecimal character set is replaced by the fourth character, such as S.

[0093] e. License plate prefix character set.

[0094] In steps 205-207, the fifth character, such as P, can be used to represent each Chinese character in the Chinese character prefix of the license plate number, and the code corresponding to the license plate prefix character set is set to the fifth character P. That is to say, each Chinese character included in the subsequent license plate prefix character set will be replaced by the fifth character, such as P.

[0095] f. Any other character not appearing in the above rules shall be represented by the character itself.

[0096] Next, for step 209: For each type of sensitive information, perform the following: Based on the feature description of the current type of sensitive information, determine at least one current character set included in the current type of sensitive information, use the code corresponding to each current character set to mark each character in the current character set one by one, and connect each code in sequence to form the pattern string corresponding to the current type of sensitive information.

[0097] The implementation process of step 209 is illustrated with examples of the character set of af and its corresponding code.

[0098] For sensitive information related to ID card types, there are two corresponding pattern strings, including: NNNNNNNNNNNNNNNNNN and NNNNNNNNNNNNNNNNNX.

[0099] For sensitive information of MAC address type, the corresponding pattern string includes: SS:SS:SS:SS:SS:SS.

[0100] For sensitive information such as mobile phone numbers, the corresponding pattern string includes: NNNNNNNNNNN.

[0101] For sensitive information related to civilian vehicle license plate numbers, there are two corresponding pattern strings: PLWWWWW and PLWWWWWW.

[0102] As can be seen, through the processing of steps 201 to 209 above, a pattern string corresponding to each type of sensitive information is generated.

[0103] Next, for step 211: create an AC automaton using the pattern strings corresponding to various types of sensitive information, where a path from the root node to a leaf node in the AC automaton corresponds to at least one pattern string.

[0104] In step 211, the existing AC automaton is extended so that each node of the AC automaton supports matching using various pattern strings. The pattern strings corresponding to various types of sensitive information are input into the AC automaton, and the node corresponding to the last code of each pattern string in the AC automaton is recorded. The recorded information includes the type name of the corresponding sensitive information. For example, for the pattern string of a mobile phone number, the node corresponding to the last code in the AC automaton is recorded as: "Mobile Phone Number". This ensures that a path from the root node to a leaf node in the AC automaton corresponds to at least one pattern string. Here, the initialization and construction can be performed according to the traditional AC automaton algorithm. The calculation method for failed nodes in the AC automaton needs to be adapted to the codes corresponding to each character set mentioned above.

[0105] In the created Aucma automaton, a path from the root node to a leaf node corresponds to at least one pattern string. According to the principle of the Aucma automaton, for example, for sensitive information such as ID card types, the corresponding pattern string is NNNNNNNNNNNNNNNNNN, and for sensitive information such as mobile phone numbers, the corresponding pattern string is NNNNNNNNNNNN. These two pattern strings are on the same path from the root node to the leaf node in the Aucma automaton; however, on this path, the 11th child node will be marked with the sensitive information corresponding to the mobile phone number type, and the 18th child node will be marked with the sensitive information corresponding to the ID card type.

[0106] For example, an AC automaton created in the embodiments of this specification can be found in [reference needed]. Figure 4 As shown.

[0107] In one embodiment of this specification, the process of creating an AC automaton using various pattern strings corresponding to various types of sensitive information in step 211 may include: creating a first AC automaton using various pattern strings corresponding to various types of sensitive information; for the created first AC automaton, determining whether any two child nodes directly connected to the same root node can represent the same type of character; if so, deleting the pattern string corresponding to the path of any one of the two child nodes in the first AC automaton; inputting the deleted pattern string into a second AC automaton to create a second AC automaton.

[0108] The processing logic of the AC automaton is to find the corresponding child nodes level by level from the root node / parent node for matching. Therefore, as long as there is no conflict among the first-level child nodes under the same parent node, the correctness of the entire logic can be guaranteed. At the same time, we determine whether there is a conflict between multiple first-level child nodes under the same parent node by checking whether there is an intersection between the multiple first-level child nodes (that is, whether any two first-level child nodes can represent the same type of character, such as both being able to represent numbers). For example, if the code S represents a hexadecimal number (0-9 and af) and the code N represents a decimal number (0-9), then N and S have an intersection. N and S can represent the same type of character. If they are under the same parent node, there will be a conflict. Therefore, the pattern string corresponding to the sensitive information of MAC type and the pattern string corresponding to the sensitive information of telephone number type need to be placed in different AC automata.

[0109] After creating the AC automaton, you can use the above... Figure 1 The illustrated process performs sensitive information matching. If multiple AC automata are created in the embodiments of this specification, then... Figure 1 The process shown involves using multiple created AC automata to determine whether the string to be recognized matches at least one of various pattern strings. For example, each AC automaton can be matched one by one, and the matching process ends once a match is found in one AC automaton; alternatively, matching can be performed simultaneously in multiple AC automata.

[0110] In one embodiment of this specification, a sensitive information identification device is provided, see [link to relevant documentation]. Figure 5 The device includes:

[0111] The string acquisition module 501 is configured to obtain the string to be recognized;

[0112] Input module 502 is configured to pass the string to be recognized as an input parameter of the recognition interface into a pre-created AC automaton;

[0113] AC automaton module 503 is configured to determine whether the string to be identified can match at least one of various pattern strings. If it can, it is determined that the string to be identified contains sensitive information. The various pattern strings correspond to various types of sensitive information. The AC automaton is created based on the various pattern strings.

[0114] In one embodiment of the device described in this specification, see [link to embodiment]. Figure 6 This further includes: creating module 601;

[0115] The creation module 601 is configured to: obtain feature descriptions of various types of sensitive information; wherein the feature descriptions include at least one of the following: the number of characters involved in the sensitive information; the type of characters involved in the sensitive information; the position of each type of character involved in the sensitive information in the sensitive information; classify each character involved in the various feature descriptions into at least one character set; set a code corresponding to each character set; wherein different character sets have different codes; for each type of sensitive information, perform the following: based on the feature description of the current type of sensitive information, determine at least one current character set included in the current type of sensitive information, use the code corresponding to each current character set to mark each character in the current character set one by one, and connect each code in sequence to form a pattern string corresponding to the current type of sensitive information; and create an AC automaton using the pattern strings corresponding to various types of sensitive information.

[0116] The creation module 601 is configured to: create a first AC automaton using pattern strings corresponding to various types of sensitive information; for the created first AC automaton, determine whether any two child nodes directly connected to the same root node can represent the same type of character; if so, delete the pattern string corresponding to the path of any one of the two child nodes in the first AC automaton, and input the deleted pattern string into the second AC automaton to create the second AC automaton.

[0117] AC automaton module 503 is configured to perform: using multiple created AC automata to determine whether the string to be identified can match at least one of various pattern strings.

[0118] It should be noted that the above-mentioned devices are typically implemented on the server side. They can be set up on independent servers, or some or all of the devices can be combined and installed on the same server. This server can be a single server or a server cluster consisting of multiple servers. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system. The above-mentioned devices can also be implemented on computer terminals with strong computing capabilities.

[0119] This specification provides, in one embodiment, a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the methods of any embodiment in the specification.

[0120] This specification provides a computing device according to one embodiment, including a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to perform the method of any embodiment of the specification.

[0121] It is understood that the structures illustrated in the embodiments of this specification do not constitute a specific limitation on the apparatus of the embodiments of this specification. In other embodiments of the specification, the above-described apparatus may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

[0122] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the apparatus embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0123] Those skilled in the art will recognize that, in one or more of the examples above, the functions described in this invention can be implemented using hardware, software, widgets, or any combination thereof. When implemented in software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium.

[0124] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solution of the present invention should be included within the scope of protection of the present invention.

Claims

1. Methods for identifying sensitive information, among which, The method includes: Obtain the string to be recognized; The string to be identified is passed as an input parameter to the pre-created AC automaton for the identification interface; The AC automaton determines whether the string to be identified can match at least one of the various pattern strings. If it can, it is determined that the string to be identified contains sensitive information. The various pattern strings correspond to different types of sensitive information; the AC automaton is created based on these various pattern strings. The method for creating the AC automaton includes: The sensitive information of various types is characterized by features, wherein the feature description includes at least one of the following: the number of characters involved in the sensitive information; the type of characters involved in the sensitive information; and the position of each type of character involved in the sensitive information within the sensitive information. Classify each character involved in the various feature descriptions into at least one character set; Assign a code to each character set; different character sets have different codes. For each type of sensitive information, the following steps are performed: Based on the feature description of the current type of sensitive information, determine at least one current character set included in the current type of sensitive information, use the code corresponding to each current character set to mark each character in the current character set one by one, and connect each code in sequence to form the pattern string corresponding to the current type of sensitive information. Create an AC automaton using the pattern strings corresponding to various types of sensitive information.

2. The method according to claim 1, wherein, The method of creating an AC automaton using various pattern strings corresponding to different types of sensitive information includes: A first AC automaton is created using the pattern strings corresponding to various types of sensitive information. For the created first AC automaton, it is determined whether any two child nodes directly connected to the same root node can represent the same type of character. If they can, the pattern string corresponding to the path of any one of the two child nodes is deleted in the first AC automaton. The deleted pattern string is then input into the second AC automaton to create the second AC automaton. The step of using an AC automaton to determine whether the string to be identified can match at least one of various pattern strings includes: using multiple created AC automata to determine whether the string to be identified can match at least one of various pattern strings.

3. The method according to claim 1, wherein, The at least one character set includes a numeric character set consisting of at least one digit; correspondingly, setting a code for each character set includes: using a first character to represent any character in the range of 0-9, and setting a numeric character set to correspond to the first character; And / or, The at least one character set includes an alphabetic character set consisting of at least one English letter; correspondingly, setting a code for each character set includes: using a second character to represent any character in the range of az and AZ, and setting the alphabetic character set to correspond to the second character; And / or, The at least one character set includes a mixed character set composed of numbers and English letters; correspondingly, setting a code for each character set includes: using a third character to represent any character in the range of 0-9, az, and AZ, and setting the mixed character set to correspond to the third character; And / or, The at least one character set includes a hexadecimal character set consisting of characters that satisfy the hexadecimal format; correspondingly, setting a code for each character set includes: using a fourth character to represent any character in the range of 0-9, az, and AZ that satisfy the hexadecimal format, and setting the hexadecimal character set to correspond to the fourth character; The at least one character set includes a license plate prefix character set; correspondingly, setting a code for each character set includes: using the fifth character to represent each Chinese character in the Chinese character prefix of the license plate number.

4. The method according to claim 3, wherein, The sensitive information includes sensitive information related to the ID card type; correspondingly, the feature description of the sensitive information related to the ID card type includes: a string of 18 characters in length, the first 17 characters being numbers, and the last character possibly being a number or an X; correspondingly, the pattern string corresponding to the sensitive information related to the ID card type includes: NNNNNNNNNNNNNNNNNN and NNNNNNNNNNNNNNNNNX, where N is used to represent the first character; And / or, The sensitive information includes sensitive information of MAC address type; correspondingly, the feature description of sensitive information of MAC address type includes: the MAC address is a string of 17 characters in length, which contains 6 groups of 2-digit numbers in hexadecimal format, and each group of numbers is separated by a colon; correspondingly, the pattern string corresponding to sensitive information of MAC address type includes: SS:SS:SS:SS:SS:SS, where S is used to represent the fourth character; And / or, The sensitive information includes sensitive information of mobile phone number type; correspondingly, the feature description of sensitive information of mobile phone number type includes: Chinese mobile phone number is a pure numeric string of length 11; correspondingly, the pattern string corresponding to sensitive information of mobile phone number type includes: NNNNNNNNNNNN, where N is used to represent the first character; And / or, The sensitive information includes sensitive information of the type of civilian vehicle license plate number; correspondingly, the feature description of the sensitive information of the type of civilian vehicle license plate number includes: the civilian vehicle license plate number is a string with the first character being Chinese and the second character being an English letter, and the length being 7 or 8 characters, wherein the 3rd to 7th or 8th characters are numbers or English letters; correspondingly, the pattern string corresponding to the sensitive information of the type of civilian vehicle license plate number includes: PLWWWWW and PLWWWWWW, where P is used to represent the fifth character; L is used to represent the second character.

5. A device for identifying sensitive information, the device comprising: The module for obtaining the string to be recognized is configured to obtain the string to be recognized. The input module is configured to pass the string to be recognized as an input parameter to the pre-created AC automaton. The AC automaton module is configured to determine whether the string to be identified can match at least one of various pattern strings. If it can, it is determined that the string to be identified contains sensitive information. The various pattern strings correspond to various types of sensitive information. The AC automaton is created based on the various pattern strings. The device further includes: a creation module; The creation module is configured to: obtain feature descriptions of various types of sensitive information; wherein the feature descriptions include at least one of the following: the number of characters involved in the sensitive information; the type of characters involved in the sensitive information; the position of each type of character involved in the sensitive information in the sensitive information; classify each character involved in the various feature descriptions into at least one character set; set a code corresponding to each character set; wherein different character sets correspond to different codes; for each type of sensitive information, perform the following: based on the feature description of the current type of sensitive information, determine at least one current character set included in the current type of sensitive information, use the code corresponding to each current character set to mark each character in the current character set one by one, and connect each code in sequence to form a pattern string corresponding to the current type of sensitive information; create an AC automaton using the pattern strings corresponding to various types of sensitive information.

6. The apparatus according to claim 5, wherein, The creation module is configured to: create a first AC automaton using various pattern strings corresponding to various types of sensitive information; for the created first AC automaton, determine whether any two child nodes directly connected to the same root node can represent the same type of character; if they can, delete the pattern string corresponding to the path of any one of the two child nodes in the first AC automaton; input the deleted pattern string into the second AC automaton to create the second AC automaton. The AC automaton module is configured to perform the following: using multiple created AC automata to determine whether the string to be identified can match at least one of various pattern strings.

7. A computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of claims 1-4.

8. A computing device comprising a memory and a processor, wherein the memory stores executable code, and the processor, when executing the executable code, implements the method of any one of claims 1-4.