A sensitive word processing method and device, electronic equipment and storage medium
By splitting language data and matching tail units, the sensitive word query process was optimized, solving the efficiency and complexity issues of the Aho-Corasick algorithm in scenarios with small amounts of data, and achieving faster sensitive word identification and higher matching accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- RICHFIT INFORMATION TECH
- Filing Date
- 2024-12-13
- Publication Date
- 2026-06-16
AI Technical Summary
In scenarios with simple business functions and relatively small amounts of data, existing multi-pattern matching algorithms such as the Aho-Corasick algorithm do not offer significant advantages in querying, resulting in slow and complex sensitive word identification.
The query language data is split according to predetermined splitting rules, the tail language units are identified, and matching is performed in the sensitive word database to build a trie to optimize sensitive word query and storage.
It improves the speed of sensitive word identification, reduces matching complexity, and enhances query efficiency and accuracy, making it suitable for sensitive word processing needs of different task types.
Smart Images

Figure CN122220573A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of information processing technology, and in particular to a method, apparatus, electronic device, and storage medium for processing sensitive words. Background Technology
[0002] In the field of sensitive word algorithms, multi-pattern matching is a key technology widely used in areas such as web censorship, text filtering, and information security. The main purpose of multi-pattern matching algorithms is to find substrings in given text that match a predefined pattern string. Here, the predefined pattern string typically includes sensitive words and their variants. To improve matching efficiency and accuracy, researchers have proposed many multi-pattern matching algorithms, with the Aho-Corasick algorithm being the mainstream approach.
[0003] The Aho-Corasick algorithm is a highly efficient string matching algorithm primarily used for pattern matching and text search. Its main advantage is its ability to process multiple patterns simultaneously, regardless of pattern length. This makes it highly efficient when handling large numbers of patterns. However, due to its algorithmic complexity, its query advantages are less pronounced when dealing with simpler business functions and smaller datasets. Summary of the Invention
[0004] In view of this, the purpose of this application is to provide a method, apparatus, electronic device and storage medium for processing sensitive words, which can effectively improve the speed of sensitive word determination and reduce the complexity of matching in scenarios with simple business functions and small data volume.
[0005] This application provides a method for processing sensitive words, the method including:
[0006] When the task type is sensitive word query processing, the language data to be queried is split according to the predetermined splitting rules to determine at least one language unit to be queried.
[0007] Iterate through each language unit to be queried in sequence, and check whether the language unit to be queried is a tail language unit from the pre-built sensitive word database;
[0008] If the language unit to be queried is a tail language unit, then at least one pre-stored sensitive word in the sensitive word database that ends with the language unit to be queried is identified.
[0009] The target language data, which begins with the first language unit to be queried and ends with that language unit, is matched with each pre-stored sensitive word found to determine whether the language data to be queried contains a sensitive word.
[0010] Optionally, the processing method further includes:
[0011] When the task type is sensitive word addition processing, the language data to be added is split according to the predetermined splitting rules to determine at least one language unit to be added;
[0012] Iterate through each language unit to be added sequentially to determine whether the new language unit exists in the pre-built sensitive word database;
[0013] If it does not exist, the newly added language unit and the language data to be added will be stored in the sensitive word database according to the preset data storage format requirements.
[0014] Optionally, storing the newly added language unit and the language data to be added in the sensitive word database according to the preset data storage format requirements includes:
[0015] The data structure information of the newly added language unit is determined based on the data structure information of the pre-stored language units recorded in the first data structure table of the sensitive word database.
[0016] Update the first data structure table using the data structure information of the newly added language unit;
[0017] Based on the data structure information of the pre-stored sensitive words recorded in the second data structure table of the sensitive word database, determine the data structure information of the language data to be added;
[0018] Update the second data structure table using the data structure information of the language data to be added.
[0019] Optionally, the processing method further includes:
[0020] Based on the second data structure table, a trie is constructed with language units as nodes, and each node of the trie is labeled with whether it is a tail language unit identifier;
[0021] When any node is identified as a tail language unit identifier, the node is associated with the corresponding pre-stored sensitive word.
[0022] Optionally, when it is determined that the newly added language unit exists in the pre-built sensitive word database, the processing method further includes:
[0023] Determine whether the parent identifier of the pre-stored language unit corresponding to the newly added language unit is empty;
[0024] If empty, update the parent identifier of the pre-stored language unit using the parent identifier of the newly added language unit;
[0025] If not empty, identify whether the parent identifier of the newly added language unit is the same as the parent identifier of the pre-stored language unit;
[0026] If they are different, the data structure information of the newly added language unit will be updated in the first data structure table of the sensitive word database.
[0027] Optionally, when it is determined that the newly added language unit exists in the pre-built sensitive word database, the processing method further includes:
[0028] Identify whether the newly added language unit is a tail language unit;
[0029] If so, determine whether the language data to be added exists in the pre-stored sensitive words associated with the newly added language unit;
[0030] If it does not exist, update the data structure information of the language data to be added to the second data structure table.
[0031] Optionally, the processing method further includes:
[0032] When the task type is sensitive word deletion processing, the language data to be deleted is split according to the predetermined splitting rules to determine at least one language unit to be deleted;
[0033] Iterate through each language unit to be deleted sequentially, and query the pre-built sensitive word database to see if the language unit to be deleted has been stored.
[0034] If it has been stored, determine whether the language unit to be deleted has any other reference relationships besides those defined by the language data to be deleted;
[0035] If it does not exist, delete the language unit to be deleted and determine whether the pre-stored sensitive words associated with the language unit to be deleted at the end include the language data to be deleted.
[0036] If included, delete the language data to be deleted.
[0037] This application embodiment also provides a sensitive word processing device, the processing device including a query module, the query module being used for:
[0038] When the task type is sensitive word query processing, the language data to be queried is split according to the predetermined splitting rules to determine at least one language unit to be queried.
[0039] Iterate through each language unit to be queried in sequence, and check whether the language unit to be queried is a tail language unit from the pre-built sensitive word database;
[0040] If the language unit to be queried is a tail language unit, then at least one pre-stored sensitive word in the sensitive word database that ends with the language unit to be queried is identified.
[0041] The target language data, which begins with the first language unit to be queried and ends with that language unit, is matched with each pre-stored sensitive word found to determine whether the language data to be queried contains a sensitive word.
[0042] Optionally, the processing device includes a new module, the new module being used for:
[0043] When the task type is sensitive word addition processing, the language data to be added is split according to the predetermined splitting rules to determine at least one language unit to be added;
[0044] Iterate through each language unit to be added sequentially to determine whether the new language unit exists in the pre-built sensitive word database;
[0045] If it does not exist, the newly added language unit and the language data to be added will be stored in the sensitive word database according to the preset data storage format requirements.
[0046] Optionally, when the addition module is used to store the newly added language unit and the language data to be added in the sensitive word database according to the preset data storage format requirements, the addition module is used to:
[0047] The data structure information of the newly added language unit is determined based on the data structure information of the pre-stored language units recorded in the first data structure table of the sensitive word database.
[0048] Update the first data structure table using the data structure information of the newly added language unit;
[0049] Based on the data structure information of the pre-stored sensitive words recorded in the second data structure table of the sensitive word database, determine the data structure information of the language data to be added;
[0050] Update the second data structure table using the data structure information of the language data to be added.
[0051] Optionally, the newly added module is also used for:
[0052] Based on the second data structure table, a trie is constructed with language units as nodes, and each node of the trie is labeled with whether it is a tail language unit identifier;
[0053] When any node is identified as a tail language unit identifier, the node is associated with the corresponding pre-stored sensitive word.
[0054] Optionally, the newly added module is also used for:
[0055] When it is determined that the newly added language unit exists in the pre-built sensitive word database, it is determined whether the parent identifier of the pre-stored language unit corresponding to the newly added language unit is empty;
[0056] If empty, update the parent identifier of the pre-stored language unit using the parent identifier of the newly added language unit;
[0057] If not empty, identify whether the parent identifier of the newly added language unit is the same as the parent identifier of the pre-stored language unit;
[0058] If they are different, the data structure information of the newly added language unit will be updated in the first data structure table of the sensitive word database.
[0059] Optionally, the newly added module is also used for:
[0060] When it is determined that the newly added language unit exists in the pre-built sensitive word database, identify whether the newly added language unit is a tail language unit;
[0061] If so, determine whether the language data to be added exists in the pre-stored sensitive words associated with the newly added language unit;
[0062] If it does not exist, update the data structure information of the language data to be added to the second data structure table.
[0063] Optionally, the processing device includes a deletion module, which is used for:
[0064] When the task type is sensitive word deletion processing, the language data to be deleted is split according to the predetermined splitting rules to determine at least one language unit to be deleted;
[0065] Iterate through each language unit to be deleted sequentially, and query the pre-built sensitive word database to see if the language unit to be deleted has been stored.
[0066] If it has been stored, determine whether the language unit to be deleted has any other reference relationships besides those defined by the language data to be deleted;
[0067] If it does not exist, delete the language unit to be deleted and determine whether the pre-stored sensitive words associated with the language unit to be deleted at the end include the language data to be deleted.
[0068] If included, delete the language data to be deleted.
[0069] This application embodiment also provides an electronic device, including: a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, the steps of the processing method described above are performed.
[0070] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, performs the steps of the processing method described above.
[0071] This application provides a method, apparatus, electronic device, and storage medium for processing sensitive words, comprising: when the task type is sensitive word query processing, splitting the language data to be queried according to a predetermined splitting rule to determine at least one language unit to be queried; sequentially traversing each language unit to be queried, querying from a pre-built sensitive word database whether the language unit to be queried is a tail language unit; if so, determining at least one pre-stored sensitive word stored in the sensitive word database that ends with the language unit to be queried; and performing matching processing with each pre-stored sensitive word queried using the target language data formed from the first language unit to be queried to the end of the language unit to be queried, to determine whether the language data to be queried includes a sensitive word.
[0072] In this way, this solution splits the language data to be queried according to predetermined splitting rules, which reduces the burden of processing large blocks of text at once, making sensitive word queries more efficient and improving the overall response speed. By adopting a query method based on tail-level language units, unnecessary matching operations can be reduced, saving computing resources, lowering the computational complexity of sensitive word detection, and improving the speed of sensitive word identification. Furthermore, when a tail-level language unit is found, further matching with related complete sensitive words can achieve precise sensitive word location, which helps improve the accuracy of matching. In addition, this method is applicable to different task types. By flexibly adjusting the splitting rules and matching process, it can adapt to the sensitive word query needs of various language data and is suitable for sensitive word filtering or detection work in different scenarios.
[0073] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0074] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0075] Figure 1 A flowchart illustrating a method for processing sensitive words provided in an embodiment of this application;
[0076] Figure 2 This application provides a schematic diagram illustrating the principle of a sensitive word query process.
[0077] Figure 3 This is a schematic diagram illustrating the principle of sensitive word search using the Aho-Corasick algorithm.
[0078] Figure 4 This is a schematic diagram of the failure function;
[0079] Figure 5 A schematic diagram of the structure of a sensitive word processing device provided in an embodiment of this application;
[0080] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0081] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of this application. Based on the embodiments of this application, every other embodiment obtained by those skilled in the art without inventive effort falls within the scope of protection of this application.
[0082] In the field of sensitive word algorithms, multi-pattern matching is a key technology widely used in areas such as web censorship, text filtering, and information security. The main purpose of multi-pattern matching algorithms is to find substrings in given text that match a predefined pattern string. Here, the predefined pattern string typically includes sensitive words and their variants. To improve matching efficiency and accuracy, researchers have proposed many multi-pattern matching algorithms, with the Aho-Corasick algorithm being the mainstream approach.
[0083] The Aho-Corasick algorithm is a highly efficient string matching algorithm primarily used for pattern matching and text search. Its main advantage is its ability to process multiple patterns simultaneously, regardless of pattern length. This makes it highly efficient when handling large numbers of patterns. However, due to its algorithmic complexity, its query advantages are less pronounced when dealing with simpler business functions and smaller datasets.
[0084] Based on this, embodiments of this application provide a method, apparatus, electronic device, and storage medium for processing sensitive words, which can effectively improve the speed of sensitive word determination and reduce matching complexity.
[0085] Please see Figure 1 , Figure 1 This is a flowchart illustrating a method for processing sensitive words provided in an embodiment of this application. Figure 1 As shown in the embodiments of this application, the processing method includes:
[0086] S101. When the task type is sensitive word query processing, the language data to be queried is split according to the predetermined splitting rules to determine at least one language unit to be queried.
[0087] S102. Iterate through each language unit to be queried in sequence and query whether the language unit to be queried is a tail language unit from the pre-built sensitive word database.
[0088] S103. If the language unit to be queried is a tail language unit, determine at least one pre-stored sensitive word in the sensitive word database that ends with the language unit to be queried.
[0089] S104. The target language data formed from the first language unit to be queried to the end of the language unit to be queried is matched with each pre-stored sensitive word found to determine whether the language data to be queried includes a sensitive word.
[0090] For step S101, when the task type is sensitive word query processing, the language data to be queried is split according to the pre-determined splitting rules, and it is divided into one or more language units to be queried.
[0091] Here, the language unit can be a letter, character, morpheme, word, phrase, or other language structure.
[0092] The specific splitting rules can be set according to the data storage format of the pre-constructed sensitive word database. The language units are consistent with the storage units in the sensitive word database.
[0093] For example, assuming the language data to be queried is "ushers", the language units to be queried, determined according to the predetermined splitting rules, include: "u", "s", "h", "e", "r", and "s".
[0094] For step S102, for each language unit to be queried, each language unit is traversed sequentially, and the sensitive word database is consulted to determine whether the current language unit is the "tail" of a certain sensitive word. That is, whether it is the last part of a complete sensitive word.
[0095] Continuing with the example above, the sensitive words pre-stored in the sensitive word database include: she, he, his, and hers. When iterating through "h", it is determined that it is not a tail-end linguistic unit; when iterating through "e", it is determined that it is a tail-end linguistic unit.
[0096] Regarding step S103, when it is determined that the current language unit is the end of a certain sensitive word, the system will extract all sensitive words ending with that language unit from the sensitive word database and perform a matching operation.
[0097] Continuing with the example above, when traversing to 'e', it is determined to be a tail language unit, and the pre-stored sensitive words ending with "e" in the sensitive word database are identified as 'she' and 'he'.
[0098] In step S104, starting from the first language unit to be queried and continuing to the last language unit, a complete target language segment is formed and matched against each pre-stored sensitive word found. If the target language segment matches a sensitive word, it indicates that the sensitive word is contained.
[0099] Continuing with the example above, when traversing to 'e', the target language data is determined to be "ushe". Using "ushe" to match with "she" and "he", it is determined that "ushe" includes "she" and "he". Therefore, the language data to be queried includes the sensitive words "she" and "he".
[0100] In addition, for examples, please refer to Figure 2 , Figure 2 This is a schematic diagram illustrating the principle of a sensitive word query process provided in this application. For example... Figure 2As shown, taking "ushers" as an example, the sensitive words are "she," "he," "his," and "hers," and the language data to be queried is "ushers." The processing procedure of this invention is as follows: Let g(x) = X, that is, g(s) = S. The process from S to H can be represented by g(s) + g(h). In the diagram, "s" and "e" represent the sensitive word storage pool, which stores all sensitive words ending with them. For example, nodes E and S store HE, SHE and HIS, HERS, respectively. Next, we start to decompose "ushers." First, according to the principle of g(x) = X, if g(u) = U does not exist, then we proceed to the next word query. If g(s) = S, S exists, the initial position is found, and g(s) = S is output. Since it does not exist in the pool, the output is empty. Then we move to H. The process of H consists of g(s) + g(h). H is not a sensitive word storage pool, so we continue to E. The process of E consists of g(s) + g(h) + g(e), which is a sensitive word storage pool, and the output is HE and SHE. Following this pattern, we eventually reach S. The process of S consists of g(s) + g(h) + g(e) + g(r) + g(s). The sensitive word that can be matched is HERS, so the final output is she, he, hers.
[0101] In addition, please refer to Figure 3 , Figure 3 This is a schematic diagram illustrating the principle of sensitive word search using the Aho-Corasick algorithm. For example... Figure 3 As shown, taking ushers as an example, the Aho-Corasick algorithm processes as follows: Let g(pre, x) = next. The state pre transitions to the state next after inputting a character x, such as g(1, e) = 2, g(1, a) = ?. Clearly, g(1, e) can find the next transition, while g(1, a) cannot. What happens when it cannot be found? This needs to be handled by the failure function. The failure function is the core of the AC algorithm, referring to the transition relationship between states in the event of a failure. That is, when a failure occurs, the algorithm automatically switches to a certain state based on the pattern containing a prefix of another pattern. Please refer to [link to relevant documentation]. Figure 4 , Figure 4 This is a schematic diagram of the failure function. For example... Figure 4As shown, pattern string 1 contains a prefix of pattern string 2. When the match reaches position 'a', the match in pattern string 1 fails, and the process jumps to pattern string 2 for matching instead of returning to the initial state (state 0) to start matching again. The failure jump state can be obtained through recursive derivation. For example, g(5, r) = fail, g(5, r) = g(f(5), r), f(5) = g(f(4), e) = g(g(f(3), h), e), state 3 corresponds to the character 's'. If there is no prefix of other patterns, the process switches to state 0, i.e., f(3) = 0, f(5) = g(f(4), e) = g(g(f(3), h), e) = g(g(0, h), e) = g(1, e) = 2, g(5, r) = g(f(5), r) = g(2, r) = 8. The final failure results correspond to f(1)=0, f(2)=0, f(3)=0, f(4)=1, f(5)=2, f(6)=0, f(7)=3, f(8)=0, f(9)=3. And so on, the final execution results are she, he, hers.
[0102] Therefore, based on the processing examples of this scheme and the Aho-Corasick algorithm, it can be seen that this scheme can identify sensitive words included in the language data to be queried more quickly than the Aho-Corasick algorithm.
[0103] In another embodiment provided in this application, the processing method further includes:
[0104] S201. When the task type is sensitive word addition processing, the language data to be added is split according to the predetermined splitting rules to determine at least one language unit to be added.
[0105] S202. Iterate through each language unit to be added in sequence and determine whether the new language unit exists in the pre-built sensitive word database.
[0106] S203. If it does not exist, store the newly added language unit and the language data to be added in the sensitive word database according to the preset data storage format requirements.
[0107] The description of S201 can be referred to the description of S101, and the same technical effect can be achieved, so it will not be repeated here.
[0108] In step S202, each language unit to be added is sequentially traversed, and it is checked whether these language units already exist in the sensitive word database. If they do not exist, step S203 is executed, and the new language unit and the complete set of sensitive words to be added are stored in the database according to the preset data storage format. This ensures the integrity of the sensitive word database and avoids duplicate storage.
[0109] Regarding step S203, in one embodiment provided in this application, storing the newly added language unit and the language data to be added in the sensitive word database according to preset data storage format requirements includes:
[0110] S2031. Determine the data structure information of the newly added language unit based on the data structure information of the pre-stored language units recorded in the first data structure table of the sensitive word database.
[0111] S2032. Update the first data structure table using the data structure information of the newly added language unit.
[0112] S2033. Based on the data structure information of the pre-stored sensitive words recorded in the second data structure table of the sensitive word database, determine the data structure information of the language data to be added.
[0113] S2034. Update the second data structure table using the data structure information of the language data to be added.
[0114] Here, the data structure information of the newly added language unit determined according to the first data structure table includes the following information: its own identifier (ID, which is characterized by being incremental and non-repeating), sensitive unit (current language unit), parent identifier (PID, parent language unit ID), and parent language unit.
[0115] The data structure information of the language data to be added, as determined by the second data structure table, includes the following information: the identifier itself (ID, which is characterized by being incremental and non-repeating), sensitive data (current language data), and the tail language unit identifier (i.e., PID, parent language unit ID).
[0116] In another embodiment provided in this application, the processing method further includes:
[0117] Based on the second data structure table, a trie is constructed with language units as nodes, and each node of the trie is labeled with whether it is a tail language unit identifier; when any node is labeled with a tail language unit identifier, the node is associated with the corresponding pre-stored sensitive word.
[0118] The structure of the constructed trie can be referred to Figure 2 As shown.
[0119] Furthermore, in another embodiment provided in this application, when it is determined that the newly added language unit exists in the pre-constructed sensitive word database, the processing method further includes:
[0120] S204. Determine whether the parent identifier of the pre-stored language unit corresponding to the newly added language unit is empty.
[0121] S205. If empty, update the parent identifier of the pre-stored language unit using the parent identifier of the newly added language unit.
[0122] S206. If not empty, identify whether the parent identifier of the newly added language unit is the same as the parent identifier of the pre-stored language unit.
[0123] S207. If they are not the same, the data structure information of the newly added language unit is updated in the first data structure table of the sensitive word database.
[0124] Regarding step S204, if it is determined that the newly added language unit exists in the pre-constructed sensitive word database after step S202 is completed, then step S204 is executed.
[0125] The newly added language unit is the same as the corresponding pre-stored language unit.
[0126] If it is determined that the parent identifier of the pre-stored language unit corresponding to the newly added language unit is empty, proceed to step S205; if it is determined that the parent identifier of the pre-stored language unit corresponding to the newly added language unit is not empty, proceed to step S206.
[0127] For step S205, the parent identifier of the newly added language unit is inserted at the parent identifier of the pre-stored language unit.
[0128] For step S206, if the parent identifier of the newly added language unit is the same as the parent identifier of the pre-stored language unit, then no re-insertion is required. If it is determined that the parent identifier of the newly added language unit is different from the parent identifier of the pre-stored language unit, then step S207 is executed.
[0129] In step S207, the data structure information of the newly added language unit is inserted into the first data structure table of the sensitive word database, thereby establishing an index and query conditions.
[0130] After performing step S202, if it is determined that the newly added language unit exists in the pre-constructed sensitive word database, in another embodiment provided in this application, the processing method further includes:
[0131] S208. Identify whether the newly added language unit is a tail language unit.
[0132] S209. If yes, determine whether the language data to be added exists in the pre-stored sensitive words associated with the newly added language unit.
[0133] S210. If it does not exist, update the data structure information of the language data to be added to the second data structure table.
[0134] For step S208, if it is determined that the newly added language unit is a tail language unit, proceed to step S209; otherwise, end the processing flow for the newly added language unit.
[0135] Regarding step S209, if it is determined that the language data to be added does not exist in the pre-stored sensitive words associated with the newly added language unit, then proceed to step S210. Otherwise, no processing is required.
[0136] In step S210, the data structure information of the language data to be added is inserted into the second data structure table, thereby establishing an index and query conditions.
[0137] In addition, before processing the addition of sensitive words, the identity verification of the personnel handling the sensitive word addition is required. For example, the JWT token is decrypted using the hs256 algorithm and built-in signature to obtain the information of the currently logged-in user (the personnel handling the sensitive word addition) and then verified.
[0138] Here, the process of adding sensitive words to the database is illustrated using the following example: Sensitive words are entered into the Elasticsearch (ES) database (assuming the current sensitive word database is empty). Specifically, the newly added sensitive words are she / he / his / hers. The data structure information in the first data structure table includes its own identifier (ID, characterized by incrementing and non-repeating), sensitive unit (current language unit), parent identifier (PID, parent language unit ID), and parent language unit. The sensitive words are iterated through, and the data is retrieved based on the index and query conditions. If no results are found, ES writes the data to the corresponding primary shard (the first data structure table) and automatically creates an index for it. If a result is found, there are two cases: ① If pid is empty, the parent ID of the current data is updated to pid. ② If pid is not empty, it is checked whether the current data pid is the same as the queried pid. If they are the same, no re-insertion is needed; otherwise, it needs to be re-inserted into the database. Finally, the data in the first data structure table is as follows:
[0139] {(1, s, null, null)}、
[0140] {(1, s, null, null), (2, h, 1, s)},
[0141] {(1,s,null,null),(2,h,1,s),(3,e,2,h)},
[0142] {(1,s,null,null),(2,h,1,s),(3,e,2,h),(4,i,2,h)},
[0143] {(1,s,null,null),(2,h,1,s),(3,e,2,h),(4,i,2,h),(5,r,3,e)},{(1,s,4,i),(2,h,1,s),(3,e,2,h),(4,i,2,h),(5,r,3,e),(1,s,5,r)}.
[0144] Please refer to Table 1, which is the first data structure table.
[0145] Table 1:
[0146] ID Sensitive Units PID Parent language unit 1 S 4 1 2 H 1 S 3 E 2 H 4 I 2 H 5 R 3 E 1 S 5 R
[0147] The second data structure table includes the following data structure information: its own identifier (ID, characterized by being incremental and non-repeating), sensitive data (current language data), and tail language unit identifier (i.e., PID, parent language unit ID). It iterates through the sensitive words, using the `string.length` method to determine if the newly added language unit is a tail language unit. If not, the current process ends, and the next iteration begins. If it is a tail language unit, the data is retrieved based on the index and query conditions. If not found, Elasticsearch writes the data to the corresponding primary shard (the second data structure table) and automatically creates an index for it. If found, it checks if the current sensitive word exists in the second data structure table; if not, it is added; otherwise, no further processing is needed. Finally, the data in the second data structure table are as follows: {(10, she, 3)}, {(10, she, 3), (11, he, 3)}, {(10, she, 3), (11, he, 3), (12, his, 1)}, {(10, she, 3), (11, he, 3), (12, his, 1), (13, hers, 1)}. The final data structures are shown in the table below:
[0148] Please refer to Table 2, which is the second data structure table.
[0149] Table 2:
[0150] ID Sensitive words PID 10 SHE 3 11 HE 3 12 HIS 1 13 HERS 1
[0151] In another embodiment provided in this application, the processing method further includes:
[0152] S301. When the task type is sensitive word deletion processing, the language data to be deleted is split according to the predetermined splitting rules to determine at least one language unit to be deleted.
[0153] S302. Iterate through each language unit to be deleted in sequence and query whether the language unit to be deleted has been stored in the pre-built sensitive word database.
[0154] S304. If it has been stored, determine whether the language unit to be deleted has any other reference relationships besides the reference relationships defined by the language data to be deleted.
[0155] S305. If it does not exist, delete the language unit to be deleted and determine whether the pre-stored sensitive words associated with the language unit to be deleted at the end include the language data to be deleted.
[0156] S306. If included, delete the language data to be deleted.
[0157] Example, combination Figure 2 The following explains the process of deleting sensitive words. Taking "ushers" as an example, the sensitive words are "she / he / his / hers," and the text to be deleted is "his." The process iterates through "his," first checking if "h" exists in the database. If it does, it checks if the language unit to be deleted has other references. If they do, it remains unchanged; otherwise, it is deleted. Next, it checks if the pre-stored sensitive words associated with the language unit to be deleted include the language data to be deleted. If they do, the language data to be deleted is deleted. Clearly, "h" has references to "s," "e," and "i," so "h" remains unchanged. "i" has no other references, so it is deleted, and the "his" at the end of "s" is deleted.
[0158] This approach makes sensitive word processing more efficient, improving overall response speed. It also saves computational resources and increases the speed and accuracy of sensitive word identification. Furthermore, this method is applicable to different task types; by flexibly adjusting the splitting rules and matching process, it can adapt to the sensitive word processing needs of various language data, making it suitable for sensitive word filtering or detection in different scenarios.
[0159] Based on the same inventive concept, this application also provides a processing device corresponding to the processing method. Since the principle of the device in this application to solve the problem is similar to the processing method described above in this application, the implementation of the device can refer to the implementation of the method, and the repeated parts will not be described again.
[0160] Please see Figure 5 , Figure 5 This is a schematic diagram of a sensitive word processing device provided in an embodiment of this application. Figure 5 As shown, the processing device 500 includes a query module 510, which is used for:
[0161] When the task type is sensitive word query processing, the language data to be queried is split according to the predetermined splitting rules to determine at least one language unit to be queried.
[0162] Iterate through each language unit to be queried in sequence, and check whether the language unit to be queried is a tail language unit from the pre-built sensitive word database;
[0163] If the language unit to be queried is a tail language unit, then at least one pre-stored sensitive word in the sensitive word database that ends with the language unit to be queried is identified.
[0164] The target language data, which begins with the first language unit to be queried and ends with that language unit, is matched with each pre-stored sensitive word found to determine whether the language data to be queried contains a sensitive word.
[0165] Optionally, the processing device 500 includes a new module 520, which is used for:
[0166] When the task type is sensitive word addition processing, the language data to be added is split according to the predetermined splitting rules to determine at least one language unit to be added;
[0167] Iterate through each language unit to be added sequentially to determine whether the new language unit exists in the pre-built sensitive word database;
[0168] If it does not exist, the newly added language unit and the language data to be added will be stored in the sensitive word database according to the preset data storage format requirements.
[0169] Optionally, when the adding module 520 is used to store the newly added language unit and the language data to be added in the sensitive word database according to the preset data storage format requirements, the adding module 520 is used to:
[0170] The data structure information of the newly added language unit is determined based on the data structure information of the pre-stored language units recorded in the first data structure table of the sensitive word database.
[0171] Update the first data structure table using the data structure information of the newly added language unit;
[0172] Based on the data structure information of the pre-stored sensitive words recorded in the second data structure table of the sensitive word database, determine the data structure information of the language data to be added;
[0173] Update the second data structure table using the data structure information of the language data to be added.
[0174] Optionally, the newly added module 520 is further used for:
[0175] Based on the second data structure table, a trie is constructed with language units as nodes, and each node of the trie is labeled with whether it is a tail language unit identifier;
[0176] When any node is identified as a tail language unit identifier, the node is associated with the corresponding pre-stored sensitive word.
[0177] Optionally, the newly added module 520 is further used for:
[0178] When it is determined that the newly added language unit exists in the pre-built sensitive word database, it is determined whether the parent identifier of the pre-stored language unit corresponding to the newly added language unit is empty;
[0179] If empty, update the parent identifier of the pre-stored language unit using the parent identifier of the newly added language unit;
[0180] If not empty, identify whether the parent identifier of the newly added language unit is the same as the parent identifier of the pre-stored language unit;
[0181] If they are different, the data structure information of the newly added language unit will be updated in the first data structure table of the sensitive word database.
[0182] Optionally, the newly added module 520 is further used for:
[0183] When it is determined that the newly added language unit exists in the pre-built sensitive word database, identify whether the newly added language unit is a tail language unit;
[0184] If so, determine whether the language data to be added exists in the pre-stored sensitive words associated with the newly added language unit;
[0185] If it does not exist, update the data structure information of the language data to be added to the second data structure table.
[0186] Optionally, the processing device 500 includes a deletion module 530, which is used for:
[0187] When the task type is sensitive word deletion processing, the language data to be deleted is split according to the predetermined splitting rules to determine at least one language unit to be deleted;
[0188] Iterate through each language unit to be deleted sequentially, and query the pre-built sensitive word database to see if the language unit to be deleted has been stored.
[0189] If it has been stored, determine whether the language unit to be deleted has any other reference relationships besides those defined by the language data to be deleted;
[0190] If it does not exist, delete the language unit to be deleted and determine whether the pre-stored sensitive words associated with the language unit to be deleted at the end include the language data to be deleted.
[0191] If included, delete the language data to be deleted.
[0192] Please see Figure 6 , Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 6 As shown, the electronic device 600 includes a processor 610, a memory 620, and a bus 630.
[0193] The memory 620 stores machine-readable instructions executable by the processor 610. When the electronic device 600 is running, the processor 610 and the memory 620 communicate via the bus 630. When the machine-readable instructions are executed by the processor 610, they can perform the operations described above. Figure 1 as well as Figure 2 The steps in the method embodiment shown are specifically implemented in the method embodiment and will not be repeated here.
[0194] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can perform the above-described actions. Figure 1 as well as Figure 2 The steps in the method embodiment shown are specifically implemented in the method embodiment and will not be repeated here.
[0195] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0196] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the shown or discussed mutual couplings, direct couplings, or communication connections may be through some communication interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.
[0197] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0198] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0199] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a processor-executable, non-volatile, computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0200] Finally, it should be noted that the above-described embodiments are merely specific implementations of this application, used to illustrate the technical solutions of this application, and not to limit them. The scope of protection of this application is not limited thereto. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this application. Such modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for processing sensitive words, characterized in that, The processing method includes: When the task type is sensitive word query processing, the language data to be queried is split according to the predetermined splitting rules to determine at least one language unit to be queried. Iterate through each language unit to be queried in sequence, and check whether the language unit to be queried is a tail language unit from the pre-built sensitive word database; If the language unit to be queried is a tail language unit, then at least one pre-stored sensitive word in the sensitive word database that ends with the language unit to be queried is identified. The target language data, which begins with the first language unit to be queried and ends with that language unit, is matched with each pre-stored sensitive word found to determine whether the language data to be queried contains a sensitive word.
2. The processing method according to claim 1, characterized in that, The processing method further includes: When the task type is sensitive word addition processing, the language data to be added is split according to the predetermined splitting rules to determine at least one language unit to be added; Iterate through each language unit to be added sequentially to determine whether the new language unit exists in the pre-built sensitive word database; If it does not exist, the newly added language unit and the language data to be added will be stored in the sensitive word database according to the preset data storage format requirements.
3. The processing method according to claim 2, characterized in that, The step of storing the newly added language unit and the language data to be added in the sensitive word database according to the preset data storage format requirements includes: The data structure information of the newly added language unit is determined based on the data structure information of the pre-stored language units recorded in the first data structure table of the sensitive word database. Update the first data structure table using the data structure information of the newly added language unit; Based on the data structure information of the pre-stored sensitive words recorded in the second data structure table of the sensitive word database, determine the data structure information of the language data to be added; Update the second data structure table using the data structure information of the language data to be added.
4. The processing method according to claim 3, characterized in that, The processing method further includes: Based on the second data structure table, a trie is constructed with language units as nodes, and each node of the trie is labeled with whether it is a tail language unit identifier; When any node is identified as a tail language unit identifier, the node is associated with the corresponding pre-stored sensitive word.
5. The processing method according to claim 2, characterized in that, When it is determined that the newly added language unit exists in the pre-built sensitive word database, the processing method further includes: Determine whether the parent identifier of the pre-stored language unit corresponding to the newly added language unit is empty; If empty, update the parent identifier of the pre-stored language unit using the parent identifier of the newly added language unit; If not empty, identify whether the parent identifier of the newly added language unit is the same as the parent identifier of the pre-stored language unit; If they are different, the data structure information of the newly added language unit will be updated in the first data structure table of the sensitive word database.
6. The processing method according to claim 5, characterized in that, When it is determined that the newly added language unit exists in the pre-built sensitive word database, the processing method further includes: Identify whether the newly added language unit is a tail language unit; If so, determine whether the language data to be added exists in the pre-stored sensitive words associated with the newly added language unit; If it does not exist, update the data structure information of the language data to be added to the second data structure table.
7. The processing method according to claim 1, characterized in that, The processing method further includes: When the task type is sensitive word deletion processing, the language data to be deleted is split according to the predetermined splitting rules to determine at least one language unit to be deleted; Iterate through each language unit to be deleted sequentially, and query the pre-built sensitive word database to see if the language unit to be deleted has been stored. If it has been stored, determine whether the language unit to be deleted has any other reference relationships besides those defined by the language data to be deleted; If it does not exist, delete the language unit to be deleted and determine whether the pre-stored sensitive words associated with the language unit to be deleted at the end include the language data to be deleted. If included, delete the language data to be deleted.
8. A device for processing sensitive words, characterized in that, The processing device includes a query module, which is used for: When the task type is sensitive word query processing, the language data to be queried is split according to the predetermined splitting rules to determine at least one language unit to be queried. Iterate through each language unit to be queried in sequence, and check whether the language unit to be queried is a tail language unit from the pre-built sensitive word database; If the language unit to be queried is a tail language unit, then at least one pre-stored sensitive word in the sensitive word database that ends with the language unit to be queried is identified. The target language data, which begins with the first language unit to be queried and ends with that language unit, is matched with each pre-stored sensitive word found to determine whether the language data to be queried contains a sensitive word.
9. An electronic device, characterized in that, include: The device includes a processor, a memory, and a bus, wherein the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the memory via the bus, and the machine-readable instructions are executed by the processor to perform the steps of the processing method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the processing method as described in any one of claims 1 to 7.