Unlock instant, AI-driven research and patent intelligence for your innovation.

Chinese sensitive text recognition method and device, storage medium and equipment

A text recognition and storage medium technology, applied in text database indexing, text database query, unstructured text data retrieval, etc., can solve the problem of not being able to cover variants such as homophones, similar characters, split characters, and inaccurate recognition of sensitive characters and other issues to achieve the effect of improving the recall rate and avoiding misjudgment

Pending Publication Date: 2021-12-21
北京云上曲率科技有限公司
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] To this end, the present invention provides a Chinese sensitive text recognition method, device, storage medium and equipment to solve the problem that the existing sensitive text recognition is not accurate and cannot cover variants such as homophones, similar characters, and split characters.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese sensitive text recognition method and device, storage medium and equipment
  • Chinese sensitive text recognition method and device, storage medium and equipment
  • Chinese sensitive text recognition method and device, storage medium and equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] see figure 1 , Embodiment 1 of the present invention provides a Chinese sensitive text recognition method, comprising the following steps:

[0038] S1. Obtain a text object to be recognized, perform preprocessing on the text object, and obtain a text pinyin list corresponding to the text object after preprocessing;

[0039] S2. Convert the sensitive Chinese characters in the sensitive lexicon into sensitive pinyin, and generate a pinyin Trie tree corresponding to the sensitive pinyin;

[0040] S3. Search on the Pinyin Trie tree through the text pinyin list, mark the text pinyin searched in the text pinyin list as sensitive words, and perform context backtracking through the marked sensitive words to obtain sensitive words in the text object Content, blanking sensitive words in the sensitive content.

[0041] Specifically, a Trie tree, also known as a word lookup tree, is a tree structure and a variant of a hash tree. Typical applications are for counting, sorting and...

Embodiment 2

[0063] see image 3 , Embodiment 2 of the present invention provides a Chinese-sensitive text recognition device, using the Chinese-sensitive text recognition method of Embodiment 1 or any possible implementation thereof, including:

[0064] A text recognition preprocessing unit 1, configured to obtain a text object to be recognized, perform preprocessing on the text object, and obtain a text pinyin list corresponding to the text object after preprocessing;

[0065] Sensitive words pinyin Trie tree generating unit 2, for converting the sensitive Chinese characters in the sensitive lexicon into sensitive pinyin, generating the corresponding pinyin Trie tree of the sensitive pinyin;

[0066] The text-sensitive content identification processing unit 3 is used to search on the Pinyin Trie tree through the text pinyin list, mark the text pinyin searched in the text pinyin list as sensitive words, and perform context backtracking through the marked sensitive words Sensitive content...

Embodiment 3

[0083] Embodiment 3 of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores the program code of the Chinese-sensitive text recognition method, and the program code includes a method for executing Embodiment 1 or any possible implementation thereof Instructions for the Chinese sensitive text recognition method.

[0084] The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server, a data center, etc. integrated with one or more available media. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, DVD), or a semiconductor medium (for example, a solid state disk (SolidState Disk, SSD)) and the like.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese sensitive text recognition method and device, a storage medium and equipment. The method comprises the following steps: obtaining a to-be-recognized text object, preprocessing the text object, and obtaining a text pinyin list corresponding to the text object after the preprocessing; converting sensitive Chinese characters in a sensitive word bank into sensitive pinyin, and generating a pinyin Trie tree corresponding to the sensitive pinyin; searching on the pinyin Trie tree through the text pinyin list, marking the searched text pinyin in the text pinyin list as a sensitive words, performing context backtracking through the marked sensitive words to obtain sensitive content in the text object, and blanking the sensitive words in the sensitive content. According to the method, all-angle coverage of sensitive vocabularies can be ensured, the recall rate is improved, interference recall of multi-tone, homophone, similar-form characters and split characters is supported, and common word backtracking is adopted, so that misjudgment is avoided.

Description

technical field [0001] The invention relates to the technical field of sensitive text processing, in particular to a Chinese sensitive text recognition method, device, storage medium and equipment. Background technique [0002] At present, in Internet scenarios, based on considerations of compliance or actual business needs, it is usually necessary to review the content published by users. Compared with other carriers such as images or audio, the cost for users to publish text is generally lower, and text content is more likely to contain sensitive or illegal content. Timely detection and blocking of sensitive content is the basis for ensuring the purity of the Internet. [0003] In the prior art, the sensitive text recognition scheme usually includes: sensitive word matching and the text classification model sensitive word matching of the whole sentence, usually a thesaurus is defined in advance, when the words in the thesaurus appear in the text to be detected, it is consi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/279G06F40/284G06F40/216G06F40/151G06F16/31G06F16/33
CPCG06F40/279G06F40/284G06F40/216G06F40/151G06F16/322G06F16/334
Inventor 李勇涛王圳樊伟华杜晓祥
Owner 北京云上曲率科技有限公司