Unlock instant, AI-driven research and patent intelligence for your innovation.

A sensitive corpus detection method based on thesaurus and word vector model

A detection method and word vector technology, applied in text database query, digital data information retrieval, natural language data processing, etc., can solve problems such as insensitive word detection, achieve a wide range and improve performance

Active Publication Date: 2022-06-17
XIDIAN UNIV +1
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Aiming at the sensitive corpus detection method for network media, the applicant retrieved a related patent through patent search, the name is sensitive text detection method and device, the patent application number is CN201410064854.6, the patent proposes a method based on limited automatic Sensitive text detection scheme based on state machine and keyword category weights. This patent proposes to judge the frequency of sensitive words and the sensitivity of the weight to the text. However, this method can only filter out sensitive words already in the lexicon, and cannot There is no effective solution for detecting sensitive words other than thesaurus

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A sensitive corpus detection method based on thesaurus and word vector model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] Below in conjunction with accompanying drawing, the present invention is described in further detail:

[0028] refer to figure 1 , the sensitive corpus detection method based on thesaurus and word vector model of the present invention comprises the following steps:

[0029] 1) Obtaining open text corpus, and then preprocessing the open text corpus, wherein the open text is expected to include Chinese Wikipedia corpus and news corpus;

[0030] The Chinese Wikipedia corpus in step 1) comes from the Chinese open corpus of Wikipedia. For the Wikipedia Chinese corpus, the latest corpus acquisition address is: https: / / dumps.wikimedia.org / zhwiki / latest / zhwiki-latest-pages-articles .xml.bz2; news corpus comes from Sohu news data.

[0031] The specific process of preprocessing the Chinese Wikipedia corpus in step 1) is as follows:

[0032] Use the open tool WikiExtractor to extract effective information from the Chinese Wikipedia corpus. After extracting the effective informa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a sensitive corpus detection method based on a lexicon and a word vector model, comprising the following steps: 1) Acquiring an open text corpus, and preprocessing the open text corpus, the open text is expected to include Chinese Wikipedia corpus and news Corpus; 2) Merge the Chinese Wikipedia corpus and news corpus processed in step 1) to obtain the merged corpus, then use the word segmentation tool to segment the merged corpus, and then filter out the stop words in the word segmentation results; 3) use the open The tool word2vec performs unsupervised training on the word segmentation results after filtering stop words, and builds a word vector model according to the results of unsupervised training; 4) Obtain the text to be detected, score the vocabulary, and build a dictionary of similar words at the same time; 5) Use similar words The dictionary, word vector model and sensitive thesaurus perform sensitivity detection on the words in the word segmentation table, and complete the detection of sensitive corpus based on the thesaurus and word vector model. This method has excellent detection ability for sensitive words.

Description

technical field [0001] The invention belongs to the technical field of Internet information processing, and relates to a sensitive corpus detection method based on a thesaurus and a word vector model. Background technique [0002] With the rapid development of the information age, various social media platforms emerging on the Internet are favored by many users, and the use of social platforms to release information has become an important way to form and spread social public opinion. Social media brings a huge amount of text corpus every day, a small part of which will bring huge risks to social security and political stability. In order to avoid the negative impact of potentially sensitive corpus, it is necessary to detect the corpus on the Internet, and to be able to quickly identify the sensitive information involved in it, and then carry out further processing. [0003] For sensitive corpus on the Internet, the traditional detection method purely based on thesaurus has...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9536G06F16/33G06F40/284
CPCG06F16/9536G06F16/3344
Inventor 李辉陈鹏
Owner XIDIAN UNIV