Unlock instant, AI-driven research and patent intelligence for your innovation.

Sensitive corpus detection method based on lexicon and word vector model

A detection method and word vector technology, applied in text database query, natural language data processing, digital data information retrieval and other directions, can solve problems such as insensitive word detection, achieve a wide range and improve performance.

Active Publication Date: 2020-01-24
XIDIAN UNIV +1
View PDF5 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Aiming at the sensitive corpus detection method for network media, the applicant retrieved a related patent through patent search, the name is sensitive text detection method and device, the patent application number is CN201410064854.6, the patent proposes a method based on limited automatic Sensitive text detection scheme based on state machine and keyword category weights. This patent proposes to judge the frequency of sensitive words and the sensitivity of the weight to the text. However, this method can only filter out sensitive words already in the lexicon, and cannot There is no effective solution for detecting sensitive words other than thesaurus

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Sensitive corpus detection method based on lexicon and word vector model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention is described in further detail below in conjunction with accompanying drawing:

[0028] refer to figure 1 , the sensitive corpus detection method based on thesaurus and word vector model of the present invention comprises the following steps:

[0029] 1) Obtain the open text corpus, and then preprocess the open text corpus, wherein the open text is expected to include Chinese Wikipedia corpus and news corpus;

[0030] Step 1) The Chinese Wikipedia corpus comes from the Chinese open corpus of Wikipedia. For the Wikipedia Chinese corpus, the latest corpus acquisition address is: https: / / dumps.wikimedia.org / zhwiki / latest / zhwiki-latest-pages-articles .xml.bz2; news corpus comes from Sohu news data.

[0031] The specific process of preprocessing the Chinese Wikipedia corpus in step 1) is:

[0032] Use the open tool WikiExtractor to extract effective information from the Chinese Wikipedia corpus, remove the invalid tags in the effective information te...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a sensitive corpus detection method based on a lexicon and a word vector model, which comprises the following steps: 1) obtaining an open text corpus, and preprocessing the open text corpus, the open text corpus including a Chinese Wikipedia corpus and a news corpus; 2) combining the Chinese Wikipedia corpus processed in the step 1) with the news corpus to obtain a combinedcorpus, performing word segmentation on the combined corpus by utilizing a word segmentation tool, and filtering out stop words in a word segmentation result; 3) using an open tool word2vec to perform unsupervised training on the word segmentation result after the stop words are filtered, and constructing a word vector model according to an unsupervised training result; 4) obtaining a to-be-detected text and a score word list, and meanwhile constructing a similar word dictionary, and (5) utilizing the similar word dictionary, a word vector model and a sensitive word bank to carry out sensitivity detection on words in the score word list to complete sensitive corpus detection based on the lexicon and the word vector model, and the method is excellent in sensitive word detection capacity.

Description

technical field [0001] The invention belongs to the technical field of Internet information processing, and relates to a sensitive corpus detection method based on a thesaurus and a word vector model. Background technique [0002] With the rapid development of the information age, various social media platforms emerging on the Internet have been favored by many users, and the use of social platforms for information release has become an important way to form and spread public opinion. Social media brings massive text corpus every day, and the existence of a small amount of corpus will bring huge risks to social security and political stability. In order to avoid the negative impact of potentially sensitive corpus, it is necessary to detect the corpus on the Internet, and quickly identify the sensitive information involved in it, and then carry out further processing. [0003] For sensitive corpus on the Internet, the traditional method of detection purely based on thesaurus...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/9536G06F16/33G06F40/284
CPCG06F16/9536G06F16/3344
Inventor 李辉陈鹏
Owner XIDIAN UNIV