Sensitive corpus detection method based on lexicon and word vector model
A detection method and word vector technology, applied in text database query, natural language data processing, digital data information retrieval and other directions, can solve problems such as insensitive word detection, achieve a wide range and improve performance.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0027] The present invention is described in further detail below in conjunction with accompanying drawing:
[0028] refer to figure 1 , the sensitive corpus detection method based on thesaurus and word vector model of the present invention comprises the following steps:
[0029] 1) Obtain the open text corpus, and then preprocess the open text corpus, wherein the open text is expected to include Chinese Wikipedia corpus and news corpus;
[0030] Step 1) The Chinese Wikipedia corpus comes from the Chinese open corpus of Wikipedia. For the Wikipedia Chinese corpus, the latest corpus acquisition address is: https: / / dumps.wikimedia.org / zhwiki / latest / zhwiki-latest-pages-articles .xml.bz2; news corpus comes from Sohu news data.
[0031] The specific process of preprocessing the Chinese Wikipedia corpus in step 1) is:
[0032] Use the open tool WikiExtractor to extract effective information from the Chinese Wikipedia corpus, remove the invalid tags in the effective information te...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 
