A sensitive corpus detection method based on thesaurus and word vector model
A detection method and word vector technology, applied in text database query, digital data information retrieval, natural language data processing, etc., can solve problems such as insensitive word detection, achieve a wide range and improve performance
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0027] Below in conjunction with accompanying drawing, the present invention is described in further detail:
[0028] refer to figure 1 , the sensitive corpus detection method based on thesaurus and word vector model of the present invention comprises the following steps:
[0029] 1) Obtaining open text corpus, and then preprocessing the open text corpus, wherein the open text is expected to include Chinese Wikipedia corpus and news corpus;
[0030] The Chinese Wikipedia corpus in step 1) comes from the Chinese open corpus of Wikipedia. For the Wikipedia Chinese corpus, the latest corpus acquisition address is: https: / / dumps.wikimedia.org / zhwiki / latest / zhwiki-latest-pages-articles .xml.bz2; news corpus comes from Sohu news data.
[0031] The specific process of preprocessing the Chinese Wikipedia corpus in step 1) is as follows:
[0032] Use the open tool WikiExtractor to extract effective information from the Chinese Wikipedia corpus. After extracting the effective informa...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 
