Keyword-based bad texts detection method and device

A text detection and keyword technology, applied in unstructured text data retrieval, text database clustering/classification, special data processing applications, etc., can solve the problem of missing illegal words, difficult to identify camouflage words, low accuracy rate of web page recognition, etc. problem, to achieve the effect of improving the accuracy rate and solving the lower accuracy rate

Inactive Publication Date: 2017-06-09
SURFILTER NETWORK TECH
View PDF5 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Therefore, according to the previous detection methods, even if it takes a lot of manual labor to mark various illegal words as keywords, many illegal words will inevitably be missed.
On the other hand, it is difficult to identify fake words when detecting based on offending words
Therefore, due to the limitation of keywords in the prior art, the accuracy of identifying web page violations is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Keyword-based bad texts detection method and device
  • Keyword-based bad texts detection method and device
  • Keyword-based bad texts detection method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] This embodiment provides a method for detecting bad text based on keywords, which can be executed by a computer with an information processing function, a network server, or the like. Bad text refers to text content that contains bad information related to pornography, gambling, and drugs. Keywords are words with bad information or sensitive information that are pre-acquired by detectors for bad text detection, such as "sex" and other illegal words. As an application scenario of the present invention, in this embodiment, the web server detects the webpage text in the form of data stream in the network according to the method provided by the present invention. It can be understood that, for detection, the webpage text in data stream form can be restored to the webpage text in natural language form. Hereinafter, the method for detecting bad text based on keywords provided in this embodiment will be described.

[0054] figure 1 It is a flow chart of the keyword-based bad ...

Embodiment 2

[0108] Corresponding to the keyword-based bad text detection method provided in the first embodiment, the second embodiment provides a keyword-based bad text detection device. The device may specifically be a computer with an information processing function, a network server, or the like. Such as figure 2 As shown, the bad text detection device 100 based on keywords includes:

[0109] A seed word obtaining unit 101, which is used to obtain a plurality of seed words, and the seed word is a word used to represent bad information;

[0110] Semantic associated word expansion unit 102, which is used to expand the seed words acquired by the seed word acquisition unit 101 according to the semantic clustering method, to obtain semantically associated words associated with the seed word semantics, and to use the seed word and the semantically associated words as the criteria for detecting bad text Key words;

[0111]The bad text judging unit 103, when the webpage text is transmitte...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a keyword-based bad texts detection method and device, relates to the field of web content detection, and is intended to improve the accuracy of keyword-based bad texts detection; the method includes the steps of S0, acquiring a plurality of seed words which are words for characterizing bad information; S1, extending the seed words according to semantic clustering method to obtain semantic correlatives correlated to the seed words, and using the seed words and the semantic correlatives as keywords to detect bad texts; S2, counting occurring frequency of each keyword in each web text while the web text is transmitted in a broadband environment, and determining the web texts which are bad texts according to the occurring frequencies.

Description

technical field [0001] The invention relates to the field of web page content detection, and more specifically, to a keyword-based bad text detection method and device. Background technique [0002] With the popularization of the Internet and the improvement of network bandwidth, the number of accessible websites and webpage content on the Internet also shows an explosive increase trend. Due to the openness of the Internet, the content of the webpage contains a lot of illegal information related to pornography, gambling and drugs. In order to block illegal webpages containing bad information and purify the network environment, real-time monitoring of webpage content is required. [0003] In the past, in order to monitor the content of web pages in real time, it was proposed to measure whether a web page violated regulations based on the number of occurrences of keywords. Specifically, when the number of occurrences of keywords in a certain webpage exceeds a threshold, it i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/35G06F40/279
Inventor 唐新民沈智杰景晓军刘永强
Owner SURFILTER NETWORK TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products