High-accuracy website sensitive word detection method based on machine learning

A high-accuracy, machine learning technology, applied in machine learning, instruments, electrical digital data processing, etc., can solve problems such as false positives, increase the false positive rate of sensitive vocabulary detection, and large labor costs of website supervision agencies, so as to reduce data Quantity, improve detection speed and accuracy, and reduce labor costs

Pending Publication Date: 2020-02-04
HANGZHOU ANHENG INFORMATION TECH CO LTD
View PDF13 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The invention solves the problem in the prior art that the monitoring software is mainly based on rule matching, which increases the false positive rate of sensitive vocabulary detection, and a large number of false positives appear, which brings huge labor costs to the website supervision organization, and provides a Optimized machine learning-based high-accuracy website sensitive word detection method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-accuracy website sensitive word detection method based on machine learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The present invention will be described in further detail below in conjunction with the examples, but the protection scope of the present invention is not limited thereto.

[0031] The invention relates to a method for detecting sensitive words on a website with high accuracy based on machine learning. The method includes the following steps.

[0032] Step 1: Based on the website, download the file to be detected; create a database of sensitive words.

[0033] In said step 1, all website page files in the monitoring website are crawled from the Internet as files to be detected.

[0034] In the present invention, the sensitive word database refers to the sensitive word database newly created in the system in advance, which involves categories such as pornography, politics, people's livelihood, gambling, and drugs, and is a vocabulary that needs to be partially blocked or monitored and alarmed on the webpage.

[0035] Step 2: Match the files to be detected with the datab...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a high-accuracy website sensitive word detection method based on machine learning. The method comprises the following steps of performing rule matching on a to-be-detected file and the sensitive word database to obtain a document set containing sensitive words, processing and learning the training data to output a machine learning model, and then inputting the document setinto the model to obtain a website sensitive word detection result. Model training is carried out by combining a machine learning algorithm; sensitive word rule matching is firstly carried out on thecrawled website page, then machine learning automatic analysis is carried out on the output website after rule matching again, the data volume predicted by a machine learning model is reduced, the detection speed and accuracy are improved, and finally the possibility that the page contains sensitive words is obtained through statistical calculation. Due to machine learning, semantic analysis andcombined judgment with word segmentation meanings, the sensitive vocabulary recognition rate can be effectively improved, the monitoring accuracy can be ensured, and the labor cost of a supervision institution can be greatly reduced.

Description

technical field [0001] The present invention relates to the technical field of digital computing equipment or data processing equipment or data processing methods that are especially suitable for specific functions, and in particular to a method for detecting sensitive words on websites with high accuracy based on machine learning. Background technique [0002] For a country and even the world, a healthy network environment is very important, and it is related to the healthy development of this society. However, with the rapid development of the Internet, a large number of sensitive words are flooded on the Internet, such as words related to pornography, politics, people's livelihood, gambling, drugs, etc. This is a very serious challenge for a healthy Internet environment. Therefore, more and more institutions have begun to use specialized software to monitor sensitive words. [0003] In the prior art, many traditional monitoring software are based on rule matching, most o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/216G06F40/289G06F40/30G06N20/00
CPCG06N20/00
Inventor 江辉云范渊
Owner HANGZHOU ANHENG INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products