Three-folded webpage text content recognition and filtering method based on the Chinese punctuation

A punctuation and content recognition technology, applied in the field of network information security, can solve problems such as slow filtering speed, low filtering accuracy and filtering rate, and easy bypassing of filters, and achieves improved speed, high efficiency, and CPU usage. low rate effect

Inactive Publication Date: 2007-09-12
DALIAN UNIV OF TECH
View PDF0 Cites 77 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] 1. Using the URL and keyword filtering method, the filtering accuracy and filtering rate are low, and the filter is easily bypassed;
[0011] 2. Using the content filtering method based on text vector space alone has a slow filtering speed and cannot meet the requirements of real-time filtering for broadband network data transmission;
[0012] 3. There are few studies on t

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
  • Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
  • Three-folded webpage text content recognition and filtering method based on the Chinese punctuation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037]

[0038]

[0039]

[0040]

[0041]

[0042]

[0043]

[0044]

[0045]

[0046]

[0047]

[0048]

[0049]

[0050]

[0051]

[0052]

[0053]

[0054]

[0055]

[0056]

[0057]

[0058]

[0059]

[0060]

[0061]

[0062]

[0063]

[0064]

[0065]

[0066]

[0067] Tags construct the tag tree of the webpage, and regularize a webpage into nested content blocks; then, for the webpage set generated by using the same template, find out the content blocks that appear multiple times in the webpage set as noise content, and The content blocks that appear less frequently in the set of web pages are valid information blocks. Fudan University proposed an Internet filtering system and filtering method based on Content Filtering Agent (CFA). The system framework includes three parts: Content Filtering Agent (CFA), Query Server (QS), and Content Analysis and Management Server (CAMS). The filtering process of the network content filtering system is: ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method based on Chinese website punctuation triple recognition and text content filtering. The method based on existing URL, the website information keywords in the method of filtration - filtration rate and the low rate of filtration of the whole problem, Bringing on a method for composite based on the URL and on keywords, as well as text-based knowledge representation method of vector space website text content filtering. Applying to a method Based on black-and-white list of URL filtering and Chinese punctuation statistical characteristics to effectively remove navigation information, relevant linked information, advertising linked information, copyright information and other Web content noise information to extract content of text; adopting vector space model text knowledge representation, By calculating vector text template and unhealthy information in the feature vector cosine angle, and set the threshold, compared to the text of the class. The invention can be widely used in the filtering of undesirable information network and website personalized information services.

Description

technical field [0001] The invention belongs to the field of network information security and relates to the identification and filtering of bad text information on Chinese web pages. Background technique [0002] In several existing web content security products, such as "Internet Nanny" and "Internet Dad", most of them use methods based on URL addresses and keywords to prohibit access to illegal web pages and websites. In terms of diversity and dynamics, this method of using a static address library or manually updating URLs and keywords is far from meeting people's filtering requirements. Parents look forward to the emergence of more effective and comprehensive information filtering products. [0003] Existing filtering methods for web page text content mainly revolve around the vector space model. [0004] Liu Peide and others constructed a network information filtering system (NIFS) with a feedback mechanism by using the vector space model, TC3 classification algorithm...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L29/06G06F17/30G06F17/27H04L12/24
Inventor 宋明秋吴新涛
Owner DALIAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products