System for automatic classification analysis for website based on website content

An automatic classification and website technology, applied in the direction of network data retrieval, network data indexing, special data processing applications, etc., can solve the problems of slow update speed, low efficiency, high maintenance cost, etc.

Inactive Publication Date: 2014-04-23
NANJING HUGEDATA NETWORK TECH
View PDF3 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Website classification can effectively improve the accuracy rate of Web information. The classification search engines represented by Yahoo, Sohu, etc. use manual classification methods, which are difficult to implement due to their low efficiency, slow update speed, and high maintenance costs. Effectively track and manage a large number of dynamically changing websites on the Internet

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System for automatic classification analysis for website based on website content
  • System for automatic classification analysis for website based on website content
  • System for automatic classification analysis for website based on website content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] The present invention will be further described below in conjunction with the accompanying drawings.

[0040] Such as figure 1 As shown, the number of links to the industry benchmark website is judged, and if it is greater than a certain threshold, the homepage data is captured, otherwise, the next-level link data is captured; the captured data is preprocessed and the text content of the web page is analyzed. Then determine the effective node of the container, if not, it is judged to be noise and deleted, otherwise the node block word segmentation is processed; the importance of the feature word category is calculated, and the feature word category discrimination is obtained through the calculation of the website category feature thesaurus, combined with the importance and The degree of differentiation is used to obtain the weight set of characteristic keywords; the set of website category characteristic keywords is further obtained to establish the website category tem...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a system for automatic classification analysis for websites based on website contents. The system comprises a capture module, a website text content analysis module, a word segmentation module, a feature training extracting module and a website classification module. The feature training extracting module selects a plurality of features words with maximum weights by calculating importance degree, distinction degree and feature keyword weight of every candidate feature word and sorting the candidate feature words according to the feature keyword weights, wherein the feature keyword weights are used as weightings after the normalization of the selected feature words and a website classification vector template is created according to the given sets of the selected feature words and the feature keyword weights. The website classification module is used for generating a feature spatial vector according to the given set of the selected feature words and the weightings which are obtained by the feature training extracting module and identifying the classification of a website by calculating the similarity between the feature spatial vector and the feature spatial vector of the website. The system is capable of effectively solving the problem of network information in a mess and allowing users to searching information for positioning conveniently and accurately.

Description

technical field [0001] The invention belongs to the field of data mining and machine learning, and relates to a system for automatically classifying and analyzing websites based on website content. Background technique [0002] Since the 1990s, the Internet has developed at an astonishing speed, and it has accommodated a large amount of various types of original information, including web pages, texts, images, multimedia, etc. How to grasp effective information in the vast amount of information is always It is one of the main goals of information processing. Website classification can effectively improve the accuracy rate of Web information. The classification search engines represented by Yahoo, Sohu, etc. use manual classification methods, which are difficult to implement due to their low efficiency, slow update speed, and high maintenance costs. Effectively track and manage a large number of dynamically changing websites on the Internet. Contents of the invention [0...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/353G06F16/951
Inventor 耿伟吴蒙乔波
Owner NANJING HUGEDATA NETWORK TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products