Method for automatically classifying webpage content visited by Internet users

A web content and automatic classification technology, applied in the computer field, can solve the problems of sparse text data, small classification scale, and high data dimension, and achieve the effect of accurate classification, reducing spatial dimension, and reducing the problem of excessive dimension.

Active Publication Date: 2015-02-04
ASIAINFO TECH NANJING
View PDF4 Cites 36 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The technical problem to be solved by the present invention is: with the development of Internet technology, the existing automatic text classification system has problems such as small classification scale, sparse text data, and high data dimension that cannot be solved in the process of large-scale web page text classification. The defect that the classification effect is poor after running for a period of time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically classifying webpage content visited by Internet users
  • Method for automatically classifying webpage content visited by Internet users
  • Method for automatically classifying webpage content visited by Internet users

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The present invention is a set of statistical learning theory based on support vector machine, with the help of the majority voting strategy of decision forest, through supervised machine learning on webpage content samples accessed by limited Internet users, and then constructing a classifier by multiple classifiers A set of decision-making systems, and finally adaptively obtain new web page samples, and automatically train the classifier regularly. This classification system not only has a strict theoretical basis, but also can better solve practical problems such as small samples, nonlinear conversion to linear, sparse data, high-dimensional data, long training time for classifiers, and local minimum points. The decision-making system also solves The problem of inaccurate classification caused by a single classifier is solved. Since many operations can adopt the parallel MapReduce architecture, the training time of the classifier is greatly reduced. The analysis is f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for automatically classifying webpage content visited by Internet users. The method comprises the following steps: carrying out machine learning on webpage content samples accessed by finite Internet users based on a text classification technique according to a support vector machine and a decision forest technology, then building a set of decision system by a plurality of classifiers, and finally obtaining new webpage samples by self adaption and automatically classifying the new webpage samples. By virtue of the method, the actual problems of small samples, nonlinearity to linearity, data sparseness, high dimension of data, long classifier training time and local minimum points can be well solved; by virtue of the decision system, the problem of the inaccurate classification caused by a single classifier is solved; a plurality of operations can use a parallel MapReduce structure, so that the classifier training time is greatly shortened; the classification process is also capable of analyzing the webpage content of mobile Internet in milliseconds and finally classifying the webpage content into a predefined class.

Description

technical field [0001] The invention belongs to the field of computer technology, relates to network technology, and is a method for automatically classifying web page content accessed by Internet users. Background technique [0002] With the rapid development of mobile Internet information, in the face of hundreds of millions of massive information, people can no longer simply rely on manual processing of all information, and need auxiliary tools to help people better discover, filter and manage these information resources. Information mining has become a bottleneck in the development of science and technology and the further improvement of human life quality. Automatic text classification as the basis of mining has also become a major research hotspot in modern information processing research. [0003] The automatic text classification system has gone through three milestone stages: [0004] Stage 1: Knowledge engineering method. At the beginning, most of them used the m...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 孙洋
Owner ASIAINFO TECH NANJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products