Text classification method of Chinese web page based on steam clustering

A text and webpage technology, which is applied in the clustering field of massive webpage texts, can solve the problems of uncertain data unit dimensions and difficult analysis, and achieve the effects of high operating speed, high processing efficiency, and wide coverage

Inactive Publication Date: 2010-06-09
TSINGHUA UNIV
View PDF0 Cites 51 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The main problem of applying this general stream clustering method to webpage texts is that the characteristic information of webpage texts includes title, author, publication time, etc. in addition to the text, and the data units of webpage texts after preprocessing are often It is high-dimensional and the dimension is uncertain, and it is more difficult to analyze

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] A kind of Chinese web page text classification method and embodiment based on flow clustering that the present invention proposes are described in detail as follows:

[0030] First, define a single text structure consisting of the title vector, label vector, text vector, author vector, related link vector and publication time of the text;

[0031] The text class is a set of publication time T coming at a certain time t 1 ,T 2 ,... T n (in days) the corresponding text P 1 , P 2 ,...P 3 A collection of , the class structure is composed of multiple feature vectors and class weights and update time, expressed as ( , ω, t), where Respectively, the weighted linear sum of the title vector, label vector, text vector, author vector, and related blog post link vector of all texts in this class; Represents the weight of this class, f(t)=2 -λt is the decay function (λ is recommended to take 0.1, that is, 10 days as the half-life), t is the publication date of the text cl...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a text classification method of a Chinese web page based on steam clustering, belonging to the technical field of internetwork data mining. The text classification method comprises the following steps of: acquiring a web page in real time; removing unprocessed labels in the format of the web page, and analyzing the characteristic information of texts of the web page; segmenting the content of the texts, using as ngram participles, and forming a plurality of word strings; computing the weight value of each word string; extracting the word string with a high weight value, and using the word string with the high weight value and the corresponding weight value thereof as characteristic vectors; computing the similarity of the characteristic vectors and characteristic information and a known class; computing obtained total similarity, and classifying the texts to the know class or establishing a new class; judging whether the know class is divided into two subclasses or not according to the number of characteristic items of the known class; and storing processed text records and the information of the known class. The text classification method sufficiently excavates the effective information of web page texts aiming at the characteristics of the web page texts and is incremental, fast, effective and more practical.

Description

technical field [0001] The invention belongs to the technical field of Internet data mining, and in particular relates to a clustering method for massive webpage texts. Background technique [0002] With the rapid development and promotion of computer network technology, network data has expanded rapidly. These data have the characteristics of fast update speed, huge data volume, and irregular data organization forms, but they also contain a lot of valuable information. How to extract effective information from these massive data has become a hot spot of concern. [0003] In order to effectively classify massive data, at present people mainly classify massive data based on the flow clustering method. The data is classified into a class, and the representation method of a class is the weighting of the feature information of the data in the class, which facilitates the update operation of the class. [0004] The main problem in applying this general stream clustering method ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 卞小丁袁睿翕孙立远
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products