Text clustering method and system

A text clustering and text technology, applied in the field of text clustering, can solve the problems of poor matching effect and low accuracy, and achieve the effect of improving the effect, good adaptability, and avoiding the uniqueness of the centroid.

Active Publication Date: 2017-01-25
GUANGZHOU SHIYUAN ELECTRONICS CO LTD
View PDF4 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Based on this, it is necessary to provide a text clustering method and system for the problems of low accuracy and poor matching effect of traditional text clustering methods

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text clustering method and system
  • Text clustering method and system
  • Text clustering method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0024] In order to solve the problems of low accuracy and poor matching effect of traditional text clustering methods, the present invention provides a text clustering method embodiment 1; figure 1 It is a schematic flow diagram of Embodiment 1 of the method for text clustering of the present invention; as figure 1 As shown, the following steps may be included:

[0025] Step S110: When receiving the text to be classified, extract the keywords of the text to be classified

[0026] Step S120: According to the keywords in the obtained final bag of words, match the keywords of the text to be classified to obtain the type label of the text to be classified; wherein, the final bag of words is a collection of words from all kinds of word bags according to the preset selection rules It is obtained after sorting and filtering the keywords; the class tag word bag is a set of keywords generated after keyword extraction of each text corresponding to each type of tag.

[0027] Specifical...

Embodiment 2

[0052] In order to solve the problems of low accuracy and poor matching effect of traditional text clustering methods, the present invention also provides a text clustering method embodiment 2; image 3 It is a schematic flow diagram of Embodiment 2 of the method for text clustering of the present invention; as image 3 As shown, the following steps can be included, generating keywords→constructing word bag through keywords→adjusting word bag→using word bag classification, including:

[0053] Step S310: extract keywords according to TFIDF;

[0054] TF can be calculated based on the following formula: (the number of times the word appears in the document) / (the total number of words in the document), the larger the value, the more important the word, that is, the greater the weight.

[0055] For example: after a document is segmented, there are a total of 500 word segments, and the word segment "Hello" appears 20 times, then the TF value is: tf=20 / 500=2 / 50=0.04;

[0056] IDF c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a text clustering method and system. The text clustering method comprises the following steps: keywords of to-be-classified texts are extracted when the to-be-classified texts are received; the keywords of the to-be-classified texts are matched according to the obtained keywords in a final word bag, and the type tag of the to-be-classified text is obtained; the final word bag is obtained by sorting and screening the key words in various type tag word bags according to preset selection rules; the type tag word bags are sets of key words generated after key word extraction from texts corresponding to type tags. The key words corresponding to each tag are extracted through records of existing tags, the final word bag is obtained, to-be-classified texts are classified according to the key words in the final word bag, good adaptability to noise data is realized, and the condition that the accuracy is reduced substantially under the condition of more noise is avoided; an approximate string matching effect is improved greatly through large-range thresholding of a centroid.

Description

technical field [0001] The invention relates to the technical field of digital text mining, in particular to a text clustering method and system. Background technique [0002] The traditional text clustering technology is mainly based on the Rocchio algorithm of TFIDF (Term frequency–inverse document frequency). The Rocchio algorithm comes from the vector space model theory. The basic idea of ​​the vector space model Vectorspace model is to use a vector to represent a text, and the subsequent processing process is Can be converted to operations on vectors in space. Rocchio based on TFIDF is an implementation method of this idea, in which an N-dimensional vector is used to represent the text, the vector dimension N is the number of features, and the vector component is a certain weight of the feature. The calculation method representing the weight is called TFIDF method. Through the TFIDF method, the text in the training set is first expressed as a vector, and then the cate...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35G06F40/289
Inventor 李贤陈振安王鹏
Owner GUANGZHOU SHIYUAN ELECTRONICS CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products