Text classification based on tfidf algorithm and related word weight correction

A technology for text classification and related words, applied in the computer field, can solve the problems of inability to complete the adjustment of weights, the accuracy of text classification is not very high, and cannot effectively reflect the importance of words and the location distribution of characteristic words. and processing cycle is short, efficient and accurate extraction, extraction comprehensive and accurate effect

Active Publication Date: 2018-01-26
北京东方通网信科技有限公司
View PDF4 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] The disadvantage of using the tfidf algorithm for text classification is: the tfidf algorithm assumes that the smaller the text frequency, the greater the ability to distinguish different types of text, so the concept of inverse document frequency IDF is introduced, and the weight TF is completed. Adjustment, the purpose o

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification based on tfidf algorithm and related word weight correction
  • Text classification based on tfidf algorithm and related word weight correction
  • Text classification based on tfidf algorithm and related word weight correction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] like figure 1 As shown, a text classification method based on tfidf algorithm and related word weight correction, including the following steps:

[0047] S1: Extract category keywords from part of the training data or according to the keywords provided by the user;

[0048] S2: Form the word segmentation results of the text into a sliding text window, set the weight of each word, and correct its position in the sliding text window;

[0049] S3: According to the weight of the word and its position in the sliding text window, the word frequency of the word is calculated according to the word frequency statistical correction function;

[0050] S4: according to the TFIDF algorithm, the words of the text are respectively weighted and calculated to realize the vectorization of the words of the text;

[0051] S5: After the vectorization of the text is realized, the text is classified through the svm classifier.

[0052] Compared with the traditional tfidf algorithm, the acc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a text classification method based on a tfidf algorithm and related word weight correction. The text classification method comprising the steps of S1, extracting category keywords; S2, forming a sliding text window, setting a word weight and modifying the position thereof in the sliding text window; S3, calculating the word frequency of words according to a word frequencystatistical correction function; S4, performing weighting calculation according to a TFIDF algorithm to realize vectorization of the words in a text; and S5, classifying the text by a SVM classifier.In the process of text classification, the weight of the category keywords is increased, so that the result of the text vectorization better reflects the text information. The method of the inventionintroduces the text sliding window and takes full account of the position information of the words in the text. The category keywords come from partially training data and users, the category keywordsare extracted by using the tfidf algorithm, the characteristics of the keywords can be extracted efficiently and accurately, meanwhile, the case of a few category keywords of an actual application scene is balanced, and the category keywords are extracted comprehensively and accurately.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a text classification method based on tfidf algorithm and related word weight correction. Background technique [0002] In the prior art, the commonly used technology for text classification is to use the tfidf algorithm to calculate the weights of related words and vectorize the related words. [0003] The tfidf algorithm was proposed by Salton in 1988. The core of the algorithm is: words that appear more frequently in the same text and appear less frequently in different texts should be given higher weights. Among them, words that appear in the text Frequency (TF), used to describe the ability of the word to reflect the content of the document; Inverse Document Frequency (IDF), used to calculate the ability of the word to distinguish different documents, the calculation formula is as follows: [0004] [0005] [0006] TF_IDF (i,j) =TF (i,j) *IDF i [0007] no (i,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 黄永军
Owner 北京东方通网信科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products