Short text-oriented optimization classification method

A classification method and short text technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as difficulty in extracting semantic information, a large amount of external corpus, and limited improvement of classification accuracy, and achieve enhanced semantic representation. ability, reduce the amount of calculation, and improve the effect of precision

Active Publication Date: 2019-07-02
长沙市智为信息技术有限公司
View PDF15 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When this method is applied to short text classification, there are the following problems: (1) When VSM calculates the semantic similarity between sentences, it does not consider the influence of synonyms in sentences on their similarity
(2) When there is a lot of text data, using VSM to represent text will cause serious dimension disaster problems
(3) Short texts are usually small in length, with many polysemous words and noisy words. The effective features of short texts extracted by traditional methods are often not enough, resulting in less semantic information representation of short texts, which is not conducive to subsequent classification
The method of using semantic expansion requires a large amount of external corpus, and also increases the computational overhead, which brings about the disaster of dimensionality, and its application scenarios are often limited
However, the method based on word vector representation alone has limited improvement in classification accuracy.
The main reason is that the word representation obtained by the traditional word embedding method or TF-IDF method only contains the semantic information or statistical information in the current text corpus, while the short text is small in length and has many polysemous words and noise words. There are fewer effective features, which makes it difficult to extract enough semantic information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short text-oriented optimization classification method
  • Short text-oriented optimization classification method
  • Short text-oriented optimization classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0054] This embodiment is a specific embodiment of a method for optimal classification of short texts based on feature clustering. The present invention is mainly divided into six steps:

[0055] Step 1. Obtain training data, and preprocess the training data using the following steps:

[0056] A. The training data comes from the open source news corpus released by Fudan and Sogou Labs, with more than 200,000 pieces of data, including six categories: sports, Internet, economy, politics, art, and military;

[0057] B. Add the collected and organized online content word dictionary to improve the accuracy of subsequent word segmentation;

[0058] C. Remove stop words;

[0059] D. Segment the training data and complete the preprocessing.

[0060] Step 2. For the training data obtained in step 1, traverse each feature word in the data set after word segmentation, and select feature words whose word frequency is greater than the set threshold and have no repetition to construct a ...

Embodiment 2

[0086] This embodiment is a specific embodiment of a method for optimal classification of short texts based on feature clustering. The present invention is mainly divided into six steps:

[0087] Step 1. Obtain training data, and preprocess the training data using the following steps:

[0088] A. The training data comes from the China Mobile SMS data set, including normal, marketing, advertising, credit card, and others, a total of 5 categories, with a total data of about 100,000;

[0089] B. Remove stop words;

[0090] C. Segment the training data and complete the preprocessing.

[0091] Step 2. For the training data obtained in step 1, traverse each feature word in the data set after word segmentation, and select word frequency greater than the set threshold (set to 2 here) and no repeated feature words to construct feature item sets;

[0092] Step 3. To train the large-scale corpus collected, the specific steps are:

[0093] A. Collect open source Chinese corpus from Wi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a short text-oriented optimization classification method. The method comprises the following steps of: 1, obtaining an original data set and preprocessing the original data set; 2, selecting a feature item set from the preprocessed data set; 3, training the collected large-scale corpora by using a word vector tool to obtain a word vector model; 4, performing word vector representation on each feature item in the feature item set by using a word vector model, and performing primary clustering on the word vectors of the feature items to obtain a plurality of primary feature clusters; 5, performing two-stage loose clustering in each preliminary feature cluster to obtain a plurality of similar feature clusters; and 6, replacing the feature words obtained in the step 4 with the similar feature clusters obtained in the step 5, and then carrying out short text classification by using a classifier. Traditional short text classification mostly lacks semantic expression capability and is quite high in demnsion of the feature space; according to the invention, the semantic information of the short text can be expressed better, the dimension of the feature space is reduced, the precision and efficiency of short text classification are improved, and the short text classification method can be applied to short text classification tasks in various fields, such as spamshort message classification and microblog topic classification.

Description

technical field [0001] The invention belongs to the technical field of Chinese short text classification, and relates to an optimized classification method for short texts, in particular to a classification method for network short texts. Background technique [0002] In the information age of data explosion, the intelligence of mobile terminals and the rapid development of Internet technology have prompted people to communicate more and more frequently on the mobile Internet, resulting in a large amount of information data. Most of these data are in the form of short texts as the carrier of information transmission, such as Weibo and instant push news, etc. The content is concise, refined and rich in meaning, which has high research value. Therefore, how to automatically classify these short texts to help understand the rich meanings expressed by these short texts has become a hot and difficult research topic in the fields of natural language processing and machine learning...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/289G06F40/30G06F18/22G06F18/23213G06F18/2411
Inventor 尹垚李芳芳毛星亮施荣华石金晶胡超
Owner 长沙市智为信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products