Unlock instant, AI-driven research and patent intelligence for your innovation.

An Optimal Classification Method for Short Texts

A classification method and short text technology, applied in semantic analysis, instruments, computing, etc., can solve the problems of difficulty in extracting semantic information, limited improvement of classification accuracy, polysemy, and noise, etc. The effect of increasing the overhead and reducing the amount of calculation

Active Publication Date: 2021-07-27
长沙市智为信息技术有限公司
View PDF13 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When this method is applied to short text classification, there are the following problems: (1) When VSM calculates the semantic similarity between sentences, it does not consider the influence of synonyms in sentences on their similarity
(2) When there is a lot of text data, using VSM to represent text will cause serious dimension disaster problems
(3) Short texts are usually small in length, with many polysemous words and noisy words. The effective features of short texts extracted by traditional methods are often not enough, resulting in less semantic information representation of short texts, which is not conducive to subsequent classification
The method of using semantic expansion requires a large amount of external corpus, and also increases the computational overhead, which brings about the disaster of dimensionality, and its application scenarios are often limited
However, the method based on word vector representation alone has limited improvement in classification accuracy.
The main reason is that the word representation obtained by the traditional word embedding method or TF-IDF method only contains the semantic information or statistical information in the current text corpus, while the short text is small in length and has many polysemous words and noise words. There are fewer effective features, which makes it difficult to extract enough semantic information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An Optimal Classification Method for Short Texts
  • An Optimal Classification Method for Short Texts
  • An Optimal Classification Method for Short Texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0054] This embodiment is a specific embodiment of a method for optimal classification of short texts based on feature clustering. The present invention is mainly divided into six steps:

[0055] Step 1. Obtain training data, and preprocess the training data using the following steps:

[0056] A. The training data comes from the open source news corpus released by Fudan and Sogou Labs, with more than 200,000 pieces of data, including six categories: sports, Internet, economy, politics, art, and military;

[0057] B. Add the collected and organized online content word dictionary to improve the accuracy of subsequent word segmentation;

[0058] C. Remove stop words;

[0059] D. Segment the training data and complete the preprocessing.

[0060] Step 2. For the training data obtained in step 1, traverse each feature word in the data set after word segmentation, and select feature words whose word frequency is greater than the set threshold and have no repetition to construct a ...

Embodiment 2

[0086] This embodiment is a specific embodiment of a method for optimal classification of short texts based on feature clustering. The present invention is mainly divided into six steps:

[0087] Step 1. Obtain training data, and preprocess the training data using the following steps:

[0088] A. The training data comes from the China Mobile SMS data set, including normal, marketing, advertising, credit card, and others, a total of 5 categories, with a total data of about 100,000;

[0089] B. Remove stop words;

[0090] C. Segment the training data and complete the preprocessing.

[0091] Step 2. For the training data obtained in step 1, traverse each feature word in the data set after word segmentation, and select word frequency greater than the set threshold (set to 2 here) and no repeated feature words to construct feature item sets;

[0092] Step 3. To train the large-scale corpus collected, the specific steps are:

[0093] A. Collect open source Chinese corpus from Wi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an optimized classification method for short texts. Step 1: Obtain an original data set and preprocess it; Step 2: Select a feature item set from the preprocessed data set; The vector tool trains the collected large-scale corpus to obtain the word vector model; Step 4: Use the word vector model to represent each feature item in the feature item set as a word vector, and perform a preliminary clustering of the word vectors of the feature items to obtain Several preliminary feature clusters; Step 5: Perform two-stage loose clustering within each preliminary feature cluster to obtain several similar feature clusters; Step 6: Replace the feature words obtained in Step 4 with the similar feature clusters obtained in Step 5 , and then use the classifier for short text classification. Most of the traditional short text classification lacks semantic expression ability, and the dimension of the feature space is relatively high. The present invention can better express the semantic information of the short text while reducing the dimension of the feature space, thereby improving the accuracy and efficiency of short text classification. It is used in short text classification tasks in various fields, such as spam SMS classification, Weibo topic classification, etc.

Description

technical field [0001] The invention belongs to the technical field of Chinese short text classification, and relates to an optimized classification method for short texts, in particular to a classification method for network short texts. Background technique [0002] In the information age of data explosion, the intelligence of mobile terminals and the rapid development of Internet technology have prompted people to communicate more and more frequently on the mobile Internet, resulting in a large amount of information data. Most of these data are in the form of short texts as the carrier of information transmission, such as Weibo and instant push news, etc. The content is concise, refined and rich in meaning, which has high research value. Therefore, how to automatically classify these short texts to help understand the rich meanings expressed by these short texts has become a hot and difficult research topic in the fields of natural language processing and machine learning...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/289G06F40/30G06K9/62
CPCG06F40/289G06F40/30G06F18/22G06F18/23213G06F18/2411
Inventor 李芳芳尹垚毛星亮施荣华石金晶胡超
Owner 长沙市智为信息技术有限公司