Unlock instant, AI-driven research and patent intelligence for your innovation.
An Optimal Classification Method for Short Texts
What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A classification method and short text technology, applied in semantic analysis, instruments, computing, etc., can solve the problems of difficulty in extracting semantic information, limited improvement of classification accuracy, polysemy, and noise, etc. The effect of increasing the overhead and reducing the amount of calculation
Active Publication Date: 2021-07-27
长沙市智为信息技术有限公司
View PDF13 Cites 0 Cited by
Summary
Abstract
Description
Claims
Application Information
AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology
Problems solved by technology
When this method is applied to short text classification, there are the following problems: (1) When VSM calculates the semantic similarity between sentences, it does not consider the influence of synonyms in sentences on their similarity
(2) When there is a lot of text data, using VSM to represent text will cause serious dimension disaster problems
(3) Short texts are usually small in length, with many polysemous words and noisy words. The effective features of short texts extracted by traditional methods are often not enough, resulting in less semantic information representation of short texts, which is not conducive to subsequent classification
The method of using semantic expansion requires a large amount of external corpus, and also increases the computational overhead, which brings about the disaster of dimensionality, and its application scenarios are often limited
However, the method based on word vector representation alone has limited improvement in classification accuracy.
The main reason is that the word representation obtained by the traditional word embedding method or TF-IDF method only contains the semantic information or statistical information in the current text corpus, while the short text is small in length and has many polysemous words and noise words. There are fewer effective features, which makes it difficult to extract enough semantic information
Method used
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more
Image
Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
Click on the blue label to locate the original text in one second.
Reading with bidirectional positioning of images and text.
Smart Image
Examples
Experimental program
Comparison scheme
Effect test
Embodiment 1
[0054] This embodiment is a specific embodiment of a method for optimal classification of short texts based on feature clustering. The present invention is mainly divided into six steps:
[0055] Step 1. Obtain training data, and preprocess the training data using the following steps:
[0056] A. The training data comes from the open source news corpus released by Fudan and Sogou Labs, with more than 200,000 pieces of data, including six categories: sports, Internet, economy, politics, art, and military;
[0057] B. Add the collected and organized online content word dictionary to improve the accuracy of subsequent word segmentation;
[0058] C. Remove stop words;
[0059] D. Segment the training data and complete the preprocessing.
[0060] Step 2. For the training data obtained in step 1, traverse each feature word in the data set after word segmentation, and select feature words whose word frequency is greater than the set threshold and have no repetition to construct a ...
Embodiment 2
[0086] This embodiment is a specific embodiment of a method for optimal classification of short texts based on feature clustering. The present invention is mainly divided into six steps:
[0087] Step 1. Obtain training data, and preprocess the training data using the following steps:
[0088] A. The training data comes from the China Mobile SMS data set, including normal, marketing, advertising, credit card, and others, a total of 5 categories, with a total data of about 100,000;
[0089] B. Remove stop words;
[0090] C. Segment the training data and complete the preprocessing.
[0091] Step 2. For the training data obtained in step 1, traverse each feature word in the data set after word segmentation, and select word frequency greater than the set threshold (set to 2 here) and no repeated feature words to construct feature item sets;
[0092] Step 3. To train the large-scale corpus collected, the specific steps are:
[0093] A. Collect open source Chinese corpus from Wi...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More
PUM
Login to View More
Abstract
The invention discloses an optimized classification method for short texts. Step 1: Obtain an original data set and preprocess it; Step 2: Select a feature item set from the preprocessed data set; The vector tool trains the collected large-scale corpus to obtain the word vector model; Step 4: Use the word vector model to represent each feature item in the feature item set as a word vector, and perform a preliminary clustering of the word vectors of the feature items to obtain Several preliminary feature clusters; Step 5: Perform two-stage loose clustering within each preliminary feature cluster to obtain several similar feature clusters; Step 6: Replace the feature words obtained in Step 4 with the similar feature clusters obtained in Step 5 , and then use the classifier for short text classification. Most of the traditional short text classification lacks semantic expression ability, and the dimension of the feature space is relatively high. The present invention can better express the semantic information of the short text while reducing the dimension of the feature space, thereby improving the accuracy and efficiency of short text classification. It is used in short text classification tasks in various fields, such as spam SMS classification, Weibo topic classification, etc.
Description
technical field [0001] The invention belongs to the technical field of Chinese short text classification, and relates to an optimized classification method for short texts, in particular to a classification method for network short texts. Background technique [0002] In the information age of data explosion, the intelligence of mobile terminals and the rapid development of Internet technology have prompted people to communicate more and more frequently on the mobile Internet, resulting in a large amount of information data. Most of these data are in the form of short texts as the carrier of information transmission, such as Weibo and instant push news, etc. The content is concise, refined and rich in meaning, which has high research value. Therefore, how to automatically classify these short texts to help understand the rich meanings expressed by these short texts has become a hot and difficult research topic in the fields of natural languageprocessing and machine learning...
Claims
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More
Application Information
Patent Timeline
Application Date:The date an application was filed.
Publication Date:The date a patent or application was officially published.
First Publication Date:The earliest publication date of a patent with the same application number.
Issue Date:Publication date of the patent grant document.
PCT Entry Date:The Entry date of PCT National Phase.
Estimated Expiry Date:The statutory expiry date of a patent right according to the Patent Law, and it is the longest term of protection that the patent right can achieve without the termination of the patent right due to other reasons(Term extension factor has been taken into account ).
Invalid Date:Actual expiry date is based on effective date or publication date of legal transaction data of invalid patent.