Integrated classification method for mass multi-word short texts

A classification method and short text technology, applied in the field of text representation and representation learning, can solve problems such as the curse of dimensionality

Active Publication Date: 2019-04-19
HEFEI UNIV OF TECH
View PDF6 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the shortcomings of the above-mentioned prior art, the present invention provides an integrated classification method for massive multi-word short texts, in order to solve the problem of "dimension disaster" of traditional representation learning methods, thereby improving the effect of short text representation learning , improve the accuracy of text classification, and have high robustness and practicability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Integrated classification method for mass multi-word short texts
  • Integrated classification method for mass multi-word short texts
  • Integrated classification method for mass multi-word short texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] In this embodiment, an integrated classification method for massive multi-word short texts, such as figure 1 shown, including the following steps:

[0043] Step 1. Obtain the multi-word short text collection, as shown in Table 1, and use the jieba_fast word segmentation method to perform word segmentation processing on the multi-word short text collection in the multi-process precise mode. jieba_fast is an improved version based on jieba word segmentation, which can Significantly improve word segmentation speed under large data volume. Adopt the multi-process word segmentation method to improve the utilization rate of CPU and memory, and increase the precision of word segmentation by adding a custom thesaurus, and finally get the word segmentation result X={x 1 ,x 2 ,...,x i ,...,x M+N}, x i Indicates the i-th short text after word segmentation, and has: Indicates the i-th short text x i In the k-th word, the word segmentation result X is a marked word segment...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an integrated classification method for mass multi-word short texts, which comprises the following steps of: 1, acquiring a multi-word short text set, and performing word segmentation preprocessing on the multi-word short text; 2, obtaining a word vector representation model on the word segmentation result by utilizing a CBOW continuous word bag model in a Word2vec word vector representation method; 3, based on a word vector representation model, using a Sentence2vec sentence vector representation method to represent PV-in the method; The DM model is used for constructing sentence vector representation; And 4, on the basis of the sentence vector representation model, utilizing a kNN classifier to predict a category label of the labeled data. According to the method,the problem of'dimensionality disaster 'of a traditional representation learning method can be solved, so that the short text representation learning effect is improved, the text classification precision is improved, and the method has relatively high robustness and practicability.

Description

technical field [0001] The invention relates to the field of text representation and representation learning methods, in particular to an integrated classification method for massive multi-word short texts. Background technique [0002] With the continuous prosperity of the commodity economy, various new commodities and services are constantly emerging. According to national regulations, enterprises and individuals need to issue value-added tax invoices in accordance with the requirements in their operations. When issuing invoices, the commodities on the invoices should be associated with the tax codes approved by the State Administration of Taxation. However, there are more than 4,200 tax codes approved by the State Administration of Taxation, and there are many types. The traditional method of manually selecting tax classification codes not only requires taxpayers to have certain professional knowledge, but is also prone to filling errors, which increases the cost of busin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06K9/62
CPCG06F18/24147
Inventor 胡学钢唐雪涛朱毅李培培
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products