Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

An integrated classification method for massive multi-word short texts

A classification method, short text technology, applied in the field of text representation and representation learning, which can solve problems such as the curse of dimensionality

Active Publication Date: 2020-11-27
HEFEI UNIV OF TECH
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the shortcomings of the above-mentioned prior art, the present invention provides an integrated classification method for massive multi-word short texts, in order to solve the problem of "dimension disaster" of traditional representation learning methods, thereby improving the effect of short text representation learning , improve the accuracy of text classification, and have high robustness and practicability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An integrated classification method for massive multi-word short texts
  • An integrated classification method for massive multi-word short texts
  • An integrated classification method for massive multi-word short texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] In this embodiment, an integrated classification method for massive multi-word short texts, such as figure 1 shown, including the following steps:

[0044] Step 1. Obtain the multi-word short text collection, as shown in Table 1, and use the jieba_fast word segmentation method to perform word segmentation processing on the multi-word short text collection in the multi-process precise mode. jieba_fast is an improved version based on jieba word segmentation, which can Significantly improve word segmentation speed under large data volume. Adopt the multi-process word segmentation method to improve the utilization rate of CPU and memory, and increase the precision of word segmentation by adding a custom thesaurus, and finally get the word segmentation result X={x 1 ,x 2 ,...,x i ,...,x M+N},x i Indicates the i-th short text after word segmentation, and has: Indicates the i-th short text x i In the k-th word, the word segmentation result X is a marked word segmenta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an integrated classification method for massive multi-word short texts. 3. Based on the word vector representation model, use the PV-DM model in the Sentence2vec sentence vector representation method to construct a sentence vector representation; 4. Based on the sentence vector representation model, use the kNN classifier to predict is the category label for the labeled data. The invention can solve the "dimension disaster" problem of the traditional representation learning method, thereby improving the effect of short text representation learning, improving the accuracy of text classification, and having high robustness and practicability.

Description

technical field [0001] The invention relates to the field of text representation and representation learning methods, in particular to an integrated classification method for massive multi-word short texts. Background technique [0002] With the continuous prosperity of the commodity economy, various new commodities and services are constantly emerging. According to national regulations, enterprises and individuals need to issue value-added tax invoices in accordance with the requirements in their operations. When issuing invoices, the commodities on the invoices should be associated with the tax codes approved by the State Administration of Taxation. However, there are more than 4,200 tax codes approved by the State Administration of Taxation, and there are many types. The traditional method of manually selecting tax classification codes not only requires taxpayers to have certain professional knowledge, but is also prone to filling errors, which increases the cost of busin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06K9/62
CPCG06F18/24147
Inventor 胡学钢唐雪涛朱毅李培培
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products