A method for realizing Chinese text classification and related equipment

A text classification, Chinese technology, applied in the direction of text database clustering/classification, unstructured text data retrieval, semantic tool creation, etc., can solve problems such as many spelling errors, and achieve the effect of high accuracy and single dimension.

Inactive Publication Date: 2019-03-08
深兰人工智能芯片研究院(江苏)有限公司
View PDF2 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the method based on deep learning has strong adaptability, but it has not been well solved for the language phenomenon of homophones and spelling mistakes in short Chinese texts.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for realizing Chinese text classification and related equipment
  • A method for realizing Chinese text classification and related equipment
  • A method for realizing Chinese text classification and related equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach 1

[0030] figure 1 It is a schematic flowchart of the method for realizing Chinese text classification provided by Embodiment 1 of the present invention. Such as figure 1 As shown, the method includes:

[0031] Step 101, using the Chinese pinyin sequence to expand the semantics of the Chinese short text, and using word vectors to establish a character mapping matrix and a word-level mapping matrix;

[0032] Step 102, performing convolution and down-sampling operations on the character mapping matrix and word-level mapping matrix to automatically extract the local feature vectors of the short Chinese text;

[0033] In step 103, the local feature vectors are concatenated and fused, and then added to a normalized Softmax classifier to classify the Chinese short text.

[0034] Wherein, the Chinese pinyin sequence is used to expand the semantics of the Chinese short text, and the word vector is used to establish a character mapping matrix and a word-level mapping matrix, including:...

Embodiment 1

[0048] figure 2 A schematic flow diagram of the method for realizing Chinese text classification provided by Embodiment 1 of the present invention, as figure 2 As shown, the method includes:

[0049] Step 201, use the Chinese pinyin sequence to expand the semantics of the original text, and establish a character-level and word-level double-input matrix by using word vectors;

[0050] Wherein, the double-input matrix refers to the character mapping matrix w C and phrase mapping matrix w p .

[0051] Step 202, inputting the local feature vectors of the automatically extracted text through convolution and downsampling operations;

[0052] Step 203, adding the concatenated and fused feature vectors to the Softmax classifier to realize the classification of Chinese short texts.

Embodiment 2

[0054] image 3 It is a schematic flow diagram of the specific implementation of step 201 in Embodiment 1 of the present invention, as image 3 As shown, step 201 in the first embodiment includes:

[0055] Step 301, preprocessing the text, including removing a large number of meaningless symbols, and retaining mixed comments;

[0056] Wherein, the mixed comments may be comments in Chinese, English or other languages.

[0057] Step 302, use the word embedding vector set obtained from large-scale corpus training, denoted as VT; perform vectorized representation for each component unit in CF and PF, and obtain the character mapping matrix w C and phrase mapping matrix w p .

[0058] Among them, the character level feature (Char Level Feature, CF): that is, pinyin represents a sequence; the word level feature (Phrase Level Feature, PF): that is, a phrase represents a sequence.

[0059] Among them, the calculation formula is as follows:

[0060] W C =VT·idx(CF),W P =VT·idx(...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention relates to the field of text classification, and discloses a method for realizing Chinese text classification and related equipment, and the method comprises the following steps of carrying out semantic expansion on a Chinese short text by using a Chinese pinyin sequence, and establishing a character mapping matrix and a word-level mapping matrix by using a word vector; carrying out convolution and downsampling operation on the character mapping matrix and the word-level mapping matrix to automatically extract a local feature vector of the Chinese short text; and after the local feature vectors are connected in series and fused, adding the local feature vectors into a normalized Softmax classifier to realize classification of the Chinese short text. Thus, based on the convolutional neural network model with the character mapping matrix and the word-level mapping matrix as joint input, the defect that the standard convolutional neural network is single indimension can be effectively overcome, the context information characteristics of the Chinese short text are extracted more sufficiently, and a classification result with higher accuracy is obtained.

Description

technical field [0001] The embodiments of the present invention relate to the field of text classification, in particular to a method and related equipment for realizing Chinese text classification. Background technique [0002] At present, the amazing performance of deep learning in image recognition and handwriting recognition is obvious to all. In recent years, the application of natural language processing (NLP) has become more and more extensive, and short text classification is an important part of it. [0003] Among them, the methods for realizing short text classification include: methods based on text feature expansion and methods based on deep learning. Methods based on text feature expansion can be further divided into rule-based methods and statistics-based methods. Rule-based methods mainly rely on expert knowledge and classify data sets by formulating certain rules; statistics-based methods mainly use From the perspective of machine learning, external corpus ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F16/36
Inventor 陈海波
Owner 深兰人工智能芯片研究院(江苏)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products