Rubbish article classification method based on distributed feature representation of text

A classification method and distributed technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve the problems of not considering word order, high misjudgment rate, etc.

Active Publication Date: 2016-03-09
CHONGQING UNIV OF POSTS & TELECOMM
View PDF3 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The technical problem to be solved by the present invention is that the bag-of-words model does not consider the order of words when classifyin

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rubbish article classification method based on distributed feature representation of text
  • Rubbish article classification method based on distributed feature representation of text
  • Rubbish article classification method based on distributed feature representation of text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0019] The present invention will be further described below in conjunction with the accompanying drawings.

[0020] Collect manuscript text data sets (including junk manuscripts and valid manuscripts), mark the categories of manuscripts, such as junk manuscripts are recorded as class: y=-1, valid manuscripts are recorded as class: y=1, support vector machine training text classification based on the above categories Model.

[0021] Segment the manuscript text corpus. The word segmentation method used in this embodiment is a Chinese word segmentation algorithm based on the combination of dictionary reverse maximum matching algorithm and statistical word segmentation strategy.

[0022] Firstly, the text of the manuscript to be segmented is preprocessed, and the non-Chinese character information in the text is normalized. Separators (such as spaces "") can be used to replace non-Chinese character information such as punctuation and English letters in the manuscript text.

[00...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a rubbish article classification method based on distributed feature representation of text. The method comprises: performing word segmentation on article text by using a Chinese word segmentation algorithm based on a dictionary and a statistical strategy; using a Skip-Gram model based on a Negative-Sampling algorithm in word2vec to select a support vector machine of a linear kernel; and training text vectors of the article to acquire an article classification model of SVM. The correct rate of article category discrimination is obviously improved, and thus the accuracy of article category discrimination is greatly improved.

Description

technical field [0001] The invention relates to a garbage manuscript text classification method, in particular to a garbage manuscript classification method based on text distributed feature representation. Background technique [0002] Text classification methods have been widely used in text data mining, natural language processing, information retrieval and other fields. At present, there are many methods based on text classification problems, mainly including Naive Bayesian, K nearest neighbor, support vector machine and so on. Among them, because the support vector machine overcomes the influence of factors such as sample distribution, redundant features, and overfitting, it has good generalization ability, and has better effect and stability than other methods. [0003] There are currently two representation methods for the word vector representation of manuscript text, namely One-hotRepresentation and DistributedRepresentation. The biggest problem with the first meth...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 唐贤伦周家林胡志强陈瑛洁郭飞张毅张浩
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products