Chinese text classification method based on MPI (Message Passing Interface) and adaboost.MH

A text classification and Chinese technology, applied in character and pattern recognition, special data processing applications, instruments, etc., can solve the problems of long training set time and a lot of time, and achieve the effect of improving time efficiency and shortening time

Inactive Publication Date: 2017-08-25
CHONGQING UNIV OF POSTS & TELECOMM
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Aiming at the problem that it takes a long time to build a training set for massive data and it takes a lot of time to train the classification model using the Adaboost.MH algorithm, the present invention uses the combination of MPI and adaboost.MH to propose a parallel text classification based on MPI and Adaboost.MH method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese text classification method based on MPI (Message Passing Interface) and adaboost.MH
  • Chinese text classification method based on MPI (Message Passing Interface) and adaboost.MH
  • Chinese text classification method based on MPI (Message Passing Interface) and adaboost.MH

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention will be further described below in conjunction with the accompanying drawings.

[0028] Such as figure 1 As shown, the present invention includes the following 5 steps.

[0029] 1. Text preprocessing: collect Chinese text files in different fields through web crawlers and search network information, and perform word segmentation processing on the collected Chinese text files. You can use open source word segmentation packages such as IK and ICTCLAS to perform Chinese word segmentation on the collected texts, and then remove punctuation marks and stop words. Stop words are words that appear very frequently but have no practical meaning, such as "Le", " of", "and" and so on. The entry after word segmentation is separated and saved into the local training set data as a preliminary feature.

[0030]2. Feature selection: Preliminary features are selected by using the mutual information method. First use the MPI_Init function to start p processes, ob...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese text classification method based on an MPI (Message Passing Interface) and adaboost.MH, and is used for solving the problem of long total Chinese text classification time caused by long adaboost.MH training time when a data size is large. The method comprises the following steps that: storing a Chinese text subjected to word segmentation processing into a training data set; then, combining a mutual information method with the MPI to realize feature word selection; then, carrying out reduction summation on all processes through an MPI_Reduce function in the MPI so as to obtain a similarity, and selecting a feature word according to the similarity; according to whether the feature word selected from the Chinese text contained in each process is in the presence or not, endowing the feature word with a weight by each process; and according to the communication function of the MPI, integrating process calculation results to obtain a text classification model, and utilizing the classification model to classify Chinese texts to be classified. By use of the Chinese text classification method, Chinese text classification time is greatly shortened.

Description

technical field [0001] The invention relates to the technical field of text mining, in particular to a Chinese text classification method based on MPI and adaboost.MH. Background technique [0002] Text classification is the process of dividing texts into relevant categories according to the information content when the category system of the text is known. With the advancement of science and technology, the development of society, the popularization of computers and the advent of the Internet age, the number of network texts is increasing rapidly, and the text classification task presents new characteristics: 1. A large number of new texts that need to be classified are generated every day. Text, these data are usually more than terabytes. Second, the categories of texts show diversity, that is, a text can belong to multiple categories, for example, the same text can belong to history, politics, technology, etc. [0003] Traditional single-label classification methods suc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27G06K9/62
CPCG06F16/355G06F40/284G06F18/22G06F18/2411G06F18/2431
Inventor 王进高延雨李颖李航余薇高选人邓欣陈乔松胡峰
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products