Chinese text sorting method based on correlation study between sorts

A text classification and correlation technology, applied in the field of Chinese text classification algorithm research, can solve the problems of long running time and large amount of algorithm running

Inactive Publication Date: 2012-01-25
南方报业传媒集团
View PDF4 Cites 42 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] All the above algorithms need to use methods such as SVM to train and construct classifiers. The

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese text sorting method based on correlation study between sorts
  • Chinese text sorting method based on correlation study between sorts
  • Chinese text sorting method based on correlation study between sorts

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0076] Example 1

[0077] Such as figure 1 As shown, the Chinese text classification method based on correlation learning between categories specifically includes the following steps:

[0078] (1) Training process:

[0079] (1-1) Feature selection: For all Chinese lexical items, there is a standard dictionary, which contains a complete lexical item set. All the lexical items in the lexical item set constitute a lexical item index according to the sequence of the phonetic sequence. The goal of feature selection is to select representative terms from the dictionary to form feature terms, and also to form a feature index based on the phonetic sequence. The specific process is: read in all training documents and segment the documents. After the training document is segmented, the word frequency of each term is counted in turn according to the index order of the term in the dictionary. Select the frequently appearing terms in the training document to form a feature subset after rough ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese text sorting method based on correlation study between sorts. The method comprises the following steps of: firstly, dividing words of a document and performing rough selection on characteristics by computing word frequencies; secondly, further determining representative word items according to discrimination indexes between the word items and sorts so as to form characteristic word items which are finely selected; thirdly, training the document to be expressed by a tfidf weight and a discrimination index weight according to an index of the characteristic word items; fourthly, establishing a group of two-sort sorters corresponding to different projection vectors and training to obtain a code array expressing the correlation between two-sort sorters; and finally, projecting a multi-vector expression of a new document to all the two-sort sorters, introducing the code array, computing the similarity between each sort and the document, and outputting the maximum of the similarity as a sort judging result of the new document. The new document is sorted based on a correlation studying result between the sorts, and the running efficiency of an algorithm is improved on the premise of ensuring the sorting performance.

Description

technical field [0001] The invention belongs to the research field of Chinese text classification algorithms, and in particular relates to a Chinese text classification method which adopts the discrimination index between words and categories to select features and learns based on the correlation between categories. Background technique [0002] With the rapid development of China's publishing industry, the number of Chinese documents in electronic format continues to rise. The work of document classification is becoming more and more cumbersome, so it is necessary to use advanced machine learning and pattern classification methods to assist traditional manual classification. [0003] The Chinese text classification method mainly consists of two parts: feature selection and classification algorithm. The characteristics of the document set are generally expressed in the form of Bag-of-Words model (Bag-of-Words) and document vector model (Vector Space Model). The probability...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 吴娴杨兴锋张东明何崑
Owner 南方报业传媒集团
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products