Document semantic representation method based on thematic word class similarities and text classification method and device

A technology of semantic representation and similarity, applied in the information field, can solve the problems of sparse data, no semantic information, and no consideration of word semantic information, etc., and achieve the effect of improving the effect

Active Publication Date: 2018-09-28
INST OF INFORMATION ENG CAS
View PDF5 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] There are some deficiencies in the existing bag-of-words model document representation methods, in which the BOW model only considers the frequency of words, not the semantic information of words; the TF-IDF model represents the text as a vector through the combination of word frequency and inverse document frequency, and does not Considering the semantic information of the text, it is easy to suffer from the problem of data sparsity; while the FBOW model uses the positional relationship of words in the semantic space to represent the correlation between words, and does not represent the semantic information represented by the document as a whole
Therefore, the document semantic vector representation method still has a lot of room for improvement.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document semantic representation method based on thematic word class similarities and text classification method and device
  • Document semantic representation method based on thematic word class similarities and text classification method and device
  • Document semantic representation method based on thematic word class similarities and text classification method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be further described in detail below through specific embodiments and accompanying drawings.

[0042] The document semantic representation method based on the similarity of subject word classes in this embodiment mainly includes two aspects:

[0043] 1) Clustering of bag-of-words model: First, use the word vector model to train the corpus to obtain word vectors; use Gaussian Mixture Model (GMM) to cluster the trained word vectors in the semantic space, and then Words with similar semantics are grouped into a category. Each cluster category represents a set of semantically related words. The corpus may be English corpus, Chinese corpus (word segmentation is required) or corpus of other languages.

[0044] 2) Text Semantic Representation: Treat each clustered category as an independent clustered "text", and use the WMD model to calcula...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a document semantic representation method based on thematic word class similarities and a text classification method and device. The document semantic representation method based on thematic word class similarities comprises the steps that (1) a word vector model is used to train a corpus, and word vectors are obtained; (2) clustering is performed on the word vectors in asemantic space; and (3) a WMD algorithm is used to calculate the distance between a to-be-represented document and each category obtained through clustering, and the obtained distance is used as semantic representation of the to-be-represented document. Then document classification is realized by calculating the similarities among semantic representation vectors of the document. Through the document semantic representation method, according to semantic information, word frequency and other information of texts, transfer cost between text words and clustering sets is calculated through the WMDmodel, each text is represented with one low-dimensional dense vector containing the semantic information, text information can be better represented, the classification task is high in accuracy, andthe method can be applied to natural language processing tasks such as information retrieval and text classification.

Description

technical field [0001] The invention belongs to the field of information technology, and in particular relates to a document semantic representation method, a text classification method and a corresponding device based on the similarity of subject parts of speech. Background technique [0002] Text vector representation is one of the key technologies in the fields of text mining and natural language processing. A good document semantic representation method can improve the performance of tasks such as information retrieval and text classification. [0003] The present invention is a document semantic representation method based on the similarity of subject parts of speech, and is an improvement proposed for the high-dimensional sparseness and no semantics of the bag-of-words model. At present, the document representation methods based on the bag-of-words model are as follows: [0004] 1) The traditional bag of words (Bag of words, BOW) model represents the frequency of words...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F40/30
Inventor 陈小军王大魁时金桥白离胡兰兰文新张闯马建伟
Owner INST OF INFORMATION ENG CAS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products