Document semantic representation method based on thematic word class similarities and text classification method and device

A technology of semantic representation and similarity, applied in the information field, can solve the problems of sparse data, no semantic information, and no consideration of word semantic information, etc., and achieve the effect of improving the effect
CN108595706AActive Publication Date: 2018-09-28INST OF INFORMATION ENG CAS

Patent Information

Authority / Receiving Office
CN Β· China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF INFORMATION ENG CAS
Publication Date
2018-09-28

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention relates to a document semantic representation method based on thematic word class similarities and a text classification method and device. The document semantic representation method based on thematic word class similarities comprises the steps that (1) a word vector model is used to train a corpus, and word vectors are obtained; (2) clustering is performed on the word vectors in asemantic space; and (3) a WMD algorithm is used to calculate the distance between a to-be-represented document and each category obtained through clustering, and the obtained distance is used as semantic representation of the to-be-represented document. Then document classification is realized by calculating the similarities among semantic representation vectors of the document. Through the document semantic representation method, according to semantic information, word frequency and other information of texts, transfer cost between text words and clustering sets is calculated through the WMDmodel, each text is represented with one low-dimensional dense vector containing the semantic information, text information can be better represented, the classification task is high in accuracy, andthe method can be applied to natural language processing tasks such as information retrieval and text classification.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the field of information technology, and in particular relates to a document semantic representation method, a text classification method and a corresponding device based on the similarity of subject parts of speech. Background technique

[0002] Text vector representation is one of the key technologies in the fields of text mining and natural language processing. A good document semantic representation method can improve the performance of tasks such as information retrieval and text classification.

[0003] The present invention is a document semantic representation method based on the similarity of subject parts of speech, and is an improvement proposed for the high-dimensional sparseness and no semantics of the bag-of-words model. At present, the document representation methods based on the bag-of-words model are as follows:

[0004] 1) The traditional bag of words (Bag of words, BOW) model represents the frequency of words...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More