A short text representation method based on word2vec

A short text, training text technology, applied in the direction of unstructured text data retrieval, text database clustering/classification, instruments, etc., can solve the problem of sparse data space expression, ignoring word and word semantic information, short text representation ability, etc. problem, to achieve the effect of improving the classification accuracy and improving the clustering effect.

Active Publication Date: 2021-07-27
SUN YAT SEN UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the vector space model has the defect that the data space expression is sparse and the semantic information between words is ignored, which leads to its weak ability to express short text

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A short text representation method based on word2vec
  • A short text representation method based on word2vec
  • A short text representation method based on word2vec

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0035] The above-mentioned present invention and other technical features and advantages will be described in more detail below in conjunction with the accompanying drawings. In this embodiment, the comprehensive two-child policy short text corpus is taken as an example.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention relates to a short text representation method based on word2vec, comprising the following steps: S1: input a training text set through text preprocessing, set word2vec method parameters, and train to obtain a word vector set corresponding to the training text set; S2: for each article Each word in the document is calculated by the cosine distance between the word vectors to obtain a series of similar words of the word in the entire training text set; S3: calculate the cosine distance between the similar words in each document and the document; S4: according to the cosine distance Sort from large to small, and finally select the first n similar words and the corresponding cosine distance to form n similar words and cosine measures of the document; S5: Calculate the weight of the word in the document and the selected n similar words in the document, Form a new text representation and output a word2vec-based vector space representation for each document.

Description

technical field [0001] The present invention relates to the field of computer science and technology, and more specifically, relates to a short text representation method based on word2vec. Background technique [0002] In text mining processing, machine interpretation of sample information needs to go through the text representation process to convert samples into numerical values. With the continuous expansion of the scope of natural language processing and the development of computer technology, how to use numerical values ​​to better represent the semantic information represented by text has always been one of the most important research points in the field of text processing, because it directly affects text mining. Effect. For short text mining, effective text feature representation methods are even more difficult to research, especially short texts generated by social platforms, which not only have traditional features such as sparse features, incomplete semantics, p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F16/36G06F40/284G06F40/289G06K9/62
Inventor 路永和张炜婷
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products