A short text representation method based on word2vec
A short text, training text technology, applied in the direction of unstructured text data retrieval, text database clustering/classification, instruments, etc., can solve the problem of sparse data space expression, ignoring word and word semantic information, short text representation ability, etc. problem, to achieve the effect of improving the classification accuracy and improving the clustering effect.
Active Publication Date: 2021-07-27
SUN YAT SEN UNIV
View PDF3 Cites 0 Cited by
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
However, the vector space model has the defect that the data space expression is sparse and the semantic information between words is ignored, which leads to its weak ability to express short text
Method used
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreImage
Smart Image Click on the blue labels to locate them in the text.
Smart ImageViewing Examples
Examples
Experimental program
Comparison scheme
Effect test
Embodiment 1
[0035] The above-mentioned present invention and other technical features and advantages will be described in more detail below in conjunction with the accompanying drawings. In this embodiment, the comprehensive two-child policy short text corpus is taken as an example.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More PUM
Login to View More
Abstract
The present invention relates to a short text representation method based on word2vec, comprising the following steps: S1: input a training text set through text preprocessing, set word2vec method parameters, and train to obtain a word vector set corresponding to the training text set; S2: for each article Each word in the document is calculated by the cosine distance between the word vectors to obtain a series of similar words of the word in the entire training text set; S3: calculate the cosine distance between the similar words in each document and the document; S4: according to the cosine distance Sort from large to small, and finally select the first n similar words and the corresponding cosine distance to form n similar words and cosine measures of the document; S5: Calculate the weight of the word in the document and the selected n similar words in the document, Form a new text representation and output a word2vec-based vector space representation for each document.
Description
technical field [0001] The present invention relates to the field of computer science and technology, and more specifically, relates to a short text representation method based on word2vec. Background technique [0002] In text mining processing, machine interpretation of sample information needs to go through the text representation process to convert samples into numerical values. With the continuous expansion of the scope of natural language processing and the development of computer technology, how to use numerical values to better represent the semantic information represented by text has always been one of the most important research points in the field of text processing, because it directly affects text mining. Effect. For short text mining, effective text feature representation methods are even more difficult to research, especially short texts generated by social platforms, which not only have traditional features such as sparse features, incomplete semantics, p...
Claims
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More Application Information
Patent Timeline
Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F16/36G06F40/284G06F40/289G06K9/62
Inventor 路永和张炜婷
Owner SUN YAT SEN UNIV
Who we serve
- R&D Engineer
- R&D Manager
- IP Professional
Why Patsnap Eureka
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com