Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Doc2vec-based similar entity mining method

An entity and similarity technology, applied in the field of similar document mining, achieves the effects of strong scalability, comprehensive vector representation, and strong portability

Inactive Publication Date: 2018-03-23
WUHAN UNIV
View PDF6 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This makes the tree more complex to construct, but very efficient at computing nearest neighbors, even in very high dimensions

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Doc2vec-based similar entity mining method
  • Doc2vec-based similar entity mining method
  • Doc2vec-based similar entity mining method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] Step 1: word2vec calculation

[0038] 1.1 participle

[0039] For Chinese word2vec calculation, the corpus should be segmented first.

[0040] The current mainstream technology of Chinese word segmentation is: for entry words, efficient word map scanning is realized based on the prefix dictionary, and a directed acyclic graph (DAG) composed of all possible word formations of Chinese characters in a sentence is generated, and dynamic programming is used to find the maximum probability Path, to find the maximum segmentation combination based on word frequency; for unregistered words, use the HMM model based on the ability of Chinese characters to form words, and use the Viterbi algorithm to solve the model.

[0041] Existing relatively mature Chinese word segmentation tools include IKAnalyzer, PaodingAnalyzer, etc.

[0042] 1.2 word2vec

[0043] Unsupervised learning of word embeddings has achieved unprecedented success in many natural language processing tasks. Words...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to similar document mining problems in natural language processing, relates to the technical field of word embedded expression, document keyword extraction, document embedded expression and nearest neighbor rapid calculation in high-dimensional space, and discloses a Doc2vec-based similar entity mining method. According to the method, similar entities are effectively mined through description documents of the entities by using Word2vec word embedded expression, TFIDF document keyword extraction, converting the description documents of the entities into continuous and dense vectors by using Doc2vec, and using Balltree data structures.

Description

technical field [0001] The invention belongs to the problem of mining similar documents in natural language processing, and relates to technical fields such as word embedding expression, document keyword extraction, document embedding expression, nearest neighbor fast calculation in high-dimensional space and the like. Background technique [0002] In many fields such as search, machine reading comprehension, user portraits, and recommendation systems, similar word mining, similar document mining, and more specifically similar APP or similar public account mining play a key role. For similarity mining, one of the most direct methods needs to map words or documents into a high-dimensional space, that is, word embedding or document embedding. [0003] At present, the most mainstream and successful method of word embeddings is Word2Vec technology. The technique is a neural probabilistic language model, which was first proposed by Bengio Y et al. The neural probabilistic langu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
CPCG06F40/279G06F40/284
Inventor 李石君刘杰杨济海李号号余伟余放李宇轩
Owner WUHAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products