Subject term embedding disambiguation method and system based on LDA

A subject word and subject technology, applied in the field of LDA-based subject word embedding and disambiguation, can solve problems such as polysemy, different word meanings, and difficulty in capturing word meanings, and achieve obvious performance improvement and rich semantic information.

Active Publication Date: 2020-07-03
KUNMING UNIV OF SCI & TECH
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] At present, the effect of word sense disambiguation based on Word2Vec has indeed made relatively large progress compared with before, but most of the work has the following main shortcomings: it generates a unique context-independent word vector for each word, but as we all know, words genera

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Subject term embedding disambiguation method and system based on LDA
  • Subject term embedding disambiguation method and system based on LDA
  • Subject term embedding disambiguation method and system based on LDA

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] Embodiment 1: as figure 1 As shown, an LDA-based keyword embedding disambiguation method, the specific steps are:

[0049] Step1: Based on the large-scale Wiki corpus without word meaning annotation, use the online LDA algorithm to train the topic model;

[0050]Step2: Based on the topic model, classify each document of the Wiki corpus into each topic to form various topic document sets, and then use Word2Vec to train the word vector under each topic for each topic document set, which is the topic word vector;

[0051] Step3: Based on the small-scale SemCor corpus with semantic annotations, use the topic model and the topic word vector to calculate the context vector;

[0052] Step4: Concatenate the context vector and other traditional semantic features, use SVM to train and test the disambiguation model.

[0053] Further, the Step1 is specifically:

[0054] Step1.1: Do word segmentation for the Wiki corpus, remove the non-word symbols in each document for word segm...

Embodiment 2

[0079] Embodiment 2: a kind of LDA-based keyword embedding disambiguation method, comprising:

[0080] Topic model training steps:

[0081] Step1.1: Do word segmentation for the Wiki corpus, remove the non-word symbols in each document for word segmentation, and convert it into a document with one line;

[0082] Step1.2: Then use WordNet to restore the lemmatization of the corpus;

[0083] Step1.3: Then use the preset stop word set to remove all stop words in the corpus and generate a new Wiki corpus;

[0084] Step1.4: Finally, based on the Wiki corpus, use online LDA to train the topic model, including the document-topic probability distribution p(t i |d) and word-topic probability distribution p(t j |w). Among them, d represents the current document, w represents the current word, and t i represents the ith topic.

[0085] Example:

[0086] Suppose there is a corpus containing three documents: {"Anarchism draws on many currents of thought and strategy.", "Anarchism do...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a subject term embedding disambiguation method and system based on LDA, and belongs to the technical field of semantic analysis. The method comprises the following steps: a topic model training step: training a topic model based on a Wiki corpus in combination with an LDA algorithm; a subject term vector generation step: training a subject term vector by utilizing Word2Vecaccording to the Wiki corpus and the subject model; a context vector generation step: calculating the vector representation of the context where the ambiguous words are located by using the topic model and the topic word vectors; and a supervised word sense disambiguation step: combining the context vector with other traditional semantic features, and performing word sense disambiguation by utilizing an SVM.

Description

technical field [0001] The invention relates to an LDA-based keyword embedding disambiguation method and system, and belongs to the technical field of semantic analysis. Background technique [0002] Natural language is inherently ambiguous, and many words generally have multiple meanings, such as "cricket", which can be expressed as a kind of movement or an insect, but in a specific context, each word has different meanings. has a definite meaning. Word sense disambiguation is a method to determine the correct meaning of ambiguous words according to a specific context, which is considered to be an AI-Complete problem. Word sense disambiguation is one of the tasks with the longest history in the direction of natural language processing. At the same time, it is also a key basic task in many natural language processing. It is widely used in machine translation, information retrieval, information extraction and other fields. [0003] The commonly used solutions for word sense...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/284G06F40/30G06F40/211G06F40/247G06K9/62
CPCG06F18/2411
Inventor 唐季林贾连印陈明鲜张崇德
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products