Subject term embedding disambiguation method and system based on LDA

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A subject word and subject technology, applied in the field of LDA-based subject word embedding and disambiguation, can solve problems such as polysemy, different word meanings, and difficulty in capturing word meanings, and achieve obvious performance improvement and rich semantic information.

Active Publication Date: 2020-07-03

KUNMING UNIV OF SCI & TECH

View PDF8 Cites 3 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0011] At present, the effect of word sense disambiguation based on Word2Vec has indeed made relatively large progress compared with before, but most of the work has the following main shortcomings: it generates a unique context-independent word vector for each word, but as we all know, words generally have polysemy phenomenon, the meaning of words in different contexts is often different

Therefore, the single word vector generated for each ambiguous word is often difficult to capture the meaning of words in different contexts, thus affecting the further improvement of the disambiguation effect

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0048] Embodiment 1: as figure 1 As shown, an LDA-based keyword embedding disambiguation method, the specific steps are:

[0049] Step1: Based on the large-scale Wiki corpus without word meaning annotation, use the online LDA algorithm to train the topic model;

[0050]Step2: Based on the topic model, classify each document of the Wiki corpus into each topic to form various topic document sets, and then use Word2Vec to train the word vector under each topic for each topic document set, which is the topic word vector;

[0051] Step3: Based on the small-scale SemCor corpus with semantic annotations, use the topic model and the topic word vector to calculate the context vector;

[0052] Step4: Concatenate the context vector and other traditional semantic features, use SVM to train and test the disambiguation model.

[0053] Further, the Step1 is specifically:

[0054] Step1.1: Do word segmentation for the Wiki corpus, remove the non-word symbols in each document for word segm...

Embodiment 2

[0079] Embodiment 2: a kind of LDA-based keyword embedding disambiguation method, comprising:

[0080] Topic model training steps:

[0081] Step1.1: Do word segmentation for the Wiki corpus, remove the non-word symbols in each document for word segmentation, and convert it into a document with one line;

[0082] Step1.2: Then use WordNet to restore the lemmatization of the corpus;

[0083] Step1.3: Then use the preset stop word set to remove all stop words in the corpus and generate a new Wiki corpus;

[0084] Step1.4: Finally, based on the Wiki corpus, use online LDA to train the topic model, including the document-topic probability distribution p(t i |d) and word-topic probability distribution p(t j |w). Among them, d represents the current document, w represents the current word, and t i represents the ith topic.

[0085] Example:

[0086] Suppose there is a corpus containing three documents: {"Anarchism draws on many currents of thought and strategy.", "Anarchism do...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a subject term embedding disambiguation method and system based on LDA, and belongs to the technical field of semantic analysis. The method comprises the following steps: a topic model training step: training a topic model based on a Wiki corpus in combination with an LDA algorithm; a subject term vector generation step: training a subject term vector by utilizing Word2Vecaccording to the Wiki corpus and the subject model; a context vector generation step: calculating the vector representation of the context where the ambiguous words are located by using the topic model and the topic word vectors; and a supervised word sense disambiguation step: combining the context vector with other traditional semantic features, and performing word sense disambiguation by utilizing an SVM.

Description

technical field [0001] The invention relates to an LDA-based keyword embedding disambiguation method and system, and belongs to the technical field of semantic analysis. Background technique [0002] Natural language is inherently ambiguous, and many words generally have multiple meanings, such as "cricket", which can be expressed as a kind of movement or an insect, but in a specific context, each word has different meanings. has a definite meaning. Word sense disambiguation is a method to determine the correct meaning of ambiguous words according to a specific context, which is considered to be an AI-Complete problem. Word sense disambiguation is one of the tasks with the longest history in the direction of natural language processing. At the same time, it is also a key basic task in many natural language processing. It is widely used in machine translation, information retrieval, information extraction and other fields. [0003] The commonly used solutions for word sense...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F40/284G06F40/30G06F40/211G06F40/247G06K9/62

CPCG06F18/2411

Inventor 唐季林贾连印陈明鲜张崇德

Owner KUNMING UNIV OF SCI & TECH

Subject term embedding disambiguation method and system based on LDA

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology