English word relevancy calculating method and device based on Wikipedia concept vectors

A technology of concept vectors and calculation methods, which is applied in the field of English word correlation calculation based on Wikipedia concept vectors, and can solve problems such as inability to accurately distinguish concepts

Active Publication Date: 2017-12-05
QILU UNIV OF TECH
View PDF6 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This causes the word vector to fuse all the concept information of a word, and cannot accurately distinguish each different concept

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • English word relevancy calculating method and device based on Wikipedia concept vectors
  • English word relevancy calculating method and device based on Wikipedia concept vectors
  • English word relevancy calculating method and device based on Wikipedia concept vectors

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0107] In order to enable those skilled in the art to better understand the solutions of the embodiments of the present invention, the embodiments of the invention will be further described in detail below in conjunction with the drawings and implementations.

[0108] The embodiment of the present invention is based on the flow chart of the English word correlation calculation method of Wikipedia concept vector, as figure 1 shown, including the following steps.

[0109] Step 101, constructing a basic corpus of Wikipedia.

[0110] Obtain its Dump raw corpus from the Wikipedia Dump service site; and normalize the raw corpus, and only keep the Wikipedia concept documents whose namespace attribute is 0; for each concept document, only keep its official text and concept annotation information; the processed The concept documents are collected as the basic corpus of Wikipedia, specifically:

[0111] Step 1-1) visit the Wikipedia Dump service site, download the latest en...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an English word relevancy calculating method and device based on Wikipedia concept vectors. The method comprises the steps that 1, a Wikipedia Dump service site obtains raw linguistic data and performs standardized processing to generate a Wikipedia basic corpus; 2, concept labeling and extension are performed, and a Wikipedia concept corpus is established; 3, according to the Wikipedia concept corpus, the concept vectors are trained; 4, for word pairs to be compared, a word concept set is obtained according to Wikipedia; 5, the similarity of the concept vector corresponding to each concept in Cartesian products of the concept set is calculated, and a maximum value is taken as the relevancy of the word pairs to be compared. By utilizing the method, word concept information contained by Wikipedia can be fully mined to generate word concept vectors, and word relevancy can be more accurately and effectively calculated.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a method and device for calculating English word relevancy based on Wikipedia concept vectors. Background technique [0002] Word correlation refers to the degree of semantic association between two words, which has a wide range of applications in the field of natural language processing, and has a direct impact on the effects of information retrieval, semantic understanding, word meaning disambiguation, and text clustering. Existing word correlation calculation methods can be divided into two categories: one is based on knowledge base methods, usually using semantic ontology knowledge bases such as WordNet, to analyze the number of overlapping words in the interpretation of words or the path length of words in the ontology concept tree , concept density, etc., to judge the degree of relevance of words; the other is based on statistical methods, according to t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/36G06F40/284
Inventor 鹿文鹏张玉腾
Owner QILU UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products