Chinese word relevancy calculation method and device based on Wikipedia concept vectors

A technology of concept vector and calculation method, which is applied in the field of calculation of Chinese word correlation based on Wikipedia concept vector, and can solve problems such as inability to accurately distinguish concepts

Active Publication Date: 2017-12-19
南方电网互联网服务有限公司
View PDF6 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This causes the word vector to fuse all the concept information of a word, and cannot accurately distinguish each different concept

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word relevancy calculation method and device based on Wikipedia concept vectors
  • Chinese word relevancy calculation method and device based on Wikipedia concept vectors
  • Chinese word relevancy calculation method and device based on Wikipedia concept vectors

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0105] In order to enable those skilled in the art to better understand the solutions of the embodiments of the present invention, the embodiments of the invention will be further described in detail below in conjunction with the drawings and implementations.

[0106] The embodiment of the present invention is based on the flow chart of the Chinese word relevance computing method of Wikipedia concept vector, as figure 1 shown, including the following steps.

[0107] Step 101, constructing a basic corpus of Wikipedia.

[0108] Obtain its Dump raw corpus from the Wikipedia Dump service site; and normalize the raw corpus, and only keep the Wikipedia concept documents whose namespace attribute is 0; for each concept document, only keep its official text and concept annotation information; the processed The concept documents are collected as the basic corpus of Wikipedia, specifically:

[0109] Step 1-1) visit the Wikipedia Dump service site, download the latest zhwiki...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese word relevancy calculation method and device based on Wikipedia concept vectors. The method comprises the steps that 1, a Wikipedia Dump service site acquires raw corpora, conducts normalized treatment on the raw corpora and generates a Wikipedia basic corpus; 2, concept labeling and expanding are conducted, and a Wikipedia concept corpus is built; 3, according to the Wikipedia concept corpus, the concept vectors are trained; 4, according to the Wikipedia, a word concept set of word pairs to be compared is acquired; 5, the similarities of the concept vectors corresponding to all the concept pairs in the cartesian product of the concept set are calculated, and the largest value is used as the similarity of the word pairs to be compared. By using the Chinese word relevancy calculation method and device based on the Wikipedia concept vectors, word concept information which is contained in the Wikipedia can be sufficiently excavated so that the word concept vectors can be generated, and the word similarity can be calculated more accurately and effectively.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a method and device for calculating the correlation degree of Chinese words based on Wikipedia concept vectors. Background technique [0002] Word correlation refers to the degree of semantic association between two words, which has a wide range of applications in the field of natural language processing, and has a direct impact on the effects of information retrieval, semantic understanding, word meaning disambiguation, and text clustering. Existing word correlation calculation methods can be divided into two categories: one is based on knowledge base methods, usually using semantic ontology knowledge bases such as WordNet, to analyze the number of overlapping words in the interpretation of words or the path length of words in the ontology concept tree , concept density, etc., to judge the degree of relevance of words; the other is based on statistical method...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/36G06F40/284
Inventor 鹿文鹏张玉腾张甜甜孟凡擎
Owner 南方电网互联网服务有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products