Chinese word similarity calculation method based on fusion strategy

A similarity calculation and similarity technology, applied in the field of text processing, can solve problems such as slow calculation speed, errors, and inability to deal with unlogged

Inactive Publication Date: 2019-07-02
BEIJING INFORMATION SCI & TECH UNIV
View PDF0 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The disadvantage of the ontology-based method is that it is limited by the semantic dictionary and cannot handle unregistered (oov) words, and improper classification of words in the process of ontology construction will also lead to errors in the similarity calculation of words; based on large-scale corpus Statistical methods and word embedding methods are limited by the size of the corpus used for training, and the amount of calculation is relatively large, the calculation speed is slow, and the interference from the sparse corpus and the noise in the corpus is relatively large

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word similarity calculation method based on fusion strategy
  • Chinese word similarity calculation method based on fusion strategy
  • Chinese word similarity calculation method based on fusion strategy

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0073] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0074] A method for calculating the similarity of Chinese words based on a fusion strategy, based on the combination of HowNet, Synonym Cilin, the Chinese Wikipedia corpus trained by Word2Vec, and Baidu Dictionary to calculate the similarity of words. For two input words, first determine the Whether it exists in HowNet or synonym word forest, if it exists, use HowNet or synonym word forest to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Chinese word similarity calculation method based on a fusion strategy. The method comprises steps of calculating word similarity based on the combination of four of HowNet,synonym forest, Word2Vec trained Chinese Wikipedia encyclopedia corpus and a Baidu dictionary; for two input words, firstly, judging whether the synonyms exist in a HowNet or synonym forest or not; ifyes, using the HowNet or synonym forest for calculating the similarity, if not, judging whether the HowNet or synonym forest exists in the Wikipedia corpus or the Baidu dictionary or not, and if yes,using the Word2vec or the Baidu dictionary for calculating the similarity of the words. The invention provides a Chinese word similarity calculation method based on a fusion strategy. According to the fusion strategy, the known network, the synonym forest, the word2vec and the Baidu dictionary are comprehensively considered, advantage complementation among strategies is formed, the calculated Spearman correlation coefficient and Pearson correlation coefficient are higher than those of other methods, the accuracy of a word similarity calculation result is improved, and the requirements of practical application can be well met.

Description

technical field [0001] The invention belongs to the technical field of text processing, and in particular relates to a method for calculating the similarity of Chinese words based on a fusion strategy. Background technique [0002] Word similarity calculation is a basic research topic of Chinese information processing. It has extensive and in-depth research in the fields of natural language processing, automatic question answering, knowledge graph, text classification, text clustering, information retrieval, information extraction, word sense disambiguation, and machine translation. Therefore, it has been studied and paid attention by more and more scholars. [0003] The current word similarity calculation can be divided into three types: methods based on existing knowledge ontology, methods based on large-scale corpus statistics, and word embedding methods based on corpus. The first method based on knowledge ontology uses the level, density and distance between words in th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22
CPCG06F40/194
Inventor 吕学强董志安游新冬
Owner BEIJING INFORMATION SCI & TECH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products