Chinese word similarity detection algorithm based on pronunciation, shape and meaning

A detection algorithm and similarity technology, which is applied in computing, other database retrieval, instruments, etc., can solve the problems that the similarity of words with hidden typos cannot be detected, and the length of Chinese character strings cannot be recognized.

Active Publication Date: 2021-02-05
HAINAN UNIVERSITY
View PDF1 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, the commonly used algorithms for the similarity detection of Chinese character strings are: first, the similarity detection based on the phonetic shape of Chinese characters, by obtaining the basic information of Chinese characters, such as the pinyin of Chinese characters, the shape structure, the number of strokes, the order of strokes, etc. The data generates mathematical expressions according to certain coding rules, and then uses specific algorithms to process the mathematical expressions to obtain the similarity of Chinese characters; the second is the similarity detection based on the semantics of Chinese characters, by comparing Chinese character strings with those included in a large knowledge base Words and descriptions are compared, and then the semantic similarity of Chinese characters is calculated according to the distance of the sememe in the knowledge base; however, these two types of methods have defects. The detection method must be based on the fact that the detected words are completely correct, and the similarity between words with hidden typos cannot be detected

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word similarity detection algorithm based on pronunciation, shape and meaning
  • Chinese word similarity detection algorithm based on pronunciation, shape and meaning
  • Chinese word similarity detection algorithm based on pronunciation, shape and meaning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0075] In order to better understand the technical contents of the present invention, specific embodiments are provided below, and in conjunction with the accompanying drawings, the present invention is further described:

[0076] see figure 1 , the present invention provides a Chinese word similarity detection algorithm based on sound, form and meaning, which combines the three major features of sound, form and meaning of Chinese characters to perform similarity detection on Chinese character strings, comprising the following steps:

[0077] Step S1: converting each Chinese character pinyin in the input Chinese character strings s1 and s2 into a binary phonetic code;

[0078] Step S2: each Chinese character in the Chinese character string s1 of input, s2 is converted into font code according to font;

[0079] Step S3: calculate the phonetic code similarity, font code similarity, meaning similarity of Chinese character string s1, s2 respectively;

[0080] Step S4: consider t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a Chinese word similarity detection algorithm based on pronunciation, shape and meaning, which detects the overall similarity of Chinese character strings by comprehensively considering three characteristics of pronunciation, shape and meaning of Chinese characters, and comprises the following steps of: firstly, converting the pinyin of each Chinese character of the Chinesecharacter strings s1 and s2 into a corresponding phonetic code, and converting each Chinese character of the Chinese character strings s1 and s2 into a shape code; then respectively calculating the phonetic code similarity and the shape code similarity between the Chinese character strings s1 and s2, then independently calculating the similarity of the Chinese character string meanings, and finally setting contribution parameters for an application scene in combination with the phonetic form meanings to calculate the overall similarity of the final Chinese character strings s1 and s2. The algorithm can meet complex application scenarios, can be applied to detection of the repetition degree of structured data items, especially in the case of manual input errors, and can also be applied to detection of sensitive words hidden in wrongly written characters and the like. Compared with a Chinese character similarity detection algorithm of the same type, the detection effect on the Chinese character string similarity is greatly enhanced.

Description

technical field [0001] The invention relates to the technical field of Chinese word similarity, and more specifically, to a Chinese word similarity detection algorithm based on sound, form and meaning. Background technique [0002] The string similarity algorithm refers to a certain method to calculate the similarity between two different strings. A percentage is usually used to measure the similarity between strings. String similarity algorithms are used in many computing scenarios, such as data cleaning, user input error correction, recommendation systems, plagiarism detection systems, automatic scoring systems, as well as web search and DNA sequence matching. At present, the commonly used algorithms for similarity detection of Chinese character strings are as follows: one is based on the similarity detection of Chinese characters, by obtaining the basic information of Chinese characters, such as the pinyin of Chinese characters, the shape structure, the number of strokes...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/126G06F40/284G06F16/903
CPCG06F40/126G06F40/284G06F16/90344Y02D10/00
Inventor 黄梦醒王华敏冯思玲冯文龙张雨吴迪
Owner HAINAN UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products