Unlock instant, AI-driven research and patent intelligence for your innovation.

A Seed-Based Method for Generating Typos Confusion Sets

A technology for typos and confusion sets, applied in the field of natural language processing, which can solve the problems of unreasonable confusion sets, large workload, and high false positive rate of automatic proofreading systems.

Active Publication Date: 2017-03-22
中科国力(镇江)智能技术有限公司
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Because the Chinese computer text is through the phonetic code input method (such as the Sogou Pinyin input method) and the shape code input method (such as the Wubi input method), so similar sound and shape are the main features of typos in Chinese characters. Generated by similarity algorithm or shape similarity algorithm, many very unreasonable confusion sets will be generated, which will lead to a very high false positive rate of the automatic proofreading system
If it is completely filtered manually, due to artificial subjectivity, some unreasonable confusion sets will be obtained and reasonable confusion sets will be missed, and the workload will be huge

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Seed-Based Method for Generating Typos Confusion Sets
  • A Seed-Based Method for Generating Typos Confusion Sets
  • A Seed-Based Method for Generating Typos Confusion Sets

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0054] Such as figure 1 , figure 2 Shown, the present invention is a kind of confusion set generation method based on seed typos, comprising the following steps:

[0055] Step 1) Create a typo confusion set map. According to the seed typo confusion set, a typo confusion set graph is established.

[0056] Step 2) Typo confusion sets are added automatically. Use the created typo confusion set map to discover the rules between typos and automatically add typos confusion sets.

[0057] Step 3) Automatic generation of homophone typos in the typo confusion set. Automatically add homophonic typos of Chinese characters.

[0058] Step 4) Automatic generation of non-homophone typos in the typo confusion set. According to features such as shape similarity and typo confusion set map, automatically add non-homophone typos of Chinese characters.

[...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A method for generating a wrongly-written or mispronounced character confusion set based on seeds includes the following steps that (1) a wrongly-written or mispronounced character confusion set map is established, and the wrongly-written or mispronounced character confusion set map is established according to the seed wrongly-written or mispronounced character confusion set; (2) the wrongly-written or mispronounced character confusion set map is used, the law among wrongly-written or mispronounced characters can be automatically discovered and mined through an algorithm, and the wrongly-written or mispronounced characters are automatically added into the wrongly-written or mispronounced character confusion set; (3) homophonous wrongly-written or mispronounced characters are automatically generated in the wrongly-written or mispronounced character confusion set, and the homophonous wrongly-written or mispronounced Chinese characters are automatically added; (4) non-homophonous wrongly-written or mispronounced characters are automatically generated in the wrongly-written or mispronounced character confusion set, and the non-homophonous wrongly-written or mispronounced Chinese characters can be automatically added according to the features such as shape similarity and the wrongly-written or mispronounced character confusion set map.

Description

technical field [0001] The invention relates to natural language processing in the field of computers, in particular using a method based on a seed and a typo map to automatically establish a typo confusion set, effectively reducing the amount of labor, and the generated typo confusion set is effectively applied to an automatic proofreading system for Chinese texts. Background technique [0002] With the rapid development of information processing technology and the Internet, traditional text work is almost completely replaced by computers. Electronic texts such as e-books, e-newspapers, e-mails, and office documents, blogs, and microblogs have all become part of people's daily lives. However, there are more and more typos in the text, which brings great challenges to the proofreading work. Traditional manual proofreading has low efficiency, high intensity, and long cycle obviously cannot meet the needs of text proofreading. Therefore, it is necessary to study automatic tex...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27
Inventor 刘亮亮符建辉施恒利王石
Owner 中科国力(镇江)智能技术有限公司