Check patentability & draft patents in minutes with Patsnap Eureka AI!

Method for automatically identifying word repetition errors

A technology for automatic recognition and word recognition, which is applied in the fields of electrical digital data processing, natural language data processing, instruments, etc., and can solve the problem of repeated words and words that are not dealt with separately.

Pending Publication Date: 2020-09-25
CHINA NAT INST OF STANDARDIZATION
View PDF4 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, there is such a phenomenon of reasonable repetition of words in Chinese, so simple judgment of repeated words will bring a lot of misjudgments, and now most of the automatic proofreading of Chinese texts does not deal with word repetition errors separately, but simply Use binary or ternary information of words to judge whether there is an error

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically identifying word repetition errors
  • Method for automatically identifying word repetition errors
  • Method for automatically identifying word repetition errors

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The present invention will be described in further detail below in conjunction with the examples and accompanying drawings, and the following examples do not limit the present invention.

[0044] A method for automatic recognition of word repetition errors provided by the present invention, the method comprises the following steps:

[0045] After segmenting the large-scale training corpus, statistically obtain the binary and triplet structures of repeated words in the training corpus, as well as the degree of repetition combination, the information entropy of the adjacent words in the left upper context and the information entropy of the adjacent words in the right lower context. ;

[0046] The steps of counting and collecting words containing repeated characters in the Chinese dictionary and establishing a Chinese dictionary repeated word thesaurus;

[0047] The step of judging the repeated words appearing in the text to be checked based on the repeated words in the C...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for automatically identifying word repetition errors, which comprises the following steps of: after performing word segmentation on a large-scale training corpus, performing statistics to obtain a two-tuple structure and a three-tuple structure which comprise repeated words in the training corpus, and repeated combination degrees, left upper adjacent word information entropy and right lower adjacent word information entropy of the repeated words in the training corpus; counting and recording words containing repeated characters in the Chinese dictionary, and establishing a repeated character library of the Chinese dictionary; judging repeated words appearing in the text to be subjected to error checking based on the repeated words in the Chinese dictionary;and judging repeated words appearing in the to-be-debugged text based on the repeated combination degree, the left upper text adjacent word information entropy and the right lower text adjacent wordinformation entropy which are obtained through statistics. According to the method, whether the repeated words are the repeated words recorded in the dictionary or not can be quickly judged and recognized, whether the repeated words are not the repeated words in the dictionary but belong to the repeated words in daily terms or not can be effectively judged, judgment and recognition are quick and comprehensive, and practicability is high.

Description

technical field [0001] The invention relates to a natural language processing method, in particular to a method for discovering word repetition errors in the field of Chinese automatic proofreading. Background technique [0002] In the era of big data, there are more and more text data, and there are more and more errors in the text, including word repetition errors (also known as insertion errors). In Chinese, some words can be repeated, such as "research research", but some words cannot be repeated, such as "apology, apology" and "de", once they appear, it is a repetition error. [0003] How to automatically find the repetition of words in the text is one of the research contents of automatic proofreading of Chinese text. [0004] However, there is such a phenomenon of reasonable repetition of words in Chinese, so simple judgment of repeated words will bring a lot of misjudgments, and now most of the automatic proofreading of Chinese texts does not deal with word repetiti...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/232G06F40/242G06F40/284
CPCG06F40/232G06F40/284G06F40/242Y02D10/00
Inventor 王海涛曹馨宇刘亮亮周长青
Owner CHINA NAT INST OF STANDARDIZATION
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More