Lucene-based wrongly written character query method

A query method and typo technology, applied in the direction of electronic digital data processing, special data processing applications, instruments, etc., can solve problems such as low proofreading efficiency, misjudgment of correct words, and poor actual user experience, and achieve the goal of improving accuracy Effect

Active Publication Date: 2017-12-22
南方电网互联网服务有限公司
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] 3) True word errors will interfere with the grammar and semantics of the entire sentence, so finding true word errors requires a lot of knowledge and resources;
[0007] 4) Data sparseness is a major obstacle for automatic proofreading of true word errors
The automatic proofreading method for Chinese true word errors of the present invention solves the problems of data sparseness, misjudgment of correct words, and low proofreading efficiency in the prior art, and has high effectiveness and accuracy; but the inventive method still has certain defects: In practical applications, this method requires a large amount of corpus training, and the retrieval takes a lot of time, which is not very good for the actual user experience

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Lucene-based wrongly written character query method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0030] In this embodiment, a query is performed on the text "Youhua" based on the Lucene typo query method. The query method includes the following steps:

[0031] (1) Perform word segmentation on the searched text "Youhua", and the result after word segmentation is "You" and "Hua";

[0032] (2) Read the word "you", judge whether it is a non-single-character word, it is a single-character word, and obtain the simset of "you"=[游, cala, 唷, 莸, 郵, 呦, from, 牖, inducement, 窈, Worm, worm, worm, friendly, wart, again, larvae, especially, worm, young, worry, quiet, lonely, rich, stubborn, uranium, long, life, oil, right, grapefruit, still, have, excellent, humer, Yor, unity, 卣, confine, euro, blessing, yo, glaze

[0033] (3) Read the word "flower", judge it as a one-character word, and get the simset of "flower"=[化, 画, 吪, slip, 铧, 姡, cunning, 骅, stroke, 呚, flower, wow, birch, flower , Hua, Piao, words]

[0034] (4) The result of the Cartesian product formed Result=[You Hua, You Hua, You Woma...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Lucene-based wrongly written character query method. A sentence of a queried text is subjected to word segmentation; a first word is selected; whether the first word is a single word or not is judged; if the first word is the single word, a similar voice table and a similar shape table are queried; a query result simset is returned according to the similar voice table and the similar shape table; the query result simset and a next word or a query result simset of the next word are subjected to Cartesian product to obtain a Cartesian product result; the result is used to be matched with all words in a dictionary; if the matching succeeds, an error correction result is returned and added to an error correction result set; if the error correction result set is null, a null value is returned and the matching exits; if the error correction result set is not null, all error correction results are returned and used for performing a query; and if the first word in the sentence of the text is not the single word through the query or the matching between the result and all the words in the dictionary fails, characters are read backwards, and the previous steps are repeated. The method has the advantages that Lucene retrieval can be more accurate and humanized; and the retrieval accuracy is improved.

Description

Technical field [0001] The invention belongs to natural language processing in the field of artificial intelligence computers, and particularly relates to a query method based on Lucene typos. Background technique [0002] With the rapid development of information processing technology and the Internet, traditional text work has been almost completely replaced by computers. Text electronic publications such as e-books, e-newspapers, e-mails, and office documents continue to emerge, and there are more and more errors in the text. . [0003] At present, most of the manual proofreading methods are adopted. The proofreading work is monotonous, labor-intensive, and low in efficiency. The manual proofreading method can no longer meet the needs of text proofreading. Therefore, the study of automatic text proofreading has far-reaching significance for theory and application. Automatic text proofreading is one of the main applications of natural language processing, and it is also a diffic...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/3344G06F40/289G06F40/30
Inventor 张晓如陈璟刘嘎琼陈国程文月刘亮亮
Owner 南方电网互联网服务有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products