Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Candidate set calculation method and device and character error correction method and device in character input

A candidate set and text input technology, which is applied in computing, digital data processing, special data processing applications, etc., can solve problems such as limited error correction range, unsatisfactory error correction, and inability to cope with new words.

Inactive Publication Date: 2017-07-18
ALIBABA (CHINA) CO LTD
View PDF9 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method relies on very comprehensive user logs, which often cover a limited range of error correction and cannot cope with new words
The inventor found that in practice, the above two methods are not ideal for error correction in input with specific language habits, such as Indian English hinglish

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Candidate set calculation method and device and character error correction method and device in character input
  • Candidate set calculation method and device and character error correction method and device in character input
  • Candidate set calculation method and device and character error correction method and device in character input

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0092] According to the first embodiment of the present invention, as figure 2 As shown, a method for calculating a candidate set in text input is provided, including the following steps:

[0093] First, in step S2100, the user log is mined. The user log can be selected for a specific user group with specific input habits, for example, it can be selected as the user log of a minority language user. In particular, for Indian English, which is a multilingual feature, the user Usually, the vocabulary of other Indian native languages ​​is converted into Latin alphabet input. At this time, there are often multiple spelling methods, and the meaning and pronunciation of the words are consistent. Based on the phonetic rules of this type of spelling, we can address this In this case, we choose to mine Indian English user logs, so as to obtain a candidate set that can match Indian English input habits. In addition, user logs can be dynamically updated online, so correspondingly, accor...

no. 2 example

[0146] According to the second embodiment of the present invention, such as Figure 4 , 5 As shown, an input error correction method based on the method described in the first embodiment is provided. Therefore, the repeated part will not be described in detail.

[0147] Such as Figure 4 As shown, the input error correction method according to this embodiment includes a transition probability calculation step. In traditional pattern recognition theory, user input can be viewed as a set of state sequences. Calculate the transition probability between states, that is, the probability of finding two words from the corpus that form adjacent contexts. For example, the existing English corpus is as follows:

[0148] it is over

[0149] How Sweet It Is

[0150] it is time to say goodbye

[0151] Then the transition probability P(is|it)=3 / 3=1 from it to is can be calculated, and the transition probability from is to over is P(over|it)=1 / 3.

[0152] The calculation of the tran...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a candidate set calculation method in character input. The method comprises the following steps of: extraction: extracting error correction query pairs in a user log and establishing an error correction character string pair for each error correction query pair, wherein the error correction query pairs indicate corresponding relationships between wrongly input character contents and correctly input character contents, and the error correction character string pairs indicate corresponding relationships between wrongly input character strings and correctly input character strings in the error correction query pairs; and candidate set calculation: when a character string in an input single word ti is matched with the error correction character string pairs, generating a variant set V={v1, v2, ..., vn} of the word according to the error correction character string pairs to serve as a candidate set C={c1, c2, ..., cn}, and calculating a corresponding output probability P={p1, p2, ..., pn}. The invention furthermore discloses a candidate set calculation device and an input error correction method and device. By utilizing the methods and devices disclosed by the invention, the error correction correctness can be improved, and good adaptability is expressed for new word error correction.

Description

technical field [0001] The present invention relates to the technical field of natural language processing. Specifically, the present invention relates to a method and device for calculating a candidate set in text input, and a method and device for text error correction. Background technique [0002] Error correction technology is an important link in search. According to literature statistics, about 10%-15% of search engine queries are wrongly entered. Especially in some groups with specific language habits, such as Indian English or Indian music search items, the wrong query accounts for 30%. Common search error correction methods include noisy channel models and hidden Markov models. The noise channel model is to obtain the candidate set through the edit distance, and then obtain the maximum conversion probability based on statistics, so as to obtain the optimal candidate error correction; the hidden Markov model regards the query as a set of observation states, and th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/3344G06F16/951
Inventor 吴岳谢玄亮陈凯成
Owner ALIBABA (CHINA) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products