Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

Inactive Publication Date: 2006-01-19
IBM CORP
View PDF6 Cites 110 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0034] Advantages of the invention include that with a word boundary probability estimating device, a probabilistic language model building device, a kana-kanji converting device, and a method therefor according to the present invention as described above, existing vocabulary/linguistic models concerning the first corpus (word-segmented) are combined with vocabulary/linguistic models built by probabilistically se

Problems solved by technology

However, the existing automatic word segmentation systems provide low accuracies in fields such as the medical field, where many technical terms are used.
In training using a corpus in an application field, it is generally difficult to obtain a huge corpus segmented and tagged manually for the application field, taking much time and cost and thus making it difficult to develop a system in a short period.
Although information segmented into words in a field (for example in the medical field) may works in processing the language in that field, there is no assurance that the information will work also in another application field (for example in the economic field, which is completely

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
  • Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
  • Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

Examples

Experimental program
Comparison scheme
Effect test

Example

[0048]1 . . . Kana-kanji converting device [0049]10 . . . CPU [0050]12 . . . Input device [0051]14 . . . Display device [0052]16 . . . Storage device [0053]18 . . . Recording medium [0054]2 . . . Kana-kanji conversion program [0055]22 . . . Language decoding section [0056]30 . . . Base form pool [0057]300 . . . Vocabulary dictionary [0058]302 . . . Character dictionary [0059]32 . . . Language model [0060]320 . . . First corpus (word-segmented) [0061]322 . . . Second corpus (word-unsegmented)

DETAILED DESCRIPTION OF THE INVENTION:

[0062] The present invention provides that a word n-gram probability is calculated with high accuracy in a situation where: [0063] (a) a first corpus (word-segmented), which is a relatively small corpus containing manually segmented word information, and [0064] (b) a second corpus (word-unsegmented), which is a relatively large corpus containing raw information are given as a training corpus that is storage containing vast quantities of sample sentences.

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Calculates a word n-gram probability with high accuracy in a situation where a first corpus), which is a relatively small corpus containing manually segmented word information, and a second corpus, which is a relatively large corpus, are given as a training corpus that is storage containing vast quantities of sample sentences. Vocabulary including contextual information is expanded from words occurring in first corpus of relatively small size to words occurring in second corpus of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus. The first corpus (word-segmented) is used for calculating n-grams and the probability that the word boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second corpus (word-unsegmented), in which probabilistic word boundaries are assigned based on information in the first corpus (word-segmented), is used for calculating a word n-grams.

Description

FIELD OF THE INVENTION [0001] The present invention relates to recognition technology in natural language processing, and improving the accuracy of recognition in natural language processing by using a corpus, in particular by effectively using a corpus to which segmentation is not applied. BACKGROUND ART [0002] Along with the progress of recognition technology for natural language, various techniques, including kana-kanji conversion, spelling checking (character error correction), OCR, and speech recognition techniques, have achieved a practical-level predication capability. At present, most of the methods for implementing these techniques with high accuracy are based on probabilistic language models and / or statistical language models. Probabilistic language models are based on the frequency of occurrence of words or characters and require a collection of a huge number of texts (corpus) in an application field. [0003] The following documents are considered: [0004] [Non-patent Docum...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F40/00G10L15/187G10L15/197
CPCG06F17/2863G06F17/2715G06F40/216G06F40/53
Inventor MORI, SHINSUKETAKUMA, DAISUKE
Owner IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products