Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

Inactive Publication Date: 2006-01-19

IBM CORP

View PDF6 Cites 110 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Benefits of technology

[0034] Advantages of the invention include that with a word boundary probability estimating device, a probabilistic language model building device, a kana-kanji converting device, and a method therefor according to the present invention as described above, existing vocabulary / linguistic models concerning the first corpus (word-segmented) are combined with vocabulary / linguistic models built by probabilistically segmenting the second corpus (word-unsegmented), which is a raw corpus, whereby the accuracy of recognition in natural language processing can be improved. Because the capability of a probabilistic language model can be improved simply by collecting sample sentences in a field of interest, application of the present invention to fields for which language recognition technique corpuses not provided can be supported.

Problems solved by technology

However, the existing automatic word segmentation systems provide low accuracies in fields such as the medical field, where many technical terms are used.

In training using a corpus in an application field, it is generally difficult to obtain a huge corpus segmented and tagged manually for the application field, taking much time and cost and thus making it difficult to develop a system in a short period.

Although information segmented into words in a field (for example in the medical field) may works in processing the language in that field, there is no assurance that the information will work also in another application field (for example in the economic field, which is completely different from the medical field).

In other words, a correct corpus segmented and tagged in a field may be definitely correct in that field, but may not necessarily correct in other fields because the segmented and / or tagged corpus has been fixed by segmentation and / or tagging.

However, all of these techniques are aiming to predetermine word boundaries in word segmentation fixedly.

However, in Asian languages such as Japanese, it is difficult to morphologically analyze even written text because, unlike English text, text in such languages is written without a space between words.

However, methods (a) and (c) require high computational costs for bi-gram and higher and are unrealistic.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

:

[0062] The present invention provides that a word n-gram probability is calculated with high accuracy in a situation where: [0063] (a) a first corpus (word-segmented), which is a relatively small corpus containing manually segmented word information, and [0064] (b) a second corpus (word-unsegmented), which is a relatively large corpus containing raw information are given as a training corpus that is storage containing vast quantities of sample sentences.

[0065] Vocabulary including contextual information is expanded from words occurring in the first corpus (word-segmented) of relatively small size to words occurring in the second corpus (word-unsegmented) of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus.

[0066] The first corpus (word-segmented) is used for calculating n-grams and the probability that the boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

Calculates a word n-gram probability with high accuracy in a situation where a first corpus), which is a relatively small corpus containing manually segmented word information, and a second corpus, which is a relatively large corpus, are given as a training corpus that is storage containing vast quantities of sample sentences. Vocabulary including contextual information is expanded from words occurring in first corpus of relatively small size to words occurring in second corpus of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus. The first corpus (word-segmented) is used for calculating n-grams and the probability that the word boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second corpus (word-unsegmented), in which probabilistic word boundaries are assigned based on information in the first corpus (word-segmented), is used for calculating a word n-grams.

Description

FIELD OF THE INVENTION [0001] The present invention relates to recognition technology in natural language processing, and improving the accuracy of recognition in natural language processing by using a corpus, in particular by effectively using a corpus to which segmentation is not applied. BACKGROUND ART [0002] Along with the progress of recognition technology for natural language, various techniques, including kana-kanji conversion, spelling checking (character error correction), OCR, and speech recognition techniques, have achieved a practical-level predication capability. At present, most of the methods for implementing these techniques with high accuracy are based on probabilistic language models and / or statistical language models. Probabilistic language models are based on the frequency of occurrence of words or characters and require a collection of a huge number of texts (corpus) in an application field. [0003] The following documents are considered: [0004] [Non-patent Docum...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/27G06F40/00G10L15/187G10L15/197

CPCG06F17/2863G06F17/2715G06F40/216G06F40/53

Inventor MORI, SHINSUKETAKUMA, DAISUKE

Owner IBM CORP

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Benefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology