Unlock instant, AI-driven research and patent intelligence for your innovation.

System and method for diacritization of text

a text diacritic and restoration system technology, applied in the field of diacriticization, can solve the problems of document without diacritic becoming a source of confusion for beginners readers and people with learning disabilities, and document without diacritic also being problematic, so as to achieve accurate and reliable technique and restore diacritic.

Inactive Publication Date: 2008-10-30
NUANCE COMM INC
View PDF3 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

This method provides a highly accurate and reliable technique for restoring diacritics, improving the accuracy of language processing and synthetic production, and reducing the need for manual diacritization, while being adaptable to dynamic language models.

Problems solved by technology

This often leads to a considerable ambiguity since several words that have different diacritic patterns may appear identical in a diacritic-less setting.
However, a document without diacritics becomes a source of confusion for beginner readers and people with learning disabilities.
A document without diacritics is also problematic for video, speech, and natural language processing applications, where a diacritic-less setting adds another layer of ambiguity when processing the data.
Currently, applications such as text-to-speech, speech-to-text, and others use data where diacritics may be placed manually or by rule based methods, which may be tedious, time consuming to generate and less accurate.
The main disadvantage of rule based methods is that it is difficult to maintain up-to-date rules, or extend the method to new applications due to the productive nature of any “living” spoken language.
This method does not appear to handle the case of two syllabification marks (e.g., shedda) showing the doubling of the preceding consonant and sukuun denoting the lack of a vowel.
Even though the methods proposed for diacritization have been maturing and improving over time, they still provide a limited solution to the problem in terms of accuracy and diacritics coverage.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and method for diacritization of text
  • System and method for diacritization of text
  • System and method for diacritization of text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025]Aspects of the present invention provide systems and methods that ensure a highly accurate restoration of diacritics in language processing and synthetic production. This highly accurate restoration eliminates the cost of manually diacritizing text needed for many applications. While the present disclosure describes the Arabic language and employs Arabic as an example, the principles of the present embodiments may be employed in any language or coding system which employs diacritics or other symbolic equivalents (e.g., Hebrew).

[0026]Introduction to Diacritics: As most Semitic languages, Arabic is usually written without diacritical marks. In TABLE 1, diacritics are presented with grapheme (lam) to demonstrate where they are placed in the text along with their names and meaning. Arabic has 28 letters (graphemes), 25 of which are consonants and the remaining 3 are long vowels. The Arabic alphabet can be extended to 90 by additional shapes, marks, and vowels. Unlike many other l...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A system and method for restoration of diacritics includes making classification decisions regarding an utterance in accordance with an aggregate of a plurality of information sources in a diacritization model for diacritic restoration. A best diacritic representation is determined for graphemes in the utterance based upon a best match with the diacritization model. A diacritically restored representation of the utterance is output.

Description

RELATED APPLICATION INFORMATION[0001]This application is a Continuation of co-pending U.S. patent application Ser. No. 11 / 386,626 filed on Mar. 22, 2006, incorporated herein in its entirety.BACKGROUND[0002]1. Technical Field[0003]The present invention relates to diacritization (e.g., vowelization) of text and more particularly to a diacritization restoration system and method, which restores missing diacritics from text reproductions of speech and translated text.[0004]2. Description of the Related Art[0005]Arabic documents are composed of scripts without short vowels and other diacritic marks. The written text is actually missing indications of the vowels, since those familiar with reading the language can do so without the vowels being indicated. This often leads to a considerable ambiguity since several words that have different diacritic patterns may appear identical in a diacritic-less setting. Educated Modern Standard Arabic speakers are able to accurately restore diacritics i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/20G06F40/00
CPCG06F17/273G06F17/2863G06F40/232G06F40/53
Inventor EMAM, OSSAMA S.SARIKAYA, RUHIZITOUNI, IMED
Owner NUANCE COMM INC