Noise removal in multibyie text encodings using statistical models

A byte and noise technology, applied in the direction of coding, coding components, digital transmission systems, etc., can solve problems such as information loss and unclear repair

Inactive Publication Date: 2005-11-09
INT BUSINESS MASCH CORP
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For example, the GB to Unicode converter simply crashes at the first invalid byte sequence, making all information after the noise lost
[0005] There is an ambiguous problem with fixing such noise
For example, consider the case of 9-byte sequences of GB2312-80 characters all in the range 161-254 - which "halfword" is the noise to discard? Giving up any one byte makes it possible to make four Chinese characters perfectly valid, but in an incomprehensible sequence

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Noise removal in multibyie text encodings using statistical models
  • Noise removal in multibyie text encodings using statistical models
  • Noise removal in multibyie text encodings using statistical models

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] refer to figure 1 , figure 1The basic flow of the present invention is depicted, wherein a possibly corrupted byte sequence is passed through a state sequence marker 100 which first generates the most probable state sequence for the byte sequence and then modifies the state sequence so that All errors or "noise" are localized into a single state. The sequence of bytes and associated status sequences are then passed through a repair module 110 which examines the sequences to determine if there are any errors in the sequence of bytes and, if so, corrects them, thereby outputting a sequence of valid bytes.

[0039] figure 2 A typical Markov model depicting sequences of admissible states associated with mixed double-byte and ASCII sequences (eg, GB-type byte sequences). The state of the byte in this example can be one of the following three states, namely, ASCII character (state A), first byte of double-byte character state (state GB1), or second word of double-byte cha...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Disclosed is a method of validating a byte sequence having a plurality of states, the method comprising designating one or more noise states from among the plurality of states; generating a most probable state sequence for the byte sequence; utilizing said state sequence to identify all noise in the byte sequence; and localizing said noise in said noise states. Once localized, the noise may be deleted from the byte sequence.

Description

[0001] This work was supported by the DARPQ government contract under SPAWAR contract number N66001-99-2-8916. technical field [0002] The present invention relates to the validation of character code sequences. Background technique [0003] Double-byte character encodings are commonly used for many purposes, including encoding complex character sets such as GB2312-80 - Simplified Chinese characters used in mainland China. GB2312-80 contains 7,445 Chinese characters represented as a pair of bytes, where each byte is a number from 161 to 254. This allows Chinese characters to be mixed with traditional ASCII text represented by byte values ​​in the range 0 to 127. Technically, the simultaneous representation of GB2312-80 and ASCII is called EUC-CN encoding, but for simplicity, we refer to it as GB2312-80 throughout this specification. This necessarily means that bytes in the range 161 to 254 must occur in pairs, and that any string of such characters must contain an even nu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/44G06F9/445G06F9/45G06F11/00H03M7/30
CPCG06F8/447G06F9/44521H03M7/30
Inventor 杰弗里·S·麦卡利朱玮晶
Owner INT BUSINESS MASCH CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products