Supercharge Your Innovation With Domain-Expert AI Agents!

Chinese character encoding method and device

A technology of Chinese characters and encoding methods, applied in the fields of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of occupying a large amount of memory, complex implementation, low detection efficiency, etc., to improve the recognition efficiency and accuracy, avoid Memory usage and the effect of narrowing the detection range

Active Publication Date: 2017-10-20
RUN TECH CO LTD BEIJING
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The encoding mode method is mainly based on the encoding range to determine the encoding method. For a large number of character sequences, it can only be detected one by one based on the encoding range. Not only is the detection efficiency low, but also when there are a large number of encoding overlap points, it will be impossible to decide which encoding method to use.
[0005] The character distribution method is based on the character distribution probability as a model. Before identifying the encoding method, it is necessary to establish a character probability distribution model for a specific character set. For the intricate network environment, Chinese, English and other special symbols are often mixed together in the network data stream. When English characters and other non-Chinese characters are in the majority, it often interferes with the recognition of Chinese character encoding methods, especially affecting the recognition based on Recognition Accuracy of Character Encoding Recognition Scheme Based on Character Probability Distribution Model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese character encoding method and device
  • Chinese character encoding method and device
  • Chinese character encoding method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0022] see figure 1 , is a flow chart of a method for identifying a Chinese character encoding method provided in Embodiment 1 of the present invention. The method of the embodiment of the present invention is applicable to the recognition system of the coding mode of Chinese characters, and the system includes: a distribution device and a data restoration device, wherein the distribution device is used to obtain the sequence of Chinese characters to be recognized in the network; The splitting device acquires the Chinese character sequence to be recognized, and identifies the encoding method of the Chinese character sequence to be recognized. The method in the embodiment of the present invention can be executed by a recognition device configured with a Chinese character encoding method realized by hardware and / or software, and the realization device is typically configured in a data restoration device.

[0023] The method includes:

[0024] Step 110, obtaining character sequ...

Embodiment 2

[0041] see figure 2 , is a flow chart of a method for recognizing a Chinese character encoding method provided in Embodiment 2 of the present invention. On the basis of the above-mentioned embodiments, this embodiment provides an optimal solution for determining the encoding method of the Chinese character sequence to be recognized based on the characteristics of the character sequence and based on the set Chinese encoding recognition strategy.

[0042] This preferred method includes:

[0043] Step 210, if the length of the Chinese character sequence to be recognized cannot be divisible by 2, then determine that the encoding method of the Chinese character sequence to be recognized is UTF-8 encoding;

[0044] In this step, since GB2312 encoding and GBK encoding both adopt double-byte encoding, and UTF-8 encoding adopts three-byte encoding, if the length of the Chinese character sequence to be recognized cannot be divisible by 2, then the corresponding encoding method must no...

Embodiment 3

[0070] see image 3, is a flow chart of a method for recognizing a Chinese character encoding method provided in Embodiment 3 of the present invention. On the basis of the above-mentioned embodiments, this embodiment provides a specific implementation plan for determining the encoding method of the Chinese character sequence to be recognized based on the characteristics of the character sequence and based on the set Chinese encoding recognition strategy.

[0071] Step 310, judging whether the length of the Chinese character sequence to be recognized can be divisible by 2, if not, then perform step 320; if so, then perform step 330;

[0072] Step 320, determine that the encoding method of the Chinese character sequence to be recognized is UTF-8 encoding, and the process ends.

[0073] In this step, since GB2312 encoding and GBK encoding both adopt double-byte encoding, and UTF-8 encoding adopts three-byte encoding, if the length of the Chinese character sequence to be recogniz...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Embodiments of the present invention provide a method and device for identifying a coding mode of a Chinese character sequence. The method includes: obtaining character sequence features from the Chinese character sequence to be recognized, the character sequence features including: the length feature of the character sequence, or the length feature of the character sequence and the abnormal code point feature of the character sequence; according to the character sequence feature , based on the set Chinese encoding recognition strategy, determine the encoding method of the Chinese character sequence to be recognized. Since there is no need to establish a complex character probability distribution model in advance, the recognition process of the recognition encoding method is simplified; for the Chinese character sequences to be recognized in massive network data, the length feature is used to narrow the detection range, avoiding the need to directly recognize the Chinese character sequences one by one Detecting the memory occupation caused by abnormal code points, after narrowing the detection range, further combining the characteristics of abnormal code points, improves the recognition efficiency and accuracy of the coding method.

Description

technical field [0001] The embodiment of the present invention relates to the technical field of computer data communication, and in particular to a method and device for identifying a coding mode of Chinese characters. Background technique [0002] With the continuous development of computer communication technology, people have created a variety of encoding methods for transmitting data in the network. For Chinese characters, the commonly used encoding methods are GBK, GB2312 and UTF-8. After obtaining the encoded Chinese character sequence transmitted in the network, it is necessary to decode the obtained Chinese character sequence in order to correctly restore the original data corresponding to the Chinese character sequence. Therefore, the technology for identifying the encoding method of the Chinese character sequence Came into being. [0003] Existing recognition technologies for encoding modes of Chinese character sequences mainly include: encoding pattern method an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/22
Inventor 许敬缓
Owner RUN TECH CO LTD BEIJING
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More