Supercharge Your Innovation With Domain-Expert AI Agents!

Method and device for identifying coding mode of Chinese characters

A technology of Chinese characters and coding methods, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of large memory usage, complex implementation, and low recognition efficiency, so as to improve recognition efficiency and accuracy, avoid The effect of memory occupation and simplification of the recognition process

Active Publication Date: 2015-02-18
RUN TECH CO LTD BEIJING
View PDF5 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The encoding mode method is mainly based on the encoding range to determine the encoding method. For a large number of character sequences, it can only be detected one by one based on the encoding range. Not only is the detection efficiency low, but also when there are a large number of encoding overlap points, it will be impossible to decide which encoding method to use.
[0005] The character distribution method is based on the character distribution probability as a model. Before identifying the encoding method, it is necessary to establish a character probability distribution model for a specific character set. For the intricate network environment, Chinese, English and other special symbols are often mixed together in the network data stream. When English characters and other non-Chinese characters are in the majority, it often interferes with the recognition of Chinese character encoding methods, especially affecting the recognition based on Recognition Accuracy of Character Encoding Recognition Scheme Based on Character Probability Distribution Model

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for identifying coding mode of Chinese characters
  • Method and device for identifying coding mode of Chinese characters
  • Method and device for identifying coding mode of Chinese characters

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0022] see figure 1 , is a flow chart of a method for identifying a Chinese character encoding method provided in Embodiment 1 of the present invention. The method of the embodiment of the present invention is applicable to the recognition system of the coding mode of Chinese characters, and the system includes: a distribution device and a data restoration device, wherein the distribution device is used to obtain the sequence of Chinese characters to be recognized in the network; The splitting device acquires the Chinese character sequence to be recognized, and identifies the encoding method of the Chinese character sequence to be recognized. The method in the embodiment of the present invention can be executed by a recognition device configured with a Chinese character encoding method realized by hardware and / or software, and the realization device is typically configured in a data restoration device.

[0023] The method includes:

[0024] Step 110, obtaining character sequ...

Embodiment 2

[0041] see figure 2 , is a flow chart of a method for recognizing a Chinese character encoding method provided in Embodiment 2 of the present invention. On the basis of the above-mentioned embodiments, this embodiment provides an optimal solution for determining the encoding method of the Chinese character sequence to be recognized based on the characteristics of the character sequence and based on the set Chinese encoding recognition strategy.

[0042] This preferred method includes:

[0043] Step 210, if the length of the Chinese character sequence to be recognized cannot be divisible by 2, then determine that the encoding method of the Chinese character sequence to be recognized is UTF-8 encoding;

[0044] In this step, since GB2312 encoding and GBK encoding both adopt double-byte encoding, and UTF-8 encoding adopts three-byte encoding, if the length of the Chinese character sequence to be recognized cannot be divisible by 2, then the corresponding encoding method must no...

Embodiment 3

[0070] see image 3, is a flow chart of a method for recognizing a Chinese character encoding method provided in Embodiment 3 of the present invention. On the basis of the above-mentioned embodiments, this embodiment provides a specific implementation plan for determining the encoding method of the Chinese character sequence to be recognized based on the characteristics of the character sequence and based on the set Chinese encoding recognition strategy.

[0071] Step 310, judging whether the length of the Chinese character sequence to be recognized can be divisible by 2, if not, then perform step 320; if so, then perform step 330;

[0072] Step 320, determine that the encoding method of the Chinese character sequence to be recognized is UTF-8 encoding, and the process ends.

[0073] In this step, since GB2312 encoding and GBK encoding both adopt double-byte encoding, and UTF-8 encoding adopts three-byte encoding, if the length of the Chinese character sequence to be recogniz...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the invention provides a method and a device for identifying the coding mode of a Chinese character sequence. The method comprises the following steps: obtaining character sequence characteristics from the Chinese character sequence to be identified, wherein the character sequence characteristics comprise the length characteristic of the character sequence, or the length characteristic of the character sequence and the abnormal coding point characteristic of the character sequence; determining the coding mode of the Chinese character sequence to be identified based on a set Chinese coding identification strategy according to the character sequence characteristics. Complicated character probability distribution models do not need to be established in advance, and thus the identification process of identifying the coding mode is simplified; the detection range is shortened by adopting the length characteristic for the Chinese character sequence to be identified in mass network data, the occupation of internal storage caused by direct one-by-one detection on the abnormal coding points in the Chinese character sequence to be identified is avoided, the characteristics of abnormal coding points are further combined after the detection range is narrowed down, and the identification efficiency and accuracy rate of the coding mode are improved.

Description

technical field [0001] The embodiment of the present invention relates to the technical field of computer data communication, and in particular to a method and device for identifying a coding mode of Chinese characters. Background technique [0002] With the continuous development of computer communication technology, people have created a variety of encoding methods for transmitting data in the network. For Chinese characters, the commonly used encoding methods are GBK, GB2312 and UTF-8. After obtaining the encoded Chinese character sequence transmitted in the network, it is necessary to decode the obtained Chinese character sequence in order to correctly restore the original data corresponding to the Chinese character sequence. Therefore, the technology for identifying the encoding method of the Chinese character sequence Came into being. [0003] Existing recognition technologies for encoding modes of Chinese character sequences mainly include: encoding pattern method an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22
Inventor 许敬缓
Owner RUN TECH CO LTD BEIJING
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More