Text compression method and text compression device

A text compression, text technology, applied in the direction of instrument, calculation, electrical digital data processing, etc., can solve the problem of low compression rate and achieve the effect of high compression rate

Active Publication Date: 2012-07-11
PEKING UNIV +2
View PDF3 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since this method compresses and decompresses Chinese text in units of bytes, it can be combined with various current compression algorithms or tools,

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text compression method and text compression device
  • Text compression method and text compression device
  • Text compression method and text compression device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and embodiments.

[0024] The technical idea of ​​the present invention is to perform compression in units of words instead of words, thereby improving the compression rate. In order to achieve the above object, the text compression method according to the present invention comprises the following steps:

[0025] Step S1, from the text to be compressed, filter words that meet predetermined word length and frequency of occurrence conditions;

[0026] Step S2, assigning codes to the screening words according to the frequency of occurrence of the screening words; and

[0027] Step S3, compressing the text using the assigned code.

[0028] Among them, steps S3 and S4 belong to the prior art and can be implemented by various known technologies, therefore, detailed descriptions thereof are omitted in this specification. Hereinafter, step S1 will be mainly described. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a text compression method applicable to texts in non-Latin languages such as Chinese and the like. The method includes the steps: screening words meeting the conditions of a predetermined word length and the occurrence frequency from a text to be compressed; allocating codes to the screened words according to the occurrence frequency of the screened words; and utilizing the allocated codes to compress the text. Correspondingly, the invention provides a text compression device. The text compression method and the text compression device have the advantages that alternative words are extracted from the text to be compressed with one word serving as a unit and are screened according to the occurrence frequency of the alternative words, and then only the words with higher occurrence frequency are kept, so that high-frequency expansion words in data of the texts in non-Latin languages such as Chinese and the like can be effectively extracted, the total number of the coded words in a dictionary is decreased, the texts in non-Latin languages such as Chinese and the like are compressed by the aid of the codes, and the high compression ratio is acquired.

Description

technical field [0001] The invention relates to the technical field of text data processing, in particular to a compression method and device suitable for non-Latin languages ​​such as Chinese. Background technique [0002] At present, there are many mature algorithms for the compression of Latin texts, mainly including statistical methods (for example, Huffman algorithm) and dictionary encoding methods. However, for the compression of texts in Chinese, Japanese, Korean and other languages, because they cannot determine the boundaries of each word based on separators such as spaces and punctuation like Latin languages, and the number of commonly used words in these languages ​​is huge and the rules are complex, so it is difficult to Efficiently extract words, update word frequencies, and obtain frequently expanded words in text data by simply applying statistical methods or dictionary encoding methods for Latin languages. In addition, even if the high-frequency expansion wo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 仇睿恒胡薇
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products