Text compression method and device, text decompression method and device, computer equipment and storage medium

A text compression and computer program technology, applied in the field of data processing, can solve the problems of unavailable compression ratio and high compression files, etc., achieve the effect of compact representation and increase compression ratio

Pending Publication Date: 2022-02-15
深圳市领存技术有限公司
0 Cites 0 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, if the text is compressed according to the above two encoding met...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

[0048] Specifically, using the semantic vector of the text to compress can obtain a more compact representation, compared to compre...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention relates to a text compression method and device, a text decompression method and device, computer equipment and a storage medium. The method comprises the steps that text preprocessing is carried out on a to-be-compressed text to obtain word vectors of a plurality of target segmented words, semantic vectors corresponding to the whole text are obtained by summarizing the word vectors of the target segmented words, the semantic vectors comprise the meaning of text expression, compression is carried out through the semantic vectors of the text, a more compact representation form can be obtained, and compared with text compression according to the character frequency and the simple arrangement rule, the compression ratio can be greatly improved.

Application Domain

Technology Topic

Image

  • Text compression method and device, text decompression method and device, computer equipment and storage medium
  • Text compression method and device, text decompression method and device, computer equipment and storage medium
  • Text compression method and device, text decompression method and device, computer equipment and storage medium

Examples

  • Experimental program(1)

Example Embodiment

[0039] In order to make the objects, technical solutions, and advantages of the present application, the technical solutions in the present application embodiment will be described in connext of the present application embodiment, and It is a part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art will belong to the scope of the present application without the premise of making creative labor.
[0040] In one embodiment, figure 1 Schemmatic diagram of a text compression method in an embodiment, reference figure 1 Provide a text compression method. This embodiment is mainly used in the application to serve 120 in this method, and the text compression method specifically includes the steps of:
[0041] Step S210, the compressed text is treated for text pre-processing, and the word vector of multiple target particles is obtained.
[0042] Specifically, the text to be compressed means that the text represents the text that does not have a compressed process, and the text to be compressed includes a plurality of target scales, using the word embedded to convert each target word in the text to a word vector, the word vector includes more A vector value for indicating the vector dimension, in the present embodiment, the vector dimension of the word vector is 256, i.e., a word word with 256 vector values. The word vector uses the word vector reduces the amount of calculated and stored data.
[0043] A method of converting a word conversion into a word vector comprises: first, a probability that two words occur simultaneously in a large corpus, the word simultaneously occurs to be mapped to a similar position of the vector space; the second, is according to one Words or several words, predict that their possible adjacent words, naturally learn the words corresponding to words in the forecast process.
[0044] It is also possible to use the word that has been opened source training, such as the word gensim library.
[0045] Step S220, based on the word vector of the plurality of the target, generates the semantic vector corresponding to the text corresponding to the text.
[0046] Specific, semantic vectors contain text expression, can be understood by semantic vectors to understand the correlationnce and similarity between the words in the context in the context of the context, facilitate the subsequent semantic vector for keyword search or text recommendation and other functional calculation processing. .
[0047] Step S230, the semantic vector to be compressed text is compressed, generated into compressed text.
[0048] Specifically, the semantic vector of the text is compressed, and the compact representation can be obtained, compared to the text compared to the character frequency and the simple arrangement law, it is extremely large enough to increase the compression ratio.
[0049] In one embodiment, the compressed text is compressed for text pretreatment, and the word vector of multiple target words, including: generating encoding dictionary based on the word embedded table; based on the encoding dictionary The compressed text performs a word handling to obtain a plurality of target scales; wherein each of the target chords carrying a corresponding word code; a word vector corresponding to each of the target particles is determined based on the word embedding table.
[0050] Where the word embedded table includes W words and the word vector corresponding to each word, for example, [Word_1: (0.234, 0.252, ..., 0.234); Word_2: (0.254, 0.227, ..., 0.284); .. Word_W: (0.256, 0.297, ..., 0.384)], Word_1 and Word_2 respectively correspond to a word, using the word embedded table to build the code dictionary, the encoding dictionary includes all the target words in the text to be compressed, encoded the dictionary The total length is N, that is, the encoding dictionary includes N word words, and each word in the encoding dictionary corresponds to one word code, for example, {text: 1, compression: 2, based on: 3, Chinese: 4, us: 5, one: 6 , System: 7, Word: 8, Embed: 9, Compression: 10, Circulating Neural Network: 11,: 12, Algorithm: 13, ..., XXX: N}, that is, each word corresponds to a value .
[0051]Specifically, based on the encoding dictionary, the compressed text can be used to treat the compressed text, and the psychoacter can specifically be a chlorist such as HANLP or Jieba, such as "a word embedded Chinese text compression algorithm", for example. It is split, "a type of | word | embedding | | Chinese | text | compression | algorithm", to get multiple target scales, ie, two target scores are separated from separators. Based on the coding dictionary, the word coding can be found, for example, based on the encoding dictionary can find "a" corresponding word code is 6, "based" corresponding word coding is 3, so that each Word code corresponding to the target word.
[0052] The word vector corresponding to the word noun corresponds to the word vector, the word vector, and the word coding, based on the word vector, and the word code, the word, the entire encoding matrix, which is to be compressed, which is S = (S) 1 S 2 , ..., s M ), Where s 1 S 2 Corresponding to different target points, based on the text of the text in the above example, S 1 The matrix number used to indicate the first target word "one" to be compressed text, the value corresponding to the matrix number is the word code, that is, S 1 = 6, s M It is a matrix number that refers to the target word "algorithm" in the end of the compressed text, and to be compressed text includes M target fractions, because the word vector corresponding to each target is a 256-dimensional vector, so the code to be compressed corresponds to the encoding The matrix is ​​a matrix of one m * 256, in the present embodiment, based on the to be compressed in the above example, M = 9.
[0053] In one embodiment, the word vector based on the plurality of the target particle, generates a semantic vector corresponding to the compressed text, including: dividing the plurality of target particles to obtain multiple words According to the word vector of the target word, the semantic vector corresponding to each of the words segment is generated; the semantic vector to be compressed is generated based on the semantic vector corresponding to each of the words.
[0054] Specifically, divide multiple target chords into multiple words according to the specified number, each of which includes a specified number of target chords, the specified number is X, which is divided according to each x target word, ie The number of sessions is d = m / x, and the word training can be obtained using the word vector corresponding to each target word in each word, that is, the semantic vector representing this term is obtained, which corresponds to the X target word corresponding to the X target. X word vector compressed, compressed into a semantic vector that can express this X target word, ie the quantity is to compress X word vector into a semantic vector, integrated all the words corresponding to the compressed text, generates to be compressed The semantic vector corresponding to the text, that is, the semantic vector corresponding to the text, is constituted by the semantic vector corresponding to the D word segment. Using the word embedded table to grasp the characteristics of text semantics information, the semantic vector is more concise, and the phrase table related to the specified field can be trained to enhance the compression ratio of the text in the designated field.
[0055] In one embodiment, the word vector based on each of the words, the semantic vector corresponding to each of the words, including: determining the target word in the previous segment Semantic vector; based on the semantic vectors of the target particle preceded, the semantic vector of the target particle in the report is determined.
[0056] Specific, reference figure 2 As shown, the corresponding word vector of X target fraction in each term is used for iteration learning, and the previous target chord is adjacent to the two target particles in the term. Before the target score, that is, the matrix number of the previous target chronic is smaller than the matrix number of the post-target word. figure 2 X_ (T-1) is used to indicate the semantic vector of the previous target word relative to X_T, X_T is used to indicate the semantic vector of the post-target word relative to X_ (T-1), after learning after the previous target The semantic vector of the preamble word will be obtained, and it will generate the semantic vector of the previous target chronic when learning the semantic vector of the previous target word, resulting in the semantic vector of the post-target word. Specifically, it can be implemented over the LSTM network model or the GRU model.
[0057] Each learning process of the latter target word will be combined with all the previous target word results until the last target word study is completed, and the semantic vector corresponding to the term is generated, that is, the process of iterative learning. Connect the semantic vector of all target chords in the word segment to obtain a semantic vector of the corresponding semantic information including multiple target points. The semantic vector corresponding to the compressed text is generated based on the semantic vector corresponding to each term, and the semantic vector corresponding to the compressed text is recorded as C, c = (C 1 , C 2 , ..., c i , ..., c D ), Where C i Semantic vector, C, which is used to indicate the word corresponding to the i-th paragraph to be compressed i Including L vector strings, each vector string includes b numbers, ie c i Is vector consisting of L * b, for example, B is 3, L = 3, C i = (123, 245, 356). For D words to be compressed, the semantic vector corresponding to the text corresponding to the text is obtained is a vector consisting of D * L * b numbers.
[0058] In one embodiment, the semantic vector corresponding to the compressed text is compressed, generating a compressed text, including: determining the probability of an occurrence of each character in the respective semantic vectors of each word; The probability is sorted by numerical descending order, generating probability descending alignment, based on the probability descending order, sequentially, the probability of appearance is superimposed in the probability of appearance, resulting in a superposed value; when the superposition value reaches a preset value, utilization The preset character is respectively labeled with the probability of appearance in the probability of appearance, and the compression coding corresponding to the word segment is generated based on each of the preset probabilities based on each of the paragraphs. The corresponding compression encoding is generated to generate the compressed text.
[0059] Specifically, the probability of the appearance of each character in the corresponding semantic vector, since the semantic vector includes the L * B number character, that is, the statistics of the statistics in the semantic vector in all of the words, each number is followed The value of the probability of appearance, obtains the probability descending alignment, that is, the probability of the probability descending order includes numbers and the probability of the number corresponding to the digital, for example, the semantics corresponding to each word and each of the various words are split based on the semantics of each word and each of the words to be compressed. Vector, each semantic vector consists of any number of 0-9, based on the probability descending order of the 10 digital generation, as shown in the following table:
[0060] Semantic vector Probability 2 0.4 1 0.2 3 0.1 4 0.1 6 0.06 5 0.04 0 0.04 9 0.03 7 0.02 8 0.01
[0061] Among them, the probability that the probability is in the probability of all the probability of all the probability is the least probability, and the probability is greater than or equal to the latter probability, that is, refer to the probabilistic descending order, 0.01 and 0.02 There are two probability values ​​that have the smallest value in probability, namely 0.01 is 0.02, then the probability of occurrence, 0.02 is the probability of appearing relative to 0.01, and will be superimposed with the probability of appearance in first occurrence probability, That is, the superposition value is 0.01 + 0.02 = 0.03, the obtained superposition value is used as the new appearance probability, and the two appearance probability that the least the probability value is the smallest probability of the probability value in the new appearance probability, ie (0.03, " 0.03, 0.04, 0.04, 0.06, 0.1, 0.1, 0.2, 0.4) reselect the probability of the two probability values ​​to add, resulting in 0.03 + 0.03 = 0.06, and then 0.06 as the new appearance probability, in (0.04, 0.04, 0.06, 0.06, 0.1, 0.1, 0.2, 0.4) reselect the probability of the two probability values ​​to add, resulting in 0.04 + 0.04 = 0.08, and push until the superposition value reaches a preset value, usually preset The value is set to 1, thereby forming image 3 The tree structure shown.
[0062] In the tree structure formed in the output probability, each output probability corresponds to a node, and the node is marked in turn down the tree structure, and the node is divided into parent nodes and sub-nodes, that is, after the parent node is branched. The child node, respectively mark the sub-node after each bifurcation, labeled 0 after the bifurcation is labeled on the left side, and the bifurcation is labeled on the right side of the sub-node, or it can be bifurcated. After the sub-node marker on the left, the bifurcation is labeled on the sub-node of the right side, and in the present embodiment, the first marking method is selected, and the parent is also included before each child node. The node corresponds to the corresponding compression encoding corresponding to each node.
[0063] like image 3 As shown, the node in the left side of the bifurcation is marked as 0, and the node located on the right side of the bifurcation is marked as 1, and the compression encoding corresponding to the number 7 includes the mark 00000 corresponding to the parent node and its own corresponding marker 0. That is 000000. The corresponding compression encoding corresponding to the number 8 includes a marker 00000 corresponding to its parent node and its own corresponding marker 1, ie 10,0001. Similarly, the compression coding of numbers 9 is 00001, the compression encoding of the number 6 is 0001, the compression encoding of the digital 3 is 001, and the compression encoding of the digital 0 is 01000, and the compression encoding of the digital 5 is 01001, and the compression encoding of the digital 4 is 0101, the compression encoding of the number 1 is 011, and the compression encoding of the digital 2 is 1, and the compressed text is generated based on compression coding corresponding to each digit.
[0064] In this way, the semantic vector corresponding to the compressed text will be compressed. For the corresponding semantic vector, the semantic semant is used, and the less bit character is represented, and the corresponding semantic vector is correspondingly selected, and more than one The characters are indicated that the representation bits of the corresponding semantic vector in part to be compressed text is reduced, and the number of representations of the corresponding semantic vector in part is increasing. Since the number of corresponding semantic vectors indicating the number of digits is greater than the number of points The number of corresponding semantic vectors increases, so that the corresponding amount of data corresponding to the compressed text is less than the corresponding amount of data to be compressed.
[0065] In one embodiment, referring to Figure 4 Provide a text decompression method, the method further comprising:
[0066] In step S310, the vector decoding process to be decompressed text is decompressed to obtain the semantic vector of the corresponding text to be decompressed.
[0067] In the present embodiment, the text is to be subjected to the text after the compression process, and the text is composed of a plurality of compression encoded, compression encoding is a binary string composed of 0 and 1, based on the Hofman coding algorithm The vector decoding process to be decompressed text is to be converted to the binary string of 0 and 1 to the digital sequence, the digital sequence length is D * 1 * B, which is used to indicate the semantic vector corresponding to the text to be decompressed.
[0068] Step S320, divide the semantic vector to decompress the text to generate multiple sub-vectors.
[0069] In the present embodiment, the corresponding semantic vector to decompress the text is divided into D individual semantic vector, and each subrogram is a semantic vector corresponding to a term, i.e., the splitting text corresponding semantic vector. C = (c 1 , C 2 , ..., c i , ..., c D ), Where C iSemantic vector, C, which is used to indicate the word corresponding to the i-th paragraph to be compressed i Including L vector strings, each vector string includes b numbers, ie c i Is a vector consisting of L * b, for example, B is 3, L = 3, i = (123, 245, 356).
[0070] Step S330, each of the sub-vectors is numbered decoding processing to generate decompression text.
[0071] In this embodiment, reference Figure 5 The number decoding process is numbered for each sub-vector, that is, the sub-vector corresponding to the word segment is input to the word code corresponding to the transition output of the conversion output, each of the words, including X word, and therefore the sub-vector corresponding to the word segment The number decoding process is performed, and the X word code is obtained. According to each word, the corresponding word is found in the encoding dictionary, for example, the word code is 5, and the word "we" find 5 in the coding dictionary is "we". In this way, the corresponding sub-vector of the D word segment is decoded, thereby decoding a plurality of words corresponding to the text to be decompressed, i.e., decompress the decompression text.
[0072] figure 1 and Figure 4 A flow chart of a method of compression and decompression method in an embodiment. It should be understood that although figure 1 and Figure 4 The various steps in the flowchart are displayed in accordance with the arrows, but these steps are not necessarily executed in the order indicated by the arrow. Unless otherwise specified herein, the implementation of these steps does not have a strict order, which can be performed in other order. and, figure 1 and Figure 4 At least a portion of the steps may include multiple sub-steps or multiple stages, which do not necessarily perform completion at the same time, but can be performed at different times, and the order of execution of these sub-steps or phases is not inevitable. It is performed in turn, but can be performed or alternately performed at least a portion of the sub-step or stage of other steps or other steps.
[0073] In one embodiment, if Image 6 As shown, a text compression device includes:
[0074] The pre-processing module 410 is configured to treat text pre-processing to obtain a plurality of target fraction;
[0075] The generation module 420 is used to generate a semantic vector corresponding to the compressed text based on the words of the plurality of the target chord;
[0076] The compression module 430 is configured to compress the semantic vector corresponding to the compressed text to generate compressed text.
[0077] In one embodiment, the pre-processing module 410 is also used in:
[0078] Based on a plurality of words in the word table, the generated encoding dictionary is constructed; wherein the word embedding table includes a plurality of word words and various words corresponding to the respective words, and each of the coding dictionary corresponds to one word code;
[0079] Based on the encoding dictionary to distinguish between the compressed text, multiple target fraction is obtained; wherein each of the target particles carry a corresponding word code;
[0080] The word vector corresponding to each of the target particles is determined based on the word embedded table.
[0081] In one embodiment, the generation module 420 is also used in:
[0082] The plurality of target chords are divided into a plurality of words; wherein each of each of the words comprises at least two consecutive target particle;
[0083] The semantic vector corresponding to each of the words is generated based on the word vector of the target word in each of the words.
[0084] The semantic vector to be compressed according to the semantic vector is generated based on the semantic vectors corresponding to each of the words.
[0085] In one embodiment, the generation module 420 is also used in:
[0086] Determine the semantic vector of the target particle in the previous paragraph;
[0087] The semantic vector of the derivative fraction in the segment is determined based on the semantic vectors of the target particle preceding.
[0088] In one embodiment, the compression module 430 is also used in:
[0089] Determine the probability of the appearance of each character in the respective semantic semantic vectors;
[0090] The probability of appearance in accordance with the numerical descending order is sorted, and the probability descending order is generated;
[0091] Based on the probability descending order, the latter probability is superimposed in the probability of appearance in order to obtain a superimposed value;
[0092] When the superposition value reaches a preset value, the preset character is used to mark the probability of labeled the probability of appearance in the first appearance, and the preset character is correspondingly preset based on each of the preset probabilities associated with the words. Compression coding corresponding to the word segment;
[0093] The compressed text is generated based on compression coding corresponding to each of the words.
[0094] In one embodiment, referring to Figure 7 Provide a text decompression device, the method further includes:
[0095] The first decoding module 510 is configured to decompress the text to perform the vector decoding process to obtain a semantic vector corresponding to the text corresponding to the text;
[0096] The division module 520 is configured to divide the semantic vector to decompress the text to generate multiple sub-vectors;
[0097] The second decoding module 530 is configured to decode the decoding process for each of the sub-vectors, generating a decompression text.
[0098] Figure 8 An internal structure diagram of a computer device is shown in one embodiment. The computer device can be a server. like Figure 8 As shown, the computer device includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein, the memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer is stored in the operating system, and a computer program can be stored, which can cause the processor to implement text compression and decompression method when executed by the processor. A computer program can also be stored in this inner memory. When the computer program is executed, the processor can perform text compression and decompression methods. Those skilled in the art will appreciate that Figure 8 The structure shown is merely a block diagram of a part of the structure associated with the present application, and does not constitute a limited limit of the computer device applied to the present application scheme, and the specific computer device may include more or more than the figure. Less parts, or combine certain components, or have different parts arrangement.
[0099] In one embodiment, the text compression provided by the present application, the decompression device can be implemented as a computer program, and the computer program can be Figure 8 The displayed computer device is running. In the memory of the computer device, it is stored in the memory of the text, and each program module, for example, Image 6 The pre-processing module 410 shown, generates module 420 and compression module 430. The computer program composed of each program module enables the processor to perform the steps of the text compression and decompression method of the various embodiments described in this specification.
[0100] Figure 8 The computer device shown can pass through Image 6 The text compressed, the preprocessing module 410 in decompression is performed to perform text pre-processing to which the compressed text is performed to obtain a plurality of target scales. The computer device can perform a semantic vector corresponding to the compressed text by generating a word vector, based on the generation module 420. The computer device can perform compression processing of the semantic vector corresponding to the compressed text by compression module 430, generating compressed text.
[0101] In one embodiment, a computer device, including a memory, a processor, and a computer program stored on a memory and can run on a processor, and a processor performs a computer program to implement the above embodiment described above. method.
[0102] In one embodiment, a computer readable storage medium is provided, and a computer program is stored, and the computer program implements the method described above when executed by the processor.
[0103] One of ordinary skill in the art will appreciate that all or part of the flow in the above embodiment method is to be done by a computer program to indicate related hardware, the program can be stored in one non-volatile computer readable storage medium. In the case, the procedure can include the flow of embodiments of each of the above methods. Among them, any reference to memory, storage, database, or other medium used in the various embodiments provided herein can include nonvolatile and / or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrical programmable ROM (EEPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory can include a random access memory (RAM) or an external cache. As a description, the RAM can be obtained in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), sync chain Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamics RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).
[0104] It should be noted that in this article, a relationship term such as "first" and "second", etc. is only used to distinguish an entity or operation with another entity or an operational area, not necessarily or implied. There is any such actual relationship or order between entities or operations. Moreover, the term "comprising", "comprising" or any other variable is intended to cover non-exclusive contained, thereby enabling a process, method, article, or device including a series of elements, not only those elements, but also not expressly listed. Other elements, or elements that are also inherent to such processes, methods, items, or equipment. In the case where there is no more restriction, the elements defined by the statement "include a ...", and there is no other same element in the process, method, article, or apparatus including the elements.
[0105] The above is only the specific embodiments of the present invention, and the present invention can be understood by those skilled in the art. A variety of modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein can be implemented in other embodiments without departing from the fine god or range of the present invention. Accordingly, the present invention will not be limited to the embodiments shown herein, but rather consistent with the widest range consistent with the principles and novel features of this article.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Run-length coding and decoding methods and devices

InactiveCN102185612ASimplify the compression processIncrease the compression ratioCode conversionImage codingRun-length encodingBit numbering
Owner:张环蚀

Classification and recommendation of technical efficacy words

  • Compact representation
  • Increase the compression ratio

Huffman data compression method

InactiveUS7064489B2Reduce memory requirementsIncrease the compression ratioElectric arc lampsCode conversionBehaviour patternData compression
Owner:ROKE MANOR RES LTD

Real-time data on-line compression and decompression method

ActiveCN1612252AIncrease the compression ratioRealize online compression storageCode conversionDigital recording/reproducingImage compressionTime course
Owner:ZHEJIANG SUPCON TECH +1

Picture encoding method and image decoding method

InactiveUS20050031030A1Increase the compression ratioHighly practicalPicture reproducers using cathode ray tubesPicture reproducers with optical-mechanical scanningGroup of picturesComputer science
Owner:GK BRIDGE 1
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products