Text compression method and apparatus, text decompression method and apparatus, model training method and apparatus, and device
By combining a large language model (LLM) with a deep learning encoder and a q-former quantization module, the problem of low compression rate and poor decompression effect of long texts and complex language structures in existing technologies is solved, achieving efficient text compression and high-quality restoration.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SHANGHAI SOULGATE TECH CO LTD
- Filing Date
- 2025-07-31
- Publication Date
- 2026-06-18
AI Technical Summary
Existing text compression technologies have low compression rates and poor decompression performance when dealing with long texts and complex language structures, making it difficult to effectively handle text content with rich context and high language dependence.
A large language model (LLM) is used in conjunction with a deep learning encoder and a q-former quantization module to generate a high-compression token sequence through feature extraction and quantization, and then decompression is performed using an autoregressive language model.
It achieves high compression ratio text compression while maintaining high-quality text restoration capabilities, applicable to text in different languages and formats, and significantly reduces data storage and transmission resources.
Smart Images

Figure CN2025111717_18062026_PF_FP_ABST
Abstract
Description
A text compression method, a text decompression method, a model training method, an apparatus, and a device.
[0001] This application claims priority to Chinese Patent Application No. 202411794658.4, filed on December 9, 2024, entitled “A text compression method, a text decompression method, a model training method, an apparatus and device”, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the technical field of text processing, and in particular to a text compression method, a text decompression method, a model training method, an apparatus, and a device. Background Technology
[0003] With the rapid development of information technology, the amount of global data has exploded. Against this backdrop, text data, as an important form of information transmission and knowledge storage, has become a hot research topic in terms of processing and compression technologies.
[0004] Conventional text compression techniques, such as Huffman coding and Lempel-Ziv-Welch (LZW) coding, achieve compression by analyzing the frequency of characters in the text to construct an optimal prefix code or through string matching. For example, Huffman coding achieves compression by constructing a binary tree based on character frequency and assigning a unique variable-length code to each character. The LZW algorithm, on the other hand, achieves compression by constructing a string dictionary and replacing repeated strings with their indices in the dictionary.
[0005] However, with the surge in text data volume and the increasing complexity of language structures, conventional text compression techniques have low compression rates and poor decompression performance. Summary of the Invention
[0006] The purpose of this application is to provide a text compression method, a text decompression method, a model training method, an apparatus, and a device that can improve the compression ratio and decompression effect.
[0007] Firstly, a text compression method is provided, including:
[0008] Obtain the target text to be compressed;
[0009] A text compression model is obtained, which includes: a feature extraction module and a q-former quantization processing module;
[0010] The target text to be compressed is input into the text compression model, and features are extracted using the feature extraction module based on the target text to obtain an embedded vector sequence; the embedded vector sequence is then quantized using the q-former quantization module to obtain a token sequence, thereby achieving text compression.
[0011] In a preferred embodiment, this application may be further configured such that the text compression model also includes a preprocessing module;
[0012] Before performing feature extraction based on the target text to be compressed using the feature extraction module to obtain the embedded vector sequence, the following steps are also included:
[0013] The preprocessing module is used to preprocess the target text to be compressed, resulting in processed text.
[0014] Accordingly, the step of extracting features from the target text to be compressed using the feature extraction module to obtain an embedded vector sequence includes:
[0015] The feature extraction module is used to extract features from the processed text to obtain an embedded vector sequence.
[0016] In a preferred embodiment, this application may be further configured such that the feature extraction module includes GPT or XLNet.
[0017] In a preferred embodiment, this application can be further configured such that: the quantization processing module is used to quantize the embedded vector sequence to obtain a token sequence for text compression, including:
[0018] Obtain the quantization granularity; based on the quantization granularity, use the q-former quantization processing module to quantize the embedded vector sequence to obtain a token sequence, thereby achieving text compression.
[0019] Secondly, a text decompression method is provided, including:
[0020] Obtain a token sequence and a request language, wherein the token sequence is obtained by a text compression method as described in any of the first aspects;
[0021] Obtain the large language model for decompression;
[0022] The token sequence and the request language are input into the large language model, and the large language model is used to reconstruct the text of the token sequence according to the request language to obtain the output text.
[0023] Thirdly, a model training method is provided, including:
[0024] Obtain multiple training raw texts, the request language, and the token sequence corresponding to each of the multiple training raw texts;
[0025] Obtain a training model, which includes a compressed training model and a large language training model. The compressed training model includes a feature extraction training module and a q-former quantization processing training module.
[0026] The original training text is input into the compressed training model to extract features from the original training text using the feature extraction training module, thereby obtaining a training embedded vector sequence; the training embedded vector sequence is then quantized using the q-former quantization processing training module to obtain a training token sequence.
[0027] The training token sequence and the request language are input into the large language training model, so that the large language training model can be used to restore the text of the training token sequence according to the request language to obtain the training output text.
[0028] The training model is iteratively trained based on the training output text and the original training text to obtain a text processing model. The text processing model includes a text compression model and a large language model. The text compression model is used to compress the text, and the large language model is used to decompress the token sequence obtained from the compressed text.
[0029] Fourthly, a text compression device is provided, comprising:
[0030] The first acquisition module is used to acquire the target text to be compressed; and to acquire the text compression model, which includes: a feature extraction module and a q-former quantization processing module.
[0031] The compression module is used to input the target text to be compressed into the text compression model, and to extract features from the target text using the feature extraction module to obtain an embedded vector sequence; the embedded vector sequence is then quantized using the q-former quantization processing module to obtain a token sequence, thereby achieving text compression.
[0032] Fifthly, a text decompression apparatus is provided, comprising:
[0033] The second acquisition module is used to acquire a token sequence and a request language, wherein the token sequence is obtained by the text compression method as described in any of the first aspects; and to acquire a large language model for decompression.
[0034] The decompression module is used to input the token sequence and request language into the large language model, so as to use the large language model to restore the text of the token sequence according to the request language and obtain the output text.
[0035] Sixthly, a model training apparatus is provided, comprising:
[0036] The third acquisition module is used to acquire multiple training original texts, request language, and token sequences corresponding to each of the multiple training original texts; and to acquire the training model, which includes a compressed training model and a large language training model. The compressed training model includes a feature extraction training module and a q-former quantization processing training module.
[0037] The training module is used to input the original training text into the compressed training model, and extract features from the original training text using the feature extraction training module to obtain a training embedded vector sequence; quantize the training embedded vector sequence using the q-former quantization processing training module to obtain a training token sequence; input the training token sequence and the request language into the large language training model, and restore the text of the training token sequence using the large language training model according to the request language to obtain the training output text; iteratively train the training model based on the training output text and the original training text to obtain a text processing model, which includes a text compression model and a large language model, wherein the text compression model is used to compress the text, and the large language model is used to decompress the token sequence obtained from the compressed text.
[0038] In a seventh aspect, an electronic device is provided, comprising:
[0039] One or more processors;
[0040] Memory;
[0041] One or more applications, wherein the applications are stored in memory and configured to be executed by one or more processors, the applications being configured to: perform an operation corresponding to a method shown in any possible implementation of the first aspect, or perform an operation corresponding to a method shown in an implementation of the second aspect, or perform an operation corresponding to a method shown in an implementation of the third aspect.
[0042] Eighthly, a computer-readable storage medium is provided, the storage medium storing at least one instruction, at least one program, code set, or instruction set, wherein at least one instruction, at least one program, code set, or instruction set is loaded by a processor and performs an operation corresponding to the method shown in any possible implementation of the first aspect, or performs an operation corresponding to the method shown in an implementation of the second aspect, or performs an operation corresponding to the method shown in an implementation of the third aspect.
[0043] Ninthly, a computer program product is provided, comprising a computer program that, when executed by a processor, implements an operation corresponding to a method shown in any possible implementation of the first aspect, or an operation corresponding to a method shown in an implementation of the second aspect, or an operation corresponding to a method shown in an implementation of the third aspect.
[0044] In summary, the text compression method provided in this application has the following beneficial technical effects:
[0045] The target text to be compressed is obtained and input into the feature extraction module of the text compression model. Deep learning techniques are used to capture the semantic and syntactic features of the text, and the extracted key information is represented as an embedded vector sequence. The q-former quantization module is then used to quantize the embedded vector sequence, transforming the continuous embedded vector sequence into a discrete token sequence, thus achieving effective compression of the text information. This combination of the feature extraction module and the q-former structure significantly reduces the redundancy of the text data while maintaining high-quality restoration of the original text.
[0046] In addition, this application also provides a text decompression method, a model training method, an apparatus, and a device, all of which have the aforementioned beneficial technical effects. Attached Figure Description
[0047] To more clearly illustrate the technical solutions of the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0048] Figure 1 is a schematic diagram of an application scenario of a text compression method provided in an embodiment of this application;
[0049] Figure 2 is a schematic flowchart of a text compression method provided in an embodiment of this application;
[0050] Figure 3 is a schematic flowchart of a text decompression method provided in an embodiment of this application;
[0051] Figure 4 is a schematic flowchart of a text compression and decompression method provided in an embodiment of this application;
[0052] Figure 5 is a schematic flowchart of a model training method provided in an embodiment of this application;
[0053] Figure 6 is a structural schematic diagram of a text compression device provided in an embodiment of this application;
[0054] Figure 7 is a schematic diagram of a text decompression device provided in an embodiment of this application;
[0055] Figure 8 is a schematic diagram of a model training device provided in an embodiment of this application;
[0056] Figure 9 is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0057] This specific embodiment is merely an explanation of this application and is not intended to limit it. After reading this specification, those skilled in the art can make modifications to this embodiment without contributing any inventive step, but such modifications are protected by patent law as long as they are within the scope of this application.
[0058] It should be noted that, in the optional embodiments of this application, the data related to object information, when applied to specific products or technologies, requires the permission or consent of the object. Furthermore, the collection, use, and processing of this data must comply with the relevant laws, regulations, and standards of the relevant countries and regions. In other words, if the embodiments of this application involve data related to an object, it must be obtained with the object's authorization and consent, the authorization and consent of relevant departments, and in accordance with the relevant laws, regulations, and standards of the country and region. If the embodiments involve personal information, the acquisition of all personal information requires the individual's consent. If sensitive information is involved, the separate consent of the information subject is required. The embodiments also need to be implemented with the object's authorization and consent.
[0059] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0060] Furthermore, the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article, unless otherwise specified, generally indicates that the preceding and following related objects have an "or" relationship.
[0061] Currently, conventional text compression algorithms, such as Huffman coding and LZW coding, achieve compression by analyzing the frequency of characters in the text to construct an optimal prefix code or through string matching. For example, Huffman coding achieves compression by constructing a binary tree based on character frequency and assigning a unique variable-length code to each character. The LZW algorithm, on the other hand, achieves compression by constructing a string dictionary and replacing repeated strings with their indices in the dictionary.
[0062] These algorithms perform well with short texts and simple language structures, but their compression and decompression efficiency is often unsatisfactory when faced with long texts and complex language structures. Furthermore, these algorithms struggle with texts rich in context and highly language-dependent, leading to information loss or compression distortion.
[0063] Therefore, the inventors discovered the following drawbacks in conventional techniques:
[0064] Limited compression ratio: When dealing with long texts and complex language structures, it is difficult to further improve the compression ratio.
[0065] Poor decompression performance: During the compression process, conventional algorithms may lose some key contextual information, resulting in a decrease in the quality of the decompressed text.
[0066] Difficulty in handling complex language structures: Conventional algorithms mainly rely on character frequency and local pattern matching, which are difficult to effectively handle text content with rich context and high language dependence.
[0067] In recent years, the development of artificial intelligence technology, especially Large Language Models (LLMs), has provided a new perspective for text processing. LLMs, through deep learning of large amounts of text data, are able to understand complex language structures and contextual information, demonstrating outstanding capabilities in text generation, translation, and summarization. However, despite significant progress in text generation, LLMs are still rarely used in text compression, particularly in achieving high compression rates and high-quality restoration, where effective solutions are still lacking.
[0068] Based on this, aiming to solve the problems of low compression ratio and poor decompression effect of existing text compression technologies when dealing with long texts and complex language structures, this application proposes a new text compression technology by utilizing the natural language processing capabilities of large language models (LLMs). This technology achieves high compression ratio and high-quality text compression. Specifically, it extracts text features through a deep learning encoder, performs quantization processing in conjunction with a q-former module, and uses prediction generation from an autoregressive language model (LLM). This invention can effectively compress text data while maintaining high restoration quality, especially in scenarios involving long texts and complex language structures.
[0069] This application not only improves the efficiency of text compression but also ensures the quality of the decompressed text. This has significant practical value for applications that require processing large amounts of text data, such as cloud computing, big data analytics, mobile communications, and digital storage. The technical solution of this application significantly reduces the resources required for data storage and transmission while maintaining the integrity of text information, providing users with a more efficient and economical text processing solution.
[0070] Please refer to Figure 1, which is a schematic diagram of an application scenario of a text compression method provided in this application embodiment. This text compression method can be applied to a text compression system. In some embodiments, the text compression system includes an electronic device and a terminal device. The terminal device can send the target text to be compressed. Of course, other servers can also send the target file to the electronic device or read the target file of the electronic device itself. This application embodiment is not limited to this. The terminal device includes, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), and vehicle terminals (such as vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. The electronic device includes, but is not limited to, servers or terminal devices. The server can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services. The electronic device is used to implement the text compression method. The terminal device and the electronic device can be directly or indirectly connected through wired or wireless communication. This application embodiment is not limited to this. It is understood that the above is only an example, and this embodiment is not limited to this.
[0071] This application provides a text compression method, as shown in Figure 2, which includes:
[0072] S101. Obtain the target text to be compressed;
[0073] In this embodiment of the application, the target text to be compressed is the target text uploaded by the terminal device or the target text in the electronic device.
[0074] For example, in the cloud computing field, target files can be numerous log files generated by cloud platform servers and applications, including but not limited to runtime status, error messages, and user access records. Target files can also be user data stored in cloud storage services, such as documents, images, and videos. In the big data analytics field, target files can be data from various sources, such as social media, IoT devices, and enterprise databases. During data transmission or storage, large data volumes require file compression to save bandwidth and storage space. In the mobile communications field, target files can be SMS / MMS content or voice call logs, which may need to be compressed before transmission to save bandwidth and costs. Target files can also be data transmitted by mobile users when accessing web pages.
[0075] It should be noted that the embodiments of this application can effectively process texts of different languages and formats, without limiting the language and format of the target text.
[0076] S102. Obtain the text compression model, which includes: a feature extraction module and a q-former quantization processing module;
[0077] S103. Input the target text to be compressed into the text compression model, and use the feature extraction module to extract features based on the target text to obtain an embedded vector sequence; use the q-former quantization processing module to quantize the embedded vector sequence to obtain a token sequence, so as to achieve text compression.
[0078] The feature extraction module is used to extract features from the target file. In some embodiments, the feature extraction module can use a deep learning encoder to process the text; the encoder can be a Transformer-based model, such as BERT. The feature extraction module can capture the deep semantic and syntactic features of the text, transforming it into a sequence of embedding vectors in a high-dimensional space. It is understood that the sequence of embedding vectors contains key information about the text, providing a foundation for subsequent compression.
[0079] The q-former quantization module is a neural network module specifically designed for quantizing embedded vector sequences. It transforms continuous vector values into discrete token values, thereby achieving text compression. The q-former quantization module receives the embedded vector sequence output by the encoder and quantizes it. Specifically, by learning the distribution of the embedded vector sequence, the q-former quantization module uses the embedded vector sequence output by the feature extraction module as its input data, discretizing the continuous values and mapping them to a finite set of tokens to obtain a token sequence. The quantized token sequence has a high compression ratio while retaining the key information of the original text. The token sequence refers to the result of quantization; it is a series of discrete symbols or tags used to represent the approximate representation of the original text data under quantization, thereby achieving compressed text storage. In this embodiment, the size of the token sequence is much smaller than the target text. However, this embodiment employs an innovative feature extraction module, namely the encoder-q-former structure, which significantly reduces the redundancy of text data while maintaining high-quality reconstruction of the original text. The encoder uses deep learning techniques to capture the semantic and syntactic features of the text, while the q-former module efficiently compresses these features, making it sufficient to support high-quality text reconstruction even with a small token sequence.
[0080] As can be seen, in this embodiment, the target text to be compressed is obtained, input into the feature extraction module of the text compression model, and deep learning technology is used to capture the semantic and syntactic features of the text. The extracted key information is represented in the form of an embedded vector sequence. The q-former quantization processing module is used to quantize the embedded vector sequence, transforming the continuous embedded vector sequence into a discrete token sequence, thus achieving effective compression of text information. By combining the feature extraction module with the q-former structure, the redundancy of text data can be significantly reduced while maintaining the high-quality restoration capability of the original text.
[0081] In one possible implementation of this application embodiment, the text compression model further includes: a preprocessing module;
[0082] Before extracting features from the target text to be compressed using the feature extraction module to obtain the embedded vector sequence, the process also includes: preprocessing the target text to be compressed using the preprocessing module to obtain the processed text.
[0083] Accordingly, feature extraction is performed on the target text to be compressed using the feature extraction module to obtain an embedded vector sequence, including: using the feature extraction module to extract features from the processed text to obtain an embedded vector sequence.
[0084] Among them, the preprocessing module is responsible for preliminarily processing the input text data (target text) to improve the efficiency and accuracy of subsequent processing. The preliminary processing includes at least one of the following:
[0085] Word segmentation processing: Decompose the continuous text string of the target text into meaningful lexical units.
[0086] Stop word removal processing: Delete common and less meaningful words in the target text, such as "de" (of), "shi" (is), etc.
[0087] Punctuation processing: Identify and retain the punctuation marks in the target text because they have an important impact on the text structure.
[0088] It can be seen that in the embodiments of the present application, by preprocessing the target text, cleaning and processing the target text, redundant information can be removed, the data quality can be improved, and thus the compression quality can be improved.
[0089] A possible implementation manner of the embodiments of the present application is that the feature extraction module includes: GPT (Generative Pre-trained Transformer) or XLNet.
[0090] The extraction module of the Transformer architecture directly captures the dependencies in the text through the self-attention mechanism, greatly improving the efficiency of the model in processing text data. Among them, GPT is a language model based on the Transformer architecture. XLNet is another language model based on the Transformer architecture, which adopts the method of permutation language model.
[0091] It can be seen that in the embodiments of the present application, in the feature extraction stage, different pre-trained models, such as GPT or XLNet, can be used to adapt to different types of text data.
[0092] A possible implementation manner of the embodiments of the present application is to use the q-former quantization processing module to perform quantization processing on the embedded vector sequence to obtain a token sequence for text compression, including:
[0093] Obtain the quantization granularity; based on the quantization granularity, use the q-former quantization processing module to perform quantization processing on the embedded vector sequence to obtain a token sequence for text compression.
[0094] Quantization granularity refers to the level of fineness or resolution used when quantizing embedded vectors during text compression, determining the level of detail of information that the quantized tokens can represent. In this embodiment, when efficient storage or transmission of large amounts of text data is required, the text is first embedded, converting it into a sequence of embedded vectors. Based on a preset quantization granularity, the q-former quantization module quantizes these vectors, dividing the continuous vector space into a series of discrete intervals, and mapping each vector to the token corresponding to its interval, thus achieving text compression. During the quantization stage, the number of tokens after compression can be changed by adjusting the output dimension of the q-former, thereby adjusting the quantization granularity and achieving different compression ratios and restoration qualities. The quantization granularity can be adjusted according to actual needs; for example, the embedded vectors can be quantized into a set of 256, 512, or 1024 tokens.
[0095] Specifically, the quantization granularity can be determined based on the characteristics of the text, the compression requirements, or the default granularity. For example, for texts that require a high degree of semantic information preservation (such as legal documents and medical reports), a finer quantization granularity can be set; while for texts with low requirements for semantic information (such as log data and social media content), a coarser quantization granularity can be set to achieve a higher compression ratio.
[0096] Therefore, in some embodiments, the step of obtaining quantization granularity can be achieved in multiple ways:
[0097] The first approach is to determine the quantization granularity through experience or expert knowledge, which can be done manually by setting a suitable granularity. The second approach is to dynamically determine the quantization granularity using automated algorithms. For example, analyzing features such as word distribution and sentence length in the text, and automatically adjusting the quantization granularity based on the correspondence between these features and preset features and granularities, can optimize compression performance and thus more flexibly adapt to various compression needs. It is understandable that other methods can also be used to obtain the quantization granularity, such as adjustment mechanisms based on user feedback. This is not limited here; the specific implementation will vary depending on the application scenario and requirements.
[0098] Furthermore, this application provides a text decompression method, which includes: obtaining a token sequence and a request language, wherein the token sequence is obtained by the above-mentioned text compression method; obtaining a decompression model; inputting the token sequence into the decompression model to restore the text of the token sequence using the decompression model, thereby obtaining the output text.
[0099] Furthermore, the decompression model is a large language model used for decompression; see Figure 3, which illustrates a text decompression method provided in an embodiment of this application, including:
[0100] S201. Obtain the token sequence and request language. The token sequence is obtained through a text compression method.
[0101] The request language is used to instruct the large language model to perform decoding operations, for example, "Please decode the text based on the provided token sequence".
[0102] S202. Obtain the large language model for decompression;
[0103] S203. Input the token sequence and request language into the large language model, and use the large language model to restore the text of the token sequence according to the request language to obtain the output text.
[0104] During the decoding phase, the Large Language Model (LLM) leverages its powerful language understanding capabilities to progressively predict and reconstruct the original text based on the compressed token sequence and the request language, thus obtaining the output text. The structure of the LLM is not limited in this embodiment; users can configure it according to their actual needs. An exemplary LLM structure is QWEN2.
[0105] This application proposes a text compression technique based on a Large Language Model (LLM), aiming to address the challenges faced by existing text compression algorithms when processing long and complex texts. Referring to Figure 4, this technique extracts features from the input text using a deep learning encoder, transforming them into a sequence of embedded vectors. These vectors are then quantized using a q-former module to generate a token sequence with a high compression ratio. These tokens retain key information from the original text and are decoded using an autoregressive language model (LLM) to predict and reconstruct the original text content.
[0106] As can be seen, the decompression method provided in this application progressively expands the compressed token sequence using LLM to restore the original long text, which not only improves the compression rate but also ensures the integrity and accuracy of the text information.
[0107] Furthermore, embodiments of this application provide a model training method, comprising: acquiring multiple training original texts and token sequences corresponding to each of the multiple training original texts; acquiring a training model, the training model including a compression training model and a decompression training model, the compression training model including a feature extraction training module and a q-former quantization processing training module; inputting the training original texts into the compression training model to extract features from the training original texts using the feature extraction training module, obtaining a training embedded vector sequence; quantizing the training embedded vector sequence using the q-former quantization processing training module, obtaining a training token sequence; inputting the training token sequence into the decompression training model to restore the text of the training token sequence using the decompression training model, obtaining a training output text; iteratively training the training model based on the training output text and the training original texts to obtain a text processing model, the text processing model including a text compression model and a decompression model, the text compression model being used to compress the text, and the decompression model being used to decompress the token sequence obtained from the compressed text.
[0108] Furthermore, the decompression training model is a large language training model; see Figure 5, which is a model training method provided by an embodiment of this application, including:
[0109] S301. Obtain multiple training raw texts, the request language, and the token sequence corresponding to each of the multiple training raw texts;
[0110] Based on multiple data points from various fields, both internal and external, the training format is constructed as follows: request language (e.g., please restore the original text) + compressed token sequence + original text.
[0111] S302. Obtain the training model. The training model includes a compressed training model and a large language training model. The compressed training model includes: a feature extraction training module and a q-former quantization processing training module.
[0112] S303. Input the original training text into the compressed training model, and use the feature extraction training module to extract features based on the original training text to obtain the training embedded vector sequence; use the q-former quantization processing training module to quantize the training embedded vector sequence to obtain the training token sequence.
[0113] S304. Input the training token sequence and the request language into the large language training model, so as to use the large language training model to restore the text of the training token sequence according to the request language and obtain the training output text.
[0114] S305. Iteratively train the training model based on the training output text and the original training text to obtain a text processing model. The text processing model includes a text compression model and a large language model. The text compression model is used to compress the text, and the large language model is used to decompress the token sequence obtained from the compressed text.
[0115] During the training phase, the large language training model learns how to predict and generate the original text based on the compressed token sequence and the requested language. For model parameters: the hidden layer size of the LLM can be set between 128 and 512 to balance model complexity and performance. A large amount of text data is used to tune the model parameters; the model learns the patterns of text generation to accurately recover the structural and semantic information of the original text from the compressed token sequence.
[0116] In one feasible approach, training is performed only on the original text portion, calculating the loss and implementing iterative training. Therefore, the difference between the training output text and the desired output is calculated, and the model parameters are adjusted based on this difference until the difference meets the requirements, at which point training stops.
[0117] The training process includes compression and decompression, wherein:
[0118] Compression process:
[0119] Step a101: Input the preprocessed text.
[0120] Step a102: Extract features using the encoder.
[0121] Step a103: Perform quantization using the q-former module.
[0122] Step a104: Generate a compressed token sequence.
[0123] Decompression process: The decompression process is the reverse of the compression process, and includes the following steps:
[0124] Step a201: Input the compressed token sequence.
[0125] Step a202: Use an autoregressive language model (LLM) to predict the text based on the token sequence and contextual information.
[0126] Step a203: Gradually expand the token sequence to restore the original text.
[0127] Furthermore, during the training phase of the autoregressive language model, different optimization algorithms, such as Adam or RMSprop, can be used to improve training efficiency.
[0128] In summary, this embodiment provides an efficient and flexible text compression technology that can significantly reduce the resources required for data storage and transmission while maintaining high restoration quality.
[0129] Specifically: 1. High compression ratio: High compression ratio is achieved through quantization processing and the predictive capabilities of LLM. 2. High restoration quality: q-former quantization processing achieves a high compression ratio while preserving key information of the original text, and LLM's natural language processing capabilities ensure high-quality text restoration. 3. Wide applicability: Applicable to text in different languages and formats, with broad application prospects.
[0130] It should be noted that the text compression method, text decompression method, and model training method provided in this application embodiment can be referred to each other, and will not be elaborated further in this embodiment.
[0131] The following describes a text compression device provided in an embodiment of this application. The text compression device described below can be referred to in correspondence with the text compression method described above. The text compression device of this embodiment is installed in an electronic device. Referring to FIG6, FIG6 is a structural block diagram of a text compression device according to one embodiment of this application, including:
[0132] The first acquisition module 510 is used to acquire the target text to be compressed; and to acquire the text compression model, which includes: a feature extraction module and a q-former quantization processing module.
[0133] The compression module 520 is used to input the target text to be compressed into the text compression model, and to extract features from the target text using the feature extraction module to obtain an embedded vector sequence; the embedded vector sequence is then quantized using the q-former quantization processing module to obtain a token sequence, thereby achieving text compression.
[0134] In one feasible approach, the text compression model also includes a preprocessing module;
[0135] Also includes:
[0136] The preprocessing module is used to preprocess the target text to be compressed, and obtain the processed text.
[0137] Accordingly, compression module 520 is used for:
[0138] The feature extraction module is used to extract features from the processed text to obtain an embedded vector sequence.
[0139] In one possible implementation, the feature extraction module includes either GPT or XLNet.
[0140] In one possible implementation, compression module 520 is used for:
[0141] Obtain the quantization granularity; based on the quantization granularity, use the q-former quantization processing module to quantize the embedded vector sequence to obtain the token sequence, thereby achieving text compression.
[0142] The following describes a text decompression apparatus provided in an embodiment of this application. The text decompression apparatus described below can be referred to in correspondence with the text decompression method described above. The text decompression apparatus of this embodiment is installed in an electronic device. Referring to FIG7, FIG7 is a structural block diagram of a text decompression apparatus according to one embodiment of this application, including:
[0143] The second acquisition module 610 is used to acquire the token sequence and the request language. The token sequence is obtained through a text compression method. It also acquires a large language model for decompression.
[0144] The decompression module 620 is used to input the token sequence and request language into the large language model, so as to use the large language model to restore the text of the token sequence according to the request language and obtain the output text.
[0145] The following describes a model training device provided in an embodiment of this application. The model training device described below can be referred to in correspondence with the model training method described above. The model training device in this embodiment is installed in an electronic device. Referring to FIG8, FIG8 is a structural block diagram of a model training device in one embodiment of this application, including:
[0146] The third acquisition module 710 is used to acquire multiple training original texts, request language and token sequences corresponding to each of the multiple training original texts; and to acquire training models, including compressed training models and large language training models. The compressed training model includes: a feature extraction training module and a q-former quantization processing training module.
[0147] The training module 720 is used to input the original training text into the compressed training model, so that the feature extraction training module can extract features from the original training text to obtain the training embedded vector sequence; the q-former quantization processing training module can quantize the training embedded vector sequence to obtain the training token sequence; the training token sequence and the request language are input into the large language training model, so that the large language training model can restore the text of the training token sequence according to the request language to obtain the training output text; the training model is iteratively trained based on the training output text and the original training text to obtain the text processing model. The text processing model includes a text compression model and a large language model. The text compression model is used to compress the text, and the large language model is used to decompress the token sequence obtained from the compressed text.
[0148] This application provides an electronic device, as shown in FIG9. The electronic device 300 shown in FIG9 includes a processor 301 and a memory 303. The processor 301 and the memory 303 are connected, for example, via a bus 302. Optionally, the electronic device 300 may further include a transceiver 304. It should be noted that in practical applications, the transceiver 304 is not limited to one, and the structure of this electronic device 300 does not constitute a limitation on the embodiments of this application.
[0149] Processor 301 may be a CPU (Central Processing Unit), a general-purpose processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute the various exemplary logic blocks, modules, and circuits described in conjunction with the disclosure of this application. Processor 301 may also be a combination that implements computational functions, such as including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
[0150] Bus 302 may include a pathway for transmitting information between the aforementioned components. Bus 302 may be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, etc. Bus 302 may be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in Figure 9, but this does not indicate that there is only one bus or one type of bus.
[0151] The memory 303 may be a ROM (Read Only Memory) or other type of static storage device capable of storing static information and instructions, RAM (Random Access Memory) or other type of dynamic storage device capable of storing information and instructions, or an EEPROM (Electrically Erasable Programmable Read Only Memory), CD-ROM (Compact Disc Read Only Memory) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but not limited thereto.
[0152] The memory 303 is used to store application code that executes the solution of this application, and its execution is controlled by the processor 301. The processor 301 is used to execute the application code stored in the memory 303 to implement the content shown in the foregoing method embodiments.
[0153] The electronic device shown in Figure 9 is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.
[0154] This application provides a computer-readable storage medium storing a computer program that, when run on a computer, enables the computer to execute the corresponding content in the aforementioned method embodiments.
[0155] This application provides a computer program product, including a computer program that, when executed by a processor, implements the corresponding content in the aforementioned method embodiments.
[0156] It should be understood that although the steps in the flowcharts of the accompanying figures are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the accompanying figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.
[0157] The above are only some embodiments of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications should also be considered within the scope of protection of this application.
Claims
1. A text compression method, characterized in that, include: Obtain the target text to be compressed; A text compression model is obtained, which includes: a feature extraction module and a q-former quantization processing module; The target text to be compressed is input into the text compression model, and features are extracted using the feature extraction module based on the target text to obtain an embedded vector sequence; the embedded vector sequence is then quantized using the q-former quantization module to obtain a token sequence, thereby achieving text compression.
2. The text compression method according to claim 1, characterized in that, The text compression model also includes: a preprocessing module; Before performing feature extraction based on the target text to be compressed using the feature extraction module to obtain the embedded vector sequence, the following steps are also included: The preprocessing module is used to preprocess the target text to be compressed, resulting in processed text. Accordingly, feature extraction is performed using the feature extraction module based on the target text to be compressed, resulting in an embedded vector sequence, including: The feature extraction module is used to extract features from the processed text to obtain an embedded vector sequence.
3. The text compression method according to claim 1, characterized in that, The feature extraction module includes GPT or XLNet.
4. The text compression method according to claim 1, characterized in that, The embedded vector sequence is quantized using the q-former quantization module to obtain a token sequence for text compression, including: Obtain the quantization granularity; based on the quantization granularity, use the q-former quantization processing module to quantize the embedded vector sequence to obtain a token sequence, thereby achieving text compression.
5. A text decompression method, characterized in that, include: Obtain a token sequence and a request language, wherein the token sequence is obtained by the text compression method as described in any one of claims 1 to 4; Obtain the large language model for decompression; The token sequence and the request language are input into the large language model, and the large language model is used to reconstruct the text of the token sequence according to the request language to obtain the output text.
6. A model training method, characterized in that, include: Obtain multiple training raw texts, the request language, and the token sequence corresponding to each of the multiple training raw texts; Obtain a training model, which includes a compressed training model and a large language training model. The compressed training model includes a feature extraction training module and a q-former quantization processing training module. The original training text is input into the compressed training model to extract features from the original training text using the feature extraction training module, thereby obtaining a training embedded vector sequence; the training embedded vector sequence is then quantized using the q-former quantization processing training module to obtain a training token sequence. The training token sequence and the request language are input into the large language training model, so that the large language training model can be used to restore the text of the training token sequence according to the request language, and the training output text is obtained. The training model is iteratively trained based on the training output text and the original training text to obtain a text processing model. The text processing model includes a text compression model and a large language model. The text compression model is used to compress the text, and the large language model is used to decompress the token sequence obtained from the compressed text.
7. A text compression device, characterized in that, include: The first acquisition module is used to acquire the target text to be compressed; A text compression model is obtained, which includes: a feature extraction module and a q-former quantization processing module; The compression module is used to input the target text to be compressed into the text compression model, and to extract features from the target text using the feature extraction module to obtain an embedded vector sequence; the embedded vector sequence is then quantized using the q-former quantization processing module to obtain a token sequence, thereby achieving text compression.
8. A text decompression device, characterized in that, include: The second acquisition module is used to acquire a token sequence and a request language, wherein the token sequence is obtained by the text compression method as described in any one of claims 1 to 4; and to acquire a large language model for decompression. The decompression module is used to input the token sequence and request language into the large language model, so as to use the large language model to restore the text of the token sequence according to the request language and obtain the output text.
9. A model training device, characterized in that, include: The third acquisition module is used to acquire multiple training raw texts, the request language, and the token sequence corresponding to each of the multiple training raw texts; Obtain a training model, which includes a compressed training model and a large language training model. The compressed training model includes a feature extraction training module and a q-former quantization processing training module. The training module is used to input the original training text into the compressed training model, and extract features from the original training text using the feature extraction training module to obtain a training embedded vector sequence; quantize the training embedded vector sequence using the q-former quantization processing training module to obtain a training token sequence; input the training token sequence and the request language into the large language training model, and restore the text of the training token sequence using the large language training model according to the request language to obtain the training output text; iteratively train the training model based on the training output text and the original training text to obtain a text processing model, which includes a text compression model and a large language model, wherein the text compression model is used to compress the text, and the large language model is used to decompress the token sequence obtained from the compressed text.
10. An electronic device, characterized in that, include: One or more processors; Memory; One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications being configured to: perform the steps of the method according to any one of claims 1 to 6.