Large language model pre-training method and apparatus, electronic device, and storage medium
By introducing self-distillation technology in the pre-training stage of the large language model and using self-distilled mixed datasets to train student models, the problem of training instability was solved, the stability and generalization ability of the model were improved, and the pre-training effect was enhanced.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD
- Filing Date
- 2025-09-19
- Publication Date
- 2026-07-02
AI Technical Summary
Large language models often encounter training instability during pre-training, resulting in poor pre-training performance or even crashes. Existing technologies mainly maintain the stability of model training by optimizing parameter configuration, but rely on hyperparameter tuning and lack general applicability across datasets.
By employing self-distillation technology, the generative model at each stage of the large language model is used as the teacher model, and the student model is trained using a self-distilled mixed dataset. This enhances the stability of the training process and the breadth and diversity of the data. The self-distilled mixed dataset is used for self-improvement, integrating language representations and knowledge from different training stages.
It significantly improves the parameter accuracy and generalization ability of generative large language models during pre-training, solves the problem of training instability, and enhances the comprehensiveness and learning effect of pre-training.
Smart Images

Figure CN2025122459_02072026_PF_FP_ABST
Abstract
Description
A method, apparatus, electronic device and storage medium for pre-training large language models
[0001] Related applications
[0002] This application claims priority to Chinese patent application filed on December 24, 2024, with application number 202411919771.0, entitled "A method, apparatus, electronic device and storage medium for pre-training a large language model", the entire contents of which are incorporated herein by reference. Technical Field
[0003] This application relates to the field of artificial intelligence technology, and in particular to a method, apparatus, electronic device and storage medium for pre-training large language models. Background Technology
[0004] Large language model pre-training refers to the process of initially training a large neural network model using large-scale text data to learn general feature representations and knowledge. Due to the scaling of the model structure and the surge in the amount of pre-training data, large language model pre-training often encounters training instability problems, which interfere with the pre-training process, resulting in poor pre-training performance or even pre-training failure. Summary of the Invention
[0005] One aspect of this application proposes a method for pre-training a large language model, the method comprising the following steps:
[0006] Acquire multiple types of text data and preprocess the text data to obtain an initial pre-training dataset;
[0007] Generative large language models are built using a decoder architecture based on the Transformer model;
[0008] The generative large language model is trained in multiple stages based on the initial pre-training dataset;
[0009] A self-distilled hybrid dataset is synthesized based on the teacher model and the initial pre-trained dataset; wherein, the generative large language model trained at each stage serves as the teacher model;
[0010] The student model is trained using the self-distilled mixed dataset; wherein the student model is the generative large language model k stages after the teacher model, and k is a positive integer;
[0011] Return to the step of synthesizing a self-distilled hybrid dataset based on the teacher model and the initial pre-training dataset, until the training has completed a preset number of stages, and use the student model obtained from the final training as the pre-trained generative large language model.
[0012] In some embodiments, the Transformer-based decoder architecture constructs a generative large language model, including the following steps:
[0013] A construction step is performed on the causal decoder architecture of the Transformer model to construct the generative large language model; the construction step includes:
[0014] The causal decoder architecture of the Transformer model is normalized using the root mean square normalization (RMSNorm) algorithm.
[0015] A grouped query attention mechanism is set for the causal decoder architecture of the Transformer model;
[0016] The activation function of the feedforward neural network layer in the causal decoder architecture of the Transformer model is determined to be the SwiGLU function;
[0017] The position encoding of the causal decoder architecture of the Transformer model is set to RoPE relative rotation position encoding.
[0018] In some embodiments, training the generative large language model on the initial pre-training dataset in multiple stages includes the following steps:
[0019] Determine the mixing ratio of different types of text data at different stages in the initial pre-training dataset;
[0020] A proportioned mixed pre-training dataset is obtained by sampling from the initial pre-training dataset according to the mixing ratio;
[0021] The text in the pre-training dataset with the specified ratio is concatenated and padded to a set length to obtain multiple texts of the same length as training samples.
[0022] The training samples are randomly divided into several batches of data;
[0023] The weight parameters of the embedding layer, decoder layer, and normalization layer of the generative large language model are initialized using a random method.
[0024] The generative large language model is trained in N stages based on the batch data; wherein each stage of training uses at least B batch data; N and B are positive integers.
[0025] In some embodiments, training the generative large language model with N stages based on the batch data includes the following steps:
[0026] The training process for each phase consists of the following steps:
[0027] The word sequence in each batch of data is input into the generative large language model, and then forward propagation is performed to obtain the probability distribution of the predicted target word.
[0028] The loss value between the predicted distribution of the generative large language model and the actual annotation results is calculated based on the loss function.
[0029] The gradient of the loss function with respect to the parameters of the generative large language model is calculated using the backpropagation algorithm;
[0030] The gradient is propagated sequentially from the output layer of the generative large language model back to the input layer of the generative large language model.
[0031] The parameters of the generative large language model are updated using an optimization algorithm based on the gradient and with the training objective of minimizing the loss value; wherein, the parameters of the generative large language model are updated accordingly after each batch of data has been processed.
[0032] In some embodiments, the synthesis of a self-distilled hybrid dataset based on the teacher model and the initial pre-trained dataset includes the following steps:
[0033] Initialize an empty data pool;
[0034] During training at each stage, training samples are sampled from each batch of data at each stage with a first probability, and then the sampled training samples are added to the data pool.
[0035] After each stage of training, the current parameters of the generative large language model are saved to obtain the teacher model;
[0036] After training at each stage, candidate synthetic samples are randomly sampled from the data pool; wherein the data volume of the candidate synthetic samples is the same as the data volume of the batch data.
[0037] The text of each training sample in the candidate synthetic sample is segmented using a delimiter to obtain multiple original text data;
[0038] The data to be synthesized is selected from each of the original text data with a second probability.
[0039] The characters or words of each of the data to be synthesized are selected as cutoff points with a third probability; wherein the third probability follows a normal distribution.
[0040] The cutoff point and the text before the cutoff point are used as query text, and the query text is input into the teacher model at the current stage.
[0041] The teacher model is used to output the predicted text corresponding to the query text.
[0042] By concatenating the query text and the predicted text, a composite text is obtained.
[0043] The synthesized text and the remaining original text data that were not selected as the data to be synthesized are combined into the self-distilled mixed dataset.
[0044] In some embodiments, training the student model using the self-distilled mixed dataset includes the following steps:
[0045] Calculate the knowledge transfer interval; where k represents the knowledge transfer interval;
[0046] The formula for calculating the knowledge transfer interval includes:
[0047] Where K and c are hyperparameters; K represents the maximum knowledge gap, and c is used to control the growth rate and approximation rate of k as it grows during training phase i, and c>0; The symbol for rounding up is 'f'; f(i) represents the intermediate function.
[0048] The generative large language model that is k stages away from the teacher model is selected as the student model;
[0049] The student model was trained using the self-distilled mixed dataset.
[0050] In some embodiments, the method further includes the following steps:
[0051] The student model is trained using a text dataset of the target type, thereby obtaining a target large language model;
[0052] The target type of text is input into the target large language model, and then the target large language model is used to output the prediction result corresponding to the target type of text.
[0053] Another aspect of this application provides a large language model pre-training apparatus, the apparatus comprising:
[0054] The data processing unit is used to acquire multiple types of text data and preprocess the text data to obtain an initial pre-training dataset.
[0055] Model building unit, used to build generative large language models based on the decoder architecture of the Transformer model;
[0056] A stage training unit is used to train the generative large language model in multiple stages based on the initial pre-training dataset.
[0057] A dataset synthesis unit is used to synthesize a self-distilled mixed dataset based on the teacher model and the initial pre-trained dataset; wherein the generative large language model trained at each stage serves as the teacher model.
[0058] A distillation training unit is used to train a student model using the self-distillation mixed dataset; wherein the student model is the generative large language model k stages after the teacher model, and k is a positive integer;
[0059] The iterative training unit is used to return to the step of synthesizing a self-distilled hybrid dataset based on the teacher model and the initial pre-training dataset until the training of a preset number of stages is completed, and the student model obtained by the final training is used as the pre-trained generative large language model.
[0060] Another aspect of this application provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the above-described large language model pre-training method.
[0061] Another aspect of this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described large language model pre-training method.
[0062] Details of one or more embodiments of this application will be set forth in the following drawings and description. Other features, objects, and advantages of this application will become apparent from the specification, drawings, and claims. Attached Figure Description
[0063] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0064] Figure 1 is a flowchart illustrating a large language model pre-training method provided in an embodiment of this application;
[0065] Figure 2 is a flowchart of the steps for constructing a generative large language model provided in an embodiment of this application;
[0066] Figure 3 is a flowchart of steps S120 provided in an embodiment of this application;
[0067] Figure 4 is a flowchart of step S126 provided in an embodiment of this application;
[0068] Figure 5 is a flowchart of step S140 provided in an embodiment of this application;
[0069] Figure 6 is an example flowchart of a large language model pre-training provided in an embodiment of this application;
[0070] Figure 7 is a flowchart of a self-distillation-type large language model pre-training method provided in an embodiment of this application;
[0071] Figure 8 is a flowchart illustrating the construction of a self-distilled mixing dataset provided in an embodiment of this application;
[0072] Figure 9 is a schematic diagram of the structure of a large language model pre-training device provided in an embodiment of this application;
[0073] Figure 10 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0074] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit it. In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this application; they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this application as detailed in the appended claims.
[0075] It is understood that the terms “first,” “second,” etc., used in this application may be used herein to describe various concepts, but unless otherwise stated, these concepts are not limited by these terms. These terms are only used to distinguish one concept from another. For example, without departing from the scope of the embodiments of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the words “if,” “when,” or “in response to a determination” as used herein may be interpreted as “when…” or “when…” or “in response to a determination.”
[0076] As used in this application, the terms "at least one", "multiple", "each", "any", etc., "at least one" includes one, two or more, "multiple" includes two or more, "each" refers to each of the corresponding multiples, and "any" refers to any one of the multiples.
[0077] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.
[0078] Before providing a detailed description of the embodiments of this application, some of the nouns and terms used in the embodiments of this application will be explained first. The nouns and terms used in the embodiments of this application shall be interpreted as follows:
[0079] A language model is an artificial intelligence model that uses machine learning techniques to understand and generate human natural language. By analyzing and modeling text data, a language model can learn the rules of human language use and, based on this, accurately predict the probability distribution of masked positions or the next word sequence in a given content fragment. This probability calculation process helps computers understand and generate natural language, and can then be applied to various tasks such as intelligent question answering, machine translation, and content analysis. A large language model (LLM) refers to a scaled-up neural network language model, typically with one billion or more parameters. When solving complex tasks, it exhibits capabilities that smaller, pre-trained language models lack, such as strong context learning, reasoning ability, cross-domain adaptability, and emergent abilities.
[0080] Pre-training refers to the process of pre-training a model on a large-scale, unsupervised text dataset, and is the initial stage of language model learning. During pre-training, the model extracts and models patterns, structures, and semantic knowledge of natural language from massive amounts of data to learn general feature representations and knowledge, possessing the potential to solve downstream tasks.
[0081] Self-distillation: Knowledge distillation refers to the transfer of knowledge from the teacher model to the student model. This distillation process is typically achieved by minimizing the difference in output distribution between the teacher and student models in the hidden or classification layers. Generally, the teacher model has the same or even more complex network capacity and model structure than the student model. Self-distillation is a special case of knowledge distillation; it does not rely on an external large language model as a teacher model but utilizes the same model for self-transfer and optimization of knowledge. Common practices include distilling from one module of the model to another, or from one stage of the model to another.
[0082] Unlike the conventional definition in model distillation, in this application, "teacher model" refers to a model in an earlier stage of training (fewer training steps), while "student model" refers to a model in a later stage of training (more training steps). By distilling the knowledge of the preceding teacher model, the student model is further trained to achieve a more stable and natural effect. In this process, the teacher model acts as the "source" of knowledge transfer, and the student model acts as the "receiver" of knowledge; the two constitute a knowledge transfer relationship in a relative sense. The aforementioned "more" and "fewer" training steps refer to the relative relationship between the teacher model and the student model in terms of training steps.
[0083] Large language model pre-training refers to the process of initially training a large neural network model using massive amounts of text data to learn general feature representations and knowledge. Due to the scaling of the model structure and the surge in the amount of pre-training data, large language model pre-training often encounters problems of training instability, which interferes with the learning process and may even lead to training crashes.
[0084] To address the aforementioned issues, commonly used techniques include gradient clipping, dynamic batch training, and regularization. Gradient clipping controls the gradient norm by setting a threshold to constrain the gradient magnitude. When the calculated gradient magnitude exceeds this threshold, it is clipped to avoid gradient explosion. Dynamic batch training dynamically adjusts the batch size during training, gradually increasing it to the millions. Regularization techniques control model complexity to prevent overfitting; for example, weight decay adds a penalty term to the loss function to suppress parameter overfitting, thereby improving the model's generalization ability and stabilizing the training process.
[0085] In summary, related technologies primarily maintain model training stability by optimizing parameter configurations. While these technologies can enhance stability to some extent, they still face limitations such as high dependence on hyperparameter tuning and a lack of generality across different source datasets. This application addresses this issue from the perspective of data generalization by introducing self-distillation technology during the pre-training stage of a large language model. The aim is to apply and reinforce knowledge in the language representation space across different training stages, thereby enhancing the stability of the training process and improving the generalization ability of the large language model.
[0086] This application provides a method, apparatus, electronic device, and storage medium for pre-training a large language model. The technical solution includes: acquiring multiple types of text data and preprocessing the text data to obtain an initial pre-training dataset; constructing a generative large language model based on a decoder architecture using a Transformer model; training the generative large language model in multiple stages using the initial pre-training dataset; synthesizing a self-distilled mixed dataset using a teacher model and the initial pre-training dataset; wherein the generative large language model trained in each stage serves as the teacher model; training a student model using the self-distilled mixed dataset; wherein the student model is a generative large language model k stages after the teacher model, where k is a positive integer; returning to the step of synthesizing the self-distilled mixed dataset using the teacher model and the initial pre-training dataset, until a preset number of training stages are completed, and using the finally trained student model as the pre-trained generative large language model. This application utilizes a self-distilled mixed dataset to improve a generative large language model, integrating language representations and knowledge obtained from different training stages into the pre-training steps of subsequent stages. This enhances the stability of the pre-training process, increases the breadth and diversity of pre-training data, improves the comprehensiveness of pre-training learning, effectively solves the problem of repeated learning on low-quality pre-training data, and significantly improves the parameter accuracy of the generative large language model during the pre-training process.
[0087] This application provides a method, apparatus, electronic device, and storage medium for pre-training a large language model, relating to the field of artificial intelligence technology. The method, apparatus, electronic device, and storage medium provided in this application can be applied to a terminal, a server, or software running on a terminal or server. In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, or in-vehicle terminal, but is not limited thereto; the server can be configured as an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The server can also be a node server in a blockchain network; the software can be an application implementing knowledge extraction methods, but is not limited to the above forms.
[0088] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0089] Referring to Figure 1, an embodiment of this application provides a method for pre-training a large language model, which may include, but is not limited to, the following steps S100 to S150.
[0090] S100: Acquire multiple types of text data and preprocess the text data to obtain an initial pre-training dataset.
[0091] For example, general and specialized text data of various types are collected and preprocessed to obtain an initial pre-training dataset.
[0092] The text data sources include publicly available pre-trained data and proprietary data collected based on preset application scenarios, covering content such as web pages, books, multiple languages, and mathematical code.
[0093] Data preprocessing operations include, but are not limited to, quality filtering, sensitive content filtering, data deduplication, and lexicalization.
[0094] S110: Constructing generative large language models based on a decoder architecture of the Transformer model.
[0095] Furthermore, S110 may include S111:
[0096] S111: Perform the construction steps on the causal decoder architecture of the Transformer model to construct the generative large language model; referring to Figure 2, the construction steps in S111 include S1111 to S1114:
[0097] S1111: Normalize each sub-layer of the causal decoder architecture of the Transformer model using the pre-applied root mean square normalization (RMSNorm) algorithm.
[0098] S1112: Set a grouped query attention mechanism for the causal decoder architecture of the Transformer model;
[0099] S1113: The activation function of the feedforward neural network layer of the causal decoder architecture of the Transformer model is determined to be the SwiGLU function;
[0100] S1114: Set the position encoding of the causal decoder architecture of the Transformer model to RoPE relative rotation position encoding.
[0101] For example, a generative large language model is constructed based on a decoder architecture of the Transformer model.
[0102] Decoder architectures can be further subdivided into causal decoders, prefix decoders, etc. This embodiment adopts a causal decoder architecture, based on a one-way mask attention mechanism, which only focuses on the word preceding it in the input sequence and the word itself, and then autoregressively predicts the next word. The large language model in this embodiment includes at least the GPT series models and any generative large language model implemented by inheriting the structure of the GPT series models.
[0103] S120: Train the generative large language model in multiple stages based on the initial pre-training dataset.
[0104] Referring to Figure 3, S120 may further include S121 to S126:
[0105] S121: Determine the mixing ratio of different types of text data at different stages in the initial pre-training dataset;
[0106] S122: Sample the proportioned mixed pre-training dataset from the initial pre-training dataset according to the mixing ratio;
[0107] S123: Concatenate and fill the text in the pre-training dataset with the specified ratio to a set length to obtain multiple texts of the same length as training samples;
[0108] S124: Randomly divide the training samples into several batches of data;
[0109] S125: Initialize the weight parameters of the embedding layer, decoder layer, and normalization layer of the generative large language model using a random method;
[0110] S126: Train the generative large language model in N stages based on the batch data; wherein each stage of training uses at least B batch data; N and B are positive integers.
[0111] Furthermore, referring to Figure 4, when executing S126, the training of each stage performs the following steps S1261 to S1265:
[0112] S1261: Input the word sequence in each batch of data into the generative large language model, and then forward propagate to obtain the probability distribution of the predicted target word;
[0113] S1262: Calculate the loss value between the predicted distribution of the generative large language model and the actual annotation results based on the loss function;
[0114] S1263: Calculate the gradient of the loss function with respect to the parameters of the generative large language model using the backpropagation algorithm;
[0115] S1264: Propagate the gradient from the output layer of the generative large language model back to the input layer of the generative large language model in sequence;
[0116] S1265: Based on the gradient and with the goal of minimizing the loss value, update the parameters of the generative large language model using an optimization algorithm; wherein, each time the generative large language model finishes processing a batch of data, the parameters of the generative large language model are updated accordingly.
[0117] For example, this embodiment can pre-train a large language model based on large-scale unsupervised text data, and the specific steps include:
[0118] (1) Batch dataset preparation. The vocabulary of the large language model is initialized based on the pre-trained dataset. Through course learning, data matching, and sample splicing strategies, the pre-trained dataset is divided into batches and the learning order is arranged to obtain the batch dataset sequence.
[0119] (2) Initialize the weight parameters of the large language model, including the weight parameters of the embedding layer, decoder layer and normalization layer.
[0120] (3) Training phase division. The pre-training process of the large language model is divided into N training phases, where each training phase includes the learning of at least B batches of data.
[0121] (4) Train the large language model iteratively using a batch dataset sequence. Each round of batch training includes inputting batch data, forward propagation to calculate the loss, backpropagation to calculate the gradient, and the optimizer updating the parameters of the large language model.
[0122] S130: Synthesize a self-distilled hybrid dataset based on the teacher model and the initial pre-trained dataset; wherein the generative large language model trained at each stage serves as the teacher model.
[0123] Furthermore, S130 may include S131 to S1311:
[0124] S131: Initialize an empty data pool;
[0125] S132: During training at each stage, training samples are sampled from each batch of data at each stage with a first probability, and then the sampled training samples are added to the data pool.
[0126] S133: After training at each stage, save the current parameters of the generative large language model to obtain the teacher model;
[0127] S134: After training at each stage, candidate synthetic samples are randomly sampled from the data pool; wherein the amount of data in the candidate synthetic samples is the same as the amount of data in the batch data;
[0128] S135: Use a separator to segment the text of each training sample in the candidate synthetic sample to obtain multiple original text data;
[0129] S136: Select the data to be synthesized from each of the original text data with a second probability;
[0130] S137: Select characters or words from each of the data to be synthesized as cutoff points with a third probability; wherein the third probability follows a normal distribution;
[0131] S138: Use the cutoff point and the text before the cutoff point as query text, and input the query text into the teacher model at the current stage;
[0132] S139: Output the predicted text corresponding to the query text using the teacher model;
[0133] S1310: Concatenate the query text and the predicted text to obtain the synthesized text;
[0134] S1311: Combine the synthesized text and the remaining original text data that were not selected as the data to be synthesized into the self-distilled mixed dataset.
[0135] For example, in this embodiment, the generative large language model trained in the i-th stage can be used as a teacher model to synthesize a self-distilled mixed dataset.
[0136] This embodiment proposes a self-distillation-based pre-training method for large language models, aiming to promote self-improvement of large language models through a distillation iterative mechanism, thereby enhancing training stability, learning performance, and generalization ability. Its core technical point lies in applying a self-distillation-based data generalization scheme to the iterative training of large language models. Using real-world data as the context, after the i-th stage of model training is completed, the model obtained in the i-th training stage is used as the teacher model to generate high-quality synthetic data that more closely resembles the knowledge distribution of the basic model. This data will be used for the learning of the student model in the (i+k)-th training stage.
[0137] Specifically, it includes the following steps:
[0138] 1. Maintain a real data pool P, initialized to empty.
[0139] 2. For each training phase i, a self-distilled mixed dataset is synthesized, specifically including the following steps:
[0140] 2.1 Real Data Sampling. For each data point in each batch of training data, there is a first probability α that it will be selected into the data pool P. α can be a fixed value uniform across all samples, or it can vary depending on the sample's class label.
[0141] This embodiment enables the samples in the data pool P to reflect the distribution characteristics of the original dataset, and can adjust the sampling probability according to the representativeness and priority of different categories, thereby optimizing the balance of learning for each category during model training, which helps to enhance the generalization ability of the final model and make it more robust when facing new data.
[0142] 2.2 Model archiving. The last training epoch of each training phase serves as the model archiving point. Once training is complete, the current model parameters are saved.
[0143] 2.3 Obtaining the candidate set for distillation.
[0144] After the current training phase is completed, b data samples are randomly sampled from the data pool P as candidate synthetic samples. The size of b is equal to that of the batch data in normal training.
[0145] Using a special identifier indicating the end of a sentence as a delimiter, text segmentation is performed on b data samples to obtain multiple original text data lines of varying lengths and numbers. Since the pre-training samples are constructed by concatenating multiple data lines with padding characters at fixed lengths according to certain rules, it is necessary to reverse-engineer the original text data according to the concatenation rules.
[0146] For each piece of original text data, a second probability β is selected for synthesis, while the probability 1-β remains constant. The proportion of distilled and synthesized data needs to be dynamically adjusted at different training stages. In the early stages of training, the model is not yet mature and needs to learn basic patterns and features from purer real-world data to build robust basic representation capabilities. As training progresses, the probability β of data synthesis is gradually increased, allowing the model to progressively adapt to the data format it generates, thereby improving its generalization ability and robustness.
[0147] 2.4 Self-distillation synthesis.
[0148] For each piece of source data to be synthesized, the location of the cutoff point is selected. The third probability of each character or word in the text being selected as the cutoff point follows a normal distribution to ensure the diversity and randomness of the synthesized data, while maintaining a high content of real-world knowledge.
[0149] The truncation point and the text fragment preceding it are used as queries and input into the model saved after training in this stage to predict the subsequent content of the sentence.
[0150] The text preceding the truncation point in the source data is concatenated with the generated text from the large language model to form a complete distilled mixed text sentence. This sentence not only preserves the authenticity of the original data but also ensures the distributional consistency of the large language model.
[0151] 2.5 The original text and the distilled mixed text data are merged to obtain a self-distilled mixed dataset D that combines real-world knowledge and real data to guide the self-distillation synthesis.
[0152] 3. Remove the b data samples collected in this stage from the data pool P.
[0153] S140: Train a student model using the self-distilled mixed dataset; wherein the student model is the generative large language model k stages after the teacher model, and k is a positive integer.
[0154] Referring to Figure 5, S140 may further include S141 to S143:
[0155] S141: Calculate the knowledge transfer interval; where k represents the knowledge transfer interval;
[0156] The formula for calculating the knowledge transfer interval includes:
[0157] Where K and c are hyperparameters; K represents the maximum knowledge gap, and c is used to control the growth rate and approximation rate of k as it grows during training phase i, and c>0; The symbol for rounding up is 'f'; f(i) represents the intermediate function.
[0158] S142: Select the generative large language model that is k stages away from the teacher model as the student model;
[0159] S143: Train the student model using the self-distilled mixed dataset.
[0160] For example, in this embodiment, the self-distilled mixed dataset D is used to train the student model in the i+kth stage.
[0161] The self-distilled mixed dataset D obtained in step S130 is further processed into a batch data format that can be used for model training, and is used as a batch data in the i+k stage to participate in the training of the i+k stage. The above process can be regarded as the transfer and generalization process of knowledge from the previous teacher model to the growing student model.
[0162] The knowledge transfer interval k is dynamically adjusted based on the training progress. In the early stages of pre-training a large language model, the learning rate is high, the loss curve decreases rapidly, and the learning speed is fast. Therefore, the frequency of knowledge transfer (i.e., the knowledge transfer interval k) should be set relatively small. As training progresses, the model's learning rate gradually slows down. At this point, the knowledge transfer interval k should be increased to adapt to the model's learning process, improve training efficiency, and enhance the robustness of the pre-trained model.
[0163] S150: Return to the step of synthesizing a self-distilled hybrid dataset based on the teacher model and the initial pre-training dataset, until the training of the preset number of stages is completed, and use the student model obtained by the final training as the pre-trained generative large language model.
[0164] It is understood that this embodiment can be iterated repeatedly to continuously adjust and optimize the parameters of the large language model until the training loss of the large language model converges or the preset number of training stages is reached, thus obtaining the pre-trained generative large language model.
[0165] As a further implementation, embodiments of this application may also include training the pre-trained generative large language model using text data of the target type, thereby enabling the generative large language model to generate corresponding prediction results under the guidance of text data of the target type. Specifically, embodiments of this application may further include the following steps S161-S162:
[0166] S161: Train the student model obtained by the final training using the target type of text dataset, and then obtain the target large language model;
[0167] S162: Input the text of the target type into the target large language model, and then use the target large language model to output the prediction result corresponding to the text of the target type.
[0168] For example, texts in different scenarios correspond to text datasets of different target types. For instance, the text types in different scenarios such as education, daily life, or travel are generally different. Using texts from different scenarios as target type text datasets to train a pre-trained generative large language model enables the generative large language model to better learn the knowledge in the corresponding scenarios and thus output prediction results with higher matching degree. The prediction results may include text, links, images, or audio and video content.
[0169] The following section will provide a detailed introduction and explanation of the solutions in the embodiments of this application, using specific application examples.
[0170] The stability of a pre-trained model is crucial for ensuring its generalization ability and practical application effectiveness. A highly stable pre-trained model can adapt more effectively to downstream tasks, thus exhibiting superior performance in diverse task scenarios. Furthermore, model stability directly affects training efficiency and the amount of hardware resources required.
[0171] To enhance training stability and further improve the generalization ability of large language models, this embodiment proposes introducing self-distillation technology during the pre-training stage of large language models. The core technology of this embodiment lies in using the previous model as the teacher model and the currently trained model as the student model. During the student model training process, in addition to using the original data, a hybrid dataset synthesized by the teacher model through self-distillation will be added. The synthesized data is randomly sampled from previously learned data according to a specific probability distribution, and the cutoff point is determined following a normal distribution. The content starting from the cutoff point is entirely generated by the teacher model. This synthesized data not only retains real-world knowledge but also closely matches the knowledge distribution of the base model.
[0172] This embodiment aims to leverage distilled mixed data for self-improvement, integrating language representations and knowledge acquired at different training stages into subsequent learning processes. This strategy not only enhances the stability of the pre-training process but also increases the breadth and diversity of data, ensuring comprehensive learning and effectively avoiding repeated learning on low-quality data, significantly improving performance in downstream tasks. Traditional distillation techniques typically require teacher models with equally or even more complex structures, leading to higher training costs; while self-distillation simplifies model compression and knowledge transfer through a self-supervised mechanism, eliminating reliance on complex teacher models or additional computational steps. Although self-distillation does not actually add new information, numerous studies have shown that self-distilled student models achieve better generalization capabilities than teacher models.
[0173] Referring to Figure 6, this embodiment provides an example flowchart of a large language model pre-training method.
[0174] Specifically, the large language model pre-training method in this embodiment may include the following steps:
[0175] S1. Collect general text data and various types of specialized text data, and preprocess these data to obtain the initial pre-training dataset.
[0176] The data sources include publicly available pre-trained data and proprietary data collected based on preset application scenarios, covering content such as web pages, books, multiple languages, and mathematical code.
[0177] Data preprocessing operations include quality filtering, sensitive content filtering, data deduplication, and lexicalization.
[0178] S2. Construct a generative large language model based on the decoder architecture of the Transformer model.
[0179] This embodiment employs a standard, dense Transformer causal decoder architecture to construct a generative large language model network structure. The Transformer decoder structure consists of multiple identical network blocks stacked together, with each block containing a self-attention network layer and a prefix neural network layer.
[0180] The causal decoder architecture is an autoregressive architecture that employs a one-way attention masking mechanism. Its key feature is that when predicting a word at each position, it only uses the words preceding it as context, without utilizing future word information. This mechanism ensures the temporal coherence and logical consistency of the generated text, demonstrating good context learning and generalization abilities, making it suitable for text generation tasks.
[0181] Based on the original Transformer decoder architecture, preferred implementation schemes include: using a pre-applied root mean square normalization (RMSNorm) algorithm to normalize each sub-layer to enhance the stability of the training process; employing a grouped query attention mechanism to balance inference efficiency and performance; using the SwiGLU function as the activation function of the feedforward neural network layer to improve the model's expressive power; and using RoPE relative rotation position encoding instead of traditional trigonometric function absolute position encoding to more effectively process long sequence data.
[0182] For example, a large language model structure with 7 billion parameters is constructed, which contains 30 Transformer layers, a hidden layer dimension of 4096, an attention layer with 32 query heads, 8 key and value heads, and a vocabulary size of 12000.
[0183] S3. Pre-train a large language model based on large-scale unsupervised text data.
[0184] For example, Figure 7 is a flowchart of a self-distillation-type large language model pre-training method provided in this embodiment.
[0185] S301: For the initial pre-training dataset, design a scheduling strategy to determine the order in which each data source is used for training, and use different data source mixing ratios at different stages.
[0186] S302: Based on the data mixing ratio that meets the preset conditions (e.g., achieving the most ideal performance), sample data from different data sources to adjust the overall distribution of the pre-training data and obtain the pre-training dataset after the ratio mixing.
[0187] In some embodiments, the capabilities of certain corresponding models can be enhanced by increasing the proportion of specific data sources, such as using more mathematical and code data in the later stages of pre-training to enhance inference capabilities.
[0188] S303: Construct a batch dataset sequence for training based on the pre-trained dataset after mixing the proportions.
[0189] To adapt to the model's context length and conserve computational resources, multiple data points are first concatenated and padded to a fixed length using the fewest possible special identifiers to form a training sample. Then, the large number of training samples are shuffled and randomly divided into several small batches to obtain a sequence of batch datasets suitable for training. In the pre-training process of large language models, the batch size is typically set to a relatively high value.
[0190] S304: Initialize the weight parameters of the embedding layer, decoder layer, and normalization layer of the generative large language model using a random method.
[0191] S305: Divide the pre-training process of a large language model into N training phases, where each training phase includes the learning of at least B batches of data.
[0192] S306: Iteratively train a large language model using batch dataset sequences. The core process of large language model pre-training includes inputting batch data, calculating the loss through forward propagation, calculating the gradient through backpropagation, and updating the large language model parameters using the optimizer. Specifically:
[0193] Language modeling is used as the pre-training task. For each word sequence x = {x1, ..., x2} in the batch data... n The task of language modeling is to base a sequence on the word sequence x preceding the current position in the sequence. <t The autoregressive method is used to predict the next target word x. t .
[0194] For each batch of data, the word sequence is input into the large language model, and forward propagation is used to obtain the probability distribution of the predicted target words. The loss between the model's predicted distribution and the actual labeled results is calculated based on the loss function. The gradient of the loss function with respect to the model parameters is calculated using the backpropagation algorithm, and this gradient information is propagated sequentially from the output layer back to the input layer. After obtaining the gradient, the parameters of the large language model are updated using an optimization algorithm with the goal of minimizing the loss. The model parameters are updated after each batch of data has been processed.
[0195] For example, the likelihood function is used. The AdamW optimizer is used as the loss function to update the parameters.
[0196] S4. Use the models trained in each stage as teacher models to synthesize a self-distilled mixed dataset.
[0197] For example, Figure 8 is a flowchart of the construction process of the self-distillation mixing dataset provided in this embodiment.
[0198] S401: Maintain a real data pool P, initialized to empty.
[0199] S402: Real data sampling. For each data point in each batch, there is a first probability α that it will be selected into the data pool P. α can be a fixed value uniform across all samples, or it can vary depending on the sample's category label.
[0200] This embodiment enables the samples in the data pool P to reflect the distribution characteristics of the original dataset, and can adjust the sampling probability according to the representativeness and priority of different categories, thereby optimizing the balance of learning for each category during model training, which helps to enhance the generalization ability of the final model and make it more robust when facing new data.
[0201] S403: Model Archive. The last training epoch of each training phase serves as the model archive point. Once training is complete, the current model parameters are saved.
[0202] The synthesis process of the self-distillation mixed dataset includes the following steps (S404~S409):
[0203] S404: After the current training phase is completed, randomly sample b data samples from the data pool P as candidate synthetic samples. The size of b is equal to the batch data size used in normal training.
[0204] S405: Using a special identifier indicating the end of a sentence as a delimiter, perform text segmentation on b data samples to obtain multiple original text data lines of varying lengths and numbers. Since the pre-training samples are constructed by concatenating multiple data lines with padding characters at a fixed length according to certain rules, it is necessary to reverse-engineer the original text data according to the concatenation rules.
[0205] S406: For each piece of original text data, it is selected for synthesis with a second probability β, while the probability 1-β remains unchanged.
[0206] Different training phases require dynamically adjusting the proportion of distilled synthetic data. In the early stages of training, the model is not yet mature and needs to learn basic patterns and features from purer real-world data to build robust basic representation capabilities. As training progresses, the probability β of data synthesis is gradually increased, allowing the model to progressively adapt to the data format it generates, thereby improving its generalization ability and robustness.
[0207] S407: For each piece of source data to be synthesized, select the location of the cutoff point. The third probability of each character or word in the text being selected as the cutoff point follows a normal distribution to ensure the diversity and randomness of the synthesized data, while maintaining a high content of real-world knowledge.
[0208] S408: Take the truncation point and the text fragment before it as the query, input it into the model saved after training in this stage, and predict the subsequent content of the sentence.
[0209] S409: Concatenate the text before the truncation point in the source data with the generated text from the large language model to form a complete distilled mixed text statement. This statement not only preserves the authenticity of the original data but also ensures the distribution consistency of the large language model.
[0210] S410: Merge the original text and the distilled mixed text data to obtain a self-distilled mixed dataset D that combines real-world knowledge and real data under the guidance of self-distillation synthesis.
[0211] S411: Remove b data samples from the self-distilled mixed dataset D from the real data pool P.
[0212] S5. Use the self-distilled mixed dataset for training the student model in stage i+k.
[0213] S501: The self-distilled mixed dataset D obtained in the previous steps is further processed into a batch data format suitable for model training, and used as a batch data in the (i+k)th stage for training. The above process can be regarded as a process of knowledge transfer and generalization from the previous teacher model to the developing student model.
[0214] The knowledge transfer interval k is dynamically adjusted based on the training progress. In the early stages of pre-training a large language model, the learning rate is high, the loss curve decreases rapidly, and the learning speed is fast. Therefore, the frequency of knowledge transfer (i.e., the knowledge transfer interval k) should be set relatively small. As training progresses, the model's learning rate gradually slows down. At this point, the knowledge transfer interval k should be increased to adapt to the model's learning process, improve training efficiency, and enhance the robustness of the pre-trained model.
[0215] In some embodiments, an adaptive selection scheme for the knowledge transfer interval k is provided:
[0216] Where K and c are hyperparameters. K represents the maximum knowledge gap, and c controls the growth rate and approximation rate of k as it grows during the training phase i, with c > 0. The floor function is the floor function, representing the smallest integer greater than this value.
[0217] S6. After each training phase is completed, execute steps S4 and S5.
[0218] S7. Iterate through step S4 repeatedly, continuously adjusting and optimizing the parameters of the large language model until the model loss converges or the preset number of training rounds is reached, to obtain the pre-trained large language model.
[0219] In summary, the technical features and corresponding beneficial effects of this embodiment include:
[0220] 1. This embodiment proposes a self-distillation-based pre-training method for large language models, promoting self-improvement through a distillation iterative mechanism during the model's growth. During pre-training, not only new training data but also real data from previous teacher models and synthetic data generated by the teacher models under the guidance of real data are used. This self-distillation approach, with self-supervised pre-training, improves training stability, simplifies model compression and knowledge transfer processes, and enables better generalization to downstream tasks, thus enhancing the reliability and generalization ability of the pre-trained model.
[0221] 2. This embodiment proposes a novel method for synthesizing distilled mixed data. First, a cutoff point is selected; the text before and after the cutoff point remains unchanged, while the text after the cutoff point is resynthesized using a teacher model. Compared to synthesizing complete data entirely using a teacher model, this method leverages real-world data to provide positive guidance during the generation process, mitigating the illusion of data substitution and error accumulation in the synthesized data, and reducing distributional bias from real-world data. Furthermore, by dynamically adjusting the proportion of synthesized data in the mixed dataset at different training stages, overfitting to the feature distribution of a specific previous period is avoided. Simultaneously, it demonstrates stronger adaptability and generalization when facing data drift in practical applications.
[0222] 3. This embodiment proposes a dynamic knowledge transfer interval selection strategy. The distilled mixed dataset synthesized by the teacher model will be adaptively transferred to the training of the student model in subsequent stages according to the model training progress. In the initial stage of pre-training, the knowledge growth rate is fast, and the knowledge transfer interval is set to be small; as training progresses, the model's learning rate gradually slows down, and the knowledge transfer interval is increased accordingly. This dynamic adjustment strategy can more effectively adapt to the model's growth trend and enhance the robustness of the pre-trained model.
[0223] Referring to Figure 9, this application embodiment also provides a large language model pre-training device, which can implement the above-described large language model pre-training method. The device includes:
[0224] The data processing unit 901 is used to acquire multiple types of text data and preprocess the text data to obtain an initial pre-training dataset.
[0225] Model building unit 902 is used to build generative large language models based on the decoder architecture of the Transformer model;
[0226] The stage training unit 903 is used to train the generative large language model in multiple stages based on the initial pre-training dataset.
[0227] Dataset synthesis unit 904 is used to synthesize a self-distilled mixed dataset based on the teacher model and the initial pre-trained dataset; wherein the generative large language model trained at each stage serves as the teacher model.
[0228] Distillation training unit 905 is used to train a student model using the self-distillation mixed dataset; wherein the student model is the generative large language model k stages after the teacher model, and k is a positive integer;
[0229] The loop training unit 906 is used to return to the step of synthesizing the self-distilled mixed dataset based on the teacher model and the initial pre-training dataset until the training of a preset number of stages is completed, and the student model obtained by the final training is used as the pre-trained generative large language model.
[0230] It is understood that the content of the above method embodiments is applicable to the present device embodiments. The specific functions implemented by the present device embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0231] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the aforementioned large language model pre-training method. This electronic device can be any smart terminal, including tablet computers, in-vehicle computers, etc.
[0232] It is understood that the content of the above method embodiments is applicable to this device embodiment. The specific functions implemented by this device embodiment are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0233] Please refer to Figure 10, which illustrates the hardware structure of an electronic device according to another embodiment. The electronic device includes:
[0234] The processor 1001 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.
[0235] The memory 1002 can be implemented as a read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory 1002 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 1002 and is called and executed by the processor 1001 to implement a large language model pre-training method according to an embodiment of this application.
[0236] Input / output interface 1003 is used to implement information input and output;
[0237] The communication interface 1004 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0238] Bus 1005 transmits information between various components of the device (e.g., processor 1001, memory 1002, input / output interface 1003, and communication interface 1004);
[0239] The processor 1001, memory 1002, input / output interface 1003 and communication interface 1004 are connected to each other within the device via bus 1005.
[0240] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described large language model pre-training method.
[0241] It is understood that the content of the above method embodiments is applicable to this storage medium embodiment. The specific functions implemented in this storage medium embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.
[0242] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0243] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0244] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0245] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0246] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0247] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0248] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0249] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0250] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0251] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0252] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0253] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.
Claims
1. A method for pre-training a large language model, the method comprising the following steps: Acquire multiple types of text data and preprocess the text data to obtain an initial pre-training dataset; Generative large language models are built using a decoder architecture based on the Transformer model; The generative large language model is trained in multiple stages based on the initial pre-training dataset; A self-distilled hybrid dataset is synthesized based on the teacher model and the initial pre-trained dataset; wherein, the generative large language model trained at each stage serves as the teacher model; The student model is trained using the self-distilled mixed dataset; wherein the student model is the generative large language model k stages after the teacher model, and k is a positive integer; Return to the step of synthesizing a self-distilled hybrid dataset based on the teacher model and the initial pre-training dataset, until the training has completed a preset number of stages, and use the student model obtained from the final training as the pre-trained generative large language model.
2. The large language model pre-training method of claim 1, wherein, The decoder architecture based on the Transformer model is used to construct a generative large language model, which includes the following steps: A construction step is performed on the causal decoder architecture of the Transformer model to construct the generative large language model; the construction step includes: The causal decoder architecture of the Transformer model is normalized using the root mean square normalization (RMSNorm) algorithm. A grouped query attention mechanism is set for the causal decoder architecture of the Transformer model; The activation function of the feedforward neural network layer in the causal decoder architecture of the Transformer model is determined to be the SwiGLU function; The position encoding of the causal decoder architecture of the Transformer model is set to RoPE relative rotation position encoding.
3. The large language model pre-training method of claim 1, wherein, The training of the generative large language model based on the initial pre-training dataset in multiple stages includes the following steps: Determine the mixing ratio of different types of text data at different stages in the initial pre-training dataset; A proportioned mixed pre-training dataset is obtained by sampling from the initial pre-training dataset according to the mixing ratio; The text in the pre-training dataset with the specified ratio is concatenated and padded to a set length to obtain multiple texts of the same length as training samples. The training samples are randomly divided into several batches of data; The weight parameters of the embedding layer, decoder layer, and normalization layer of the generative large language model are initialized using a random method. The generative large language model is trained in N stages based on the batch data; wherein each stage of training uses at least B batch data; N and B are positive integers.
4. The large language model pre-training method of claim 3, wherein, The step of training the generative large language model in N stages based on the batch data includes the following steps: The training process for each phase consists of the following steps: The word sequence in each batch of data is input into the generative large language model, and then forward propagation is performed to obtain the probability distribution of the predicted target word. The loss value between the predicted distribution of the generative large language model and the actual annotation results is calculated based on the loss function. The gradient of the loss function with respect to the parameters of the generative large language model is calculated using the backpropagation algorithm; The gradient is propagated sequentially from the output layer of the generative large language model back to the input layer of the generative large language model. The parameters of the generative large language model are updated using an optimization algorithm based on the gradient and with the training objective of minimizing the loss value; wherein, the parameters of the generative large language model are updated accordingly after each batch of data has been processed.
5. The large language model pre-training method of claim 1, wherein, The process of synthesizing a self-distilled hybrid dataset based on the teacher model and the initial pre-trained dataset includes the following steps: Initialize an empty data pool; During training at each stage, training samples are sampled from each batch of data at each stage with a first probability, and then the sampled training samples are added to the data pool. After each stage of training, the current parameters of the generative large language model are saved to obtain the teacher model; After training at each stage, candidate synthetic samples are randomly sampled from the data pool; wherein the data volume of the candidate synthetic samples is the same as the data volume of the batch data. The text of each training sample in the candidate synthetic sample is segmented using a delimiter to obtain multiple original text data; The data to be synthesized is selected from each of the original text data with a second probability. The characters or words of each of the data to be synthesized are selected as cutoff points with a third probability; wherein the third probability follows a normal distribution. The cutoff point and the text before the cutoff point are used as query text, and the query text is input into the teacher model at the current stage. The teacher model is used to output the predicted text corresponding to the query text. By concatenating the query text and the predicted text, a composite text is obtained. The synthesized text and the remaining original text data that were not selected as the data to be synthesized are combined into the self-distilled mixed dataset.
6. The large language model pre-training method of claim 1, wherein, The process of training a student model using the self-distilled mixed dataset includes the following steps: Calculate the knowledge transfer interval; where k represents the knowledge transfer interval; The generative large language model that is k stages away from the teacher model is selected as the student model; The student model was trained using the self-distilled mixed dataset.
7. The large language model pre-training method of claim 6, wherein, The calculation formula of the knowledge delivery interval includes: wherein K, c are hyperparameters; K represents the maximum knowledge gap, c is used to control the growth speed and approximation speed of k as the training stage i grows, c > 0; is the floor function; f(i) represents the intermediate function.
8. The large language model pre-training method according to any one of claims 1 to 7, wherein, The method further includes the following steps: The student model is trained using a text dataset of the target type, thereby obtaining a target large language model; The target type of text is input into the target large language model, and then the target large language model is used to output the prediction result corresponding to the target type of text.
9. A large language model pre-training device, the device comprising: The data processing unit is used to acquire multiple types of text data and preprocess the text data to obtain an initial pre-training dataset. Model building unit, used to build generative large language models based on the decoder architecture of the Transformer model; A stage training unit is used to train the generative large language model in multiple stages based on the initial pre-training dataset. A dataset synthesis unit is used to synthesize a self-distilled mixed dataset based on the teacher model and the initial pre-trained dataset; wherein the generative large language model trained at each stage serves as the teacher model. A distillation training unit is used to train a student model using the self-distillation mixed dataset; wherein the student model is the generative large language model k stages after the teacher model, and k is a positive integer; The iterative training unit is used to return to the step of synthesizing a self-distilled hybrid dataset based on the teacher model and the initial pre-training dataset until the training of a preset number of stages is completed, and the student model obtained by the final training is used as the pre-trained generative large language model.
10. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement a large language model pre-training method as described in any one of claims 1 to 8.
11. A computer-readable storage medium storing a computer program that, when executed by a processor, implements a large language model pre-training method as described in any one of claims 1 to 8.