Model training method, device, apparatus, storage medium, and program product

By constructing an auxiliary model aligned with the output distribution of the target model in a specific vertical domain, the problem of decreased accuracy of the draft model after fine-tuning in the financial and medical fields is solved, and efficient inference acceleration of speculative sampling is achieved.

CN122242622APending Publication Date: 2026-06-19ALIPAY (HANGZHOU) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
Filing Date
2026-05-20
Publication Date
2026-06-19

Smart Images

  • Figure CN122242622A_ABST
    Figure CN122242622A_ABST
Patent Text Reader

Abstract

This specification provides a model training method, device, apparatus, storage medium, and program product. A target model, fine-tuned for a specific vertical domain, corresponds to a lightweight initial model. Based on the target model's system prompts, a large language model is used to reverse-engineer model input data that is adapted to the target model and relevant to the domain. The target model then infers from this input data to obtain output data. Training samples are constructed based on the target model's input and output data. Labeling results for the training samples are built based on the hidden state data during the inference process. Supervised training of the initial model is performed using the training samples and their labeling results to obtain an auxiliary model highly aligned with the target model's output distribution. This method can construct accurate training data and labeling results highly adapted to a specific vertical domain. The auxiliary model trained based on this data exhibits high predictive accuracy, thereby enhancing the acceleration effect of speculative sampling inference.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of artificial intelligence technology, and in particular to a model training method, device, apparatus, storage medium, and program product. Background Technology

[0002] Speculative decoding is a model inference acceleration technique. Its principle is as follows: a lightweight draft model quickly predicts candidate tokens corresponding to the model input, and the target model then performs parallel validation on these candidate tokens. During validation, tokens accepted by the target model are retained. If an unaccepted token is encountered, the target model starts from that position and autoregressively generates subsequent tokens. Finally, the accepted tokens and the autoregressively generated tokens are concatenated as the output. The better the predictive ability of the draft model and the fewer autoregressions the target model performs, the better the inference acceleration effect.

[0003] Currently, a native development framework has been proposed for speculative sampling techniques. Within this framework, both a target model and a draft model are developed simultaneously, and both are trained concurrently using a general dataset. However, when the target model is fine-tuned for specific domains such as finance and healthcare, its output distribution changes. In this case, the draft model trained on general data struggles to accurately match the domain-specific output distribution, leading to decreased prediction accuracy and limited inference acceleration. Summary of the Invention

[0004] This specification provides a model training method, device, apparatus, storage medium, and program product to improve the prediction accuracy of auxiliary models in speculative sampling and enhance inference acceleration.

[0005] This specification provides a model training method, comprising: determining a target model, which is a first large language model fine-tuned in a specific vertical domain and associated with an initial model, the parameter size of which is smaller than that of the target model; based on system prompt words supported by the target model, performing reverse reasoning in the specific vertical domain using a second large language model to obtain multiple model input data adapted to the target model; guided by the system prompt words, performing model reasoning using the target model for any model input data to obtain model output data, and collecting target hidden state data generated by the target model during the model reasoning process; constructing training samples based on at least some model input data and their corresponding model output data, and constructing annotation results of the training samples based on the corresponding target hidden state data; and performing supervised training of the initial model using the training samples and their annotation results to obtain an auxiliary model, the auxiliary model being used to generate multiple candidate lexical units for the target model during speculative sampling.

[0006] This specification also provides a model training device, comprising: a determination module, a first inference module, a second inference module, a construction module, and a training module; the determination module is used to determine a target model, which is a first large language model fine-tuned in a specific vertical domain and associated with an initial model, the parameter scale of the initial model being smaller than that of the target model; the first inference module is used to perform reverse inference in the specific vertical domain using a second large language model based on system prompt words supported by the target model, to obtain multiple model input data adapted to the target model; the second inference module is used to perform model inference using the target model for any model input data under the guidance of system prompt words, to obtain model output data, and to collect target hidden state data generated by the target model during the model inference process; the construction module is used to construct training samples based on at least some model input data and their corresponding model output data, and to construct the annotation results of the training samples based on the corresponding target hidden state data; the training module is used to perform supervised training of the initial model using the training samples and their annotation results to obtain an auxiliary model, the auxiliary model being used to generate multiple candidate lexical units for the target model during speculative sampling.

[0007] This specification also provides an electronic device, including: a memory and a processor; the memory is used to store one or more computer instructions; the processor is used to execute one or more computer instructions to perform the steps in the method provided in this specification.

[0008] This specification also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can implement the steps of the method provided in this specification.

[0009] This specification also provides a computer program product, including: a computer program / instructions, which, when executed by a processor, can implement the steps of the method provided in this specification.

[0010] In the embodiments of this specification, the target model, fine-tuned for a specific vertical domain, corresponds to a lightweight initial model. Based on the system prompts of the target model, a large language model is used to reverse-engineer model input data that is adapted to the target model and relevant to the domain. The target model then infers from this input data to obtain model output data. Training samples are constructed based on the target model's input and output data. Labeling results for the training samples are constructed based on the hidden state data during the inference process. The lightweight initial model is then subjected to supervised training using the training samples and their labeling results to obtain an auxiliary model that is highly aligned with the output distribution of the target model. The method in the embodiments of this application can construct accurate training data and labeling results that are highly adapted to the vertical domain. Training the auxiliary model based on this data can improve the prediction accuracy of the auxiliary model, thereby enhancing the inference acceleration effect of speculative sampling. Attached Figure Description

[0011] The accompanying drawings, which are included to provide a further understanding of this specification and form part of this specification, illustrate exemplary embodiments and are used to explain this specification, but do not constitute an undue limitation thereof. In the drawings: Figure 1 This is a schematic flowchart of a model training method provided as an exemplary embodiment of this specification.

[0012] Figure 2 This is a flowchart illustrating another model training method provided in one embodiment of this specification.

[0013] Figure 3 This is a schematic diagram of a model training apparatus provided for an exemplary embodiment of this specification.

[0014] Figure 4 This is a schematic diagram of the structure of an electronic device provided as an exemplary embodiment of this specification. Detailed Implementation

[0015] To make the objectives, technical solutions, and advantages of this specification clearer, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of them. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this specification.

[0016] It should be noted that, in the cases involving user information in this application's embodiments, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse. The various models involved in this application (including but not limited to language models or large models) comply with relevant laws and standards.

[0017] In the speculative sampling domain, the target model is a large language model fine-tuned in a specific vertical domain. The target model is associated with an initial model, whose parameter size is smaller than that of the target model. The method of obtaining the initial model is not limited. The initial model can be a natively trained model developed using a native speculative sampling framework, without pre-training or fine-tuning in a specific vertical domain. Alternatively, the initial model can be a non-natively trained model developed using a non-native framework, also without pre-training or fine-tuning in a specific vertical domain. Regardless of the method of acquisition, the initial model is a model that has not yet been trained or fine-tuned in a specific vertical domain, and its output distribution is not yet aligned with the target model. Therefore, it is necessary to pre-train or fine-tune the initial model in a specific vertical domain to align the output distribution of the trained auxiliary model with that of the target model, thereby improving the inference acceleration effect in speculative sampling.

[0018] In the embodiments of this specification, the target model, fine-tuned for a specific vertical domain, corresponds to a lightweight initial model. Based on the system prompts of the target model, model input data that is adapted to the target model and relevant to the domain is generated through reverse engineering using a large language model. The target model then infers from this input data to obtain model output data. Training samples are constructed based on the target model's input and output data. Labeling results for the training samples are constructed based on the hidden state data during the inference process. The lightweight initial model is then subjected to supervised training using the training samples and their labeling results to obtain an auxiliary model that is highly aligned with the output distribution of the target model. The method in the embodiments of this application can construct accurate training data and labeling results that are highly adapted to a specific vertical domain. Training the auxiliary model based on this data can improve the prediction accuracy of the auxiliary model, thereby enhancing the inference acceleration effect of speculative sampling.

[0019] The technical solutions provided in the various embodiments of this specification are described in detail below with reference to the accompanying drawings.

[0020] Figure 1 This is a flowchart illustrating an exemplary embodiment of the model training method provided in this specification, such as... Figure 1The method shown includes: 101. Determine the target model. The target model is the first major language model after fine-tuning in a specific vertical domain, and it is associated with the initial model. The parameter size of the initial model is smaller than that of the target model.

[0021] 102. Based on the system prompt words supported by the target model, the second largest language model is used to perform reverse reasoning in a specific vertical domain to obtain multiple model input data that are adapted to the target model.

[0022] 103. Guided by system prompts, use the target model to perform model inference on any model input data to obtain model output data, and collect target hidden state data generated by the target model during the model inference process.

[0023] 104. Construct training samples based on at least a portion of the model input data and their corresponding model output data, and construct the annotation results of the training samples based on the corresponding target hidden state data.

[0024] 105. Supervised training of the initial model is performed using training samples and their annotation results to obtain an auxiliary model. The auxiliary model is used to generate multiple candidate lexical units for the target model during the speculative sampling process.

[0025] In the embodiments described in this specification, the target model is a first large language model fine-tuned for a specific vertical domain. This specific vertical domain may include, but is not limited to, medical, legal, or financial fields. The target model is a high-precision model with a large parameter scale used to generate high-quality text. The first large language model is a large-scale trained natural language processing (NLP) model. Large language models are typically built on deep learning techniques and trained on large-scale training datasets, thus exhibiting powerful performance in processing natural language tasks. Given a text (i.e., context), the large language model attempts to predict the most likely next word. The number of parameters in the large language model is greater than a set threshold, typically in the millions or billions. During training, these parameters are continuously adjusted and optimized based on the difference between the large language model's predictions and actual results to improve its prediction accuracy. In some embodiments, the large language model typically employs an advanced neural network architecture, such as the Transformer architecture, to build its model structure. This architecture enables the large language model to capture complex patterns in the text and handle long-distance dependencies. After thorough training, the large language model possesses powerful generative capabilities, capable of producing coherent and context-appropriate text content based on given prompts.

[0026] The internal structure of the target model is not limited. Below is one example of an internal structure for a target model, which includes multiple network layers, where each network layer refers to a hidden layer located between the input and output layers. In addition, the target model may also include input and output layers.

[0027] The input layer transforms the model input data into continuous high-dimensional semantic vectors and generates positional information for each word in the input data. For example, the input layer may include a positional encoding network and a word embedding network. The word embedding module maps each word in the model input data to a dense vector of fixed dimensions as a high-dimensional semantic vector; the positional encoding module adds temporal feature vectors to each word at each position.

[0028] The network layers are used for deep iterative feature extraction and logical reasoning on the high-dimensional semantic vector and temporal feature vector. They can capture contextual dependencies, integrate semantic knowledge, perform complex logical deductions, and gradually transform shallow syntactic information into deep abstract representations. For example, in a transformer architecture, multiple network layers can be implemented as multiple decoder modules. The decoding network includes a self-attention network for global context aggregation and a feed-forward network (FFN) for non-linear knowledge transformation. The self-attention network and the feed-forward network are connected through residual connections and normalization layers to ensure training stability and smooth information flow.

[0029] The output layer maps the high-dimensional abstract latent states output by the network layers back to a discrete vocabulary space, calculates the probability distribution of candidate lexical units, and thus generates the final prediction result. For example, the output layer includes a linear transformation network and a final normalization network. The linear transformation network maps the latent state data of deep semantics to a discrete vocabulary space, and the final normalization network generates candidate lexical units.

[0030] The target model is associated with an initial model, whose parameter size is smaller than that of the target model. The initial model is a model that has not yet been trained or fine-tuned for a specific vertical domain, and its output distribution is not yet aligned with the target model. In application scenarios employing speculative sampling techniques, this can be mitigated by... Figure 1 The method described above trains an initial model to obtain an auxiliary model corresponding to the target model, which can also be called a draft model. In this way, during speculative sampling, the auxiliary model can quickly generate multiple candidate words based on the actual input data of the target model. The target model then effectively validates these candidate words to obtain a subset of output words. These output words are no longer subject to autoregressive decoding, thus accelerating the inference speed of the target model.

[0031] The internal structure of the initial model is not limited. In one example, the initial model and the target model use a homologous architecture, with the initial model having a smaller network depth than the target model. For instance, the target model achieves deep logical reasoning and high-precision distribution through multiple layers of network stacking; for example, the target model includes one input layer, 61 network layers, and one output layer. The initial model mimics the deep thinking of the target model with fewer layers (such as 1, 5, or 10 layers) to improve reasoning speed; for example, the initial model includes one input layer, one network layer, and one output layer. In another example, the initial model and the target model use a non-homologous architecture, using the same vocabulary and word segmenter; the initial model employs a lightweight heterogeneous architecture. This initial model may include one input layer, 12 linear recursive units, and one output layer. By trading linear time complexity for faster single-word generation and approximating the output distribution of the target model through different computational methods, acceleration is achieved in scenarios with limited GPU memory bandwidth.

[0032] In the embodiments of this specification, the target model supports system prompts, which are used to guide the target model in generating model output data based on the input data. System prompts may include the target model's role, task objectives and scope, output constraints, behavioral guidelines, contextual information, and thought processes.

[0033] In speculative sampling or model fine-tuning scenarios, directly collecting model input data often faces the following problems: data scarcity in specific vertical domains, distribution mismatch, and privacy compliance issues. Therefore, it is possible to obtain system prompt words associated with the target model. Based on these prompt words, training samples and their labeled results can be constructed in the specific vertical domain of the target model to train the initial model. This will be introduced below.

[0034] In the embodiments of this specification, based on the system prompt words supported by the target model, a second large language model is used to perform reverse reasoning in a specific vertical domain to obtain multiple model input data that are adapted to the target model.

[0035] Forward reasoning, guided by system prompts, involves using the model's input data to derive its output. Backward reasoning, on the other hand, uses the target model's system prompts to construct input data that triggers the target model's reasoning. The second language model differs from the first, essentially acting as the "exam setter," tailoring high-quality, diverse input data as "test questions" for the target model. This second language model possesses different pre-training corpora and parameter distributions, allowing it to interpret system prompts from various perspectives and generate input data that the target model might not anticipate or rarely generate, thus covering more application scenarios and improving the robustness of the subsequent auxiliary model. For a detailed explanation of the internal structure of the second language model, please refer to the preceding description of the large language model; it will not be repeated here.

[0036] In the embodiments of this specification, after obtaining model input data suitable for the target model, the model input data can be sent to the target model. Guided by system prompts, the target model performs model inference for any model input data to obtain model output data corresponding to that model input. In the embodiments of this specification, not only is it necessary to construct model input data and model output data suitable for the target model, but it also collects target hidden state data generated by the target model during model inference. Figure 2 The steps in step 1.3, which involve model inference of the target model and collection of target hidden state data, are illustrated by example.

[0037] In this context, target hidden state data refers to the intermediate activation values ​​or feature representations generated by the internal network layers of the target model during the model inference process. Target hidden state data encodes the target model's understanding of the model's input data, the intermediate states of logical reasoning, and the basis for predicting the probability distribution of the model's output data.

[0038] In the embodiments of this specification, training samples are constructed based on at least a portion of the model input data and their corresponding model output data, and the annotation results of the training samples are constructed based on the corresponding target hidden state data. Figure 2 The following example demonstrates the steps for constructing sample data and its annotation results through step 1.4. When constructing training samples, all training samples or a portion of the training samples can be used; there is no limitation on this, as long as the number of training samples is sufficient to successfully train the auxiliary model.

[0039] For example, a model input data and the model output data obtained by the target model from inference based on the input data can be used as a training sample, and the target hidden state data collected during the model inference process can be used as the annotation result of the training sample. Alternatively, a system prompt word, a model input data, and the model output data obtained from inference based on the input data can be used as a training sample, and the target hidden state data collected during the model inference process can be used as the annotation result of the training sample. The initial model is trained under supervision using the training sample and its annotation result to obtain an auxiliary model. The target hidden state data can be aligned with the internal representation space of the target model. The target hidden state data represents the integration, inference, and abstraction of contextual information. Using the target hidden state data as a supervision signal for supervised training can improve the accuracy of the trained auxiliary model in predicting candidate nouns under complex contexts, long logical chains, and system prompt word constraints. The supervised training process can refer to the process during the supervised training of the initial model, where the initial model generates predicted hidden state data during model inference based on any training sample, and the supervised loss is calculated based on the predicted hidden state data and the target hidden state data.

[0040] In the embodiments of this specification, the target model, fine-tuned for a specific vertical domain, corresponds to a lightweight initial model. Based on the system prompts of the target model, model input data that is adapted to the target model and relevant to the domain is generated through reverse engineering using a large language model. The target model then infers from this input data to obtain model output data. Training samples are constructed based on the target model's input and output data. Labeling results for the training samples are constructed based on the hidden state data during the inference process. The lightweight initial model is then subjected to supervised training using the training samples and their labeling results to obtain an auxiliary model that is highly aligned with the output distribution of the target model. The method in the embodiments of this application can construct accurate training data and labeling results that are highly adapted to a specific vertical domain. Training the auxiliary model based on this data can improve the prediction accuracy of the auxiliary model, thereby enhancing the inference acceleration effect of speculative sampling.

[0041] In one optional embodiment, the implementation of using the target model to perform model inference on the model input data under the guidance of system prompts to obtain model output data and collecting target hidden state data generated by the target model during the model inference process is not limited, and will be described exemplarily below.

[0042] In this system, a data acquisition module is mounted on a target network layer among multiple network layers. The number and layer number of the target network layers are not limited. For example, if the target model includes 61 network layers, the target network layers may include, but are not limited to, layer 2, layers 20 to 30, and layer 56. Optionally, one acquisition module is mounted on each target network layer, and the acquisition module is used to collect the target hidden state data of the target network layer mounted on the acquisition module. Alternatively, multiple target network layers may be mounted on the same acquisition module, which collects the target hidden state data of multiple target network layers.

[0043] Based on this, an implementation method is provided that, guided by system prompts, a target model performs model inference on input data to obtain model output data, and targets the hidden state data generated during the model inference process are collected. This includes: feeding the input data into the target model; under the guidance of system prompts, the data sequentially passes through multiple network layers in the target model for autoregressive decoding, ultimately obtaining the model output data; wherein, the autoregressive decoding process refers to performing multiple forward propagation calculations on the input data. Each forward propagation calculation starts from the first network layer and propagates sequentially through multiple network layers. Each forward propagation calculation generates one word, and the input data for each forward propagation calculation consists of the currently generated word and the model input data. Furthermore, the input for the forward propagation calculation may also include system prompts. Based on this, during the autoregressive decoding process, when the forward propagation reaches the target network layer, a collection module mounted on the target network layer collects the hidden state data output by the target network layer as the target hidden state data. One target network layer corresponds to one target hidden state data; if there are multiple target network layers, the number of target hidden state data is also multiple.

[0044] Optionally, the implementation method of constructing training samples based on at least a portion of the model input data and their corresponding model output data, and constructing the annotation results of the training samples based on the corresponding target hidden state data, is not limited. An example is provided below: autoregressive decoding includes N forward propagation calculations, each of which outputs a word. Therefore, the model output data includes N words output from the N forward propagation calculations, where N is a positive integer. Based on this, an implementation method of constructing training samples based on at least a portion of the model input data and their corresponding model output data, and constructing the annotation results of the training samples based on the corresponding target hidden state data, includes: for the first forward propagation calculation, using the model input data as the context input for this forward propagation calculation, performing the first forward propagation calculation through multiple network layers, and finally outputting the first word; for the first forward propagation calculation, using the context input data of this forward propagation calculation as the training sample, and using the target hidden state data collected from the target network layer during this forward propagation calculation as the annotation result of the training sample. For the second forward propagation computation, the model input data and the first word output from the first forward propagation computation are used together as context input. After passing through multiple network layers, the second forward propagation computation is performed, outputting the second word. For the second forward propagation computation, the context input of this forward propagation computation is used as the training sample, and the target hidden state data collected from the target network layer during this forward propagation computation is used as the annotation result. For the third forward propagation computation, the model input data and the first and second word outputs from the previous two forward propagation computations are used together as context input. After passing through multiple network layers, the third forward propagation computation is performed, outputting the third word. For the third forward propagation computation, the context input of this forward propagation computation is used as the training sample, and the target hidden state data collected from the target network layer during this forward propagation computation is used as the annotation result, and so on. Compared to the first forward propagation computation, the second and subsequent forward propagation computations are collectively referred to as subsequent forward propagation computations. Specifically, for subsequent forward propagation computations, the input data of the model and the word units output from previously performed forward propagation computations are used as context inputs. The context input data of the forward propagation computations are used as training samples, and the target hidden state data collected from the target network layer during the forward propagation computation is used as the labeled results of the training samples. Optionally, regardless of the forward propagation process, the context input may also include system prompt words.Specifically, by using the context input data of the forward propagation computation as training samples and the target hidden state data collected from the target network layer during the forward propagation computation as the labeled results of the training samples, the constructed samples and their labeled results inherit the thought chain and intrinsic feature distribution of the target model. This makes the auxiliary model trained based on the training samples and their labeled results highly aligned with the output distribution of the target model, thereby improving the predictive ability of the auxiliary model and increasing the inference speed of speculative sampling.

[0045] In one optional embodiment, the implementation method of using training samples and their annotation results to perform supervised training on the initial model to obtain the auxiliary model is not limited. A specific implementation method based on a dynamic learning rate scheduling approach for supervised training is provided below. Specifically, the training samples are divided into multiple batches, with each batch corresponding to a training step. Each training step corresponds to a number of training steps, and each training step corresponds to one update of the model parameters. For any training step, a target supervised loss is constructed based on the training samples and their annotation results corresponding to the training step. The learning rate corresponding to the training step is determined based on the relationship between the number of training steps corresponding to the training step and a preset training step threshold. Based on the gradient information of the target supervised loss, the model parameters of the initial model are updated using the learning rate to complete this round of training.

[0046] The training dataset is massive (potentially millions of samples), making it impossible to load all samples into GPU memory for supervised training at once. Therefore, the dataset is divided into mini-batches. A batch can contain multiple training samples, such as 32, 64, 500, or 100 samples. The target-supervised loss measures the difference between the initial model's current predictions and the labeled results (i.e., the previously collected target hidden state data). The learning rate represents the step size for updating the initial model's parameters. This learning rate is dynamically adjusted. The preset training step threshold is not fixed; for example, it could be 1 / 2, 3 / 1, or 2 / 5 of the total training steps.

[0047] One method for updating the model parameters of the initial model based on the gradient information of the target supervised loss with the learning rate is to use the backpropagation algorithm to calculate the gradient information of the supervised loss function with respect to each model parameter, and update the model parameters along the opposite direction of the gradient to complete the current training round.

[0048] Optionally, the implementation method for determining the learning rate corresponding to a training step based on the relationship between the number of training steps corresponding to the training step and a preset training step threshold is not limited. A specific example is provided below: based on a preset training step threshold, the supervised training process can be divided into two stages. In the first training stage, the number of training steps for each training step is less than the preset training step threshold. In the second training stage, the number of training steps for each training step is greater than or equal to the preset training step threshold and less than or equal to the total number of training steps. For any training step, if the number of training steps for that training step is less than the preset training step threshold, the training step is determined to be in the first training stage. Based on the proportion of training progress in the first training stage, a first learning rate is determined. In the first training stage, the first learning rate corresponding to each training step is positively correlated with the number of training steps and increases from an initial value to a set learning rate threshold. If the number of training steps is greater than or equal to the preset training step threshold, the training step is determined to be in the second training stage. Based on the proportion of training progress in the second training stage, a second learning rate is determined. In the second training stage, the second learning rate corresponding to each training step is negatively correlated with the number of training steps and decreases from the self-learning rate threshold. Wherein, in Figure 2 The steps for adjusting the learning rate are illustrated in step 2.2.

[0049] In the embodiments described in this specification, a progressive, phased training strategy is employed. In the initial training phase (i.e., the first training stage), a higher initial learning rate (e.g., 5e) is used. -5 The learning rate is gradually increased to allow the model to quickly learn the main distribution and macroscopic patterns of the data. As training progresses, in the second training phase, the learning rate is gradually reduced to a smaller final value (e.g., 5e). -6 This "fast first, stable later" training approach helps the model converge quickly in the early stages and allows for fine-tuning in the later stages, preventing the model from oscillating around a relatively optimal solution and thus achieving better final performance.

[0050] In the first training phase, the implementation method for determining the first learning rate based on the percentage of training progress in the first training phase is not limited. For example, one way to determine the first learning rate based on the percentage of training progress in the first training phase could be: calculating the ratio of the number of training steps to a preset training step threshold as the percentage of training progress in the first training phase, and determining the first learning rate based on this percentage of training progress and a set learning rate threshold. For example, , As the first learning rate, The learning rate threshold is set, for example, , To train step count, With a preset training step threshold, the first learning rate increases linearly in this case. For example, In this case, the first learning rate grows non-linearly.

[0051] In the second training phase, the method for determining the first learning rate based on the percentage of training progress in the first training phase using the number of training steps is not limited. For example, one method for determining the second learning rate based on the percentage of training progress in the second training phase using the number of training steps includes: generating a percentage of training progress in the second training phase based on the number of training steps, a preset training step threshold, and the total number of training steps; and determining the second learning rate using a decay coefficient and this percentage of training progress. For example, the percentage of training progress in the second training phase can be expressed as: , Total training steps To train step count, To preset the training step threshold, the second learning rate can be expressed as: , Used for the attenuation result (i.e., Perform a power operation to adjust the "steepness" of the decay curve. The decay is faster in the later stages and smoother in the early stages, making it suitable for tasks that require rapid convergence. It decays more slowly in the later stages and drops rapidly in the early stages, making it suitable for tasks that require long-term fine-tuning.

[0052] In practical applications, for auxiliary models, different input lengths often correspond to different distributions of true output lengths. By pre-statistically analyzing this pattern, the auxiliary model can be guided to generate candidate word sequences of "more suitable lengths" during inference, which helps improve the acceptance rate of the target model. Based on this, in some optional embodiments of this specification, the second largest language model is used to perform reverse inference in a specific vertical domain based on the system prompt words supported by the target model to obtain multiple model input data adapted to the target model. Guided by the system prompt words, the target model is used to perform model inference on any model input data to obtain model output data. Based on the construction of model input data and model output data suitable for the target model, training samples can be constructed using a portion of the model input data and its corresponding model output data. That is, the training samples include a portion of the model input data and its corresponding model output data. These training samples are used to perform supervised training on the initial model to obtain the auxiliary model. Based on this, after obtaining the auxiliary model, multiple test samples can be constructed using at least a portion of the model input data outside the training samples. These test samples are then grouped according to their length, with different groups corresponding to different input length intervals; test samples within the same input length interval are grouped into the same group. For any group, speculative sampling is performed using the auxiliary and target models based on the test samples within that group to obtain the corresponding output length interval, also known as the speculative window. The input and output length intervals corresponding to different groups are then associated with the auxiliary model; that is, the auxiliary model supports multiple input and output length intervals, with each input length interval associated with one output length interval. The number of input and output length intervals supported by the auxiliary model is the same as the number of groups. By analyzing the output length distribution patterns under different input length intervals, a better output length interval is dynamically adapted to the auxiliary model, ensuring that the length distribution of the candidate words generated by the auxiliary model is highly aligned with the statistical patterns of the actual output length, thereby improving the acceptance rate of speculative sampling. Figure 2 The process of optimizing the output length through step 2.3 is illustrated by example.

[0053] The method of grouping multiple test samples according to sample length is not limited. The number of groups can be 2 or 3, etc. For example, if there are 2 groups, it is a short input group and a long input group. Or, if there are 3 groups, it is a short input group, a medium input group, and a long input group. Different groups correspond to different input length ranges. For example, the input length range corresponding to the short input group can be 0 to 6k (k is 1,000), the input length range corresponding to the medium input group can be 6k to 16k, and the input length range corresponding to the long input group can be 16k to 32k.

[0054] Optionally, the implementation method of obtaining the output length interval corresponding to the group by performing speculative sampling through the auxiliary model and the target model based on the test samples within the group is not limited. An example is provided below: predicting the probability distribution of the group under different output lengths; inputting the test samples in the group into the auxiliary model for candidate word prediction, and statistically analyzing the probability distribution at different word positions within the group based on the target model's acceptance length of the candidate words; determining the output length interval corresponding to the group based on the probability distribution of each group under different output lengths and the probability distribution at different word positions.

[0055] Among them, the output lengths in different groups correspond to different probability distributions. For example, the output lengths in the short input group follow a peaked distribution, meaning that the output lengths are highly concentrated. On the other hand, the output lengths in the long input group follow a long-tailed distribution, meaning that the span of the output lengths is relatively large.

[0056] The implementation method of inputting test samples from the group into the auxiliary model for candidate word prediction and statistically analyzing the probability distribution at different word positions within the group based on the acceptance length of the candidate words by the target model is not limited. For example, the auxiliary model inputs the predicted candidate words for each test sample into the target model for verification, and records how many consecutive candidate words (i.e., the acceptance length) the target model accepts in each verification process. Based on the acceptance length of the candidate words by the target model, the probability distribution at different word positions within the group is statistically analyzed.

[0057] The method for determining the output length interval corresponding to a group based on the probability distribution of each group at different output lengths and at different word positions is not limited. For example, for any group, based on a set first probability threshold, such as 90%, 95%, or 98%, the output lengths with probability values ​​exceeding the first probability threshold are selected from the probability distributions at different output lengths, and the integer interval covered by the output length is taken as the first length interval; based on a set second probability threshold, such as 92%, 96%, or 99%, word positions with probability values ​​exceeding the second probability threshold are selected from the probability distributions at different word positions, and the integer interval covered by the word positions is taken as the second length interval; the output length interval corresponding to the group is generated based on the intersection of the first and second length intervals.

[0058] Optionally, when the auxiliary model is associated with input length intervals and output length intervals corresponding to different groups, during the speculative sampling process, the target input data is fed into the auxiliary model, and the following operations are performed inside the auxiliary model: based on the length of the target input data, the target input length interval is determined from the input length interval associated with the auxiliary model; using the target output length interval associated with the target input length interval as a length constraint, model inference is performed on the target input data to obtain the target output data, the length of which is within the target output length interval.

[0059] In one optional embodiment, the implementation method of using a second language model to perform reverse reasoning in a specific vertical domain based on system prompt words supported by the target model to obtain model input data adapted to the target model is not limited. A specific implementation method is provided below, including: inputting system prompt words supported by the target model and domain knowledge information from a specific vertical domain into a second language model; performing the following operations in the second language model: performing dual structural and semantic analysis on the system prompt words to obtain multiple semantic elements included in the system prompt words; constructing a forward reasoning thought chain for the target model to perform model reasoning under the guidance of system prompt words based on multiple semantic elements and combined with domain knowledge information from the vertical domain; deriving contextual features that can trigger the forward reasoning thought chain based on the forward reasoning thought chain; and generating model input data adapted to the target model based on the contextual features.

[0060] The process involves structural parsing of system prompts to obtain multiple semantic elements. These elements include, but are not limited to, roles, task instructions, output constraints, and behavioral boundaries. A forward reasoning chain guides the target model to infer output data based on the input data. Contextual features may include, but are not limited to, user identity, intent, entity information, and background.

[0061] Based on this, an internal architecture of a second major language model is provided as an example. The second major language model includes a multidimensional semantic parsing module, a thought chain causal mapping module, and an output data generation module. The three modules are connected end to end, and the output of the previous module is the input of the next module.

[0062] The multidimensional semantic parsing module performs both structural and semantic analysis on system prompts to obtain multiple semantic elements. This module includes a self-attention network, a feedforward network, and a pooling network. The self-attention network globally scans the system prompts, while the feedforward network extracts deep features from the scanned text fragments to obtain deep semantic features. The pooling network then integrates these deep semantic features into structured label vectors, accurately identifying explicit instructions and implicit intentions.

[0063] The thought chain causal mapping module is used to construct a forward reasoning thought chain for the target model to perform model reasoning under the guidance of system prompts, based on multiple semantic elements and domain knowledge information from the vertical field. Based on the forward reasoning thought chain, it reverse-engineers the contextual features that can trigger the forward reasoning thought chain. The thought chain causal mapping module includes an attention layer, a graph convolutional layer, and a mapping layer. The attention layer constructs the forward reasoning thought chain, the graph convolutional layer performs reverse attribution analysis on the forward reasoning thought chain to obtain precondition features, and the mapping layer identifies contextual features from the precondition features.

[0064] The output data generation module is used to generate model input data adapted to the target model based on contextual features. This module includes a cross-attention network, which uses the derived contextual features as navigation signals to generate model input data adapted to the target model.

[0065] Furthermore, the generation parameters of the second language model can be controlled, enabling it to generate diverse model input data. Generation parameters include, but are not limited to, a temperature scheduler and sampling parameters. The temperature scheduler controls the randomness of the output candidate prompts; a higher temperature parameter results in more random output. For example, the temperature parameter can be dynamically adjusted between 0.7 and 1.2. Higher temperature parameters (e.g., 1.1 to 1.2) are used initially to explore the semantic space, while lower temperature parameters (e.g., 0.7 to 0.8) are used later to refine the generation quality. "Early" and "late" refer to the number of tokens generated in a single text generation task. "Early" can be a predetermined number of tokens (e.g., 6k or 16k), while "late" indicates generating more than that predetermined number. When the second language model generates the next token, sampling parameters are used to select the target token from multiple candidate tokens. Sampling parameters indicate that the target token is selected from the first predetermined number of candidate tokens in descending order of probability value. The sampling parameter can be the TOP-p parameter, for example, a value of 0.9-0.95, to ensure diversity while avoiding overly discrete outputs. Figure 2 In this example, step 1.2 is used to demonstrate the process of generating diverse model input data through reverse reasoning.

[0066] In one optional embodiment, before using a second language model to perform reverse reasoning in a specific vertical domain based on the system prompt words supported by the target model to obtain multiple model input data adapted to the target model, a method for determining the system prompt words corresponding to the target model is also provided. Specifically, semantic parsing is performed on the basic prompt words corresponding to the target model to obtain a set of semantic elements; the set of semantic elements is processed to diversify its expression form to obtain multiple candidate prompt words; based on the semantic similarity between the multiple candidate prompt words and the basic prompt words, the target prompt word is selected from the candidate prompt words; the basic prompt words and the target prompt word are used as the system prompt words. Figure 2 The following example demonstrates the steps for obtaining system prompt words through step 1.1.

[0067] Among them, basic prompt words are a concentrated embodiment of the target model's behavioral patterns, task definitions, and style features, and are key to generating high-quality data. Semantic parsing is performed on the basic prompt words to obtain a set of semantic elements. This set of semantic elements includes, but is not limited to, multiple semantic elements such as roles, task instructions, output constraints, and behavioral boundaries.

[0068] The method of diversifying the expression of semantic element sets to obtain multiple candidate prompts is not limited. For example, the diversification methods may include, but are not limited to: synonym substitution, sentence restructuring, tone fine-tuning, style fine-tuning, logical equivalence restatement, and redundant addition and deletion.

[0069] The method for selecting the target prompt word from multiple candidate prompt words based on their semantic similarity to the base prompt word is not limited. For example, based on the semantic similarity between multiple candidate prompt words and the base prompt word, the candidate prompt words with a similarity exceeding a set similarity threshold are selected as the target prompt words. Another example is to calculate the semantic similarity between multiple candidate prompt words and the base prompt word, and select a set number (e.g., 10, 50, or 100) of the top-ranked candidate prompt words as the target prompt words, in descending order of similarity.

[0070] In addition, based on the semantic similarity between multiple candidate prompts and the basic prompts, before selecting the target prompt from the candidate prompts, conflict detection can be performed on multiple candidate prompts. For candidate prompts that do not contain contradictory or conflicting information, the semantic similarity with the basic prompts can be calculated.

[0071] In one optional embodiment, in order to improve the ability of the auxiliary model to express complex hidden states, the dimension of the feedforward network of the base model can be expanded before supervised training to obtain an initial model. Then, supervised training is performed on the initial model to obtain the auxiliary model. Since the feedforward network is the main nonlinear computation part, it can improve the nonlinear fitting ability of the auxiliary model and enhance the ability of the auxiliary model to perform complex feature transformations at each word position, so that it can learn and predict the high-dimensional hidden states of the target model more accurately.

[0072] Specifically, the target model is associated with a base model. The base model can be a model trained on a general dataset or a lightweight model that has not yet been trained. For an introduction to the internal architecture of the base model, please refer to the internal architecture of the initial model. Based on the system prompts supported by the target model, a second-largest language model is used for back-inference in a specific vertical domain to obtain multiple model input data adapted to the target model. Guided by the system prompts, the target model performs model inference on any model input data to obtain model output data, and collects the target hidden state data generated by the target model during the model inference process. Training samples are constructed based on at least a portion of the model input data and their corresponding model output data, and the annotation results of the training samples are constructed based on the corresponding target hidden state data. Multiple validation samples are constructed based on at least a portion of the model input data outside the training samples. Based on different expansion ratios, the dimension of the feedforward network in the base model is expanded to obtain multiple candidate models. The multiple validation samples are input into the multiple candidate models respectively for candidate word prediction, and the initial model is selected from the multiple candidate models based on the length of the candidate words predicted by the target model. Figure 2 The steps in step 2.1 are exemplified to increase the dimension of the feedforward network.

[0073] In this context, the dimension of the feedforward network refers to the width of its hidden layers. The number of parameters in the hidden layers enables the model to store "knowledge memory" and perform complex nonlinear logical reasoning. Expanding the dimension of the feedforward network is equivalent to increasing the width of the hidden layers, enhancing the model's ability to perform complex feature transformations at each word position. A pre-maintained set of expansion ratios is used, containing multiple expansion ratios. For example, the expansion ratio set could be {1.0, 1.5, 2.0, 2.5}. Based on different expansion ratios in this set, the dimension of the feedforward network in the base model is expanded to obtain multiple candidate models.

[0074] The method of inputting multiple validation samples into multiple candidate models for candidate word prediction is not limited. For example, for any candidate model, multiple validation samples are input into that candidate model for candidate word prediction, resulting in multiple candidate word sequences corresponding to multiple validation samples, with one validation sample corresponding to one candidate word sequence. These candidate word sequences are then input into the target model for validation, obtaining the target model's acceptance length for each candidate word sequence. Based on the acceptance length for each candidate word sequence, the target model's acceptance length for the predicted candidate words is calculated. Therefore, based on the target model's acceptance length for the predicted candidate words from the multiple candidate models, an initial model is selected. For example, the first candidate model is selected as the initial model in descending order of acceptance length; or, in descending order of acceptance length, a candidate model is randomly selected from the top-ranked, predetermined number (e.g., 2 or 3) candidate models as the initial model.

[0075] The method provided in the embodiments of this specification has the following technical effects: Based on the target model's basic prompt words, a hidden state training set highly aligned with its behavioral patterns is self-constructed through a perturbation-generation-verification closed loop, eliminating dependence on raw data. Perturbation refers to diversifying the processing of the basic prompt words to expand their richness. Generation involves using a second-largest language model for back-inference based on the system prompt words supported by the target model to obtain multiple model input data adapted to the target model. Verification is used to determine the output length range through input-output distribution. The auxiliary model, trained under supervision on this self-generated dataset, possesses "plug-and-play" deployment characteristics and can be seamlessly integrated into various large language models (i.e., the target model) with similar architectures to achieve efficient inference acceleration. Architectural similarity means that the target model and the auxiliary model maintain a high degree of consistency or compatibility in network topology (such as attention mechanisms, feedforward neural network structures, activation functions, etc.). Based on this, the "deep nesting" principle is followed: the target model has significantly more network layers than the auxiliary model, thus achieving speculative sampling while maintaining computational logical isomorphism, leveraging the lightweight characteristics of the auxiliary model.

[0076] The feedforward neural network (FFN) of the auxiliary model is dimensionally expanded, for example, by 1.5 to 2 times, giving it the ability to accurately fit the high-dimensional hidden state manifold of the target model, rather than simply mimicking the word distribution. This improves the target model's acceptance rate for words predicted by the auxiliary model. For example, the acceptance rate increases from 70% to 85%, the number of parameters in the auxiliary model increases by 20%, and the inference overhead increases by less than 5%.

[0077] During supervised training, the learning rate is adjusted in stages: a higher learning rate is used in the first training stage, and a lower learning rate is used in the second training stage, achieving a "fast first, stable later" training rhythm. Furthermore, different model output ranges (i.e., speculative windows) are determined based on different input length ranges, aligning with the inherent distribution of the data. This aligns the auxiliary model's training rhythm with the inherent distribution of the data and the requirements of the validation stage, improving the model's convergence speed. For example, the convergence speed is improved by 30%, and the end-to-end speedup is further increased by 10% to 15%.

[0078] It should be noted that the execution subject of each step of the method provided in the above embodiments can be the same device, or the method can be executed by different devices. For example, the execution subject of steps 101 to 104 can be device A; or the execution subject of steps 101 and 102 can be device A, and the execution subject of step 103 can be device B; and so on.

[0079] Furthermore, some processes described in the above embodiments and accompanying drawings include multiple operations appearing in a specific order. However, it should be clearly understood that these operations may not be executed in the order they appear herein, or they may be executed in parallel. The operation numbers, such as 101, 102, etc., are merely used to distinguish different operations and do not represent any execution order. Additionally, these processes may include more or fewer operations, and these operations may be executed sequentially or in parallel. It should be noted that the descriptions such as "first" and "second" in this document are used to distinguish different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit "first" and "second" to different types.

[0080] In this specification, unless explicitly stated otherwise, "receiving and sending data" does not necessarily mean direct receiving and sending; it can also mean indirect receiving and sending. For example, A receiving data sent by B can be understood as A directly receiving data sent by B, or it can be understood as A indirectly receiving data sent by B through other entities such as C. Similarly, B sending data to A can be understood as B sending data directly to A, or it can be understood as B indirectly sending data to A through other entities such as C. Here, C can be one entity, or it can be two or more entities.

[0081] Figure 3 A schematic diagram of a model training device provided for exemplary embodiments of this specification, such as... Figure 3 As shown, the device includes: a determination module 31, a first reasoning module 32, a second reasoning module 33, a construction module 34, and a training module 35.

[0082] The determination module is used to determine the target model, which is the first major language model fine-tuned in a specific vertical domain, and is associated with the initial model. The parameter size of the initial model is smaller than that of the target model.

[0083] The first reasoning module is used to perform reverse reasoning in a specific vertical domain based on the system prompt words supported by the target model, and to obtain multiple model input data that are adapted to the target model.

[0084] The second reasoning module is used to perform model reasoning on any model input data using the target model under the guidance of system prompts, so as to obtain model output data and collect target hidden state data generated by the target model during the model reasoning process.

[0085] The construction module is used to construct training samples based on at least a portion of the model input data and their corresponding model output data, and to construct the annotation results of the training samples based on the corresponding target hidden state data.

[0086] The training module is used to supervise the training of the initial model using training samples and their annotation results to obtain an auxiliary model. The auxiliary model is used to generate multiple candidate lexical units for the target model during the speculative sampling process.

[0087] For detailed descriptions of the implementation methods and effects of the above-mentioned device, please refer to the foregoing embodiments, which will not be repeated here.

[0088] Figure 4 This specification illustrates a schematic diagram of an electronic device provided in an exemplary embodiment, which is applicable to the model training method provided in the foregoing embodiments. For example... Figure 4As shown, the electronic device 700 mainly consists of a communication interface 702, a user interface 704, a processor 706, and a memory 708. These components are interconnected and communicate with each other through a system bus, network, or other connection mechanism 410. The communication interface 702 enables the device 700 to communicate with other devices, access networks, and transmission networks via analog or digital modulation. For example, the communication interface 702 may include a chipset and antenna for wireless communication with a radio access network or access point. Furthermore, the communication interface 702 can also be a wired interface such as Ethernet, Token Ring, or a USB port, or a wireless interface such as Wi-Fi (Wireless Fidelity), Bluetooth, Global Positioning System (GPS), or wide-area wireless interface such as WiMAX (Wireless Maximum) or LTE (Long Term Evolution). Of course, the communication interface 702 can also support other forms of physical layer interfaces and standard or proprietary communication protocols. The communication interface 702 may also include multiple physical communication interfaces, such as a Wi-Fi interface, a Bluetooth interface, and a wide-area wireless interface.

[0089] User interface 704 includes receiving user input and providing output to the user. Therefore, user interface 704 may include input components such as a keypad, keyboard, touch-sensitive or presence-sensitive panel, computer mouse, trackball, joystick, microphone, still camera, and video camera, and output components such as a display screen (which may be combined with a touch-sensitive panel), CRT (Cathode Ray Tube), LCD (Liquid Crystal Display), LED (Light Emitting Diode), display using DLP (Digital Light Processing) technology, printer, and other known or future similar devices. User interface 704 may also generate auditory output via speakers, speaker jacks, audio output ports, audio output devices, headphones, and other known or future similar devices. In some embodiments, user interface 704 may include software, circuitry, or other forms of logic capable of transmitting and receiving data from external user input / output devices. Additionally or alternatively, electronic device 700 may support remote access from other devices via communication interface 702 or another physical interface (not shown). User interface 704 can be configured to receive user input, the position and movement of which can be indicated by an indicator or cursor described herein. User interface 704 can also be configured as a display device for rendering or displaying text fragments.

[0090] Processor 706 may include one or more general-purpose processors and / or special-purpose processors. Memory 708 may include one or more volatile and / or non-volatile memory components and may be integrated wholly or partially with processor 706. Memory 708 may include removable and non-removable components.

[0091] The processor 706 is capable of executing program instructions 718 (e.g., compiled or uncompiled program logic and / or machine code) stored in memory 708 to perform the various functions described herein.

[0092] Memory 708 may contain non-transitory computer-readable media, such as static random-access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. Memory 708 stores program instructions that, when executed by device 700, enable device 700 to perform any of the methods, processes, or functions disclosed in this specification and / or the accompanying drawings. Processor 706 executing program instructions 718 may cause processor 706 to use data 712.

[0093] For example, program instructions 718 may include an operating system 722 (e.g., an operating system kernel, device drivers, and / or other modules) installed on device 700 and one or more applications 720 (e.g., a browser, social application, or game application). Similarly, data 712 may include operating system data 716 and application data 714. Operating system data 716 is primarily accessible to the operating system 722, while application data 714 is primarily accessible to one or more applications 720. Application data 714 may reside in a file system visible or hidden from the user of device 700.

[0094] Application 720 can communicate with operating system 722 through one or more application programming interfaces (APIs). These APIs help application 720 read and / or write application data 714, transmit or receive information via communication interface 702, receive or display information on user interface 704, etc.

[0095] In some terminology, application 720 may be simply referred to as "app". Furthermore, application 720 can be downloaded to device 700 through one or more online app stores or app markets. However, applications can also be installed on device 700 in other ways, such as through a web browser or a physical interface on electronic device 700 (e.g., a USB port).

[0096] Accordingly, embodiments of this specification also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, enables the processor to implement the steps in the above-described method embodiments. The computer-readable storage medium includes volatile or non-volatile or a combination thereof, and can be removable or non-removable. Examples of computer-readable storage media include, but are not limited to, phase-change random access memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random-access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), flash memory or other memory technologies, CD-ROM, Digital Video Disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium. Accordingly, embodiments of this specification also provide a computer program product, which includes a computer program or instructions that, when executed by a processor, cause the processor to implement the steps in the above-described method embodiments. It should be understood that each step or combination of steps in the above-described method flow can be implemented by the computer program or instructions. Furthermore, these computer programs or instructions can be applied to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing device, enabling the processor of the general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to function as an apparatus for implementing the corresponding functions in the above-described method embodiments.

[0097] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, product, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, product, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, product, or apparatus that includes that element.

[0098] This specification uses specific terms to describe embodiments thereof. Terms such as "an embodiment," "one embodiment," and / or "some embodiments" refer to a particular feature, structure, or characteristic associated with at least one embodiment of this specification. Therefore, it should be emphasized and noted that references to "an embodiment," "one embodiment," or "an alternative embodiment" in different locations throughout this specification do not necessarily refer to the same embodiment. Furthermore, those skilled in the art can combine and integrate the different embodiments or examples described herein, as well as the features of those different embodiments or examples, without contradiction.

[0099] The terminology used in the embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of this specification. The singular forms “a,” “the,” and “the” used in the embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. “Multiple” generally includes at least two, but does not exclude the inclusion of at least one. “A plurality” generally includes at least two, but does not exclude the inclusion of at least one.

[0100] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0101] The above are merely embodiments of this specification and are not intended to limit this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this specification.

Claims

1. A model training method, characterized in that, include: A target model is determined, which is the first major language model fine-tuned in a specific vertical domain and associated with an initial model, the parameter size of which is smaller than that of the target model; Based on the system prompt words supported by the target model, the second largest language model is used to perform reverse reasoning in the specific vertical domain to obtain multiple model input data that are adapted to the target model. Guided by the system prompts, the target model is used to perform model inference on any model input data to obtain model output data, and target hidden state data generated by the target model during the model inference process is collected. Training samples are constructed based on at least a portion of the model input data and their corresponding model output data, and the annotation results of the training samples are constructed based on the corresponding target hidden state data. The initial model is trained under supervision using the training samples and their annotation results to obtain an auxiliary model, which is used to generate multiple candidate lexical units for the target model during speculative sampling.

2. The method according to claim 1, characterized in that, The target model includes multiple network layers, and a data acquisition module is mounted on the target network layer among the multiple network layers. Guided by the system prompts, the target model performs model inference on the input data to obtain model output data, and collects the target hidden state data generated by the target model during the model inference process, including: The input data of the model is fed into the target model, and under the guidance of the system prompts, it passes through the multiple network layers in sequence for autoregressive decoding processing to obtain the output data of the model. During the autoregressive decoding process, when the forward propagation reaches the target network layer, the acquisition module mounted on the target network layer is used to acquire the hidden state data output by the target network layer, which is then used as the target hidden state data.

3. The method according to claim 2, characterized in that, The autoregressive decoding includes N forward propagation calculations, and the model output data includes N tokens output from the N forward propagation calculations, where N is a positive integer. Training samples are constructed based on at least a portion of the model input data and their corresponding model output data, and the annotation results of the training samples are constructed based on the corresponding target hidden state data, including: For any forward propagation computation, the context input data of the forward propagation computation is used as a training sample. The context input data includes the model input data, or it includes the model input data and the word units output by the previously executed forward propagation computation. The target hidden state data collected from the target network layer during the forward propagation calculation will be used as the annotation result of the training samples.

4. The method according to any one of claims 1-3, characterized in that, The target model is associated with a base model, the parameter size of which is smaller than that of the target model, and further includes: Construct multiple validation samples based on at least a portion of the model input data other than the training samples; Based on different expansion ratios, the dimensions of the feedforward network in the base model are expanded to obtain multiple candidate models. The multiple verification samples are respectively input into the multiple candidate models for candidate word prediction, and the initial model is selected from the multiple candidate models according to the acceptance length of the candidate words predicted by the target model.

5. The method according to any one of claims 1-3, characterized in that, The initial model is trained under supervision using the training samples and their annotation results to obtain an auxiliary model, including: The training samples are divided into multiple batches, and each batch of training samples corresponds to a training step, which in turn corresponds to a number of training steps. For any training step, a target supervision loss is constructed based on the training samples and their annotation results corresponding to the training step. Based on the relationship between the number of training steps corresponding to the training step and the preset training step threshold, the learning rate corresponding to the training step is determined. Based on the gradient information of the target supervised loss, the model parameters of the initial model are updated with the learning rate to complete this round of training.

6. The method according to claim 5, characterized in that, Based on the relationship between the number of training steps corresponding to the training step and a preset training step threshold, the learning rate corresponding to the training step is determined, including: If the number of training steps is less than a preset training step threshold, the training step is determined to be in the first training stage. Based on the percentage of training progress of the number of training steps in the first training stage, a first learning rate corresponding to the training step is generated. In the first training stage, the learning rate corresponding to each training step is positively correlated with the number of training steps and increases from the initial value to the set learning rate threshold. If the number of training steps is greater than or equal to the preset training step threshold, the training step is determined to be in the second training stage. Based on the percentage of the training progress of the number of training steps in the second training stage, a second learning rate corresponding to the training step is generated. In the second training stage, the learning rate corresponding to each training step is negatively correlated with the number of training steps and decreases from the learning rate threshold.

7. The method according to any one of claims 1-3, characterized in that, The training samples include a portion of the model input data. Therefore, after obtaining the auxiliary model, the following are also included: Multiple test samples are constructed based on at least a portion of the model input data other than the training samples; The multiple test samples are grouped according to their sample lengths, with different groups corresponding to different input length ranges. For any group, based on the test samples within the group, speculative sampling is performed using the auxiliary model and the target model to obtain the output length range corresponding to the group; The input length range and output length range corresponding to different groups are associated with the auxiliary model.

8. The method according to claim 7, characterized in that, Based on the test samples within the group, speculative sampling is performed using the auxiliary model and the target model to obtain the output length range corresponding to the group, including: Predict the probability distribution of the grouping under different output lengths; The test samples in the group are input into the auxiliary model for candidate word prediction, and the probability distribution at different word positions in the group is statistically analyzed based on the acceptance length of the candidate word by the target model. The output length range corresponding to each group is determined based on the probability distribution of each group at different output lengths and the probability distribution at different word positions.

9. The method according to claim 7, characterized in that, Also includes: During speculative sampling, the target input data is fed into the auxiliary model, and the following operations are performed within the auxiliary model: Based on the length of the target input data, the target input length range is determined from the input length range associated with the auxiliary model; Using the target output length interval associated with the target input length interval as a length constraint, model reasoning is performed on the target input data to obtain target output data, wherein the length of the target output data is within the target output length interval.

10. The method according to any one of claims 1-3, characterized in that, Based on the system prompt words supported by the target model, a second major language model is used to perform reverse reasoning in the specific vertical domain to obtain model input data adapted to the target model, including: The system prompt words supported by the target model and the domain knowledge information in the specific vertical domain are input into the second large language model, and the following operations are performed in the second large language model: The system prompt words are subjected to both structural and semantic analysis to obtain multiple semantic elements included in the system prompt words; Based on the aforementioned multiple semantic elements and combined with domain knowledge information from the vertical field, a positive reasoning thought chain is constructed for the target model to perform model reasoning under the guidance of the system prompt words. Based on the forward reasoning thought chain, the contextual features that can trigger the forward reasoning thought chain are deduced in reverse. Based on the contextual features, model input data adapted to the target model is generated.

11. The method according to any one of claims 1-3, characterized in that, Also includes: Semantic parsing is performed on the basic prompt words corresponding to the target model to obtain a set of semantic elements; The semantic element set is processed to diversify its expression forms, resulting in multiple candidate prompt words; Based on the semantic similarity between the multiple candidate prompt words and the basic prompt words, the target prompt word is selected from the candidate prompt words; The basic prompt words and the target prompt words are used as the system prompt words.

12. A model training device, characterized in that, include: The module consists of a determination module, a first inference module, a second inference module, a construction module, and a training module. The determining module is used to determine the target model, which is the first major language model fine-tuned in a specific vertical domain and associated with an initial model, wherein the parameter size of the initial model is smaller than that of the target model. The first reasoning module is used to perform reverse reasoning in the specific vertical domain based on the system prompt words supported by the target model and using the second major language model to obtain multiple model input data that are adapted to the target model; The second reasoning module is used to perform model reasoning on any model input data using the target model under the guidance of the system prompt words, so as to obtain model output data and collect target hidden state data generated by the target model during the model reasoning process. The construction module is used to construct training samples based on at least a portion of the model input data and their corresponding model output data, and to construct the annotation results of the training samples based on the corresponding target hidden state data; The training module is used to supervise the training of the initial model using the training samples and their annotation results to obtain an auxiliary model. The auxiliary model is used to generate multiple candidate lexical units for the target model during the speculative sampling process.

13. An electronic device, characterized in that, include: A memory and a processor; the memory is used to store one or more computer instructions; the processor is used to execute the one or more computer instructions for: performing the steps of the method according to any one of claims 1-11.

14. A computer-readable storage medium storing a computer program, characterized in that, When a computer program is executed by a processor, it is able to perform the steps of the method described in any one of claims 1-11.

15. A computer program product, characterized in that, include: A computer program / instruction that, when executed by a processor, enables the implementation of the steps in the method described in any one of claims 1-11.