A pre-training language model training method and system based on smoothing and negative sampling

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing a loss function with Top-K mean smoothing and negative sampling mechanism, the memory bottleneck and target difference problems in large-scale language model training are solved, achieving more efficient training and better model performance.

CN122242577APending Publication Date: 2026-06-19SHAN DONG MSUN HEALTH TECH GRP CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHAN DONG MSUN HEALTH TECH GRP CO LTD
Filing Date: 2026-03-13
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing large-scale language models suffer from performance degradation issues during training, including memory bandwidth bottlenecks, label sparsity, overconfidence, low computational efficiency, and discrepancies between pre-training and reinforcement learning objectives.

Method used

We adopt a suboptimal smoothing strategy based on Top-K means and a random negative sampling mechanism to replace the traditional Softmax cross-entropy loss function. Through local Top-K smoothing alignment and sparse gradient backpropagation, we reduce the memory bandwidth requirements, alleviate model overconfidence, and improve training efficiency and diversity.

Benefits of technology

It significantly reduces memory bandwidth requirements, improves training efficiency, enhances model diversity and instruction compliance, reduces the difference in optimization objectives between pre-training and reinforcement learning, and improves model smoothness and robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122242577A_ABST

Patent Text Reader

Abstract

This invention proposes a pre-trained language model training method and system based on smoothing and negative sampling, belonging to the field of natural language processing technology. The method includes: acquiring the language model to be trained and text data, wherein the data includes feature vectors and ground truth labels of the training text; inputting the text data into the language model to be trained, using a Transformer model and a linear projection layer to obtain unnormalized prediction score vectors of the entire vocabulary, selecting several maximum values from the unnormalized prediction score vectors to generate a first set, randomly sampling several negative sample words from the entire vocabulary to generate a second set, calculating a loss function based on the first and second sets; the loss function includes a positive sample enhancement term using label smoothing and a negative sample suppression term; and obtaining the trained pre-trained language model based on the loss function. This invention optimizes the alignment consistency between language model pre-training and reinforcement learning.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of natural language processing technology, and in particular relates to a pre-trained language model training method and system based on smoothing and negative sampling. Background Technology

[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.

[0003] In recent years, large-scale language models based on the Transformer architecture (such as GPT-4, LLaMA, Claude, and Qwen) have made breakthroughs in natural language understanding and generation tasks. These models typically have hundreds of billions or even trillions of parameters and learn complex language rules, world knowledge, and preliminary logical reasoning abilities through self-supervised pre-training on text data of trillions of basic text units (tokens).

[0004] Current mainstream large-scale model training paradigms typically consist of three stages: pre-training, supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF) or preference optimization. The core loss function is cross-entropy loss, which is commonly used in the pre-training and SFT stages. While pre-trained models possess strong continuation capabilities, they often lack understanding of user intent and may generate harmful content. Therefore, the RLHF stage is crucial. However, the optimization objectives of the pre-training stage and the RLHF stage differ fundamentally, leading to an alignment tax. This means that forcibly aligning with human preferences causes performance degradation in certain fundamental capabilities. Consequently, with the continuous increase in model size and vocabulary size, and the increasing demands on model inference efficiency and alignment quality, existing training systems based on Softmax cross-entropy exhibit significant shortcomings in computational efficiency, smoothness of the model representation space, and compatibility with subsequent alignment tasks. Summary of the Invention

[0005] To overcome the shortcomings of the existing technologies, this invention proposes a pre-training language model training method and system based on smoothing and negative sampling. This method replaces the traditional Softmax cross-entropy loss function in the design of pre-training and supervised fine-tuning objective functions. By eliminating global normalization calculations and introducing a suboptimal smoothing strategy based on Top-K means and a random negative sampling mechanism, this design aims to solve the memory bandwidth bottleneck problem in training large vocabulary models, alleviate the overconfidence of model predictions and label sparsity, and reduce the alignment gap between the optimization objective and the human feedback reinforcement learning stage from a mathematical mechanism perspective. This improves the training efficiency, inference diversity, and instruction compliance of large models.

[0006] To achieve the above objectives, one or more embodiments of the present invention provide the following technical solutions: In a first aspect, this invention discloses a pre-trained language model training method based on smoothing and negative sampling, comprising: Obtain the language model to be trained and the text data, wherein the data in this paper includes the feature vectors of the training text and the true labels of the training text; The text data is input into the language model to be trained. A Transformer model and a linear projection layer are used to obtain an unnormalized prediction score vector of the entire vocabulary. Several maximum values are selected from the unnormalized prediction score vector to generate a first set. Several negative sample words are randomly sampled from the entire vocabulary to generate a second set. The loss function of the current word is calculated based on the first set and the second set. The loss function includes a positive sample enhancement term and a negative sample suppression term using label smoothing. The parameters of the pre-trained language model are iteratively updated based on the loss function to obtain the trained pre-trained language model.

[0007] Secondly, this invention discloses a pre-trained language model training system based on smoothing and negative sampling, comprising: The text acquisition module is configured to acquire the language model to be trained and the text data, wherein the data in this paper includes the feature vectors of the training text and the true labels of the training text; The prediction loss module is configured to: input the text data into the language model to be trained; use a Transformer model and a linear projection layer to obtain an unnormalized prediction score vector of the entire vocabulary; select several maximum values from the unnormalized prediction score vector to generate a first set; randomly sample several negative sample words from the entire vocabulary to generate a second set; and calculate the loss function for the current word based on the first set and the second set; the loss function includes a positive sample enhancement term and a negative sample suppression term using label smoothing. The iterative training module is configured to iteratively update the parameters of the pre-trained language model based on the loss function to obtain the trained pre-trained language model.

[0008] Thirdly, the present invention discloses an electronic device, including a memory and a processor, and computer instructions stored in the memory and running on the processor, wherein the computer instructions, when run by the processor, complete the steps of the above-mentioned pre-trained language model training method based on smoothing and negative sampling.

[0009] Fourthly, the present invention discloses a computer-readable storage medium for storing computer instructions, which, when executed by a processor, complete the steps of the above-mentioned pre-trained language model training method based on smoothing and negative sampling.

[0010] Fifthly, the present invention discloses a computer program product, the computer program product comprising executable instructions stored in a computer-readable storage medium; When the processor of the electronic device reads the executable instructions from the computer-readable storage medium and executes the executable instructions, the steps of the above-described pre-trained language model training method based on smoothing and negative sampling are completed.

[0011] Compared with the prior art, the beneficial effects of the present invention are as follows: This invention abandons global Softmax and introduces local Top-K smooth alignment, which prevents Logit explosion and overconfidence, preserves the competitive relationship of semantically similar words, and makes the potential semantic space smoother and denser.

[0012] This invention is based on gradient sparsification and negative sampling mechanisms, which makes the gradient of the loss function change from dense to highly sparse. This not only greatly reduces the memory bandwidth requirement (from O(V) to O(n+k+1)), but also fits the sparse training framework and accelerates backpropagation.

[0013] The loss function proposed in this invention makes pre-training and RLHF more continuous in optimizing the manifold, significantly reducing the "alignment tax" and enabling the model to adjust the policy more smoothly during the reinforcement learning stage without destroying existing knowledge representations.

[0014] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0015] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0016] Figure 1 This is an overall framework diagram of the pre-trained language model training method based on smoothing and negative sampling as described in Embodiment 1 of the present invention.

[0017] Figure 2 This is a schematic diagram of sparse gradient backpropagation as described in Embodiment 1 of the present invention. Detailed Implementation

[0018] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0019] It should be noted that the terminology used herein is for the purpose of describing particular implementations only and is not intended to limit the exemplary implementations of the present invention.

[0020] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0021] The current mainstream large model training paradigm consists of three stages: 1. Pre-training: Using a large amount of unlabeled text, the likelihood probability of the text sequence is maximized through the Next Token Prediction (NTP) task.

[0022] 2. Supervised Fine-Tuning (SFT): Using high-quality instruction-response pairs, continue to fine-tune the model with NTP tasks, enabling it to learn to follow human instructions.

[0023] 3. Human Feedback Reinforcement Learning (RLHF) or Preference Optimization: Using algorithms such as PPO, DPO, GRPO, and GSPO, the model strategy is adjusted to align with human values (such as usefulness, honesty, and safety) based on the ranking or rating of human responses to the model.

[0024] In the pre-training and SFT stages, industry and academia almost universally adopt the cross-entropy loss function. Its principle is as follows: assuming the model's vocabulary size is... (Typically between 30,000 and 250,000), for a given context sequence The model outputs the next token. The unnormalized logits vector is The traditional training process is as follows: First, the Logits are transformed into a probability distribution using the Softmax function. :

[0025] Then, calculate the ground truth. The corresponding negative log-likelihood loss:

[0026] This mathematical form originates from Maximum Likelihood Estimation (MLE) and the concept of entropy in information theory. Its physical meaning is to maximize the probability of the true label while minimizing the probabilities of all other non-label tokens. The Softmax function ensures that the sum of all probabilities is 1, forming a valid probability distribution.

[0027] In the RLHF phase, the objective function shifts from simply "predicting the next word" to "maximizing the reward model's score." To prevent the model from forgetting its original language capabilities or experiencing mode collapse when optimizing rewards, Kullback-Leibler divergence is typically introduced as a penalty term to constrain the current policy. Compared with pre-trained reference strategies The difference between the optimization objective in the pre-training stage (fitting the data distribution) and the optimization objective in the RLHF stage (maximizing preference scores) is fundamental, leading to the so-called "Alignment Tax." That is, after forcibly aligning with human preferences, the model will experience performance degradation in certain basic capabilities (such as long text generation and multilingual capabilities).

[0028] Although the cross-entropy loss function is the cornerstone of training large models, its inherent limitations become increasingly apparent as model size (number of parameters) and vocabulary size continue to increase, and as requirements for model inference efficiency and alignment quality rise, thus becoming a key bottleneck restricting the further development of large models. Specifically, these limitations include the following: (a) The Softmax bottleneck of computation and memory bandwidth; In modern GPU hardware architectures (such as the NVIDIA H100), the growth rate of computing power (FLOPS) far exceeds the growth rate of memory bandwidth. The output layer of large models (Logits computation and Softmax) has become one of the main bottlenecks in training and inference, known as the "Softmax bottleneck".

[0029] 1. Extremely high memory bandwidth consumption: Calculate the partition term of Softmax It is necessary to read all Logits from the entire vocabulary. For a typical vocabulary... Batch Size (Micro-batch), sequence length Saving only the Logits matrix requires The video memory (FP16).

[0030] When calculating gradients through backpropagation, the gradients generated by Softmax are dense. According to the chain rule... This means that for each token position, the model needs to calculate and return the result. Each gradient value. Even though the prediction probability of the vast majority of non-labeled tokens is extremely low (e.g., ... The gradient must also be calculated and stored. This results in huge memory access overhead, and in distributed training, it leads to huge communication overhead (all-reduce), severely limiting the improvement of training efficiency.

[0031] 2. Computational cost and numerical stability: The exponentiation operation exp is a transcendental function, and its instruction cycle at the hardware level is much longer than that of addition, subtraction, multiplication, and division. For large vocabularies, performing the exp operation on each Logit constitutes a significant computational burden.

[0032] To prevent numerical overflow, the Log-Sum-Exp technique (subtracting the maximum value) is commonly used in engineering. This introduces an additional max scan operation, which further increases the number of memory accesses (it needs to traverse once to find the maximum value, and then traverse once to sum).

[0033] (ii) Label sparsity and overconfidence. The mathematical property of cross-entropy loss is that it forces the probability corresponding to the label to 1 and pushes the probabilities of all other non-label tokens to 0. This "hard" alignment has significant problems in natural language processing: 1. Unnatural sparsity of semantic space: In natural language, given a context, there are often multiple reasonable follow-up words (synonyms, reasonable grammatical variations). For example, "I am very __ today" can be followed by "happy", "joyful", or "pleasant", all of which are reasonable.

[0034] Cross-entropy forces the model to recognize only the one-hot label in the training data, suppressing other semantically similar words. This results in an overly "sparse" and "rigid" representation space learned by the model, with an extremely sharp distribution of Logits, which is not conducive to capturing the rich semantic structure and synonym relationships of language.

[0035] 2. Overconfidence and lack of self-regulation: Because the model is trained to approximate a one-hot distribution, it often becomes overconfident in its predictions (logits vary greatly), tending to output extremely high probability values even in ambiguous situations. This lack of calibration means that the model's confidence score cannot truly reflect its prediction accuracy and also limits the diversity of sampling and decoding.

[0036] Existing label smoothing techniques attempt to achieve this by setting label probabilities to... To alleviate this problem, its smoothing strategy usually distributes the words evenly across all non-labeled words, failing to distinguish between "reasonable suboptimal words" and "completely wrong words," thus not fundamentally solving the problem of semantic modeling.

[0037] (iii) The separation between pre-training and reinforcement learning objectives (Alignment Gap). There is a significant "gap" between the objective function in the pre-training and RLHF stages, which increases the difficulty and instability of RLHF.

[0038] 1. Inconsistent optimization directions: Pre-training is imitation learning (fitting the distribution), where the model passively learns the statistical regularities of the data.

[0039] RLHF stands for Policy Optimization (Competitive Ranking), where the model adjusts its policy by generating samples and receiving rewards. RLHF (especially methods like DPO) essentially optimizes the relative difference (margin) between logits to establish a dominant position for the winning response.

[0040] 2. Poor spatial plasticity and alignment tax: Because the cross-entropy loss during the pre-training phase causes the model to form a sharp winner-take-all Logits distribution, when RLHF attempts to adjust the policy (e.g., to make the model prefer to refuse to answer in certain situations), the model must overcome the strong inertia (extremely high Logits values) formed during the pre-training phase.

[0041] Such large adjustments often require large gradient updates, which can easily disrupt the general knowledge representations learned by the model during pre-training, leading to a decline in general capabilities, i.e., generating an "alignment tax".

[0042] The KL divergence constraint is extremely sensitive to large Logits differences, leading to instability in the RL training process.

[0043] (iv) The inefficiency and waste of resources in gradient calculation; In backpropagation, the cross-entropy loss produces a gradient for every token in the vocabulary. However, for the vast majority of completely irrelevant tokens (e.g., "car" appearing when predicting "apple"), their probabilities are extremely low, their gradient contribution is negligible and often noise. Computing and updating these tiny full gradients is a huge waste of expensive GPU computing power. Although there have been some studies on sparse training (such as Sparsemax and Gumbel-Softmax), they typically introduce complex projection operations or non-differentiable sampling steps, making them difficult to deploy efficiently in large-scale distributed training.

[0044] In summary, existing training systems based on Softmax cross-entropy have significant drawbacks in terms of computational efficiency, smoothness of the model representation space, and compatibility with subsequent alignment tasks.

[0045] Example 1 In one or more embodiments, in order to reduce memory bandwidth pressure, support sparse gradient updates, and provide a better initialization state for RLHF loss function design, this invention discloses a pre-trained language model training method based on smoothing and negative sampling, such as... Figure 1 As shown, it includes the following steps: Step S1: Obtain the language model to be trained and the text data, wherein the data in this paper includes the feature vectors of the training text and the true labels of the training text.

[0046] In this embodiment, the text data includes, but is not limited to, diagnostic-related outpatient text data, which can be analyzed using this language model to assist in generating diagnostic reports.

[0047] Furthermore, the feature vector of the training text is the unnormalized logits vector output from the current input context, denoted as... .

[0048] Step S2: Input the text data into the language model to be trained, use the Transformer model and linear projection layer to obtain the unnormalized prediction score vector of the whole vocabulary, select several maximum values from the unnormalized prediction score vector to generate a first set, randomly sample several negative sample tokens from the whole vocabulary to generate a second set, and calculate the loss function of the current token based on the first set and the second set; the loss function includes a positive sample enhancement term and a negative sample suppression term using label smoothing.

[0049] This embodiment performs parallel processing for each token position in each training step.

[0050] Step S2-1: Input the text data into the language model to be trained, perform forward propagation, and generate unnormalized prediction score vectors (Logits) of the entire vocabulary.

[0051] Input context sequence (i.e., text data), after being processed by a Transformer model (such as an Attention layer or an FFN layer), yields the hidden state output vector at the current position. ( (For hidden layer dimensions). Through the final linear projection layer (Unembedding Layer) yields the unnormalized Logits vector of the entire vocabulary. ,in .

[0052] Step S2-2: Top-n filtering and calculation of second-best items; Unnormalized prediction score vector Filter out the largest value Let there be logits, denoted as the first set. .

[0053] Then, calculate the suboptimal score:

[0054] in, This represents the score for the second-best item; n is a constant parameter, typically... Take the smaller value, for example .

[0055] Preferred, Filter The selection of the maximum logical value is specifically performed using block-wise parallel scanning and fused kernel technology in the on-chip cache (such as SRAM or L2 cache) of a hardware accelerator (such as a GPU or TPU). Only the filtered sparse Top-n results are output to global memory (HBM), without sending the complete logical value vector. Write to global memory to reduce video memory bandwidth usage.

[0056] In this embodiment, to improve efficiency, this step can be performed in the GPU's SRAM (Shared Memory) through block-wise reduction, without the need for global sorting of the entire vocabulary.

[0057] Step S2-3: Negative sample sampling and calculation; Based on a preset sampling strategy, the vocabulary of the language model is used. Random sampling Let the negative sample tokens be denoted as the second set. .

[0058] The sampling strategy is to exclude correct labels (logic values corresponding to the target true labels) and the first set in step S2-2 from the vocabulary. Random uniform sampling is achieved by drawing samples with equal probability from unlabeled and Top-n indices.

[0059] Then, calculate the negative score:

[0060] in, The score is negative. For hyperparameters, such as .

[0061] Step S2-4: Calculate the loss function of the current token based on the first set and the second set or the suboptimal item score and the negative item score; the loss function includes a positive sample enhancement term and a negative sample suppression term with label smoothing.

[0062] Specifically, the loss value of the current token is calculated according to the following formula:

[0063] in, The recommended value range is for the interval threshold (hyperparameter). The typical value is 1.0; For negative sampling weights (hyperparameters), the recommended value range is... Typical value Baseline is a dynamic baseline. This is the logical value corresponding to the target's true label.

[0064] In this formula, the first term is the positive sample enhancement term: .

[0065] This requires correct labeling. It must be higher than the Top-n mean by a preset margin. .if If it is already in the Top-n and the value is large enough, the loss of this term is 0, which allows... It coexists with other high-confidence tokens to achieve label smoothing. This is essentially a pairwise ranking loss, which widens the gap between the predicted label and the suboptimal solution, but sets an upper limit (…). ), to prevent over-optimization.

[0066] The second term is the negative sample suppression term: .

[0067] in This is a dynamic baseline. It requires that the mean of negative samples must be significantly lower than the average of the mean of positive samples and the top-n samples. This acts similarly to the denominator of a softmax function, suppressing the logits of irrelevant words and preventing numerical drift. Used to control the pressure.

[0068] Furthermore, the loss function The complete form is:

[0069] The first term is the positive sample enhancement term. The second term is the negative sample suppression term. . for The largest A set of indices for each value (Top-n Indices); From Random sampling A set of indexes; The Logit value corresponding to the actual label; An index for the actual labels; This is the output Logits vector of the model.

[0070] Step S3: Iteratively update the parameters of the pre-trained language model based on the loss function to obtain the trained pre-trained language model.

[0071] Specifically, sparse gradient backpropagation, such as Figure 2 As shown, according to the loss function right , Logits in, and The gradient is calculated using Logits. The backpropagation process only calculates and updates the gradient for the target label index, the selected n Top-n indices, and the sampled k negative sample indices. The gradients for the remaining unselected word index positions are not calculated, thus achieving sparse gradient updates. That is, for the vast majority of tokens in the word list that are not selected (neither Label, nor Top-n, nor Neg-k), the gradient is updated only when the number of unselected tokens is less than or equal to the number of negative tokens. Its gradient is always 0; and, efficient update: only update the weight matrix corresponding to the sparse indices involved. The column. In GPU implementations, this can be greatly reduced by using a custom Fused Kernel.

[0072] In this embodiment, the calculation process of the loss function and the backpropagation process do not include exponential (Exp) operations and full summation (Summation) normalization operations on the full logic value vector.

[0073] This invention proposes a hybrid ranking loss function based on Top-K Logits Mean Smoothing and Dynamic Negative Sampling, aiming to replace the traditional Softmax cross-entropy loss as the core objective function in the pre-training and SFT stages of large models. During the pre-training stage, a feature space with a smooth Logit distribution is established, enabling the model to adapt to preference data with smaller parameter adjustments in subsequent Human Feedback Reinforcement Learning (RLHF) or Direct Preference Optimization (DPO) stages, thereby reducing alignment tax.

[0074] As one implementation method, in order to fully leverage the advantages of the present invention and avoid the overhead of loops at the Python level, this embodiment adopts a hardware acceleration strategy, which requires implementing a fused ranking kernel (FusedRanking Kernel) at the underlying level (CUDA / Triton), including: 1. Optimize the Logits Cut-Logits Strategy with Forward Pass: Traditional implementations would first calculate Get the complete The Logits matrix consumes a large amount of GPU memory instantly. This invention employs the following design: Input: The last hidden layer Output weight matrix .

[0075] The kernel logic includes: Tiling: Utilizing the vocabulary dimension The tokens are divided into blocks (e.g., 1024 tokens per block). Each GPUThread Block is responsible for calculating a portion of the Logits.

[0076] On-Chip Top-K: While calculating Logits, a local Top-n heap is maintained using registers or shared memory. Logits are discarded after calculation and are not written to HBM (Global Memory); only the Top-n values and indices are retained.

[0077] Global Reduction: After all blocks have been calculated, a global reduction is performed and the blocks are merged to obtain the final global Top-n.

[0078] Label and Negative Extraction: Label Logit Negative sample index must be calculated and retained separately. Generated by the CPU before the Kernel starts (excluding tags). (Number of items), obtained after the Top-n index, in After removing the Top-n index, take the top... Count the indexes, and then only calculate the Logits for those specific indexes.

[0079] Finally, only the final Loss scalar, along with the sparse indices used in the computation and their corresponding values, are output. The memory usage of the token output at a single location is from... sudden drop The decrease can reach two orders of magnitude.

[0080] 2. Backward propagation optimizes sparse gradient update; Since forward propagation does not compute the complete Logits matrix, backpropagation also does not require computing the full gradient.

[0081] This embodiment uses sparse gradient calculation, based on... right The derivative of has a non-zero gradient only at the following positions: Label position : .

[0082] Top-n positions .

[0083] Negative sample location : .

[0084] When updating weights, these sparse gradients are used directly. The corresponding row of the matrix, or calculation gradient This avoids huge Matrix multiplication greatly reduces computational power consumption.

[0085] 3. Hyperparameter setting and training stability control, including: Top-n ( The suggested selection range is: Smaller (As in 4) This makes the model closer to the One-hot target, resulting in faster convergence. Larger... (e.g., 16) Enhance smoothing effects and increase model diversity. This embodiment uses dynamic adjustment, which can use a smaller initial adjustment during training. Accelerate learning, increase later To enhance robustness.

[0086] Negative sample number ( In the sampling strategy, it is recommended that for of For example, 1024.

[0087] Margin ( The choice between Warm-up and other options: This determines how much higher the Label needs to be than the Top-n mean. Starting from 0.1, it is gradually increased to 1.0 as training progresses. This is similar to Curriculum Learning, preventing the model from crashing due to excessively large gradients in the early stages.

[0088] 4. Alternative Evaluation Metrics: Since this embodiment no longer uses Softmax, the traditional "Perplexity" (PPL) metric cannot be directly calculated (because it lacks a normalized probability distribution). The following alternative metrics should be used for monitoring in the implementation of this invention: Top-1 / Top-5 Accuracy: Still applies, based on Logits sorting.

[0089] Ranking Accuracy: The proportion of labels that rank before negative samples.

[0090] Margin Satisfaction Rate: Satisfied The sample proportion.

[0091] This invention provides a pre-training method that decouples probability normalization from relative preference learning. The training of a large model essentially teaches the model that "the score of a correct token is significantly higher than that of an incorrect token," rather than necessarily calculating an exact normalized probability distribution. By abandoning mandatory global normalization, significant computational advantages and superior representation space characteristics can be achieved. Therefore, the above technical solution achieves: 1. Traditional cross-entropy will target Logit Compare with LogSumExp of the full vocabulary Logits. This invention completely eliminates Softmax calculation, instead... The model compares the label probabilities with the mean of the top-n largest Logits predicted by the model. This is a dynamic, "soft" alignment strategy. The model does not need to push the label probabilities to 1 (i.e., push the Logit to infinity), but only needs to ensure that the label Logit is higher than the average level of the "suboptimal group". This design naturally prevents Logit explosion and overconfidence, preserves the competition between semantically similar words (which tend to appear in the top-n), and makes the underlying semantic space smoother and denser. This effectively achieves adaptive label smoothing at the Logit level.

[0092] 2. To simulate the suppressive effect of Softmax on unlabeled tokens (i.e., to prevent unlabeled logits from growing indefinitely), this invention introduces a negative sampling term. Unlike Softmax, which requires calculating the entire vocabulary, this invention only randomly samples... negative samples ( This forces the mean Logits of these negative samples to be significantly lower than the mixed baseline of positive and suboptimal samples. This causes the gradient of the loss function to shift from dense ( ) becomes highly sparse ( This not only greatly reduces the demand for video memory bandwidth (from...) Down to It also naturally fits sparse training frameworks, accelerating backpropagation. This design revives the negative sampling idea of the Word2Vec era, but re-adapts it to the output layer of the Transformer architecture for large models, solving the memory wall problem under large vocabulary.

[0093] 3. The form of the aforementioned loss function (ReLU Margin + Ranking) is mathematically closer to the objective function of Contrastive Learning and Direct Preference Optimization (DPO). By introducing this margin-based ranking mechanism during the pre-training stage, the logits distribution learned by the model is no longer a "winner-takes-all" spike, but rather retains a structure with clear boundaries between the "winning group" and the "elimination group." This structure provides excellent plasticity for subsequent RLHF. Because RLHF essentially adjusts the relative ranking of logits, the loss function of this invention makes pre-training and RLHF more continuous in terms of optimization manifold, significantly reducing the "alignment tax," allowing the model to adjust the policy more smoothly during the reinforcement learning stage without destroying existing knowledge representations.

[0094] Example 2 In one or more embodiments, a pre-trained language model training system based on smoothing and negative sampling is disclosed, specifically including: The text acquisition module is configured to acquire the language model to be trained and the text data, wherein the data in this paper includes the feature vectors of the training text and the true labels of the training text; The prediction loss module is configured to: input the text data into the language model to be trained; use a Transformer model and a linear projection layer to obtain an unnormalized prediction score vector of the entire vocabulary; select several maximum values from the unnormalized prediction score vector to generate a first set; randomly sample several negative sample words from the entire vocabulary to generate a second set; and calculate the loss function for the current word based on the first set and the second set; the loss function includes a positive sample enhancement term and a negative sample suppression term using label smoothing. The iterative training module is configured to iteratively update the parameters of the pre-trained language model based on the loss function to obtain the trained pre-trained language model.

[0095] Example 3 This embodiment provides an electronic device, including a memory and a processor, as well as computer instructions stored in the memory and running on the processor. When the computer instructions are executed by the processor, they complete the steps of the above-described pre-trained language model training method based on smoothing and negative sampling.

[0096] Example 4 This embodiment provides a computer-readable storage medium for storing computer instructions, which, when executed by a processor, complete the steps of the above-described pre-trained language model training method based on smoothing and negative sampling.

[0097] Example 5 This embodiment provides a computer program product including executable instructions, which are computer instructions; the executable instructions are stored in a computer-readable storage medium. When the processor of an electronic device reads the executable instructions from the computer-readable storage medium and executes the executable instructions, the electronic device performs the steps of the pre-trained language model training method based on smoothing and negative sampling provided in this embodiment.

[0098] The steps and methods involved in Embodiments 2 to 5 above correspond to those in Embodiment 1. For specific implementation details, please refer to the relevant description section of Embodiment 1. The term "computer-readable storage medium" should be understood as a single medium or multiple media including one or more instruction sets; it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set for execution by a processor and enabling the processor to perform any of the methods in this invention.

[0099] Those skilled in the art will understand that the modules or steps of the present invention described above can be implemented using general-purpose computer devices. Optionally, they can be implemented using computer-executable program code, thereby allowing them to be stored in a storage device for execution by a computer device, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. The present invention is not limited to any particular combination of hardware and software.

[0100] The above description is only a preferred embodiment of the present invention. Although the specific implementation of the present invention has been described in conjunction with the accompanying drawings, it is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that, based on the technical solution of the present invention, various modifications or variations that can be made by those skilled in the art without creative effort are still within the scope of protection of the present invention.

Claims

1. A pre-trained language model training method based on smoothing and negative sampling, characterized in that, include: Obtain the language model to be trained and the text data, wherein the data in this paper includes the feature vectors of the training text and the true labels of the training text; The text data is input into the language model to be trained. A Transformer model and a linear projection layer are used to obtain an unnormalized prediction score vector of the entire vocabulary. Several maximum values are selected from the unnormalized prediction score vector to generate a first set. Several negative sample words are randomly sampled from the entire vocabulary to generate a second set. The loss function of the current word is calculated based on the first set and the second set. The loss function includes a positive sample enhancement term and a negative sample suppression term using label smoothing. The parameters of the pre-trained language model are iteratively updated based on the loss function to obtain the trained pre-trained language model.

2. The pre-trained language model training method based on smoothing and negative sampling as described in claim 1, characterized in that, The loss function is: in, This is the interval threshold; Negative sampling weights; Baseline is a dynamic baseline; The logical value corresponding to the target's true label; The score is negative. This is the second-best score.

3. The pre-trained language model training method based on smoothing and negative sampling as described in claim 2, characterized in that, The dynamic baseline is: The dynamic baseline is used to ensure that the mean of negative samples is lower than the average level of the mean of positive samples and the first set.

4. The pre-trained language model training method based on smoothing and negative sampling as described in claim 2, characterized in that, The score for the suboptimal item is: in, The score represents the second-best option; n is a constant parameter. This is the first set.

5. The pre-training language model training method based on smoothing and negative sampling as described in claim 2, characterized in that, The negative item score is: in, The score is negative. For hyperparameters; This is the second set.

6. The pre-trained language model training method based on smoothing and negative sampling as described in claim 1, characterized in that, The first set is generated by selecting several maximum values from the unnormalized prediction score vector, and the second set is generated by randomly sampling several negative sample words from the full vocabulary. The sampling strategy is to extract words with equal probability from the first set and exclude correct labels.

7. A pre-trained language model training system based on smoothing and negative sampling, characterized in that, include: The text acquisition module is configured to acquire the language model to be trained and the text data, wherein the data in this paper includes the feature vectors of the training text and the true labels of the training text; The prediction loss module is configured to: input the text data into the language model to be trained; use a Transformer model and a linear projection layer to obtain an unnormalized prediction score vector of the entire vocabulary; select several maximum values from the unnormalized prediction score vector to generate a first set; randomly sample several negative sample words from the entire vocabulary to generate a second set; and calculate the loss function for the current word based on the first set and the second set; the loss function includes a positive sample enhancement term and a negative sample suppression term using label smoothing. The iterative training module is configured to iteratively update the parameters of the pre-trained language model based on the loss function to obtain the trained pre-trained language model.

8. An electronic device, characterized in that, It includes a memory and a processor, as well as computer instructions stored in the memory and running on the processor, which, when executed by the processor, complete the pre-trained language model training method based on smoothing and negative sampling as described in any one of claims 1-6.

9. A computer-readable storage medium, characterized in that, Used to store computer instructions, which, when executed by a processor, complete the pre-trained language model training method based on smoothing and negative sampling as described in any one of claims 1-6.

10. A computer program product, characterized in that, The computer program product includes executable instructions stored in a computer-readable storage medium; When the processor of the electronic device reads the executable instructions from the computer-readable storage medium and executes the executable instructions, it implements the pre-trained language model training method based on smoothing and negative sampling as described in any one of claims 1-6.