Large language model training method, device and electronic equipment

By setting digit labels and sources in a large language model and utilizing the reward mechanism of reinforcement learning, the problem of inaccurate digit output was solved, and the accuracy of digit generation in the model was improved.

CN120632448BActive Publication Date: 2026-06-16BEIJING SANKUAI CLOUD COMPUTING TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING SANKUAI CLOUD COMPUTING TECH CO LTD
Filing Date
2025-05-28
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Large language models suffer from the illusion problem in numerical output, where the generated numbers are irrelevant to or inaccurate with the original text, and the accuracy of numbers may be sacrificed during the reinforcement learning alignment process.

Method used

By setting system prompts for the large language model, its output carries preset numerical labels and numerical sources. During training, it utilizes proximal policy optimization or group-relative policy optimization, combined with the first and second reward values, to perform reinforcement learning training, thereby optimizing the accuracy of the model's numerical output.

Benefits of technology

It improves the accuracy of the output numbers of large language models, reduces the occurrence of hallucination problems, and achieves low-cost improvement in digital output.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120632448B_ABST
    Figure CN120632448B_ABST
Patent Text Reader

Abstract

The present disclosure provides a large language model training method, device and electronic equipment. The method comprises: setting control information for a large language model through a system prompt word of the large language model; inputting training data to the large language model, obtaining output data of the large language model, and extracting N numbers from the output data; obtaining a first reward value according to the number M of numbers carrying a preset number label and a number source in the N numbers; determining a standard value corresponding to the number according to the number source in the M numbers carrying the preset number label and the number source, and obtaining a second reward value according to the comparison result of the standard value and the number; and performing reinforcement learning training on the large language model by using a proximal policy optimization method or a group relative policy optimization method, wherein the training reward value in the proximal policy optimization method or the group relative policy optimization method is formed according to the first reward value and the second reward value. The present disclosure can improve the accuracy of the numbers generated by the large language model.
Need to check novelty before this filing date? Find Prior Art