A large model optimization method and vehicle
By decomposing the task instructions of a large model into atomic sub-instructions and generating responses in stages, and combining preferred and non-preferred responses to construct a preference dataset, the problem of relying on manual annotation and feedback in existing technologies is solved, achieving efficient and low-cost illusion suppression and model optimization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GREAT WALL MOTOR CO LTD
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-30
AI Technical Summary
Existing large models are prone to generating fictional content that does not conform to the input information or common sense when generating tasks. Existing hallucination suppression methods rely on manual annotation and feedback, which are costly, have poor scalability, and insufficient training stability, and cannot meet the needs of large-scale applications.
By decomposing the original task instructions into multiple atomic sub-instructions, the large model is guided to generate preferred responses in stages, and the preferred dataset is automatically constructed by combining the non-preferred responses directly generated from the original task instructions, and the parameters are updated, thereby realizing the automated construction of the preferred dataset and model optimization.
It enables the automatic construction of preference datasets without manual annotation and feedback, simplifies the training process, reduces the cost of illusion suppression, improves the accuracy and credibility of generated content, and is suitable for efficient application in multimodal tasks.
Smart Images

Figure CN122308874A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the technical field of model optimization, and in particular to a large model optimization method and a vehicle. Background Technology
[0002] With the widespread application of large-scale models in tasks such as image captioning and visual question answering, the accuracy and credibility of their generated content have become key bottlenecks restricting the practical application of the technology. However, when performing generation tasks, large-scale models often produce fictitious content that does not conform to the input information or common sense, seriously affecting the practical application value of the model and the user experience. Therefore, how to efficiently and cost-effectively suppress the illusions of large-scale models has become an urgent technical problem to be solved.
[0003] Currently, among existing methods, supervised fine-tuning techniques guide models to learn real semantics and common sense through high-quality manually labeled data; reinforcement learning techniques based on human feedback construct reward models and iteratively adjust model generation strategies by combining human feedback. Both of these methods optimize model output by introducing human intervention in order to alleviate the hallucination problem.
[0004] In the aforementioned existing technologies, supervised fine-tuning relies on complex and highly subjective multimodal data annotation, which is costly and has poor scalability; reinforcement learning based on human feedback requires the construction of reward models and relies on unstable human feedback, resulting in a cumbersome training process and insufficient stability. All of these technologies rely too much on human intervention and cannot achieve automatic construction of preference data and concise optimization of models, making it difficult to meet the practical application requirements of large-scale and low-cost models. Summary of the Invention
[0005] This application addresses, to at least some extent, one of the technical problems in the related art.
[0006] Therefore, this application aims to provide a large model optimization method and vehicle, which decomposes the original task instructions into multiple atomic sub-instructions through atomic instruction decomposition, guides the large model to generate preferred responses in stages, and automatically constructs a preference dataset by combining the non-preferred responses directly generated from the original task instructions, thereby updating the parameters of the large model. This realizes the automated construction of the preference dataset and simplifies the model optimization process, solving the problems of existing technologies such as reliance on a large amount of manual annotation, complex training process, high implementation cost, poor scalability, and insufficient training stability.
[0007] To achieve the above objectives, in a first aspect, this application provides a large model optimization method, which includes: Acquire input modality information and the original task instructions proposed in response to the input modality information, and confirm the task type based on the original task instructions; Based on the task type, the original task instruction is decomposed into atomic instructions to obtain multiple atomic sub-instructions; Each of the atomic sub-instructions is input into the large model to generate a preferred response, and a non-preferred response generated by the large model based on the original task instructions is obtained; By combining the input modality information, the original task instructions, the preferred response, and the non-preferred response, a preference dataset is constructed. The parameters of the large model are updated based on the preference dataset to obtain an optimized large model, which is then used to execute the multimodal task corresponding to the task type.
[0008] By acquiring input modal information and original task instructions and determining multimodal tasks, the original instructions are decomposed into atomic sub-instructions to generate preferred and non-preferred responses and construct a preference dataset. Then, the model parameters are updated based on the dataset to obtain an optimized model. This method can automatically complete the construction of preference data and model optimization, eliminating the reliance on manual annotation and feedback, effectively reducing the cost of illusion suppression, simplifying the implementation process, improving the accuracy and credibility of the output content of large models, and adapting to the efficient application of multimodal tasks.
[0009] In some embodiments of this application, based on the task type, the original task instruction is decomposed into multiple atomic sub-instructions, including: Determine the phased prompting learning logic that matches the task type; Based on the phased prompting learning logic, the original task instruction is decomposed into multiple atomic sub-instructions that are executed sequentially in a preset order.
[0010] By determining the matching phased prompt learning logic based on multimodal tasks, and then decomposing the original instructions into atomic sub-instructions that are executed in a preset order according to the phased prompt learning logic, the instruction decomposition can be more in line with the characteristics of the task and the execution rules. This ensures that the decomposed atomic sub-instructions are logically clear and have clear objectives, avoids the accumulation of errors caused by complex instructions, and provides a standardized instruction foundation for the subsequent reliable generation of responses and stable construction of datasets.
[0011] In some embodiments of this application, based on the phased prompting learning logic, the original task instruction is decomposed into a plurality of atomic sub-instructions executed sequentially in a preset order, including: Based on the phased prompting learning logic, the multimodal task corresponding to the original task instruction is decomposed into multiple atomic sub-tasks; The preset order is generated based on the logical dependencies between the atomic subtasks. Using the completion of one atomic subtask corresponding to each atomic subinstruction as the decomposition standard, the original task instruction is decomposed in stages to obtain multiple atomic subinstructions that are executed sequentially according to the preset order.
[0012] By breaking down multimodal tasks into atomic subtasks, decomposing instructions in stages based on the standard of one atomic sub-instruction corresponding to one atomic sub-task, and determining the execution order according to logical dependencies, atomic sub-instructions can perform their respective functions and be connected in an orderly manner. This ensures that each step of instruction execution focuses on a single simple task, reduces model generation bias, improves the objectivity and stability of response generation, and supports the efficient production of hallucination-free content.
[0013] In some embodiments of this application, the atomic sub-instructions are input into a large model to generate a preferred response, including: Based on the preset order, each of the atomic sub-instructions is executed sequentially through the large model to generate a sub-response corresponding to each of the atomic sub-instructions, and the preferred response is obtained based on the sub-response.
[0014] By executing atomic sub-instructions sequentially in a preset order to generate corresponding sub-responses and then fusing them to obtain the preferred response, the model output can be constrained in a step-by-step generation manner. This ensures that each response is generated based on objective input modal information, avoids the spread of errors caused by generating long texts all at once, and guarantees that the content of the preferred response is true, accurate, and free of fictitious information, thus providing high-quality positive samples for the preference dataset.
[0015] In some embodiments of this application, the atomic sub-instructions are executed sequentially through the large model to generate sub-responses corresponding to each atomic sub-instruction, and the preferred response is obtained based on the sub-responses, including: In each of the atomic sub-instructions, a generation constraint condition is preset, and under the constraint condition, each of the atomic sub-instructions is executed sequentially; The sub-response generated by the previous atomic sub-instruction is used as the context input for the execution of the current atomic sub-instruction. The corresponding sub-responses are generated stage by stage, and the sub-responses are integrated to obtain the preferred response.
[0016] By setting generation constraints in atomic sub-instructions and using the previous sub-response as the current context input to complete the optimal response generation, the output form and content of the large model can be strictly standardized, ensuring that the output of each stage is consistent and fits the input facts, further reducing the probability of illusion and improving the controllability and reliability of response generation.
[0017] In some embodiments of this application, the sub-response generated by the previous atomic sub-instruction is used as the context input for the execution of the current atomic sub-instruction, and corresponding sub-responses are generated stage by stage. The sub-responses are then integrated to obtain the preferred response, including: Execute the first atomic sub-instruction to extract multiple objective facts from the input modal information and obtain the first sub-response; Execute the second atomic sub-instruction to generate corresponding feature descriptions based on the objective facts stated in the first sub-response, and obtain the second sub-response; Execute the third atomic sub-instruction to fuse the first sub-response and the second sub-response to obtain the preferred response.
[0018] By sequentially executing atomic sub-instructions to extract objective facts, generate feature descriptions, and fuse descriptions, the optimal response is obtained. It follows a bottom-up generation logic, first locking in real information, then refining details, and finally integrating and outputting the result. The entire process is based solely on the input modal information, avoiding the inclusion of fictitious content, thus ensuring that the optimal response has high authenticity and high accuracy.
[0019] In some embodiments of this application, a preference dataset is constructed by combining the input modality information, the original task instruction, the preferred response, and the non-preferred response, including: The input modal information, the original task instruction, the preferred response, and the non-preferred response are combined into a quadruple data sample. Based on different multimodal tasks, multiple sets of quadruple data samples are generated; The preference dataset is obtained by aggregating multiple sets of the quadruple data samples.
[0020] By combining input modal information, original task instructions, preferred responses, and non-preferred responses into quadruples, and aggregating multi-task samples to form a preference dataset, it is possible to construct an automated dataset with complete structure and strong adaptability. This dataset can meet the needs of model optimization without manual annotation, improve the scalability and universality of the dataset, and reduce the cost of data construction.
[0021] In some embodiments of this application, the parameters of the large model are updated based on the preference dataset to obtain an optimized large model, including: Input the preference dataset into the large model; Define an optimization objective, and based on the optimization objective, minimize the log-likelihood loss function using the direct preference optimization algorithm, and iteratively update the parameters of the large model.
[0022] By inputting the preference dataset into a large model, setting the optimization objective, and minimizing the log-likelihood function through the direct preference optimization algorithm, the parameters can be updated iteratively. This eliminates the need for reward model construction and manual feedback, simplifies the training process, improves training efficiency and stability, and allows large models to quickly optimize towards a low-illusion direction.
[0023] In some embodiments of this application, the optimization objective is to make the probability of the large model generating the preferred response higher than the probability of generating the non-preferred response.
[0024] By setting the optimization objective to ensure that the probability of generating preferred responses in large models is higher than that of non-preferred responses, the direction of model parameter adjustment can be accurately guided, the model's tendency to generate realistic and accurate content can be strengthened, the probability of outputting fictional content can be weakened, and the illusion can be effectively suppressed from the perspective of generation strategy, thereby improving the credibility of the output of large models.
[0025] In a second aspect, this application provides a vehicle including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the large model optimization method described above.
[0026] By equipping a vehicle with a memory and processor that store the corresponding computer program, the processor can implement the aforementioned large model optimization method when executing the program. This enables the large model on the vehicle to have efficient and low-cost hallucination suppression capabilities, improves the output accuracy of tasks such as in-vehicle multimodal interaction and visual perception, and enhances the practicality and reliability of in-vehicle intelligent systems.
[0027] As can be seen from the above technical solutions, additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description
[0028] Figure 1 A flowchart illustrating a large model optimization method provided in an embodiment of this application; Figure 2 This is a flowchart illustrating another large model optimization method provided in an embodiment of this application. Detailed Implementation
[0029] In this application, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to a specific feature, structure, material, or characteristic described in connection with that embodiment or example, which is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0030] The present application will now be described in detail through exemplary embodiments. However, it should be understood that, without further description, elements, structures, and features in one embodiment may be advantageously incorporated into other embodiments. Large models refer to generative artificial intelligence models with a large number of parameters obtained by pre-training on massive amounts of data based on deep learning technology. Large models learn general knowledge representations from large-scale unlabeled data through self-supervised learning, and can understand and generate various types of content such as natural language, code, and images. The core feature of large models is that they have powerful context learning and knowledge emergence capabilities, and can adapt to various downstream tasks under conditions of few or zero samples. Large models can include large language models and multimodal large models. The training data for large-scale language models mainly consists of large-scale text corpora. The model structure is based on Transformer and learns statistical patterns and semantic relationships in texts to gain a deep understanding and generation capability of natural language. The core input and output of large-scale language models are both text modalities.
[0031] Multimodal large models can process and understand information from two or more modalities simultaneously, such as images, text, audio, and video. Through a unified network architecture, multimodal large models align data representations from different modalities, enabling cross-modal understanding, generation, and retrieval.
[0032] In generation and question-answering tasks, large models predict and generate coherent output content word by word through autoregression based on prompts or reference information input by users. Their generation logic relies on the memory and reasoning of statistical regularities in training data rather than a true understanding of objective facts. Therefore, they are prone to generating fictional content that does not conform to the input information or common sense, i.e., the problem of hallucination. The root cause of the hallucination problem lies in the fact that large models lack precise constraints on input information in complex tasks, which can easily introduce error accumulation and fabricated content when generating long texts or understanding cross-modal data, thereby affecting the credibility and usability of the output results. To alleviate hallucinations, the relevant solutions mainly employ two methods: supervised fine-tuning and reinforcement learning based on human feedback. Among them, supervised fine-tuning uses high-quality manually labeled data to fine-tune the parameters of large models, guiding them to learn real-world semantics and common sense. Reinforcement learning based on human feedback constructs a reward model, combines human feedback on the preferences of the large model's output, and iteratively adjusts the generation strategy to make the output conform to human preferences. However, both of these methods have significant drawbacks. Supervised fine-tuning relies heavily on manual annotation. For example, in multimodal tasks such as image description and visual question answering, annotation needs to take into account the matching degree between text semantics and image content. The process is complex and highly subjective, resulting in extremely high costs for obtaining high-quality data. Moreover, the annotation scale is limited and has poor scalability, which cannot meet the needs of large-scale training and therefore makes it difficult to completely suppress hallucinations. Reinforcement learning based on human feedback requires first building a reward model that matches human preferences, and then iteratively adjusting parameters through continuous and stable human feedback. The overall training process is cumbersome and time-consuming. At the same time, the accuracy of the reward model depends on a large amount of human feedback, and the subjectivity and instability of human feedback directly affect the reliability of the reward model, resulting in insufficient training stability, low efficiency, and inability to quickly scale to various multimodal tasks. In commonality, both of the above methods rely on human intervention and fail to achieve automatic construction of preference data and concise optimization of large models. The inherent limitations of human intervention, such as high cost, low efficiency, and strong subjectivity, directly lead to the shortcomings of existing hallucination suppression methods in terms of efficiency, scalability, and practicality. Therefore, existing large-scale hallucination suppression methods generally suffer from problems such as reliance on a large amount of manual annotation, complex training processes, high implementation costs, poor scalability, and insufficient training stability. They cannot achieve hallucination suppression efficiently and at low cost, which seriously restricts the credibility and practicality of large-scale models in actual applications.
[0033] Based on this, this application proposes a large model optimization method and vehicle. By decomposing the original task instructions into multiple atomic sub-instructions through atomic instruction decomposition, the large model is guided to generate preferred responses in stages. The method also automatically constructs a preference dataset by combining the non-preferred responses directly generated from the original task instructions, thereby updating the parameters of the large model. This achieves automated construction of the preference dataset and simplifies the model optimization process, solving the problems of existing technologies such as reliance on a large amount of manual annotation, complex training process, high implementation cost, poor scalability, and insufficient training stability.
[0034] In the following, embodiments of this application will be described in detail with reference to the accompanying drawings.
[0035] like Figure 1 and Figure 2 As shown in an illustrative embodiment of a large model optimization method of this application, the method includes: S1: Obtain input modality information and the original task instructions proposed in response to the input modality information, and confirm the task type based on the original task instructions; Furthermore, the input modal information is the objective data source that is being processed, such as road condition images captured by cameras in vehicle scenarios, voice commands recorded by microphones, or navigation text uploaded through external interfaces. Once this information is acquired, it is stored in the system as objective, fixed, and unchangeable basic data. The acquisition of input modal information is completed directly by the system front-end acquisition module or external input interface; if it is an image or video modal, the raw visual data can be obtained by real-time shooting through the vehicle-mounted image acquisition device, external file upload, or video stream input; if it is a text or voice modal, the raw content data can be obtained by direct input through the text input box and voice acquisition combined with the transcription module. The acquired input modal information serves as objective basic data and remains unchanged throughout the subsequent instruction decomposition, sub-response generation, and model optimization processes, providing a true and consistent factual basis for the model. The original task instruction is the specific task requirement put forward by the user to the input modal information. For example, it may be to generate a detailed description of the collected road condition image or to understand and respond to the intent of the voice command. That is, there is a clear relationship between the input modal information and the original task instruction. Throughout the entire subsequent processing, the large model must generate a response based solely on the input modal information. The original task instructions are only used to determine the task type and decomposition direction and must not replace or exceed the factual content carried by the objective modal information. After obtaining the input modal information and the original task instructions proposed by the user, the original task instructions are first semantically parsed and the task type is identified to clarify the specific task category that needs to be completed, thereby determining the task type; The original task instructions carry the user's intention to process the input modal information. Different intentions correspond to different multimodal task types, which may include image description tasks, visual question answering tasks, visual reasoning tasks, text-to-image generation tasks, image editing tasks, audio understanding tasks, video understanding tasks, and multimodal retrieval tasks. Among them, the image description task generates a corresponding natural language description based on the input image; the visual question answering task answers specific questions posed by users based on the input image; the visual reasoning task performs logical analysis and reasoning on scenes, relationships, or events in an image; the text-to-image generation task generates semantically consistent image content based on text descriptions; the image editing task performs local modifications or style conversions on the input image according to instructions; the audio understanding task identifies the speech content, event type, or scene attributes in an audio signal; the video understanding task understands the actions, events, or scene changes in a video sequence; and the multimodal retrieval task searches for matching information in cross-modal data based on text or images. For example, when a user provides a road image and enters the instruction "Please describe the current road conditions," the system recognizes the intent as generative description and confirms the current task as an image description task; when the user enters the instruction "Are there pedestrians ahead?", the system recognizes the intent as targeted question answering and confirms the current task as a visual question answering task; when the user enters the instruction "What accident might have happened in this picture?", the system recognizes the intent as logical reasoning and confirms the current task as a visual reasoning task; when the user enters the instruction "Generate a picture of a beach at sunset based on this description," the system recognizes the intent as creative generation and confirms the current task as a text-to-image generation task; when the user enters the instruction "Change the red car in the picture to blue," the system recognizes the intent as selective modification and confirms the current task as an image editing task; when the user enters a recording from inside a car and asks "What did the driver say?", the system recognizes the intent as speech content extraction and confirms the current task as an audio understanding task; when the user provides a dashcam video and asks "Did the car in front change lanes?", the system recognizes the intent as action recognition and confirms the current task as a video understanding task. Accurate identification of task type is the foundation for subsequent atomic instruction decomposition. Different task types have fundamental differences in output format, information requirements and inference path. Based on this, differentiated decomposition logic and sub-instruction structure must be designed. By customizing differentiated decomposition logic and sub-instruction design for different task types, it is ensured that atomic instruction decomposition is highly matched with task objectives. This allows the model to focus on a single, clear, and verifiable sub-task at each stage, thereby avoiding the accumulation of errors and content fabrication caused by processing complex tasks all at once. This lays the correct direction for the automatic construction of subsequent preference datasets. Based on the confirmation of the above task types, the system further obtains the current multimodal task. A multimodal task refers to a task that requires simultaneous processing and understanding of two or more modal information (such as images, text, audio, video, etc.) and completion of a specific goal. Each execution of a task instruction involves the interaction between at least two different modal information, such as the interaction between text instructions and image content, the interaction between text instructions and audio signals, or the interaction between image content and text description. Task type is a classification label for user intent (such as image description task, visual question answering task, etc.), while multimodal task is the actual cross-modal processing scenario determined based on task type and combined with the specific format and content characteristics of input modality information; task type determines the basic framework of multimodal task, and multimodal task is the instantiation of task type under specific input conditions. The system uses natural language understanding technology to classify the original task instructions by intent, and combines the type of input modal information to comprehensively judge and confirm the current multimodal task. The system extracts key verbs, interrogative words, and noun phrases from the original task instructions to identify the user's core intent. It determines whether the core intent is to generate descriptive content, answer a specific question, or perform other types of operations, thus defining the task type. Simultaneously, the system combines the format and content characteristics of the input modal information to determine whether the data being processed is image, video, audio, or other modalities, concretizing the abstract task type into an executable multimodal task.
[0036] For example, when the original task instruction contains generative verbs such as "describe," "depict," or "explain" and the input is an image or video, the system first confirms that the task type is an image description task, and then, based on the fact that the input is visual modal information, confirms that the current multimodal task is "generate a text description for the image"; when the original task instruction contains interrogative words such as "whether," "how much," or "what" and the input is an image or video, the system first confirms that the task type is a visual question-and-answer task, and then confirms that the current multimodal task is "answer a specific question based on the image content"; when the original task instruction contains creative verbs such as "generate," "draw," or "create" and the output target is an image, the system first confirms that the task type is a text-to-image generation task, and then confirms that the current multimodal task is "generate a corresponding image based on the text description."
[0037] Taking an in-vehicle scenario as an example, if a user inputs the voice command "Please describe the current road conditions" to a road image captured by a forward-looking camera, the system first converts the command into text through speech recognition, and then extracts the generative verb "describe" and the target object "road conditions" through natural language understanding, confirming that the task type is an image description task. Based on this, combined with the specific modal information that the input is a forward-looking road image, the system further confirms that the current multimodal task is "to generate a road condition description text for this road image". If the user inputs the command "Are there any pedestrians ahead?", the system extracts the question word "whether" and the target entity "pedestrian", confirming that the task type is a visual question answering task. Based on this, and combined with the input being a forward-looking road image, the system further confirms that the current multimodal task is "to determine whether there are any pedestrians in the road image and give an answer". Based on the confirmation of the above task types, the system adopts differentiated decomposition logic; For example, in an image description task, the goal is to generate coherent, complete, and detailed scene description text. Therefore, it needs to be decomposed according to the level of information extraction, from shallow to deep: The first stage is the entity extraction sub-instruction, which guides the model to identify and list all objectively existing objects, people, and scene elements from the image, with the output format being a concise entity list; the second stage is the feature description sub-instruction, which guides the model to generate accurate descriptions of the attributes such as color, shape, position, and state of each item in the entity list, with the output format being a short text description sentence; the third stage is the fusion generation sub-instruction, which guides the model to organize all the descriptive information produced in the first two stages in a reasonable logical order and fuse them into a coherent and natural complete description text; For visual question answering tasks, the goal is to provide accurate and concise answers to specific user questions. Therefore, it needs to be decomposed according to the cognitive logic of question solving: The first stage is the entity localization sub-instruction, which guides the model to locate the target region or object related to the question from the image, and outputs the bounding box coordinates or object label; the second stage is the attribute judgment sub-instruction, which guides the model to perform attribute analysis on the located target and determine whether it meets the conditions asked in the question (such as color, quantity, existence, action state, etc.), and outputs the judgment result or attribute value; the third stage is the answer summarization sub-instruction, which guides the model to integrate the localization results and judgment results of the first two stages into an answer text that conforms to the question format; For text-to-image generation tasks, the goal is to generate image content that conforms to the semantics of the text description. Therefore, it needs to be decomposed according to the image synthesis process: The first stage is the semantic parsing sub-instruction, which guides the model to extract key semantic elements from the text description, including object type, attribute modification, spatial relationship, etc., and outputs a structured semantic graph; The second stage is the layout planning sub-instruction, which guides the model to plan the position, size and hierarchical relationship of each element in the image according to the extracted semantic elements, and outputs a layout sketch; The third stage is the image generation sub-instruction, which guides the model to gradually generate complete image content that conforms to the semantic description according to the layout planning.
[0038] S2: Based on the task type, the original task instruction is decomposed into atomic instructions to obtain multiple atomic sub-instructions; In some embodiments, based on the task type, the original task instructions are decomposed into atomic instructions to obtain multiple atomic sub-instructions, including: Determine the phased prompting learning logic that matches the task type; Based on the phased prompting learning logic, the original task instructions are decomposed into multiple atomic sub-instructions that are executed sequentially in a preset order.
[0039] By determining the matching phased prompt learning logic based on task type, and then decomposing the original instruction into atomic sub-instructions executed in a preset order according to the phased prompt learning logic, the instruction decomposition can better fit the task characteristics and execution rules, ensuring that the decomposed atomic sub-instructions have clear logic and clear objectives, avoiding the accumulation of errors caused by complex instructions, and providing a standardized instruction foundation for subsequent reliable response generation and stable dataset construction.
[0040] Furthermore, the phased prompting learning logic can adapt to diverse and multimodal task scenarios without redesigning the instruction decomposition framework for different task types. It can adaptively match the corresponding phase division and execution rules according to the task type. While maintaining the consistency of the core decomposition logic, it can achieve the scenario-based and precise generation of atomic instructions, effectively improving the universality and scalability of instruction decomposition. For image description tasks, the phased prompting learning logic generates atomic sub-instructions according to the path of objective information extraction, feature detail description, and overall content integration. This allows each atomic sub-instruction to focus on extracting real information from the image and gradually improving the output content, thus avoiding the large model from generating fictitious information unrelated to the image out of thin air. For visual question-answering tasks, the phased prompting learning logic generates atomic sub-instructions according to the path of target object location, relevant attribute verification, and answer content organization, ensuring that the output of the large model at each step is based on the input modal information and does not rely on its own memory to make up answers. Meanwhile, executing atomic sub-instructions in a preset order can maintain a stable output state of the large model during the generation process. The instruction output of the previous stage can serve as a reliable basis for the execution of the instruction in the next stage. No manual intervention or correction is required throughout the process, avoiding the subjective bias and high cost caused by manual annotation. Furthermore, the phased prompting learning logic effectively avoids information confusion and error propagation caused by inputting complex instructions all at once, allowing large models to complete outputs under the guidance of a single, clear instruction. This improves the accuracy and controllability of response generation, providing stable and reliable input conditions for subsequent automatic construction of unlabeled preference datasets and model training using direct preference optimization. It also enables the entire illusion suppression scheme to be efficiently adapted to various application scenarios such as in-vehicle multimodal interaction, intelligent visual perception, and multimodal content generation. While reducing implementation costs, it improves model optimization efficiency and output credibility, meeting the actual needs of large-scale model deployment.
[0041] In some embodiments, based on a phased prompting learning logic, the original task instructions are decomposed into multiple atomic sub-instructions that are executed sequentially in a preset order, including: Based on the phased prompting learning logic, the multimodal task corresponding to the original task instruction is decomposed into multiple atomic sub-tasks; The preset order is generated based on the logical dependencies between the atomic subtasks; Using the completion of one atomic sub-task corresponding to each atomic sub-instruction as the decomposition standard, the original task instructions are decomposed in stages to obtain multiple atomic sub-instructions that are executed sequentially in a preset order.
[0042] By breaking down multimodal tasks into atomic subtasks, decomposing instructions in stages based on the standard of one atomic sub-instruction corresponding to one atomic sub-task, and determining the execution order according to logical dependencies, atomic sub-instructions can perform their respective functions and be connected in an orderly manner. This ensures that each step of instruction execution focuses on a single simple task, reduces model generation bias, improves the objectivity and stability of response generation, and supports the efficient production of hallucination-free content.
[0043] Furthermore, the phased prompting learning logic atomizes the original task instructions. The core of this approach is to break down the originally complex holistic generation task into several single, concise, and independently verifiable atomic sub-tasks. Each atomic sub-instruction undertakes only one clear and limited execution objective, thus blocking the avalanche effect of error accumulation and illusion spread from the starting point of the generation chain. This allows the large model to focus on the current single task at each step of execution without having to process multiple semantics and multi-dimensional information simultaneously, significantly reducing generation bias. The phased prompting learning logic follows the principles of simplifying complexity, step-by-step verification, and bottom-up construction. Taking the image description task in multimodal tasks as an example, the original task instruction is "describe the image in detail". The phased prompting learning logic decomposes it into three atomic sub-instructions that are executed in a preset order. The first atomic sub-instruction is entity extraction, which guides the large model to extract all real-world objects from the input image and outputs a list of objects such as "person, bicycle, school bus, tattoo shop, and parking sign", ensuring that subsequent descriptions are based on objectively existing visual entities. The second atomic sub-instruction is the feature description. Based on the list of objects in the first sub-response, it generates a concise and accurate feature description for each object one by one. For example, it describes a person as "a man wearing a helmet and a backpack riding a bicycle" and a school bus as "a yellow school bus with a red stop sign". This forces the large model to focus on the visual features of individual objects and avoids information distortion or the introduction of irrelevant content. The third atomic sub-instruction is generated by fusion. It takes all the concise descriptions generated in the second sub-response as context input and integrates them into a logically coherent, detailed and illusion-free complete image description by leveraging the contextual understanding capabilities of the large model. For example, "In the image, a man wearing a helmet and a backpack is riding a bicycle, and a yellow school bus with a red stop sign is parked next to him." This ensures the integrity of the final output and that all information has a reliable source. Taking the visual question answering task as an example, the original task instruction is "What is the pedestrian in front of the red car in the image doing?" It is broken down into three sub-tasks: entity localization, attribute judgment, and answer summarization through a phased prompting learning logic. The entity localization task requires locating key entities related to the problem from the input image, namely, locating the "red car" and the pedestrian it is pointing "ahead" to; The attribute determination task requires determining the pedestrian's specific actions or state based on the location results; The task of summarizing the answer requires integrating the preceding results into an answer that aligns with the original question's intent; The preset order is generated based on the logical dependencies between the atomic subtasks. The entity localization task must be executed before the attribute judgment task, because the attribute judgment can only be performed after the target entity is located. The attribute judgment must be executed before the answer summarization, because the final answer can only be summarized after the action features are obtained. Using the standard that each atomic sub-instruction corresponds to the completion of an atomic sub-task, the original task instruction is decomposed in stages to obtain three atomic sub-instructions that are executed in a preset order. Among them, the first atomic sub-instruction is entity localization, which guides the large model to output the location information of the target entity; the second atomic sub-instruction is attribute judgment, which guides the large model to output the action description of the entity based on the localization result; and the third atomic sub-instruction is answer summarization, which guides the large model to generate the final answer based on the previous results. Through the decomposition of the learning logic in the above-mentioned phased prompts, each atomic sub-instruction corresponds to a single simple task. They are logically independent and semantically clear, which effectively avoids the accumulation of errors caused by complex tasks and lays a reliable foundation for the subsequent generation of optimal responses. Furthermore, the decomposed atomic sub-instructions each perform their own functions and are interconnected in an orderly manner. Each step of sub-instruction execution focuses on a single, simple task, which effectively reduces the generation bias of the large model, improves the objectivity and stability of response generation, and provides support for the efficient production of hallucination-free content.
[0044] S3: Input each atomic sub-instruction into the large model, generate the preferred response, and obtain the non-preferred response generated by the large model based on the original task instructions; Furthermore, the preferred response refers to the response generated by the large model in stages according to a preset order after each atomic sub-instruction is input into the large model in sequence, which is completely matched with the reference information and has no fictitious content; In this process, the large model extracts objective facts from the reference information when executing the first atomic sub-instruction, generates feature descriptions based on the extracted objective facts when executing the second atomic sub-instruction, and fuses all preceding sub-responses when executing the last atomic sub-instruction, ultimately generating a logically coherent, accurate, and highly consistent output with the reference information; the preferred response represents the ideal output without hallucinations and serves as a positive reference sample in the preference dataset. Non-optimal responses refer to responses generated directly by a large model after the original task instructions are directly input into it, without being guided by atomic instruction decomposition. Because they are not subject to phased constraints, non-optimal responses are prone to containing fictitious content that does not match the reference information, i.e., they exhibit hallucination phenomena. Non-optimal responses serve as negative reference samples in the preference dataset, contrasting with optimal responses, and are used to train the model to distinguish between real and fictitious content.
[0045] In some embodiments, each atomic sub-instruction is input into a large model to generate a preferred response, including: Based on a preset order, each atomic sub-instruction is executed sequentially through a large model to generate a sub-response corresponding to each atomic sub-instruction, and an optimal response is obtained based on the sub-responses.
[0046] By executing atomic sub-instructions sequentially in a preset order to generate corresponding sub-responses and then fusing them to obtain the preferred response, the output of the large model can be constrained in a step-by-step generation manner. This ensures that each sub-response is generated based on objective input modal information, avoids the spread of errors caused by generating long texts all at once, and guarantees that the content of the preferred response is true, accurate, and free of fictitious information, thus providing high-quality positive samples for the preference dataset.
[0047] Furthermore, by executing atomic sub-instructions and generating corresponding sub-responses in a preset order, the large model can maintain a strong binding relationship with the input modal information throughout the generation process. Each output step is based on objective data sources such as original images and text, without relying on the model's internal memory or external non-empirical knowledge, thus reducing the possibility of fictional content appearing in the generation path. The step-by-step execution method described above can effectively avoid the error accumulation and avalanche effect caused by generating complex tasks all at once, enabling large models to complete output in a stable and controllable state, and improving the reliability and consistency of the overall response. For example, in image description scenarios, executing atomic sub-instructions sequentially allows the large model to gradually focus on visual information and improve the content layer by layer, ensuring that the final optimal response fully matches the real situation of the image. In visual question answering scenarios, sequential execution allows the model to gradually pinpoint the question, verify the authenticity of the information, and form accurate conclusions, avoiding irrelevant or unfounded answers. Meanwhile, the above step-by-step execution method does not require manual intervention for verification and correction, and can automatically complete the construction of the preferred response, greatly reducing data preparation costs and improving the efficiency of constructing the preference dataset; The optimal response obtained through step-by-step generation has high accuracy, high objectivity and high consistency, which can provide high-quality positive samples for subsequent direct preference optimization training, enabling large models to learn more clearly the real and credible generation logic during the optimization process, effectively improving training stability and illusion suppression effect. Furthermore, the step-by-step execution method has good versatility and scalability, and can be adapted to multimodal tasks of different types and fields. It does not require adjustment of the core process for specific scenarios, and can meet the diverse application needs of in-vehicle intelligent interaction, multimodal perception, content generation and understanding, etc. It provides key support for the efficient implementation and large-scale promotion of the entire large model illusion suppression solution, and enables the optimized large model to output more credible, stable and more realistic results in actual use.
[0048] In some embodiments, each atomic sub-instruction is executed sequentially through a large model to generate a sub-response corresponding to each atomic sub-instruction, and a preferred response is obtained based on the sub-responses, including: Generation constraints are pre-set in each atomic sub-instruction, and each atomic sub-instruction is executed sequentially under the constraints. The sub-response generated by the previous atomic sub-instruction is used as the context input for the execution of the current atomic sub-instruction. The corresponding sub-responses are generated stage by stage, and the sub-responses are integrated to obtain the optimal response.
[0049] By setting generation constraints in atomic sub-instructions and using the previous sub-response as the current context input to complete the optimal response generation, the output form and content of the large model can be strictly standardized, ensuring that the output of each stage is consistent and fits the input facts, further reducing the probability of illusion and improving the controllability and reliability of response generation.
[0050] Furthermore, by presetting generation constraints in atomic sub-instructions and passing them through the context, the large model can maintain a high degree of standardization and objectivity in each generation process, avoiding information distortion and fabrication caused by unrestricted output formats and unclear content boundaries. Generative constraints can limit the output of large models to short, structured, objective, and non-redundant text, forcing large models to provide accurate content only around the input modal information, without introducing irrelevant information and subjective speculation, thus reducing the possibility of illusions from the perspective of generative rules. Meanwhile, using the preceding sub-response as the context input for subsequent execution can ensure logical coherence and consistency between the outputs of each stage, making the entire response generation process a stable and self-consistent link, and avoiding the spread of errors caused by information gaps. For example, in image description tasks, generating constraints can guide large models to output concise descriptions based on real visual content, while contextual propagation ensures that the description details are consistent and do not contradict each other. In visual question answering tasks, generating constraints allows large models to output judgments based on objective facts, while contextual propagation ensures that the answer derivation process is rigorous and reliable. The entire generation process requires no manual annotation, manual correction, or external intervention. It is completed automatically entirely by relying on preset constraints and contextual connections, which reduces data processing costs and improves the generation efficiency and quality stability of the optimal response. The optimal response obtained through the above method is characterized by factual accuracy, logical coherence, and standardized form. It can provide high-quality positive samples for the preference dataset, thereby improving the training effect and model stability of subsequent direct preference optimization. This enables the large model to output more credible, accurate, and realistic input results in practical applications, effectively solving the pain points of traditional illusion suppression methods such as high cost, complex process, and insufficient stability. It provides reliable support for the large-scale deployment of large models in fields such as in-vehicle intelligence, intelligent interaction, and content generation.
[0051] In some embodiments, the sub-response generated by the previous atomic sub-instruction is used as the context input for the execution of the current atomic sub-instruction, and corresponding sub-responses are generated stage by stage. The sub-responses are then integrated to obtain a preferred response, including: Execute the first atomic sub-instruction to extract multiple objective facts from the input modal information and obtain the first sub-response; Execute the second atomic sub-instruction to generate the corresponding feature description based on the objective facts in the first sub-response, and obtain the second sub-response; Execute the third atomic sub-instruction, fuse the first sub-response and the second sub-response to obtain the preferred response.
[0052] By sequentially executing atomic sub-instructions to extract objective facts, generate feature descriptions, and fuse descriptions, the optimal response is obtained. It follows a bottom-up generation logic, first locking in real information, then refining details, and finally integrating and outputting the result. The entire process is based solely on the input modal information, avoiding the inclusion of fictitious content, thus ensuring that the optimal response has high authenticity and high accuracy.
[0053] Furthermore, phased sub-response generation refers to the process by which a large model, after receiving each atomic sub-instruction decomposed from the atomic instruction, generates corresponding sub-responses one by one, independently and sequentially according to a preset execution order. The objectivity and reliability of the sub-responses are ensured through specific implementation methods such as input isolation, modal information binding, constraint-based generation, sequential execution and caching, and no human intervention. Taking the image description task as an example, after the original instruction is decomposed into three atomic sub-instructions: entity extraction, feature description, and fusion generation, the large model first inputs the first atomic sub-instruction separately, while binding the original input image information. Under the generation constraint, it focuses only on extracting real objects in the image and outputs a short text entity list as the first sub-response, such as "person, bicycle, school bus, restaurant". This step isolates the input to prevent the large model from being disturbed by other instructions and ensures that the output is based on the real image through modal information binding. The large model then caches the first sub-response as an intermediate result, inputs the second sub-instruction separately, and continues to bind the original image information. Using the list of objects in the previous sub-response as the context, it generates concise and accurate feature descriptions for each object under generation constraints, and outputs the second sub-response, such as "A person wearing a helmet and backpack is riding a bicycle, a yellow school bus has a red stop sign, and a restaurant is located on one side of the street". This step limits the output to a single-sentence attribute description through constraint generation, avoiding the spread of errors caused by long texts. The sequential execution and caching ensure the consistency of logic in each stage. Finally, the large model is fed a separate third atomic sub-instruction, with the preceding sub-responses as context inputs, and fused to generate the final preferred response; Throughout the process, each atomic sub-instruction is input individually, modal information is bound one by one, and intermediate results are cached stage by stage. The large model focuses only on the current single sub-task each time, without relying on its own memory or external knowledge to create content. It can automatically obtain stable and reliable intermediate sub-responses without human intervention, providing a solid foundation for the final generation of the hallucination-free optimal response.
[0054] This implementation method, which involves phased execution, progressively passing context, and ultimately integrating the output, not only ensures the authenticity and accuracy of the preferred response, but also avoids the error accumulation and avalanche effect common in large models from the perspective of the generation mechanism, making the entire output process more stable, controllable, and traceable. By using the sub-responses generated by preceding atomic sub-instructions as the context input for the execution of the current instruction, the large model has clear and reliable prior evidence at each step of generation, without relying on its own parameter memory or external uncertain knowledge, thereby greatly reducing the generation of unfounded inferences and fictitious content; The preferred response generation method has strong versatility and can flexibly adjust the stage goals and output forms according to different multimodal tasks. In image description tasks, it can be generated step by step according to the path of objective entity extraction, core feature description and complete sentence fusion to ensure that the final description content is highly matched with the image information. In visual question answering tasks, the process can proceed by locating the question-related object, determining the object's attributes, and summarizing the answer content, making the answer derivation process rigorous and the results reliable. Meanwhile, the above-mentioned step-by-step generation and layer-by-layer integration method does not require manual participation in annotation, screening or correction, and is completed automatically throughout the entire process, which effectively reduces the cost and complexity of constructing preference data and solves the pain point of traditional hallucination suppression methods being highly dependent on manual annotation; In practical applications, the preferred response generation method is clear in facts, accurate in details, and logically coherent. It can provide high-quality positive samples for subsequent direct preference optimization training, enabling large models to quickly learn generation strategies that conform to real input information during the optimization process, thereby improving training efficiency and stability. Furthermore, the aforementioned method for generating optimal responses, relying on a simple execution process and stable generation results, can be easily extended to various scenarios such as in-vehicle multimodal interaction, intelligent visual perception, automated content generation, and multimodal understanding. This meets the requirements of different business scenarios for the credibility and practicality of model output, enabling the optimized large model to maintain low illusion, high accuracy, and high stability in various tasks. This provides a solid guarantee for the large-scale implementation and efficient application of the entire large model illusion suppression solution, and also provides feasible technical support for the safe, reliable, and efficient operation of the large model in real-world scenarios.
[0055] Furthermore, non-optimal response generation refers to directly obtaining the direct output of the large model to be optimized to the original task instructions. This output is not guided by atomic instruction decomposition. When the large model processes complex tasks at once, it lacks phased constraints and is prone to error accumulation during the long text generation process, thus generating responses containing fictional content. These responses are used as non-optimal responses to construct negative samples for the preference dataset. Taking image description task as an example, when the user inputs the original task instruction "describe the image in detail", the large model to be optimized directly responds to the original task instruction. Since it has not gone through the phased guidance of entity extraction, feature description, and fusion generation, the large model may fabricate objects that do not exist in the image in the description, such as fabricating "a policeman is directing traffic", or adding inaccurate features to real objects, such as describing a real yellow school bus as a "blue bus", or making errors in logical association, such as fabricating an interaction relationship between the school bus and pedestrians that does not exist. Although this direct output is fluent and structurally complete, the fictional content it contains does not match the objective facts of the input modality information, which is a typical hallucination phenomenon. By obtaining this direct output as a non-preferred response, a clear positive and negative sample pair is provided in stark contrast to the preferred response generated by atomic instruction decomposition, enabling the model to learn which output is more consistent with objective facts under the same input conditions, thereby effectively suppressing the generation of illusions.
[0056] S4: Combine input modal information, original task instructions, preferred responses and non-preferred responses to construct a preference dataset; Furthermore, the preference dataset refers to the data set used to train a large model to distinguish between preferred and non-preferred responses; For the same input modal information and the same original task instruction, two different qualities of response are obtained. One response is obtained by inputting the atomic sub-instruction into the large model and generating it in stages to obtain the preferred response that matches the reference information. The other response is obtained by directly inputting the original task instruction into the large model and obtaining a non-preferred response that includes hallucinations. The input modal information, the original task instructions, the preferred response and the non-preferred response are associated and stored as a data sample, resulting in a quadruple; By collecting a large amount of different input modal information and original task instructions, the above process is repeated to form a dataset containing multiple quadruple samples, thus obtaining a preference dataset. In the preference dataset, the preferred response is used as a positive reference sample, and the non-preferred response is used as a negative reference sample, which together constitute the training basis for model optimization. It can be automatically constructed without manual annotation and can be flexibly expanded according to different task types.
[0057] In some embodiments, a preference dataset is constructed by combining input modality information, original task instructions, preferred responses, and non-preferred responses, including: The input modal information, the original task instructions, the preferred response and the non-preferred response are combined into a quadruple data sample; Based on different multimodal tasks, generate multiple sets of quadruple data samples; By aggregating multiple sets of quadruple data samples, a preference dataset is obtained.
[0058] By combining input modal information, original task instructions, preferred responses, and non-preferred responses into quadruples, and aggregating multi-task samples to form a preference dataset, it is possible to construct an automated dataset with complete structure and strong adaptability. This dataset can meet the needs of model optimization without manual annotation, improve the scalability and universality of the dataset, and reduce the cost of data construction.
[0059] Furthermore, the construction of the preference dataset refers to the process of forming a quadruple data sample by combining input modal information, original task instructions, preferred response and non-preferred response. By repeatedly executing steps such as atomic instruction decomposition, staged sub-response generation, preferred response generation and non-preferred response generation, multiple sets of data samples are automatically generated and aggregated to form a preference dataset. This process does not require any manual annotation and can be flexibly extended according to different multimodal tasks. Taking the image description task as an example, the first step is to obtain the input modal information, which is a street image containing a school bus and cyclists, and the original task instruction "describe the image in detail"; By decomposing atomic instructions and generating staged sub-responses, a preferred response is obtained through staged guidance, such as "In the image, a man wearing a helmet and a backpack is riding a bicycle, next to a yellow school bus with a red stop sign, and there is a restaurant on the side of the street." At the same time, the direct output of the large model to the original task instructions is obtained as a non-preferred response, such as "In the image, a man is riding a bicycle, a policeman is directing traffic, and a school bus is waiting on the side of the road". The above input modal information, original task instructions, preferred response and non-preferred response are combined into a quadruple data sample; Repeat the above process to generate multiple sets of samples for the same image description task. For example, changing different input images or using different ways of expressing the original task instructions for the same image can generate new quadruple samples. For other multimodal tasks such as visual question answering, only the decomposition logic of the atomic sub-instructions needs to be adjusted. For example, the three-stage decomposition of the image description task can be changed to the decomposition logic of entity localization, attribute judgment and answer summarization, and the corresponding quadruple samples can be generated in the same way. By aggregating a large number of quadruple data samples from different tasks and sample sources, a complete preference dataset can be automatically constructed. This dataset has a complete structure and strong adaptability, and can meet the needs of model optimization without manual annotation, significantly improving the scalability and universality of the dataset and effectively reducing the cost of data construction.
[0060] S5: Update the parameters of the large model based on the preference dataset to obtain the optimized large model, which is then used to perform the multimodal task corresponding to the task type.
[0061] Furthermore, the optimization logic for updating model parameters refers to using the preferred responses in the preference dataset as positive reference samples and the non-preferred responses in the preference dataset as negative reference samples, and adjusting the parameters of the large model through a specific optimization algorithm so that the model tends to generate preferred responses rather than non-preferred responses under the same input conditions. Each quadruple in the preference dataset contains four parts: input modality information, original task instructions, preferred response, and non-preferred response. After the preference dataset is input into the large model, the large model needs to learn that, given the same reference information and original task instructions, the probability of generating a preferred response should be higher than the probability of generating a non-preferred response. This learning process does not rely on external reward models or human feedback, but rather iteratively updates model parameters by using a direct preference optimization algorithm to minimize the log-likelihood loss function. In each iteration, the large model simultaneously calculates the probability of generating the preferred response and the non-preferred response, and adjusts the internal weights according to the probability difference between the two, so that the large model gradually strengthens the preference for the preferred response and suppresses the tendency to favor the non-preferred response. After multiple rounds of iterative updates, the model parameters converge to a stable state, resulting in an optimized large model. When performing multimodal tasks corresponding to the task type, the optimized large model can generate real content that matches the reference information with a higher probability, significantly reducing the possibility of hallucinations, thereby improving the credibility and practicality of the model in real-world applications.
[0062] Based on the above-mentioned large model optimization method, by acquiring input modal information and original task instructions and determining multimodal tasks, atomic instructions are decomposed into atomic sub-instructions to generate preferred and non-preferred responses and construct a preference dataset. Then, the model parameters are updated based on the dataset to obtain the optimized model. This method can complete the construction of preference data and model optimization in an automated manner, eliminating the dependence on manual annotation and feedback, effectively reducing the cost of illusion suppression, simplifying the implementation process, improving the accuracy and credibility of the output content of the large model, and adapting to the efficient application of multimodal tasks.
[0063] In some embodiments, updating the parameters of a large model based on a preference dataset to obtain an optimized large model includes: Input the preference dataset into the large model; Define an optimization objective, and based on this objective, minimize the log-likelihood loss function using the direct preference optimization algorithm, and iteratively update the parameters of the large model.
[0064] By inputting the preference dataset into a large model, setting the optimization objective, and minimizing the log-likelihood function through the direct preference optimization algorithm, the parameters can be updated iteratively. This eliminates the need for reward model construction and manual feedback, simplifies the training process, improves training efficiency and stability, and allows large models to quickly optimize towards a low-illusion direction.
[0065] Furthermore, model optimization based on direct preference optimization refers to using automatically constructed preference datasets to train large models in a concise data-driven manner, without the need to build reward models or provide manual feedback, thereby simplifying the training process and improving stability. Taking the image description task as an example, the completed preference dataset generated by atomic instruction decomposition and staged sub-responses contains a large number of quadruple data samples. Each sample consists of input modality information, original task instructions, preferred response and non-preferred response. One of the input modal information samples is a street image containing a school bus and cyclists. The original task instruction is "describe the image in detail". The preferred response is a non-illusionary description generated by atomic instruction decomposition, while the non-preferred response is a description containing fictional content directly output by the large model. During the model optimization phase, the preference dataset is directly input into the large model to be optimized. There is no need for manual screening, labeling or any additional preprocessing of the dataset. The entire dataset participates in training in its original form. The large model uses the preferred response in the preference dataset as a positive reference sample and the non-preferred response as a negative reference sample, and continuously adjusts its own parameters through the direct preference optimization algorithm during the training process; The large model processes each quadruple sample in the preference dataset and iteratively updates the parameters of the large model by minimizing the log-likelihood loss function. This makes the modal large model gradually more inclined to generate preferred responses and less inclined to generate non-preferred responses under the same input conditions. As training iterations progress, model parameters are gradually updated. For example, in the early stages of training, the direct output of the large model for the aforementioned street image may still contain fictional content such as "a policeman is directing traffic." However, after multiple rounds of iterations, the large model increasingly tends to output non-illusionary descriptions that are close to the preferred response. The entire process is achieved solely through data-driven and goal-constrained approaches, without involving any changes to the internal structure of the large model, nor requiring the construction of complex reward models or the introduction of human feedback to guide the training direction. The core advantage of the above model optimization method lies in its simplicity and efficiency. Compared with traditional reinforcement learning methods based on human feedback, it does not require training a reward model to simulate human preferences in advance, nor does it require continuously collecting human feedback data during the training process. It only requires inputting the preference dataset into the large model and starting the optimization program to automatically complete the model training and optimization, which greatly simplifies the training process and significantly improves training efficiency and stability. Taking the visual question answering task as an example, the preference dataset contains four-tuple samples such as "What is the pedestrian in front of the red car in the image doing?" The preferred response is "The pedestrian in front of the red car is making a stop gesture" generated after entity localization, attribute judgment and answer induction decomposition, while the non-preferred response is "The pedestrian is crossing the road" directly output by the large model. During the optimization process, the large model uses the preferred response as a positive reference and the non-preferred response as a negative reference. It iteratively updates the parameters through the direct preference optimization algorithm, making the large model more inclined to generate accurate answers based on facts under the same input conditions. Through the above optimization process, the large model significantly reduces the probability of hallucinations while maintaining its original generative capabilities. The optimized large model can be used to perform various multimodal tasks such as image description, visual question answering, multimodal content generation, multimodal understanding and interaction. The matching degree between its output content and input modal information and objective facts is greatly improved, effectively ensuring the credibility and practicality of the model in practical applications. The entire optimization process requires no manual annotation, no reward model building, and no continuous manual feedback. It achieves efficient, low-cost, and stable model optimization solely by relying on an automatically constructed preference dataset and a direct preference optimization algorithm.
[0066] In some embodiments, the optimization objective is to make the probability of the large model generating a preferred response higher than the probability of generating a non-preferred response.
[0067] By setting the optimization objective to ensure that the probability of generating preferred responses in large models is higher than that of non-preferred responses, the direction of model parameter adjustment can be accurately guided, the model's tendency to generate realistic and accurate content can be strengthened, the probability of outputting fictional content can be weakened, and the illusion can be effectively suppressed from the perspective of generation strategy, thereby improving the credibility of the output of large models.
[0068] Furthermore, the optimization objective is set in close coordination with the direct preference optimization algorithm, and the large model generation strategy is precisely controlled through the constraints of the loss function. Taking the image description task as an example, the preference dataset contains a quadruple sample. The input modality information is a street image containing a school bus and a cyclist. The original task instruction is "describe the image in detail". The preferred response is a non-illusionary description generated by atomic instruction decomposition: "In the image, a man wearing a helmet and a backpack is riding a bicycle, and a yellow school bus with a red stop sign is parked next to him". The non-preferred response is a description containing fictional content directly output by the large model: "In the image, a man is riding a bicycle, a policeman is directing traffic, and a school bus is waiting on the side of the road". During the model optimization process, for this sample, the large model will calculate the probability of generating the preferred response and the probability of generating the non-preferred response respectively; The optimization objective drives the large model to adjust its parameters to increase the probability of generating preferred responses and decrease the probability of generating non-preferred responses. As training iterations progress, when the same image and instructions are input again, the probability of the large model outputting fictional content such as "a policeman is directing traffic" decreases significantly, while the probability of outputting descriptions based on real visual facts increases significantly. Taking the visual question answering task as an example, the preference dataset contains a quadruple sample. The input modal information is an intersection image containing a red car and a pedestrian. The original task instruction is "What is the pedestrian in front of the red car doing?" The preferred response is "The pedestrian in front of the red car is making a stop gesture" generated after entity localization, attribute judgment, and answer induction decomposition. The non-preferred response is "The pedestrian is crossing the road" directly output by the large model. The optimization objective guides the large model to generate a fact-based answer, "The pedestrian is making a stop gesture," with a higher probability of generating a fictitious answer, "The pedestrian is crossing the road," when updating parameters. After multiple rounds of iterative optimization, when faced with similar visual question answering tasks, the large model tends to generate answers based on verifiable visual evidence in the image, rather than relying on the large model's own conjecture or external knowledge to guess. The core advantage of the optimization objective lies in its directness and simplicity. The optimization objective is directly set as the probability of the preferred response being higher than that of the non-preferred response. This objective is achieved in a data-driven manner through the direct preference optimization algorithm. There is no need to build an intermediate reward model or involve human feedback. This eliminates the propagation path of training error of the reward model and the subjective interference of human feedback, making the optimization process of large models more stable and efficient. The optimization objective directly guides the large model to learn which outputs are more consistent with objective facts at the generation strategy level, enabling the large model to automatically avoid fictitious content during the generation process, suppressing the occurrence of hallucinations at the source, and significantly improving the credibility and practicality of the large model in real-world applications.
[0069] This application provides a vehicle including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the large model optimization method described above.
[0070] By equipping a vehicle with a memory and processor that store the corresponding computer program, the processor can implement the aforementioned large model optimization method when executing the program. This enables the large model on the vehicle to have efficient and low-cost hallucination suppression capabilities, improves the output accuracy of tasks such as in-vehicle multimodal interaction and visual perception, and enhances the practicality and reliability of in-vehicle intelligent systems.
[0071] Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of this application.
Claims
1. A large model optimization method, characterized in that, include: Acquire input modality information and the original task instructions proposed in response to the input modality information, and confirm the task type based on the original task instructions; Based on the task type, the original task instruction is decomposed into atomic instructions to obtain multiple atomic sub-instructions; Each of the atomic sub-instructions is input into the large model to generate a preferred response, and a non-preferred response generated by the large model based on the original task instructions is obtained; By combining the input modality information, the original task instructions, the preferred response, and the non-preferred response, a preference dataset is constructed. The parameters of the large model are updated based on the preference dataset to obtain an optimized large model, which is then used to execute the multimodal task corresponding to the task type.
2. The large model optimization method according to claim 1, characterized in that, Based on the task type, the original task instruction is decomposed into atomic instructions to obtain multiple atomic sub-instructions, including: Determine the phased prompting learning logic that matches the task type; Based on the phased prompting learning logic, the original task instruction is decomposed into multiple atomic sub-instructions that are executed sequentially in a preset order.
3. The large model optimization method according to claim 2, characterized in that, Based on the phased prompting learning logic, the original task instruction is decomposed into multiple atomic sub-instructions that are executed sequentially according to a preset order, including: Based on the phased prompting learning logic, the multimodal task corresponding to the original task instruction is decomposed into multiple atomic sub-tasks; The preset order is generated based on the logical dependencies between the atomic subtasks. Using the completion of one atomic subtask corresponding to each atomic subinstruction as the decomposition standard, the original task instruction is decomposed in stages to obtain multiple atomic subinstructions that are executed sequentially according to the preset order.
4. The large model optimization method according to claim 2, characterized in that, Each of the aforementioned atomic sub-instructions is input into the large model to generate a preferred response, including: Based on the preset order, each of the atomic sub-instructions is executed sequentially through the large model to generate a sub-response corresponding to each of the atomic sub-instructions, and the preferred response is obtained based on the sub-response.
5. The large model optimization method according to claim 4, characterized in that, The large model sequentially executes each of the atomic sub-instructions to generate a sub-response corresponding to each atomic sub-instruction, and obtains the preferred response based on the sub-responses, including: In each of the atomic sub-instructions, a generation constraint condition is preset, and under the constraint condition, each of the atomic sub-instructions is executed sequentially; The sub-response generated by the previous atomic sub-instruction is used as the context input for the execution of the current atomic sub-instruction. The corresponding sub-responses are generated stage by stage, and the sub-responses are integrated to obtain the preferred response.
6. The large model optimization method according to claim 5, characterized in that, The sub-response generated by the previous atomic sub-instruction is used as the context input for the execution of the current atomic sub-instruction. Corresponding sub-responses are generated stage by stage, and the sub-responses are integrated to obtain the preferred response, including: Execute the first atomic sub-instruction to extract multiple objective facts from the input modal information and obtain the first sub-response; Execute the second atomic sub-instruction to generate corresponding feature descriptions based on the objective facts stated in the first sub-response, and obtain the second sub-response; Execute the third atomic sub-instruction to fuse the first sub-response and the second sub-response to obtain the preferred response.
7. The large model optimization method according to claim 1, characterized in that, By combining the input modality information, the original task instructions, the preferred response, and the non-preferred response, a preference dataset is constructed, including: The input modal information, the original task instruction, the preferred response, and the non-preferred response are combined into a quadruple data sample. Based on different multimodal tasks, multiple sets of quadruple data samples are generated; The preference dataset is obtained by aggregating multiple sets of the quadruple data samples.
8. The large model optimization method according to claim 1, characterized in that, The parameters of the large model are updated based on the preference dataset to obtain the optimized large model, including: Input the preference dataset into the large model; Define an optimization objective, and based on the optimization objective, minimize the log-likelihood loss function using the direct preference optimization algorithm, and iteratively update the parameters of the large model.
9. The large model optimization method according to claim 8, characterized in that, The optimization objective is to make the probability of the large model generating the preferred response higher than the probability of generating the non-preferred response.
10. A vehicle, characterized in that, The system includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the large model optimization method as described in any one of claims 1 to 9.