Automobile intelligent cockpit interaction system based on size model cooperation
The intelligent cockpit system, which uses a combination of large and small models, solves the problems of insufficient fusion of in-vehicle and external perception and lack of intelligent interaction modes, achieving a balance between real-time performance and inference depth, and improving the adaptability and robustness of the cockpit system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- AUTOMOBILE RES INST OF TSINGHUA UNIV IN SUZHOU XIANGCHENG
- Filing Date
- 2025-12-22
- Publication Date
- 2026-06-12
Smart Images

Figure CN122196468A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of automotive electronics and intelligent human-machine interaction technology, and in particular to an intelligent cockpit interaction system for automobiles based on the collaboration of large and small models. Background Technology
[0002] With the deep application of technologies such as Artificial Intelligence (AI), the Internet of Things (IoT), and big data in the automotive field, the intelligent cockpit has become an important intelligent space inside the car. Through real-time perception and deep learning of user behavior and habits, the intelligent cockpit can provide multimodal human-machine interfaces such as voice interaction and gesture control. It can also dynamically adjust the cabin atmosphere based on the driving environment and passenger preferences, providing differentiated user experiences and commercial value for new energy vehicles.
[0003] The development of intelligent cockpits has evolved from traditional mechanical instruments to digital central control, and then to intelligent systems centered on AI and multimodal interaction. Future development trends lean towards deep integration with external environmental perception, evolving into scenario-driven cockpits. Under the current technological architecture, a typical intelligent cockpit can be divided into three layers: bottom-layer hardware (cameras, microphone arrays, embedded eMMC memory, DDR memory, etc.), middle-layer system / functional software (including driver and cockpit domain drivers and perception software, etc.), and upper-layer services (facial recognition, automatic voice recognition, data services, scenario gateways, account authentication, etc.); in addition, it includes a supporting software platform (growth platform) for rapid development.
[0004] In recent years, large language models (LLMs) have made breakthroughs in natural language processing tasks, and researchers have begun to introduce pre-trained language models into in-vehicle systems to enhance multi-turn dialogue management and contextual understanding capabilities. For example, research on using domain-fine-tuned LLMs for vehicle-mounted machine rejection detection, LLM-based in-vehicle retrieval enhanced dialogue systems, and dialogue datasets and prompt fine-tuning frameworks built for in-vehicle scenarios all demonstrate the potential of LLMs in passenger intent recognition and multi-turn dialogue.
[0005] Meanwhile, multimodal fusion research has improved in-cabin semantic understanding and user experience by jointly modeling signals such as vision, speech, and depth sensing. For example, methods such as pleasure prediction based on multimodal data, cross-modal feature fusion of multiple video streams, and joint embedding of visual and text input into LLM have made progress in in-cabin information understanding. In the areas of external environment perception and intelligent driving, there is also research on visual question answering (VQA) and visual-language models (VLM) for driving scenarios, achieving semantic understanding and analysis of driving scenarios through question answering or hierarchical reasoning.
[0006] The above background information is provided only to assist in understanding the inventive concept and technical solution of this invention. It does not necessarily belong to the prior art of this application, nor does it necessarily provide technical teaching. In the absence of clear evidence that the above information was disclosed before the filing date of this application, the above background information should not be used to evaluate the novelty and inventiveness of this application. Summary of the Invention
[0007] The purpose of this invention is to provide an intelligent cockpit interaction system for automobiles based on the collaboration of large and small models, which has higher adaptability, safety and timely response capabilities, and can enhance the intelligence of human-vehicle interaction.
[0008] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0009] A car intelligent cockpit interaction system based on big-small model collaboration includes a first model system, a second model system, and a data collaboration module. The first model system is configured as an environmental perception system, and the second model system is configured as a hierarchical decision-making system.
[0010] The first model system includes a data acquisition module and an environment perception module. The data acquisition module is configured to acquire external scene data, and the environment perception module includes multiple sub-models. The multiple sub-models are configured to detect and identify the external scene data to obtain multiple structured detection results.
[0011] The data collaboration module is configured to perform data extraction, spatial alignment, semantic alignment, and text conversion on each of the structured detection results to obtain corresponding text data. The text data has global consistency and is configured to describe vehicle exterior scene data.
[0012] The second model system is configured to make driving decisions based on one or more of the instruction information, vehicle status information, and the text data.
[0013] Furthermore, following any of the aforementioned technical solutions or combinations thereof, when the structured detection result obtained by any of the sub-models is an event exceeding a preset safety threshold, the data collaboration module is configured to transmit the text data corresponding to the event exceeding the preset safety threshold to the second model system, and the second model system makes a corresponding driving decision in response to receiving the text data corresponding to the event exceeding the preset safety threshold.
[0014] or,
[0015] The data collaboration module is further configured to perform risk assessment on the text data corresponding to each of the structured detection results to obtain the corresponding risk assessment value, and to actively transmit high-risk text data to the second model system, wherein the risk assessment value corresponding to the high-risk text data is higher than a preset risk threshold; the second model system is configured to make corresponding driving decisions based on the high-risk text data.
[0016] Furthermore, following any or a combination of the aforementioned technical solutions, when an event exceeds a preset safety threshold, the second model system is configured to be triggered based on the following function: Trigger = M(Φ(Lab,H,z)|T), if ID exists, where Trigger represents triggering, M represents the decision function of the second model system, ID represents the target identifier, Lab represents the category label, H represents the elevation value, z represents the distance between the target and the vehicle, T represents the set of functional functions, and Φ represents the preset text template;
[0017] Alternatively, the second model system is configured to be triggered based on the following function: Trigger = M(Φ(H,z)|T,H0), if ID exists, where Trigger represents triggering, M represents the decision function of the second model system, H represents the elevation value, z represents the distance of the target from the vehicle, T represents the set of function functions, H0 represents the vehicle height, and Φ represents the preset text template;
[0018] Alternatively, the second model system is configured to be triggered based on the following function: Trigger = M(Φ(t,r0,r) p )|T), if|r0-r p |>ε, where Trigger represents triggering, M represents the decision function of the second model system, T represents the set of function functions, r0 represents the current actual turning angle of the vehicle, and r p ε represents the predicted steering wheel angle, ε represents the preset angle difference threshold, and Φ represents the preset text template.
[0019] Alternatively, the second model system is configured to be triggered based on the following function: Trigger = M(Φ(Lab,z)|T), if|zi |>β, where Trigger represents the trigger, M represents the decision function of the second model system, T represents the set of function functions, Lab represents the category label, and z represents the distance between the target and the car. i Φ represents the distance of the i-th target from the car, β represents the preset distance threshold, and Φ represents the preset text template.
[0020] Furthermore, based on any or a combination of the aforementioned technical solutions, the sub-model includes a height restriction recognition model, which includes a stereo matching module, a MobileNetV2 variant module, an adaptive attention fusion unit, a ConvGRU module, a spatial pyramid fusion module, and an upsampling and coordinate attention module.
[0021] The stereo matching module is configured to generate a disparity map based on the external scene data, which includes multiple consecutive RGB images captured by a binocular camera.
[0022] The MobileNetV2 variant module is configured to extract visual and depth features in parallel from the disparity map and multiple consecutive RGB images to obtain multi-scale features;
[0023] The adaptive attention fusion unit is configured to perform spatial and road information interaction based on the multi-scale features to obtain feature maps of multiple frames.
[0024] The ConvGRU module is configured to perform temporal modeling on multi-frame feature maps to encode the temporal consistency between multi-frame feature maps and generate fused features with temporal constraints.
[0025] The spatial pyramid fusion module is configured to perform multi-scale context convergence on fusion features with temporal constraints to enhance the feature representation capability of height-restricted objects with different distances and attitudes.
[0026] The upsampling module and the coordinate attention module are configured to perform resolution restoration and position information enhancement on the features after multi-scale context convergence and enhancement, so as to output pixel-level output results. Based on the pixel-level output results, combined with camera intrinsic parameters and disparity values, the absolute height of the height restriction object and the distance of the height restriction object relative to the vehicle body are calculated. The output pixel-level output results include the endpoint offset of the height restriction object, confidence level and category probability.
[0027] Furthermore, based on any or a combination of the aforementioned technical solutions, the sub-model includes a road surface pre-aiming model, which includes a first-stage network and a second-stage network.
[0028] The first-stage network includes a VGG-11-based backbone network and a UNet-based encoding and decoding architecture. Its input is the vehicle exterior scene data, and its output is a semantic segmentation image.
[0029] The second-stage network fuses the semantic segmentation image, scene image, and disparity image, and maps the three-dimensional disparity information onto a two-dimensional plane based on a top-down orthogonal projection method to obtain structured detection results representing road feature type, distance, and height.
[0030] Furthermore, following any one or a combination of the aforementioned technical solutions, the sub-model includes a steering wheel angle prediction model, which includes a dual-branch CNN module and a dual-branch NCP module. The dual-branch CNN module performs multi-scale feature extraction on the input image through inverse residual blocks and multi-scale convolutional layers to obtain a first feature and a second feature, wherein the feature scale of the second feature is higher than that of the first feature.
[0031] The dual-branch NCP module includes two neural circuits based on a four-level topology. The two neural circuits obtain the predicted steering wheel angle r based on the following formula. p :
[0032] r p =NCP l (CNN l (I))+NCP h (CNN h (I))
[0033] Where I is the input image, CNN l CNN h These are the first and second features, respectively, NCP l It is the first neural circuit, NCP h It is the second neural circuit.
[0034] Furthermore, following any one or a combination of the aforementioned technical solutions, the sub-model includes an obstacle detection model, which includes a stereo matching module and an obstacle perception module. The stereo matching module is configured to generate a disparity image based on the following formula:
[0035]
[0036] Where U represents the disparity map, U(p) represents the disparity value at point p, N(p) represents the neighborhood of point p, p′ represents the points within the neighborhood of point p, U(p′) represents the disparity value at point p′, SSIM is the structural similarity cost function, and λ is the regularization coefficient. The gradient of the image is represented by ||||, the L2 norm is represented by ||||, and MVP() represents the Viterbi multipath algorithm function. L I R U init These represent the left eye image, the right eye image, and the initial disparity map, respectively.
[0037] The obstacle perception module is configured to perform feature extraction and semantic classification on the parallax image to identify vehicles, pedestrians, non-motorized vehicles and other obstacles.
[0038] Furthermore, based on any or a combination of the aforementioned technical solutions, the second model system includes a core control module, an instruction input module, and a system feedback module;
[0039] The instruction information includes voice instructions, and the instruction input module is configured to convert the voice instructions into text instructions for output to the core control module;
[0040] The core control module is equipped with a pre-trained cockpit model, which is configured to make driving decisions based on one or more of the text commands, vehicle status information and text data. The driving decisions include warning information and / or active intervention control operations.
[0041] The system feedback model is configured to convert the driving decision into audible and / or visual cues and output them.
[0042] Furthermore, following any one or a combination of the aforementioned technical solutions, the second model system includes a pre-trained large cockpit model, which is trained in the following manner:
[0043] Qwen2-7B-Instruct was selected as the teacher model, and Qwen2-1.5B-Instruct was selected as the student model.
[0044] A pre-configured supervised corpus is used as a learning sample set, which includes text in a structured question-and-answer format;
[0045] The teacher model is trained using the learning sample set, and the output of the teacher model is controlled by the temperature parameter T. s After smoothing, the probability distribution predicted by the teacher model is obtained through Softmax.
[0046] The student model is trained using the learning sample set, and the training process is constrained by the probability distribution predicted by the teacher model to obtain a trained student model, which serves as the large cockpit model; the total loss function L of the student model... KD for:
[0047] L KD = (1-α)H(y,q) S )+αD KL (q T ||q S );
[0048] Where: α is the weighting coefficient, 0 < α < 1;
[0049] y i Let q represent a learning sample. S (i|x) represents the probability distribution predicted by the student model, H(y,q) s ) represents the cross-entropy loss between the student model's prediction and the real text sequence, where i is a positive natural number;
[0050] D KL (q T |q S ) represents the Kullback-Leibler divergence between the teacher model and the student model, q T (i|x) represents the probability distribution predicted by the teacher model.
[0051] Furthermore, based on any or a combination of the aforementioned technical solutions, the structured detection results include one or more of the following: timestamp, sensor identifier, calibration reference, detection data list, BEV slice, and elevation slice;
[0052] And / or,
[0053] The sub-models include multiple models such as road surface pre-aiming model, traffic restriction recognition model, steering wheel angle prediction model, and obstacle detection model;
[0054] And / or,
[0055] The data acquisition module includes a binocular camera, which is configured to acquire real-time images of the road scene and transmit them to the environmental perception module.
[0056] And / or,
[0057] The data collaboration module includes an interface layer for system calls of the second model. The interface layer includes multiple interfaces, one interface for each sub-model, and one interface is configured to output text data corresponding to a structured detection result.
[0058] And / or,
[0059] The second model system is configured with a cockpit function list and a simulated dialogue scenario list. The function list includes six functional modules: vehicle entertainment, driving functions, cockpit configuration, system operation, visual perception and visible information retrieval. The simulated dialogue scenario list includes four types of scenarios: single-turn tool retrieval, multi-turn tool retrieval, multi-turn question-and-answer dialogue, and combination of multiple tools.
[0060] And / or,
[0061] The supervision of the second model system should follow the following design specifications: set the dialogue background based on the global description, present the decision basis through visible information, clarify each sub-model and the corresponding functional interface of each sub-model, configure the format example to guide the tool call and feedback process, divide the guidance into prompts and body text, and use question sequence to simulate real interaction.
[0062] The beneficial effects of the technical solution provided by this invention are as follows:
[0063] a. Enhance the intelligence of human-vehicle interaction: Utilize a large cockpit model that has been finely tuned by the domain to achieve accurate intent understanding, natural and smooth multi-turn dialogue, and scenario-based proactive prompts, so that human-vehicle interaction is expanded from a single passive question and answer to a proactive, intelligent and context-aware interaction mode, which significantly improves the convenience and user experience during the driving process.
[0064] b. Enables intelligent cockpit systems to balance real-time performance and reasoning depth: Through the division of labor and collaborative design of small and large models, the small model achieves rapid response and low-latency environmental perception on the vehicle side, while the large model provides deep semantic understanding and complex reasoning. The two form an efficient complementary mechanism, achieving the optimal balance between real-time performance and reasoning ability under limited computing power.
[0065] c. Improve model adaptability and engineering feasibility: By introducing structured output constraints and chain-like thinking prompts during the fine-tuning process, the cockpit model trained by this invention can output data in a standard format and has logical reasoning capabilities, thereby seamlessly connecting with the vehicle's infotainment system interface and improving the feasibility and robustness of vehicle-side deployment. Attached Figure Description
[0066] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0067] Figure 1 A schematic diagram of a system framework based on size model collaboration is provided as an exemplary embodiment of the present invention;
[0068] Figure 2 A schematic diagram of a module of an intelligent cockpit interaction system for automobiles provided as an exemplary embodiment of the present invention;
[0069] Figure 3 A schematic diagram illustrating the principle of system task processing based on size model collaboration, provided as an exemplary embodiment of the present invention;
[0070] Figure 4 A schematic diagram of a prompt word template used for supervision prediction provided as an exemplary embodiment of the present invention;
[0071] Figure 5 A schematic diagram of a template-based data augmentation method for supervised prediction provided as an exemplary embodiment of the present invention;
[0072] Figure 6 A schematic diagram of a systematic training process for a large cockpit model provided as an exemplary embodiment of the present invention;
[0073] Figure 7 A schematic diagram of the input and output of a road surface pre-aiming model provided as an exemplary embodiment of the present invention;
[0074] Figure 8 A schematic diagram of the input and output of a height restriction recognition model provided as an exemplary embodiment of the present invention;
[0075] Figure 9 A schematic diagram of the input and output of a steering wheel angle prediction model provided as an exemplary embodiment of the present invention;
[0076] Figure 10 This is a schematic diagram of the input and output of an obstacle detection model provided as an exemplary embodiment of the present invention. Detailed Implementation
[0077] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0078] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, apparatus, product, or device that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or devices.
[0079] Despite the progress made in existing research and products in in-cabin multimodal understanding and LLM-based dialogue systems, the following technical shortcomings still urgently need to be addressed:
[0080] (1) Insufficient integration of in-vehicle and out-of-vehicle perception: Most intelligent cockpit systems at present focus on understanding and interacting with multimodal information in the vehicle, and pay insufficient attention to environmental perception and decision support for driving scenarios outside the vehicle. This makes it difficult for the system to actively intervene or issue effective warnings when there is a conflict between in-vehicle commands and dangerous scenarios outside the vehicle, which poses a safety hazard.
[0081] (2) Lack of intelligent interaction mode: Most existing interactions remain at the passive question and answer stage, lacking the ability to actively generate prompts, risk warnings or personalized guidance based on the scene status, and cannot realize the transformation from "passive response" to "active collaboration", resulting in limited value of the system in the driving process and poor user experience;
[0082] (3) The contradiction between large model deployment and real-time performance: Although end-to-end visual question answering or large model reasoning can achieve deep semantic understanding, the computational overhead is large and the response latency is high, making it difficult to meet the requirements of vehicle systems for real-time performance and computing power-constrained environments. At the same time, existing solutions do not support the consistency, fault tolerance and generalization ability of multi-turn voice interaction, which limits their availability in vehicle continuous dialogue and complex tasks.
[0083] (4) Lack of cross-domain collaboration capability: Existing solutions mostly focus on a single domain (inside the cockpit or outside the driver's seat), lacking a system and mechanism that can achieve on-demand collaboration between the intelligent cockpit and intelligent driving while ensuring the efficiency of vehicle operation, thus limiting the overall intelligence and safety assurance capability of the system in complex scenarios.
[0084] In one embodiment of the present invention, a car intelligent cockpit interaction system based on size-model collaboration is provided, see [link to relevant documentation]. Figure 1The system includes a first model system, a second model system, and a data collaboration module. The first model system includes a data acquisition module and an environmental perception module. The data acquisition module is configured to acquire external scene data, and the environmental perception module includes multiple sub-models. The multiple sub-models are configured to detect and identify the external scene data to obtain multiple structured detection results. The data collaboration module is configured to extract data, spatially align, semantically align, and convert text data from each structured detection result to obtain corresponding text data. The text data has global consistency and is configured to describe the external scene data. The second model system is configured to make driving decisions based on instruction information, vehicle status information, and the text data.
[0085] The structured detection results include one or more of the following: timestamps, sensor identifiers, calibration references, detection data lists, BEV slices, and elevation slices. The sub-models include multiple models such as road surface pre-aiming models, traffic restriction recognition models, steering wheel angle prediction models, and obstacle detection models. The traffic restriction recognition models include speed limit recognition models and height limit recognition models, etc. The text data has global consistency, meaning it conforms to a preset paradigm. Specifically, the output results of each sub-model are cleaned, extracted, and aligned by the data collaboration module to obtain specific numerical values for the recognition results. For example, a specific paradigm's text data representation might be: the road surface pre-aiming output includes target identifier ID, category label Lab, elevation value H, and forward distance z. After data transformation, the structured detection results are filled into the paradigm and finally prepared for transmission to the second model system. The text data is no longer the original image, point cloud, or independent numerical values, but a machine-readable, semantically clear text representation.
[0086] In this embodiment, the first model system is a lightweight small-scale environmental perception model (hereinafter referred to as the environmental perception small model), and the second model system is a large-scale cockpit model (hereinafter referred to as the cockpit large model). The environmental perception small model and the cockpit large model work together efficiently through the data collaboration module to build a safe, convenient and intelligent in-vehicle experience.
[0087] In one embodiment of the present invention, such as Figure 1As shown, the environmental perception mini-model acquires real-time road scene images using a binocular camera and performs image processing and feature extraction on the acquired data to identify and output environmental information such as road surface prediction, height restriction detection, steering wheel angle prediction, pedestrian detection, and vehicle detection. The data collaboration module cleans, aligns, and converts the output of the environmental perception mini-model in time, providing high-quality structured environmental state information to the downstream decision-making module, i.e., the cockpit big model. The cockpit big model is used to collect, parse, and intelligently process user voice commands, integrating the output of the environmental perception mini-model into the human-machine interface and functional logic to achieve dynamic presentation and intelligent feedback of key environmental states. When the cockpit big model receives user questions or control commands, the system performs contextual association and reasoning based on the current driving environment and vehicle operating status to generate reasonable suggestions or proactive warnings, and issues vehicle control commands when necessary, thereby continuously optimizing the driver's cockpit experience in a closed-loop human-machine information flow.
[0088] This application presents a vehicle intelligent cockpit interaction system based on large and small model collaboration, which draws inspiration from the hierarchical processing and feedback mechanisms of biological nervous systems to simulate the collaborative work of the human sensory system and brain. Its core consists of a large cockpit model for the intelligent cockpit and a small environmental perception model for intelligent driving, constructing a cross-domain collaborative architecture that combines real-time perception and on-demand reasoning. On the perception side, driving environment data is collected in real time via binocular cameras, with each sub-model responsible for rapid scene analysis and risk detection. On the decision-making side, a large cockpit model generated from a professionally supervised and fine-tuned dataset based on an open-source large language model undertakes advanced intent understanding and decision-making. Both sides achieve lightweight collaborative reasoning through on-demand invocation and result write-back, balancing real-time response with semantic depth. The system also includes the aforementioned data collaboration module, which comprises an interface layer for basic cockpit function calls and a fusion module / data processing unit for human-vehicle interaction. It supports natural wake-up of the vehicle system by the driver or user in the cockpit and can also realize multi-round intelligent interaction with the driver or user. When a potential risk is detected, the system will trigger active activation, issue an alarm, and directly control the vehicle when necessary, thereby improving interactive intelligence, system robustness, and driving safety while ensuring vehicle-side operating efficiency.
[0089] In this embodiment, the collaborative process between the environmental perception mini-model and the cockpit large model is as follows: Figure 2 As shown, the automotive intelligent cockpit interaction system adopts a modular design, mainly including: the core control module (main control module), the data acquisition module, the environmental perception module, the command input module, and the system feedback module.
[0090] The data acquisition module and the environment perception module constitute the first model system. The data acquisition module uses a binocular camera as the main sensor, supports hardware triggering or precise clock synchronization schemes, and adds a unified timestamp to each frame of image to ensure the temporal consistency between depth calculation and upper-layer perception logic. The camera system completes intrinsic and extrinsic parameter calibration and incorporates real-time distortion correction, image correction, and automatic exposure / gain adjustment preprocessing functions to adapt to complex lighting and scene changes. The preprocessed image input uses a deep learning-based disparity estimation algorithm to quickly generate a dense depth map. Subsequent post-processing steps, such as hole filling, smoothing, and uncertainty assessment, control the depth error within an acceptable centimeter range, thus providing a high-precision and robust data foundation for the environment perception module.
[0091] The environmental perception module, as the system's perception unit, relies on road images and other multi-source data. It employs efficient image processing workflows and deep learning algorithms to transform raw observations into environmental understanding information suitable for decision-making. Given the limitations of the onboard platform in terms of computing power and real-time performance, each small model balances accuracy and latency during its structural design and training phases, achieving performance trade-offs through model compression, distillation, or lightweight design. Under the scheduling of the large cockpit model, each small model runs in parallel, identifying and outputting key elements such as road condition information, lane lines, and the two-dimensional / three-dimensional position and size of obstacles in real time. The cleaned and aligned structured results are then fed back to the large cockpit model, providing criteria for subsequent safety decisions.
[0092] The instruction input module, core control module, and system feedback module constitute the second model system. The core control module is configured on an automotive-grade SoC or equivalent vehicle control unit that meets ASIL-D safety level requirements, and integrates a large cockpit model to simultaneously handle human-machine interaction and driving decision-making. This module performs semantic parsing of user-issued instructions through an embedded natural language understanding subsystem, supports session-level state maintenance and contextual reasoning, and schedules and calls upon the structured perception results generated by the environmental perception sub-models accordingly; simultaneously, it receives and integrates feedback information from each sub-model in real time. Utilizing an edge reasoning framework and a deterministic scheduling mechanism, the module can complete the decision-making loop within milliseconds; once a safety risk is detected, it immediately issues a warning and provides multimodal feedback to the driver via voice, touch, and other means. If necessary, it issues execution commands to the vehicle control unit to achieve proactive intervention, thereby ensuring the system's rapid response capability and operational safety.
[0093] The command input module serves as the entry point for human-computer interaction, supporting natural input methods such as voice. The module employs a high-sensitivity microphone array for audio acquisition and implements preprocessing techniques at the hardware / software level, including beamforming, noise suppression, and voice enhancement, to improve input signal quality. Subsequently, the voice recognition unit converts the speech into text and forwards it to the core control module. The core module performs natural language parsing on the received text to identify user intent and generates corresponding dispatch commands, which are then sent to the target functional modules to enable operations such as navigation, multimedia control, or vehicle function adjustments, ensuring the efficiency and accuracy of the interaction process.
[0094] The system feedback module presents decision results and early warning information to users in a multimodal manner to improve the timeliness and perceptibility of information delivery. Output methods include voice broadcasts and alarm prompts to highlight and convey key information. For vehicle control commands to be executed, the core control module directly issues commands through the chassis control interface and collects execution feedback in real time to confirm the completion of actions. Combining environmental perception and risk prediction capabilities, when a sudden event or potential hazard is detected ahead, the system can trigger warnings such as voice prompts, alert sounds, or vibrations, and provide handling suggestions or implement proactive intervention, thereby reducing accident risks and improving driving safety and transparency.
[0095] like Figure 2 As shown, the above modules are functionally divided and communicate and collaborate with each other through structured interfaces, thus forming a closed-loop control system of information acquisition, perception, decision-making, execution, and feedback. The system proposed in this invention employs a brain-inspired bionic collaborative mechanism: a large-scale model collaborative mechanism simulating the collaborative work of biological senses and the brain is proposed. The small model and the first model are equivalent to "eyes / ears," responsible for real-time perception and rapid analysis of the driving environment; the large model and the second model are equivalent to the "brain," performing deep semantic understanding and decision generation on key information. This constructs a hierarchical processing mechanism of "fast path - slow path" and a closed-loop control mechanism of "acquisition - perception - decision-making - execution - feedback," achieving efficient integration and engineering feasibility verification.
[0096] This invention enables intelligent cockpit systems to balance real-time performance and reasoning depth: through the division of labor and collaborative design of small and large models, the small model achieves rapid response and low-latency environmental perception on the vehicle side, while the large model provides deep semantic understanding and complex reasoning. The two form an efficient complementary mechanism, achieving the optimal balance between real-time performance and reasoning ability under limited computing power.
[0097] In the system architecture design described in this invention, the collaboration strategy is the core of achieving efficient cooperation among various functional modules. To this end, this invention proposes two types of collaboration strategies: task-level collaboration and data-level collaboration. Based on task-level collaboration, dynamic task scheduling ensures the real-time performance and resource utilization efficiency of the decision-making loop between modules. Data-level collaboration achieves the fusion and semantic representation of multi-source heterogeneous information through a unified interface and standardized data format. These two strategies work together to support the deep coupling between the large cockpit model and the small environmental perception model.
[0098] In task-level collaboration, the principle of maximizing resource utilization for each task is maintained, fully leveraging the advantages of the large and small model collaborative architecture. Specifically, a lightweight visual small model group S = {S1, S2, S3, ..., S...} is constructed. k}, k takes any natural number greater than 1, and it is represented by a stereo image I. L and I R And the parallax image D reconstructs the 3D scene. Each small model achieves functional decoupling through a multi-task learning framework: the road surface prediction model S1 uses road surface geometry modeling to evaluate road surface smoothness and slope, the height restriction detection model S2 measures the maximum height and depth distance of the height restriction object, the steering wheel angle prediction model S3 combines road geometry and vehicle status to output steering angle suggestions, and the pedestrian / vehicle obstacle detection model S4 locates the position of pedestrians / vehicles through a multi-target tracking algorithm.
[0099] The large cockpit model M, as the core decision-making entity, is responsible for scheduling and controlling the aforementioned k sub-tasks, which is manifested in the controllable function set T = {t1, t2, ..., t...} n The tool invocation capability is defined as n, where n is a natural number greater than 1. The function is embedded as a priori capability into the system model's text prompt (Prompt) using a standardized interface. When a user input command q is received, the cockpit big model automatically parses the task and selects the most suitable tool t for invocation, then receives the execution result r returned by the vehicle system. For compound commands (e.g., "search for nearby gas stations and replan the route"), the cockpit big model generates and executes a tool invocation sequence step-by-step using chain-like thinking, with each step's action and result incorporated into the subsequent context until the task is completed, ultimately generating and outputting the integrated answer.
[0100] This process can be expressed as follows: Where M represents the large cockpit model, q represents the input command, T represents the set of function functions, r represents the execution result, i.e. the decision result, k takes a natural number greater than 1, i = 1, 2, ..., k-1, j = 1, 2, ..., k-1.
[0101] Since a large number of specialized utility functions may be involved in a car cockpit, the large cockpit model needs to have robust parsing capabilities for long sequences of text and contextual information.
[0102] In task-level collaboration, the human-vehicle interaction of intelligent cockpit tasks is mainly reflected in two aspects: proactive activation (the large cockpit model proactively initiates early warning notifications or risk notifications) and natural wake-up (the user proactively initiates command calls or voice wake-up).
[0103] Regarding the active activation, when any of the sub-models detects an event that exceeds a preset safety threshold, the data collaboration module is configured to transmit the text data corresponding to the event exceeding the preset safety threshold to the second model system. The second model system then makes a corresponding driving decision in response to receiving the text data corresponding to the event exceeding the preset safety threshold.
[0104] Specifically, the environmental perception submodule continuously outputs preprocessed 3D scene features. When any small model (sub-model) detects an event exceeding a preset safety threshold, this information is directly and proactively reported to the large cockpit model via the data collaboration module for key data capture and rule-based reasoning. The large cockpit model determines a warning strategy based on prior rules and the current context, and drives relevant sub-modules to perform corresponding actions in the form of task assignment. Subsequently, the system prompts the driver through voice or warning signals, and directly intervenes in vehicle control to achieve proactive intervention when necessary.
[0105] Alternatively, in a specific application example, each of the sub-models is only responsible for outputting the structured detection results and does not judge whether the event corresponding to the detection result exceeds a preset safety threshold. The data collaboration module performs risk assessment on the text data corresponding to each of the structured detection results to obtain a corresponding risk assessment value, and actively transmits high-risk text data to the second model system. The risk assessment value corresponding to the high-risk text data is higher than the preset risk threshold; the second model system is configured to make corresponding driving decisions based on the high-risk text data.
[0106] For natural wake-up, the system first converts user commands into text using Automatic Speech Recognition (ASR) and transmits it to the large cockpit model. The large model then performs semantic parsing and intent recognition, combining real-time vehicle status and environmental perception data to determine the invocation strategy. After incorporating the capabilities of the smaller environmental perception models into a set T as tools, the large cockpit model can invoke perception results through function calls to respond to user queries or provide input for higher-order tasks. At the cockpit control level, the large cockpit model invokes system tools such as navigation, weather inquiries, and obstacle detection, and uses a chain-like approach to deduce decisions step by step, ultimately providing feedback to the user through voice or a visual interface. During the interaction, the system records the conversation context and user preferences to support multi-turn continuous dialogue and personalized services.
[0107] Data-level collaboration based on the aforementioned data collaboration module involves defining a unified interface and data format. Features extracted from each sub-model across visual, depth, and geometric dimensions are pre-fused and presented in a consistent, structured representation for the large cockpit model to access. This reduces latency caused by multi-frame and multi-format conversions, improving system lightweightness and real-time performance. This strategy enables deep integration of multi-source heterogeneous models, enhancing the comprehensiveness and robustness of environmental perception, providing richer and more accurate scene information to upper-level decision-making modules, and maintaining dynamic adaptability and scalability while ensuring professional capabilities.
[0108] The results generated by the sub-model will be processed, paradigm-crossed, and converted into textual representation by the data collaboration module. For example, if the small model detects a height restriction of 2.5 meters in front, it will output the value 2.5. After the paradigm-crossing process of "height restriction of XX meters in front", it will output the textual description of "height restriction of 2.5 meters in front".
[0109] like Figure 3 As shown, the system first uses a binocular camera to acquire real-time images of the road ahead and generates disparity images using a stereo matching algorithm to extract depth information. The raw data enters the processing pipeline via a high-speed data transmission unit: small models such as road surface pre-aiming, height restriction recognition, steering wheel angle prediction, and pedestrian / vehicle obstacle detection take the binocular images and disparity maps as input, outputting structured detection results such as semantic segmentation, image annotation, distance estimation, and turning angle suggestions. Subsequently, the data collaboration module cleans and extracts key features from various outputs, and the alignment module generates a unified data representation. Finally, in the conversion stage, the data from various sources is encapsulated into standardized text for use by the large cockpit model. This process follows predefined data contract standards; newly connected small models only need their outputs to conform to the agreed format to be seamlessly integrated in parallel, thus ensuring the system's scalability and efficient operation.
[0110] In specific embodiments, the most intuitive output of the small model is semantic segmentation, image annotation, distance estimation, and corner suggestion, with the 3D point cloud removed. For example, the road surface pre-aiming model can output semantic segmentation results based on the image, marking the position of road bumps or depressions on the road surface and calculating the positive and negative heights relative to a smooth road surface; the height restriction recognition model can mark the position of height restriction objects in the image and calculate the height restriction height and the distance to the vehicle; the steering wheel angle prediction model can directly predict the suggested rotation angle based on the image; and the obstacle detection model for people and vehicles can mark the position of people and vehicles in the image and calculate the distance to the vehicle.
[0111] The working principle of the data collaboration module is to correlate, calibrate and integrate multi-source outputs from multiple heterogeneous small models with different formats in time and space, and generate a globally consistent and conflict-free text-formatted scene description, laying the foundation for subsequent data conversion and final fusion.
[0112] The data collaboration module's operation includes data extraction, spatial alignment, semantic alignment, and generation of a unified text data representation. During the data extraction process, the data collaboration module simultaneously "listens" to and receives outputs from four smaller models (road surface prediction, height restriction object recognition, steering wheel angle prediction, and pedestrian / vehicle obstacle detection). Each smaller model can, based on the target's position in the image, process the data to output the target's 3D information, such as the position and distance of obstacles on the road surface, the height and distance of height restrictions, and the predicted steering wheel angle.
[0113] For the spatial alignment process, all small models receive input from the same calibrated binocular camera system, using binocular images and generated disparity maps acquired simultaneously. Therefore, all sub-models perceive within the same world coordinate system. The information extracted from the above data is then transformed into real-world spatial data in the world coordinate system.
[0114] The data collaboration module refines the normalized data and assigns it clear semantic labels. This elevates the raw data to a higher level of semantic information and establishes connections between the perception results of different models, constructing a complete scene atlas: Road surface pre-aiming model: target identifier ID, category label Lab, elevation value H and distance z; Height restriction recognition model: target identifier ID, height restriction object height H and distance z; Steering wheel angle prediction model: timestamp t, current actual turning angle r0, algorithm-predicted turning angle r. p Human and vehicle obstacle detection model: target identifier ID, category label Lab, distance z.
[0115] The data collaboration module generates unified text data based on the refined scene graph data. After the above steps, all data has achieved consistency in time, space, and semantics. At this point, the module will fill the aligned data into the corresponding fields according to the predefined "data contract," generating a complete, internally consistent structured data object. This object is the direct input for the "data transformation" stage and will be converted into a text prompt for use by the large model.
[0116] The cockpit big model occupies a crucial central position in the entire system, and its core capabilities stem from the advanced LLM architecture it relies on. The cockpit big model not only needs to learn specific question-answering patterns in the cockpit domain but also needs powerful tool invocation capabilities. This invention employs a domain adapter fine-tuning scheme, retaining the pre-trained big model's general semantic understanding backbone network while using professional supervised corpora for efficient incremental parameter training.
[0117] In one embodiment of the present invention, in order to adapt the large cockpit model to the vertical application domain and enhance the tool calling capability, two types of core resources are constructed and configured for the large cockpit model in the second model system: one is a list of cockpit function functions, and the other is a list of simulated dialogue scenarios.
[0118] The function list covers six core functional modules: in-vehicle entertainment, driving functions, cabin configuration, system operation, and visual perception and visible information retrieval. It aims to cover common operation and service retrieval scenarios within the car cabin. These modules include not only in-vehicle entertainment operations such as playing music and adjusting volume, but also driver assistance functions such as navigation settings and switching autonomous driving modes, as well as the management of comfort configurations such as seat adjustment and air conditioning control, and even basic system functions such as numerical calculations and alarm reminders.
[0119] The simulated dialogue scenario list is divided into four categories based on functional type and interaction complexity: single-turn tool call, multi-turn tool call, multi-turn question-and-answer dialogue, and multi-tool combination scenarios. In single-turn tool call scenarios, users often issue simple and direct commands such as "turn on the air conditioner." In multi-turn tool call scenarios, the requirements usually involve continuous operation steps, such as "please increase the volume first, and then switch to the next song." In multi-turn question-and-answer scenarios, the questions are sometimes independent of each other, but more often they are semantically connected, such as "Can I get over the height restriction barrier ahead?" (assuming it's impassable) "Then please replan my route." Multi-tool combination scenarios emphasize calling different functional modules as much as possible during the dialogue to form a more complex tool chain or thought chain, such as "Can I get over the speed bump ahead?" "Then please replan my route."
[0120] The list design balances direct commands with contextual reasoning (e.g., consecutive questions like "What's the weather like today?" or "What about tomorrow?") to present a task breakdown from simple to complex and multi-layered interaction capabilities. Furthermore, the simulated dialogue scenario list includes proactive warning corpus from collaborative large and small models. The environmental perception sub-model monitors the driving environment and generates warning signals, which the large cockpit model then uses to make decisions and responses. The training corpus employs a structured question-and-answer (QA) format, converting perception results into controllable and reproducible inputs for model fine-tuning.
[0121] To ensure the quality and structural consistency of the supervised corpus, this invention establishes a standardized data annotation structure. The entire tool follows a unified six-element design when calling the dataset.
[0122] (1) Setting the dialogue context based on global descriptors: In order to achieve context awareness and consistent response in the dialogue system, global descriptors need to be defined in advance to establish a complete dialogue context. This part constructs an overall cognitive framework by providing a comprehensive overview of the interaction environment, providing a foundation for the model to understand the context in depth.
[0123] (2) Presenting decision-making basis through visible information: The system should integrate and present all visible data that may affect response generation, such as real-time media playback information, environmental state parameters, or weather forecasts. The structured listing of such information helps enhance the model's ability to perceive dynamic external conditions, thereby supporting more accurate decision-making and response generation.
[0124] (3) Define each sub-model, i.e., the toolset and its corresponding functional interface: Clearly define the callable toolset and its corresponding functional interface, including input / output specifications, preconditions and execution constraints. This part provides the necessary operation instructions and information resources for the agent to perform specific tasks, ensuring the reliability of function execution and consistency with the system objectives.
[0125] (4) Configuration Format Examples Guide the Invocation and Feedback Process: This guide clarifies the entire process from intent parsing, tool invocation, execution feedback to result integration through formal examples. It clearly defines the logical steps and data transfer mechanisms for response construction, providing reusable operational paradigms and error handling strategies for the model.
[0126] (5) Separating prompts from the main text: The prompts serve as meta-instruction markers, used to distinguish between the system prompts and the user-initiated question sequence. This mechanism ensures the structured organization of prompt information, guarantees a clear boundary between system instructions and the dialogue main text, thereby maintaining the coherence and parsability of the interaction flow.
[0127] (6) Employing a question sequence to simulate real-world interaction, and highlighting multi-round reasoning and task decomposition capabilities through the thought chain: By simulating multi-round question-and-answer sequences in real-world user interaction, the system demonstrates its ability to progressively parse complex queries or compound instructions and decompose tasks. This module emphasizes reasoning methods based on the thought chain, presenting a complete cognitive path from initial question identification, sub-task generation, intermediate reasoning to the formation of the final solution, reflecting its systematic processing capabilities for multi-level and multi-modal tasks.
[0128] This six-element structure is used to improve annotation consistency and enhance the model's ability to understand and execute complex interaction scenarios. Figure 4In the prompt word template shown,<global_info> This indicates the globally visible information displayed on the vehicle's infotainment system.<tools_description> This describes the functionality of a globally visible tool.<tools_name> This represents a list of tool names. <query>This indicates a prompt from the vehicle's infotainment system when a user queries the system or during an active warning scenario.
[0129] This invention proposes a template-based data augmentation method to expand the diversity of labeled samples and reduce the risk of model overfitting, such as... Figure 5 As shown, the data augmentation method uses placeholder templates as its core. In pre-labeled samples, a special identifier "#" is used to place visible information and tool return values. The placeholders are randomly replaced by scripts or parameterized functions to generate instances with various parameter combinations, thereby automatically generating a large number of sample variants. The templates, as functional resources, can systematically produce rich training samples through parameter combination, thereby improving the model's generalization ability in different scenarios and alleviating the data sparsity or repetitive pattern problems that occur during training.
[0130] To prevent the large-scale cockpit model from forgetting common-sense knowledge or semantic coherence when training tool invocation capabilities, this invention introduces the public dialogue question-answering dataset Neo_sft_phase2 as an auxiliary training resource. Although this public dataset has a low direct correlation with the automotive cockpit vertical domain, simultaneously learning from multi-source heterogeneous data allows the model to maintain sensitivity to semantic coherence and common-sense reasoning while mastering tool invocation logic. This improves task execution accuracy while maintaining the robustness and fluency of natural language generation.
[0131] The systematic training process of the large cockpit model is as follows: Figure 6 As shown, the process includes key steps such as data preparation (i.e., supervised prediction mentioned above), model selection, fine-tuning and compression, and evaluation and validation. The base model chosen for this invention is Qwen2-7B-Instruct because it demonstrates superior overall performance compared to models of similar scale in instruction compliance and common-sense reasoning benchmarks (such as AlignBench, C-Eval, etc.). Therefore, as a teacher model or fine-tuning base, it is more conducive to achieving strong instruction compliance and common-sense reasoning ability.
[0132] To perform task-oriented adaptation of large models under limited computing resources, this invention employs Low-Rank Adaptation (LoRA) for fine-tuning. Specifically, a low-rank trainable matrix is incorporated into the target sub-layer of the pre-trained model, the original weights are frozen, and only low-dimensional adaptation parameters are trained. The target sub-layer refers to the network layer selected in the base model for adapting to the cockpit task, used to perform information transformation and feature representation. This target sub-layer includes, but is not limited to, the linear mapping layer in the attention mechanism and the linear transformation layer in the feedforward network. The low-rank trainable matrix (sub-layer) refers to the adjustment structure added to the target sub-layer. This adjustment structure consists of low-rank parameters with a parameter dimension lower than the original weight matrix of the target sub-layer, used to adjust the output of the target sub-layer while freezing the original model parameters. The low-dimensional adaptation parameters refer to the small number of parameters that need to be trained in the low-rank trainable sub-layer, far fewer than the number of parameters in the original model. Since the pre-trained model itself has natural language understanding capabilities, the task of this application is simply to adjust it in a specific direction (cockpit), therefore, training only low-dimensional parameters is sufficient to achieve the desired effect. This method significantly reduces the number of training parameters (typically <1%), thereby achieving efficient fine-tuning while preserving as much of the original model's knowledge as possible. Key fine-tuning aspects include: adapting the model to cockpit-specific commands, generating a standardized JSON output format, and enhancing the model's logical reasoning and task decomposition capabilities by introducing "thought chain" style prompts into training hints. After this adaptation and fine-tuning, a sub-version of the large cockpit model with strong tool invocation and formatted output capabilities can be generated.
[0133] Given that the deployment of the 7B-scale model on in-vehicle systems is constrained by real-time performance and hardware resources, this invention employs a model compression scheme based on knowledge distillation to compress the model into a version that can run on in-vehicle systems. The adapted Qwen2-7B-Instruct is used as the teacher model, and the smaller Qwen2-1.5B-Instruct is used as the student model, achieving knowledge transfer through soft-label guidance. Specifically, for each learning sample, i.e., each input text sequence x, the teacher model outputs a logits vector v, which is then processed by the temperature parameter T. s After smoothing, the soft label distribution q of the teachers is obtained through Softmax, and the calculation formula is as follows:
[0134]
[0135] Among them, T s A value greater than 1 can reduce the peak value of the distribution and highlight the relative similarity information between the categories of the learning samples.
[0136] The student model uses a Softmax function with a temperature parameter of 1 to generate the probability distribution q predicted by the student model. S It minimizes two parts of the loss: one is the supervised cross-entropy loss F on the real text sequence y, and the other is the teacher soft label q. T With students predicting q S The Kullback-Leibler divergence D between KL The two-part loss is calculated as follows:
[0137]
[0138] The two losses are weighted and combined using a weighting factor α to form the total loss function L of the student model. KD :
[0139] L KD =(1-α)F(y,q) S )+αD KL (q T ||q S );
[0140] The weighting coefficient α is used to balance basic task ability with teacher knowledge transfer. The hyperparameter temperature T... s The weighting coefficient α has a key impact on the distillation effect: T s The smoothness of the soft labels is determined by their size; too large or too small a size is detrimental to effective transfer. α determines the balance between preserving the original labeled signal and learning the teacher's soft knowledge in the student model. To achieve optimal distillation results, this invention employs a combination of grid search and cross-validation during the training phase to optimize the distillation process. s The model was fine-tuned with α, and the final trade-off point for the student model used for vehicle deployment was selected based on the performance on the validation set.
[0141] In one embodiment of the present invention, the environmental perception model includes multiple sub-models / small models. Within the technical framework of deep collaboration between intelligent cockpit and intelligent driving, the present invention defines the environmental perception small models as the system's "sensory nerves," used for real-time scene perception based on binocular vision and continuously providing structured environmental representations to the large cockpit model. The small models extract key environmental features from the original images and continuously input them into the large cockpit model, thereby enabling the large model to acquire high-quality environmental states in real time, achieving dynamic synchronization of human-machine information. To overcome the limitations of traditional visual models in scene adaptability, the present invention designs four lightweight dedicated visual small models: a road surface pre-aiming model that uses 3D-BEV (3D Bird's Eye View) to predict road elevation; a height restriction recognition model that uses RGB images and disparity maps to detect obstacle height; a steering wheel angle prediction model that performs trajectory deduction based on road geometric features; and a human-vehicle obstacle detection model that uses stereo vision and semantic segmentation to achieve target localization.
[0142] This collaborative architecture exhibits excellent scalability, allowing for the seamless introduction of smaller models for other intelligent driving tasks in the future to expand perception capabilities and enable more comprehensive perception of the driving environment. Simultaneously, the design of these smaller models adheres to a trade-off between detection accuracy and inference speed. Through strategies such as model lightweighting, structured output, confidence labeling, and priority fusion, it ensures both the reliability of critical perception tasks and meets the real-time requirements of in-vehicle edge computing, thereby achieving an optimal performance balance between multi-model parallelism and system resource constraints.
[0143] The network structure and data transmission process of the four types of small models mentioned above will be explained below.
[0144] In one embodiment of the present invention, a lightweight two-stage road surface pre-aiming model for vehicle-mounted edge computing is proposed. The road surface pre-aiming model includes a first-stage network and a second-stage network; the first-stage network includes a VGG-11-based backbone network and a UNet-based encoder-decoder architecture, with the external scene data as input and the semantic segmentation image as output; the second-stage network fuses the semantic segmentation image, scene image, and disparity image, and maps the three-dimensional disparity information to a two-dimensional plane based on a top-down orthogonal projection method to obtain structured detection results representing road surface feature type, distance, and height.
[0145] Specifically, the road surface pre-aiming model takes a YUV format image (the vehicle exterior scene data) of a preset resolution as input. The first-stage network is a semantic segmentation network, which adopts an encoder-decoder (UNet variant) structure. Its backbone is replaced with a lightweight VGG-11 variant to reduce parameter redundancy and maintain feature extraction capability. This branch outputs a pixel-level semantic segmentation map. The end of the branch is mapped to a heatmap of each category through a 1×1 convolution, and the median frequency-balanced weighted cross-entropy loss is used to calculate the segmentation error. The second-stage network fuses the segmentation output of the first-stage network, the scene map, and the binocular disparity image. Based on a top-down orthogonal projection method, it maps the three-dimensional disparity information to a two-dimensional plane, thereby constructing a three-dimensional representation of the road surface in the 3D-BEV space and realizing the prediction of road elevation and slope. The final output is as follows: Figure 7 The diagram shows structured data representing road surface feature types, distances, and heights. The semantic segmentation network employs an "encoder-decoder" structure. The encoder consists of multiple convolutional-batch normalization-ReLU layers, with channels gradually expanding with each layer and accompanied by spatial resolution downsampling. The decoder is symmetrically configured and fuses the corresponding layer features from the encoder with the upsampled features through skip connections. Furthermore, based on the FPN (Feature Pyramid Networks) principle, multi-scale feature fusion is introduced to restore spatial resolution. The network output is represented as follows:
[0146]
[0147] Among them, M road f represents the final output semantic segmentation map. Fusion Represents the feature fusion function. B represents the multi-scale feature pyramid output by the decoder. obs and B free These represent the obstacle branch function and the drivable area branch function, respectively.
[0148] The structured output of the road surface pre-aiming model includes target identifier ID, category label Lab, elevation value H, and forward distance z. The target identifiers corresponding to the road surface pre-aiming model include raised obstacles, subsidence defects, and non-standard obstacles. Raised obstacles include speed bumps and manhole covers, subsidence defects include depressions and cracks, and non-standard obstacle categories cover foreign objects other than regular road users, including but not limited to tree branches, bricks, cardboard boxes, and other objects that may affect driving safety.
[0149] For example, the structured detection result output by the road surface pre-aiming model is: z=20, H=5, Lab is a speed bump. In active activation mode, the data transformation Φ of the data collaboration module fills the structured detection result into a predefined text template, forming text data such as "A road surface elevation of 5 cm was detected 20 meters ahead, identified as a speed bump, requiring a reminder to the driver," and outputs it to the cockpit large model. The second model system prompts and triggers the processing flow of the cockpit large model, as shown in the following expression:
[0150] Trigger=M(Φ(Lab,H,z)|T), if ID exists,
[0151] Here, M represents the decision function of the large cockpit model. In natural wake-up mode, the large cockpit model obtains the latest structured data from the smaller model by calling the road surface preview function interface to complete the subsequent semantic response.
[0152] To achieve the identification and three-dimensional measurement of height restriction facilities, this invention designs a lightweight height restriction identification model based on binocular vision. The sub-model includes a height restriction identification model.
[0153] The height restriction recognition model includes a stereo matching module, a MobileNetV2 variant module, an adaptive attention fusion unit, a ConvGRU module, a spatial pyramid fusion module, and an upsampling and coordinate attention module. The stereo matching module is configured to generate a disparity map based on the vehicle exterior scene data, which includes multiple consecutive RGB images captured by a binocular camera. The MobileNetV2 variant module is configured to extract visual and depth features in parallel from the disparity map and the multiple consecutive RGB images to obtain multi-scale features. The adaptive attention fusion unit is configured to perform spatial and road information interaction based on the multi-scale features to obtain multi-frame feature maps. The ConvGRU module is configured to perform temporal modeling on the multi-frame feature maps to encode the temporal consistency between multiple frames and generate fusion features with temporal constraints. The spatial pyramid fusion module is configured to perform multi-scale context convergence on the fusion features with temporal constraints to enhance the feature representation capability for height restrictions at different distances and in different postures. The upsampling and coordinate attention module is configured to restore the resolution and enhance the position information of the features after multi-scale context enhancement, outputting pixel-level endpoint offset, confidence level and category probability of the height restriction object, and based on the pixel-level output results, combined with camera intrinsic information and disparity value, calculate the absolute height of the height restriction object and the distance of the height restriction object relative to the front of the vehicle body.
[0154] First, the stereo matching module generates a disparity map. Using consecutive frames of binocular RGB images and the disparity map as input to the MobileNetV2 variant module, visual and depth features are extracted in parallel through the MobileNetV2 variant. The feature output scale is generated using a feature pyramid to create a multi-scale representation, and the spatial and channel-level information exchange is completed in the Adaptive Attention Feature Fusion (AFF) unit. The AFF operation is as follows:
[0155] M AFF =Λ⊙X+[1-Λ]⊙Y
[0156] Λ=σ(M global (X+Y)+M local (X+Y))
[0157] Among them, M AFF This represents the feature map output by the AFF module, ⊙ denotes element-wise multiplication, σ represents the Sigmoid activation function, and M... global M represents global average pooling calculation. local This indicates pointwise convolution computation, where X represents the features extracted from the disparity map, and Y represents the features extracted from multiple RGB images. Their dimensions are the same as the number of channels.
[0158] The fused features are further encoded by the ConvGRU module to ensure multi-frame temporal consistency, and then aggregated by the Spatial Pyramid Fusion (SPF) module to enhance the recognition capability of crossbars at different distances and poses. Upsampling and CoordAtt are used to restore the target resolution. The network regresses the crossbar endpoint offset, confidence, and class probability at the pixel level. The absolute height H of the crossbar and its distance z from the front of the vehicle are calculated by combining the in-camera parallax values. Figure 8 As shown.
[0159] The height restriction recognition model is controlled with a small number of parameters to adapt to real-time deployment in vehicle-mounted embedded systems. The structured output of the small height restriction recognition model includes the target ID, the height H of the restricted object, and the distance z. In active activation scenarios, after template conversion Φ, a prompt text "Height restriction of H meters is ahead z meters, driver needs to be reminded" is generated and sent to the large model. The vehicle height H0 is used as a priori by the large model for comparison to generate suggestions such as "safe passage" or "detour". The expression is as follows:
[0160] Trigger = M(Φ(H,z)|T,H0), if ID exists,
[0161] Where M represents the decision function of the cockpit big model. In the "natural wake-up" scenario, the big model directly obtains the text data corresponding to the latest detection result output by the data collaboration module through the height restriction recognition function interface, such as "height restriction of 2.5 meters ahead z meters".
[0162] In one embodiment of the present invention, a dual-branch biomimetic end-to-end steering wheel angle prediction model based on binocular input is proposed. This model consists of a dual-branch CNN module and a dual-branch NCP (Neural Circuit Policies) module. The CNN module simulates the stereoscopic perception mechanism of insect compound eyes, achieving multi-scale feature extraction through improved inverse residual blocks and convolutions at different scales. The NCP module borrows the four-level topology (sensory-intermediate-command-motor) of nematode neural circuits and introduces a time-varying synaptic transmission mechanism to enhance time-series modeling capabilities. The outputs of the two branches are summed to obtain the final predicted steering angle r. p The basic reasoning process is as follows:
[0163] r p =NCP l (CNN l (I))+NCP h (CNN h (I))
[0164] Among them, CNN l CNN h These are the first and second features output by the convolutional layers at low and high feature scales, respectively, i.e., NCP. l NCP h These are the low-feature-scale and high-feature-scale neural circuit layers, respectively, and I is the input image.
[0165] This network model acquires road surface images as input via visual sensors and combines them with a biologically interpretable neural circuit strategy. This improves the accuracy of steering angle prediction while enhancing model transparency and safety, effectively solving the "black box" problem of traditional deep learning methods. Figure 9 As shown.
[0166] In the collaborative architecture, the steering wheel angle prediction model encapsulates real-time data in a unified format to obtain structured detection results corresponding to the predicted steering wheel angle, mainly including timestamp t, the vehicle's current actual steering angle r0, and the predicted steering wheel angle r. p .
[0167] In the "active activation" scenario, when the system algorithm determines that the steering angle deviation exceeds the preset safety threshold ε, the data collaboration module sends a prompt based on the predefined template Φ, such as "The vehicle's current actual steering angle is r0 degrees, and the predicted steering wheel angle is r". p "The speed needs to be adjusted, and the driver needs to be reminded." After receiving this warning message, the large cockpit model outputs a warning and operational suggestions to the driver. This process is represented by the following formula:
[0168] Trigger=M(Φ(t,r0,r p )|T), if|r0-r p |>ε
[0169] Where M represents the decision function of the cockpit big data model, T represents the set of function functions, ε represents the preset angle difference threshold, and Φ represents the preset text template. In the "natural wake-up" scenario, when the driver asks questions such as "How much should I turn the steering wheel now?" through voice interaction, the cockpit big data model actively calls the steering wheel angle prediction function interface based on the semantic understanding results, and generates a natural language response that conforms to the user's habits based on the structured data returned by the vehicle system, such as "The current model suggests turning 16 degrees to the right".
[0170] In one embodiment of the present invention, a high-accuracy obstacle detection model for detecting obstacles such as people and vehicles is proposed based on a hybrid obstacle detection architecture. In this embodiment, stereo vision and semantic segmentation are combined to obtain complete 3D target detection and cross-frame tracking capabilities. This design fully utilizes the advantages of stereo vision in distance perception, forming a complete obstacle perception architecture suitable for obstacle detection tasks in complex environments. The obstacle detection model mainly consists of two parts: a stereo matching module and an obstacle perception module.
[0171] The stereo matching module employs an improved multi-path Viterbi (MPV) algorithm, combining structural similarity (SSIM) cost function with multi-scale fast matching for disparity estimation, thereby generating high-quality disparity images. The calculation method is as follows:
[0172]
[0173] Where U represents the disparity map, U(p) represents the disparity value at point p, N(p) represents the neighborhood of point p, p′ represents the points within the neighborhood of point p, U(p′) represents the disparity value at point p′, SSIM is the structural similarity cost function, and λ is the regularization coefficient. The gradient of the image is represented by ||||, the L2 norm is represented by ||||, and MVP() represents the Viterbi multipath algorithm function. L I R U init These represent the left eye image, the right eye image, and the initial disparity map, respectively.
[0174] Based on the obtained disparity map, the obstacle perception module, using the VGG-16 variant as its basic architecture, extracts features from the corresponding disparity regions and performs semantic classification, enabling it to distinguish between vehicles, pedestrians, non-motorized vehicles, and other obstacle types, such as... Figure 10 As shown in the figure. Subsequently, cross-frame ID maintenance and motion state estimation (velocity, acceleration) are achieved through a multi-target tracking module.
[0175] The obstacle detection model outputs standardized, encapsulated structured detection results, including target identifier ID, category label Lab (including three major categories of standard obstacles: motor vehicles, non-motorized vehicles (including bicycles and motorcycles), and pedestrians, as well as other obstacles), distance z from the vehicle, etc., and sets a safe distance threshold β for real-time risk assessment. In active activation mode, when z i When the threshold is <β, a safety warning is triggered. The system generates a message "Lab is ahead at z meters, driver needs to be reminded" through data conversion Φ and sends it to the large model. The expression is as follows:
[0176] Trigger=M(Φ(Lab,z)|T),if|z i |>β
[0177] Where M represents the decision function of the cockpit big model. In the "natural wake-up" working mode, when the driver actively initiates a driving question and answer query, the big model recognizes the user's intention and calls the human, vehicle and obstacle detection function interface to obtain the latest target detection data, and sends it back to the big model for natural language conversion, and finally feeds back to the user with a reply such as "There is a vehicle ahead, 25.8 meters away".
[0178] Each of the smaller models adheres to a unified interface contract in terms of data, with output fields including timestamps, sensor identifiers, calibration references, detection data lists, and necessary multimodal evidence such as BEV / elevation slices. Before reporting, the data undergoes deduplication, anomaly removal, spatiotemporal alignment, and formatting to ensure the resolvability and decision reliability of the upper-level cockpit model. To adapt to vehicle deployment, the smaller models can employ engineering techniques such as model pruning, quantization, distillation, and operator fusion to achieve edge deployment, thereby meeting accuracy requirements while also considering real-time performance, power consumption, and security constraints.
[0179] The automotive intelligent cockpit interaction system based on big-small model collaboration proposed in this invention has the following characteristics.
[0180] Brain-inspired bionic collaborative mechanism: This application proposes a large-scale model collaborative mechanism that simulates the collaborative work of biological senses and the brain. The small model is equivalent to the "eyes / ears" and is responsible for real-time perception and rapid analysis of the driving environment. The large model is equivalent to the "brain" and performs deep semantic understanding and decision generation on key information. It constructs a hierarchical processing of "fast path - slow path" and a closed-loop control mechanism of "collection - perception - decision - execution - feedback" to achieve efficient integration and engineering feasibility verification.
[0181] A large-scale cockpit language model: We independently designed cockpit interaction corpus construction tools and simulation scenarios, proposed data augmentation methods to enhance the coverage and robustness of the corpus, built a professional supervised dataset, and fine-tuned the open-source large-scale language model under supervision to obtain a large-scale cockpit model with vehicle-side task understanding capabilities, supporting intent recognition, multi-turn interaction, and control decision-making.
[0182] Highly efficient and comprehensive perception of small model clusters: The design principle of small model clusters with the optimal balance between detection accuracy and inference speed is proposed. The small models can not only quickly complete risk detection, but also cover the driving environment perception as comprehensively as possible. At the same time, the system is scalable, and small models for different driving tasks can be introduced as needed in the future to continuously improve and expand the vehicle's environmental perception capabilities.
[0183] The automotive intelligent cockpit interaction system based on the big-small model collaboration proposed in this invention can achieve the following technical effects.
[0184] Improve driving safety: By integrating real-time multimodal perception inside and outside the vehicle and combining it with low-latency risk detection of environmental perception small models, proactive warnings for dangerous scenarios and emergency control triggering when necessary can be achieved, thereby reducing safety risks caused by semantic misjudgment, interaction delays or lack of environmental perception.
[0185] Enhance the intelligence of human-vehicle interaction: Utilize a large cockpit model that has been finely tuned by the domain to achieve accurate intent understanding, natural and smooth multi-turn dialogue, and scenario-based proactive prompts. This expands human-vehicle interaction from a single passive question-and-answer mode to a proactive, intelligent, and context-aware interaction mode, significantly improving the convenience and user experience during the driving process.
[0186] This enables the intelligent cockpit system to balance real-time performance and reasoning depth: through the division of labor and collaborative design of small and large models, the small model achieves rapid response and low-latency environmental perception on the vehicle side, while the large model provides deep semantic understanding and complex reasoning. The two form an efficient complementary mechanism, achieving the optimal balance between real-time performance and reasoning ability under limited computing power.
[0187] Enhancing model adaptability and engineering feasibility: By introducing structured output constraints and chain-like thinking prompts during the fine-tuning process, the cockpit large model trained by this invention can output data in a standard format and has logical reasoning capabilities, thereby seamlessly connecting with the vehicle's infotainment system interface and improving the feasibility and robustness of vehicle-side deployment.
[0188] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0189] The above description is only a specific embodiment of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications should also be considered within the scope of protection of this application.< / query>
Claims
1. A car intelligent cockpit interaction system based on big-small model collaboration, characterized in that, It includes a first model system, a second model system, and a data collaboration module. The first model system is configured as an environmental perception system, and the second model system is configured as a hierarchical decision-making system. The first model system includes a data acquisition module and an environment perception module. The data acquisition module is configured to acquire external scene data, and the environment perception module includes multiple sub-models. The multiple sub-models are configured to detect and identify the external scene data to obtain multiple structured detection results. The data collaboration module is configured to perform data extraction, spatial alignment, semantic alignment, and text conversion on each of the structured detection results to obtain corresponding text data. The text data has global consistency and is configured to describe vehicle exterior scene data. The second model system is configured to make driving decisions based on one or more of the instruction information, vehicle status information, and the text data.
2. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 1, characterized in that, When any of the sub-models detects a structured detection result that exceeds a preset safety threshold, the data collaboration module is configured to transmit the text data corresponding to the event exceeding the preset safety threshold to the second model system. The second model system then makes a corresponding driving decision in response to receiving the text data corresponding to the event exceeding the preset safety threshold. or, The data collaboration module is also configured to perform risk assessment on the text data corresponding to each of the structured detection results to obtain the corresponding risk assessment value, and to actively transmit high-risk text data to the second model system, wherein the risk assessment value corresponding to the high-risk text data is higher than a preset risk threshold. The second model system is configured to make corresponding driving decisions based on the high-risk text data.
3. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 2, characterized in that, When an event exceeds a preset safety threshold, the second model system is configured to be triggered based on the following function: Trigger = M(Φ(Lab,H,z)|T), if ID exists, where Trigger represents triggering, M represents the decision function of the second model system, ID represents the target identifier, Lab represents the category label, H represents the elevation value, z represents the distance between the target and the vehicle, T represents the set of function functions, and Φ represents the preset text template; Alternatively, the second model system is configured to be triggered based on the following function: Trigger = M(Φ(H,z)|T,H0), if ID exists, where Trigger represents triggering, M represents the decision function of the second model system, H represents the elevation value, z represents the distance of the target from the vehicle, T represents the set of function functions, H0 represents the vehicle height, and Φ represents the preset text template; Alternatively, the second model system is configured to be triggered based on the following function: Trigger = M(Φ(t,r0,r) p )|T), if|r0-r p |>ε, where Trigger represents triggering, M represents the decision function of the second model system, T represents the set of function functions, r0 represents the current actual turning angle of the vehicle, and r p ε represents the predicted steering wheel angle, ε represents the preset angle difference threshold, and Φ represents the preset text template. Alternatively, the second model system is configured to be triggered based on the following function: Trigger = M(Φ(Lab,z)|T), if|z i |>β, where Trigger represents the trigger, M represents the decision function of the second model system, T represents the set of function functions, Lab represents the category label, and z represents the distance between the target and the car. i Φ represents the distance of the i-th target from the car, β represents the preset distance threshold, and Φ represents the preset text template.
4. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 1, characterized in that, The sub-model includes a height restriction recognition model, which includes a stereo matching module, a MobileNetV2 variant module, an adaptive attention fusion unit, a ConvGRU module, a spatial pyramid fusion module, and an upsampling and coordinate attention module. The stereo matching module is configured to generate a disparity map based on the external scene data, which includes multiple consecutive RGB images captured by a binocular camera. The MobileNetV2 variant module is configured to extract visual and depth features in parallel from the disparity map and multiple consecutive RGB images to obtain multi-scale features; The adaptive attention fusion unit is configured to perform spatial and road information interaction based on the multi-scale features to obtain feature maps of multiple frames. The ConvGRU module is configured to perform temporal modeling on multi-frame feature maps to encode the temporal consistency between multi-frame feature maps and generate fused features with temporal constraints. The spatial pyramid fusion module is configured to perform multi-scale context convergence on fusion features with temporal constraints to enhance the feature representation capability of height-restricted objects with different distances and attitudes. The upsampling module and the coordinate attention module are configured to perform resolution restoration and position information enhancement on the features after multi-scale context convergence and enhancement, so as to output pixel-level output results. Based on the pixel-level output results, combined with camera intrinsic parameters and disparity values, the absolute height of the height restriction object and the distance of the height restriction object relative to the vehicle body are calculated. The output pixel-level output results include the endpoint offset of the height restriction object, confidence level and category probability.
5. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 1, characterized in that, The sub-model includes a road surface prediction model, which includes a first-stage network and a second-stage network. The first-stage network includes a VGG-11-based backbone network and a UNet-based encoding and decoding architecture. Its input is the vehicle exterior scene data, and its output is a semantic segmentation image. The second-stage network fuses the semantic segmentation image, scene image, and disparity image, and maps the three-dimensional disparity information onto a two-dimensional plane based on a top-down orthogonal projection method to obtain structured detection results representing road feature type, distance, and height.
6. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 1, characterized in that, The sub-model includes a steering wheel angle prediction model, which includes a dual-branch CNN module and a dual-branch NCP module. The dual-branch CNN module performs multi-scale feature extraction on the input image through inverse residual blocks and multi-scale convolutional layers to obtain a first feature and a second feature. The feature scale of the second feature is higher than that of the first feature. The dual-branch NCP module includes two neural circuits based on a four-level topology. The two neural circuits obtain the predicted steering wheel angle r based on the following formula. p : r p =NCP l (CNN l (I))+NCP h (CNN h (I)) Where I is the input image, CNN l CNN h These are the first and second features, respectively, NCP l It is the first neural circuit, NCP h It is the second neural circuit.
7. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 1, characterized in that, The sub-model includes an obstacle detection model, which comprises a stereo matching module and an obstacle perception module. The stereo matching module is configured to generate a disparity image based on the following formula: Where U represents the disparity map, U(p) represents the disparity value at point p, N(p) represents the neighborhood of point p, p′ represents the points within the neighborhood of point p, U(p′) represents the disparity value at point p′, SSIM is the structural similarity cost function, and λ is the regularization coefficient. The gradient of the image is represented by ||||, the L2 norm is represented by ||||, and MVP() represents the Viterbi multipath algorithm function. L I R U init These represent the left eye image, the right eye image, and the initial disparity map, respectively. The obstacle perception module is configured to perform feature extraction and semantic classification on the parallax image to identify vehicles, pedestrians, non-motorized vehicles and other obstacles.
8. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 1, characterized in that, The second model system includes a core control module, an instruction input module, and a system feedback module; The instruction information includes voice instructions, and the instruction input module is configured to convert the voice instructions into text instructions for output to the core control module; The core control module is equipped with a pre-trained cockpit model, which is configured to make driving decisions based on one or more of the text commands, vehicle status information and text data. The driving decisions include warning information and / or active intervention control operations. The system feedback model is configured to convert the driving decision into audible and / or visual cues and output them.
9. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 1, characterized in that, The second model system includes a pre-trained large cockpit model, which is trained in the following way: Qwen2-7B-Instruct was selected as the teacher model, and Qwen2-1.5B-Instruct was selected as the student model. A pre-configured supervised corpus is used as a learning sample set, which includes text in a structured question-and-answer format; The teacher model is trained using the learning sample set, and the output of the teacher model is controlled by the temperature parameter T. s After smoothing, the probability distribution predicted by the teacher model is obtained through Softmax. The student model is trained using the learning sample set, and the training process is constrained by the probability distribution predicted by the teacher model to obtain a trained student model, which serves as the large cockpit model; the total loss function L of the student model... KD for: L KD =(1-α)H(y,q S )+αD KL (q T ||q S ); Where: α is the weighting coefficient, 0 < α < 1; y i Let q represent a learning sample. S (i|x) represents the probability distribution predicted by the student model, H(y,q) s ) represents the cross-entropy loss between the student model's prediction and the real text sequence, where i is a positive natural number; D KL (q T |q S ) represents the Kullback-Leibler divergence between the teacher model and the student model, q T (i|x) represents the probability distribution predicted by the teacher model.
10. The automotive intelligent cockpit interaction system based on big-small model collaboration according to claim 1, characterized in that, The structured detection results include one or more of the following: timestamp, sensor identifier, calibration reference, detection data list, BEV slice, and elevation slice; And / or, The sub-models include multiple models such as road surface pre-aiming model, traffic restriction recognition model, steering wheel angle prediction model, and obstacle detection model; And / or, The data acquisition module includes a binocular camera, which is configured to acquire real-time images of the road scene and transmit them to the environmental perception module. And / or, The data collaboration module includes an interface layer for system calls of the second model. The interface layer includes multiple interfaces, one interface for each sub-model, and one interface is configured to output text data corresponding to a structured detection result. And / or, The second model system is configured with a cockpit function list and a simulated dialogue scenario list. The function list includes six functional modules: vehicle entertainment, driving functions, cockpit configuration, system operation, visual perception and visible information retrieval. The simulated dialogue scenario list includes four types of scenarios: single-turn tool retrieval, multi-turn tool retrieval, multi-turn question-and-answer dialogue, and combination of multiple tools. And / or, The supervision of the second model system should follow the following design specifications: set the dialogue background based on the global description, present the decision basis through visible information, clarify each sub-model and the corresponding functional interface of each sub-model, configure the format example to guide the tool call and feedback process, divide the guidance into prompts and body text, and use question sequence to simulate real interaction.