A target-oriented active dialogue method and system based on size model cooperation

By constructing structured task flowcharts and multi-agent simulations, combined with supervised fine-tuning and preference optimization, the initiative and process management capabilities of the dialogue system are realized, which solves the limitations of existing dialogue systems and improves the success rate of dialogue and system efficiency in complex tasks.

CN122240798APending Publication Date: 2026-06-19ZHEJIANG XIWEI TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG XIWEI TECHNOLOGY CO LTD
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing human-computer dialogue systems lack initiative, have weak process management capabilities, rely on simplistic strategy decisions, and have low efficiency in utilizing computing resources, making it difficult to effectively advance complex tasks and ensure the controllability and efficiency of the dialogue process.

Method used

A goal-oriented proactive dialogue method with a combination of large and small models is adopted. By constructing a structured task flowchart, multi-agent simulation generates policy-oriented dialogue data. Combined with supervised fine-tuning and preference optimization, a dialogue policy decision model is generated. The optimization of dialogue behavior is achieved by using information gap-driven phased goal constraints and multi-dimensional state change coupled decision mechanism.

Benefits of technology

It improves the success rate of dialogue systems in complex tasks, enhances the robustness, decision accuracy, response efficiency, and cross-scenario generalization ability of dialogue systems, and ensures the controllability of dialogue processes and the consistency of strategies.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240798A_ABST
    Figure CN122240798A_ABST
Patent Text Reader

Abstract

This invention discloses a goal-oriented proactive dialogue method and system based on a large-scale model collaboration, relating to the fields of artificial intelligence and natural language processing. The method includes: constructing a structured task flowchart and generating policy-oriented dialogue data through multi-agent simulation; performing supervised fine-tuning and preference optimization on a basic large-scale language model based on the dialogue data to obtain a dialogue policy decision model; acquiring user input information, combining it with historical dialogues to form a real-time dialogue context, determining the goal-oriented intent and updating the current state node based on information gap-driven phased goal constraints through candidate intent semantic matching and scoring; selecting a dialogue processing path based on the updated state node and adaptability, outputting dialogue behavior identifiers or transitional dialogue responses; and generating natural language output. This invention solves the problems of existing dialogue systems, such as lack of initiative, weak process management capabilities, singular policy decisions, and low computational resource utilization efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and natural language processing, and more specifically, to a goal-oriented proactive dialogue method and system for collaboration between large and small models. Background Technology

[0002] Traditional human-computer dialogue systems, such as intelligent customer service and chatbots, mostly adopt a passive question-and-answer or retrieval-based response model. These systems perform reasonably well when answering users' explicit questions, but they reveal significant limitations when dealing with scenarios that require multiple rounds of communication, follow specific processes, and ultimately achieve complex goals, such as business negotiations, dispute mediation, and product sales.

[0003] Existing technologies face the following challenges: 1) Lack of initiative: The system cannot lead the dialogue and can only wait for users to ask questions, making it easy for users to steer the conversation away from the core issues, resulting in low dialogue efficiency or task failure. 2) Weak process management capabilities: For tasks with clear steps, traditional systems struggle to effectively track dialogue status and manage process progress, failing to ensure that all necessary steps are covered. 3) Limited decision-making capabilities: When dialogue encounters stalemates or unexpected user reactions, the system lacks in-depth strategic thinking capabilities and cannot flexibly adjust communication strategies to resolve conflicts and advance the dialogue like human experts. 4) Uneven resource consumption: Some systems rely entirely on large language models for all interactions, resulting in significant waste of computing resources and latency when handling simple, routine dialogue segments. Summary of the Invention

[0004] To overcome the aforementioned deficiencies of the prior art, embodiments of the present invention provide a goal-oriented proactive dialogue method and system for collaboration between large and small models, which solves the problems of existing dialogue systems lacking initiative, having weak process management capabilities, having single strategic decision-making, and having low efficiency in utilizing computing resources.

[0005] To achieve the above objectives, the present invention provides the following technical solution: Firstly, this application provides a goal-oriented proactive dialogue method that utilizes a large and small model collaboration. The method includes: constructing a structured task flowchart and generating policy-oriented dialogue data through multi-agent simulation, wherein the multi-agent includes a facilitator agent for guiding the dialogue process and participant agents for simulating user behavior; performing supervised fine-tuning and preference optimization on a basic large language model based on the dialogue data to obtain a dialogue policy decision model; acquiring user input information, combining it with historical dialogues to form a real-time dialogue context, determining the goal-oriented intent and updating the current state node based on information gap-driven phased goal constraints through candidate intent semantic matching and scoring; selecting a dialogue processing path based on the updated state node and the fit between the policy model mechanism and the rule response mechanism, and outputting a dialogue behavior identifier or transitional dialogue response generated by the dialogue policy decision model; and generating natural language output based on the dialogue behavior identifier or transitional dialogue response.

[0006] In one embodiment, constructing a structured task flowchart includes: breaking down the dialogue process required to complete the dialogue objective into a set of subtasks arranged in logical order, with each subtask corresponding to a state node, and setting a node type and completion conditions for each state node; analyzing the dependencies between state nodes, defining the transition paths between nodes and their triggering conditions; determining the target node, and defining dialogue success conditions and dialogue failure conditions for the target node; performing consistency verification and path integrity verification on the overall task structure to ensure that each state node has at least one path to the target node, and generating a structured task flowchart.

[0007] In one embodiment, based on the dialogue objective, the dialogue process required to achieve the dialogue objective is decomposed into a set of subtasks arranged in logical order, including: semantic parsing of historical dialogue data and converting it into a sequence of dialogue behaviors arranged in chronological order; statistical analysis of the transition relationships between adjacent behaviors and extracting recurring behavior transition segments as an initial set of behavior patterns; compression and optimization of the set of behavior patterns using the minimum description length principle, retaining the set of behavior pattern blocks with the greatest expressive power; mapping behavior pattern blocks to functional behavior blocks, and generating a set of subtasks corresponding to different dialogue stages based on the semantic dependencies and chronological order between functional behavior blocks, and outputting the task decomposition results in logical order.

[0008] In one embodiment, strategy-oriented dialogue data is generated through multi-agent simulation, including: parsing the task flowchart to establish a dialogue state space and a set of state transition rules; defining a multi-agent system, including a facilitator agent and participant agents, and configuring different user profile parameters for the participant agents to form a set of simulated user behaviors; starting from the initial state, the facilitator agent selects candidate dialogue behaviors according to the state transition rules, and the participant agents generate response behaviors according to the user profiles, alternatingly interacting and updating the dialogue state, recording the state, behavior, and transition results of each round of dialogue until the termination condition is met; and performing structured processing on the generated dialogue trajectory data to form a set of strategy-oriented dialogue data.

[0009] In one embodiment, supervised fine-tuning includes: parsing dialogue data and constructing a training sample set; preprocessing the training samples to obtain a high-quality training dataset; selecting a pre-trained basic large language model as the initial model, and using the high-quality training dataset to perform supervised learning training on it, updating the model parameters by minimizing the difference between the model's generated output and the target output; and obtaining the initial policy model after the training reaches a preset convergence condition.

[0010] In one embodiment, a dialogue strategy decision model is obtained, comprising: generating candidate dialogue behaviors for each dialogue context using an initial strategy model; evaluating the quality of each candidate dialogue behavior using a preset dialogue evaluation function; comparing candidate behaviors in the same context pairwise based on the evaluation scores to construct a preference data sample set; and updating the parameters of the initial strategy model using a direct preference optimization objective function based on the preference data sample set, thereby increasing the probability of generating preferred behaviors and decreasing the probability of generating inferior behaviors, thus obtaining the dialogue strategy decision model.

[0011] In one embodiment, based on information gap-driven phased goal constraints, the target-oriented intent is determined and the current state node is updated through candidate intent semantic matching and scoring. This includes: generating a target information demand set based on preset dialogue goals and state nodes in the task flowchart, and extracting an observed information set from the real-time dialogue context; determining an information gap set based on the difference between the target information demand set and the observed information set, and evaluating the importance of the information gaps to obtain a weighted information gap set; based on the weighted information gap set, determining the phased goal of prioritizing filling high-weight information gaps in the current stage through reverse reasoning, and generating a candidate user intent set related to filling the information gaps; semantically matching user input with candidate user intents, and weighting the matching results based on the weighted information gaps, selecting the intent with the highest score as the target-oriented intent; matching state nodes in the task flowchart and determining state transitions based on the target-oriented intent and the observed information, and updating the current state node.

[0012] In one embodiment, based on the updated state node, a dialogue processing path is selected based on the compatibility between the strategy model mechanism and the rule response mechanism, and the dialogue behavior identifier or transitional dialogue response generated by the dialogue strategy decision model is output. This includes: obtaining the updated current dialogue state node and historical state information, constructing a multi-dimensional state vector of the current and historical states from multiple dimensions; calculating the state change components corresponding to each dimension, and calculating the comprehensive change intensity value through the state change coupling function; constructing a change driving potential function based on the comprehensive change intensity value, and calculating the state change driving potential value; constructing compatibility functions for the rule response mechanism and the strategy model mechanism based on the state change driving potential value, and calculating the compatibility value of the two mechanisms in the current state; selecting the mechanism with high compatibility as the dialogue processing path: if the strategy model mechanism has high compatibility, the dialogue strategy decision model is called to output the dialogue behavior identifier; otherwise, a transitional dialogue response is generated based on preset rules.

[0013] In one embodiment, generating natural language output based on dialogue behavior identifiers or transitional dialogue responses includes: obtaining dialogue behavior identifiers or transitional dialogue responses and mapping them to a basic semantic intent structure; constructing a pragmatic constraint set by combining user features and dialogue state in the current dialogue context; generating multiple candidate natural language expression sequences based on the semantic intent structure and pragmatic constraint set, and performing template mapping and structural reorganization to obtain an optimized expression set; performing multi-dimensional scoring on each candidate expression in the optimized expression set, and selecting the candidate expression with the highest score as the target natural language output; adjusting the target natural language output according to preset role style parameters to obtain the final output result.

[0014] Secondly, this application provides a goal-oriented proactive dialogue system that utilizes a large and small model collaboration. The system includes: a dialogue flow and data generation module, used to construct a structured task flowchart and generate strategy-oriented dialogue data through multi-agent simulation, wherein the multi-agent includes a facilitator agent for guiding the dialogue flow and participant agents for simulating user behavior; a strategy model training module, used to perform supervised fine-tuning and preference optimization of a basic large language model based on dialogue data to obtain a dialogue strategy decision model; an information gap reverse reasoning module, used to acquire user input information, combine it with historical dialogues to form a real-time dialogue context, determine the goal-oriented intent and update the current state node based on information gap-driven phased goal constraints through candidate intent semantic matching and scoring; a mechanism adaptation decision module, used to select a dialogue processing path based on the adaptation degree between the strategy model mechanism and the rule response mechanism according to the updated state node, and output dialogue behavior identifiers or transitional dialogue responses generated by the dialogue strategy decision model; and an output module, used to generate natural language output based on the dialogue behavior identifiers or transitional dialogue responses.

[0015] As can be seen from the above technical solutions, the embodiments of this application have the following advantages: This invention transforms the dialogue system from a respondent to a facilitator, ensuring that the dialogue remains focused on the core objective through proactive agenda setting, questioning, and strategy adjustment, thereby greatly improving the success rate in complex tasks.

[0016] By organically integrating task flowcharts, multi-agent data generation, supervised fine-tuning and preference-optimized training, as well as semantic back-reasoning and multi-dimensional state change coupling decision-making mechanisms driven by information gaps, a closed-loop optimization of the entire chain from dialogue understanding and state tracking to policy decision-making and natural language generation is achieved. This not only accurately identifies target-oriented intentions and dynamically updates state nodes when user expressions are incomplete or uncertain, but also adaptively selects policy models or rule mechanisms to generate responses based on dialogue complexity. Thus, while ensuring the controllability of the dialogue process and the consistency of policies, it significantly improves the robustness, decision accuracy, response efficiency, and cross-scenario generalization ability of the dialogue system. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a schematic diagram of a goal-oriented proactive dialogue method for collaboration between large and small models, provided in an embodiment of this application.

[0019] Figure 2 This is a schematic diagram of the structure of a goal-oriented proactive dialogue system with large and small model collaboration, provided in an embodiment of this application. Detailed Implementation

[0020] To enable those skilled in the art to better understand the technical solutions in this application, the technical solutions in the embodiments of this application will be clearly and completely described below. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0021] It should be noted that, in this document, terms such as “comprising,” “including,” or any other variations thereof are intended to cover non-exclusive inclusion, such that an article or device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such an article or device. Without further limitation, an element defined by the phrase “comprising one…” does not exclude the presence of other identical elements in the article or device that includes the aforementioned element.

[0022] Reference Figure 1 As shown in the diagram, a goal-oriented proactive dialogue method for size-model collaboration provided by the present invention includes the following steps: S1. Based on the preset dialogue goal, construct a structured task flowchart, which includes multiple state nodes, transition conditions between nodes, and target nodes.

[0023] In this embodiment, a structured task flowchart is constructed based on preset dialogue objectives, including: S11, Based on the dialogue objective, the dialogue process required to complete the dialogue objective is decomposed into a set of multiple sub-tasks arranged in a logical order, wherein each sub-task corresponds to a stage objective in the dialogue process. S12, according to the set of subtasks, define a corresponding state node for each subtask and set node attributes for each state node. The node attributes include at least the node type and the node completion condition, forming an initial set of state nodes. The node types include information collection nodes, solution proposal nodes, objection handling nodes, and process summary nodes.

[0024] S13. Based on the initial set of state nodes, analyze the dependencies between each state node, define the transition paths between nodes, and set trigger conditions for each transition path to form state transition relationships. The triggering conditions include at least one of the following: "Key slot information filling complete" means that in the current dialogue task, the predefined information fields necessary to complete the objective have been fully or conditionally filled in through user input; "Dialogue rounds reach preset threshold" means that the current number of dialogue interactions has reached or exceeded the maximum number of rounds or stage rounds preset by the system.

[0025] S14, determine at least one target node in the set of state nodes, and define dialogue success conditions and dialogue failure conditions for the target node; The target nodes include dialogue success nodes and dialogue failure nodes. Dialogue success nodes correspond to the goal being achieved, while dialogue failure nodes correspond to the process being terminated but the goal not being achieved.

[0026] S15, based on the state transition relationship and the target node, performs consistency verification and path integrity verification on the overall task structure to ensure that any initial state node has at least one path to the target node, and generates a structured task flowchart.

[0027] Consistency verification includes detecting whether there are non-target nodes without outgoing edges or isolated nodes that cannot reach the target node, and performing supplementary connection or deletion processing on the nodes; path integrity verification includes verifying that there are corresponding state transition paths for different user behavior branches, so as to ensure that the dialogue process can cover a variety of user response situations.

[0028] Furthermore, based on the dialogue objective, the dialogue process required to achieve the dialogue objective is broken down into tasks, resulting in a set of subtasks arranged in logical order, including: S111, acquire historical dialogue data related to the dialogue target, and perform semantic parsing on the historical dialogue data to convert each round of dialogue into a preset behavior label, thereby forming a dialogue behavior sequence arranged in chronological order; Among them, behavioral tags include questioning behavior, answering behavior, clarifying behavior, refusing behavior, and explaining behavior.

[0029] S112, Based on the dialogue behavior sequence, perform statistical analysis on the transition relationship between adjacent behavior tags, and extract the recurring behavior transition segments as basic behavior pattern units to obtain an initial behavior pattern set; S113, Based on the initial set of behavior patterns, the redundant structures between behavior patterns are merged, and the set of behavior patterns is compressed and optimized by the minimum description length method to eliminate redundant patterns and retain the set of behavior pattern blocks with the greatest expressive power. The compression optimization of the behavior pattern set using the minimum description length method specifically includes: calculating the sum of the description length and the data reconstruction length for each behavior pattern set, where the description length represents the number of bits required for lossless encoding of the behavior pattern set (e.g., using entropy encoding for the number of patterns and the symbol sequence in each pattern); the data reconstruction length represents the number of bits required to encode the original dialogue behavior sequence using the behavior pattern set, specifically by constructing a probabilistic model (e.g., an nth-order Markov model of the sequence) using the behavior pattern set, calculating the negative log-likelihood (base 2) of the original dialogue behavior sequence under this model as the reconstruction cost; using the sum of the description length and the data reconstruction length as the compression cost function, and selecting the optimal behavior pattern set by minimizing the compression cost function, thereby achieving behavior pattern compression.

[0030] S114. Based on the set of behavior pattern blocks, the occurrence position and duration range of each behavior pattern block in the dialogue behavior sequence are analyzed, and the semantic function category corresponding to each behavior pattern block is identified, thereby mapping the behavior pattern block to a semantically meaningful functional behavior block. The semantic function categories include information collection, conflict resolution, solution advancement, and process confirmation.

[0031] S115. Based on the set of functional behavior blocks, the functional behavior blocks are grouped and classified according to the semantic dependencies and temporal order between the behavior blocks to generate a set of subtasks corresponding to different dialogue stages. Each subtask corresponds to a set of functional behavior blocks with an independent stage purpose. S116. Based on the set of subtasks, each subtask is arranged in an orderly manner according to the order of appearance of functional behavior blocks in the original dialogue behavior sequence and the semantic progression relationship, so as to obtain the task decomposition result organized in logical order.

[0032] It should be noted that by using the above-mentioned task decomposition method based on behavior sequence modeling, pattern compression and semantic function mapping, the dialogue stage division that relies on human experience can be transformed into a structured sub-task sequence that is automatically generated by data, thereby significantly improving the consistency, reproducibility and cross-scenario generalization ability of task decomposition.

[0033] S2, based on the task flowchart, strategy-oriented dialogue data is generated through multi-agent simulation, including at least a facilitator agent for guiding the dialogue process and a participant agent for simulating user behavior, so that the generated dialogue data includes state nodes, dialogue behaviors and state transition relationships.

[0034] In this embodiment, based on the task flowchart, policy-oriented dialogue data is generated through multi-agent simulation, including: S21, parse the state nodes, transition relationships between nodes, and target nodes in the task flowchart, and establish a dialogue state space and a set of state transition rules based on the parsing results; The dialogue state space refers to the set of all state nodes in the task flowchart, which represents the entire range of states that the dialogue may be in at different stages; the state transition rule set refers to the set of rules used to define the transition from one dialogue state node to another under specific conditions.

[0035] S22, Based on the dialogue state space and the set of state transition rules, define the role types in the multi-agent system, including at least one facilitator agent for guiding the dialogue process and at least one participant agent for simulating user behavior, and set the behavioral goals and decision-making strategies for each type of agent. The goal of the facilitator agent is to drive the dialogue along the task flow graph toward the target node, and the goal of the participant agent is to generate corresponding response behaviors according to the set user profile. The decision-making strategy refers to the decision-making mechanism by which an agent selects the optimal dialogue behavior to achieve its behavioral goals based on predetermined rules or a learning model, given the current dialogue state, dialogue history, and preset goals. The set of state transition rules is structured and encoded, with each state node corresponding to a set of allowed entry / exit conditions and a set of legal output behavior types. Simultaneously, a structured prompt template is designed for the facilitator agent, containing the current state node name, an enumerated list of available candidate behavior types (e.g., ["ask a question", "clarify", "propose a solution", "soothe emotions"]), a list of target nodes allowed for transition from the current state according to the rules, and a summary of user profile parameters. The facilitator agent is required to output structured data conforming to a predefined JSON Schema. This data must include a `behavior_type` field (selected from the enumerated list) and a `generated_text` field.

[0036] S23. Based on the participant intelligent agent, construct multiple participant instances with different user characteristics, and configure user profile parameters for each participant instance. The user profile parameters include behavioral preferences, information providing tendencies, and emotional response patterns, forming a diverse set of user behavior simulations. The user profile parameters (behavioral preferences, information provision tendency, and emotional response pattern) configured for each participant instance are embedded into system prompts via variable substitution. For example: User profile: Preference = {{preference}}; Information provision tendency = {{disclosure_tendency}}; Emotional pattern = {{emotion}}. Simultaneously, the set of state transition rules is preprocessed into a natural language rule list or JSON mapping table and embedded into the "constraints" section of the system prompts to guide the participant agent in generating responses within legal limits. The participant instance refers to a specific user agent entity generated based on the preset user profile parameters during the dialogue simulation, used to simulate user interaction behaviors with specific behavioral characteristics.

[0037] S24. Based on the dialogue state space, select the initial state node as the dialogue start state, and combine the agent to set the initial dialogue context to generate the initial dialogue state. S25, In the current dialogue state, based on the dialogue context and the current state node, the guide agent selects candidate dialogue behaviors according to the set of state transition rules to promote the dialogue towards the target node and obtains the guide behavior output. The candidate dialogue behaviors include asking questions, clarifying, proposing solutions, and emotional reassurance behaviors. The facilitator behavior output refers to the dialogue behaviors generated by the facilitator agent based on decision-making strategies in the current dialogue state, which are used to promote the dialogue process towards the target node.

[0038] Immediately after the facilitator agent outputs, two-step legality checks are performed: (1) verifying whether behavior_type belongs to the set of candidate behaviors allowed by the current state; (2) verifying whether the behavior satisfies the transition conditions defined in the set of state transition rules.

[0039] If the verification fails, it will be processed according to the following priority: Retry: Add an error message (such as "The behavior type 'complain' in the output is not in the allowed list. Please select from [ask, clarify, propose a solution, soothe emotions]") to the dialogue context, and retry the model. The model can be retried up to 3 times.

[0040] Fallback: If the number of retries has been exhausted and the attempt still fails, use the default safety behavior (e.g., behavior_type="clarify", generated_text="sorry, I didn't understand, could you say it again?").

[0041] Interruption: If consecutive failures exceed the system threshold (e.g., 5 rollbacks in the same trajectory), the current dialogue simulation will be terminated and the trajectory will be marked as "invalid data" and discarded or entered into the manual review queue.

[0042] Record detailed information for each illegal action (e.g., status node, illegal output, number of attempts).

[0043] S26, Based on the facilitator's behavioral output and the current dialogue context, the participant's intelligent agent generates the corresponding response behavior according to its user profile parameters, thus forming the participant's behavioral output; The participant's behavioral output refers to the response behavior generated by the participant agent after receiving the facilitator's action, based on its user profile and the current dialogue context. Furthermore, the participant agent's output also follows the aforementioned legality verification and retry mechanism, but the verification rules focus on whether it conforms to user profile constraints (e.g., information provision preferences do not allow for direct refusal to answer) and whether the generated text is consistent with the predefined behavior type. If verification fails, a retry, rollback, or interruption process is triggered.

[0044] S27. Based on the facilitator's behavior output and the participant's behavior output, combined with the set of state transition rules, the current dialogue state is updated to obtain a new state node. The current state node, dialogue context, facilitator's behavior, participant's behavior and state transition result in each round of dialogue are recorded to form complete dialogue trajectory data. S28. Based on the new state node, determine whether the preset termination condition has been met. If the termination condition has been met, end the current dialogue simulation. If not, return to step S25 and continue to the next round of dialogue simulation. The termination conditions include reaching the target node or exceeding a preset threshold for the number of dialogue rounds.

[0045] S29. The dialogue trajectory data is organized and structured to form a policy-oriented dialogue data set containing state node sequences, dialogue behavior sequences, and state transition relationships.

[0046] It should be noted that by introducing multi-agent collaborative simulation under the constraints of the task flowchart, the automatic generation and closed-loop recording of dialogue behaviors, state nodes and state transition relationships can be achieved, thereby efficiently constructing high-quality dialogue data that is structurally clear, policy-consistent and diverse.

[0047] S3. Based on the dialogue data, the basic large language model is subjected to supervised fine-tuning to obtain an initial policy model. Based on the preference data of the merits and demerits of dialogue behavior, the initial policy model is trained by direct preference optimization to obtain a dialogue policy decision model.

[0048] In this embodiment, the basic large language model is subjected to supervised fine-tuning based on the dialogue data to obtain an initial policy model, including: S31, the dialogue data is parsed, and the state nodes, dialogue context and corresponding facilitator behavior in each round of dialogue are aligned to construct a training sample set, wherein each training sample includes at least an input sequence and a target output. S32, based on the training sample set, the input sequence is structured and encoded, and the dialogue history, current state node and related context information are concatenated or serialized. At the same time, the corresponding facilitator behavior is converted into standardized dialogue behavior identifiers or natural language expressions as the target output, thereby obtaining the input and output pairs required for model training. S33, based on input and output pairs, preprocesses the training data, including removing outlier samples and filtering dialogue segments that do not conform to state transition rules, to obtain a high-quality training dataset; S34, Select a pre-trained basic large language model (e.g., Llama-3, Qwen3) as the initial model, load its pre-trained parameters, and use the model as the model to be fine-tuned; S35, based on a high-quality training dataset, supervised learning training is performed on the model to be fine-tuned. The model parameters are updated by minimizing the difference between the model's generated output and the target output, so that the model learns to generate the corresponding facilitator behavior given the dialogue state and context. S36, During training, a loss function is constructed to measure the difference between the model output and the target output, wherein the loss function is used to measure the degree of difference between the dialogue behavior category predicted by the model and the real behavior label; The specific calculation formula for the loss function is as follows: In the formula, For classification loss value, The number of training samples. The total number of behavior categories. Let i be the true label of the i-th sample in the k-th class. Predict the probability that the sample belongs to the k-th class for the model.

[0049] S37, based on the loss function, uses gradient descent or its variant optimization algorithms to iteratively update the model parameters, so that the model's output on the training dataset gradually approaches the target output; S38. During the training process, the performance of the model on the validation dataset is evaluated. When the loss function value is lower than a preset threshold, training is stopped and the parameters of the trained model are obtained. The preset threshold is set based on the distribution of the loss function of the validation dataset or empirical performance indicators during the historical training process, so that it corresponds to the loss level when the model reaches the expected accuracy or stable convergence state on the validation set.

[0050] S39. Based on the trained model parameters, an initial policy model is formed, enabling the model to output corresponding dialogue behavior or policy instructions when the current dialogue state and context are input.

[0051] Furthermore, based on preference data regarding the merits and demerits of dialogue behaviors, the initial policy model is trained using direct preference optimization to obtain a dialogue policy decision model, including: S310, extract the training dialogue context as input condition based on the dialogue data, and construct a context sample set. The training dialogue context includes historical dialogue content, current state node and corresponding dialogue environment information. S311, Based on the context sample set, at least two different candidate dialogue behaviors are generated for each context sample using the initial policy model to obtain a candidate behavior set; S312, based on the candidate behavior set, perform quality evaluation on each candidate dialogue behavior, calculate the corresponding evaluation score through a preset dialogue evaluation function, and obtain the score result of each candidate behavior; The specific calculation formula for the dialogue evaluation function is as follows: In the formula, The overall evaluation score for candidate dialogue behavior. The process advancement score measures whether an action propels the dialogue toward the target node. It is calculated by matching candidate dialogue actions against preset state transition conditions of the current state node; if the transition conditions are met, a value of 1 is assigned, otherwise a value of 0 is assigned, thus obtaining the process advancement score. The semantic accuracy score measures the consistency between the action and the semantics of the current context. It is obtained by inputting the training dialogue context and the candidate dialogue action into a predefined semantic encoding model to obtain vector representations, and then calculating the cosine similarity between the two as the semantic accuracy score. The emotion fit score measures how well candidate dialogue behaviors match the user's emotional state. It is calculated by taking the inverse function of the absolute value of the difference between the user's current emotional intensity and the corresponding emotional tendency of the candidate dialogue behavior (e.g., 1 minus the difference). , , These are the corresponding weight coefficients. The contribution values ​​are calculated by statistically analyzing the absolute values ​​of the Pearson correlation coefficients between each score and the target achievement rate on the historical dialogue sample set. After mapping each contribution value to the [0,1] interval using the Min-Max normalization method, the contribution values ​​are calculated based on the proportion of the normalized results, satisfying the following: .

[0052] Optionally, the user's current emotional intensity is determined by mapping the text entered by the user in the current dialogue turn to three categories—"positive," "neutral," and "negative"—using a pre-trained emotion classification model (e.g., a BERT-based fine-tuned three-class classification model), and assigning numerical values ​​to each: positive = +1, neutral = 0, and negative = -1. For emotional tendency, each candidate dialogue behavior type is pre-labeled with a fixed emotional tendency value, following the same labeling rules as the quantization scale in step one: +1 indicates a positive, appeasing tendency, 0 indicates neutral, and -1 indicates a negative or critical tendency. This labeling is determined by domain experts based on behavioral semantics and dialogue strategy and stored in the system in the form of a lookup table.

[0053] S313. Based on the scoring results, compare candidate behaviors in the same context sample pairwise, mark behaviors with high evaluation scores as preferred behaviors, and mark behaviors with low evaluation scores as inferior behaviors, and construct a set of preference data samples containing "context-preferred behavior-inferior behavior". S314, Based on the preference data sample set, use the initial policy model to calculate the conditional probability of generating preferred and undesirable behaviors under given context conditions, and obtain the corresponding probability output results; S315. Based on the probability output results, construct an objective function for direct preference optimization, so that the model increases the probability of generating preferred behavior and decreases the probability of generating inferior behavior under given context conditions. S316, based on the objective function of direct preference optimization, updates the parameters of the initial policy model through gradient descent, so that the output distribution of the model on the preference data gradually shifts towards the preferred behavior; The objective function for direct preference optimization is specifically calculated using the following formula: In the formula, The objective function is to directly optimize preferences. To train the dialogue context, For the expectation of all preference samples, To optimize dialogue behavior, As a form of inferior selection dialogue behavior, This represents the probability of the current model to be optimized generating behavior within the given context. The probability distribution of the initial policy model. For the Sigmoid function, The preset scaling factor is used to adjust the influence of the log probability difference between the preferred and undesired behaviors on model optimization. Its value is determined by existing hyperparameter tuning methods through grid search or cross-validation on the validation dataset, and the parameter value that makes the convergence performance of the direct preference optimization objective function optimal is selected.

[0054] S317, Based on the trained model parameters, generate a dialogue strategy decision model, enabling the model to preferentially generate the optimal dialogue behavior that conforms to the preference data under a given dialogue context.

[0055] It's important to note that the constructed preference dataset is used for training the DPO (Direct Preference Optimization) algorithm. The DPO algorithm directly updates the parameters of the large language model, increasing the probability of generating "selective" responses while decreasing the probability of generating "rejective" responses. In this way, the model implicitly learns the multi-dimensional preferences inherent in the complex evaluation function and internalizes them as its generation strategy. Through DPO training, the model learns to generate dialogue behaviors that best conform to procedural norms, are precise in expression, emotionally appropriate, and efficiently achieve the goal in any given dialogue state. The final trained model is the "Dialogue Policy Decision Model." Compared to traditional RL, this method offers a more stable training process and more directly aligns complex human preferences (quantified by the evaluation function) into the model's strategy.

[0056] S4: Obtain user input information, combine it with historical dialogue to form a real-time dialogue context, and based on information gap-driven phased goal constraints, determine the target-oriented intent through candidate intent semantic matching and scoring, and update the current state node, including: S41, acquire real-time user input information, which includes voice or text; when the input information is voice, perform voice recognition processing to convert it into text form to obtain the user input text for the current round; S42, based on user input text and combined with historical dialogue records, organize and splice the historical dialogue content in chronological order to form a dialogue history sequence; S43, based on the dialogue history sequence, the user input text and historical dialogue content are structured and encoded to generate a real-time dialogue context, which includes the current input, historical interaction information and the current dialogue round identifier; S44, based on the definition of each state node in the preset dialogue target and task flowchart, the information required to achieve the target is structurally modeled to generate a target information requirement set, which includes multiple preset information slots; S45, based on the real-time dialogue context, the information provided by the user is parsed and extracted to obtain an observed information set, which includes the filled information slots; S46, Based on the target information demand set and the observed information set, the information slots that have not yet been acquired are determined by the set difference operation to obtain the information gap set; The set difference operation refers to removing elements already contained in one set from another set, thus obtaining a set of remaining elements that belong only to the former and not to the latter.

[0057] S47. Based on the information gap set, the importance of each missing information slot is assessed to obtain a weighted information gap set. Specifically, the importance assessment quantifies the impact of each missing information slot on achieving the dialogue goal by calculating the degree of impact of each missing information slot. Specifically, it calculates the probability of triggering an effective state transition when the information slot is filled based on historical dialogue data, or it calculates the probability or the increase in the dialogue success rate before and after the information slot is filled, and uses the probability value or the increase as the weight of the information slot, thereby realizing a numerical assessment of the importance of each missing information slot.

[0058] S48, based on the weighted information gap set, reverse reasoning is performed on the current stage of the dialogue to determine the phased goal of prioritizing filling the high-weight information gaps; Specifically, the determination of the phased goals is achieved by sorting the weights of each missing information slot in the weighted information gap set, and selecting one or more information slots with the highest weight value or whose cumulative weight reaches a preset threshold as priority filling targets. This quantifies the corresponding information acquisition task into the goal of the current phase. The preset threshold is set according to the statistical results of the cumulative weight distribution of information slots in historical dialogue data, and is selected by quantile analysis or cross-validation to achieve the optimal dialogue goal achievement rate.

[0059] S49. Based on the determined phased goals, generate multiple candidate user intent sets related to filling the information gap. The candidate user intent set refers to a set of alternative semantic intents that may be used to interpret user input and advance the dialogue process, generated under the constraints of the current phased goals in order to achieve information gap filling. S410, based on the candidate user intent set, perform semantic matching calculation between the current user input and each candidate user intent, and combine the weighted information gap set to perform weighted processing on the matching result to obtain a comprehensive score for each candidate intent; Specifically, the current user input and each candidate user intent are semantically encoded, and the cosine similarity value between them is calculated as the semantic matching score. Secondly, based on the weighted information gap set, the information gap filling score is obtained by matching the information slots associated with each candidate user intent with the information gap set and accumulating the corresponding weights. Finally, the semantic matching score and the information gap filling score are weighted and summed to obtain the comprehensive score of the candidate user intent.

[0060] S411, Select the candidate user intent with the highest score as the target-oriented intent of the current user input. The target-oriented intent refers to the user semantic intent with the highest score that is most helpful in filling the key information gap and promoting the dialogue toward the target node in the current dialogue state, based on the comprehensive score. S412, Based on the goal-oriented intent and the set of observed information, match it with the state nodes in the task flowchart to determine the set of candidate state nodes; S413, Based on the candidate state node set, combined with the current state node and the state transition conditions defined in the task flowchart, determine whether the state transition is satisfied, and determine the target state node after the state transition occurs. S414, Based on the determination result, update the current dialogue state to the target state node and record the state transition path, thereby realizing dynamic tracking of the dialogue state.

[0061] It should be noted that by introducing target information demand modeling and information gap calculation into the dialogue state tracking process, with information gap as the core driving factor, and combining importance quantification assessment, phased goal reverse reasoning, and candidate intent weighted scoring mechanism, the transformation from "user expression driven" to "goal-oriented driven" has been achieved. Thus, even when user expression is incomplete or ambiguous, it can still stably identify the semantic intent most conducive to advancing the dialogue goal, and achieve accurate state matching and transition based on the task flowchart, which significantly improves the robustness of dialogue understanding, the accuracy of state updates, and the efficiency and controllability of the overall dialogue process.

[0062] S5, based on the updated state node, select the dialogue processing path according to the compatibility between the strategy model mechanism and the rule response mechanism: when the compatibility of the strategy model mechanism is high, output the corresponding dialogue behavior identifier; otherwise, generate a transitional dialogue response based on preset rules, including: S51, obtain the updated current dialogue state node, and extract the corresponding real-time dialogue context and the historical state information corresponding to the previous round of dialogue; S52, based on the current dialogue state information and historical state information, the state information is spliced ​​together from three dimensions: semantic features, emotional features and target advancement features, to construct the current multidimensional state vector and the historical multidimensional state vector respectively; Specifically, the dialogue text is input into a pre-trained semantic encoding model (including encoding models based on the Transformer architecture, such as BERT, RoBERTa, or Sentence-BERT) to extract semantic feature vectors; the text is used to identify emotions based on a pre-trained emotion classification model to obtain an emotion probability distribution vector as emotion features; and the current state node is mapped to a target advancement scalar or vector representation based on the progress of the state nodes in the task flowchart.

[0063] S53. Based on the current multidimensional state vector and the historical multidimensional state vector, calculate the state change components corresponding to each dimension, including semantic change, emotion change and target offset. Then, perform dimension alignment and normalization on the features of each dimension and then concatenate them to construct a set of multidimensional state change components. The semantic change is obtained by calculating the cosine distance or Euclidean distance between the current semantic vector and the historical semantic vector; the emotion change is obtained by calculating the KL divergence or difference norm between the two rounds of emotion probability distributions; and the target offset is obtained by calculating the topological distance or stage number difference between the current state node and the historical state node in the task flowchart.

[0064] S54, based on the multidimensional set of state change components, construct a state change coupling function and calculate the comprehensive change intensity value; The specific calculation formula for the state change coupling function is as follows: In the formula, The comprehensive change intensity value represents the overall degree of change of the current dialogue state relative to the previous state in a multi-dimensional space. It is used to uniformly quantify the comprehensive impact of changes in different dimensions on strategy decisions. Let i be the change component between the current state and the historical state on the i-th feature dimension. Let i be the importance weight of the i-th dimension of change to the overall state change. The interaction strength between the i-th and j-th dimension variation components is used to characterize the coupling relationship between variations in different dimensions. Its value is obtained by fitting historical data. This represents the change component between the current state and the historical state on the j-th feature dimension.

[0065] S55. Based on the comprehensive change intensity value, a change driving potential function is constructed, and the state change driving potential value is calculated. The specific calculation formula for the changing driving potential function is as follows: In the formula, The driving potential value for state change represents the strength of the driving force of the current state change on policy selection.

[0066] S55. Based on the potential value driven by state change, the adaptation functions of the rule response mechanism and the policy model mechanism are constructed respectively. Combined with the current dialogue context and state node information, the adaptation values ​​of the two mechanisms in the current state are calculated. The specific formula for calculating the fit value is as follows: In the formula, This is a suitability value for using the strategy model mechanism to make decisions in the current state. The larger the value, the more suitable it is to call the strategy model. This is the fit value for using the rule-based response mechanism in the current state. The semantic similarity score represents the degree of similarity between the current dialogue context and historically successful dialogue strategies, and is obtained through cosine similarity. To determine the degree of matching between the current state and the rules in the rule base, the similarity between the feature vector of the current dialogue state and the feature vectors of the triggering conditions of each rule in the preset rule base is calculated, and the maximum matching score is used to determine the match. , , , The weight parameters are obtained by using entropy weighting, analytic hierarchy process (AHP), or cross-validation optimization methods based on historical dialogue verification data, and are then normalized so that the sum of all weight parameters is 1.

[0067] S56. Based on the adaptability value, the rule response mechanism and the strategy model mechanism are evaluated in competition, and the mechanism with higher adaptability is selected as the current dialogue processing path. S57, Execute corresponding processing based on decision results: When the strategy model mechanism is selected, input the real-time dialogue context into the dialogue strategy decision model and output the corresponding dialogue behavior identifier; when the rule response mechanism is selected, match the preset rule base according to the current state node and generate a transitional dialogue response. The pre-defined rule base refers to a set of structured rules pre-built based on the task flowchart and dialogue objectives before the system runs. It describes the standardized response strategies to be adopted under different state nodes and specific conditions. Each rule usually includes triggering conditions (such as the current state node, user intent, or filled information slot) and corresponding response templates or behavioral instructions. When the dialogue is in a non-critical strategy decision situation, the system matches the current dialogue state with the triggering conditions in the rule base, selects the rules that meet the conditions, and generates the corresponding transitional dialogue response, thereby reducing computational complexity and improving response efficiency while ensuring dialogue continuity.

[0068] It should be noted that by introducing multi-dimensional state change modeling and coupling analysis mechanisms, the dialogue strategy triggering is transformed from traditional static node judgment into a dynamic decision-making process driven by multiple factors such as semantics, emotion, and goal advancement. Combined with the adaptability competition between the strategy model and the rule mechanism, it is possible to effectively reduce computational overhead, improve response efficiency, and enhance dialogue coherence while ensuring the accuracy and flexibility of decision-making in complex scenarios.

[0069] S6, generating natural language output based on the dialogue behavior identifier or the transitional dialogue response, thereby guiding the dialogue along the task flowchart toward the target node, including: S61, obtain the dialogue behavior identifier or the transitional dialogue response, and perform semantic parsing on it, mapping the dialogue behavior identifier to the corresponding basic semantic intent structure; The basic semantic intent structure includes behavior type, target object, and semantic parameters, thereby obtaining a structured semantic expression result.

[0070] S62, based on the basic semantic intent structure, combines user feature information and dialogue state information in the current dialogue context to construct a set of pragmatic constraints; The pragmatic constraint set includes emotion adaptation constraints, which are rules that constrain the generated content to match the emotion type and intensity with the current dialogue context; formality constraints, which are rules that constrain the generated content to conform to the target formality; and context consistency constraints, which are rules that constrain the generated content to maintain semantic and logical coherence. S63, based on the basic semantic intent structure and pragmatic constraint set, uses a structural constraint generation mechanism to generate multiple candidate natural language expression sequences, thus obtaining a candidate expression set; The structural constraint generation mechanism constrains the language generation process by pre-setting a structured output mode, so that the large language model outputs candidate expressions according to a predefined field structure. The field structure includes at least an emotion response field, an information expression field, and a guidance behavior field, and outputs the corresponding content in a structured data format.

[0071] S64, based on the candidate expression set, performs template mapping and structural recombination on the structured output results to obtain an optimized expression set; The process involves mapping each structured field in the candidate expression set to a preset three-part expression structure template of "emotional response + information expression + guiding behavior", and then concatenating them according to the field order. The structured output mode is implemented through a preset syntax constraint template or JSON Schema constraint to limit the large language model to generate content according to fixed fields.

[0072] The system performs a structural integrity check on the generated results to determine whether each field is complete and meets the preset semantic constraints. If there are missing fields, misaligned fields, or semantic inconsistencies, the system triggers regeneration or corrects errors according to the field identification rules, thereby ensuring the accuracy of the structural reorganization.

[0073] S65, based on the optimized expression set, performs multi-dimensional scoring on each candidate expression to obtain a comprehensive score result; The comprehensive scoring result is obtained by weighting the semantic consistency score, the emotion fit score, and the context coherence score. The semantic consistency score is obtained by calculating the cosine similarity between the optimized expression set vector and the basic semantic intent vector; the emotion fit score is obtained by calculating the inverse function of the KL divergence between the expression emotion distribution and the target emotion distribution; and the context coherence score is obtained by calculating the semantic similarity between the current expression and the historical dialogue vector.

[0074] S66: Based on the comprehensive scoring results, the candidate expression with the highest score is selected as the target natural language output; S67. Based on the target natural language output, the expression is adjusted by combining preset character style parameters to obtain the final output result. The character style parameters include tone intensity (achieved by inserting tone words or adjusting sentence length), politeness level (achieved by replacing politeness words), and expression habits (achieved by replacing synonym expression templates).

[0075] Reference Figure 2 As shown in the diagram, this invention provides a target-oriented proactive dialogue system structure based on a collaboration between large and small models. The system includes a dialogue flow and data generation module, a strategy model training module, an information gap inverse reasoning module, a mechanism adaptation decision-making module, and an output module. These modules are interconnected. The dialogue flow and data generation module is used to construct a structured task flow diagram and generate policy-oriented dialogue data through multi-agent simulation. The multi-agent includes a facilitator agent for guiding the dialogue flow and a participant agent for simulating user behavior. The strategy model training module is used to perform supervised fine-tuning and preference optimization of the basic large language model based on dialogue data to obtain the dialogue strategy decision model. The information gap reverse reasoning module is used to obtain user input information, combine it with historical dialogue to form a real-time dialogue context, and determine the target-oriented intent and update the current state node based on the phased goal constraints driven by the information gap through candidate intent semantic matching and scoring. The mechanism adaptation decision module is used to select the dialogue processing path based on the adaptation degree between the strategy model mechanism and the rule response mechanism according to the updated state node, and output the dialogue behavior identifier or transitional dialogue response generated by the dialogue strategy decision model. The output module is used to generate natural language output based on dialogue behavior identifiers or transitional dialogue responses.

[0076] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, in the form of a computer program product.

[0077] Those skilled in the art will recognize that the modules and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0078] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

[0079] In conclusion, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A goal-oriented proactive dialogue method for collaboration between large and small models, characterized in that, include: Construct a structured task flowchart and generate policy-oriented dialogue data through multi-agent simulation. The multi-agent includes a facilitator agent for guiding the dialogue process and a participant agent for simulating user behavior. Based on dialogue data, the basic large language model is subjected to supervised fine-tuning and preference optimization to obtain a dialogue strategy decision model. It acquires user input information, combines it with historical dialogues to form a real-time dialogue context, and determines the target-oriented intent and updates the current state node based on the phased goal constraints driven by information gaps through candidate intent semantic matching and scoring. Based on the updated state nodes, the dialogue processing path is selected according to the compatibility between the strategy model mechanism and the rule response mechanism, and the dialogue behavior identifier or transitional dialogue response generated by the dialogue strategy decision model is output. Natural language output is generated based on dialogue behavior identifiers or transitional dialogue responses.

2. The goal-oriented proactive dialogue method for large-scale model collaboration according to claim 1, characterized in that, The construction of the structured task flowchart includes: Based on the dialogue objective, the dialogue process required to complete the dialogue objective is broken down into a set of subtasks arranged in logical order. Each subtask corresponds to a state node, and a node type and node completion conditions are set for each state node. Analyze the dependencies between state nodes and define the transition paths and triggering conditions between nodes; Identify the target node and define the conditions for successful and failed dialogue for the target node; Perform consistency checks and path integrity verification on the overall task structure to ensure that each state node has at least one path to the target node, and generate a structured task flowchart.

3. The goal-oriented proactive dialogue method for large-scale model collaboration according to claim 2, characterized in that, The step of breaking down the dialogue process required to achieve the dialogue objective into a set of subtasks arranged in logical order, based on the dialogue objective, includes: Semantic parsing of historical dialogue data is performed to convert it into a time-ordered sequence of dialogue behaviors; statistical analysis is conducted on the transition relationships between adjacent behaviors to extract recurring behavior transition segments as an initial set of behavior patterns. The behavior pattern set is compressed and optimized using the minimum description length principle, retaining the behavior pattern block set with the greatest expressive power; The behavior pattern blocks are mapped to functional behavior blocks, and a set of subtasks corresponding to different dialogue stages is generated based on the semantic dependencies and temporal order between the functional behavior blocks. The task decomposition results are output in logical order.

4. The goal-oriented proactive dialogue method for large-scale model collaboration according to claim 1, characterized in that, The strategy-guided dialogue data generated through multi-agent simulation includes: The task flowchart is analyzed to establish a dialogue state space and a set of state transition rules. Define a multi-agent system, including a facilitator agent and participant agents, and configure different user profile parameters for the participant agents to form a set of user behavior simulations; Starting from the initial state, the facilitator agent selects candidate dialogue behaviors according to the state transition rules, and the participant agents generate response behaviors based on the user profile. They interact alternately and update the dialogue state, recording the state, behavior and transition results of each round of dialogue until the termination condition is met. The generated dialogue trajectory data is structured to form a strategy-oriented dialogue data set.

5. The goal-oriented proactive dialogue method for large-scale model collaboration according to claim 1, characterized in that, The aforementioned monitoring and fine-tuning includes: Analyze the dialogue data and construct a training sample set; Preprocess the training samples to obtain a high-quality training dataset; A pre-trained basic large language model is selected as the initial model, and supervised learning training is performed on it using the high-quality training dataset. The model parameters are updated by minimizing the difference between the model's generated output and the target output. After the training reaches the preset convergence condition, the initial policy model is obtained.

6. The goal-oriented proactive dialogue method for large-scale model collaboration according to claim 1, characterized in that, The obtained dialogue strategy decision model includes: Generate candidate dialogue behaviors for each dialogue context using the initial policy model; The quality of each candidate dialogue behavior is evaluated using a pre-defined dialogue evaluation function. Based on the evaluation scores, candidate behaviors in the same context are compared pairwise to construct a set of preference data samples. Based on the aforementioned set of preference data samples, the parameters of the initial policy model are updated using the direct preference optimization objective function, thereby increasing the probability of generating preferred behavior and decreasing the probability of generating undesirable behavior, thus obtaining the dialogue policy decision model.

7. The goal-oriented proactive dialogue method for large-scale model collaboration according to claim 1, characterized in that, The information gap-driven phased goal constraint determines the goal-oriented intent and updates the current state node through candidate intent semantic matching and scoring, including: Based on the preset dialogue goals and state nodes in the task flowchart, a set of target information requirements is generated, and an observed information set is extracted from the real-time dialogue context. The information gap set is determined by the difference between the target information demand set and the observed information set, and the importance of the information gaps is assessed to obtain the weighted information gap set; Based on the weighted information gap set, reverse reasoning is used to determine the phased objective of prioritizing filling high-weight information gaps in the current stage, and a set of candidate user intentions related to filling the information gaps is generated. Semantically match user input with candidate user intents, and combine the weighted information gap to give the matching results a weighted score, and select the intent with the highest score as the goal-oriented intent; Based on the stated goal-oriented intent and observed information, match the state nodes in the task flowchart and determine the state transition, then update the current state node.

8. The goal-oriented proactive dialogue method for large-scale model collaboration according to claim 1, characterized in that, The step of selecting a dialogue processing path based on the updated state node and the compatibility between the strategy model mechanism and the rule response mechanism, and outputting the dialogue behavior identifier or transitional dialogue response generated by the dialogue strategy decision model, includes: Obtain the updated current dialogue state node and historical state information, and construct a multi-dimensional state vector of the current and historical states from multiple dimensions; Calculate the state change components corresponding to each dimension, and calculate the comprehensive change intensity value through the state change coupling function; Based on the comprehensive change intensity value, a change driving potential function is constructed, and the state change driving potential value is calculated. Based on the state change driving potential, fitness functions for the rule response mechanism and the policy model mechanism are constructed respectively, and the fitness values ​​of the two mechanisms in the current state are calculated. Select a mechanism with high adaptability as the dialogue processing path: if the strategy model mechanism has high adaptability, call the dialogue strategy decision model to output the dialogue behavior identifier; otherwise, generate a transitional dialogue response based on preset rules.

9. The goal-oriented proactive dialogue method for large-scale model collaboration according to claim 1, characterized in that, The generation of natural language output based on dialogue behavior identifiers or transitional dialogue responses includes: Obtain dialogue behavior identifiers or transitional dialogue responses and map them to the basic semantic intent structure; Construct a set of pragmatic constraints by combining user characteristics and dialogue state in the current dialogue context; Based on the semantic intent structure and pragmatic constraint set, multiple candidate natural language expression sequences are generated, and template mapping and structural reorganization are performed to obtain an optimized expression set; Each candidate expression in the optimized expression set is scored in multiple dimensions, and the candidate expression with the highest score is selected as the target natural language output. The target natural language output is adjusted according to the preset character style parameters to obtain the final output result.

10. A system using a goal-oriented proactive dialogue method based on size model collaboration as described in any one of claims 1-9, characterized in that, include: The dialogue flow and data generation module is used to construct a structured task flow diagram and generate policy-oriented dialogue data through multi-agent simulation. The multi-agent includes a facilitator agent for guiding the dialogue flow and a participant agent for simulating user behavior. The strategy model training module is used to perform supervised fine-tuning and preference optimization of the basic large language model based on dialogue data to obtain the dialogue strategy decision model. The information gap reverse reasoning module is used to obtain user input information, combine it with historical dialogue to form a real-time dialogue context, and determine the target-oriented intent and update the current state node based on the phased goal constraints driven by the information gap through candidate intent semantic matching and scoring. The mechanism adaptation decision module is used to select the dialogue processing path based on the adaptation degree between the strategy model mechanism and the rule response mechanism according to the updated state node, and output the dialogue behavior identifier or transitional dialogue response generated by the dialogue strategy decision model. The output module is used to generate natural language output based on dialogue behavior identifiers or transitional dialogue responses.