A collaborative reasoning method and system based on cross-family debate of heterogeneous large language models and a storage medium

By constructing a heterogeneous large language model collaboration group, leveraging model family differences and multi-round debates, role types are identified and adaptive answer aggregation is performed. This addresses the problem of similar error patterns in homogeneous models during complex reasoning tasks, thereby improving the accuracy and reliability of answers.

CN122242775APending Publication Date: 2026-06-19HANGZHOU SHENDU ZHIJIAN TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU SHENDU ZHIJIAN TECHNOLOGY CO LTD
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing large language models are prone to similar error patterns due to homogeneity in complex reasoning tasks, making it difficult to effectively complement and correct errors, and the gap in capabilities leads to unreliable results.

Method used

A heterogeneous large language model collaboration group is constructed. Models are selected based on differences in model families to ensure heterogeneity in training data, architecture, and strategies. Multi-round debate and role recognition are used to achieve complementary error correction, and answers are adaptively aggregated based on ability scores.

Benefits of technology

It improves the accuracy and reliability of answers for complex reasoning tasks, reduces the impact of error patterns in homogeneous models, and enhances the complementary error correction capabilities of models in collaboration.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242775A_ABST
    Figure CN122242775A_ABST
Patent Text Reader

Abstract

This invention discloses a collaborative reasoning method, system, and storage medium based on cross-family debate using heterogeneous large language models, belonging to the field of knowledge model reasoning technology. The method includes: acquiring reasoning task data; selecting large language models from different model families to construct a heterogeneous model collaboration group; sending the reasoning task data to each model in a unified prompt format, with each model generating initial reasoning data in isolation; summarizing the initial reasoning data to generate a debate context and distributing it; generating review data based on the reasoning and answers of other models and revising its own reasoning or answers; identifying the roles of each model based on changes in answers, reasoning text, and review text during multiple rounds of interaction; calculating ability scores based on historical performance data and / or current interaction data; and selecting consensus or weighted aggregation to output the final answer based on the ability balance result. Through the technical solution of this invention, complementary error correction can be achieved by utilizing the differences in error patterns of heterogeneous models, improving the reliability of reasoning answers.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of knowledge model reasoning technology, and in particular to a collaborative reasoning method based on heterogeneous large language model cross-family debate, a collaborative reasoning system based on heterogeneous large language model cross-family debate, and a computer-readable storage medium. Background Technology

[0002] With the development of large language model technology, large language models have been widely applied to complex AI decision-making scenarios such as mathematical reasoning, logical reasoning, common sense question answering, and professional knowledge analysis. For such tasks, the model typically needs to generate a reasoning process and provide an answer based on the input question. The completeness of the reasoning process, the accuracy of the answer, and the reliability of the output directly affect the quality of subsequent decisions.

[0003] In existing technologies, to improve the accuracy of complex reasoning tasks, schemes have emerged that involve multiple large language models or multiple agents collaborating to complete the reasoning. For example, multiple models can answer the same question separately, and then the final answer is generated through voting, evaluation, or multiple rounds of discussion. Such schemes can reduce the impact of random errors by a single model on the results to some extent, but their collaborative effectiveness usually depends on whether the participating models have sufficient differences.

[0004] In practical applications, if multiple participating models come from the same or similar model families, they are prone to developing similar inference biases and error patterns due to similar training data sources, model architectures, training strategies, and optimization processes. When faced with the same type of inference traps or complex problems, these models may simultaneously produce the same or highly similar incorrect answers. In this case, multi-model collaboration easily degenerates into majority voting among homogeneous models, making it difficult to effectively provide complementary error correction information, and ultimately, unreliable answers may still be output.

[0005] Meanwhile, in multi-model collaborative reasoning, different models play different roles in the interaction. Some models tend to absorb the reasoning content of other models and form a comprehensive answer, some models are better at pointing out specific errors in the reasoning of other models, and some models tend to question the reasoning premises or implicit assumptions. If the actual collaborative role of each model cannot be identified based on the changes in answers, reasoning text, and review text during multiple rounds of interaction, it will be difficult to fully utilize the behavioral differences of different models in the debate process.

[0006] Furthermore, different families of large language models may exhibit varying comprehensive reasoning abilities. If indiscriminate consensus or simple majority voting is still employed when there are significant differences in capabilities, the weaker model may unduly influence the final result, reducing the reliability of the final reasoning answer. Summary of the Invention

[0007] To address the aforementioned issues, this invention provides a collaborative reasoning method, system, and storage medium based on cross-family debate using heterogeneous large language models. By selecting large language models from at least two different model families based on model family differences, a heterogeneous model collaboration group is constructed. This ensures that participating models differ in training data sources, model architecture, and training strategies, reducing the risk of shared misjudgments caused by highly correlated error patterns in homogeneous models. Initial reasoning data is generated by each language model in isolation, followed by multiple rounds of review and revision based on a structured debate context, fully leveraging the reasoning perspectives of different models to achieve complementary error correction. Furthermore, this invention identifies the roles of each model based on changes in answers, reasoning text, and review text during multiple rounds of interaction, and uses ability scores to determine the balance of abilities within the collaboration group. When abilities are balanced, answer consensus is adopted; when abilities are unbalanced, weighted aggregation is used. This reduces the undue influence of weaker models on the final result, improving the accuracy and reliability of the final reasoning answer.

[0008] To achieve the above objectives, this invention provides a collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model, comprising: Acquire the inference task data to be processed, and select large language models from at least two different model families based on the preset model family difference conditions to construct a heterogeneous model collaboration group; The reasoning task data is sent to the various language models in the heterogeneous model collaboration group in a unified prompt format, so that the various language models generate initial reasoning data in a mutually isolated state. The initial reasoning data includes the initial reasoning process and the initial answer. The initial reasoning data generated by various language models are aggregated to generate a structured debate context. The structured debate context is then distributed to various language models, enabling each language model to generate review data based on the reasoning process and answers of other large language models. Based on the review data, each language model updates at least one of its own reasoning process and answers to obtain at least one round of revised reasoning data. During the multi-round interaction, based on the answer change information, reasoning text change information, and review text information of each language model in adjacent rounds, the role type of each language model in collaborative reasoning is identified. Based on the historical reasoning performance data and / or current multi-round interaction data of various language models, calculate the ability scores of various language models, and determine whether the heterogeneous model collaboration group meets the ability balance condition based on the ability scores. When the aforementioned ability balance condition is met, the final reasoning answer is generated based on the consensus result of answers after multiple rounds of interaction; when the aforementioned ability balance condition is not met, the answers after multiple rounds of interaction are weighted and aggregated based on the ability scores of various language models to generate the final reasoning answer.

[0009] In the above technical solution, preferably, the step of selecting a large language model from at least two different model families and constructing a heterogeneous model collaboration group based on preset model family difference conditions includes: Obtain information about the model family to which the candidate large language model belongs, the source of pre-training data, the model architecture, and the training strategy. The participating models are selected from candidate large language models that belong to different model families and differ in at least one of the following: pre-training data source information, model architecture information, and training strategy information. The participating models are configured as the heterogeneous model collaboration group, which includes three to five large language models.

[0010] In the above technical solution, preferably, the step of sending the inference task data to the various language models in the heterogeneous model collaboration group in a unified prompt format includes: Generate unified prompt data based on the inference task data; The unified prompt data is sent to each of the major language models, and the major language models are controlled not to receive the inference results of other major language models before generating the initial inference data. It receives the initial inference process, initial answer, and initial confidence score returned by various language models.

[0011] In the above technical solution, preferably, the step of generating a structured debate context and distributing the structured debate context to various language models includes: Write the initial inference data of various language models into the structured debate context; In each round of interaction, the reasoning data and answer data from the previous round of major language models are distributed to the major language models. It receives at least one of the following from various language models: error pointing out, premise questioning, reasoning supplementation, and answer revision, and uses it as the reasoning data for this round of revision.

[0012] In the above technical solution, preferably, the step of identifying the role types of various language models in collaborative reasoning based on the answer change information, the reasoning text change information, and the review text information includes: When the frequency with which the target large language model absorbs the reasoning content of other large language models and modifies its own reasoning process or answer in adjacent rounds reaches the first preset frequency condition, the target large language model is marked as a synthesizer role. When the frequency with which the target large language model points out specific reasoning errors of other large language models in the reviewed text information reaches the second preset frequency condition, the target large language model is marked as an error corrector. When the target large language model raises questions about the reasoning premises or implicit assumptions of other large language models in the review of text information at a frequency that reaches the third preset frequency condition, the target large language model is marked as a critic.

[0013] In the above technical solution, preferably, the calculation of the ability scores of various language models based on historical reasoning performance data and / or current multi-turn interaction data includes: Historical capability indicators are generated based on the accuracy of various language models in historical calibration reasoning tasks. An interaction error correction index is generated based on the frequency with which each language model is corrected by other large language models during the current multi-round interaction process. Generate a stable confidence index based on the consistency of confidence among various language models in the current multi-turn interaction. The capability score is calculated based on at least one of the historical capability index, the interactive error correction index, and the confidence stability index.

[0014] In the above technical solution, preferably, the step of determining whether the heterogeneous model collaboration group meets the capability balance condition based on the capability score includes: Determine the score difference between the highest and lowest ability scores in the heterogeneous model collaboration group; When the score difference is less than or equal to a preset capability difference threshold, the heterogeneous model collaboration group is determined to meet the capability balance condition. When the score difference is greater than the preset capability difference threshold, it is determined that the heterogeneous model collaboration group does not meet the capability balance condition, and a weighted aggregation weight is determined based on the capability scores of the major language models, wherein the weighted aggregation weight is positively correlated with the capability score.

[0015] In the above technical solutions, preferably, the collaborative reasoning method for cross-ethnic debate based on heterogeneous large language models further includes: After each round of interaction, it is determined whether the current answers of each language model are consistent, and whether the current confidence of each language model has reached the preset confidence threshold. When the current answers of all language models are consistent and the current confidence of all language models reaches the preset confidence threshold, the multi-round interaction is terminated and the current answer is output. When the current answers of the major language models are inconsistent, or when the current confidence of at least one major language model does not reach the preset confidence threshold, the next round of interaction continues until the preset maximum number of rounds is reached.

[0016] This invention also proposes a collaborative reasoning system for cross-ethnic debate based on a heterogeneous large language model, comprising: The heterogeneous model scheduling module is used to acquire inference task data to be processed and select large language models from at least two different model families according to preset model family difference conditions to build a heterogeneous model collaboration group. An independent reasoning control module is used to send the reasoning task data to the various language models in the heterogeneous model collaboration group in a unified prompt format, so that the various language models generate initial reasoning data in a mutually isolated state. The initial reasoning data includes the initial reasoning process and the initial answer. The multi-round debate coordination module is used to summarize the initial reasoning data generated by various language models, generate a structured debate context, and distribute the structured debate context to various language models, so that each language model generates review data based on the reasoning process and answer of other large language models, and updates at least one of its own reasoning process and answer according to the review data, to obtain at least one round of revised reasoning data. The role recognition and tracking module is used to identify the role type of each language model in collaborative reasoning based on the answer change information, reasoning text change information, and review text information of each language model in adjacent rounds during multi-round interaction. The capability balance detection module is used to calculate the capability scores of each language model based on the historical reasoning performance data and / or the current multi-round interaction data of each language model, and to determine whether the heterogeneous model collaboration group meets the capability balance condition based on the capability scores. The answer aggregation module is used to generate a final reasoning answer based on the consensus results of answers after multiple rounds of interaction when the ability balance condition is met; and to generate a final reasoning answer by weighting and aggregating the answers after multiple rounds of interaction based on the ability scores of various language models when the ability balance condition is not met.

[0017] The present invention also proposes a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements a collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model as disclosed in any of the above technical solutions.

[0018] Compared with the prior art, the beneficial effects of the present invention are as follows: (1) By selecting large language models from at least two different model families to construct a heterogeneous model collaboration group based on the preset model family difference conditions, the participating models can form differences in training data sources, model architecture and training strategies, which can reduce the risk of homogeneous models producing the same wrong answer due to similar error patterns, and provide a complementary reasoning basis for subsequent collaborative reasoning.

[0019] (2) By sending the reasoning task data to each language model in a unified prompt format during the initial reasoning stage, and by having each language model generate the initial reasoning data in a state of mutual isolation, it is possible to avoid the influence of the reasoning results of other models on each model during the initial stage, and to preserve the independent reasoning perspectives and judgment results of different model families.

[0020] (3) By summarizing the initial reasoning data, a structured debate context is generated and distributed to various language models. Each model generates review data based on the reasoning process and answers of other models, and revises its own reasoning process or answers accordingly. This enables cross-model family reasoning exchange, error pointing, premise questioning and reasoning supplementation in multi-round interactions, improving the complementary error correction ability in complex reasoning tasks.

[0021] (4) Based on the information on the changes in answers, the changes in reasoning text, and the information on reviewing text of various language models in adjacent rounds during the multi-round interaction process, the role types of various language models in collaborative reasoning can be identified. Different collaborative behaviors such as synthesizers, error correctors, and critics can be distinguished, and the actual role of different models in the debate process can be better understood and utilized.

[0022] (5) Calculate the ability score based on the historical reasoning performance data of various language models and / or the current multi-round interaction data, and determine whether the heterogeneous model collaboration group meets the ability balance condition. When the ability is balanced, generate the final reasoning answer based on the consensus result of the answers after multi-round interaction. When the ability is unbalanced, perform weighted aggregation based on the ability score, so as to avoid the weaker model from having an undue influence on the final result and improve the accuracy, stability and reliability of the final reasoning answer. Attached Figure Description

[0023] Figure 1 This is a schematic diagram of the architecture of a collaborative reasoning method based on heterogeneous large language models for cross-ethnic debate, as disclosed in an embodiment of the present invention. Figure 2 This is a flowchart illustrating a collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model, as disclosed in one embodiment of the present invention. Detailed Implementation

[0024] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0025] The present invention will now be described in further detail with reference to the accompanying drawings: like Figure 1 and Figure 2 As shown, this invention provides a collaborative reasoning method based on heterogeneous large language models for cross-family debate, used to handle reasoning tasks such as mathematical reasoning, logical reasoning, common sense reasoning, or professional knowledge analysis. Addressing the problems of existing multi-agent debate frameworks that commonly use multiple instances from the same model family as participants, suffer from highly correlated error patterns due to shared training data and architecture among family models, and degenerate into a debate essentially into a majority vote within the same family, this invention constructs heterogeneous collaborative groups from different model families. It leverages the inherent differences in error patterns across families to achieve complementary error correction and introduces a role recognition and ability balancing adaptive aggregation mechanism, achieving accuracy superior to a single model on diverse reasoning tasks.

[0026] In this method, the implementing entity first acquires the reasoning task data to be processed and, based on pre-defined model family difference conditions, selects large language models from at least two different model families to construct heterogeneous model collaboration groups. Due to differences in pre-training data sources and distribution, model architecture, training objectives, and training strategies, large language models from different model families typically possess different reasoning strengths, blind spots, and error patterns. This results in systematic differences in advantages and disadvantages across different types of reasoning problems, and the error patterns of each family exhibit low correlation. This means that when a model from one family makes an error on a certain type of problem, another family has a higher probability of providing the correct answer. This natural error complementarity is the structural basis for achieving cross-family debate error correction gains in this invention. Therefore, constructing heterogeneous collaboration groups, rather than homogeneous collaboration groups, can provide complementary judgment perspectives for the same reasoning task.

[0027] In the initial reasoning phase, after the heterogeneous model collaboration group is constructed, the reasoning task data is sent to each language model in the collaboration group in a unified prompt format. This allows each language model to generate initial reasoning data in an isolated state, including the initial reasoning process and initial answers. Since each model does not receive reasoning results from other models at this stage, early conformity or mutual influence can be avoided, thus preserving the independent reasoning paths of different model families. This isolated independent reasoning phase is a key design feature of this method to ensure that the initial output of each model truly reflects its family's unique cognitive perspective. If models could see each other's outputs in the initial stage, the reasoning results of the dominant model would prematurely anchor the reasoning direction of other models, destroying error diversity and rendering subsequent debates worthless for error correction.

[0028] Subsequently, the initial inference data generated by various language models are aggregated to create a structured debate context. This structured debate context is then distributed to each language model, enabling them to generate review data based on the reasoning processes and answers of other large language models. Each model then updates at least one aspect of its own reasoning process and answer based on this review data, resulting in at least one round of revised inference data. Through multiple rounds of interaction, different models can identify errors in each other's reasoning, supplement missing information, or correct their own answers, transforming differences in error patterns between cross-family models into complementary error-correction capabilities. The core function of the structured debate context is to organize the initial inference data of each model into input that can be effectively interpreted by other models in a standardized format. This eliminates the inefficiency in information transmission caused by significant differences in reasoning styles and expressions among different model families, enabling truly effective cross-family reasoning exchange.

[0029] During multi-round interactions, based on changes in answers, inference text, and review text information from adjacent rounds, the role types of each language model in collaborative reasoning are identified. For example, some models may be more inclined to synthesize the viewpoints of other models, some may be better at pointing out specific errors, and some may be better at questioning the premises of reasoning. Identifying role types reflects the actual behavioral patterns of models in cross-ethnic collaborative debate reasoning, providing a basis for understanding the differences in how each model contributes to the reasoning results, and offering a structured analytical perspective for system monitoring and optimization without relying on pre-labeling or manual intervention.

[0030] Furthermore, based on the historical reasoning performance data and / or current multi-turn interaction data of various language models, the ability scores of various language models are calculated, and the ability scores are used to determine whether the heterogeneous model collaboration group meets the ability balance condition. When the capability balance condition is met, the final inference answer is generated based on the consensus results of multiple rounds of interaction. When the capability balance condition is not met, the answers after multiple rounds of interaction are weighted and aggregated based on the capability scores of various language models to generate the final inference answer, ensuring that the erroneous inferences of the weak model will not affect the final result through numerical advantage. This adaptive strategy selection mechanism enables this method to maintain robust inference quality in real-world deployment environments with diverse capability structures within collaborative groups, achieving accuracy superior to any single member model on multiple inference benchmarks.

[0031] In this implementation, the final result can be avoided by weaker models influencing the outcome through simple majority voting or persistent incorrect answers, thereby improving the accuracy and reliability of collaborative reasoning answers.

[0032] In the above embodiments, preferably, based on preset model family difference conditions, large language models are selected from at least two different model families to construct heterogeneous model collaboration groups, including: The process involves obtaining information about the candidate large language model's family, pre-training data source, model architecture, and training strategy. The model family information identifies which research institution and product line developed the model, distinguishing candidate models from different model systems, such as Anthropic's Claude series, Google's Gemini series, and DeepSeek's inference series. The pre-training data source information describes the composition of the corpus used for training, including the data source institution, data language distribution, and domain coverage. The model architecture information describes the structural features of the model, such as the Transformer variant type, attention mechanism design, and parameter size. The training strategy information describes the supervised fine-tuning strategy, reinforcement learning feedback method, and alignment target setting used by the model. These four types of information collectively determine the error pattern characteristics of a model and serve as the basis for judging whether two models possess sufficient differences to produce error complementarity.

[0033] The participating models are selected from candidate large language models that belong to different model families and differ in at least one of the following: pre-training data source information, model architecture information, and training strategy information. Among them, different model families are a necessary condition to ensure that the error patterns caused by sharing training processes among models of the same family are fundamentally avoided. At least one additional difference further ensures that the models selected across families have sufficient heterogeneity on a cognitive basis, rather than just being nominally from different families but actually using highly similar training data and strategies.

[0034] The participating models will be configured as heterogeneous model collaboration groups, which include three to five large language models. Three models represent the minimum configuration for achieving cross-family, multi-perspective understanding, and logical chain reasoning dominance. Five models represent the recommended upper limit to balance reasoning diversity and computational overhead. Beyond five models, the marginal error correction gain diminishes while the API call cost increases linearly, resulting in a decrease in overall efficiency.

[0035] This scale strikes a balance between the diversity of inference perspectives and computational overhead: too few models make it difficult to achieve sufficient complementarity, while too many models increase inference latency and invocation costs.

[0036] In this implementation, the heterogeneous model collaboration group constructed in the above manner, through systematic screening of four-dimensional candidate model information, can reduce the problem of highly correlated error patterns among family models. This allows the strengths and weaknesses of different model families to complement each other in multiple rounds of debate, improving the collaboration group's coverage and error correction capabilities for complex reasoning tasks. For example, in a typical collaboration group containing three members—Claude-3.5, Gemini-1.5-Pro, and DeepSeek-R1—the low correlation of error patterns among the three members in mathematical reasoning, logical reasoning, and common-sense reasoning tasks significantly enhances the debate error correction gain, resulting in an overall accuracy rate that surpasses the accuracy of any single model among the three members.

[0037] In the above embodiments, preferably, the inference task data is sent to the various language models in the heterogeneous model collaboration group in a unified prompt format, including: Unified prompt data is generated based on the inference task data. This unified prompt data uses a standardized template to encapsulate the question stem, background information, and requirements of the inference problem in the same format. This ensures that the question content and task requirements received by all major language models are consistent, guaranteeing that each major language model from different families in the collaboration group receives semantically equivalent input. It eliminates systematic biases that may be introduced by differences in prompt format, making the initial inference results of different models comparable. If models receive prompts in different formats, the differences in initial inference results will be mixed with noise caused by these prompt differences, compromising the purity of the cross-family error complementarity effect.

[0038] Unified prompts are sent to each of the major language models, and each model is prevented from receiving inference results from other major language models before generating initial inference data. In other words, during the entire initial inference phase, there is no information exchange between the major language models. Each model independently generates its initial inference process, initial answer, and initial confidence level based on its own model family's knowledge distribution and inference capabilities. The system blocks any information exchange channels between models. This mutual isolation mechanism preserves the unique analytical paths and problem-solving methods of each model family when facing the same inference problem, ensuring that the initial inference results truly reflect the natural cognitive style differences between the models and providing a starting point with genuine cognitive diversity for subsequent debates.

[0039] The system receives initial inference processes, initial answers, and initial confidence levels from various language models. The initial inference process records the intermediate steps and reasoning chains each model goes through to derive an answer from the question; the initial answer is the answer each model arrives at based on its own reasoning; and the initial confidence level quantifies each model's confidence in its current answer, used for subsequent early termination condition judgments and competency balance assessments. These three types of information together constitute a complete representation of the initial inference data, serving as the raw input material for generating structured debate contexts.

[0040] In this implementation, by independently generating initial inference data, input difference noise is eliminated, and the mutual isolation mechanism protects the cognitive independence of the initial inference of each family of models. The collection of initial confidence scores provides additional meta-information for subsequent processes. The collaboration of these three aspects ensures that the initial inference data set entering the debate stage has the greatest cognitive diversity, fully releasing the error-correction potential of subsequent cross-family debates and improving the independence of the initial inference data and the effectiveness of subsequent debate comparisons.

[0041] In the above embodiments, preferably, a structured debate context is generated and distributed to various language models, including: The initial reasoning data of various language models are written into a structured debate context, enabling each model to read the reasoning process and answers of other models. The structured debate context is a standardized information carrier designed specifically for cross-family reasoning exchange. It organizes the initial reasoning processes and initial answers of each model side-by-side into a structured document using a unified template format, and clearly labels each model's content with a model identifier. This structured organization solves the problem that direct interaction between different model families may lead to inefficient information transmission due to significant differences in reasoning styles and expressions. Through standardized formatting, each participating model can efficiently interpret the reasoning content from different family models without dealing with the comprehension barriers caused by heterogeneous reasoning styles.

[0042] In each round of interaction, the reasoning and answer data from the previous round are distributed to each language model, allowing them to review and revise based on the previous results. This round-by-round distribution, rather than distributing the entire history all at once, ensures that each model focuses on the latest round's reasoning changes during each review, avoiding distractions caused by too much historical content. It also maintains a clear temporal structure in the debate process, facilitating subsequent role identification and change tracking analysis.

[0043] During the review process, each language model can generate at least one of the following: error pointing, premise questioning, reasoning supplementation, and answer revision, based on the reasoning content of other large language models, and use it as the reasoning data for this round of revision. For example, a model can point out specific errors in the calculation steps of other models; it can also question the implicit assumptions used by other models; and it can also supplement missing reasoning conditions or revise its own answer.

[0044] Specifically, error pointing out involves explicitly identifying specific calculation errors, logical fallacies, or factual errors in the reasoning process of other models; premise questioning involves challenging the implicit assumptions or initial premises upon which other models' reasoning is based, thus challenging the rationality of their reasoning foundation; reasoning supplementation involves providing supplementary arguments or omitted considerations based on the existing reasoning of other models; and answer revision involves updating one's own reasoning process or answer in the previous round based on the review comments. These four types of review data cover all the main modes of reasoning review, where error pointing out and premise questioning represent critical error correction, reasoning supplementation represents constructive collaboration, and answer revision represents self-updating based on review. Together, these four types of content constitute a complete channel for knowledge flow in multi-round debates and are systematically recorded as the reasoning data for this round of revision.

[0045] In this implementation, the structured debate context enables efficient cross-family reasoning exchange, allowing the reasoning processes of different models to be organized in a comparable manner. This facilitates cross-family review and revision of each model, improves the information organization efficiency of multi-round debates, and fully utilizes the complementary error-correction capabilities of different models. Ultimately, this results in high levels of information transmission efficiency and review depth in multi-round cross-family debates, with each round of debate effectively driving the quality of answers towards greater accuracy.

[0046] In the above embodiments, preferably, the role types of each language model in collaborative reasoning are identified based on answer change information, reasoning text change information, and review text information, including: For the target large language model, the frequency with which it absorbs reasoning content from other models and modifies its own reasoning process or answer in adjacent rounds is statistically analyzed. When the frequency with which the target large language model absorbs reasoning content from other large language models and modifies its own reasoning process or answer in adjacent rounds reaches a first preset frequency condition, it indicates that the target large language model tends to integrate multiple viewpoints, and the target large language model is marked as a synthesizer. The core behavioral characteristics of the synthesizer role are: a tendency to synthesize multiple viewpoints, absorb the reasoning advantages from different family models, and generate integrative conclusions based on this; its answer change information shows frequent updates across rounds, and its reasoning text change information shows that while retaining its own reasoning framework, it incorporates a large number of arguments from other models. The existence of the synthesizer role is a key driving force for cross-family debate to converge to high-quality consensus, integrating the scattered correct reasoning fragments of various family models into a consistent optimal reasoning path.

[0047] The frequency with which the target large language model points out specific reasoning errors of other large language models in the reviewed text information is analyzed. When the frequency of the target large language model pointing out specific reasoning errors of other large language models in the reviewed text information reaches the second preset frequency condition, it indicates that the target large language model mainly plays the role of discovering and correcting errors in the collaborative process, and is marked as the error corrector role. The core behavioral characteristics of the error corrector role are: a tendency to strictly examine the reasoning details of other models with its specialized ability in specific reasoning types, clearly pointing out errors in the reviewed text such as "the third step calculation is wrong" or "conditions are omitted", and its own reasoning process is usually relatively stable over multiple rounds. The proportion of review data pointing out errors is significantly higher than other types. The error corrector role is a direct contributor to improving the accuracy of reasoning in cross-ethnic debates, and its high-frequency error-correction behavior enables other models to identify and correct specific reasoning defects in a timely manner.

[0048] This study examines the frequency with which the target large language model challenges the reasoning premises or implicit assumptions of other models. When the frequency with which the target large language model challenges the reasoning premises or implicit assumptions of other large language models in the reviewed text reaches the third pre-set frequency condition, it indicates that the target large language model is more inclined to challenge the reasoning foundation, thus marking the target large language model as a critic. The core behavioral characteristics of the critic role are: a tendency to examine the starting point of other models' reasoning from a meta-level perspective, raising questioning questions such as "Does this premise hold in this problem?" and "Is implicit assumption X supported by the conditions of the question?"; premise-challenging content dominates the reviewed text, and such questioning can sometimes reveal deep reasoning blind spots that other models have overlooked.

[0049] The above three types of role recognition are automatically inferred through frequency statistics of model behavior analysis, without relying on pre-labeling or human intervention, and the recognition accuracy continues to improve as the debate rounds increase.

[0050] In this implementation, role types are automatically inferred based on behavioral patterns during multiple rounds of debate. The automatic identification of the three types of roles provides a structured understanding of the reasoning contribution methods of each model, enabling the identification of the actual contribution methods of different models in collaborative reasoning, and providing more granular behavioral basis for subsequent capability assessment, interactive analysis and answer aggregation.

[0051] In the above embodiments, preferably, the ability scores of each language model are calculated based on the historical reasoning performance data and / or current multi-turn interaction data of each language model, including: Historical ability indicators are generated based on the accuracy of various language models in historical calibration reasoning tasks. Historical calibration reasoning tasks are a set of reasoning questions with known correct answers that are continuously accumulated through daily operation; these can include mathematical reasoning, logical reasoning, or common sense reasoning questions with standard answers. Accuracy reflects the model's overall reasoning ability level after experiencing sufficiently diverse reasoning tasks and is the most valuable long-term stable indicator in ability scoring.

[0052] Interactive error correction metrics are generated based on the frequency with which each language model is corrected by other large language models during the current multi-round interaction process. If a model is frequently pointed out as having errors by other models in the current debate, it indicates that its current inference reliability is low. The correction frequency refers to the proportion of times in the multi-round debate of the current inference task that the model's inference error is explicitly pointed out by other member models and verified to be correct in subsequent rounds. A model with a high correction frequency indicates that its inference quality in the current task is relatively weak, while a model with a low correction frequency indicates that its inference quality is relatively strong. Interactive error correction metrics are real-time capability signals generated based on the current task, capable of capturing the current performance differences of models on specific question types that are not yet reflected in historical accuracy statistics.

[0053] A confidence stability index is generated based on the consistency of confidence scores across multiple rounds of interaction among various language models. This index reflects the stability of the models' judgments of their own answers during multiple revisions. The confidence stability index measures the consistency of the model's output confidence score across rounds of debate: models whose confidence scores remain stable across multiple rounds (i.e., maintaining high confidence scores on correct answers or whose confidence scores change reasonably after answer updates) are considered to have strong capabilities and high reasoning certainty; models whose confidence scores fluctuate frequently and significantly across multiple rounds and are inconsistent with answer changes are considered to have weak reasoning stability.

[0054] A capability score is calculated based on at least one of the following: historical capability index, interactive error correction index, and confidence stability index. The capability score can utilize both historical calibration data and real-time interactive performance during the current debate process, comprehensively reflecting an estimate of the long-term historical level and current task performance of each model.

[0055] In this implementation, three types of capability indicators are used to quantitatively evaluate the capabilities of each model from three dimensions: historical long-term accumulation, current real-time interaction, and confidence stability. This makes the capability score both historically reliable and currently task-specific, significantly improving the accuracy of capability balance judgment and weighted aggregation decision-making. Compared with a single indicator scheme that only uses historical accuracy, it can more accurately reflect the actual capability differences of each model on the current specific reasoning task.

[0056] In the above embodiments, preferably, determining whether a heterogeneous model collaborative group meets the capability balance condition based on capability scores includes: The score difference between the highest and lowest ability scores in the heterogeneous model collaboration group is determined, and this score difference is compared with a preset ability difference threshold. The score difference is the core quantitative indicator for measuring the uniformity of ability distribution within the collaboration group. It uses a single quantitative value within the group to simply reflect the degree of difference in ability patterns among members, avoiding the complexity of comparing multi-dimensional ability indicators one by one.

[0057] When the score difference is less than or equal to a preset ability difference threshold, the heterogeneous model collaboration group is deemed to meet the ability balance condition. An output strategy based on consensus after multiple rounds of debate is adopted, with each member model's reasoning contribution receiving equal weight. When the abilities of the models are similar, the consensus answer reached after multiple rounds of debate has undergone sufficient cross-group complementary error correction. The consensus conclusion of multiple models with similar abilities has high credibility, making simple consensus output the optimal strategy.

[0058] When the score difference exceeds a preset ability difference threshold, the heterogeneous model collaboration group is deemed not to meet the ability balance condition. Weighted aggregation weights are then determined based on the ability scores of each language model, with these weights being positively correlated with the ability scores. This positive correlation ensures that models with higher ability scores receive greater voting weight in the final answer aggregation, resulting in a greater impact of their reasoning conclusions on the final output. Models with lower ability scores receive less weight, ensuring that even if their answers differ from those of higher-ability models, their numerical advantage will not affect the final result. The weighted aggregation weights can be directly calculated from the ability scores through methods such as Softmax normalization and linear proportional allocation, making the weight allocation clearly interpretable.

[0059] In this implementation, the single threshold judgment logic based on the score difference is simple and efficient, avoiding the overhead of complex multi-condition judgment; the adaptive switching between the two paths of balanced and unbalanced abilities enables this method to maintain robust inference quality in actual deployment scenarios with diverse ability patterns of collaborative group members; the positive correlation weight design effectively prevents the weak model from dragging down the correct inference of the strong model through the majority mechanism when the ability gap is large, so that the accuracy of the final answer can still be close to the level of the strongest member model in the collaborative group even in scenarios with unbalanced abilities.

[0060] In the above embodiments, preferably, the collaborative reasoning method based on heterogeneous large language models for cross-ethnic debate further includes: After each round of interaction, the system checks whether the current answers from each language model are consistent and whether the current confidence levels of all models have reached a preset confidence threshold. Answer consistency is determined by comparing the answers given by each model in the current round to ensure complete consistency. For reasoning questions requiring precise numerical values ​​(such as mathematical calculations), the system assesses the numerical or expressive equivalence of the answers. For reasoning questions requiring qualitative choices (such as logical reasoning), the system checks the consistency of the answer categories. Confidence threshold assessment is performed by checking whether the confidence values ​​output by each model in the current round exceed the system's preset confidence threshold.

[0061] When the current answers of all language models are consistent and the current confidence of all language models reaches the preset confidence threshold, it means that the collaboration group has formed a relatively stable consensus on the answer, terminates the multi-round interaction, and outputs the current answer. When the current answers from the various language models are inconsistent, or when the current confidence level of at least one major language model fails to reach the preset confidence threshold, it indicates that there is still disagreement or uncertainty in the current results. The next round of interaction continues until the preset maximum number of rounds is reached. The combined triggering of both consistent answers and high confidence levels is more stringent than a simple consistency-based judgment. If only the answers are consistent but the confidence level is low, the model may be passively following rather than truly agreeing with the consensus, and continued debate could lead to a reversal of the answers. Only when both conditions are met simultaneously can the reliability of the consensus be fully guaranteed, allowing for a safe early termination of the debate. This mechanism significantly reduces the average number of debate rounds for simple questions compared to the maximum set number of rounds, greatly saving API call costs and inference latency.

[0062] The maximum number of rounds is typically set to three. After three rounds of debate, the debate is forcibly terminated regardless of whether the dual termination conditions are met, to avoid wasting computational resources on endless debates. If the dual termination conditions are not met when the maximum number of rounds is reached, an aggregation strategy is selected, and consensus or weighted aggregation output is performed based on the answers of each model in the final round.

[0063] In this implementation, the adaptive debate rounds control method described above can end the interaction early when a consensus is quickly reached in a simple task or model, reducing unnecessary model calls and inference delays; and continue the interaction when there are significant disagreements in a complex task or model, improving the reliability of the final answer and achieving a balance between inference quality and computational overhead.

[0064] This invention also proposes a collaborative reasoning system based on heterogeneous large language model cross-family debate, used to implement any of the disclosed collaborative reasoning methods based on heterogeneous large language model cross-family debate in the above embodiments. The system includes a heterogeneous model scheduling module, an independent reasoning control module, a multi-round debate coordination module, a role recognition and tracking module, an ability balance detection module, and an answer aggregation module. This system can be deployed on a cloud computing platform or server and can call large language model servers from different model families via network interfaces.

[0065] The heterogeneous model scheduling module acquires inference task data to be processed and selects a large language model from at least two different model families based on preset model family difference conditions to construct a heterogeneous model collaboration group. At the engineering implementation level, this module communicates with the inference servers of each model family through a unified API gateway, maintains the interface configuration, API key management, and request frequency limiting policies for each family's models, and possesses failover capabilities. When a family's server fails, it can automatically switch to an available instance from a backup family model, ensuring that the availability of the collaboration group is not interrupted by a single family service failure. This module is deployed as a microservice, abstracting the interface differences between each family's models through a unified API gateway and providing standardized model call interfaces to upper-layer modules.

[0066] The independent inference control module is used to send inference task data in a unified prompt format to the major language models in the heterogeneous model collaboration group, so that the major language models generate initial inference data in a mutually isolated state. Through system-level isolation control, it is ensured that the major language models do not receive the inference results of other major language models before generating initial inference data. The initial inference data includes the initial inference process and the initial answer. The multi-round debate coordination module operates as a central orchestration service, maintaining a complete debate state machine and controlling the timing flow of each round. It is used to summarize the initial inference data generated by various language models, generate a structured debate context, and distribute the structured debate context to various language models. This allows each language model to generate review data based on the inference process and answers of other large language models, and update at least one of its own inference process and answers based on the review data, thus obtaining at least one round of revised inference data. The role recognition and tracking module operates as a bypass analysis service, continuously receiving debate history data pushed by the debate coordination module. During multiple rounds of interaction, it identifies the role type (synergist, corrector, or critic) of each language model in collaborative reasoning based on the changes in answers, reasoning text, and review text information of each language model in adjacent rounds. The role information can be fed back to the system monitoring panel, providing an interpretable view of the debate process for operations and maintenance personnel.

[0067] The capability balance detection module is used to calculate the capability scores of each language model based on the historical reasoning performance data and / or the current multi-turn interaction data, and to determine whether the heterogeneous model collaboration group meets the capability balance condition based on the capability scores. The answer aggregation module is used to generate the final reasoning answer based on the consensus results of multiple rounds of interaction when the ability balance condition is met; when the ability balance condition is not met, it generates the final reasoning answer by weighting and aggregating the answers after multiple rounds of interaction based on the ability scores of various language models.

[0068] In this implementation, through the collaboration of the aforementioned modules, the system can complete the entire process from heterogeneous model selection, independent reasoning, multi-round cross-family debate, role recognition, ability balancing detection to final answer aggregation, providing an engineerable multi-model collaborative reasoning system for highly reliable AI decision-making.

[0069] The present invention also proposes a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model as disclosed in any of the above embodiments.

[0070] Computer-readable storage media include, but are not limited to, solid-state drives, cloud storage systems, ROM, RAM, hard disks, optical discs, and other forms. In the engineering deployment scenarios of collaborative inference systems, the storage media is typically object storage or block storage services of cloud computing platforms to support high-availability storage and rapid distribution of code and configuration files for various modules under a microservice architecture.

[0071] The computer program can be deployed on general-purpose servers, cloud computing instances, or edge computing nodes equipped with network communication interfaces. It communicates with the inference servers of various language model families through standard network protocols such as REST API or gRPC. It does not need to run any large language model instances locally. It only undertakes the responsibilities of orchestrating the collaborative inference process, managing the state, and aggregating the results, enabling the system to drive multiple large language models to collaboratively complete complex inference tasks with extremely low local computing power consumption.

[0072] The computer-readable storage medium enables the collaborative reasoning method of this invention to be flexibly deployed in various computing environments in software form. It can be seamlessly integrated into existing AI reasoning service infrastructure as a reasoning quality enhancement layer without requiring any modification to the various family of large language models connected. This provides a significantly higher reasoning accuracy guarantee than a single model for high-reliability reasoning application scenarios such as medical decision support, legal document analysis, and financial risk assessment.

[0073] The collaborative reasoning method and system for cross-ethnic debate based on heterogeneous large language models disclosed in the above embodiments shall be operated with reference to the following examples during implementation.

[0074] Example 1: Three-Model Heterogeneous Argumentation Reasoning Scheme This embodiment uses a hybrid benchmark test combining mathematical and logical reasoning as its application scenario. It selects three large language models from different families to construct a heterogeneous collaborative group: Model A is an enhanced version of the Claude series (excelling in integrating multi-source information and natural language reasoning); Model B is a professional version of the Gemini series (possessing strong rigor in logical reasoning details); and Model C is a DeepSeek reasoning series version (with significant advantages in mathematical reasoning). The three models differ significantly in their pre-training data sources (from Anthropic, Google, and DeepSeek's respective corpus construction systems), architecture design, and training strategies. They exhibit low correlation in error patterns and possess a structural foundation for cross-family complementary error correction.

[0075] For a multi-step mathematical reasoning problem, the system sends the same prompts to three models in a uniform format. In the first round, the three models independently generate their own reasoning processes and preliminary answers, while returning initial confidence levels: Model A's answer is X, confidence level 0.7; Model B's answer is Y, confidence level 0.85; Model C's answer is Y, confidence level 0.9. The three reasoning results are written into a structured debate context and distributed to each model. After reviewing its own reasoning, Model A discovers a missing condition in the third step (explicitly pointed out by Model B), and subsequently revises its answer to Y and increases its confidence level to 0.88. After the first round, all three models' answers are Y and their confidence levels exceed the preset threshold of 0.8, triggering an early termination condition and outputting the consensus answer Y. After 2-3 rounds of debate, the role identification module records: Model A is the synthesizer (actively revising after incorporating the error pointed out by Model B), Model B is the error corrector (explicitly pointing out specific errors in Model A's reasoning), and Model C is the critic (raising questions about the boundary conditions of the problem during the review). The ability balance test is based on the accuracy of the most recent 100 calibration questions: Model A is 73%, Model B is 79%, Model C is 82%, and the score difference is 9%, which is lower than the preset ability difference threshold of 20%, so it is judged as having balanced abilities, and the consensus of debate is used for output.

[0076] Example 2: Uneven Weighted Voting Scheme In a reasoning task, the ability balance detection module calculated the ability scores of the three models: Model A at 82%, Model B at 81%, and Model C at 55% (Model C performed significantly weaker in the current task type and was corrected three times by the other two models in the current debate). The score difference was 27%, exceeding the preset ability difference threshold of 20%, and was judged as an ability imbalance. After three rounds of debate, reaching the maximum number of rounds, Model A and Model B both gave the answer P, while Model C gave the answer Q. The system switched to a weighted voting mode, determining the weights based on the positive correlation of the ability scores: for example, Model A had a weight of 0.41, Model B had a weight of 0.41, and Model C had a weight of 0.18 (calculated through Softmax normalized ability scores). The weighted votes for answer P were 0.82, and the weighted votes for answer Q were 0.18. The model with the highest weighted votes, P, was selected as the final reasoning answer, effectively preventing Model C's incorrect answer Q from influencing the final result through a simple majority mechanism.

[0077] Example 3: System Microservice Deployment Method On cloud computing platforms, the system is deployed in a microservice architecture on public cloud platforms such as Alibaba Cloud or AWS.

[0078] The heterogeneous model scheduling module is deployed as an independent microservice. It communicates with the inference API endpoints of the Anthropic, Google, and DeepSeek families through a unified API gateway, maintains the API keys of each family, configures request frequency limits (such as not exceeding the rate limit of each family's API service per minute), and configures a backup model instance for each family to support failover.

[0079] The multi-round debate coordination module operates as a central orchestration service, maintaining the debate state machine for each reasoning task in a stateful service manner. It receives the reasoning results of each model through a message queue and distributes them according to rounds.

[0080] The role recognition and tracking module and the capability balance detection module run as bypass analysis service and background evaluation service, respectively, receiving debate history and outputting analysis results asynchronously without blocking the main debate process.

[0081] The answer aggregation module selects an aggregation strategy based on signals from the coordination and detection modules, generates the final reasoning answer along with an accompanying reasoning summary and confidence index, and returns it to the upper-layer application via a REST API.

[0082] The entire system supports horizontal scaling. As the amount of concurrent inference tasks increases, each microservice instance can be scaled up independently, ensuring that the system throughput grows dynamically with the load.

[0083] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A collaborative reasoning method based on heterogeneous large language models and cross-ethnic debate, characterized in that, include: Acquire the inference task data to be processed, and select large language models from at least two different model families based on the preset model family difference conditions to construct a heterogeneous model collaboration group; The reasoning task data is sent to the various language models in the heterogeneous model collaboration group in a unified prompt format, so that the various language models generate initial reasoning data in a mutually isolated state. The initial reasoning data includes the initial reasoning process and the initial answer. The initial reasoning data generated by various language models are aggregated to generate a structured debate context. The structured debate context is then distributed to various language models, enabling each language model to generate review data based on the reasoning process and answers of other large language models. Based on the review data, each language model updates at least one of its own reasoning process and answers to obtain at least one round of revised reasoning data. During the multi-round interaction, based on the answer change information, reasoning text change information, and review text information of each language model in adjacent rounds, the role type of each language model in collaborative reasoning is identified. Based on the historical reasoning performance data and / or current multi-round interaction data of various language models, calculate the ability scores of various language models, and determine whether the heterogeneous model collaboration group meets the ability balance condition based on the ability scores. When the aforementioned ability balance condition is met, the final reasoning answer is generated based on the consensus result of answers after multiple rounds of interaction; When the aforementioned ability balance condition is not met, the answers after multiple rounds of interaction are weighted and aggregated based on the ability scores of various language models to generate the final reasoning answer.

2. The collaborative reasoning method based on heterogeneous large language model cross-ethnic debate according to claim 1, characterized in that, The step of selecting large language models from at least two different model families and constructing heterogeneous model collaboration groups based on preset model family difference conditions includes: Obtain information about the model family to which the candidate large language model belongs, the source of pre-training data, the model architecture, and the training strategy. The participating models are selected from candidate large language models that belong to different model families and differ in at least one of the following: pre-training data source information, model architecture information, and training strategy information. The participating models are configured as the heterogeneous model collaboration group, which includes three to five large language models.

3. The collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model according to claim 1, characterized in that, The step of sending the inference task data to the various language models in the heterogeneous model collaboration group in a unified prompt format includes: Generate unified prompt data based on the inference task data; The unified prompt data is sent to each of the major language models, and the major language models are controlled not to receive the inference results of other major language models before generating the initial inference data. It receives the initial inference process, initial answer, and initial confidence score returned by various language models.

4. The collaborative reasoning method based on heterogeneous large language model cross-ethnic debate according to claim 1, characterized in that, The generation of structured debate contexts and their distribution to various language models includes: Write the initial inference data of various language models into the structured debate context; In each round of interaction, the reasoning data and answer data from the previous round of major language models are distributed to the major language models. It receives at least one of the following from various language models: error pointing out, premise questioning, reasoning supplementation, and answer revision, and uses it as the reasoning data for this round of revision.

5. The collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model according to claim 1, characterized in that, The process of identifying the role types of various language models in collaborative reasoning based on the answer change information, the reasoning text change information, and the review text information includes: When the frequency with which the target large language model absorbs the reasoning content of other large language models and modifies its own reasoning process or answer in adjacent rounds reaches the first preset frequency condition, the target large language model is marked as a synthesizer role. When the frequency with which the target large language model points out specific reasoning errors of other large language models in the reviewed text information reaches the second preset frequency condition, the target large language model is marked as an error corrector. When the target large language model raises questions about the reasoning premises or implicit assumptions of other large language models in the review of text information at a frequency that reaches the third preset frequency condition, the target large language model is marked as a critic.

6. The collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model according to claim 1, characterized in that, The ability scores of each language model are calculated based on historical reasoning performance data and / or current multi-turn interaction data, including: Historical capability indicators are generated based on the accuracy of various language models in historical calibration reasoning tasks. An interaction error correction index is generated based on the frequency with which each language model is corrected by other large language models during the current multi-round interaction process. Generate a stable confidence index based on the consistency of confidence among various language models in the current multi-turn interaction. The capability score is calculated based on at least one of the historical capability index, the interactive error correction index, and the confidence stability index.

7. The collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model according to claim 1, characterized in that, The step of determining whether the heterogeneous model collaborative group meets the capability balance condition based on the capability score includes: Determine the score difference between the highest and lowest ability scores in the heterogeneous model collaboration group; When the score difference is less than or equal to a preset capability difference threshold, the heterogeneous model collaboration group is determined to meet the capability balance condition. When the score difference is greater than the preset capability difference threshold, it is determined that the heterogeneous model collaboration group does not meet the capability balance condition, and a weighted aggregation weight is determined based on the capability scores of the major language models, wherein the weighted aggregation weight is positively correlated with the capability score.

8. The collaborative reasoning method for cross-ethnic debate based on a heterogeneous large language model according to claim 1, characterized in that, Also includes: After each round of interaction, it is determined whether the current answers of each language model are consistent, and whether the current confidence of each language model has reached the preset confidence threshold. When the current answers of all language models are consistent and the current confidence of all language models reaches the preset confidence threshold, the multi-round interaction is terminated and the current answer is output. When the current answers of the major language models are inconsistent, or when the current confidence of at least one major language model does not reach the preset confidence threshold, the next round of interaction continues until the preset maximum number of rounds is reached.

9. A collaborative reasoning system based on heterogeneous large language models for cross-ethnic debate, characterized in that, include: The heterogeneous model scheduling module is used to acquire inference task data to be processed and select large language models from at least two different model families according to preset model family difference conditions to build a heterogeneous model collaboration group. An independent reasoning control module is used to send the reasoning task data to the various language models in the heterogeneous model collaboration group in a unified prompt format, so that the various language models generate initial reasoning data in a mutually isolated state. The initial reasoning data includes the initial reasoning process and the initial answer. The multi-round debate coordination module is used to summarize the initial reasoning data generated by various language models, generate a structured debate context, and distribute the structured debate context to various language models, so that each language model generates review data based on the reasoning process and answer of other large language models, and updates at least one of its own reasoning process and answer according to the review data, to obtain at least one round of revised reasoning data. The role recognition and tracking module is used to identify the role type of each language model in collaborative reasoning based on the answer change information, reasoning text change information, and review text information of each language model in adjacent rounds during multi-round interaction. The capability balance detection module is used to calculate the capability scores of each language model based on the historical reasoning performance data and / or the current multi-round interaction data of each language model, and to determine whether the heterogeneous model collaboration group meets the capability balance condition based on the capability scores. The answer aggregation module is used to generate the final reasoning answer based on the consensus result of multiple rounds of interaction when the ability balance condition is met. When the aforementioned ability balance condition is not met, the answers after multiple rounds of interaction are weighted and aggregated based on the ability scores of various language models to generate the final reasoning answer.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the collaborative reasoning method based on cross-family debate of heterogeneous large language models as described in any one of claims 1 to 8.