A short-term confidence-based size model parallel cooperative evaluation method
By introducing a parallel collaborative evaluation method for large and small models into federated learning, and utilizing short-term confidence and arbitration mechanisms, this approach addresses the challenges of low resource utilization and advanced semantic evaluation in existing technologies. It achieves efficient and reliable model quality evaluation and incentivizes trustworthy cooperation among computing power providers.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JINAN UNIVERSITY
- Filing Date
- 2026-02-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing federated learning lacks effective monitoring of the integrity of computing power providers and the quality of learning. Existing evaluation methods are difficult to defend against advanced semantic attacks and have low resource utilization efficiency, resulting in high evaluation costs and low efficiency.
A parallel collaborative evaluation method based on short-term confidence levels using small and large models is adopted. Evaluation indicators are dynamically generated through a cloud service center. Small models are used for short-term evaluation, and large models are used for supplementary evaluation to achieve parallel collaboration. Combined with a confidence-driven arbitration mechanism, resource utilization and evaluation accuracy are optimized.
It improved the efficiency and accuracy of the evaluation, enhanced the ability to evaluate advanced semantic metrics, optimized the utilization of computing resources, formed a trustworthy closed loop of computing power cooperation, and incentivized computing power providers to deliver high-quality learning outcomes.
Smart Images

Figure CN122242648A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of artificial intelligence and federated learning, and particularly relates to the field of quality evaluation of federated learning outcomes in computing power networks. Background Technology
[0002] With the rapid development of distributed computing and artificial intelligence technologies, computing power networks (CPNs) have become a new type of infrastructure for integrating heterogeneous computing resources. Among them, federated learning (FL), as one of the specific application scenarios of CPNs, has become the mainstream paradigm for collaborative training in data privacy-sensitive fields such as healthcare and finance due to its characteristic of data not leaving the domain.
[0003] Due to the privacy-preserving nature of FL (Flexible Learning), the cloud service center, acting as the central coordinator of FL, cannot directly inspect the quality and quantity of the raw training data provided by computing power providers, nor can it verify the authenticity, completeness, and contribution of their submitted local model updates. This raises a key challenge in the FL field: the cloud service center lacks effective monitoring of the integrity and learning quality of computing power providers. Malicious or dishonest participants can severely damage the performance of the global model and the fairness of collaboration. Therefore, designing a reliable, flexible, and efficient evaluation mechanism to achieve quality-driven incentives and trustworthy collaboration is a current cutting-edge research focus in CPN (Content Processing) and FL.
[0004] However, existing methods for defending against malicious or dishonest participants in federated learning rely on numerical statistics, specifically by calculating the geometric distance between model update parameters to eliminate outliers. This approach struggles to defend against sophisticated, targeted collaborative attacks, as attackers can upload models that appear statistically normal but are semantically or logically flawed for updates. To address these issues, another defense method has been proposed: evaluating the reasoning ability of the trained models of federated learning participants. Existing methods for evaluating model reasoning ability primarily focus on quantifying model performance, such as quantifying model accuracy or using fixed contribution algorithms, such as the Shapley Value algorithm. However, these methods struggle to flexibly evaluate advanced semantic metrics such as model rationality and ethical compliance, limiting the adaptability of the evaluation system to complex federated learning tasks. To solve these semantic evaluation challenges, cutting-edge research has begun exploring the use of Large Language Models (LLMs) as evaluators in federated learning environments, such as the FedEval-LLM framework. However, regardless of whether LLM is deployed centrally or distributed, its inherent large model size, high computational requirements, and high invocation costs make using it for every evaluation costly and inefficient. To balance accuracy and cost, some studies have proposed a collaborative inference architecture between large and small models. However, most existing inference methods involve sequential collaboration between large and small models, meaning that the small model must be fully executed before the large model can intervene based on the evaluation results. This leads to serious resource idleness issues.
[0005] Therefore, there is an urgent need in this field for a new technical solution that can overcome the existing numerical limitations and advanced semantic evaluation problems. This method utilizes the collaboration of large and small models to ensure evaluation accuracy and semantic depth. At the same time, by introducing a short-term confidence prediction mechanism, it effectively solves the efficiency bottleneck caused by the serial workflow in collaborative evaluation, and achieves highly reliable and efficient evaluation of FL quality in CPN. Summary of the Invention
[0006] To address the shortcomings of the aforementioned FL quality assessment methods in CPN regarding robustness, efficiency, and resource utilization, this invention proposes a parallel collaborative assessment method based on short-term confidence level models, comprising the following:
[0007] This invention discloses a parallel collaborative evaluation method for large and small models based on short-term confidence, comprising the following steps:
[0008] S1: The cloud service center provides questions and dynamically generates evaluation indicators and their weights;
[0009] S2: Collect answers from multiple computing power providers based on the question;
[0010] S3: Based on the size model of short-term confidence, the parallel collaborative evaluation of the answer obtained in S2 is performed to generate the final score;
[0011] S4: The cloud service center collects the evaluation results information.
[0012] Preferably, S1 includes:
[0013] S1.1: Deploy the large-scale evaluation model in the cloud service center and select one issue from the issue library as the issue to be evaluated;
[0014] S1.2: The large evaluation model dynamically generates an evaluation indicator set containing 5 evaluation indicators and a corresponding evaluation indicator weight set based on the problem to be evaluated.
[0015] Preferably, in S2, the cloud service center distributes the problem to multiple computing power providers, and each computing power provider uses its local learning model to reason about the problem to be evaluated and submits its generated answer.
[0016] Preferably, S3 includes:
[0017] S3.1: Activate the small assessment model to perform a short-term assessment of the responses and output the short-term confidence scores of each indicator. ;
[0018] S3.2: Selecting the "prior arbitration set" based on a dynamic short-term arbitration threshold. ;
[0019] S3.3: Parallel execution of dual-track evaluation:
[0020] Track 1: The small assessment model continues to run, resulting in a complete assessment score;
[0021] Track Two: Large-Scale Evaluation Model Targeting the "Preliminary Arbitration Set" The evaluation indicators are used to obtain the preliminary arbitration score;
[0022] S3.4: Collect the statement confidence scores of the small evaluation model for the complete evaluation of all evaluation metrics. The "final low confidence set" is selected based on a dynamic final arbitration threshold. ;
[0023] S3.5: Initiate the large-scale assessment model to supplement arbitration and generate the final assessment results.
[0024] Preferably, the confidence level of the short-term evaluation statement The calculations include:
[0025] ;
[0026] in The output of the small evaluation model indivual , For the front indivual The indivual , For the front indivual The indivual of value.
[0027] Preferably, the dynamic short-term arbitration threshold screening process for low statement confidence specifically includes:
[0028] A dynamic short-term arbitration threshold is calculated based on the formula. :
[0029] ,
[0030] in It is an adjustable deviation factor; the system filters out all that meet the requirements. The evaluation indicators constitute the "preliminary arbitration set". .
[0031] Preferably, the dynamic final arbitration threshold screening includes:
[0032] Collect the statement confidence scores corresponding to the complete evaluation of all evaluation metrics for the responses from the small evaluation model. ,in The calculation formula is:
[0033] ;
[0034] in Output statements for small evaluation models quantity, For the first in the statement indivual , For the first in the statement indivual of value;
[0035] According to the set The statistical distribution is used to calculate the dynamic final arbitration threshold. :
[0036] ,
[0037] in It is an adjustable deviation factor, and all those that meet the criteria are selected. The evaluation metrics constitute the final low-confidence set. .
[0038] Preferably, S3.5 includes:
[0039] Comparing the "preliminary arbitration set" With the "final low-confidence set" :
[0040] like This indicates that the short-term forecast is accurate;
[0041] like This indicates a bias in short-term forecasts. In this case, the large-scale evaluation model is initiated sequentially, targeting only the discrepancies. A supplementary assessment was conducted to obtain a supplementary arbitration score. This leads to the overall evaluation score of the large model. .
[0042] Preferably, step S3 further includes:
[0043] All in the "final low confidence set" Evaluation indicator scores Evaluation scores using large models replace;
[0044] Scores of all evaluation metrics for high statement confidence Remain unchanged;
[0045] The final score after the replacement is ;
[0046] Finally, the final scores are weighted and summed according to the weight set of the evaluation indicators to obtain the final total score. :
[0047] .
[0048] Preferably, the process by which the cloud service center collects the evaluation results information is as follows:
[0049] The cloud service center will calculate the weighted total score. The final learning quality score for the computing power provider;
[0050] This score can be used for subsequent quality-driven incentive allocation, reputation ranking updates, or client selection in federated learning.
[0051] In summary, the advantages of this invention are as follows:
[0052] 1) To address the FL quality assessment problem in CPN, this invention proposes a parallel collaborative assessment method for large and small models based on short-term confidence. This method allows the computation time of the large and small assessment models to overlap, enabling on-demand, parallel access to expensive LLM resources. This reduces the use of large models, while the parallel mechanism shortens the waiting time of large assessment models, thereby improving assessment accuracy, accelerating the overall assessment process, enhancing the scheduling response capability of CPN, and optimizing the utilization efficiency of computing resources.
[0053] 2) To address the limitations of existing methods in defending against advanced semantic attacks, this invention utilizes the advanced semantic reasoning capabilities of the evaluation model, coupled with a confidence-driven arbitration mechanism. This enables the method to flexibly and specifically evaluate advanced semantic metrics, enhancing the accuracy and robustness of the evaluation results.
[0054] 3) This invention ultimately provides a highly efficient and reliable model quality scoring (QoSLearning). This score can serve as the basis for "pay-for-quality" in CPN, used for subsequent quality-driven incentive allocation, reputation ranking updates, or client selection in FL. This incentivizes computing power providers to deliver high-quality, trustworthy learning outcomes, thereby forming a reliable closed loop of computing power cooperation. Attached Figure Description
[0055] The accompanying drawings, which form part of this application, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings:
[0056] Figure 1 This is a schematic diagram of a system model according to a preferred embodiment of the present invention;
[0057] Figure 2 This is a schematic diagram of the process of parallel collaborative evaluation of large and small models based on short-term confidence level according to a preferred embodiment of the present invention;
[0058] Figure 3 yes Figure 2 A flowchart illustrating step three in the middle section. Specific implementation methods
[0059] This invention provides a parallel collaborative evaluation method for large and small models based on short-term confidence.
[0060] This invention provides a method for evaluating local learning model pairs used by computing power providers. The answer Quality approach. Using large cloud models. With small models The collaborative approach will be evaluated. Specific steps are described below:
[0061] Step 1: The cloud service center provides the problem and dynamically generates evaluation indicators:
[0062] The CPN under consideration includes a cloud service center, A computing power provider. The cloud service center maintains a problem database and deploys a large evaluation model within the cloud service center. Select one question from the question bank as the question to be evaluated. To address the issues of assessment rigidity and domain generalization, the aforementioned large assessment model... According to the problem to be evaluated Dynamically generate an evaluation indicator set containing 5 evaluation indicators. and corresponding evaluation indicator weight set .
[0063] Step 2: Collect model responses from computing power providers based on the question:
[0064] The cloud service center distributes the questions to be evaluated to computing power providers. Each computing power provider uses its local learning model to reason about the questions to be evaluated and submits its generated answers. The subsequent steps of this invention will address this answer. The quality is evaluated.
[0065] Step 3: Parallel collaborative evaluation of size models based on short-term confidence levels:
[0066] Initiating a short-term evaluation using a small model: Initiating a small evaluation model deployed in the cloud service center. Regarding the answer For all evaluation indicators Short-term evaluation is conducted, meaning the small model provides separate evaluation indicators. Perform an evaluation, and the evaluation output is sent to the... The smallest output unit ( Stop and evaluate the next evaluation metric when the time comes.
[0067] Short-term evaluation statement confidence collection: collecting data from small evaluation models Regarding the answer For all evaluation indicators Statement confidence for short-term evaluation The statement confidence score is an intrinsic metric used to quantify the certainty of an evaluation model's assessment results. It is calculated as the confidence score of all statements in the output statement being evaluated. of The arithmetic mean of the values. Therefore, the confidence level of short-term statements. The calculation formula is:
[0068] ,
[0069] in The output of the small evaluation model indivual , For the front indivual The indivual , For the front indivual The Middle indivual of value.
[0070] Collection of low-statement confidence for short-term assessment: A dynamic short-term arbitration threshold is calculated using a formula. :
[0071] ,
[0072] in It is an adjustable deviation factor. The system filters out all that meet the requirements. The evaluation indicators constitute the "preliminary arbitration set". .
[0073] Parallel refinement of small-scale model for short-term evaluation and initiation of large-scale model evaluation:
[0074] 1) Parallel Track One, Small Model Refinement: The system starts the small evaluation model to continue executing the remaining evaluations, generating all indicator evaluation output statements. To refine its evaluation results and ultimately obtain a complete evaluation score. .
[0075] 2) Parallel Track II, Large Model Launch: The system launches the large evaluation model in parallel, targeting only the "preliminary arbitration set". The evaluation metrics are used to assess the model and obtain the preliminary arbitration score of the large model. .
[0076] This parallel design allows the large evaluation model to function without waiting for the small evaluation model to complete its output, effectively eliminating resource idle time in the serial mode.
[0077] Complete evaluation statement confidence collection: collecting small evaluation models Regarding the answer For all evaluation indicators Complete evaluation of statement confidence .
[0078] ,
[0079] in Output statements for small evaluation models quantity, To evaluate the first statement indivual , To evaluate the first statement indivual of value.
[0080] Collection of low-statement confidence scores for complete evaluation: based on the set The statistical distribution is used to calculate the dynamic final arbitration threshold. :
[0081] ,
[0082] in It is an adjustable deviation factor. The system filters out all that meet the requirements. The evaluation metrics constitute the "final low-confidence set". .
[0083] Initiating supplementary arbitration using a large model: The system compares the "preceding arbitration set". With the "final low-confidence set" :
[0084] like This indicates that the short-term forecast is accurate and the results can be used directly; if This indicates a short-term prediction bias (a certain indicator has high confidence in the short-term assessment, but low confidence in the overall assessment). In this case, the system sequentially starts a large model evaluation, only targeting the discrepancies. The evaluation indicators are supplemented by evaluations that are in the "final low confidence set" but not in the "preliminary arbitration set", resulting in supplementary arbitration scores. And thus obtain .
[0085] Collaboration between large and small models: The scores of the evaluation metrics for low-confidence statements in the complete evaluation of all small evaluation models are replaced by the evaluation scores of the large evaluation model. That is, for all statements in the "final low-confidence set"... The evaluation indicators and their final scores ( The score was replaced with the evaluation score from the large evaluation model. ( In all the evaluations of the small evaluation models, the scores of evaluation indicators with high confidence in the evaluation statements remained unchanged; that is, for all evaluation statements with high confidence, the scores of the indicators remained unchanged. Its final score ( The score remains unchanged. The final score after the replacement is... .
[0086] Finally, based on the questions provided by the cloud service center and the evaluation indicator weight set generated during the dynamic generation of evaluation indicators, The final total score is obtained by weighted summation of all final scores. :
[0087] .
[0088] Step 4: The cloud service center collects the evaluation results information:
[0089] The cloud service center will calculate the weighted total score. This serves as the final learning quality score for the computing power provider model. This score can be used for subsequent quality-driven incentive allocation, reputation ranking updates, or client selection in FL (Flexible Learning Process), thereby incentivizing computing power providers to deliver high-quality, trustworthy learning outcomes and forming a credible closed loop of computing power cooperation.
[0090] like Figure 1 The schematic diagram shown is a system model of a preferred embodiment of the present invention. The system model of the present invention is built in a CPN environment, and the system includes at least a cloud service center and A computing power provider.
[0091] The cloud service center acts as the central coordinator of the system, responsible for task distribution, model aggregation, quality assessment, and incentive allocation. In this embodiment, a large assessment model is deployed within the cloud service center. And maintain a problem library. In addition, a small evaluation model. It is also deployed in the cloud service center.
[0092] Computing power providers: As participants in FL, they possess private data and local learning models. They receive evaluation tasks from the cloud service center and submit the responses from their local models.
[0093] Large evaluation model These are typically large-scale language models with high computing power and high precision, deployed in the cloud. Their functions are twofold: first, to act as an evaluation criterion generator, dynamically generating evaluation metrics; and second, to act as a high-level arbitrator, reviewing and evaluating uncertain results from smaller models.
[0094] Small evaluation model Typically, these are small, low-cost, and highly efficient models. Their role is to act as primary evaluators, providing a quick and comprehensive preliminary assessment of the responses from all computing power providers and offering the confidence level of their evaluation results.
[0095] like Figure 2 The diagram illustrates a preferred embodiment of the parallel collaborative evaluation process of the size model based on short-term confidence levels according to the present invention. The evaluation method of the present invention mainly includes the following steps:
[0096] Step 1: The cloud service center provides the problem and dynamically generates evaluation indicators:
[0097] In this step, the large evaluation model of the cloud service center Select a question from its maintained question bank. This is the problem to be evaluated in this assessment task. To address the issues of assessment rigidity and domain generalization, a large-scale assessment model is proposed. Evaluation metrics will be generated dynamically. Specifically, Able to analyze problems Based on the domain and intent, generate an evaluation indicator set containing 5 evaluation metrics. This set of indicators not only includes traditional objective indicators such as "accuracy," but more importantly, it includes other advanced semantic indicators such as "reasonableness of the response," "logical consistency," "ethical compliance," or "clinical applicability." Simultaneously, the large-scale evaluation model... It will also depend on the question The key point is to generate corresponding weight sets for each of these five evaluation indicators. ,in .
[0098] Step 2: Collect responses from computing power providers based on the questions:
[0099] In this step, the cloud service center will identify the issues to be evaluated in step one. And the dynamically generated set of evaluation metrics (and their weights) are distributed to Each computing power provider utilizes its local learning model to address the evaluation problem. Perform reasoning and generate an answer. .
[0100] Step 3: Parallel collaborative evaluation of size models based on short-term confidence levels:
[0101] This step is the core innovation of this invention, aiming to solve the inherent high overhead and process efficiency bottlenecks of large model evaluation. For example... Figure 3 As shown, Figure 3 The detailed process of step three is shown.
[0102] 1. Initiate a short-term evaluation of the small model:
[0103] The cloud service center will collect the answers Distribute to small evaluation models , Start answering A short-term assessment. Specifically, Start evaluation indicators When its evaluation output reaches the preset threshold... indivual hour, Suspension of The assessment, instead of the assessment And so on. Perform the test once for each of the five evaluation indicators. indivual Short-term assessment. Parameters It is an adjustable integer, such as 20 or 30, the choice of which is intended to obtain an effective prediction of confidence with minimal computation.
[0104] 2. Collection of confidence scores for short-term assessment statements:
[0105] While 1 is being executed, the system collects data in real time. Output for 5 indicators indivual of Values. Based on these The system can calculate five short-term confidence levels, forming a set. The statement confidence level is a quantification evaluation model (here). The inherent indicator of the certainty of its own evaluation results is defined as all the statements in the evaluation statement. of The arithmetic mean of the values. Therefore, the confidence level of short-term statements. The calculation formula is:
[0106] ,
[0107] in The output of the small model indivual , For the front indivual The indivual , For the front indivual The indivual of value.
[0108] 3. Collection of low-confidence statements for short-term assessment (preliminary arbitration decision):
[0109] The system collected confidence scores for 5 short-term statements. Subsequently, a dynamic threshold is needed to determine which indicators might have low confidence levels. This embodiment uses a statistical method to calculate the dynamic short-term arbitration threshold. :
[0110]
[0111] in, and They are sets The mean and standard deviation, It is an adjustable bias factor (e.g., ).
[0112] The system filters out all that meet the requirements. Evaluation indicators This constitutes the "preliminary arbitration set". , This set represents small models of the future predicted by the system. Even in a full assessment, there may be indicators with low confidence.
[0113] 4. Simultaneously refine the small-scale model for short-term evaluation and initiate the large-scale model evaluation:
[0114] This step is the core parallel mechanism of this invention, such as Figure 3 As shown in the flowchart, the system simultaneously launches two parallel computation tracks:
[0115] 1) Parallel Track 1 (Small Model Refinement): System Startup Evaluation Model Continue performing the assessment task that was suspended in step 1, i.e. Generate residuals for short-term evaluation of all assessment indicators To refine its evaluation results. The goal of this track is to ultimately obtain... Full assessment score and complete statement confidence (See 5).
[0116] 2) Parallel Track Two (Large Model Launch): The system launches the large evaluation model in parallel. . Only applicable to the "preliminary arbitration set" The evaluation indicators in the evaluation are used for evaluation ( (No need to evaluate all evaluation metrics). The goal of this track is to obtain the preliminary arbitration score of the large model. .
[0117] This parallel design enables large evaluation models Computation time (track 2) and small evaluation model The remaining computation time (track one) overlaps, thus eliminating the need for serial mode. We must wait Resource idle time before startup can only be started after all tasks are completed.
[0118] 5. Collection of confidence scores for low-sentence statements in a complete evaluation (final confidence score):
[0119] After parallel track one is completed, the system obtains A complete evaluation of all evaluation metrics is performed. The system then calculates the final statement confidence score. .
[0120] ,
[0121] in Output statements for small evaluation models quantity, For the first in the statement indivual , For the first in the statement indivual of value.
[0122] 6. Collection of confidence scores for low-statements in a complete assessment (final arbitration decision):
[0123] The system uses a similar approach to 3, but is based on the final statement confidence set. Calculate the dynamic final arbitration threshold :
[0124] ,
[0125] in It is an adjustable deviation factor. The system filters out all that meet the requirements. Evaluation indicators This constitutes the "final low-confidence set". This set represents The actual low confidence level of the indicator.
[0126] 7. Initiate large-scale model supplementary arbitration (decision verification):
[0127] This step is the decision verification phase, where the system compares the "advance arbitration set". With the "final low-confidence set" Scenario 1: If This indicates that the short-term predictions in step 3 are accurate, and all large evaluation models are correct. All the metrics that need to be evaluated have been evaluated in parallel track 2 of track 4 (obtained) ),at this time No additional calculation is required. Scenario 2: If This condition is equivalent to Non-empty indicates a short-term forecast bias, in which case the system must sequentially start the large evaluation model. Only for the differences ( A supplementary assessment was conducted to obtain a supplementary arbitration score. The final overall evaluation score of the large model It is the union of the "preliminary arbitration score" and the "supplementary arbitration score", that is... .
[0128] 8. Collaboration between large and small models (score aggregation):
[0129] The system aggregates all scores to generate the final evaluation result. The aggregation logic is as follows:
[0130] 1) Replace low confidence scores: For all scores in the "final low confidence set" Indicators Its final score ( The model was replaced with a larger model. Evaluation score ( ).
[0131] 2) Retain high confidence scores: For all high confidence indicators Its final score Keep it as a small model Evaluation score .
[0132] Finally, based on the evaluation index weight set generated in step one... For all final scores The scores after replacement or retention are weighted and summed to obtain the final total score. :
[0133] .
[0134] Step 4: The cloud service center collects the evaluation results information:
[0135] In this step, the cloud service center will calculate the weighted total score from step three. The final learning quality score for this task is awarded to the computing power provider. It can be applied to the management of computing networks, such as quality-driven incentive allocation: scoring. The higher the computing power provider, the more incentives they receive (such as tokens or service priority); reputation ranking updates: As the basis for updating the long-term reputation score of computing power providers; Client selection in FL: In subsequent FL training, historical scores are given priority. High-performance computing providers are involved.
[0136] Through steps one through four, this invention establishes a trusted closed loop for computing power cooperation, incentivizing computing power providers to deliver high-quality, trustworthy learning outcomes. Through the specific implementation methods described above, this invention achieves the following beneficial effects:
[0137] 1. Significantly improved evaluation efficiency: The parallel collaboration mechanism greatly shortens the waiting time for large models and accelerates the overall evaluation process.
[0138] 2. Enhanced robustness assessment: By leveraging the advanced semantic reasoning capabilities of large models, the limitations of traditional numerical assessments are overcome.
[0139] 3. Resource utilization optimization: It enables on-demand and parallel access to expensive LLM resources, significantly optimizing the efficiency of computing resource utilization.
[0140] 4. Credibility Guarantee of Incentive Mechanism: Provides a highly efficient, reliable, and semantically deep model quality scoring method. (QoSLearning) provides a solid foundation for quality-driven incentives in computing networks.
[0141] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A parallel collaborative evaluation method for large and small models based on short-term confidence, characterized in that: Includes the following steps: S1: The cloud service center provides questions and dynamically generates evaluation indicators and their weights; S2: Collect answers from multiple computing power providers based on the question; S3: Based on the size model of short-term confidence, the parallel collaborative evaluation of the answer obtained in S2 is performed to generate the final score; S4: The cloud service center collects the evaluation results information.
2. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 1, characterized in that: S1 includes: S1.1: Deploy the large-scale evaluation model in the cloud service center and select one issue from the issue library as the issue to be evaluated; S1.2: The large evaluation model dynamically generates an evaluation indicator set containing 5 evaluation indicators and a corresponding evaluation indicator weight set based on the problem to be evaluated.
3. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 1, characterized in that: In S2, the cloud service center distributes the problem to multiple computing power providers. Each computing power provider uses its local learning model to reason about the problem to be evaluated and submits its generated answer.
4. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 2, characterized in that: S3 includes: S3.1: Activate the small assessment model to perform a short-term assessment of the responses and output the short-term confidence scores of each indicator. ; S3.2: Selecting the "prior arbitration set" based on a dynamic short-term arbitration threshold. ; S3.3: Parallel execution of dual-track evaluation: Track 1: The small assessment model continues to run, resulting in a complete assessment score; Track Two: Large-Scale Evaluation Model Targeting the "Preliminary Arbitration Set" The evaluation indicators are used to obtain the preliminary arbitration score; S3.4: Collect the statement confidence scores of the small evaluation model for the complete evaluation of all evaluation metrics. The "final low confidence set" is selected based on a dynamic final arbitration threshold. ; S3.5: Initiate the large-scale assessment model to supplement arbitration and generate the final assessment results.
5. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 4, characterized in that: The confidence level of the short-term assessment statement The calculations include: ; in The output of the small evaluation model indivual , For the front indivual The indivual , For the front indivual The indivual of value.
6. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 4, characterized in that: The dynamic short-term arbitration threshold screening process for low statement confidence is as follows: A dynamic short-term arbitration threshold is calculated based on the formula. : , in It is an adjustable deviation factor; the system filters out all that meet the requirements. The evaluation indicators constitute the "preliminary arbitration set". .
7. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 4, characterized in that: The dynamic final arbitration threshold screening includes: Collect the statement confidence scores corresponding to the complete evaluation of all evaluation metrics for the responses from the small evaluation model. ,in The calculation formula is: ; in Output statements for small evaluation models quantity, For the first in the statement indivual , For the first in the statement indivual of value; According to the set The statistical distribution is used to calculate the dynamic final arbitration threshold. : , in It is an adjustable deviation factor, and all those that meet the criteria are selected. The evaluation metrics constitute the final low-confidence set. .
8. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 4, characterized in that: S3.5 includes: Compare the "preliminary arbitration set" With "final low confidence set" : like This indicates that the short-term forecast is accurate; like This indicates a bias in short-term forecasts. In this case, the large-scale evaluation model is initiated sequentially, targeting only the discrepancies. A supplementary assessment was conducted to obtain a supplementary arbitration score. This leads to the overall evaluation score of the large model. .
9. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 8, characterized in that: S3 also includes: All in the "final low confidence set" Evaluation indicator scores Evaluation scores using large models replace; Scores of all evaluation metrics for high statement confidence Remain unchanged; The final score after the replacement is ; Finally, the final scores are weighted and summed according to the weight set of the evaluation indicators to obtain the final total score. : 。 10. The parallel collaborative evaluation method for large and small models based on short-term confidence as described in claim 1, characterized in that: The process by which the cloud service center collects the evaluation results information is as follows: The cloud service center will calculate the weighted total score. The final learning quality score for the computing power provider; This score can be used for subsequent quality-driven incentive allocation, reputation ranking updates, or client selection in federated learning.