Meta-evaluation method, electronic device and storage medium followed by model instructions

By constructing partial order relations and meta-evaluation samples, the problem of insufficient quality ranking ability in the evaluation of the adjudicator model's instruction compliance was solved, thus achieving accurate evaluation of the adjudicator model and improving the reliability and accuracy of the evaluation results.

CN122198010APending Publication Date: 2026-06-12BEIJING KNOWLEDGE ATLAS TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING KNOWLEDGE ATLAS TECHNOLOGY CO LTD
Filing Date
2026-03-04
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, the instruction compliance evaluation method of the referee model cannot effectively reflect its ability to rank the quality of instruction responses, resulting in inaccurate evaluation results.

Method used

By acquiring instructions and their responses in natural language form, a partial order relation is constructed using the truth value judgment results. Meta-evaluation samples are then constructed to evaluate the instruction compliance of the referee model. The instruction compliance meta-evaluation results of the referee model are obtained, and the quality ranking ability of the referee model is evaluated by using the consistency between the predicted partial order relation and the truth value partial order relation.

Benefits of technology

It enables accurate assessment of the referee model's ability to follow instructions, reflects its ability to rank the quality of instruction responses, and improves the accuracy and reliability of the assessment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122198010A_ABST
    Figure CN122198010A_ABST
Patent Text Reader

Abstract

The application relates to the technical field of artificial intelligence, and particularly provides a meta-evaluation method for model instruction compliance, an electronic device and a storage medium, aiming to solve the problem of how to effectively and reliably evaluate the judgment ability of a judgment model for instruction compliance. The method comprises the following steps: obtaining a natural language instruction, obtaining a reply corresponding to the instruction and a set, the set comprising true value judgment results of whether each constraint condition in the instruction is followed by each reply of the instruction; obtaining a set according to the true value judgment results, the set comprising a true value partial order relationship between different replies of the instruction; constructing a meta-evaluation sample corresponding to the instruction according to the instruction and the reply and the set corresponding to the instruction; and evaluating the instruction compliance of the judgment model by using the meta-evaluation sample. The instruction compliance meta-evaluation result obtained based on the above method can accurately reflect the quality ordering ability of the judgment model for instruction replies.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, specifically providing a meta-evaluation method for model instruction compliance, an electronic device, and a storage medium. Background Technology

[0002] Instruction compliance is a fundamental capability of Large Language Models (LLMs). Users can explicitly define task requirements in instructions and impose constraints on the model's output (i.e., the model's response or reply to the instruction). The ability to comply with instructions ensures that the model's output, after processing the instruction, satisfies the task requirements and constraints as much as possible. To improve the instruction compliance capability of LLMs, a judge model is currently used to evaluate whether the target model's (the LLM to be evaluated) response to the instruction follows the constraints. The target model is then optimized based on the evaluation results to improve its instruction compliance capability. For example, reinforcement learning can be used to optimize the target model. In this method, the judge model's ability to judge instruction compliance affects the accuracy of the evaluation results. Therefore, it is necessary to evaluate the judge model's ability to judge instruction compliance; this evaluation can be understood as a meta-evaluation of the instruction compliance capability.

[0003] In some application scenarios, the adjudicator model not only needs to evaluate whether the responses comply with the constraints in the instructions (i.e., it needs to have constraint verification capability), but also needs to accurately rank multiple responses according to their compliance quality (i.e., it needs to have quality ranking capability). In other words, the adjudicator model's instruction compliance capability must simultaneously reflect constraint verification capability and quality ranking capability. However, current meta-evaluation methods for model instruction compliance mainly control the adjudicator model to use pairwise comparison methods or the Best-of-N (BoN) method, selecting the best response from multiple responses with different instruction compliance qualities, and then evaluating the adjudicator model's judgment capability on instruction compliance based on the selection result. This method only reflects the adjudicator model's ability to select a single best response, ignoring the complex partial order relationships between multiple responses, and thus failing to reflect the adjudicator model's ability to rank the quality of instruction responses. The partial order relationship indicates whether the instruction compliance quality of one response is superior to that of another.

[0004] Accordingly, a new technical solution is needed in this field to solve the above problems. Summary of the Invention

[0005] This application aims to solve the above-mentioned technical problems, namely, to solve or at least partially solve the following technical problems: how to effectively and reliably evaluate the judge model's judgment ability on instruction compliance, so that the evaluation results can accurately reflect the judge model's ability to rank the quality of instruction responses, thereby enabling the judge model to be used to conduct accurate and reliable instruction compliance evaluation on the large language model to be evaluated.

[0006] In a first aspect, this application provides a meta-evaluation method for model instruction compliance, the method comprising:

[0007] Obtain instructions in natural language form, the instructions including at least one constraint;

[0008] Get the response corresponding to the instruction and set , and Each represents the first of the instructions The number of replies and the total number of replies, all in natural language format, are set. Each response to the instruction is determined by the truth value of whether it conforms to each constraint in the instruction.

[0009] According to the set The truth value judgment result obtained in the process yields the set corresponding to the instruction. ,gather This includes a partial truth order relationship between different responses to the instruction, whereby the partial order relationship indicates whether the instruction compliance quality of one response is superior to that of another response;

[0010] According to the instructions and their corresponding responses and set The meta-evaluation sample corresponding to the instruction is constructed, and the instruction compliance evaluation of the referee model is performed using the meta-evaluation sample. The referee model is a large language model.

[0011] The instruction compliance evaluation of the referee model includes:

[0012] The referee model is controlled to process the meta-evaluation sample to obtain response ranking results, which include the predicted partial order relationship between different responses to the instruction;

[0013] Based on the predicted partial order relation and the set The represented truth partial order relation is used to obtain the instruction of the referee model following the meta-evaluation result.

[0014] In one technical solution of the above meta-evaluation method, the step of basing the evaluation on the set... The truth value judgment result obtained in the process yields the set corresponding to the instruction. ,include:

[0015] Based on the truth value judgment result, obtain the partial order relation that satisfies the preset conditions. ;

[0016] According to the partial order relation Obtain the truth partial order relation;

[0017] The preset conditions are as follows: , , Each represents the first of the instructions , 1 reply, express The instructions follow a higher quality than , , , ; , Respectively represent the first , Does the reply follow the instructions in section 1? The truth value judgment result of each constraint condition. , Each represents the first of the instructions The truth value judgment results of the responses are respectively "followed" and "not followed". This indicates the total number of constraints.

[0018] In one technical solution of the above meta-evaluation method, the step of basing the partial order relation on... Obtaining the truth partial order relation includes:

[0019] The partial order relation is checked according to preset verification conditions. Perform a verification to determine the partial order relation. Is it an abnormal partial order relation? If the partial order relation is determined... If the partial order relation is abnormal, then remove the partial order relation. .

[0020] In one technical solution of the above meta-evaluation method, the preset verification conditions include at least the following first verification condition and second verification condition:

[0021] The first verification condition is: if , None of them followed the same constraint and The degree of non-compliance is lower than The degree of non-compliance determines the partial order relation. This is an abnormal partial order relation;

[0022] The second verification condition is: if , The evaluation results differed significantly across the preset evaluation dimensions. The evaluation results are better than The evaluation results indicate that the partial order relation For abnormal partial order relationships, the preset evaluation dimension is another evaluation dimension that is not related to instruction compliance evaluation.

[0023] In one technical solution of the above meta-evaluation method, the step of obtaining instructions in natural language form includes:

[0024] Obtain multiple initial instructions, which have different instruction types and / or constraint types, wherein the constraint type indicates the type of constraint condition in the instruction;

[0025] The multiple initial instructions are filtered to obtain the final instructions;

[0026] The filtering of the multiple initial instructions includes:

[0027] For each initial instruction, the first major language model is used to evaluate the quality and complexity of the initial instruction to obtain a quality level and a complexity score. The quality level and complexity score are positively correlated with the quality and complexity of the initial instruction, respectively.

[0028] The initial instruction with the highest quality level and a complexity score greater than a set threshold is obtained, and the initial instruction is used as a candidate instruction.

[0029] The candidate instructions are clustered to obtain multiple clusters;

[0030] The candidate instruction with the highest complexity score in each cluster is obtained, and the final instruction is obtained based on the candidate instruction.

[0031] In one technical solution of the above meta-evaluation method, obtaining the final instruction based on the candidate instructions includes:

[0032] The candidate instruction is validated to determine whether it is an abnormal instruction; if it is determined to be an abnormal instruction, it is removed; otherwise, it is used as the final instruction.

[0033] In one technical solution of the above meta-evaluation method, the instruction type includes single-turn interactive type, multi-turn interactive type, and system prompt word guided type;

[0034] The constraint types include primary constraint types and constraint combination types. The primary constraint types include numerical, format, content, language, style, scene, and action types. The constraint combination types include single, parallel, chained, and selection types.

[0035] The single-type response needs to comply with a constraint condition;

[0036] The parallel type indicates that the response needs to comply with multiple constraints simultaneously;

[0037] The chained type indicates that the response requires the completion of multiple tasks in sequence, and each task needs to follow its own corresponding constraints.

[0038] The selection type indicates that the response requires selecting the correct selection branch based on preset conditions and following the constraints of the selection branch.

[0039] In one technical solution of the above meta-evaluation method, the step of obtaining the response corresponding to the instruction is... This includes: processing the instructions using a second major language model to generate the instructions. There were several different replies.

[0040] In a second aspect, an electronic device is provided, comprising at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a computer program that, when executed by the at least one processor, implements the method described in any of the technical solutions provided in the first aspect.

[0041] In a third aspect, a computer-readable storage medium is provided, wherein a plurality of program codes are stored therein, the program codes being adapted to be loaded and executed by a processor to perform the method described in any of the technical solutions provided in the first aspect above.

[0042] The above-described technical solutions of this application have at least one or more of the following beneficial effects:

[0043] In one technical solution for implementing the meta-evaluation method for model instruction compliance provided in this application, the method may include the following steps: obtaining instructions in natural language form, the instructions including at least one constraint; obtaining the response corresponding to the instructions. and set , and These represent the first instruction. The number of replies and the total number of replies, with replies in natural language format, are set. Each response to the instruction is evaluated based on the truth value of each constraint in the instruction; according to the set The set corresponding to the truth value judgment result retrieval instruction in the middle ,gather This includes the partial truth order relation between different responses to an instruction, which indicates whether the instruction compliance quality of one response is superior to that of another; based on the instruction and its corresponding response... and set The process involves constructing meta-evaluation samples corresponding to the instructions, and then using these samples to evaluate instruction compliance in the adjudicator model, which is a large language model. This evaluation includes controlling the model to process the meta-evaluation samples to obtain response ranking results, which include the predicted partial order relationships between different responses to the instruction. Based on these predicted partial order relationships and the set... The represented truth partial order relation, the instructions of the judge model follow the meta-evaluation results.

[0044] In the above implementation scheme, a higher consistency between the predicted partial order relation and the true partial order relation indicates a better ability of the adjudicator model to rank the quality of instruction responses. Therefore, the instruction compliance meta-evaluation results obtained using the predicted partial order relation and the true partial order relation can reflect the adjudicator model's ability to rank the quality of instruction responses. This overcomes the deficiency of existing technologies where the instruction compliance ability obtained from evaluation cannot reflect the adjudicator model's ability to rank the quality of instruction responses. Attached Figure Description

[0045] The disclosure of this application will become more readily understood with reference to the accompanying drawings. It will be readily understood by those skilled in the art that these drawings are for illustrative purposes only and are not intended to limit the scope of protection of this application. Wherein:

[0046] Figure 1 This is a schematic flowchart illustrating the main steps of a meta-evaluation method based on an embodiment of this application. Figure 1 ;

[0047] Figure 2 This is a schematic diagram illustrating the instruction, response, and partial order relationship of a meta-evaluation sample according to one embodiment of this application;

[0048] Figure 3 This is a partial order relation illustration according to an embodiment of this application. Figure 1 ;

[0049] Figure 4 This is a partial order relation illustration according to an embodiment of this application. Figure 2 ;

[0050] Figure 5 This is a schematic flowchart illustrating the main steps of obtaining instructions according to an embodiment of this application;

[0051] Figure 6This is a schematic flowchart illustrating the main steps of a meta-evaluation method based on an embodiment of this application. Figure 2 ;

[0052] Figure 7 This is a schematic flowchart illustrating the main steps of a meta-evaluation method based on an embodiment of this application. Figure 3 ;

[0053] Figure 8 This is a schematic diagram of the main structure of an electronic device according to an embodiment of this application. Detailed Implementation

[0054] Some embodiments of this application are described below with reference to the accompanying drawings. Those skilled in the art should understand that these embodiments are merely illustrative of the technical principles of this application and are not intended to limit the scope of protection of this application.

[0055] The following describes an embodiment of the model instructions provided in this application that follow the meta-evaluation method.

[0056] First, please refer to the appendix. Figure 1 , Figure 1 This is a schematic diagram illustrating the main steps of a meta-evaluation method based on an embodiment of this application. Figure 1 As shown, the meta-evaluation method in this application embodiment mainly includes the following steps S101 to S105.

[0057] Step S101: Obtain instructions in natural language form, the instructions including at least one constraint.

[0058] In other words, an instruction is a message in natural language form. For example, such as... Figure 2 As shown, the instruction could be "Create a meaningful sentence with a length of at least 8 words and a total number of different letters not exceeding 10. Also, please list all the different letters used in the sentence." This instruction contains four constraints: C1, C2, C3, and C4.

[0059] C1: Create a meaningful sentence

[0060] C2: Sentence length must be at least 8 words.

[0061] C3: The total number of different letters used does not exceed 10.

[0062] C4: List all the different letters used in the sentence.

[0063] Step S102: Obtain the response corresponding to the command and set ,gather Each response to the instruction determines whether it conforms to the truth value of each constraint in the instruction.

[0064] The truth value judgment result is either "follows" or "does not follow".

[0065] The corresponding instructions are 10 different replies, in reply middle Indication of instructions The There are [number] replies, and all replies are in natural language format. For example, Figure 2 The instructions shown There are 4 corresponding replies as shown below:

[0066] Reply 1: I see three green trees in the street. e, g, h, i, n, r, s, t

[0067] Reply 2: She sees bees at the tree.

[0068] Reply 3: The small cat sleeps on the warm mat. a, c, e, h, l, m, n, o,p,r, s, t, w

[0069] Reply 4: The brown fox is running.

[0070] This application embodiment uses meta-evaluation samples to evaluate the instruction compliance ability of the referee model, rather than using the referee model to evaluate the instruction compliance ability of the model to be evaluated. Therefore, the above-mentioned response to the instruction is not a response generated by the model to be evaluated in processing the instruction. The model to be evaluated can be a large language model.

[0071] Step S103: Based on the set The set corresponding to the truth value judgment result retrieval instruction in the middle ,gather This includes the partial order of truth values ​​between different responses to an instruction.

[0072] A partial order relation is used to indicate whether the quality of compliance of one response is superior to that of another response. For example, responses The instructions follow the principle of quality over response. The partial order relation between them can be expressed as , Instructions contain constraints on response information. The quality of instruction compliance in a response is determined by whether it adheres to these constraints; the more constraints a response follows, the higher its quality of instruction compliance. Let's continue with... Figure 2 For example, Figure 2 As shown, the instruction contains four constraints: C1, C2, C3, and C4. Responses 1, 2, 3, and 4 follow 4, 2, 2, and 1 constraints respectively. Response 1 has the highest instruction compliance quality, while responses 2 and 3 have higher compliance quality than response 4. Therefore, the partial order relationship among these four responses can be obtained as follows: Response 1 is superior to the other three responses, Response 2 is superior to Response 4, and Response 3 is superior to Response 4.

[0073] In some implementations, when obtaining the set First, based on the instructions and quality requirements, a preliminary partial order relationship (i.e., initial partial order relationship) can be determined between different responses. Then, these initial partial order relationships are manually verified to remove abnormal initial partial order relationships, and the remaining initial partial order relationships are used as the final partial order relationships. Figure 3 and Figure 4 Taking the partial order relation diagram shown as an example... Figure 3 This example illustrates the initial partial order relationships between five different responses. These initial partial order relationships were manually verified and then removed. , These two initial partial order relations, Indicates reply The instructions follow the principle of quality over response. , The meaning is similar, so I will not elaborate further.

[0074] Step S104: Based on the instructions and their corresponding responses and set The meta-evaluation sample corresponding to the instruction is constructed, that is, the meta-evaluation sample includes the instruction. ,reply and set .

[0075] Step S105: Use meta-evaluation samples to evaluate the referee model's compliance with instructions.

[0076] Specifically, the referee model can be controlled to process the meta-evaluation samples to obtain response ranking results. These results include the predicted partial order relationships between different responses to the instruction, and then the predicted partial order relationships are used to refine the set. The predicted partial order relation represents the meta-evaluation result of instruction compliance of the adjudicator model. The higher the consistency between the predicted partial order relation and the true partial order relation, the better the adjudicator model's ability to rank the quality of instruction responses. Therefore, the meta-evaluation result of instruction compliance obtained by using the predicted partial order relation and the true partial order relation can reflect the adjudicator model's ability to rank the quality of instruction responses.

[0077] In some implementations, the Kendall correlation coefficient between the predicted partial order relation and the true partial order relation can be obtained. The Kendall correlation coefficient is used to quantify the degree of consistency between the predicted and true partial order relations; a higher Kendall correlation coefficient indicates a higher degree of consistency, and vice versa. In this implementation, conventional methods for obtaining the Kendall correlation coefficient can be used to process the predicted and true partial order relations to obtain the Kendall correlation coefficient between them. This embodiment does not impose specific limitations on this method.

[0078] In addition, in this embodiment, the referee model can be a large language model, and this embodiment does not specifically limit the structure and acquisition method of the large language model.

[0079] Based on the method described in steps S101 to S105 above, the instruction compliance evaluation of the referee model can accurately obtain the quality ranking ability of the referee model in responding to instructions, overcoming the defect that the instruction compliance ability obtained by the existing technology cannot reflect the quality ranking ability.

[0080] The following description continues with an embodiment of the model instructions provided in this application following the meta-evaluation method, specifically the method for obtaining instructions in step S101.

[0081] In some embodiments according to this application, it is possible to... Figure 5 The following steps S1011 to S1012 are shown to obtain the instruction.

[0082] Step S1011: Obtain multiple initial instructions. These multiple initial instructions have different instruction types and / or constraint types. The constraint type indicates the type of constraint conditions in the instruction.

[0083] This step can increase the diversity of instructions.

[0084] Step S1012: Filter multiple initial instructions to obtain the final instructions.

[0085] Specifically, the final instruction can be obtained by filtering through steps 11 to 14 below.

[0086] Step 11: For each initial instruction, the first language model is used to evaluate the quality and complexity (or difficulty) of the initial instruction to obtain a quality level and a complexity score. The quality level and complexity score are positively correlated with the quality and complexity of the initial instruction, respectively. That is, the higher the quality level, the higher the quality of the initial instruction; the higher the complexity score, the higher the complexity of the initial instruction.

[0087] In this embodiment, prompt words can be set for evaluating the quality and complexity of instructions. The prompt words and the initial instructions are input into the first language model for processing, so that the first language model can evaluate the quality level and complexity score of the initial instructions based on the prompt words.

[0088] For example, prompt word templates used to evaluate instruction quality can be shown in Table 1 below:

[0089] Table 1

[0090] You are an expert specializing in evaluating the quality of user commands. I will provide you with: 1. System prompts (optional): These define the response rules or behavioral guidelines that the AI ​​assistant must follow throughout the conversation. 2. User-AI assistant conversation history (optional): This displays the conversation process between the user and the AI ​​assistant, consisting of multiple rounds of user commands and AI assistant responses. 3. The last round of user commands. Your task is to evaluate the quality of the final round of user instructions based on the following criteria. ##Evaluation Criteria 1. Low Quality: User instructions have serious problems, such as inconsistent, incomplete, or ambiguous information, making it impossible to determine the true intent behind the instruction. 2. Medium Quality: User instructions may have some inconsistent, incomplete, or ambiguous information, but the overall intent can still be inferred. 3. High Quality: User instructions are logically consistent, complete in content, and clearly expressed, allowing for easy and explicit comprehension of their intent. ##Note 1. You do not need to respond to user commands; you only need to output the evaluation results. 2. If a user command requires additional searching or the use of tools to obtain an answer, it should be rated as low quality. The following are examples and user instructions to be evaluated: ... The following are system prompts, conversation history, and the last round of user instructions: ... ## Output Format Analysis: ... Prompt Word Quality: Low Quality / Medium Quality / High Quality

[0091] The prompt word templates used to assess the difficulty of instructions are shown in Table 2 below:

[0092] Table 2

[0093] You are an expert specializing in assessing the difficulty of user commands. I will provide you with: 1. System prompts (optional): These define the response rules or behavioral guidelines the AI ​​assistant must follow throughout the conversation. 2. User-AI assistant conversation history (optional): This displays the conversation between the user and the AI ​​assistant, consisting of multiple rounds of user commands and AI assistant responses. 3. The last round of user commands. 4. Constraint checklist: This lists all constraints the AI ​​assistant must meet when generating a response to the last round of user commands. Your task is to assess the difficulty of the final round of user instructions. The instructions contain multiple constraints that must be followed. Your assessment of the difficulty requires a comprehensive consideration of the number of constraints in the instructions and the difficulty of following these constraints, with the difficulty of following the constraints being more critical. Please analyze the difficulty of the instructions in detail according to the above principles. After the analysis is completed, output a difficulty score of 1–10 in strict accordance with the following format: Score: [[5]] The higher the score, the greater the difficulty of the instructions. Below are system prompts, conversation history, last round of user instructions, and a list: ...

[0094] Step 12: Obtain the initial instruction with the highest quality level and a complexity score greater than a set threshold. This initial instruction is then used as a candidate instruction. A complexity score greater than the set threshold indicates a high level of instruction complexity; therefore, the candidate instructions obtained in this step are high-quality, high-complexity instructions. Those skilled in the art can flexibly adjust the value of the set threshold; this embodiment does not impose specific limitations on this.

[0095] Step 13: Cluster the candidate instructions to obtain multiple clusters, each cluster including at least one candidate instruction.

[0096] Specifically, candidate instructions can be clustered using conventional clustering methods in the art, such as the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm. This embodiment does not specifically limit this method.

[0097] Step 14: Obtain the candidate instruction with the highest complexity score in each cluster, and obtain the final instruction based on the candidate instruction.

[0098] In some implementations, the candidate instruction with the highest complexity score within a cluster can be directly used as the final instruction.

[0099] In some implementations, the candidate instruction with the highest noise score within a cluster can be verified to determine whether it is an anomalous instruction. If the candidate instruction is determined to be an anomalous instruction, it is removed; otherwise, it is used as the final instruction. Based on this implementation, the quality of instructions can be further improved. For example, anomalous instructions can be: (1) instructions with unreasonable constraints, vague descriptions, or contradictions; (2) instructions that exceed the capabilities of a large language model (e.g., image generation tasks); (3) instructions that require highly specialized domain knowledge. This embodiment does not specifically limit the method for judging anomalous instructions.

[0100] Based on the methods described in steps 11 to 14 above, high-quality, highly complex, and representative instructions can be obtained.

[0101] Based on the method described in steps S1011 to S1012 above, it is possible to obtain instructions that are of high quality, highly complex, representative and diverse. By using these instructions to obtain evaluation samples, the comprehensiveness and robustness of the evaluation can be improved as much as possible when using meta-evaluation samples to evaluate the instruction compliance of the referee model.

[0102] The method for obtaining meta-evaluation samples will be explained below, specifically the instruction type and constraint type of the instruction in step S1011.

[0103] 1. Describe the instruction type.

[0104] In some embodiments according to this application, the instruction type may include single-turn interactive, multi-turn interactive, and system prompt-guided.

[0105] Single-turn interactive instructions can be understood as a question-and-answer process. There is no need to consider the instructions in the historical dialogue. When evaluating instruction compliance, only the constraints actually recorded in the instruction need to be used for evaluation, without needing to use the constraints recorded in the instructions in the historical dialogue.

[0106] Multi-turn interactive instructions can be understood as instructions that form a continuous dialogue with the historical dialogue. When evaluating instruction compliance, it is necessary to evaluate not only the constraints actually recorded in the instruction, but also the constraints recorded in the instructions of the historical dialogue.

[0107] System-guided instructions can be understood as instructions that include system prompts, and the system prompts have higher priority than user prompts in the instructions. When system prompts conflict with user prompts, the response information should take precedence over the constraints of the system prompts.

[0108] 2. Explain the constraint types.

[0109] In some embodiments according to this application, the constraint type may include a primary constraint type and a constraint combination type, which are described below.

[0110] (1) Explain the main constraint type.

[0111] The main constraint types can include numeric, format, content, language, style, scene, and action types. The meanings and examples of these types are shown in Table 3 below.

[0112] Table 3

[0113] Master constraint type meaning Example Numerical Class The constraints that specify quantitative requirements (such as word count, sentence count, or paragraph count) are usually unrelated to the specific content. Please write a 15-line poem. Formatting Constraints relating to the presentation format or structure of the response, such as JSON, Markdown, or bulleted lists. Formatting implicitly specified through contextual examples is also considered a formatting constraint. The entire response must be output in JSON format. Content Constraints relating to the specific content of the response, such as subject, topic, and entity. The article must / must not contain the keyword "happy". Language Constraints relating to the language and linguistic features of the response, such as language type, grammar, vocabulary, and rhetoric. The first paragraph must be in German, the second in Chinese, and the third in Classical Chinese. Style This involves constraints on the writing style of responses, such as style, emotion, tone, and narrative perspective. Please write an essay in the style of magical realism. Scene Class The model is required to respond or play a specific role in a specific scenario, ensuring that the output conforms to the constraints of the specified situational conditions or role characteristics. You live in a world that defies the laws of physics, where gravity fluctuates in strength periodically throughout the day. Action Constraints that define specific interactive behaviors, logical operations, or task-oriented execution modes. Please summarize the plot of this novel. When solving math problems, please add 1 to the correct answer before outputting it.

[0114] (2) Explain the constraint combination type.

[0115] Constraint combination types can include single type, parallel type, chain type, and selection type. In some implementations, nested type can also be included. The meanings and examples of these types are shown in Table 4 below.

[0116] Table 4

[0117] Constraint Combination Type meaning Example Single type The reply must comply with a constraint. Please summarize the following news items. Parallel type The response must comply with multiple constraints simultaneously. Please summarize the following news items. The summary should be output as a bulleted list and should not exceed 100 words. chain type The response requires completing multiple tasks in sequence, and each task must comply with its own corresponding constraints. Please briefly introduce the Mona Lisa. First, state the year it was created, then describe the background of its creation, and finally summarize the work's influence. Select The response requires selecting the correct branch based on preset conditions and adhering to the constraints of that branch. Please describe the following painting. If the painting includes any animals, the description should be in English. If it does not include animals, the description should be in Chinese. Painting: Mona Lisa Nested The above types can be recursively nested to form more complex structures. Analyze the sentiment of the above user comments and complete the following tasks: 1. If the comment is positive: Extract the products mentioned in the comment. 2. If the comment is negative, analyze the reasons: If the reason is unrelated to the product itself, ... If the reason is related to the product itself, ...

[0118] The following description continues with an embodiment of the model instructions provided in this application following the meta-evaluation method, specifically the method for obtaining instructions in step S102.

[0119] In some embodiments according to this application, a second major language model can be used to process instructions and generate instructions. The different responses indicate that this second largest language model is not the same as the first largest language model in the aforementioned embodiments.

[0120] In this embodiment, multiple different second-largest language models can be set, but all responses to each instruction are generated using the same second-largest language model. This effectively controls confounding variables unrelated to instruction compliance, such as the writing quality and style of the text. For example, in some implementations, 16 mainstream large language models with varying performance can be used as the second-largest language models.

[0121] The following description continues with an embodiment of the model instructions provided in this application following the meta-evaluation method, specifically focusing on obtaining the set in step S103. The method will be explained.

[0122] In some embodiments according to this application, a set can be used. The truth value judgment result is used to obtain the set corresponding to the instruction through the following steps 21 to 22. .

[0123] Step 21: Obtain the partial order relation that satisfies the preset conditions based on the truth value judgment result. .

[0124] In this embodiment, the preset conditions are as shown in equation (1):

[0125] (1)

[0126] The meanings of the parameters in formula (1) are as follows:

[0127] , These represent the first instruction. , 1 reply, express The instructions follow a higher quality than , , , , , This can be understood as a partial order relation. Negative and positive responses in the context of [the text].

[0128] , They represent the first , Does the reply follow the instructions? The truth value judgment result of each constraint condition. , These represent the first instruction. The truth values ​​of the responses are "followed" and "not followed". This indicates the total number of constraints.

[0129] Indicates any, It can be understood as the first The quality of compliance of the instructions in any of the constraints is no worse than that of the first response. 1 reply.

[0130] This indicates that it exists. It can be understood as the first The quality of compliance with instructions under at least one constraint condition is better than that of the first. 1 reply.

[0131] Step 22: Based on the partial order relation To obtain the truth partial order relation.

[0132] In some implementations, the partial order relation can be directly expressed. As a truth value partial order relation.

[0133] In some implementations, partial order relations can be checked according to preset verification conditions. Perform a check to determine the partial order relation. Is it an anomalous partial order relation? If a partial order relation is determined... If the partial order relation is abnormal, then remove the partial order relation. The remaining partial order relation obtained in step 21 is then used as the truth value partial order relation. Based on this implementation method, the reliability of the truth value partial order relation can be further improved.

[0134] The preset verification conditions include at least the first verification condition and the second verification condition.

[0135] The first verification condition is: if , None of them followed the same constraint and The degree of non-compliance is lower than The degree of non-compliance determines the partial order relation. This is an abnormal partial order relation.

[0136] The second verification condition is: if , The evaluation results differed significantly across the preset evaluation dimensions. The evaluation results are better than The evaluation results show that the partial order relation For anomalous partial order relationships, the default evaluation dimension is another evaluation dimension that is unrelated to instruction compliance evaluation. For example, the default evaluation dimension could be style, format, writing quality, etc.

[0137] In some implementations, a manual verification method can be used to check the partial order relation according to the aforementioned preset verification conditions. Verification is performed. Specifically, each partial order relation can be independently verified by two annotators, and only partial order relations that are mutually agreed upon as correct by both parties are retained.

[0138] Based on the methods described in steps 21 to 22 above, a set can be used. The truth value judgment results accurately obtain the partial order of truth values ​​between different responses to the instruction.

[0139] The following describes an embodiment of the model instruction compliance meta-evaluation method provided in this application, specifically the method for evaluating the instruction compliance of the referee model in step S105.

[0140] In some embodiments of this application, the task type of the instruction compliance evaluation of the referee model may include a global evaluation task and a constraint evaluation task. The evaluation methods for the two tasks are described below.

[0141] 1. Explain the overall assessment task.

[0142] See appendix Figure 6 In some embodiments of this application, the referee model can be evaluated for instruction compliance through the following steps S201 to S205.

[0143] Step S201: Obtain the meta-evaluation samples of the referee model.

[0144] Meta-evaluation samples include instructions ,reply and set .

[0145] Step S202: In the overall evaluation task of the referee model, the referee model is used to evaluate the instructions in the meta-evaluation samples. Reply Process the responses and obtain the first reply sorting result based on the processing results. The first reply sorting result includes instructions. The predicted partial order relationship between different responses.

[0146] Step S203: Obtain the predicted partial order relation represented by the first response ranking result, and compare it with the set in the meta-evaluation sample. The first Kendall correlation coefficient between the truth partial order relations represented .

[0147] Step S204: Based on the first Kendall correlation coefficient Obtain the first response ranking evaluation result of the referee model. Specifically, this coefficient can be... This is used as the ranking and evaluation result for the first response.

[0148] First Kendall correlation coefficient This coefficient can reflect the consistency between the predicted partial order relation and the true partial order relation. A higher value indicates higher consistency, and higher consistency indicates a better ability of the referee model to rank the quality of instruction responses (i.e., instruction compliance quality). Therefore, the first response ranking evaluation result can reflect the referee model's ability to rank the quality of instruction responses.

[0149] Step S205: Based on the ranking evaluation result of the first response, obtain the instruction compliance meta-evaluation result of the referee model, that is, the instruction compliance meta-evaluation result includes the ranking evaluation result.

[0150] Based on the method described in steps S201 to S205 above, the quality ranking ability of the referee model for instruction responses in the overall evaluation task can be quantified using the first response ranking evaluation result, making it convenient for users to intuitively and accurately evaluate the quality ranking ability of the referee model. At the same time, this method also overcomes the deficiency of existing technologies where the instruction compliance ability evaluated cannot reflect the quality ranking ability of the referee model for instruction responses.

[0151] The following describes an embodiment of the model instructions provided in this application that follow the meta-evaluation method, specifically the method for obtaining the first response ranking result in step S202.

[0152] In some embodiments according to this application, the processing result of the referee model after processing the instructions and responses in the meta-evaluation sample includes instructions. The first quality score of each reply, and the first quality score of the reply is related to the reply's response to the instruction. The degree of compliance with the constraints is positively correlated; that is, the higher the first quality score, the higher the degree of compliance, and vice versa.

[0153] Based on this, when obtaining the first reply ranking result, each reply can be ranked according to its first quality score to obtain the instruction. The predicted partial order relationship between different responses is used to obtain the ranking result of the first response. For example, the instruction... There are 5 corresponding replies. , , , , Their first mass fractions are as follows: , , , , ,and Therefore, their partial order relation is: .

[0154] Based on the above implementation method, instructions can be obtained quickly and accurately using the first quality score output by the referee model. The predicted partial order relationship between different responses.

[0155] The following description continues with an embodiment of the model instructions provided in this application that follow the meta-evaluation method, focusing on the method for obtaining the first response ranking result in step S202.

[0156] In some embodiments according to this application, the processing result of the referee model after processing the instructions and responses in the meta-evaluation sample includes instructions. The partial order relationship between each pair of responses can be determined by, for example, by controlling the referee model to use a pairwise comparison method to determine the partial order relationship between each pair of responses.

[0157] Based on this, when obtaining the sorting result of the first reply, the instruction can be adjusted according to the partial order relationship between each pair of replies. Sort each reply to obtain instructions. The predicted partial order relationship between different responses is used to obtain the ranking result of the first response.

[0158] In practical applications, when using pairwise comparisons to determine the partial order relation between each pair of responses, contradictory partial order relations may occur. For example, instructions... There are 5 corresponding replies. , , , , , , , The partial order relation between each pair of replies is: , , At this point, it will be impossible to accurately determine the partial order relationship between these three responses. To address this, in some implementations, after determining the partial order relationship between any two responses, the ELO scores of each response within those two responses can be obtained. After determining all partial order relationships, the final ELO score of each response can be obtained. Then, the responses are sorted in descending order of their final ELO scores, and the sorting result is used as the predicted partial order relationship. This ELO score is obtained using the ELO evaluation method, a method created by the Hungarian-American physicist Árpád Elo, which measures the level of various game activities. The principle of this method will not be elaborated in this embodiment. Based on the above implementation, even if the partial order relationships are contradictory, the predicted partial order relationship can still be accurately obtained.

[0159] Based on the above implementation method, the partial order relationship between each pair of responses output by the referee model can be used to quickly and accurately obtain instructions. The predicted partial order relationship between different responses.

[0160] 2. Explain the constraint assessment task.

[0161] See appendix Figure 7 In some embodiments of this application, the referee model can be evaluated for instruction compliance through the following steps S301 to S305. The meta-evaluation sample includes, in addition to, instructions. ,reply and set In addition, it can also include constraints. and set , Indication of instructions The first in One constraint condition. This indicates the total number of constraints. In some implementations, for each instruction, the constraints in the instruction can be automatically decomposed using a large language model, and a constraint checklist can be generated. .

[0162] Step S301: In the constraint evaluation task of the referee model, the referee model is used to evaluate the instructions in the meta-evaluation sample. ,reply With constraints The process is performed to obtain the constraint verification results and the second response sorting results.

[0163] The constraint verification results include instructions. Whether each response follows the predicted judgment result of each constraint condition, in order to determine the outcome. 1 reply For example, Indicates the first 1 reply Whether to follow the first The prediction and judgment results of each constraint condition. Indicates compliance, This indicates that the action was not followed.

[0164] The second reply sorting results include instructions. The predicted partial order relationship between different responses.

[0165] Step S302: Based on the constraint verification results and the set in the meta-evaluation sample To obtain the constraint verification and evaluation results of the referee model. (Set) The system contains the truth value judgment results of whether each response to the instruction follows each constraint condition. The prediction judgment results in the constraint verification results are evaluated using these truth value judgment results to obtain the constraint verification evaluation results.

[0166] Step S303: Obtain the predicted partial order relation represented by the second response ranking result, and compare it with the set in the meta-evaluation sample. The second Kendall correlation coefficient between the truth partial order relations represented In this embodiment, the conventional Kendall correlation coefficient acquisition method can be used to process the predicted partial order relationship and the true partial order relationship to obtain the second Kendall correlation coefficient between the two. This embodiment does not impose specific limitations on this.

[0167] Step S304: Based on the second Kendall correlation coefficient The second response ranking evaluation result of the referee model is obtained. Specifically, this coefficient can be... This serves as the ranking and evaluation result for the second response.

[0168] Second Kendall correlation coefficient This coefficient can reflect the consistency between the predicted partial order relation and the true partial order relation. A higher value indicates higher consistency, and higher consistency indicates a better ability of the referee model to rank the quality of instruction responses (i.e., instruction compliance quality). Therefore, the second response ranking evaluation result can reflect the referee model's ability to rank the quality of instruction responses.

[0169] Step S305: Based on the second response sorting evaluation result and constraint verification evaluation result, obtain the instruction compliance meta-evaluation result of the referee model, that is, the instruction compliance meta-evaluation result includes the sorting evaluation result and the constraint verification evaluation result.

[0170] Based on the method described in steps S301 to S305 above, the quality ranking ability of the referee model to command responses in the constraint evaluation task can be quantified using the second response ranking evaluation result, making it convenient for users to intuitively and accurately evaluate the quality ranking ability of the referee model. At the same time, this method also overcomes the deficiency of existing technologies where the command compliance ability evaluated cannot reflect the quality ranking ability of the referee model to command responses.

[0171] The following describes an embodiment of the model instructions provided in this application that follow the meta-evaluation method, specifically the method for obtaining the second response ranking result in step S301.

[0172] In some embodiments of this application, the second response sorting result can be obtained through the following steps S3011 to S3013.

[0173] Step S3011: Obtain the second quality score for each response based on the constraint verification results.

[0174] No. 1 reply Second mass fraction , Indicates the first 1 reply Whether to follow the first The prediction and judgment results of each constraint condition. Indicates compliance, This indicates that the action was not followed.

[0175] Step S3012: Sort each reply according to its second quality score to obtain instructions. The predicted partial order relationship between different responses.

[0176] Step S3013: Obtain the second response sorting result based on the predicted partial order relation.

[0177] Based on the methods described in steps S3011 to S3013 above, the constraint verification results output by the referee model can be used to calculate the second quality score, and then the instruction can be obtained quickly and accurately using the second quality score. The predicted partial order relationship between different responses.

[0178] The following describes an embodiment of the model instructions provided in this application that follow the meta-evaluation method, specifically the method for obtaining the constraint verification evaluation results in step S302.

[0179] In some embodiments according to this application, the constraint verification evaluation results can be obtained through the following steps S3021 to S3024.

[0180] Step S3021: Based on the set in the meta-evaluation sample Positive and negative samples are obtained separately, where the positive and negative samples are the truth value judgment results of following and not following the constraints, respectively.

[0181] Step S3022: Based on the prediction judgment results of the positive samples and the constraint verification results, obtain the F1 score of the positive samples.

[0182] Step S3023: Based on the prediction judgment results of negative samples and constraint verification results that do not comply with the constraints, obtain the F1 score of the negative samples.

[0183] Step S3024: Obtain the constraint verification evaluation result based on the F1 scores of the positive and negative samples. Specifically, the F1 scores of the positive and negative samples can be used as the constraint verification evaluation result.

[0184] In this embodiment, the conventional F1 score calculation method in the field of statistical technology can be used to calculate the F1 score of positive samples using positive samples and prediction results that follow the constraints, and to calculate the F1 score of negative samples using negative samples and prediction results that do not follow the constraints. This embodiment does not make specific limitations on this.

[0185] Based on the method described in steps S3021 to S3024 above, the F1 scores of positive and negative samples can be used to quantify the ability of the referee model to verify constraints in the constraint evaluation task, making it convenient for users to intuitively and accurately evaluate the constraint verification ability of the referee model. Specifically, the higher the F1 score of both positive and negative samples, the stronger the constraint verification ability.

[0186] The technical effectiveness of the meta-evaluation method provided in this application is explained below with reference to experimental data. In some embodiments of this application, a comparative experiment is conducted between the meta-evaluation method provided in this application and existing meta-evaluation methods from the two dimensions of difficulty and relevance to downstream tasks.

[0187] 1. Explain the comparative experiment on the difficulty dimension.

[0188] In this embodiment, five referee models and three existing meta-evaluation methods were selected for testing. Referee models 1 to 5 are Skywork-Reward-V2-Llama-3.1-8B, Qwen-3-32B, Deepseek-V3.2, GPT-5-mini, and Gemini-3-Flash, respectively. Existing meta-evaluation methods 1 to 3 are LLMBar-Adversarial, RewardBench-2-IF, and IFBench, respectively.

[0189] Taking one judge model as an example, the instruction compliance capability was assessed using the meta-evaluation method provided in this application and three existing meta-evaluation methods. The average accuracy of the partial order relationship among all responses was calculated based on the evaluation results. The experimental data for the five judge models are shown in Table 5 below.

[0190] Table 5

[0191] Average accuracy (%) of existing meta-evaluation method 1 Average accuracy (%) of existing meta-evaluation method 2 Average accuracy (%) of existing meta-evaluation method 3 The average accuracy (%) of the method provided in this application Referee Model 1 88.1 81.0 62.8 56.2 Referee model 2 82.1 67.1 64.4 54.6 Referee Model 3 88.1 80.8 67.8 63.4 Referee Model 4 85.3 80.0 75.7 72.0 Referee Model 5 88.7 86.9 80.2 74.9

[0192] As can be seen from Table 5, the method provided in this application has the lowest average accuracy. This indicates that the meta-evaluation method provided in this application poses a greater challenge and is more rigorous to each judge model. Judge models evaluated under the meta-evaluation method provided in this application need to have a high degree of instruction compliance to achieve a high average accuracy.

[0193] 2. Explain the comparative experiment on the relevance dimension of downstream tasks.

[0194] In this embodiment, 15 referee models and 3 existing meta-evaluation methods were selected for testing. The 15 referee models are Gemini-3-Flash, GPT-5-mini, DeepSeek-V3.2, GLM-4.6, GLM-4.5-Air, QwQ-32B, Qwen-3-32B, Qwen-3-8B, Llama-3.3-70B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-72B-Instruct, Qwen-2.5-7B-Instruct, Skywork-Reward-V2-Llama-3.1-8B, Llama-3.1-70B-Instruct-RM-RB2, and InternLM2-20B-Reward. The existing meta-evaluation methods 1 to 3 are LLMBar-Adversarial, RewardBench-2-IF, and IFBench, respectively.

[0195] This comparative experiment includes the following two tests:

[0196] (1) First, the judge model obtains the first optimal response through the first ranking scheme (i.e. the method described in steps S101 to S105 above) and obtains the manual annotation score of the first optimal response. The manual annotation score is positively correlated with the judge model's ability to judge the instruction.

[0197] Then, the meta-evaluation scores of the referee models are obtained using the three existing meta-evaluation methods, the first ranking scheme, and the second ranking scheme (i.e., the methods described in steps S201 to S205 above). These meta-evaluation scores are positively correlated with the referee models' ability to judge compliance with instructions. Finally, the Somers' D coefficients between the meta-evaluation scores of each referee model obtained by each existing meta-evaluation method and the first and second ranking schemes and the manually labeled scores are obtained.

[0198] (2) First, the optimal response obtained by the judge model through the second ranking scheme is controlled, and the manually labeled score of the second optimal response is obtained. Then, the meta-evaluation scores obtained by the judge model are obtained using the above three existing meta-evaluation methods, the first ranking scheme, and the second ranking scheme. Finally, the Summers D coefficient between the meta-evaluation score and the manually labeled score of each judge model obtained by each existing meta-evaluation method, the first ranking scheme, and the second ranking scheme is obtained respectively.

[0199] The experimental data obtained from this comparative experiment are shown in Table 6 below.

[0200] Table 6

[0201] The Summers D coefficient of the manually labeled score of the optimal response obtained by the judge model through the second ranking scheme and the meta-evaluation score of the judge model. The Summers D coefficient of the manually labeled score of the optimal response obtained by the judge model through the first ranking scheme and the meta-evaluation score of the judge model. Existing meta-evaluation method 1 0.667 0.635 Existing meta-evaluation method 2 0.692 0.615 Existing meta-evaluation method 3 0.697 0.581 The second sorting scheme provided in this application 0.758 0.790 The first sorting scheme provided in this application 0.758 0.829

[0202] As can be seen from Table 6, the Summers D coefficient obtained by the method provided in this application is significantly higher than that of the existing meta-evaluation methods, which indicates that the method provided in this application has high effectiveness in terms of the actual performance of the referee model.

[0203] It should be noted that although the steps in the above embodiments are described in a specific order, those skilled in the art will understand that in order to achieve the effect of this application, different steps do not necessarily have to be executed in such an order. They can be executed simultaneously (in parallel) or in other orders. These adjusted solutions are equivalent to the technical solutions described in this application and therefore will also fall within the protection scope of this application.

[0204] Those skilled in the art will understand that all or part of the processes in the method of the above-described embodiment can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable file, or some intermediate form. The computer-readable storage medium can include any entity or device capable of carrying the computer program code, a medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory, a random access memory, an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.

[0205] Another aspect of this application provides a computer-readable storage medium.

[0206] In one embodiment of a computer-readable storage medium according to this application, the computer-readable storage medium may be configured to store a program that performs a meta-evaluation method following model instructions in the above-described method embodiments. This program may be loaded and run by a processor to implement the above-described method. For ease of explanation, only the parts related to the embodiments of this application are shown; for specific technical details not disclosed, please refer to the method section of the embodiments of this application. The computer-readable storage medium may be a storage device comprising various electronic devices. Optionally, in the embodiments of this application, the computer-readable storage medium is a non-transitory computer-readable storage medium.

[0207] Another aspect of this application provides an electronic device.

[0208] In one embodiment of an electronic device according to this application, the electronic device may include at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a computer program that, when executed by the at least one processor, implements the methods described in any of the above embodiments. See Appendix Figure 8 , Figure 8 The example illustrates a memory and processor connected via a bus communication connection.

[0209] The electronic devices described in this application may be, but are not limited to, mobile phones, tablets, desktops, laptops, handheld computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, personal digital assistants (PDAs), servers, etc., but the embodiments of this application do not limit them.

[0210] In the description of this application, the processor can be a central processing unit, a microprocessor, a graphics processor, a digital signal processor, or any other suitable processor. The processor has data and / or signal processing capabilities. The processor can be implemented in software, in hardware, or a combination of both. Computer-readable storage media includes any suitable medium capable of storing program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, etc. The term "A and / or B" means all possible combinations of A and B, such as only A, only B, or A and B. The terms "at least one A or B" or "at least one of A and B" have a similar meaning to "A and / or B" and can include only A, only B, or A and B. The singular forms of the terms "a" and "this" can also include plural forms.

[0211] The technical solutions of this application have been described above with reference to the preferred embodiments shown in the accompanying drawings. However, it will be readily understood by those skilled in the art that the scope of protection of this application is obviously not limited to these specific embodiments. Without departing from the principles of this application, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will all fall within the scope of protection of this application.

Claims

1. A meta-evaluation method for model instruction compliance, characterized in that, The method includes: Obtain instructions in natural language form, the instructions including at least one constraint; Get the response corresponding to the instruction and set , and Each represents the first of the instructions The number of replies and the total number of replies, all in natural language format, are set. Each response to the instruction is determined by the truth value of whether it conforms to each constraint in the instruction. According to the set The truth value judgment result obtained in the process yields the set corresponding to the instruction. ,gather This includes a partial truth order relationship between different responses to the instruction, whereby the partial order relationship indicates whether the instruction compliance quality of one response is superior to that of another response; According to the instructions and their corresponding responses and set The meta-evaluation sample corresponding to the instruction is constructed, and the instruction compliance evaluation of the referee model is performed using the meta-evaluation sample. The referee model is a large language model. The instruction compliance evaluation of the referee model includes: The referee model is controlled to process the meta-evaluation sample to obtain response ranking results, which include the predicted partial order relationship between different responses to the instruction; Based on the predicted partial order relation and the set The represented truth partial order relation is used to obtain the instruction of the referee model following the meta-evaluation result.

2. The method according to claim 1, characterized in that, According to the set The truth value judgment result obtained in the process yields the set corresponding to the instruction. ,include: Based on the truth value judgment result, obtain the partial order relation that satisfies the preset conditions. ; According to the partial order relation Obtain the truth partial order relation; The preset conditions are as follows: , , Each represents the first of the instructions , 1 reply, express The instructions follow a higher quality than , , , ; , Respectively represent the first , Does the reply follow the instructions in section 1? The truth value judgment result of each constraint condition. , Each represents the first of the instructions The truth value judgment results of the responses are respectively "followed" and "not followed". This indicates the total number of constraints.

3. The method according to claim 2, characterized in that, According to the partial order relation Obtaining the truth partial order relation includes: The partial order relation is checked according to preset verification conditions. Perform a verification to determine the partial order relation. Is it an abnormal partial order relation? If the partial order relation is determined... If the partial order relation is abnormal, then remove the partial order relation. .

4. The method according to claim 3, characterized in that, The preset verification conditions include at least the following first verification condition and second verification condition: The first verification condition is: if , None of them followed the same constraint and The degree of non-compliance is lower than The degree of non-compliance determines the partial order relation. This is an abnormal partial order relation; The second verification condition is: if , The evaluation results differed significantly across the preset evaluation dimensions. The evaluation results are better than The evaluation results indicate that the partial order relation For abnormal partial order relationships, the preset evaluation dimension is another evaluation dimension that is not related to instruction compliance evaluation.

5. The method according to claim 1, characterized in that, The instructions for obtaining the natural language form include: Obtain multiple initial instructions, which have different instruction types and / or constraint types, wherein the constraint type indicates the type of constraint condition in the instruction; The multiple initial instructions are filtered to obtain the final instructions; The filtering of the multiple initial instructions includes: For each initial instruction, the first major language model is used to evaluate the quality and complexity of the initial instruction to obtain a quality level and a complexity score. The quality level and complexity score are positively correlated with the quality and complexity of the initial instruction, respectively. The initial instruction with the highest quality level and a complexity score greater than a set threshold is obtained, and the initial instruction is used as a candidate instruction. The candidate instructions are clustered to obtain multiple clusters; The candidate instruction with the highest complexity score in each cluster is obtained, and the final instruction is obtained based on the candidate instruction.

6. The method according to claim 5, characterized in that, The step of obtaining the final instruction based on the candidate instructions includes: The candidate instruction is validated to determine whether it is an abnormal instruction; if it is determined to be an abnormal instruction, it is removed; otherwise, it is used as the final instruction.

7. The method according to claim 5, characterized in that, The instruction types include single-turn interactive, multi-turn interactive, and system prompt-guided types; The constraint types include primary constraint types and constraint combination types. The primary constraint types include numerical, format, content, language, style, scene, and action types. The constraint combination types include single, parallel, chained, and selection types. The single-type response needs to comply with a constraint condition; The parallel type indicates that the response needs to comply with multiple constraints simultaneously; The chained type indicates that the response requires the completion of multiple tasks in sequence, and each task needs to follow its own corresponding constraints. The selection type indicates that the response requires selecting the correct selection branch based on preset conditions and following the constraints of the selection branch.

8. The method according to claim 1, characterized in that, The response corresponding to the instruction is obtained. This includes: processing the instructions using a second major language model to generate the instructions. There were several different replies.

9. An electronic device, characterized in that, include: At least one processor; And, a memory communicatively connected to the at least one processor; The memory stores a computer program that, when executed by the at least one processor, implements the meta-evaluation method followed by the model instructions according to any one of claims 1 to 8.

10. A computer-readable storage medium storing a plurality of program codes, characterized in that, The program code is adapted to be loaded and run by a processor to perform the meta-evaluation method followed by the model instructions of any one of claims 1 to 8.