Methods for evaluating and aligning capabilities of large language models, related apparatuses, and computer program products
By evaluating the capabilities of large language models and adjusting the models using various alignment strategies, the problem of incomplete model training in existing technologies is solved, thereby improving the model's performance in natural language processing tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING BAIDU NETCOM SCI & TECH CO LTD
- Filing Date
- 2025-06-04
- Publication Date
- 2026-06-26
Smart Images

Figure CN120579577B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, specifically to the fields of artificial intelligence technology such as large language model alignment, model capability assessment, and deep learning, and particularly to methods, apparatuses, electronic devices, computer-readable storage media, and computer program products for assessing the capabilities of large language models and aligning large language models. Background Technology
[0002] Large Language Models (LLMs) are deep learning-based artificial intelligence models that are trained on large amounts of text data to understand, generate, and reason about natural language. The main goal of LLMs is to generate reasonable, grammatically correct language output given input.
[0003] Because large language models often have a large scale and a wide range of applications, their training is usually achieved through two stages: pre-training and post-training. These two stages each have different goals and tasks, and their collaborative work enables the model to perform well in various natural language processing tasks.
[0004] Against this backdrop, how to better complete the overall training process of large language models is a matter of concern and urgent need. Summary of the Invention
[0005] This disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for evaluating the capabilities of large language models and aligning large language models.
[0006] In a first aspect, embodiments of this disclosure propose a method for evaluating the capabilities of a large language model, comprising: processing a sample question using the large language model to be evaluated to obtain at least two answers to be evaluated; determining a set of correct answers from the at least two answers to be evaluated using the sample answers corresponding to the sample question; in response to the correct answer set including at least two correct answers, generating a first capability evaluation value based on a similarity comparison result between the correct answers, and generating a second capability evaluation value based on the quantitative relationship between the correct answers in the correct answer set and the answers to be evaluated; and generating a target capability evaluation value for evaluating the model capability of the large language model to be evaluated based on the first capability evaluation value and the second capability evaluation value.
[0007] Secondly, embodiments of this disclosure propose an apparatus for evaluating the capabilities of a large language model, comprising: a unit for generating answers to be evaluated, configured to process a sample question using the large language model to be evaluated to obtain at least two answers to be evaluated; a unit for determining correct answers, configured to determine a set of correct answers from the at least two answers to be evaluated using sample answers corresponding to the sample question; a sub-capability evaluation value generation unit, configured to generate a first capability evaluation value based on a similarity comparison result between the correct answers, in response to the correct answer set including at least two correct answers, and to generate a second capability evaluation value based on the quantitative relationship between the correct answers and the answers to be evaluated in the correct answer set; and a total capability evaluation value generation unit, which generates a target capability evaluation value for evaluating the model capability of the large language model to be evaluated based on the first capability evaluation value and the second capability evaluation value.
[0008] Thirdly, embodiments of this disclosure propose a method for aligning large language models, comprising: adjusting an initial large language model using a first alignment strategy to obtain at least two large language models to be evaluated; generating a first target capability evaluation value corresponding to the large language model to be evaluated, wherein the first target capability evaluation value is generated based on the method for evaluating the capability of large language models described in the first aspect; selecting a first intermediate large language model from the at least two large language models to be evaluated based on the ranking result of the first target capability evaluation value; and adjusting the first intermediate large language model using a second alignment strategy to obtain a target large language model.
[0009] Fourthly, embodiments of this disclosure propose an apparatus for aligning large language models, comprising: a first model adjustment unit configured to adjust an initial large language model using a first alignment strategy to obtain at least two large language models to be evaluated; a first model evaluation unit configured to generate a first target capability evaluation value corresponding to the large language model to be evaluated, wherein the first target capability evaluation value is generated based on the apparatus for evaluating the capability of large language models in the third aspect; an intermediate model selection unit configured to select a first intermediate large language model from the at least two large language models to be evaluated based on the ranking result of the first target capability evaluation value; and a second model adjustment unit configured to adjust the first intermediate large language model using a second alignment strategy to obtain a target large language model.
[0010] Fifthly, embodiments of this disclosure provide an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor, when executed, to implement a method for evaluating the capabilities of a large language model as described in any implementation of the first aspect and / or a method for aligning a large language model as described in any implementation of the third aspect.
[0011] In a sixth aspect, embodiments of this disclosure provide a non-transitory computer-readable storage medium storing computer instructions that, when executed by a computer, enable the implementation of a method for evaluating the capabilities of a large language model as described in any implementation of the first aspect and / or a method for aligning a large language model as described in any implementation of the third aspect.
[0012] In a seventh aspect, embodiments of this disclosure provide a computer program product including a computer program that, when executed by a processor, can implement the method for evaluating the capabilities of a large language model as described in any implementation of the first aspect and / or the method for aligning a large language model as described in any implementation of the third aspect.
[0013] The method, apparatus, electronic device, computer-readable storage medium, and computer program product for evaluating the capabilities of a large language model provided in this disclosure firstly process a sample question using the large language model to be evaluated, obtaining at least two answers to be evaluated; then, using the sample answers corresponding to the sample question, a set of correct answers is determined from the at least two answers to be evaluated; next, if the set of correct answers includes at least two correct answers, a first capability evaluation value is generated based on the similarity comparison results between the correct answers, and a second capability evaluation value is generated based on the quantitative relationship between the correct answers and the answers to be evaluated in the set of correct answers; finally, a target capability evaluation value is generated based on the first capability evaluation value and the second capability evaluation value to evaluate the model capability of the large language model to be evaluated.
[0014] Therefore, this disclosure enables a more comprehensive, high-quality, and efficient evaluation of the model capabilities of large language models.
[0015] The method, apparatus, electronic device, computer-readable storage medium, and computer program product for aligning large language models provided in this disclosure firstly adjust an initial large language model using a first alignment strategy to obtain at least two large language models to be evaluated; then, a first target capability evaluation value corresponding to each of the large language models to be evaluated is generated using the aforementioned method for evaluating the capabilities of large language models; next, a first intermediate large language model is selected from the at least two large language models to be evaluated based on the ranking result of the first target capability evaluation values; finally, the first intermediate large language model is adjusted using a second alignment strategy to obtain a target large language model.
[0016] Therefore, this disclosure enables more effective alignment of large language models by using at least two alignment strategies in combination, thereby improving alignment quality. Furthermore, to further enhance the alignment effect of each strategy, the large language model entering the next alignment stage and its corresponding alignment strategy can be determined by generating multiple candidates and selecting the best one, based on the more effective, comprehensive, and efficient evaluation method provided above. This effectively improves the alignment quality of large language models.
[0017] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0018] Other features, objects, and advantages of this disclosure will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:
[0019] Figure 1 This is an exemplary system architecture to which this disclosure can be applied;
[0020] Figure 2 A flowchart illustrating a process for evaluating the capabilities of a large language model, provided as an embodiment of this disclosure;
[0021] Figure 3 A flowchart illustrating another process for evaluating the capabilities of a large language model, provided as an embodiment of this disclosure;
[0022] Figure 4 A flowchart illustrating another process for evaluating the capabilities of a large language model, as provided in embodiments of this disclosure;
[0023] Figure 5 A flowchart illustrating a process for aligning a large language model, as provided in this embodiment of the disclosure;
[0024] Figure 6 A flowchart illustrating two processes—evaluating the capabilities of a large language model and aligning the large language model—is provided for an embodiment of this disclosure in an application scenario.
[0025] Figure 7 A structural block diagram of an apparatus for evaluating the capabilities of a large language model, provided in an embodiment of this disclosure;
[0026] Figure 8 A structural block diagram of an apparatus for aligning large language models provided in an embodiment of this disclosure;
[0027] Figure 9This is a schematic diagram of the structure of an electronic device suitable for performing a method for evaluating the capabilities of a large language model and / or aligning a large language model, as provided in an embodiment of this disclosure. Detailed Implementation
[0028] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding; these should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description. It should be noted that, unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other.
[0029] Furthermore, the acquisition, storage, use, processing, transportation, provision, and disclosure of user personal information involved in the technical solutions disclosed herein (for example, in some possible scenarios, the "large language model" in this disclosure may be used to process user personal data; correspondingly, in order to train a "large language model" that meets such capabilities, the sample questions and sample answers corresponding to the sample questions involved in this disclosure may also include at least some personal data) all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0030] Figure 1 An exemplary system architecture 100 is shown, in which embodiments of the methods, apparatuses, electronic devices, and computer-readable storage media for evaluating the capabilities of large language models and aligning large language models can be applied.
[0031] like Figure 1 As shown, system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105. Network 104 serves as the medium for providing communication links between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc.
[0032] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various applications for enabling information communication between the terminal devices 101, 102, and 103 and server 105 can be installed. These applications include large language model capability assessment applications, large language model training applications, and instant messaging applications.
[0033] Terminal devices 101, 102, and 103 and server 105 can be either hardware or software. When terminal devices 101, 102, and 103 are hardware, they can be various electronic devices with displays, including but not limited to smartphones, tablets, laptops, and desktop computers. When terminal devices 101, 102, and 103 are software, they can be installed in the aforementioned electronic devices, and can be implemented as multiple software programs or software modules, or as a single software program or software module; no specific limitation is made here. When server 105 is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When server 105 is software, it can be implemented as multiple software programs or software modules, or as a single software program or software module; no specific limitation is made here.
[0034] Server 105 can provide various services through its built-in applications. Taking a large language model capability assessment application as an example, when running this application, server 105 can achieve the following: First, server 105 processes a sample question using the large language model to be evaluated, obtaining at least two answers to be evaluated. Then, server 105 uses the sample answers corresponding to the sample question to determine a set of correct answers from the at least two answers to be evaluated. Next, if the set of correct answers includes at least two correct answers, server 105 responds by generating a first capability evaluation value based on the similarity comparison results between the correct answers, and generating a second capability evaluation value based on the quantitative relationship between the correct answers and the answers to be evaluated in the set of correct answers. Finally, server 105 generates a target capability evaluation value based on the first and second capability evaluation values to evaluate the model capability of the large language model to be evaluated.
[0035] Similarly, in the case where the above application is, for example, a large language model training application that can provide training and alignment of large language models, the server 105 can achieve the following effects when running the large language model training application: The server 105 adjusts the initial large language model using a first alignment strategy to obtain at least two large language models to be evaluated; then, the server 105 generates a first target capability evaluation value corresponding to the large language model to be evaluated, wherein the first target capability evaluation value can be generated based on the above-described process of evaluating the capability of the large language model; next, the server 105 selects a first intermediate large language model from the at least two large language models to be evaluated based on the ranking result of the first target capability evaluation value; finally, the server 105 adjusts the first intermediate large language model using a second alignment strategy to obtain the target large language model.
[0036] It should be noted that the large language model, sample questions, sample answers, etc., can all be obtained by server 105 from other devices such as terminal devices 101, 102, and 103 via network 104. However, in addition to obtaining them from terminal devices 101, 102, and 103 via network 104, they can also be pre-stored locally on server 105 in various ways. Therefore, when server 105 detects that this data is already stored locally (e.g., when starting to process previously retained evaluation tasks for the large language model to be evaluated), it can choose to directly obtain this data from locally. In this case, the exemplary system architecture 100 may not include terminal devices 101, 102, and 103 and network 104.
[0037] Since evaluating and aligning large language models may require significant computing resources and power, the methods for evaluating and aligning large language models provided in the subsequent embodiments of this disclosure are generally executed by a server 105 with strong computing power and abundant computing resources. Correspondingly, the apparatus for evaluating and aligning large language models is also generally located in the server 105. However, it should also be noted that when terminal devices 101, 102, and 103 also possess sufficient computing power and resources, they can also complete the aforementioned calculations performed by the server 105 through large language model capability evaluation and training applications installed on them, thereby outputting the same results as the server 105. Especially when multiple terminal devices with different computing capabilities exist simultaneously, but the terminal device where the large language model capability assessment application and the large language model training application resides has strong computing power and ample remaining computing resources, the terminal device can perform the aforementioned calculations, thereby appropriately reducing the computing pressure on server 105. Correspondingly, the devices for assessing the large language model capability and aligning the large language model can also be located in terminal devices 101, 102, and 103. In this case, the exemplary system architecture 100 may also exclude server 105 and network 104.
[0038] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0039] Next, we will first discuss the process of evaluating the capabilities of large language models.
[0040] Please refer to Figure 2 , Figure 2 A flowchart of a process for evaluating the capabilities of a large language model, provided for embodiments of this disclosure, includes process 200.
[0041] Process 200 specifically includes the following steps:
[0042] Step 201: Process the sample question using the large language model to be evaluated to obtain at least two answers to be evaluated;
[0043] In embodiments of this disclosure, this step can be performed by the entity executing the method for evaluating the capabilities of a large language model (e.g., Figure 1 The server 105 shown uses the large language model to be evaluated to process the sample problem (e.g., calls the large language model to be evaluated to process the sample problem) to obtain at least two answers to be evaluated.
[0044] Typically, the large language model to be evaluated can be a partially trained large language model with certain processing capabilities (e.g., the ability to handle sample questions). For example, the large language model to be evaluated can be a pre-trained large language model that is expected to undergo post-training.
[0045] The sample question can correspond to the capabilities that the large language model to be evaluated is expected to acquire after, for example, the pre-training described above. For example, if the pre-training is intended to train the large language model to have the ability to process text (e.g., to extract key information, semantics, or editing / writing errors from the text), the sample question could be the corresponding text and a "prompt" for processing the text accordingly.
[0046] For example, when aiming to extract key information from text, sample questions could be "What does the text say?", "What are the key pieces of information in the text?", and so on.
[0047] For example, in the case where pre-training is intended to train a large language model to have the corresponding answering and execution capabilities for input text, images, audio, etc., the sample question can actually be text, image, or audio information that includes the question to be answered, or instructions to perform the corresponding action.
[0048] For example, the sample question could be "What is XX?", and the large language model to be evaluated can be instructed to find explanatory information about "XX" using the sample question.
[0049] Accordingly, in the embodiments of this disclosure, during the process of processing sample questions using the large language model to be evaluated, different prompts, such as "generate at least two different language forms" or "generate at least two different versions of the answer," can be used to indicate that after the large language model to be evaluated processes the sample question, two or more answers to be evaluated are generated or obtained. For example, for a sample question like "What is XX?", the "explanation information" generated by the large language model to be evaluated as the answer to be evaluated could be "XX is A1", "XX is A2", "XX is A3", etc.
[0050] It should be noted that, as discussed above, in some scenarios, even if the large language model to be evaluated is deployed locally on the execution entity, for the sample problem, it can be obtained by the execution entity either directly from the local storage device or from a non-local storage device (e.g., Figure 1 The sample questions are obtained from the terminal devices 101, 102, and 103 shown (thus enriching the sources of "sample questions" and enabling a more context-appropriate evaluation of the capabilities of the large language model being evaluated). The local storage device can be a data storage module within the aforementioned execution entity, such as a server hard drive; in this case, the sample questions can be quickly retrieved locally. Non-local storage devices can also be any other electronic devices configured to store data, such as user terminals; in this case, the aforementioned execution entity can obtain the required sample questions by sending an acquisition command to the electronic device.
[0051] Step 202: Using the sample answers corresponding to the sample questions, determine the set of correct answers from at least two answers to be evaluated;
[0052] In the embodiments of this disclosure, after obtaining at least two answers to be evaluated based on step 201 above, the executing entity can compare them with the sample answers corresponding to the sample questions in order to determine the set of correct answers from the at least two answers to be evaluated.
[0053] The sample answer can be a predetermined, standard answer that corresponds to the sample question and meets the requirements for answering the sample question. For example, referring to the example discussed in step 201 above, if the sample question is "What is XX?", then the sample answer can correspond to the explanatory information "XX is YY". For example, such explanatory information can be a widely recognized and accepted definition of "XX" (e.g., "YY").
[0054] Accordingly, in this step, the executing entity can use the "sample answer" to determine the correct answer from at least two answers to be evaluated, thus establishing a set of correct answers. For example, the executing entity can use textual similarity and semantic similarity to select those answers to be evaluated that meet predetermined similarity requirements with the sample answer in terms of textual and semantic similarity, and then determine these qualified answers as the aforementioned correct answers to form a set of correct answers.
[0055] For example, in the above example where the "explanatory information" for the answer to be evaluated is "XX is A1", "XX is A2", or "XX is A3", the exemplary executing entity can determine the "correct answer" from these answers to be evaluated by comparing the semantic similarity between "A1", "A2", "A3" and "YY" respectively.
[0056] Next, the executing entity can read and determine the number of elements (or more specifically, the number of “correct answers”) included in the set of correct answers.
[0057] If the set of correct answers contains at least two elements, the executing entity can respond to this and continue to step 203.
[0058] Step 203: In response to the fact that the set of correct answers includes at least two correct answers, a first ability evaluation value is generated based on the similarity comparison results between the correct answers, and a second ability evaluation value is generated based on the quantitative relationship between the correct answers and the answer to be evaluated in the set of correct answers;
[0059] In the embodiments of this disclosure, if the executing entity determines in step 202 that the set of correct answers includes at least two (i.e., two or more) “correct answers”, the executing entity may, in response to this, generate a first ability evaluation value based on the similarity comparison results between the correct answers, and generate a second ability evaluation value based on the quantitative relationship between the correct answers and the answer to be evaluated in the set of correct answers.
[0060] For the first ability assessment score, the implementing entity can determine the textual similarity between each "correct answer pair" based on pairwise comparisons between correct answers. Then, the first ability assessment score is generated by summarizing the textual similarities, for example, by taking the "average".
[0061] For the second ability evaluation value, the implementing entity can first determine the difference between the number of correct answers included in the set of correct answers and the number of answers to be evaluated (for example, the total number of answers to be evaluated minus the total number of correct answers). Then, based on the pre-determined correspondence between the difference and the ability evaluation value, the implementing entity maps this "difference" to the corresponding second ability evaluation value.
[0062] In some optional implementations of this embodiment, for the first ability evaluation value actually associated with "similarity", the executing entity may also choose to use the edit distance between each correct answer as the aforementioned "similarity comparison result". That is, the executing entity may use a correct answer as a benchmark to determine the edit distance between it and another correct answer belonging to the same "correct answer pair" (e.g., the number of operations required to adjust it to another correct answer from the perspective of characters or words).
[0063] Then, the executing entity can use this edit distance as the similarity comparison result to determine the first ability evaluation value (for example, similarly, based on a pre-determined correspondence between edit distance and the first ability evaluation value, the edit distance can be mapped to the first ability evaluation value). Thus, combining semantic and glyph differences in the "edit distance" can more comprehensively reflect the similarity between correct answers.
[0064] In practice, the correlation between the levels of "similarity" and "resemblance" and the first ability evaluation score can be set according to different needs. For example, since all answers to be evaluated for the first ability evaluation are "correct answers," in order to enable the large language model to have stronger divergent thinking and answer-finding abilities, the level of the first ability evaluation score can be inversely proportional to the "similarity" and "resemblance" in the similarity comparison results mentioned above. That is, the lower the similarity between the output correct answers, the better the diversity of "correct answers" generated by the large language model to be evaluated is considered (or, in other words, it has higher "answer diversity"), and it can correspondingly have a higher first ability evaluation score.
[0065] In some optional implementations of this embodiment, for the aforementioned "second ability evaluation value," the executing entity may choose to determine the quantitative relationship between the correct answers and the answers to be evaluated in the set of correct answers based on the ratio of the number of correct answers to the number of answers to be evaluated. Then, based on this quantitative relationship, a second ability evaluation value is generated.
[0066] Therefore, the implementing entity can determine its performance in providing "correct answers" by the proportion of correct answers output by the large language model to be evaluated.
[0067] Correspondingly, the second ability evaluation value is usually positively correlated with this "ratio". Therefore, this second ability evaluation value can be used to intuitively and directly reflect the correct answer output ability of the (to be evaluated) large language model.
[0068] Step 204: Based on the first ability evaluation value and the second ability evaluation value, generate a target ability evaluation value for evaluating the model ability of the large language model to be evaluated.
[0069] In the embodiments of this disclosure, after determining and generating the first capability evaluation value and the second capability evaluation value based on the above step 203, the executing entity generates a target capability evaluation value for evaluating the model capability of the large language model to be evaluated based on the first capability evaluation value and the second capability evaluation value.
[0070] For example, the implementing entity can obtain the target capability evaluation value by directly summing the first capability evaluation value and the second capability evaluation value. Alternatively, the implementing entity can combine the first capability evaluation value and the second capability evaluation value based on pre-configured reference coefficients, such as through weighted summation, to generate a target capability evaluation value for evaluating the model capability of the large language model to be evaluated.
[0071] Correspondingly, the target ability evaluation value can provide intuitive feedback on the "overall ability" of the large language model to be evaluated in the dimensions to be evaluated. For example, a large language model with a higher target ability evaluation value can be understood as having better and higher answer diversity, as well as a higher "percentage of correct answers".
[0072] The method for evaluating the capabilities of a large language model provided in this disclosure firstly processes a sample question using the large language model to be evaluated, obtaining at least two answers to be evaluated. Then, using the sample answers corresponding to the sample question, a set of correct answers is determined from the at least two answers to be evaluated. Next, if the set of correct answers includes at least two correct answers, a first capability evaluation value is generated based on the similarity comparison results between the correct answers, and a second capability evaluation value is generated based on the quantitative relationship between the correct answers and the answers to be evaluated in the set of correct answers. Finally, a target capability evaluation value is generated based on the first and second capability evaluation values to evaluate the model capability of the large language model to be evaluated. Therefore, this disclosure enables a more comprehensive, high-quality, and efficient evaluation of the model capability of a large language model.
[0073] In some optional implementations of this embodiment, if in step 202 above the executing entity only "considers" that there is only one correct answer among at least two answers to be evaluated, that is, the set of correct answers includes only one correct answer.
[0074] In such a situation, the implementing entity can respond by simply generating and utilizing the "second ability evaluation value" to generate the aforementioned "target ability evaluation value," instead of evaluating it from the perspective of similarity, such as "diversity." That is, if the set of correct answers includes a single correct answer, the implementing entity can respond by generating the second ability evaluation value solely based on the quantitative relationship between the correct answers in the set of correct answers and the answer to be evaluated.
[0075] Then, the executing entity generates a target capability evaluation value for assessing the capabilities of the large language model to be evaluated based solely on the second capability evaluation value, in order to avoid calculation errors caused by the lack of "correct answer pairs" for generating the first capability evaluation value.
[0076] In some embodiments, the executing entity may not determine at least one “correct answer” from at least two answers to be evaluated in step 202 above (i.e., the executing entity “thinks” that all answers to be evaluated are wrong). In such cases, the executing entity can generate a prompt message for this situation and then communicate the prompt message to the target device (e.g., the terminal device used by the trainer who provides training for the large language model to be evaluated) through a pre-configured communication path. This allows feedback on the situation that the large language model to be evaluated may not be able to generate a “correct answer” at present, so as to provide a reference to the “trainer” and adjust the large language model to be evaluated in a timely manner (e.g., adjust the “pre-training” environment).
[0077] In some embodiments, in order to improve the quality of the assessment, the implementing entity may also determine the target capability evaluation sub-value corresponding to each "processing round" in a manner similar to repeating or cyclically performing the above process, at least in multiple "processing rounds".
[0078] Then, by integrating these target capability evaluation sub-values, the final target capability evaluation value is obtained. This avoids evaluation inaccuracies caused by factors such as random occurrences during individual rounds.
[0079] For this, you can refer to Figure 3 , Figure 3 A flowchart of another process for evaluating the capabilities of a large language model, provided for embodiments of this disclosure, includes process 300.
[0080] Process 300 specifically includes the following steps:
[0081] Step 301: Repeatedly process the sample problem using the large language model to be evaluated to obtain at least two first answers to be evaluated corresponding to each processing round;
[0082] Specifically, in this step, as discussed above, the executing entity can first determine at least two first answers to be evaluated for each processing round in the form of "processing rounds" (for ease of understanding, the "answers to be evaluated" for each processing round are described as "first answers to be evaluated").
[0083] For example, the executing entity can repeatedly provide sample questions to the large language model to be evaluated in processing rounds, so that the large language model to be evaluated can complete the processing corresponding to the "processing round" and obtain at least two first answers to be evaluated for each processing round.
[0084] Step 302: In each processing round, using the sample answers corresponding to the sample question, determine the set of the first correct answers corresponding to the processing round from at least two first answers to be evaluated;
[0085] Specifically, in this step, the executing entity, in each processing round, similar to the content discussed in step 202 above, determines the (first) correct answer among at least two first answers to be evaluated corresponding to that processing round, so as to determine the set of first correct answers corresponding to the processing round.
[0086] Step 303: In each processing round, in response to the fact that the corresponding first correct answer set includes at least two first correct answers, a first ability evaluation sub-value corresponding to the processing round is generated based on the similarity comparison result between the first correct answers, and a second ability evaluation sub-value corresponding to the processing round is generated based on the quantitative relationship between the first correct answers and the first answer to be evaluated corresponding to the processing round.
[0087] Specifically, in this step, the executing entity can, in each processing round, similar to the content discussed in step 203 above, generate a first ability evaluation sub-value (i.e., determined based on the similarity comparison result between the first correct answers) and a second ability evaluation sub-value (i.e., determined based on the quantitative relationship between the first correct answer and the first answer to be evaluated corresponding to the processing round) in the process discussed in step 203 above.
[0088] Accordingly, if the first set of correct answers includes only one first correct answer, the executing entity can similarly determine only the corresponding second ability evaluation sub-value, and in subsequent steps (e.g., step 304), use only this second ability evaluation sub-value as the target ability evaluation sub-value for the corresponding processing round. This will not be repeated here.
[0089] Step 304: Based on the first and second capability evaluation sub-values corresponding to each processing round, generate the target capability evaluation sub-values corresponding to each processing round.
[0090] Specifically, after determining the first capability evaluation sub-value and the second capability evaluation sub-value corresponding to each processing round based on step 303 above, the executing entity can, similar to what was discussed in step 204 above, combine the first capability evaluation sub-value and the second capability evaluation sub-value in each processing round to obtain the target capability evaluation sub-value corresponding to that processing round.
[0091] Step 305: Based on each target capability evaluation sub-value, generate the target capability evaluation value for the large language model to be evaluated.
[0092] Specifically, after obtaining the target capability evaluation sub-values corresponding to each processing round based on step 304 above, the executing entity can similarly generate the target capability evaluation value for the large language model to be evaluated by, for example, summing all the target capability evaluation sub-values.
[0093] Therefore, the implementing entity can reduce the impact of the randomness of the model itself on the effect evaluation by repeatedly evaluating it through multiple processing rounds.
[0094] In some embodiments, quality can be improved not only through multiple "processing rounds", but the implementing entity can also further select to use different sample questions in different processing rounds to utilize more sample questions to more accurately test the large language model to be evaluated, thereby conducting a more accurate capability assessment.
[0095] For this, you can refer to Figure 4 , Figure 4 A flowchart of another process for evaluating the capabilities of a large language model provided in an embodiment of this disclosure, including process 400.
[0096] Process 400 specifically includes the following steps:
[0097] Step 401: Use the large language model to be evaluated to process at least two sample questions respectively, and obtain at least two second answers to be evaluated corresponding to each sample question;
[0098] Specifically, in this step, the executing entity can use the large language model to be evaluated to process at least two (different) sample questions respectively, and obtain at least two second answers to be evaluated corresponding to each sample question (for ease of understanding, the "answer to be evaluated" for the sample questions is described as the "second answer to be evaluated").
[0099] Step 402: For each sample question, using the sample answers corresponding to the sample question, determine the set of second correct answers corresponding to the sample question from at least two second answers to be evaluated;
[0100] Specifically, in this step, the executing entity can, similar to the content discussed in step 202 above, determine the "(second) correct answer" for each of the at least two second answers to be evaluated corresponding to each sample answer, thereby obtaining the set of second correct answers corresponding to each sample question.
[0101] Step 403: For each sample question, in response to the fact that the corresponding set of second correct answers includes at least two second correct answers, based on the similarity comparison results between the second correct answers, generate a third ability evaluation sub-value corresponding to the sample question, and based on the quantitative relationship between the second correct answer and the second answer to be evaluated, generate a fourth ability evaluation sub-value corresponding to the sample question;
[0102] Specifically, in this step, the executing entity can, similar to what was discussed in step 203 above, determine the third ability evaluation sub-value (i.e., determined based on the similarity comparison results between the second correct answers, which is provided here only for ease of understanding and is different from the above) and the fourth ability evaluation sub-value (i.e., determined based on the quantitative relationship between the second correct answer and the second answer to be evaluated corresponding to the processing round, which is provided here only for ease of understanding and is different from the above) for the sample question.
[0103] Step 404: Based on the third and fourth ability evaluation sub-values corresponding to the sample questions, generate the target ability evaluation sub-values corresponding to each sample question;
[0104] Specifically, after determining the third and fourth capability evaluation sub-values corresponding to each sample problem based on step 403 above, the executing entity can, similar to what was discussed in step 204 above, combine the third and fourth capability evaluation sub-values to obtain the target capability evaluation sub-value corresponding to each sample problem.
[0105] Step 405: Based on each target capability evaluation sub-value, generate the target capability evaluation value for the large language model to be evaluated.
[0106] Specifically, after obtaining the target capability evaluation sub-values corresponding to each sample problem based on step 404 above, the executing entity can similarly generate the target capability evaluation value for the large language model to be evaluated by, for example, summing all the target capability evaluation sub-values.
[0107] In some embodiments, during the process of generating target capability evaluation values for the large language model to be evaluated based on various target capability evaluation sub-values, such as during step 305 or step 405, the executing entity may choose to generate target capability evaluation values for the large language model to be evaluated based on at least one of the mean deviation, variance, and range of the various target capability evaluation sub-values. This allows for further computational processing to provide more accurate and reasonable target capability evaluation values for assessing the model capability of the large language model to be evaluated.
[0108] In some embodiments, after the executing agent determines the corresponding second ability evaluation sub-value for each sample question in each processing round (or, in different scenarios, a fourth ability evaluation sub-value, executable here only in the literal form of "second ability evaluation sub-value"), the executing agent may also choose to determine a total result based on the distribution of each second ability evaluation sub-value, to replace the portion of the target ability evaluation value corresponding to each ability evaluation sub-value. Thus, the "correctness" of the answer to be evaluated provided by the large language model and its ability to provide the correct answer are reflected through the distribution of the "accuracy" by the executing agent.
[0109] Next, in the embodiments of this disclosure, a method for aligning a large language model is also provided. The process of "aligning a large language model" can be a process of ensuring that the output of the large language model complies with ethical, legal, and social norms during training and use. For example, the large language model can be trained (or post-trained) by aligning the input and the results of manually labeled samples to complete the "alignment".
[0110] Please refer to the following for details. Figure 5 . Figure 5 A flowchart of a process for aligning a large language model provided for embodiments of this disclosure is provided, including process 500.
[0111] Process 500 specifically includes the following steps:
[0112] Step 501: Adjust the initial large language model using the first alignment strategy to obtain at least two large language models to be evaluated;
[0113] In embodiments of this disclosure, this step can be performed by the execution entity of the method for aligning large language models (e.g., Figure 1The server 105 shown adjusts the initial large language model based on a first alignment strategy. This initial large language model can be, for example, a pre-trained large language model that needs to be "aligned".
[0114] The first alignment strategy can be one of the following: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), or Proximal Policy Optimization (PPO).
[0115] Supervised Finite Graph (SFT) refers to the process of training a pre-trained initial large language model on a specific task by introducing labeled data, thereby enabling the initial large language model to better adapt to the needs of a particular application or task. This process is typically used to further refine the model parameters of an initial large language model pre-trained on a large dataset through supervised learning methods to achieve better performance in a specific domain or task.
[0116] When aligning the initial large language model, the DPO aims to directly optimize the output of the initial large language model to make it more consistent with human preferences.
[0117] PPO is also a behavioral strategy that can be applied to training large language models, especially in handling large-scale, highly complex tasks. It can reduce instability during training through more effective strategy optimization.
[0118] In this step, the executing entity can also use the first alignment strategy to adjust the initial large language model, resulting in at least two large language models to be evaluated. For example, even with the same first alignment adjustment strategy (e.g., DPO), the executing entity can train and adjust at least two different large language models to be evaluated based on the same initial large language model by using different sample alignment inputs and sample alignment results.
[0119] Step 502: Generate the first target ability evaluation value corresponding to the large language model to be evaluated;
[0120] In the embodiments of this disclosure, based on step 501, the executing entity in this step can generate a first target ability evaluation value corresponding to each of the at least two large language models to be evaluated generated in step 501 (for ease of description, the target ability evaluation value corresponding to each large language model to be evaluated can be referred to as the first target ability evaluation value).
[0121] The “first target ability evaluation value” can actually be generated by the implementing entity based at least on the method and process for evaluating the ability of the large language model discussed above for process 200 (that is, the implementing entity generates the “first target ability evaluation value” by the process for generating the “target ability evaluation value” discussed above), which will not be repeated here.
[0122] Step 503: Based on the ranking results of the first target ability evaluation value, select the first intermediate large language model from at least two large language models to be evaluated;
[0123] In embodiments of this disclosure, after generating each first target capability evaluation value based on step 502 above, the executing entity can sort the first target capability evaluation values in, for example, descending order. Then, the executing entity selects the first intermediate large language model from the sorting results to continue subsequent "alignment".
[0124] For example, after sorting in descending order, the executing entity can select the largest language model to be evaluated that ranks first (i.e., the largest language model with the highest first target capability evaluation value as the first intermediate large language model), or the largest language model to be evaluated that has a pre-set ranking before sorting can be selected as the first intermediate large language model.
[0125] Step 504: Adjust the first intermediate large language model using the second alignment strategy to obtain the target large language model.
[0126] In the embodiments of this disclosure, after the executing entity selects the first intermediate large language model based on the above step 503, it can continue to adjust the first intermediate large language model in the next round and stage based on the second alignment strategy to obtain the target large language model.
[0127] In some optional implementations of this embodiment, the "second alignment strategy" can be an alignment strategy other than the first alignment strategy, such as the SFT, DPO, and PPO mentioned above, which are different from the first alignment strategy.
[0128] Therefore, by combining the first alignment strategy and the second alignment strategy (for example, by using two different alignment strategies in sequence), the executing agent can combine the advantages of each alignment strategy to better complete the alignment and post-training of the large language model.
[0129] In some optional implementations of this embodiment, when the executing entity uses the second alignment strategy for alignment, that is, during the execution of this step, as an alternative or alternative, the executing entity may also similarly choose to simultaneously use the second alignment strategy to adjust the first intermediate large language model to obtain at least two second intermediate large language models.
[0130] Then, after determining and generating the second target ability evaluation value corresponding to each second intermediate large language model (for example, generating the second target ability evaluation value corresponding to the second intermediate large language model in the same way as discussed in process 200 above), at least based on the ranking result of the second target ability evaluation value, the target large language model is selected from at least two second intermediate large language models.
[0131] Therefore, in order to avoid the problem of alignment quality being affected by occasional errors, biases, etc., in the way that only a target large language model is directly generated as the result.
[0132] In some embodiments, when selecting a target large language model from at least two second intermediate large language models based on the ranking result of at least the second target capability evaluation value, the executing entity may, alternatively or alternatively, select a target large language model from the first intermediate large language model and at least two second intermediate large language models based on the joint ranking result of the first target capability evaluation value and the second target capability evaluation value.
[0133] That is, the implementing entity can consider the "first intermediate large language model" before adjustment by the second alignment strategy, along with each of the "second intermediate large language models," as candidates for the target large language model. This allows it to be compared with at least two second intermediate large language models obtained after adjustment by the second alignment strategy to determine the "optimal large language model." This avoids undesirable "alignment" of the large language model due to improper second alignment strategy, thus preventing any impact on model quality.
[0134] The method for aligning large language models provided in this disclosure firstly adjusts an initial large language model using a first alignment strategy to obtain at least two large language models to be evaluated; then, it generates a first target capability evaluation value corresponding to each of the large language models to be evaluated using the aforementioned method for evaluating the capabilities of large language models; next, it selects a first intermediate large language model from the at least two large language models to be evaluated based on the ranking result of the first target capability evaluation values; finally, it adjusts the first intermediate large language model using a second alignment strategy to obtain a target large language model.
[0135] Therefore, this disclosure enables more effective alignment of large language models by employing at least two alignment strategies in combination, thereby improving alignment quality. Furthermore, to further enhance the alignment effect of each strategy, the large language model advancing to the next alignment stage and using the chosen alignment strategy can be determined based on the aforementioned more effective, comprehensive, and efficient evaluation method. This effectively improves the alignment quality of large language models.
[0136] In some embodiments, after the first and second alignment strategies, a third, fourth, or other alignment strategies can be used to perform "alignment" serially and in stages by continuously using multiple alignment strategies, thereby combining "alignment" with different purposes and standards. For example, after obtaining at least two second intermediate large language models, the executing entity can select the second intermediate large language model with the highest corresponding second target capability evaluation value and continue to "align" it using the third alignment strategy, which will not be repeated here.
[0137] Based on any of the above embodiments, the "target large language model" obtained after alignment can be used to process and complete corresponding tasks. For example, as discussed above, if the target large language model is trained to have text processing capabilities (e.g., the ability to extract key information from text) after pre-training (and post-training, alignment), then in this case, the "target large language model" can, after obtaining the text to be processed, perform text processing actions accordingly based on prompts such as "Please extract the key information in this text" or based on pre-configured processing objectives to obtain the corresponding results. For example, in this scenario, the target large language model can, after obtaining the text to be processed, extract the "key information" included in the "text to be processed," such as a summary or abstract.
[0138] To enhance understanding, this disclosure also provides a specific implementation scheme, using a concrete application scenario, that includes two processes: evaluating the capabilities of a large language model and aligning the large language model. Please refer to [link / reference needed]. Figure 6 . Figure 6 This is a flowchart illustrating two processes—evaluating the capabilities of a large language model and aligning the large language model—provided as an embodiment of the present disclosure in an application scenario. Figure 6 This includes process 600.
[0139] In process 600, for ease of explanation, server 105 (not shown in the figure) can also be used as the "execution subject" for both processes.
[0140] First, in process 600, the large language model 610 can be an initial large language model that needs to undergo subsequent "alignment".
[0141] Accordingly, for the large language model 610, the executing entity can adjust the large language model 610 by executing S601 using the first alignment strategy to obtain the large language models 621, 622 to 62N to be evaluated (where N is a positive integer).
[0142] Then, the implementing entity can use sample question 630 and sample answer 635 to "evaluate" the large language models 621, 622 to 62N to be evaluated. For example, the implementing entity can, based on the above, at least based on... Figure 2 As discussed in the process 200 shown, for each large language model to be evaluated, its processing sample question 630 is invoked; then, the executing entity determines the correct answer in the processing result of each large language model to be evaluated for the sample question 630 based on the sample answer 635; then, based on the "correct answer (set)", the target ability evaluation value is correspondingly assigned to the large language model to be evaluated.
[0143] For example, the result of the "evaluation" of the large language model 621 to be evaluated can be the first target ability evaluation value 641. Similarly, the result of the "evaluation" of the large language model 622 to be evaluated can be the first target ability evaluation value 642, and the result of the "evaluation" of the large language model 62N to be evaluated can be the first target ability evaluation value 64N.
[0144] Then, the executing entity executes S603 to select the first intermediate large language model from the large language models to be evaluated 621, 622 to 62N.
[0145] For example, the implementing entity can select the large language model to be evaluated with the highest first target capability evaluation value as the first intermediate large language model based on the descending sorting results of the first target capability evaluation values 641, 642 to 64N.
[0146] For example, in this scenario, the first target capability evaluation value 642 corresponding to the large language model 622 to be evaluated is the highest. Then, after executing S603, the executing entity can use the large language model 622 to be evaluated as the "first intermediate large language model".
[0147] Next, the executing entity can continue to execute S604 to use a second alignment strategy (different from the first alignment strategy) to further "align" the large language model 622 to be evaluated, and obtain the target large language model 650 with the alignment completed.
[0148] Further reference Figure 7 As an implementation of the method for evaluating the capabilities of large language models described above, this disclosure provides an embodiment of an apparatus for evaluating the capabilities of large language models, which is similar to... Figure 2 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.
[0149] like Figure 7As shown, the apparatus 700 for evaluating the capabilities of a large language model in this embodiment may include: a question-to-be-evaluated answer generation unit 701, a correct answer determination unit 702, a sub-capability evaluation value generation unit 703, and a total capability evaluation value generation unit 704. Specifically, the question-to-be-evaluated answer generation unit 701 is configured to process a sample question using the large language model to be evaluated, obtaining at least two questions to be evaluated; the correct answer determination unit 702 is configured to determine a set of correct answers from the at least two questions to be evaluated using sample answers corresponding to the sample question; the sub-capability evaluation value generation unit 703 is configured to, in response to the correct answer set including at least two correct answers, generate a first capability evaluation value based on a similarity comparison result between the correct answers, and generate a second capability evaluation value based on the quantitative relationship between the correct answers in the correct answer set and the questions to be evaluated; the total capability evaluation value generation unit 704 generates a target capability evaluation value for evaluating the model capability of the large language model to be evaluated based on the first capability evaluation value and the second capability evaluation value.
[0150] In this embodiment, the specific processing and technical effects of the following components in the apparatus 700 for aligning large language models—namely, the answer generation unit 701, the correct answer determination unit 702, the sub-ability evaluation value generation unit 703, and the total ability evaluation value generation unit 704—can be found by referring to [reference needed]. Figure 2 The relevant descriptions of steps 201-204 in the corresponding embodiments will not be repeated here.
[0151] In some optional implementations of this embodiment, the apparatus 700 further includes a similarity comparison unit configured to determine the similarity comparison result between correct answers based on the edit distance between each correct answer.
[0152] In some optional implementations of this embodiment, the apparatus 700 further includes: a quantitative relationship determination unit, configured to determine the quantitative relationship between the correct answers and the answers to be evaluated in the set of correct answers based on the ratio of the number of correct answers to the number of answers to be evaluated.
[0153] In some optional implementations of this embodiment, the answer generation unit 701 is further configured to repeatedly process the sample question using the large language model to be evaluated, obtaining at least two first answers to be evaluated corresponding to each processing round; and the correct answer determination unit 702 is further configured to determine, in each processing round, a first set of correct answers corresponding to the processing round from the at least two first answers to be evaluated using the sample answers corresponding to the sample question; and the sub-ability evaluation value generation unit 703 is further configured to, in each processing round, respond to the corresponding first correct answer The set includes at least two first correct answers. Based on the similarity comparison results between the first correct answers, a first ability evaluation sub-value corresponding to the processing round is generated. Based on the quantitative relationship between the first correct answer and the first answer to be evaluated corresponding to the processing round, a second ability evaluation sub-value corresponding to the processing round is generated. The total ability evaluation value generation unit 704 is further configured to generate target ability evaluation sub-values corresponding to each processing round based on the first and second ability evaluation sub-values corresponding to the processing round. Based on each target ability evaluation sub-value, a target ability evaluation value for the large language model to be evaluated is generated.
[0154] In some optional implementations of this embodiment, the answer generation unit 701 is further configured to process at least two sample questions using the large language model to be evaluated, and obtain at least two second answers to be evaluated corresponding to each sample question; the correct answer determination unit 702 is further configured to determine, for each sample question, a set of second correct answers corresponding to the sample question from the at least two second answers to be evaluated, using the sample answers corresponding to the sample question; the sub-ability evaluation value generation unit 703 is further configured to, for each sample question, in response to the fact that the corresponding set of second correct answers includes at least two second correct answers, generate a third ability evaluation sub-value corresponding to the sample question based on the similarity comparison result between the second correct answers, and generate a fourth ability evaluation sub-value corresponding to the sample question based on the quantitative relationship between the second correct answers and the second answers to be evaluated; and the total ability evaluation value generation unit 704 is further configured to generate a target ability evaluation sub-value corresponding to each sample question based on the third and fourth ability evaluation sub-values corresponding to the sample questions; and generate a target ability evaluation value for the large language model to be evaluated based on each target ability evaluation sub-value.
[0155] In some optional implementations of this embodiment, generating a target capability evaluation value for the large language model to be evaluated based on each target capability evaluation sub-value includes: generating a target capability evaluation value for the large language model to be evaluated based on at least one of the mean difference, variance, and range of each target capability evaluation sub-value.
[0156] In some optional implementations of this embodiment, the sub-ability evaluation value generation unit 703 is further configured to generate a second ability evaluation value based on the quantitative relationship between the correct answer and the answer to be evaluated in response to the correct answer set including a unique correct answer; the total ability evaluation value generation unit 704 is further configured to generate a target ability evaluation value for evaluating the model ability of the large language model to be evaluated based on the second ability evaluation value.
[0157] This embodiment is a device embodiment corresponding to the method embodiment described above. The device provided in this embodiment for evaluating the capabilities of a large language model first processes a sample question using the large language model to be evaluated, obtaining at least two answers to be evaluated. Then, using the sample answers corresponding to the sample question, a set of correct answers is determined from the at least two answers to be evaluated. Next, if the set of correct answers includes at least two correct answers, a first capability evaluation value is generated based on the similarity comparison result between the correct answers, and a second capability evaluation value is generated based on the quantitative relationship between the correct answers and the answers to be evaluated in the set of correct answers. Finally, a target capability evaluation value is generated based on the first capability evaluation value and the second capability evaluation value to evaluate the model capability of the large language model to be evaluated. Therefore, this disclosure enables a more comprehensive, high-quality, and efficient evaluation of the model capability of a large language model.
[0158] Further reference Figure 8 As an implementation of the above-mentioned method for aligning large language models, this disclosure provides an embodiment of an apparatus for aligning large language models, which is similar to... Figure 5 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.
[0159] like Figure 8As shown, the apparatus 800 for aligning large language models in this embodiment may include: a first model adjustment unit 801, a first model evaluation unit 802, an intermediate model selection unit 803, and a second model adjustment unit 804. The first model adjustment unit 801 is configured to adjust an initial large language model using a first alignment strategy to obtain at least two large language models to be evaluated; the first model evaluation unit 802 is configured to generate a first target capability evaluation value corresponding to the large language model to be evaluated, wherein the first target capability evaluation value is generated based on the apparatus for evaluating the capability of a large language model according to claim 10; the intermediate model selection unit 803 is configured to select a first intermediate large language model from at least two large language models to be evaluated based on the ranking result of the first target capability evaluation value; and the second model adjustment unit 804 is configured to adjust the first intermediate large language model using a second alignment strategy to obtain a target large language model. In this embodiment, the specific processing of the first model adjustment unit 801, the first model evaluation unit 802, the intermediate model selection unit 803, and the second model adjustment unit 804 in the apparatus 800 for aligning large language models, and the resulting technical effects, can be found by referring to [reference needed]. Figure 5 The relevant descriptions of steps 501-504 in the corresponding embodiments will not be repeated here.
[0160] In some optional implementations of this embodiment, the second model adjustment unit 804 is further configured to adjust the first intermediate large language model using a second alignment strategy to obtain at least two second intermediate large language models; generate a second target capability evaluation value corresponding to each second intermediate large language model; and select a target large language model from at least two second intermediate large language models based at least on the ranking result of the second target capability evaluation value.
[0161] In some optional implementations of this embodiment, the target large language model is selected from at least two second intermediate large language models based at least on the ranking result of the second target ability evaluation value, including: selecting the target large language model from the first intermediate large language model and at least two second intermediate large language models based on the joint ranking result of the first target ability evaluation value and the second target ability evaluation value.
[0162] In some optional implementations of this embodiment, the first alignment strategy and the second alignment strategy are different. The alignment strategies include: supervised fine-tuning alignment strategy, direct preference optimization alignment strategy, and proximal strategy optimization alignment strategy.
[0163] This embodiment exists as a device embodiment corresponding to the above method embodiment. The device for aligning large language models provided in this embodiment first adjusts the initial large language model using a first alignment strategy to obtain at least two large language models to be evaluated; then, it generates a first target capability evaluation value corresponding to each of the large language models to be evaluated using the above-described method for evaluating the capability of large language models; next, it selects a first intermediate large language model from the at least two large language models to be evaluated based on the ranking result of the first target capability evaluation values; finally, it adjusts the first intermediate large language model using a second alignment strategy to obtain the target large language model.
[0164] Therefore, this disclosure enables more effective alignment of large language models by using at least two alignment strategies in combination, thereby improving alignment quality. Furthermore, to further enhance the alignment effect of each strategy, the large language model entering the next alignment stage and its corresponding alignment strategy can be determined by generating multiple candidates and selecting the best one, based on the more effective, comprehensive, and efficient evaluation method provided above. This effectively improves the alignment quality of large language models.
[0165] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0166] Figure 9 A schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0167] like Figure 9 As shown, device 900 includes a computing unit 901, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 902 or a computer program loaded from storage unit 908 into random access memory (RAM) 903. RAM 903 may also store various programs and data required for the operation of device 900. The computing unit 901, ROM 902, and RAM 903 are interconnected via bus 904. Input / output (I / O) interface 905 is also connected to bus 904.
[0168] Multiple components in device 900 are connected to I / O interface 905, including: input unit 906, such as keyboard, mouse, etc.; output unit 907, such as various types of monitors, speakers, etc.; storage unit 908, such as disk, optical disk, etc.; and communication unit 909, such as network card, modem, wireless transceiver, etc. Communication unit 909 allows device 900 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0169] The computing unit 901 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as methods for evaluating large language model capabilities and aligning large language models. For example, in some embodiments, the methods for evaluating large language model capabilities and aligning large language models can be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program can be loaded and / or installed on device 900 via ROM 902 and / or communication unit 909. When the computer program is loaded into RAM 903 and executed by the computing unit 901, one or more steps of the methods for evaluating large language model capabilities and aligning large language models described above can be performed. Alternatively, in other embodiments, computing unit 901 may be configured in any other suitable manner (e.g., by means of firmware) to perform methods for evaluating the capabilities of large language models and aligning large language models.
[0170] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0171] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0172] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0173] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0174] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0175] Computer systems can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is established by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, also known as cloud computing servers or cloud hosts, which are hosting products within the cloud computing service ecosystem to address the management difficulties and weak business scalability inherent in traditional physical hosts and Virtual Private Servers (VPS) services. Servers can also be categorized as distributed system servers or servers incorporating blockchain technology.
[0176] According to the technical solution of this disclosure, not only can the model capabilities of large language models be evaluated more comprehensively, with higher quality and efficiency, but also the alignment of large language models can be improved with higher quality by using at least two alignment strategies in combination. Furthermore, in this process, to further improve the alignment effect of each alignment strategy, the large language model entering the next alignment stage and the alignment strategy can be determined by generating multiple candidates and selecting the best one, based on the more high-quality, comprehensive, and efficient evaluation method provided above. Thus, the alignment quality of large language models can be effectively improved.
[0177] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution provided in this disclosure can be achieved, and this is not limited herein.
[0178] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A method for evaluating the capabilities of a large language model, comprising: The method involves processing a sample question using a large language model to be evaluated to obtain at least two answers to be evaluated, including: using prompt words to instruct the large language model to be evaluated to generate at least two or more answers to be evaluated, and instructing the large language model to be evaluated to process the sample question to obtain at least two answers to be evaluated corresponding to the same sample question, wherein the sample question includes at least one of the following: text, image, or audio information including a question to be answered; Using sample answers corresponding to the sample question, determine the set of correct answers from at least two answers to be evaluated; In response to the fact that the set of correct answers includes at least two correct answers, a first ability evaluation value is generated based on the similarity comparison result between the correct answers, and a second ability evaluation value is generated based on the quantitative relationship between the correct answers in the set of correct answers and the answer to be evaluated, wherein the level of the first ability evaluation value is inversely proportional to the similarity in the similarity comparison result; Based on the first capability evaluation value and the second capability evaluation value, a target capability evaluation value is generated to evaluate the capability of the large language model to be evaluated.
2. The method according to claim 1, further comprising: The similarity comparison results among the correct answers are determined based on the edit distance between each correct answer.
3. The method according to claim 1, further comprising: The quantitative relationship between the correct answers and the answers to be evaluated in the set of correct answers is determined based on the ratio of the number of correct answers to the number of answers to be evaluated.
4. The method according to claim 1, wherein, The process of using the large language model to be evaluated to process the sample question yields at least two answers to be evaluated, including: The sample problem is repeatedly processed using the large language model to be evaluated, resulting in at least two first answers to be evaluated corresponding to each processing round; and The process of determining the set of correct answers from at least two answers to be evaluated using sample answers corresponding to the sample question includes: In each of the processing rounds, the first set of correct answers corresponding to the processing round is determined from at least two first answers to be evaluated using sample answers corresponding to the sample question. And in response to the fact that the set of correct answers includes at least two correct answers, a first ability evaluation value is generated based on the similarity comparison result between the correct answers, and a second ability evaluation value is generated based on the quantitative relationship between the correct answers in the set of correct answers and the answer to be evaluated, including: In each of the processing rounds, in response to the fact that the corresponding first correct answer set includes at least two first correct answers, a first ability evaluation sub-value corresponding to the processing round is generated based on the similarity comparison result between the first correct answers, and a second ability evaluation sub-value corresponding to the processing round is generated based on the quantitative relationship between the first correct answer and the first answer to be evaluated corresponding to the processing round. The process of generating a target capability evaluation value for assessing the capabilities of the large language model to be evaluated, based on the first capability evaluation value and the second capability evaluation value, includes: Based on the first and second capability evaluation sub-values corresponding to the processing rounds, target capability evaluation sub-values corresponding to each of the processing rounds are generated. Based on each of the target capability evaluation sub-values, a target capability evaluation value is generated for the large language model to be evaluated.
5. The method according to claim 1, wherein, The sample question is processed using the large language model to be evaluated, resulting in at least two answers to be evaluated, including: The large language model to be evaluated is used to process at least two sample questions to obtain at least two second answers to be evaluated corresponding to each sample question; and The step of determining the set of correct answers from at least two answers to be evaluated using sample answers corresponding to the sample question includes: For each of the sample questions, using the sample answers corresponding to the sample questions, a set of second correct answers corresponding to the sample questions is determined from at least two second answers to be evaluated. And using the set of correct answers, generate a capability evaluation value for assessing the capabilities of the large language model to be evaluated, including: For each of the sample questions, in response to the fact that the corresponding set of second correct answers includes at least two second correct answers, a third ability evaluation sub-value corresponding to the sample question is generated based on the similarity comparison result between the second correct answers, and a fourth ability evaluation sub-value corresponding to the sample question is generated based on the quantitative relationship between the second correct answer and the second answer to be evaluated; The process of generating a target capability evaluation value for assessing the capabilities of the large language model to be evaluated, based on the first capability evaluation value and the second capability evaluation value, includes: Based on the third and fourth ability evaluation sub-values corresponding to the sample problems, generate target ability evaluation sub-values corresponding to each of the sample problems. Based on each of the target capability evaluation sub-values, a target capability evaluation value is generated for the large language model to be evaluated.
6. The method according to claim 4 or 5, wherein, The step of generating a target ability evaluation value for the large language model to be evaluated based on each of the target ability evaluation sub-values includes: Based on at least one of the mean, variance, and range of each of the target ability evaluation sub-values, a target ability evaluation value is generated for the large language model to be evaluated.
7. The method according to claim 1, further comprising: In response to the fact that the set of correct answers includes a unique correct answer, a second ability evaluation value is generated based on the quantitative relationship between the correct answer in the set of correct answers and the answer to be evaluated; Based on the second capability evaluation value, a target capability evaluation value is generated to assess the model capability of the large language model to be evaluated.
8. A method for aligning large language models, comprising: The initial large language model is adjusted using the first alignment strategy to obtain at least two large language models to be evaluated. Generate a first target ability evaluation value corresponding to the large language model to be evaluated, wherein the first target ability evaluation value is generated based on the method for evaluating the ability of a large language model according to any one of claims 1-7; Based on the ranking result of the first target ability evaluation value, a first intermediate large language model is selected from at least two large language models to be evaluated; The first intermediate large language model is adjusted using the second alignment strategy to obtain the target large language model.
9. The method according to claim 8, wherein, The step of adjusting the intermediate large language model using the second alignment strategy to obtain the target large language model includes: The first intermediate large language model is adjusted using a second alignment strategy to obtain at least two second intermediate large language models. Generate the second target ability evaluation value corresponding to each of the second intermediate large language models; Based at least on the ranking results of the second target ability evaluation value, a target large language model is selected from at least two second intermediate large language models.
10. The method according to claim 9, wherein, The selection of the target large language model from at least two second intermediate large language models, based at least on the ranking results of the second target ability evaluation value, includes: Based on the combined ranking results of the first target ability evaluation value and the second target ability evaluation value, a target large language model is selected from the first intermediate large language model and at least two second intermediate large language models.
11. The method according to any one of claims 8-10, wherein, The first alignment strategy and the second alignment strategy are different. The alignment strategies include: supervised fine-tuning alignment strategy, direct preference optimization alignment strategy, and proximal strategy optimization alignment strategy.
12. An apparatus for evaluating the capabilities of a large language model, comprising: The answer generation unit is configured to process a sample question using a large language model to be evaluated, and obtain at least two answers to be evaluated, including: using prompt words to instruct the large language model to be evaluated to generate at least two or more answers to be evaluated, instructing the large language model to be evaluated to process the sample question, and obtaining at least two answers to be evaluated corresponding to the same sample question, wherein the sample question includes at least one of the following: text, image, or audio information including a question to be answered; The correct answer determination unit is configured to determine a set of correct answers from at least two answers to be evaluated using sample answers corresponding to the sample question; The ability evaluation value generation unit is configured to, in response to the correct answer set including at least two correct answers, generate a first ability evaluation value based on the similarity comparison result between the correct answers, and generate a second ability evaluation value based on the quantitative relationship between the correct answers in the correct answer set and the answer to be evaluated, wherein the level of the first ability evaluation value is inversely proportional to the similarity in the similarity comparison result; The total capability evaluation value generation unit generates a target capability evaluation value for evaluating the model capability of the large language model to be evaluated, based on the first capability evaluation value and the second capability evaluation value.
13. An apparatus for aligning a large language model, comprising: The first model adjustment unit is configured to adjust the initial large language model using a first alignment strategy to obtain at least two large language models to be evaluated. The first model evaluation unit is configured to generate a first target ability evaluation value corresponding to the large language model to be evaluated, wherein the first target ability evaluation value is generated based on the apparatus for evaluating the ability of a large language model as described in claim 12. The intermediate model selection unit is configured to select a first intermediate large language model from at least two large language models to be evaluated based on the ranking result of the first target ability evaluation value. The second model adjustment unit is configured to adjust the first intermediate large language model using a second alignment strategy to obtain the target large language model.
14. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method for evaluating the capabilities of a large language model as described in any one of claims 1-7, and / or the method for aligning a large language model as described in any one of claims 8-11.
15. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method for evaluating the ability of a large language model as claimed in any one of claims 1-7, and / or the method for aligning a large language model as claimed in any one of claims 8-11.
16. A computer program product comprising a computer program that, when executed by a processor, implements the method for evaluating the ability of a large language model according to any one of claims 1-7, and / or performs the method for aligning a large language model according to any one of claims 8-11.