Large model security evaluation method and device, computer device and storage medium

By automatically generating adversarial attack problems using a red team adversarial model and outputting security evaluation scores using an evaluation model, the problems of incomplete evaluation and low automation in existing large model technologies are solved, achieving more comprehensive and automated security evaluation and improving the security of large models.

CN117874177BActive Publication Date: 2026-06-26BEIJING REALAI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING REALAI TECH CO LTD
Filing Date
2023-10-20
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing methods for evaluating the security of large models rely on fixed datasets and human intervention, resulting in incomplete evaluation dimensions, low automation, and insufficient evaluation accuracy, which cannot meet the needs of large-scale evaluation tasks.

Method used

The red team adversarial model generates adversarial attack problems that meet the dimensions to be evaluated, and the evaluation model outputs security evaluation scores to generate security evaluation reports, thus achieving automated and multi-dimensional evaluation.

Benefits of technology

It improves the accuracy and automation of large model security evaluation, supports multi-dimensional security evaluation, and enhances the security of using large models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117874177B_ABST
    Figure CN117874177B_ABST
Patent Text Reader

Abstract

The application discloses a large model safety evaluation method and device, computer equipment and a storage medium. The application can obtain a to-be-evaluated dimension, generate an adversarial attack problem conforming to the to-be-evaluated dimension through a red team confrontation model, input the adversarial attack problem into a measured large model, output an answer matched with the adversarial attack problem through the measured large model, obtain a to-be-measured question and answer pair, input the to-be-measured question and answer pair into an evaluation model, output a safety evaluation score for the to-be-measured question and answer pair through the evaluation model, and generate a safety evaluation report of the measured large model in the to-be-evaluated dimension according to the safety evaluation score. The accuracy of large model safety evaluation is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, specifically to a method, apparatus, computer equipment, and storage medium for large-scale model security evaluation. Background Technology

[0002] With the rapid development of artificial intelligence (AI) technology, its applications are becoming increasingly widespread, leading to the emergence of powerful large-scale models. However, while these large-scale models are powerful, they also present security challenges during use, necessitating security assessments.

[0003] In large-scale model security evaluation scenarios, current methods primarily rely on manual data collection and traditional text-based adversarial attacks to construct fixed datasets for evaluating specific security metrics of large models. However, this approach suffers from several drawbacks. Using fixed datasets makes the evaluation dimensions entirely dependent on the dataset's data dimensions, resulting in insufficient scalability and incomplete evaluation dimensions. Furthermore, the manual evaluation method, entirely dependent on human intervention, struggles to automate large-scale evaluation tasks and fails to achieve automated evaluation. Additionally, traditional text-based adversarial attacks depend heavily on training data. If the training data is incomplete or flawed, adversarial attacks may suffer from inaccurate attack dimensions or insufficient adversarial power, reducing the accuracy of large-scale model evaluations. Therefore, existing large-scale model security evaluation methods exhibit low accuracy, compromising the security of using large models. Summary of the Invention

[0004] This application provides a method, apparatus, computer device, and storage medium for large model security evaluation, which can improve the accuracy of large model security evaluation.

[0005] To address the aforementioned technical problems, this application provides the following technical solutions:

[0006] This application provides a method for large model security evaluation, including:

[0007] Obtain the dimensions to be evaluated, and generate adversarial attack problems that conform to the dimensions to be evaluated through the red team adversarial model;

[0008] The adversarial attack question is input into the large model under test, and the large model under test outputs a response that matches the adversarial attack question to obtain the question-answer pair to be tested.

[0009] The question-and-answer pair to be tested is input into the evaluation model, and the evaluation model outputs a security evaluation score for the question-and-answer pair to be tested.

[0010] Based on the security assessment score, a security assessment report for the tested large model under the assessed dimension is generated.

[0011] In some implementations, before generating adversarial attack problems conforming to the dimensions to be evaluated using the red team adversarial model, the large model security evaluation method further includes:

[0012] Obtain a training dataset, which includes prompt information, sample question-answer pairs and their aggression scores, wherein the sample question-answer pairs include aggressive questions and their corresponding answers;

[0013] The large model is pre-trained using the training dataset to obtain the pre-trained large model.

[0014] The pre-trained large model is used as a reference model, and the pre-trained large model is copied to obtain the model to be optimized.

[0015] Based on the prompt information, the model to be optimized generates a first question-answer pair and its corresponding first score.

[0016] Based on the prompt information, the reference model generates a second question-answer pair and its corresponding second score.

[0017] Based on the sample question-and-answer pairs, the aggression score, the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score, reinforcement learning is performed on the model to be optimized until a preset stopping condition is met, and the optimized model is obtained.

[0018] The optimized model is used as the red team adversarial model.

[0019] In some implementations, the step of performing reinforcement learning on the model to be optimized based on the sample question-answer pairs, the aggression score, the first question-answer pair, the first score, the second question-answer pair, and the second score until a preset stopping condition is met, to obtain the optimized model, includes:

[0020] The reward loss is calculated using a reward model based on the sample question-and-answer pair, the aggression score, the first question-and-answer pair, and the first score.

[0021] Calculate the divergence loss between the first question-and-answer pair and the second question-and-answer pair based on the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score;

[0022] Part-of-speech tagging is performed on the first question-and-answer pair to obtain the target part-of-speech vector of the first question-and-answer pair;

[0023] Calculate the part-of-speech loss between the target part-of-speech vector and the part-of-speech vector with the highest similarity in the part-of-speech vector template library;

[0024] Based on the reward loss, the divergence loss, and the part-of-speech loss, the parameters of the model to be optimized are adjusted until a preset stopping condition is met, resulting in an optimized model.

[0025] In some implementations, adjusting the parameters of the model to be optimized based on the reward loss, the divergence loss, and the part-of-speech loss until a preset stopping condition is met to obtain the optimized model includes:

[0026] Obtain the first weight coefficient corresponding to the reward loss, the second weight coefficient corresponding to the divergence loss, and the third weight coefficient corresponding to the part-of-speech loss;

[0027] Based on the first weight coefficient, the second weight coefficient, and the third weight coefficient, the reward loss, the divergence loss, and the part-of-speech loss are weighted and calculated to obtain the total reinforcement learning loss;

[0028] Based on the total reinforcement learning loss, the parameters of the model to be optimized are adjusted until the loss is minimized or the number of iterations reaches a preset number, thus obtaining the optimized model.

[0029] In some implementations, obtaining the training dataset includes:

[0030] Obtain an initial training dataset, which includes initial sample question-answer pairs and their corresponding initial aggression scores. The initial sample question-answer pairs include initial aggressive questions and their corresponding initial answers.

[0031] Obtain the initial prompt information for each question-answer pair in the initial training dataset;

[0032] Based on the initial aggressive question and the initial prompt information, the initial sample question-answer pair is amplified to obtain an amplified training dataset. The amplified training dataset includes prompt information, sample question-answer pairs and their corresponding aggressive scores. The sample question-answer pair includes an aggressive question and its corresponding answer.

[0033] A training dataset is generated based on the augmented training dataset.

[0034] In some implementations, before calculating the part-of-speech loss between the target part-of-speech vector and the part-of-speech vector with the highest similarity in the part-of-speech vector template library, the large model security evaluation method further includes:

[0035] The training dataset is clustered according to the preset security evaluation dimensions to obtain data subsets corresponding to multiple security evaluation dimensions.

[0036] By using a part-of-speech tagging model, offensive questions with an offensive score greater than a preset score threshold in each data subset are labeled with part-of-speech tags to obtain the part-of-speech vectors corresponding to the offensive questions.

[0037] A part-of-speech vector template library is constructed based on the aforementioned part-of-speech vectors.

[0038] In some implementations, before generating adversarial attack problems conforming to the dimensions to be evaluated using the red team adversarial model, the large model security evaluation method further includes:

[0039] Determine whether the dimension to be evaluated is a zero-sample dimension;

[0040] If the dimension to be evaluated is a zero-sample dimension, then the dimension to be evaluated is adjusted to obtain the adjusted evaluation dimension;

[0041] The generation of adversarial attack problems that conform to the dimensions to be evaluated through the red team adversarial model includes:

[0042] The adversarial attack problem that conforms to the adjusted evaluation dimensions is generated using the red team adversarial model.

[0043] In some implementations, the step of adjusting the dimensions to be evaluated to obtain the adjusted evaluation dimensions includes:

[0044] Obtain existing historical security assessment dimensions from the part-of-speech vector template library;

[0045] From the historical security evaluation dimensions, target security evaluation dimensions with a similarity greater than a preset similarity threshold to the dimension to be evaluated are selected.

[0046] The evaluation dimension is adjusted by performing part-of-speech matching based on the target security evaluation dimension to obtain the adjusted evaluation dimension.

[0047] In some implementations, obtaining the dimension to be evaluated includes:

[0048] Receive user input selection instructions, and determine the dimension to be evaluated based on the selection instructions; or...

[0049] Obtain the security evaluation dimensions to which each data point in the training dataset belongs, thus obtaining a security evaluation dimension set. Select any security evaluation dimension from the security evaluation dimension set as the dimension to be evaluated.

[0050] In some implementations, generating adversarial attack problems that conform to the dimensions to be evaluated using a red team adversarial model includes:

[0051] Obtain the target prompt information under the dimension to be evaluated;

[0052] Based on the target cue information, the red team adversarial model generates adversarial attack problems that conform to the dimensions to be evaluated.

[0053] According to one aspect of this application, a large-scale model security evaluation device is also provided, comprising:

[0054] The problem generation module is used to obtain the dimensions to be evaluated and generate adversarial attack problems that conform to the dimensions to be evaluated through the red team adversarial model;

[0055] The question-answer pair acquisition module is used to input the adversarial attack question into the large model under test, and output the answer that matches the adversarial attack question through the large model under test to obtain the question-answer pair to be tested;

[0056] The score acquisition module is used to input the question-answer pair to be tested into the evaluation model, and output a security evaluation score for the question-answer pair to be tested through the evaluation model.

[0057] The report generation module is used to generate a security assessment report for the tested large model under the dimension to be assessed, based on the security assessment score.

[0058] In some embodiments, the large-scale model security evaluation device further includes:

[0059] The dataset acquisition module is used to acquire the training dataset, which includes prompt information, sample question-answer pairs and their aggression scores, and the sample question-answer pairs include aggressive questions and their corresponding answers;

[0060] The pre-training module is used to pre-train the large model using the training dataset to obtain the pre-trained large model.

[0061] The model acquisition module is used to take the pre-trained large model as a reference model and copy the pre-trained large model to obtain the model to be optimized.

[0062] The first generation module is used to generate a first question-answer pair and its corresponding first score based on the prompt information using the model to be optimized;

[0063] The second generation module is used to generate a second question-answer pair and its corresponding second score based on the prompt information using the reference model;

[0064] The reinforcement learning module is used to perform reinforcement learning on the model to be optimized based on the sample question-answer pairs, the aggression score, the first question-answer pair, the first score, the second question-answer pair, and the second score, until a preset stopping condition is met, and an optimized model is obtained.

[0065] As a module, it is used to use the optimized model as the red team adversarial model.

[0066] In some implementations, the reinforcement learning module includes:

[0067] The first calculation submodule is used to calculate the reward loss based on the sample question-answer pair, the aggression score, the first question-answer pair, and the first score using a reward model;

[0068] The second calculation submodule is used to calculate the divergence loss between the first question-and-answer pair and the second question-and-answer pair based on the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score.

[0069] The part-of-speech tagging submodule is used to perform part-of-speech tagging on the first question-answer pair to obtain the target part-of-speech vector of the first question-answer pair;

[0070] The third calculation submodule is used to calculate the part-of-speech loss between the target part-of-speech vector and the part-of-speech vector with the greatest similarity in the part-of-speech vector template library;

[0071] The adjustment submodule is used to adjust the parameters of the model to be optimized based on the reward loss, the divergence loss, and the part-of-speech loss until a preset stopping condition is met, thereby obtaining the optimized model.

[0072] In some implementations, the adjustment submodule is specifically used for:

[0073] Obtain the first weight coefficient corresponding to the reward loss, the second weight coefficient corresponding to the divergence loss, and the third weight coefficient corresponding to the part-of-speech loss;

[0074] Based on the first weight coefficient, the second weight coefficient, and the third weight coefficient, the reward loss, the divergence loss, and the part-of-speech loss are weighted and calculated to obtain the total reinforcement learning loss;

[0075] Based on the total reinforcement learning loss, the parameters of the model to be optimized are adjusted until the loss is minimized or the number of iterations reaches a preset number, thus obtaining the optimized model.

[0076] In some implementations, the dataset acquisition module is specifically used for:

[0077] Obtain an initial training dataset, which includes initial sample question-answer pairs and their corresponding initial aggression scores. The initial sample question-answer pairs include initial aggressive questions and their corresponding initial answers.

[0078] Obtain the initial prompt information for each question-answer pair in the initial training dataset;

[0079] Based on the initial aggressive question and the initial prompt information, the initial sample question-answer pair is amplified to obtain an amplified training dataset. The amplified training dataset includes prompt information, sample question-answer pairs and their corresponding aggressive scores. The sample question-answer pair includes an aggressive question and its corresponding answer.

[0080] A training dataset is generated based on the augmented training dataset.

[0081] In some embodiments, the large-scale model security evaluation device further includes:

[0082] The clustering module is used to cluster the training dataset according to preset security evaluation dimensions to obtain data subsets corresponding to multiple security evaluation dimensions.

[0083] The part-of-speech tagging module is used to tag offensive questions with an attack score greater than a preset score threshold in each data subset using a part-of-speech vector matching model, and obtain the part-of-speech vectors corresponding to the offensive questions.

[0084] A building module is used to construct a part-of-speech vector template library based on the part-of-speech vectors.

[0085] In some embodiments, the large-scale model security evaluation device further includes:

[0086] The judgment module is used to determine whether the dimension to be evaluated is a zero-sample dimension;

[0087] The dimension adjustment module is used to adjust the dimension to be evaluated if the dimension to be evaluated is a zero-sample dimension, so as to obtain the adjusted evaluation dimension.

[0088] The problem generation module is specifically used to generate adversarial attack problems that conform to the adjusted evaluation dimensions through the red team adversarial model.

[0089] In some implementations, the dimension adjustment module is specifically used for:

[0090] Obtain existing historical security assessment dimensions from the part-of-speech vector template library;

[0091] From the historical security evaluation dimensions, target security evaluation dimensions with a similarity greater than a preset similarity threshold to the dimension to be evaluated are selected.

[0092] The evaluation dimension is adjusted by performing part-of-speech matching based on the target security evaluation dimension to obtain the adjusted evaluation dimension.

[0093] In some implementations, the problem generation module is specifically used for:

[0094] Receive user input selection instructions, and determine the dimension to be evaluated based on the selection instructions; or...

[0095] Obtain the security evaluation dimension to which each data point in the training dataset belongs, thus obtaining a security evaluation dimension set. Take any security evaluation dimension in the security evaluation dimension set as the dimension to be evaluated.

[0096] The adversarial attack problem that conforms to the dimensions to be evaluated is generated using the red team adversarial model.

[0097] In some implementations, the problem generation module is specifically used for:

[0098] Obtain the target prompt information under the dimension to be evaluated;

[0099] Based on the target cue information, the red team adversarial model generates adversarial attack problems that conform to the dimensions to be evaluated.

[0100] According to one aspect of this application, a computer device is also provided, including a processor and a memory, wherein the memory stores a computer program, and the processor executes any of the large-model security evaluation methods provided in the embodiments of this application when it invokes the computer program in the memory.

[0101] According to one aspect of this application, a storage medium is also provided for storing a computer program, which is loaded by a processor to execute any of the large model security evaluation methods provided in the embodiments of this application.

[0102] According to one aspect of this application, a computer program product is also provided, including a computer program loaded by a processor to perform any of the large-model security evaluation methods provided in the embodiments of this application.

[0103] According to one aspect of this application, a chip is also provided, which includes a processor coupled to a transceiver of a computer device for executing the large-model security evaluation method provided in the first aspect of the embodiments of this application.

[0104] According to one aspect of this application, a chip system is also provided, the chip system including a processor for supporting a computer device to perform the functions involved above, such as generating or processing the information involved in the large model security evaluation method provided above.

[0105] In one possible design, the aforementioned chip system also includes a communication interface for inputting and / or outputting information.

[0106] In one possible design, the aforementioned chip system also includes a memory for storing program instructions and data necessary for the terminal device. The chip system can be composed of chips or may include chips and other discrete components.

[0107] This application can obtain the dimension to be evaluated and generate adversarial attack questions that match the dimension to be evaluated through a red team adversarial model. Then, the adversarial attack questions can be input into the large model under test, and the large model under test can output answers that match the adversarial attack questions to obtain the question-answer pair to be tested. The question-answer pair to be tested can be input into the evaluation model, and the evaluation model can output a security evaluation score for the question-answer pair to be tested. At this time, a security evaluation report of the large model under test in the dimension to be evaluated can be generated based on the security evaluation score. This solution utilizes a red team adversarial model to automatically generate a large number of adversarial attack questions that conform to any evaluation dimension, increasing the diversity and adversarial nature of these questions. These numerous adversarial attack questions are input into the large model under test, which then provides corresponding answers to each question, generating test question-answer pairs. These pairs are then input into an evaluation model to obtain corresponding security evaluation scores and generate an evaluation report for the large model under that evaluation dimension. This automates the security evaluation of large models and supports multi-security dimension evaluation, achieving a more comprehensive and automated security evaluation of large models, improving the accuracy of large model evaluation, and thus enhancing the security of using large models. Attached Figure Description

[0108] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0109] Figure 1 This is a flowchart illustrating the large model security evaluation method provided in the embodiments of this application;

[0110] Figure 2 This is a schematic diagram of the framework for large-scale model security evaluation provided in the embodiments of this application;

[0111] Figure 3 This is another flowchart illustrating the large model security evaluation method provided in this application embodiment;

[0112] Figure 4 This is a schematic diagram of the framework for training the red team adversarial model provided in an embodiment of this application;

[0113] Figure 5 This is another flowchart illustrating the large model security evaluation method provided in this application embodiment;

[0114] Figure 6 This is an interactive schematic diagram of the large model security evaluation method provided in the embodiments of this application;

[0115] Figure 7 This is a schematic diagram of the large model security evaluation device provided in the embodiments of this application;

[0116] Figure 8 This is a hardware schematic diagram of the large model security evaluation device provided in the embodiments of this application;

[0117] Figure 9 This is a schematic diagram of the structure of the computer device provided in the embodiments of this application;

[0118] Figure 10 This is a schematic diagram of the server structure provided in an embodiment of this application. Detailed Implementation

[0119] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0120] In the following description of this application, the term "some implementations" refers to a subset of all possible implementations. However, it is understood that "some implementations" may be the same subset or different subsets of all possible implementations and may be combined with each other without conflict.

[0121] In the following description of this application, the terms "first" and "second" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first" and "second" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0122] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0123] This application provides a method, apparatus, computer device, and storage medium for large-scale model security evaluation. It can be applied to a large-scale model security evaluation system for scenarios involving security evaluation of large models. This system includes a large-scale model security evaluation device, which can at least be used to perform security evaluations on large models to determine whether the large model under test has security risks. Specifically, a large number of adversarial attack questions conforming to any evaluation dimension are automatically generated through a red team adversarial model, improving the diversity and adversarial nature of the adversarial attack questions. These numerous adversarial attack questions are input into the large model under test, which provides corresponding answers to each adversarial attack question, thereby generating question-answer pairs. These question-answer pairs are then input into an evaluation model to obtain corresponding security evaluation scores and generate an evaluation report for the large model under the evaluation dimension. This automates the security evaluation task for large models and supports multi-security dimension evaluation, achieving a more comprehensive and automated security evaluation task for large models, improving the accuracy of large-scale model evaluation, and thus enhancing the security of using large models.

[0124] The large-scale model security evaluation device can be an application that performs security evaluations on large models to determine whether the large models have security risks, or a computer device such as a server or terminal that has the application installed to perform security evaluations on large models to determine whether the large models have security risks. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms, but is not limited to these. The terminal can be a mobile phone or a computer, etc. The server and the terminal can be connected directly or indirectly through wired or wireless communication, which is not limited herein.

[0125] It should be noted that the schematic diagram of the application scenario of the large model security evaluation method in this application is merely an example. The application and scenario of the large model security evaluation method described in the embodiments of this application are for the purpose of more clearly illustrating the technical solution of the embodiments of this application, and do not constitute a limitation on the technical solution provided by the embodiments of this application. As those skilled in the art will know, with the evolution of the application of the large model security evaluation method and the emergence of new business scenarios, the technical solution provided by the embodiments of this application is also applicable to similar technical problems.

[0126] The solutions provided in this application involve technologies such as Artificial Intelligence (AI) and Machine Learning (ML), which are specifically illustrated through the following embodiments:

[0127] AI, or Artificial Intelligence, refers to the theories, methods, technologies, and application systems that utilize digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, Artificial Intelligence is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine capable of reacting in a manner similar to human intelligence. Artificial Intelligence studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

[0128] AI technology is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.

[0129] Machine learning (ML) is a multidisciplinary field involving intention theory, statistics, approximation theory, convex analysis, and algorithm complexity theory, among others. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and pre-trained learning. Pre-trained models represent the latest development in deep learning, integrating all of these techniques.

[0130] This application enables reinforcement learning on large models, which may include reinforcement learning based on human feedback (RLHF), reinforcement learning based on AI feedback (RLAIF), and ordinary reinforcement learning reward constraints (Reward Models).

[0131] Human-feedback-based reinforcement learning (RLHF) involves training an AI system model using a large amount of manually labeled data during the reinforcement learning phase. A reward model generates reward signals of varying strengths based on the AI ​​model's output and the labeled data, guiding the AI ​​model to converge in the desired direction. After training, a safer AI system model is obtained. The performance of the model trained using this technique largely depends on the scale and quality of the manually labeled data. RLHF helps address issues such as hallucinations and harmful outputs.

[0132] AI-Feedback-Based Reinforcement Learning (RLAIF): A variant of RLHF, where the feedback information for the reward model shifts from partial or complete human feedback to automatic provision by agent models. The agent model is a pre-trained, safety-compliant, and aligned model that acts as a "supervisor," providing feedback signals for the reinforcement training of the new model. Fine-tuning models that can provide AI feedback signals include OpenAI's InstructGPT, DeepMind's Automatically Red-LM+Red Clf, and Anthropic's Constitutional-AI, among others.

[0133] General reinforcement learning reward models: These are classification or recognition models trained using information from human or AI red teams and labeled data. Their input can consist solely of the AI ​​model's output or include additional human feedback signals. The output is a reward signal of varying intensity, used for reinforcement learning training.

[0134] In this application, the specific type of large model can be flexibly set according to actual needs. For example, a large model can be a large language model, which refers to a language model with a huge number of parameters trained on a large amount of data (usually using large-scale self-supervised learning). Large language models are mainly implemented based on the Transformer machine learning method. The Transformer model is a deep learning model that uses a self-attention mechanism. This mechanism can assign different weights according to the different importance of each part of the input data. A typical application is OpenAI's ChatGPT.

[0135] The general-purpose big model is currently widely used in applications such as chat dialogue, text editing, artistic creation, coding, mathematical reasoning, and bioinformatics. Although it has created many new business models and has powerful capabilities, after the general-purpose big model was launched for users, there are mainly algorithmic risks, data risks, and application risks in the three types of applications: translation, chat, and collaboration.

[0136] Algorithmic risks refer to the inherent flaws in the underlying structure of these large models, regardless of their application. These flaws primarily stem from the mathematical characteristics of neural networks. When normal users utilize general-purpose large models, the risks these models may expose are called intrinsic risks. These include attributes such as interpretability, value alignment, and cognitive ability. When users maliciously attack general-purpose large models, the risks these models may expose are called extrinsic risks. These include hint injection, adversarial attacks, and model theft. Addressing these algorithmic risks often requires deep algorithmic optimization. For example, by injecting obfuscated instructions and data into the model using "hint injection," users can gain unauthorized access, causing the model to ignore pre-defined "translation" and "hello" tasks and provide irrelevant answers. Similarly, adding slight perturbations to text using "adversarial attacks" can cause model errors and low reliability.

[0137] Data Risks: Improper database processing may expose vulnerabilities in the corpus, leading to risks related to content security and credibility. Addressing these risks often requires restricting database content and access permissions. 1. Due to the complexity of corpus sources, they may implicitly contain attitudes from certain human communities, potentially causing biases in models trained on such corpora. 2. Due to the complexity of corpus sources, they may contain negative content, potentially leading to models trained on such corpora misleading users. 3. Due to the complexity of corpus sources, they may contain erroneous content, potentially causing models trained on such corpora to output false / erroneous information.

[0138] Application risks: If abused by users, the general-purpose model could be maliciously used, leading to application risks. Addressing these data risks often requires managing user behavior.

[0139] Currently, due to the diverse application areas and complex corpora of large-scale models, there is no comprehensive evaluation solution for large-scale models, nor is there a visual evaluation report for them. Testers cannot intuitively understand the vulnerabilities of their own large-scale models. Because it is impossible to exhaustively list all test cases, and each test case requires expert design, resulting in a high professional threshold, the breadth, accuracy, and objectivity of testing are all problematic. Therefore, the following issues inevitably arise with these models:

[0140] False information: Information generated by large models that does not match the questions, requiring greater human effort to identify false information;

[0141] Inappropriate content: When the model lacks the ability to correctly judge legitimate questions, it generates content such as offensive language and content that is not aligned with human viewpoints.

[0142] Malicious attacks: By bypassing the security protection mechanisms of large models through special prompts, etc., the large models generate inappropriate content.

[0143] Therefore, in the existing technology, there is no automated evaluation and online optimization scheme for comprehensive evaluation of large models. The low optimization efficiency and low evaluation accuracy of large models lead to security problems when using them. For example, if users abuse them, new attack methods such as adversarial examples and model backdoors can cause large models to execute incorrect instructions, enabling them to be used maliciously and potentially causing serious security consequences, thus reducing the security of using large models.

[0144] Compared to existing technologies, this application can obtain the dimension to be evaluated and generate adversarial attack questions that match the dimension to be evaluated through a red team adversarial model. Then, the adversarial attack questions can be input into the large model under test, and the large model under test can output answers that match the adversarial attack questions to obtain the question-answer pair to be tested. The question-answer pair to be tested can be input into the evaluation model, and the evaluation model can output a security evaluation score for the question-answer pair to be tested. At this point, a security evaluation report of the large model under test in the dimension to be evaluated can be generated based on the security evaluation score. This solution utilizes a red team adversarial model to automatically generate a large number of adversarial attack questions that conform to any evaluation dimension, increasing the diversity and adversarial nature of these questions. These numerous adversarial attack questions are input into the large model under test, which then provides corresponding answers to each question, generating test question-answer pairs. These pairs are then input into an evaluation model to obtain corresponding security evaluation scores and generate an evaluation report for the large model under that evaluation dimension. This automates the security evaluation of large models and supports multi-security dimension evaluation, achieving a more comprehensive and automated security evaluation of large models, improving the accuracy of large model evaluation, and thus enhancing the security of using large models.

[0145] The following sections provide detailed descriptions of each example. It should be noted that the order in which the embodiments are described is not intended to limit the preferred order of the embodiments.

[0146] In this embodiment, the description will be from the perspective of a large-scale security evaluation device, which can be integrated into computer equipment such as servers.

[0147] Please see Figure 1 , Figure 1 This is a flowchart illustrating a large model security evaluation method provided in an embodiment of this application. The large model security evaluation method may include steps S101 to S104:

[0148] S101. Obtain the dimensions to be evaluated, and generate adversarial attack problems that conform to the dimensions to be evaluated through the red team adversarial model.

[0149] Among them, the large model under test (i.e. the large model to be tested) for different application fields can have different evaluation dimensions. The evaluation dimension can be keywords or sentences, etc. For example, the evaluation dimension can include at least one of the relevant evaluation dimensions such as malicious use, erroneous knowledge information, and deployment risk. Of course, the evaluation dimension can also include other dimensions, which are not limited here.

[0150] The methods for obtaining the dimensions to be evaluated may include user input or determination of security evaluation dimensions based on the training dataset. In some implementations, obtaining the dimensions to be evaluated may include:

[0151] Receive user input selection instructions and determine the dimensions to be evaluated based on the selection instructions; or...

[0152] Obtain the security evaluation dimensions to which each data point in the training dataset belongs, thus obtaining a security evaluation dimension set. Select any security evaluation dimension from the security evaluation dimension set as the dimension to be evaluated.

[0153] For example, it can display a list of evaluation dimensions to be selected, receive selection instructions from the user, and select one or more evaluation dimensions from the list of evaluation dimensions to be evaluated in response to the selection instructions. Professionals can determine the evaluation dimensions based on user needs, which improves the flexibility and reliability of obtaining the evaluation dimensions.

[0154] For example, one can obtain the security evaluation dimensions of each data point in the training dataset used to train the model, thus obtaining a security evaluation dimension set; or one can obtain existing security evaluation dimensions from a part-of-speech tag template library, and so on. Then, any one security evaluation dimension from the set can be used as the dimension to be evaluated, or any number of security evaluation dimensions can be used as the dimensions to be evaluated. This allows the generation of adversarial attack problems using existing security evaluation dimensions as the dimensions to be evaluated, improving the convenience and efficiency of obtaining the dimensions to be evaluated. The construction methods of the training dataset and the part-of-speech tag template library will be explained in detail below.

[0155] To ensure that adversarial attack problems generated for any dimension to be evaluated can effectively assess the security of the large model under test, we can first determine the zero-sample dimension, so that we can adjust the dimension accordingly. Figure 2 As shown, in some implementations, before generating adversarial attack problems that conform to the dimensions to be evaluated using the red team adversarial model, the large model security evaluation method also includes:

[0156] Determine whether the dimension to be evaluated is a zero-sample dimension;

[0157] If the dimension to be evaluated is a zero-sample dimension, then the dimension to be evaluated is adjusted to obtain the adjusted evaluation dimension;

[0158] Using the red team adversarial model, adversarial attack problems that meet the evaluation dimensions are generated, including:

[0159] By using the red team adversarial model, adversarial attack problems that conform to the adjusted evaluation dimensions are generated.

[0160] Specifically, after determining the dimension to be evaluated, we can first determine whether it is a zero-sample dimension. This zero-sample dimension can be a security evaluation dimension that appears for the first time, that is, a security evaluation dimension that has not appeared in the training dataset used during the training of the red team adversarial model. If the dimension to be evaluated is not a zero-sample dimension, then we can generate a series of adversarial attack problems that conform to the dimension to be evaluated through the constructed red team adversarial model, until the number of generated adversarial attack problems meets the evaluation quantity requirement. This adversarial attack problem is then considered as a security evaluation dimension. Figure 2 The issue of adversarial attack samples. If the dimension to be evaluated has zero sample dimensions, then the dimension to be evaluated can be adjusted (i.e., Figure 2 The evaluation dimension is adjusted by matching part-of-speech vector templates. Then, the red team adversarial model generates adversarial attack questions that conform to the adjusted evaluation dimension until the number of generated adversarial attack questions meets the evaluation quantity requirement. In this way, even if the evaluation dimension is a zero-sample dimension, it can be adjusted to improve the scalability of the evaluation dimension and make the evaluation dimension more comprehensive. This allows the red team adversarial model to generate highly aggressive adversarial attack questions based on the adjusted evaluation dimension, thereby better dealing with zero-sample dimension detection tasks, and the generated adversarial attack questions have stronger offensive capabilities.

[0161] In some implementations, the evaluation dimensions are adjusted to obtain the adjusted evaluation dimensions, which include:

[0162] Obtain existing historical security assessment dimensions from the part-of-speech vector template library;

[0163] From historical security assessment dimensions, target security assessment dimensions with a similarity greater than a preset similarity threshold are selected.

[0164] The part-of-speech matching dimension is adjusted based on the target security evaluation dimension to obtain the adjusted evaluation dimension.

[0165] To improve the reliability of adjusting the dimensions to be evaluated, existing historical security evaluation dimensions in the part-of-speech vector template library can be used to adjust the dimensions to be evaluated. Specifically, existing historical security evaluation dimensions in the part-of-speech vector template library can be obtained. This part-of-speech vector template library is used to store part-of-speech vectors and their corresponding security evaluation dimensions. The part-of-speech vectors can be obtained by tagging each data in the training dataset with part-of-speech tags. The specific construction of the part-of-speech vector template library will be explained in detail below.

[0166] Then, one or more security evaluation dimensions with a similarity greater than a preset similarity threshold to the dimension to be evaluated can be selected from the historical security evaluation dimensions to obtain the target security evaluation dimension. This preset similarity threshold can be flexibly set according to actual needs and is not limited here. At this point, the dimension to be evaluated can be adjusted by part-of-speech tagging (i.e., part-of-speech tagging adjustment) based on the target security evaluation dimension to obtain the adjusted evaluation dimension. For example, the Natural Language Toolkit (NLTK) third-party library can be used to perform part-of-speech tagging on the dimension to be evaluated, obtaining the part-of-speech of the dimension to be evaluated. This part-of-speech can include nouns, verbs, adjectives, numerals, quantifiers, pronouns, adverbs, prepositions, conjunctions, auxiliary words, interjections, and onomatopoeia, etc. By using the part-of-speech of the target security evaluation dimension, which is similar to the part-of-speech of the dimension to be evaluated, to replace the part-of-speech of the dimension to be evaluated, the part-of-speech of the dimension to be evaluated is made closer to that of non-zero sample dimensions, better addressing the capability degradation problem caused by zero sample dimensions, and generating results close to those of non-zero sample dimensions even for zero sample dimensions.

[0167] To improve the accuracy of the red team adversarial model in generating adversarial attacks, the red team adversarial model can be pre-trained, such as... Figure 3 As shown, in some implementations, before generating adversarial attack problems that conform to the dimensions to be evaluated through the red team adversarial model, the large model security evaluation method further includes steps S201 to S207:

[0168] S201. Obtain the training dataset, which includes prompts, sample question-answer pairs and their aggression scores. The sample question-answer pairs include aggressive questions and their corresponding answers.

[0169] The training dataset can be obtained by augmenting the initial training dataset, or it can be obtained directly from the server's database, or it can be obtained by receiving data sent from the terminal; these are not limited here. The training dataset may include prompts, sample question-and-answer pairs, and their attack scores. Sample question-and-answer pairs include offensive questions and their corresponding answers. Prompts may include brief keyword or sentence prompts, which can be used by the model to generate sample question-and-answer pairs. The attack score generates a security rating for the answer corresponding to the offensive question based on the prompts; a higher attack score indicates a safer answer, and vice versa. Attack methods for offensive questions may include at least one of several methods such as insecure questioning, injection attacks, developer mode, role-playing, and prompt disclosure.

[0170] To enhance the diversity and richness of the training dataset, a training dataset can be obtained by augmenting the initial training dataset. In some implementations, obtaining the training dataset includes:

[0171] Obtain the initial training dataset, which includes initial sample question-answer pairs and their corresponding initial aggression scores. The initial sample question-answer pairs include initial aggressive questions and their corresponding initial answers.

[0172] Obtain the initial prompt information for each question-answer pair in the initial training dataset;

[0173] Based on the initial aggressive questions and initial prompts, the initial sample question-answer pairs are augmented to obtain the augmented training dataset. The augmented training dataset includes prompts, sample question-answer pairs and their corresponding aggressive scores. The sample question-answer pairs include aggressive questions and their corresponding answers.

[0174] A training dataset is generated based on the augmented training dataset.

[0175] like Figure 4 As shown, the initial training dataset can be constructed by automatically calling multiple open-source large model test data samples for attack, or by obtaining the initial training dataset from a server database, or by receiving the initial training dataset sent by the terminal, or by collecting a large number of normal and adversarial sample questions, using the adversarial sample questions as the initial attack questions, and inputting the adversarial sample questions into a large model test library (integrating most of the open-source large models, such as the chatGPT series, Vicuna, LLaMa, chatGLM, etc.) to obtain corresponding question-answer pairs (Q-A pairs). These Q-A pairs are used as initial sample Q-A pairs. An evaluation model is used to perform a security evaluation on each Q-A pair and give a corresponding security score as the attack score of the adversarial sample question (i.e., the initial attack score), thus forming the initial training dataset. This initial training dataset includes initial sample Q-A pairs and their corresponding initial attack scores. The initial sample Q-A pairs include the initial attack question and its corresponding initial answer. The evaluation model can be a trained classification model, and the specific type and structure of the evaluation model are not limited here.

[0176] Then, the initial prompt information for each initial sample question-answer pair in the initial training dataset can be obtained. Based on the initial aggressive question and the initial prompt information, the initial sample question-answer pairs are augmented to obtain the augmented training dataset. Based on the augmented training dataset, the training dataset is generated. That is... Figure 4In this process, the initial training dataset (QS) is processed using Prompt engineering techniques to construct N Prompts for each Question, thereby building a "Promp-Question-Score (PQS)" training dataset that is N times larger. This training dataset includes the initial training dataset and the expanded training dataset. The training dataset includes prompt information, sample question-answer pairs and their corresponding attack scores. The sample question-answer pairs include information such as attack questions and their corresponding answers.

[0177] It should be noted that, besides augmenting the initial sample question-answer pairs based on the initial aggressive questions and initial prompts in the initial training dataset to obtain an augmented training dataset, the augmentation can also be performed based on any one of the initial aggressive questions and their corresponding initial answers or initial prompts in the initial training dataset. For example, the augmented training dataset can be obtained by augmenting the initial sample question-answer pairs based on the initial aggressive questions and their corresponding initial answers or initial prompts in the initial training dataset; or, for example, by augmenting the initial sample question-answer pairs based on the initial aggressive questions and their corresponding initial answers in the initial training dataset; or, for example, by augmenting the initial sample question-answer pairs based on the initial aggressive questions and their corresponding initial answers in the initial training dataset; or, for example, by augmenting the initial sample question-answer pairs based on the initial prompts; and so on. No further limitations are specified here.

[0178] S202. Pre-train the large model using the training dataset to obtain the pre-trained large model.

[0179] After obtaining the training dataset, you can use the training dataset to train a large model (i.e. Figure 4 The pre-trained model in the training dataset is used to pre-train the large model, which is then flexibly configured according to actual needs. For example, the large model can generate question-answer pairs and their corresponding scores based on prompts in the training dataset. The loss is then calculated based on the difference between sample question-answer pairs and their corresponding aggression scores in the training dataset and the question-answer pairs and their corresponding scores generated by the large model based on prompts in the training dataset. The parameters of the large model are adjusted based on this loss until the loss is minimized or the preset number of iterations is reached, resulting in the pre-trained large model. This preset number of iterations can be flexibly configured according to actual needs and is not limited here.

[0180] like Figure 4In this approach, an open-source pre-trained model can be selected, and a new large model (i.e., a pre-trained large model) can be obtained by using supervised fine-tuning (SFT) training techniques based on the PQ data in the PQS training dataset. This large model can generate adversarial attack problems based on relevant information of the given evaluation dimension.

[0181] S203. Use the pre-trained large model as a reference model and copy the pre-trained large model to obtain the model to be optimized.

[0182] After obtaining the pre-trained large model, it can be used as a reference model with no parameter updates, and the pre-trained large model can be copied and used as the model to be optimized for reinforcement learning.

[0183] S204. Based on the prompt information, generate the first question-answer pair and its corresponding first score using the model to be optimized.

[0184] S205. Based on the prompt information, generate a second question-answer pair and its corresponding second score using the reference model.

[0185] For example, the prompts from the PQS training dataset can be simultaneously input into both the model to be optimized and the reference model. The model to be optimized outputs a first question-answer pair and its corresponding first score, forming a PQ pair, while the reference model outputs a second question-answer pair and its corresponding second score, forming another PQ pair. The first question-answer pair can include the question generated by the model to be optimized and its corresponding answer, while the second question-answer pair can include the question generated by the reference model and its corresponding answer.

[0186] It should be noted that the execution order between steps S204 and S205 can be either step S204 first and then step S205, or step S205 first and then step S204, or steps S204 and S205 can be executed simultaneously. No limitation is made here.

[0187] S206. Based on the sample question-answer pairs, aggression scores, first question-answer pairs, first scores, second question-answer pairs, and second scores, perform reinforcement learning on the model to be optimized until the preset stopping condition is met, and obtain the optimized model.

[0188] S207. Use the optimized model as the red team adversarial model.

[0189] The preset stopping condition may include reaching the required number of iterations or satisfying the optimization objective, which may include minimizing the loss. Reinforcement learning can be performed on the model to be optimized based on sample question-answer pairs and aggression scores in the training dataset, the first question-answer pair output by the model to be optimized and its corresponding first score, and the second question-answer pair output by the reference model and its corresponding second score, until the required number of iterations is reached or the optimization objective is satisfied, resulting in an optimized model. This optimized model is then used as the red team adversarial model.

[0190] like Figure 4 As shown, to improve the accuracy of red team adversarial model training, aggressive part-of-speech vectors can be learned by the model as prior knowledge to guide the training process, thereby constructing a more aggressive and more accurate red team adversarial model. Figure 5 As shown, in some implementations, reinforcement learning is performed on the model to be optimized based on sample question-answer pairs, aggression scores, a first question-answer pair, a first score, a second question-answer pair, and a second score until a preset stopping condition is met, resulting in an optimized model, including steps S2061 to S2065:

[0191] S2061. Calculate the reward loss using a reward model based on sample question-and-answer pairs, aggression scores, the first question-and-answer pair, and the first score.

[0192] A reward signal can be given to the PQ pair output by the model to be optimized through a reward model, and the reward loss can be calculated. That is, the reward loss L can be calculated based on the sample question-answer pair, the aggression score, the first question-answer pair, and the first score through the reward model. reward The calculation formula is as follows:

[0193] L reward =log(σ(r) θ (y1,x m )-r θ (y2,x gt )))

[0194] Among them, L reward Represents reward loss, r θ Represents the reward model, x m Let y1 represent the first question-and-answer pair (PQ pair) output by the model to be optimized, and let x represent the first score of the first question-and-answer pair output by the model to be optimized. gtLet y1 represent a standard question-answer pair (PQ pair) in the training dataset, and y2 represent the attack score corresponding to the question-answer pair in the training dataset. To better normalize the difference, the difference between the first question-answer pair and its corresponding first score output by the model to be optimized, and the question-answer pair and its corresponding attack score in the training dataset, can be normalized to between 0 and 1 (inclusive) using a sigmoid function.

[0195] S2062. Calculate the divergence loss between the first question-and-answer pair and the second question-and-answer pair based on the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score.

[0196] For example, the divergence KL loss L between the first question-and-answer pair and the second question-and-answer pair can be calculated based on the first question-and-answer pair (PQ pair) generated by the model to be optimized and its corresponding first score, and the second question-and-answer pair (PQ pair) generated by the reference model and its corresponding second score. KL Through divergence loss L KL Guided model training to ensure that the optimized model does not give irrelevant answers and maintains the original model's language generation capabilities; divergence loss L. KL The calculation formula is as follows:

[0197] L KL =-λ KL (π opt (y1|x1)||π frozen (y2|x2))

[0198] Among them, L KL Let λ represent the divergence loss, λ represent the penalty factor, and π represent the π / 2. opt Represents the model to be optimized, π frozen y1 represents the first score of the first question-and-answer pair output by the model to be optimized, x1 represents the first question-and-answer pair output by the model to be optimized, y2 represents the second score of the second question-and-answer pair output by the reference model, and x2 represents the second question-and-answer pair output by the reference model.

[0199] S2063. Perform part-of-speech tagging on the first question-and-answer pair to obtain the target part-of-speech vector of the first question-and-answer pair.

[0200] For example, the first question-answer pair can be part-of-speech tagging (POS_T) using a part-of-speech vector matching model to obtain the target part-of-speech vector for the first question-answer pair. This POS tagging assigns the corresponding category label, such as noun, verb, or adjective, to each word in the first question-answer pair. Optionally, since the first question-answer pair includes the question generated by the model to be optimized and its corresponding answer, the question in the first question-answer pair can be POS-tagged using a POS vector matching model to obtain the corresponding part-of-speech vector for the question. This part-of-speech vector is then used as the target part-of-speech vector for the first question-answer pair.

[0201] The specific type and structure of the part-of-speech (POS) vector matching model can be flexibly configured according to actual needs and are not limited here. For example, the POS vector matching model can be a trained vectorized model such as doc2vec or Embedding. The first question-answer pair is vectorized by the vectorized model to obtain the target POS vector of the first question-answer pair. This POS vector matching model can learn a series of highly aggressive POS vector templates from aggressive datasets, which are used as prior knowledge to constrain the loss function of the red team adversarial model during reinforcement learning optimization. The POS vector can be a sentence vector representation labeled with part-of-speech tags that meets certain rules.

[0202] S2064. Calculate the part-of-speech loss between the target part-of-speech vector and the part-of-speech vector with the highest similarity in the part-of-speech vector template library.

[0203] After obtaining the target part-of-speech (POS) vectors for the first question-answer pair, the similarity (i.e., vector similarity) between each POS vector in the POS vector template library and the target POS vector can be calculated. The POS vector with the highest similarity (i.e., the POS vector template) is then selected, and the POS loss L between the target POS vector and the POS vector with the highest similarity is calculated. POS The calculation formula can be as follows:

[0204] L POS =||POS_V m -POS_V gt || p

[0205] Among them, L POS Indicates part-of-speech loss, ||.|| p Representing Lp normal form, POS_V m POS_V represents the target part-of-speech vector of the first question-answer pair generated by the model to be optimized. gt This represents the part-of-speech vector in the part-of-speech vector template library that has the highest similarity to the target part-of-speech vector.

[0206] It should be noted that the calculation order of reward loss, divergence loss, and part-of-speech loss can be as follows: they can be calculated simultaneously, or reward loss can be calculated first, then divergence loss, and then part-of-speech loss; or part-of-speech loss can be calculated first, then divergence loss, and then reward loss; or divergence loss can be calculated first, then part-of-speech loss, and then reward loss; etc. There is no limitation here, that is, the execution order of steps S2061, S2062, and S2064 is not limited here.

[0207] To improve the efficiency of model training and the reliability of part-of-speech (POS) vector templates stored in the POS vector template library, a POS vector template library can be pre-constructed based on the training dataset. In some implementations, before calculating the POS loss between the target POS vector and the POS vector with the highest similarity in the POS vector template library, large model security evaluation methods also include:

[0208] The training dataset is clustered according to the preset security evaluation dimensions to obtain data subsets corresponding to multiple security evaluation dimensions.

[0209] By using a part-of-speech tagging model, offensive questions with an offensive score greater than a preset score threshold in each data subset are labeled with part-of-speech tags to obtain the part-of-speech vectors corresponding to the offensive questions.

[0210] A part-of-speech vector template library is built based on part-of-speech vectors.

[0211] Specifically, a clustering algorithm is used to cluster the training dataset according to preset security evaluation dimensions, resulting in data subsets corresponding to multiple security evaluation dimensions. These preset security evaluation dimensions can be the security evaluation dimensions corresponding to each data point in the training dataset, or they can be pre-defined security evaluation dimensions; no limitation is made here. Then, the attack scores of each sample question-answer pair in the data subset corresponding to each security evaluation dimension are sorted from high to low, resulting in a sorted data subset. From this sorted data subset, attack questions with attack scores greater than a preset threshold are selected, yielding the most aggressive batch of attack questions. The specific value of this preset threshold can be flexibly set according to actual needs; no limitation is made here. At this point, part-of-speech tagging (POS_T) is performed on the attack questions in each data subset whose attack scores are greater than the preset threshold, resulting in POS_V part-of-speech vectors corresponding to the attack questions, thus obtaining a series of POS_V templates corresponding to high-attack questions. Finally, a POS_V template library can be constructed based on these POS_V templates.

[0212] It should be noted that, in addition to building a part-of-speech (POS) vector template library based on the training dataset, it is also possible to build a POS vector template library based on the initial training dataset. For example, the initial training dataset can be clustered according to preset security evaluation dimensions to obtain data subsets corresponding to multiple security evaluation dimensions. The POS vector matching model can then be used to tag attack questions in each data subset whose attack scores are greater than a preset score threshold, thereby obtaining the POS vectors corresponding to the attack questions. The POS vector template library can then be built based on these POS vectors, which improves the flexibility and convenience of building the POS vector template library.

[0213] S2065. Adjust the parameters of the model to be optimized based on the reward loss, divergence loss, and part-of-speech loss until the preset stopping condition is met, and obtain the optimized model.

[0214] like Figure 4 As shown, after obtaining the reward loss, divergence loss, and part-of-speech loss, the parameters of the model to be optimized can be adjusted according to the reward loss, divergence loss, and part-of-speech loss until the preset stopping conditions such as minimizing the loss or reaching the preset number of iterations are met, and the optimized model is obtained. The optimized model is used as a red team adversarial model, realizing the training of a red team adversarial model guided by part-of-speech vectors. Through supervised fine-tuning training (SFT) and reinforcement learning (RL), a red team adversarial model with stronger offensive capabilities is obtained.

[0215] In some implementations, the parameters of the model to be optimized are adjusted based on reward loss, divergence loss, and part-of-speech tagging loss until a preset stopping condition is met, resulting in an optimized model, including:

[0216] We obtain the first weight coefficient corresponding to the reward loss, the second weight coefficient corresponding to the divergence loss, and the third weight coefficient corresponding to the part-of-speech loss;

[0217] Based on the first weight coefficient, the second weight coefficient, and the third weight coefficient, the reward loss, divergence loss, and part-of-speech loss are weighted and calculated to obtain the total reinforcement learning loss;

[0218] Based on the total loss of reinforcement learning, the parameters of the model to be optimized are adjusted until the loss is minimized or the number of iterations reaches the preset number, thus obtaining the optimized model.

[0219] To improve the accuracy of model training, the parameters of the model to be optimized can be adjusted based on the total reinforcement learning loss, obtained by weighting multiple losses such as reward loss, divergence loss, and part-of-speech tagging loss. Specifically, the first weight coefficient corresponding to the reward loss, the second weight coefficient corresponding to the divergence loss, and the third weight coefficient corresponding to the part-of-speech tagging loss can be obtained. Then, based on the first, second, and third weight coefficients, the reward loss, divergence loss, and part-of-speech tagging loss are weighted and calculated to obtain the total reinforcement learning loss. The formula for calculating the total reinforcement learning loss can be as follows:

[0220] L total =α×L reward +β×L KL +γ×L POS

[0221] Among them, L total L represents the total loss in reinforcement learning. reward L represents the reward loss. KL L represents the divergence loss. POS α represents the part-of-speech loss, β represents the first weight coefficient corresponding to the reward loss, γ represents the second weight coefficient corresponding to the divergence loss, and γ represents the third weight coefficient corresponding to the part-of-speech loss. The specific values ​​of the first, second, and third weight coefficients can be flexibly set according to actual needs and are not limited here.

[0222] After obtaining the total reinforcement learning loss, the parameters of the model to be optimized can be updated using the Proximal Policy Optimization (PPO) algorithm based on the total reinforcement learning loss. That is, the parameters of the model to be optimized are adjusted until the loss is minimized or the number of iterations reaches a preset number, and the optimized model is obtained. The optimized model is then used as a red team adversarial model, realizing the training process of the red team adversarial model guided by the model learning aggressive part-of-speech vectors as prior knowledge, thereby constructing a more aggressive and more accurate red team adversarial model.

[0223] After obtaining the red team adversarial model, adversarial attack problems can be generated based on it. In some implementations, the red team adversarial model is used to generate adversarial attack problems that conform to the dimensions to be evaluated, including:

[0224] Obtain target prompts for the dimensions to be evaluated;

[0225] Based on target cue information, the red team adversarial model generates adversarial attack questions that meet the dimensions to be evaluated.

[0226] Specifically, target hints can be obtained for the dimension to be evaluated. These hints can be keywords or sentences that match the dimension to be evaluated. Then, based on the target hints, a red team adversarial model can be used to generate adversarial attack questions that match the dimension to be evaluated. This red team adversarial model can provide high-quality, highly aggressive adversarial attack questions for any evaluation dimension, which can be used for the evaluation task of the large model under test.

[0227] S102. Input the adversarial attack question into the large model under test, and obtain the question-answer pair by outputting the answer that matches the adversarial attack question through the large model under test.

[0228] like Figure 2 As shown, after generating the adversarial attack question through the red team adversarial model, the adversarial attack question can be input into the large model under test. The large model under test outputs an answer that matches the adversarial attack question, thereby generating a question-answer pair (QA) between the adversarial attack question and its corresponding answer. The specific type and structure of the large model under test are not limited here.

[0229] S103. Input the question-and-answer pair to be tested into the evaluation model, and output the security evaluation score for the question-and-answer pair to be tested through the evaluation model.

[0230] The specific type and structure of the evaluation model are not limited here; for example, the evaluation model can be a trained classification model. After obtaining the question-answer pairs to be tested, they can be input into the evaluation model. The evaluation model scores the question-answer pairs and outputs a safety assessment score for each pair. The safety assessment score can range from 0 to 1, including 0 and 1. The higher the safety assessment score, the safer the tested model is; conversely, the lower the safety assessment score, the less safe the tested model is. For example, a safety assessment score greater than 0 indicates that the tested model's answer is safe, and a safety assessment score less than 0 indicates that the tested model's answer is unsafe.

[0231] By generating adversarial attack problems using a red team adversarial model, the large model under test is evaluated, and the evaluation model provides the evaluation results. This achieves automated security evaluation of the large model under test, while supporting multi-security dimension evaluation. For zero-sample dimensions, part-of-speech template matching technology can also be used to obtain better evaluation, thus realizing a more comprehensive and automated evaluation of the large model under test.

[0232] S104. Based on the security assessment score, generate a security assessment report for the tested large model in the dimension to be assessed.

[0233] like Figure 2As shown, after obtaining the security evaluation score of the question-answer pair generated for the large model under test, a security evaluation report of the large model under test in the dimension under test can be generated based on the security evaluation score. The security evaluation report may include the security evaluation score of the large model under test in one or more evaluation dimensions, and may also include other relevant information, which is not limited here.

[0234] After obtaining the security assessment report, it can be determined whether the security assessment report meets the assessment specifications (i.e., whether the assessment objectives are achieved). If it does, the security assessment report is directly output. If it does not, the process returns to re-execute the process of generating adversarial attack questions that conform to the dimensions to be assessed through the red team adversarial model, generating test question-answer pairs based on the adversarial attack questions through the large model under test, outputting security assessment scores for the test question-answer pairs through the evaluation model, generating a new security assessment report for the large model under test in the dimensions to be assessed based on the security assessment scores, deduplicating and summarizing the security assessment scores in the old and new versions of the security assessment reports to form the final security assessment report, thus completing the security assessment task for the large model under test.

[0235] This system implements a comprehensive and automated security evaluation task for large-scale models, based on a created part-of-speech vector template library and a trained red team adversarial model. For a specific security evaluation dimension, the red team adversarial model generates a large number of adversarial attack questions within that dimension and inputs them into the large model under test. The model then provides corresponding answers to each adversarial attack question, forming test question-answer pairs. These test question-answer pairs are input into an evaluation model to obtain corresponding security evaluation scores and generate a security evaluation report for the large model under that security evaluation dimension. If the evaluation report does not meet the requirements, the above steps are repeated to obtain a complete evaluation report that conforms to the specifications.

[0236] This application embodiment can obtain the dimension to be evaluated and generate adversarial attack questions that conform to the dimension to be evaluated through a red team adversarial model; then the adversarial attack questions can be input into the large model under test, and the large model under test outputs answers that match the adversarial attack questions to obtain the question-answer pair to be tested; and the question-answer pair to be tested can be input into an evaluation model, and the evaluation model outputs a security evaluation score for the question-answer pair to be tested; at this time, a security evaluation report of the large model under test in the dimension to be evaluated can be generated based on the security evaluation score. This solution utilizes a red team adversarial model to automatically generate a large number of adversarial attack questions that conform to any evaluation dimension, increasing the diversity and adversarial nature of these questions. These numerous adversarial attack questions are input into the large model under test, which then provides corresponding answers to each question, generating test question-answer pairs. These pairs are then input into an evaluation model to obtain corresponding security evaluation scores and generate an evaluation report for the large model under that evaluation dimension. This automates the security evaluation of large models and supports multi-security dimension evaluation, achieving a more comprehensive and automated security evaluation of large models, improving the accuracy of large model evaluation, and thus enhancing the security of using large models.

[0237] Based on the methods described in the above embodiments, the following examples will provide further detailed explanations.

[0238] This embodiment uses the large model security evaluation device integrated into a server as an example, applied to a large model security evaluation scenario. The terminal can initiate a large model security evaluation request to the server. The server responds to the request, performs a security evaluation on the large model under test, and returns the evaluation results to the terminal for display. Please refer to [link to relevant documentation]. Figure 6 , Figure 6 This is a flowchart illustrating the large model security evaluation method provided in this application embodiment. The method flow may include:

[0239] S301. The terminal can send a large model security evaluation request carrying the dimensions to be evaluated to the server.

[0240] For example, the terminal can display a list of evaluation dimensions for selection within the display interface of the security evaluation platform, receive evaluation dimension selection instructions input by the user, select the dimension to be evaluated based on the evaluation dimension selection instructions, generate a large model security evaluation request based on the dimension to be evaluated, and send the large model security evaluation request to the server.

[0241] S302. The server responds to the large model security evaluation request by generating adversarial attack questions that conform to the dimensions to be evaluated through the red team adversarial model, generating test question-answer pairs based on the adversarial attack questions through the large model under test, and outputting security evaluation scores for the test question-answer pairs through the evaluation model, and generating a security evaluation report of the large model under test in the dimensions to be evaluated based on the security evaluation scores.

[0242] The specific implementation details of the server responding to the large model security evaluation request and performing security evaluation on the large model under test can be found in the description above, and will not be repeated here.

[0243] S303. The server sends the security assessment report to the terminal.

[0244] S304, Terminal Display Security Evaluation Report.

[0245] After receiving the security assessment report from the server, the terminal can display the security assessment report on the display interface for the user to view.

[0246] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0247] In this embodiment, through interaction between the terminal and the server, users can specify the dimensions to be evaluated via the terminal to conduct targeted security assessments on the large model under test. This also facilitates timely detection of security risks in the large model under test by technical personnel. Furthermore, the server can automatically generate a large number of adversarial attack questions that conform to any dimension to be evaluated using a red team adversarial model. This increases the diversity and adversarial nature of the adversarial attack questions. A large number of adversarial attack questions are input into the large model under test to generate question-answer pairs. These question-answer pairs are then input into the evaluation model to obtain corresponding security assessment scores and generate an evaluation report for the large model under that dimension. This achieves a more comprehensive and automated security assessment task for the large model, improving the accuracy of the large model assessment and thus enhancing the security of using the large model.

[0248] To facilitate better implementation of the large model security evaluation method provided in this application, this application also provides an apparatus based on the aforementioned large model security evaluation method. The meanings of the terms used are the same as in the aforementioned large model security evaluation method, and specific implementation details can be found in the descriptions of the method embodiments. The large model security evaluation apparatus in this application can implement the steps corresponding to the large model security evaluation method executed in the above embodiments. The functions implemented by the large model security evaluation apparatus can be implemented by hardware or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the aforementioned functions, and these modules can be software and / or hardware.

[0249] Please see Figure 7 , Figure 7This is a schematic diagram of the structure of the large model security evaluation device provided in the embodiments of this application. The large model security evaluation device 400 may include a question generation module 401, a question-answer pair acquisition module 402, a score acquisition module 403, and a report generation module 404, etc.

[0250] Among them, the problem generation module 401 is used to obtain the dimension to be evaluated and generate adversarial attack problems that conform to the dimension to be evaluated through the red team adversarial model;

[0251] The question-answer pair acquisition module 402 is used to input the adversarial attack question into the large model under test, and obtain the question-answer pair to be tested by outputting the answer that matches the adversarial attack question through the large model under test;

[0252] The score acquisition module 403 is used to input the question-answer pair to be tested into the evaluation model, and output a security evaluation score for the question-answer pair to be tested through the evaluation model.

[0253] The report generation module 404 is used to generate a security assessment report for the tested large model in the dimension to be assessed based on the security assessment score.

[0254] In some embodiments, the large model safety evaluation device 400 further includes:

[0255] The dataset acquisition module is used to acquire the training dataset, which includes prompt information, sample question-answer pairs and their aggression scores. The sample question-answer pairs include aggressive questions and their corresponding answers.

[0256] The pre-training module is used to pre-train a large model using a training dataset to obtain a pre-trained large model.

[0257] The model acquisition module is used to take the pre-trained large model as a reference model and copy the pre-trained large model to obtain the model to be optimized.

[0258] The first generation module is used to generate the first question-answer pair and its corresponding first score based on the prompt information using the model to be optimized;

[0259] The second generation module is used to generate a second question-answer pair and its corresponding second score based on the prompt information using a reference model;

[0260] The reinforcement learning module is used to perform reinforcement learning on the model to be optimized based on sample question-answer pairs, aggression scores, first question-answer pairs, first scores, second question-answer pairs, and second scores until a preset stopping condition is met, thus obtaining the optimized model.

[0261] As a module, it is used to use the optimized model as a red team adversarial model.

[0262] In some implementations, the reinforcement learning module includes:

[0263] The first calculation submodule is used to calculate the reward loss based on the sample question-answer pair, the aggression score, the first question-answer pair, and the first score using the reward model;

[0264] The second calculation submodule is used to calculate the divergence loss between the first question-and-answer pair and the second question-and-answer pair based on the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score.

[0265] The part-of-speech tagging submodule is used to perform part-of-speech tagging on the first question-answer pair to obtain the target part-of-speech vector of the first question-answer pair;

[0266] The third calculation submodule is used to calculate the part-of-speech loss between the target part-of-speech vector and the part-of-speech vector with the greatest similarity in the part-of-speech vector template library;

[0267] The adjustment submodule is used to adjust the parameters of the model to be optimized based on the reward loss, divergence loss, and part-of-speech loss until the preset stopping condition is met, thus obtaining the optimized model.

[0268] In some implementations, the adjustment submodule is specifically used for:

[0269] We obtain the first weight coefficient corresponding to the reward loss, the second weight coefficient corresponding to the divergence loss, and the third weight coefficient corresponding to the part-of-speech loss;

[0270] Based on the first weight coefficient, the second weight coefficient, and the third weight coefficient, the reward loss, divergence loss, and part-of-speech loss are weighted and calculated to obtain the total reinforcement learning loss;

[0271] Based on the total loss of reinforcement learning, the parameters of the model to be optimized are adjusted until the loss is minimized or the number of iterations reaches the preset number, thus obtaining the optimized model.

[0272] In some implementations, the dataset acquisition module is specifically used for:

[0273] Obtain the initial training dataset, which includes initial sample question-answer pairs and their corresponding initial aggression scores. The initial sample question-answer pairs include initial aggressive questions and their corresponding initial answers.

[0274] Obtain the initial prompt information for each question-answer pair in the initial training dataset;

[0275] Based on the initial aggressive questions and initial prompts, the initial sample question-answer pairs are augmented to obtain the augmented training dataset. The augmented training dataset includes prompts, sample question-answer pairs and their corresponding aggressive scores. The sample question-answer pairs include aggressive questions and their corresponding answers.

[0276] A training dataset is generated based on the augmented training dataset.

[0277] In some embodiments, the large model safety evaluation device 400 further includes:

[0278] The clustering module is used to cluster the training dataset according to preset security evaluation dimensions to obtain data subsets corresponding to multiple security evaluation dimensions.

[0279] The part-of-speech tagging module is used to tag offensive questions with an attack score greater than a preset score threshold in each data subset using a part-of-speech vector matching model, and obtain the part-of-speech vectors corresponding to the offensive questions.

[0280] The building module is used to construct a part-of-speech vector template library based on part-of-speech vectors.

[0281] In some embodiments, the large model safety evaluation device 400 further includes:

[0282] The judgment module is used to determine whether the dimension to be evaluated is a zero-sample dimension;

[0283] The dimension adjustment module is used to adjust the dimension to be evaluated if the dimension to be evaluated is a zero-sample dimension, so as to obtain the adjusted evaluation dimension.

[0284] The problem generation module is specifically used to generate adversarial attack problems that conform to the adjusted evaluation dimensions through the red team adversarial model.

[0285] In some implementations, the dimension adjustment module is specifically used for:

[0286] Obtain existing historical security assessment dimensions from the part-of-speech vector template library;

[0287] From historical security assessment dimensions, target security assessment dimensions with a similarity greater than a preset similarity threshold are selected.

[0288] The part-of-speech matching dimension is adjusted based on the target security evaluation dimension to obtain the adjusted evaluation dimension.

[0289] In some implementations, the problem generation module 401 is specifically used for:

[0290] Receive user input selection instructions and determine the dimensions to be evaluated based on the selection instructions; or...

[0291] Obtain the security evaluation dimension to which each data point in the training dataset belongs, thus obtaining the security evaluation dimension set. Select any security evaluation dimension from the security evaluation dimension set as the dimension to be evaluated.

[0292] The adversarial attack problem that meets the evaluation dimensions is generated by using the red team adversarial model.

[0293] In some implementations, the problem generation module 401 is specifically used for:

[0294] Obtain target prompts for the dimensions to be evaluated;

[0295] Based on target cue information, the red team adversarial model generates adversarial attack questions that meet the dimensions to be evaluated.

[0296] In this embodiment, the question generation module 401 can obtain the dimension to be evaluated and generate adversarial attack questions that match the dimension to be evaluated through a red team adversarial model. Then, the question-answer pair acquisition module 402 can input the adversarial attack questions into the large model under test and output answers that match the adversarial attack questions to obtain the question-answer pairs to be tested. The score acquisition module 403 can input the question-answer pairs to be tested into the evaluation model and output a security evaluation score for the question-answer pairs to be tested. At this time, the report generation module 404 can generate a security evaluation report of the large model under the dimension to be evaluated based on the security evaluation score. This solution utilizes a red team adversarial model to automatically generate a large number of adversarial attack questions that conform to any evaluation dimension, increasing the diversity and adversarial nature of these questions. These numerous adversarial attack questions are input into the large model under test, which then provides corresponding answers to each question, generating test question-answer pairs. These pairs are then input into an evaluation model to obtain corresponding security evaluation scores and generate an evaluation report for the large model under that evaluation dimension. This automates the security evaluation of large models and supports multi-security dimension evaluation, achieving a more comprehensive and automated security evaluation of large models, improving the accuracy of large model evaluation, and thus enhancing the security of using large models.

[0297] The large model security evaluation device in this application embodiment has been described above from the perspective of modular functional entities. The large model security evaluation device in this application embodiment is described below from the perspective of hardware processing.

[0298] It should be noted that, Figure 7 The physical devices corresponding to the question generation module 401, question-answer pair acquisition module 402, score acquisition module 403, and report generation module 404 shown can be processors.

[0299] Figure 7 The devices shown can all have the following characteristics: Figure 8 The structure shown, when Figure 7 The large-scale model safety evaluation device 400 shown has the following features: Figure 8 When the structure shown is used, Figure 8The processor and transceiver in the device can perform the same or similar functions as the question generation module 401, question-answer pair acquisition module 402, score acquisition module 403, and report generation module 404 provided in the aforementioned device embodiments. Figure 8 The memory storage processor in the memory needs to call computer programs when executing the above-mentioned large model security evaluation method.

[0300] This application also provides a computer device, which may be a terminal or a server, etc. Figure 9 As shown, it illustrates a structural schematic diagram of the computer device involved in the embodiments of this application, specifically:

[0301] The computer device may include components such as a processor 501 with one or more processing cores, a memory 502 with one or more computer-readable storage media, a power supply 503, and an input unit 504. Those skilled in the art will understand that... Figure 9 The computer device structure shown does not constitute a limitation on the computer device and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein:

[0302] The processor 501 is the control center of the computer device, connecting various parts of the computer device through various interfaces and lines. It performs various functions and processes data by running or executing software programs and / or modules stored in the memory 502, and by calling data stored in the memory 502. Optionally, the processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 501.

[0303] The memory 502 can be used to store software programs and modules. The processor 501 executes various functional applications and data processing by running the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc. In addition, the memory 502 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.

[0304] The computer equipment also includes a power supply 503 that supplies power to the various components. Preferably, the power supply 503 can be logically connected to the processor 501 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system. The power supply 503 may also include one or more DC or AC power supplies, recharging systems, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

[0305] The computer device may also include an input unit 504, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

[0306] Although not shown, the computer device may also include a display unit, etc., which will not be described in detail here. Specifically, in this embodiment, the processor 501 in the computer device loads the executable files corresponding to the processes of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application programs stored in the memory 502 to realize various functions, as follows:

[0307] The process involves: acquiring the dimensions to be evaluated and generating adversarial attack questions that match these dimensions using a red team adversarial model; inputting these adversarial attack questions into the large model under test and obtaining test question-answer pairs by outputting answers that match the adversarial attack questions; inputting these test question-answer pairs into an evaluation model and outputting security evaluation scores for them; and generating a security evaluation report for the large model under test based on the security evaluation scores, tailored to the dimensions to be evaluated.

[0308] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the computer devices, apparatuses, and modules described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0309] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the detailed description of the large model security evaluation method above, which will not be repeated here.

[0310] The computer device in this application embodiment can automatically generate a large number of adversarial attack questions that conform to any evaluation dimension through the red team adversarial model, thereby improving the diversity and adversarial nature of the adversarial attack questions. A large number of adversarial attack questions are input into the large model under test, which provides a corresponding answer to each adversarial attack question, thereby generating a question-answer pair to be tested. The question-answer pair to be tested is input into the evaluation model to obtain the corresponding security evaluation score, and an evaluation report of the large model under test in the evaluation dimension is generated. The security evaluation task of the large model is completed automatically, and multi-security dimension evaluation is supported. This realizes a more comprehensive and automated security evaluation task for the large model, improves the accuracy of the evaluation of the large model, and thus enhances the security of the use of the large model.

[0311] It should be noted that, for ease of explanation, the above description only shows the parts of the computer device structure relevant to the embodiments of this application. Specific details not disclosed can be flexibly configured according to the specific type of computer device, and are not limited here. For example, when the computer device is a server, the server structure diagram can be as follows: Figure 10 As shown, the server 1100 can vary considerably due to different configurations or performance, and may include one or more central processing units (CPUs) 1122 (e.g., one or more processors) and memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) for storing application programs 1142 or data 1144. The memory 1132 and storage media 1130 can be temporary or persistent storage. The program stored in the storage media 1130 may include one or more modules (not shown in the figure), each module including a series of instruction operations on the server. Furthermore, the CPU 1122 may be configured to communicate with the storage media 1130 and execute the series of instruction operations in the storage media 1130 on the server 1100.

[0312] Server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input / output interfaces 1158, and / or one or more operating systems 1141, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.

[0313] The steps performed by the server in the above embodiments can be based on this Figure 10 The structure of server 1100 is shown. For example, the steps performed by the large-scale security assessment device in the above embodiment can be based on this. Figure 10The server structure is shown. For example, the central processing unit 1122 performs the following operations by calling instructions from memory 1132:

[0314] The process involves: acquiring the dimensions to be evaluated and generating adversarial attack questions that match these dimensions using a red team adversarial model; inputting these adversarial attack questions into the large model under test and obtaining test question-answer pairs by outputting answers that match the adversarial attack questions; inputting these test question-answer pairs into an evaluation model and outputting security evaluation scores for them; and generating a security evaluation report for the large model under test based on the security evaluation scores, tailored to the dimensions to be evaluated.

[0315] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the servers, devices, and modules described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0316] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the detailed description of the large model security evaluation method above, which will not be repeated here.

[0317] According to one aspect of this application, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the methods provided in the various optional implementations of the above embodiments.

[0318] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by computer instructions, or by controlling related hardware through computer instructions. These computer instructions can be stored in a computer-readable storage medium (i.e., a storage medium) and loaded and executed by a processor. Therefore, embodiments of this application provide a storage medium storing a computer program, which may include computer instructions. This computer program can be loaded by a processor to execute any of the large-model security evaluation methods provided in the embodiments of this application.

[0319] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

[0320] The storage medium may include: read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.

[0321] Since the instructions stored in the storage medium can execute the steps in any of the large model security evaluation methods provided in the embodiments of this application, the beneficial effects that any of the large model security evaluation methods provided in the embodiments of this application can achieve can be realized. For details, please refer to the previous embodiments, which will not be repeated here.

[0322] According to one aspect of this application, a chip is provided, comprising a processor coupled to a transceiver of a computer device for executing the large-model security evaluation method provided in the embodiments of this application. Embodiments of this application also provide a chip system including a processor for supporting the computer device in implementing the functions involved in the aforementioned large-model security evaluation method. In one possible design, the chip system further includes a communication interface for inputting and / or outputting information. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the terminal device. The chip system may be composed of a chip or may include chips and other discrete devices.

[0323] The foregoing has provided a detailed description of a large-scale model security evaluation method, apparatus, computer device, and storage medium provided in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A method for evaluating the security of large-scale models, characterized in that, include: Obtain the dimensions to be evaluated, and generate adversarial attack problems that conform to the dimensions to be evaluated through the red team adversarial model; The adversarial attack question is input into the large model under test, and the large model under test outputs a response that matches the adversarial attack question to obtain the question-answer pair to be tested. The question-and-answer pair to be tested is input into the evaluation model, and the evaluation model outputs a security evaluation score for the question-and-answer pair to be tested. Based on the security evaluation score, a security evaluation report of the tested large model under the evaluated dimension is generated; The steps for obtaining the red team adversarial model include: Obtain a training dataset, which includes prompt information, sample question-answer pairs and their aggression scores, wherein the sample question-answer pairs include aggressive questions and their corresponding answers; The large model is pre-trained using the training dataset to obtain the pre-trained large model. The pre-trained large model is used as a reference model, and the pre-trained large model is copied to obtain the model to be optimized. Based on the prompt information, the model to be optimized generates a first question-answer pair and its corresponding first score. Based on the prompt information, the reference model generates a second question-answer pair and its corresponding second score. Based on the sample question-and-answer pairs, the aggression score, the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score, reinforcement learning is performed on the model to be optimized until a preset stopping condition is met, and the optimized model is obtained. The optimized model is used as the red team adversarial model; The step of performing reinforcement learning on the model to be optimized based on the sample question-answer pairs, the aggression score, the first question-answer pair, the first score, the second question-answer pair, and the second score until a preset stopping condition is met, to obtain the optimized model, includes: The reward loss is calculated using a reward model based on the sample question-and-answer pair, the aggression score, the first question-and-answer pair, and the first score. Calculate the divergence loss between the first question-and-answer pair and the second question-and-answer pair based on the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score; Part-of-speech tagging is performed on the first question-and-answer pair to obtain the target part-of-speech vector of the first question-and-answer pair; Calculate the part-of-speech loss between the target part-of-speech vector and the part-of-speech vector with the highest similarity in the part-of-speech vector template library; Based on the reward loss, the divergence loss, and the part-of-speech loss, the parameters of the model to be optimized are adjusted until a preset stopping condition is met, resulting in an optimized model.

2. The large-scale model security evaluation method according to claim 1, characterized in that, The step of adjusting the parameters of the model to be optimized based on the reward loss, the divergence loss, and the part-of-speech loss until a preset stopping condition is met to obtain the optimized model includes: Obtain the first weight coefficient corresponding to the reward loss, the second weight coefficient corresponding to the divergence loss, and the third weight coefficient corresponding to the part-of-speech loss; Based on the first weight coefficient, the second weight coefficient, and the third weight coefficient, the reward loss, the divergence loss, and the part-of-speech loss are weighted and calculated to obtain the total reinforcement learning loss; Based on the total reinforcement learning loss, the parameters of the model to be optimized are adjusted until the loss is minimized or the number of iterations reaches a preset number, thus obtaining the optimized model.

3. The large-scale model security evaluation method according to claim 1, characterized in that, The acquisition of the training dataset includes: Obtain an initial training dataset, which includes initial sample question-answer pairs and their corresponding initial aggression scores. The initial sample question-answer pairs include initial aggressive questions and their corresponding initial answers. Obtain the initial prompt information for each question-answer pair in the initial training dataset; Based on the initial aggressive question and the initial prompt information, the initial sample question-answer pair is amplified to obtain an amplified training dataset. The amplified training dataset includes prompt information, sample question-answer pairs and their corresponding aggressive scores. The sample question-answer pair includes an aggressive question and its corresponding answer. A training dataset is generated based on the augmented training dataset.

4. The large-scale model security evaluation method according to claim 3, characterized in that, Before calculating the part-of-speech loss between the target part-of-speech vector and the part-of-speech vector template library with the highest similarity, the large model security evaluation method further includes: The training dataset is clustered according to the preset security evaluation dimensions to obtain data subsets corresponding to multiple security evaluation dimensions. By using a part-of-speech tagging model, offensive questions with an offensive score greater than a preset score threshold in each data subset are labeled with part-of-speech tags to obtain the part-of-speech vectors corresponding to the offensive questions. A part-of-speech vector template library is constructed based on the aforementioned part-of-speech vectors.

5. The large model security evaluation method according to any one of claims 1 to 4, characterized in that, Before generating adversarial attack problems that conform to the dimensions to be evaluated using the red team adversarial model, the large model security evaluation method further includes: Determine whether the dimension to be evaluated is a zero-sample dimension; If the dimension to be evaluated is a zero-sample dimension, then the dimension to be evaluated is adjusted to obtain the adjusted evaluation dimension; The generation of adversarial attack problems that conform to the dimensions to be evaluated through the red team adversarial model includes: The adversarial attack problem that conforms to the adjusted evaluation dimensions is generated using the red team adversarial model.

6. A large-scale model safety evaluation device, characterized in that, include: The problem generation module is used to obtain the dimensions to be evaluated and generate adversarial attack problems that conform to the dimensions to be evaluated through the red team adversarial model; The question-answer pair acquisition module is used to input the adversarial attack question into the large model under test, and output the answer that matches the adversarial attack question through the large model under test to obtain the question-answer pair to be tested; The score acquisition module is used to input the question-answer pair to be tested into the evaluation model, and output a security evaluation score for the question-answer pair to be tested through the evaluation model. The report generation module is used to generate a security evaluation report for the tested large model under the dimension to be evaluated, based on the security evaluation score. The steps for obtaining the red team adversarial model include: Obtain a training dataset, which includes prompt information, sample question-answer pairs and their aggression scores, wherein the sample question-answer pairs include aggressive questions and their corresponding answers; The large model is pre-trained using the training dataset to obtain the pre-trained large model. The pre-trained large model is used as a reference model, and the pre-trained large model is copied to obtain the model to be optimized. Based on the prompt information, the model to be optimized generates a first question-answer pair and its corresponding first score. Based on the prompt information, the reference model generates a second question-answer pair and its corresponding second score. Based on the sample question-and-answer pairs, the aggression score, the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score, reinforcement learning is performed on the model to be optimized until a preset stopping condition is met, and the optimized model is obtained. The optimized model is used as the red team adversarial model; The step of performing reinforcement learning on the model to be optimized based on the sample question-answer pairs, the aggression score, the first question-answer pair, the first score, the second question-answer pair, and the second score until a preset stopping condition is met, to obtain the optimized model, includes: The reward loss is calculated using a reward model based on the sample question-and-answer pair, the aggression score, the first question-and-answer pair, and the first score. Calculate the divergence loss between the first question-and-answer pair and the second question-and-answer pair based on the first question-and-answer pair, the first score, the second question-and-answer pair, and the second score; Part-of-speech tagging is performed on the first question-and-answer pair to obtain the target part-of-speech vector of the first question-and-answer pair; Calculate the part-of-speech loss between the target part-of-speech vector and the part-of-speech vector with the highest similarity in the part-of-speech vector template library; Based on the reward loss, the divergence loss, and the part-of-speech loss, the parameters of the model to be optimized are adjusted until a preset stopping condition is met, resulting in an optimized model.

7. A computer device, characterized in that, It includes a processor and a memory, wherein the memory stores a computer program, and the processor executes the large model security evaluation method as described in any one of claims 1 to 5 when it invokes the computer program in the memory.

8. A storage medium, characterized in that, The storage medium is used to store a computer program, which is loaded by a processor to execute the large model security evaluation method according to any one of claims 1 to 5.