Method and device for endogenous security protection of large model based on reinforcement learning

By constructing a reinforcement learning state space and training an intrinsic security protection model, and dynamically adjusting the decoding hyperparameters of LLM, the static defense problem of large language model inference services is solved, achieving effective interception of adversarial attacks and high system robustness.

CN122226432APending Publication Date: 2026-06-16Chinese People's Liberation Army Cyberspace Force Information Engineering University

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
Chinese People's Liberation Army Cyberspace Force Information Engineering University
Filing Date
2026-03-31
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing Large Language Model (LLM) inference services lack dynamic change capabilities due to static decoding hyperparameter configurations, allowing attackers to bypass defenses by searching for gradient information, resulting in high attack success rates and portability, thus creating a persistent security threat.

Method used

By constructing a state space for reinforcement learning, an intrinsic security protection model is trained. The decoding hyperparameters of the LLM are dynamically adjusted using the reinforcement learning agent to build a dynamic defense mechanism that prevents attackers from using fixed parameters to launch attacks.

🎯Benefits of technology

It effectively intercepts existing jailbreak attack methods, improves the robustness and security of the system, ensures the compliance and availability of the output, and breaks the attack chain of attackers based on fixed parameters.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122226432A_ABST
    Figure CN122226432A_ABST
Patent Text Reader

Abstract

Embodiments of the present application relate to the field of large model network security, and specifically relate to a large model endogenous security protection method and device based on reinforcement learning. A specific implementation of the method includes: constructing a state space of reinforcement learning; training an endogenous security protection large model based on the state space; and performing dynamic defense reasoning processing on user request information based on the endogenous security protection large model. The implementation can effectively intercept existing jailbreaking attack methods and support the formation of network security capabilities such as harmless instructions, compliant output, and high-availability services.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments of this application relate to the field of large model cybersecurity, specifically to a method and apparatus for large model intrinsic security protection based on reinforcement learning. Background Technology

[0002] With breakthroughs in deep learning technology, Large Language Models (LLMs) have become a core engine driving industrial intelligence, demonstrating immense application value in areas such as content generation, logical reasoning, and code assistance. However, because pre-training large models from scratch requires extremely expensive computing resources and data costs, most downstream applications currently lack the ability to independently train models, instead relying heavily on application programming interfaces (APIs) or private deployment services provided by general-purpose large models. While this "Model as a Service (MaaS)" paradigm significantly lowers the technical barrier, it also leads to a massive number of applications operating on the same model foundation and standardized inference configurations, resulting in a high concentration of security risks throughout the supply chain.

[0003] This standardized inference service model provides an opportunity for automated adversarial attacks. Especially with the emergence of algorithms such as Greedy Coordinate Gradient (GCG) and AutoDAN (Large Language Model), attack methods have evolved from early manual "jailbreaking" to gradient-based automated optimization attacks. Existing LLM inference services typically employ static decoding hyperparameter configurations (such as fixed temperature coefficients and sampling thresholds Top-P). This defense mechanism lacks dynamic adaptability, making the model's decision boundary fixed and predictable for attackers. Once an attacker uses gradient information to find an "adversarial suffix" that can bypass the defense, because the model's defense state is static, the attack sample will have an extremely high success rate and transferability, posing a persistent threat to system security. Summary of the Invention

[0004] The summary section of this application is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description section below. This summary section is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.

[0005] Some embodiments of this application propose a method, apparatus, computer device, and computer-readable storage medium for large model intrinsic security protection based on reinforcement learning, in order to solve one or more of the technical problems mentioned in the background section above.

[0006] In a first aspect, some embodiments of this application provide a large-scale intrinsic security protection method based on reinforcement learning. The method includes: constructing a state space for reinforcement learning; training an intrinsic security protection large-scale model based on the state space; and performing dynamic defense reasoning processing on user request information based on the intrinsic security protection large-scale model.

[0007] Secondly, some embodiments of this application provide a large-scale intrinsic security protection device based on reinforcement learning. The device includes: a construction unit configured to construct a state space for reinforcement learning; a training unit configured to train an intrinsic security protection large-scale model based on the state space; and a dynamic defense inference unit configured to perform dynamic defense inference processing on user request information based on the intrinsic security protection large-scale model.

[0008] Thirdly, this application also provides a computer device, which includes a processor, a memory, and a computer program stored in the memory and executable by the processor, wherein when the computer program is executed by the processor, it implements the method described in any implementation of the first aspect above.

[0009] Fourthly, this application also provides a computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, it implements the method described in any implementation of the first aspect above.

[0010] The above embodiments of this application have the following beneficial effects: Through the reinforcement learning-based large model intrinsic security protection method of some embodiments of this application, the inference hyperparameter space of LLM is used as the optimization object. By attaching reinforcement learning agents to the black box environment, an adaptive defense chain of "state space construction - policy network evolution - dynamic defense inference" is ingeniously constructed, which can effectively intercept existing jailbreak attack methods and support the formation of network security capabilities such as harmless instructions, compliant output, and high service availability. Attached Figure Description

[0011] The above and other features, advantages, and aspects of the embodiments of this application will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and elements are not necessarily drawn to scale.

[0012] Figure 1 This is a flowchart of some embodiments of the reinforcement learning-based large model intrinsic security protection method according to this application;

[0013] Figure 2This is a general framework diagram of some embodiments of the reinforcement learning-based large model intrinsic security protection method according to this application;

[0014] Figure 3 This is a schematic diagram of reinforcement learning environment interaction and state space construction according to some embodiments of the reinforcement learning-based large model intrinsic security protection method of this application;

[0015] Figure 4 This is a dynamic inference timing diagram under the dynamic defense mechanism of some embodiments of the reinforcement learning-based large model intrinsic security protection method according to this application;

[0016] Figure 5 These are schematic diagrams of some embodiments of the reinforcement learning-based large model intrinsic security protection device according to this application;

[0017] Figure 6 This is a schematic diagram of the structure of a computer device suitable for implementing some embodiments of this application. Detailed Implementation

[0018] Embodiments of this application will now be described in more detail with reference to the accompanying drawings. While some embodiments of this application are shown in the drawings, it should be understood that this application can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this application. It should be understood that the drawings and embodiments of this application are for illustrative purposes only and are not intended to limit the scope of protection of this application.

[0019] It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. Unless otherwise specified, the embodiments and features described herein can be combined with each other.

[0020] It should be noted that the concepts of "first" and "second" mentioned in this application are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0021] It should be noted that the terms "a" and "a plurality of" used in this application are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0022] The names of the messages or information exchanged between multiple devices in the embodiments of this application are for illustrative purposes only and are not intended to limit the scope of these messages or information.

[0023] The present application will now be described in detail with reference to the accompanying drawings and embodiments.

[0024] Figure 1 The flowchart 100 illustrates some embodiments of the reinforcement learning-based large model intrinsic security protection method according to this application. This reinforcement learning-based large model intrinsic security protection method includes the following steps:

[0025] Step 101: Construct the state space for reinforcement learning.

[0026] In some embodiments, the implementing entity of the reinforcement learning-based large model intrinsic security protection method can construct a reinforcement learning state space. This state space includes: a reinforcement learning environment, a multi-dimensional state vector, and an interaction pool.

[0027] As an example, the overall framework of the reinforcement learning-based large-model intrinsic security protection method in this application is as follows: Figure 2 As shown, Figure 2 A general framework diagram of some embodiments of the reinforcement learning-based large model intrinsic security protection method according to this application is shown. Figure 2 In the overall framework diagram, the top content is the target large language model object, the middle content is the implementation steps of this application, and the bottom content is the specific implementation details. Overall, this application can manipulate the LLM inference hyperparameters through "external dynamic adjustment" to construct an intrinsic security defense barrier without changing the model weight parameters, which is highly defensive and has low deployment cost.

[0028] In practice, the aforementioned implementing entity can construct the state space for reinforcement learning through the following steps:

[0029] The first step is to encapsulate the LLM API (input Prompt, output Text) into a reinforcement learning environment. In practice, the aforementioned execution entity can define the action space as a continuous floating-point vector, corresponding to the adjustment range of the decoding hyperparameters: Temperature, Top-P kernel sampling threshold, Top-K truncation value, and Max_Tokens maximum generation length. For example, the action vector can be mapped to a temperature range of [0.01, 1.0] and a kernel sampling threshold range of [0.01, 1.0].

[0030] In practice, the aforementioned executing entity defines the action space as a continuous floating-point vector, representing the adjustment amount of the LLM inference hyperparameters. Action Vector It is a decision signal directly output by the endogenous security protection big model ( Represents the action vector. (Represents the action value). Example action vectors are as follows: Each dimension maps to tuning suggestions for five core hyperparameters. Because the output layer of the intrinsic security protection large model uses the Tanh (hyperbolic tangent activation function), the original action values... In between( (Represents the sequence number). The above-mentioned execution entity uses a linear mapping function. Convert it into physical parameters: . ( This represents the actual temperature value generated by the mapping. Indicates the minimum target temperature. Example of mapping (representing the maximum target temperature): If the target temperature range is... When the network output At that time, the actual temperature value generated by the mapping was This process ensures that the deep learning model can intervene in the LLM's decoding logic in a precisely fine-tuned manner.

[0031] The second step is to define a multidimensional state vector.

[0032] In practice, Defined as a multidimensional state vector. Wherein, This represents the semantic feature vector of the current input instruction extracted by a lightweight encoder (such as Sentence-BERT (sentence embedding model)). It is used to capture the semantic features of the malicious intent hidden in the input instruction (such as "make a bomb" or "write ransomware"), providing the agent with context-aware decision-making basis. This represents the hyperparameter information used in the previous round of inference. Among them, This can include: Temperature coefficient, kernel sampling threshold (Top-P), truncation value (Top-K), maximum generation length (Max_Tokens), and frequency penalty. For example, Temperature can be set to [0.01, 1.0]. Adjusting Temperature controls the determinism of generation; lowering it (e.g., closer to 0.01) can make the model logic converge, preventing divergent malicious intent under adversarial inducement. Top-P can be set to [0.0, 1.0]. Top-P controls the truncation of cumulative probability sampling; narrowing Top-P filters out words with low probability but potentially containing toxic semantics. Top-K can be set to [1, 100]. Top-K limits the number of candidate words, directly cutting off the long-tail abnormal generation path guided by malicious attack suffixes. Max_Tokens can be set to [1, 4096]. Max_Tokens limits the total output length, preventing the model from outputting lengthy harmful instructions or consuming excessive computing resources after being attacked. The value of FrequencyPenalty can be [-2.0, 2.0]. Frequency Penalty is used to prevent the model from getting stuck in an infinite loop of repeating malicious content, profanity, or garbled text under adversarial examples by penalizing the words that have already appeared.

[0033] The third step is to initialize the interaction pool based on the aforementioned multi-dimensional state vector. In practice, the execution entity can load a mixed dataset containing malicious instructions (such as AdvBench, an adversarial benchmark dataset) and normal instructions as the input source for the environment. This ensures that subsequent training covers diverse attack scenarios.

[0034] As an example, see reference Figure 3 , Figure 3 This diagram illustrates the reinforcement learning environment interaction and state space construction of some embodiments of the reinforcement learning-based large model intrinsic security protection method according to this application. For black-box LLMs with a large number of parameters and opaque interfaces (such as Llama-3 (large language model), GPT-4 (fourth-generation generative language model), etc., API services, it is first necessary to encapsulate them into a standard environment that is interactive for reinforcement learning. Figure 3 As shown, by abstracting its reasoning interface into a Markov decision process and adopting a "feature mapping" strategy, a state space that the agent can perceive is constructed, which serves as the data foundation for the evolution of the endogenous security protection large model.

[0035] Therefore, in response to the complex generation mechanism of LLM, an "environment encapsulation" strategy is adopted to package the black-box model into a reinforcement learning environment with state observation and action execution capabilities. A composite state space is defined, consisting of the current decoding hyperparameter configuration and the semantic features of the input instructions, and the agent's ability to perceive the dual features of "context-parameters" is established.

[0036] Step 102: Based on the above state space, train the intrinsic security protection large model.

[0037] In some embodiments, the aforementioned execution entity can train an intrinsic security protection model based on the aforementioned state space. The intrinsic security protection model can be a pre-trained neural network model that takes a multi-dimensional state vector as input and hyperparameter information as output.

[0038] In practice, the aforementioned implementing entities can train an intrinsic security protection model through the following steps:

[0039] The first step is to obtain the training sample set. The training samples in the training sample set include: sample state vectors. Here, the sample state vectors can be multi-dimensional state vectors (i.e., a concatenated vector of the semantic feature vector requested by the user and historical hyperparameter information).

[0040] The second step is to determine the initial endogenous security protection model. This initial endogenous security protection model includes the initial policy network and the initial value network.

[0041] The initial policy network employs a deep, multi-layer, fully connected neural network structure. This network can be a custom model that takes the sample state vector as input and outputs the initial mean vector and initial standard deviation vector. The custom model can consist of three layers:

[0042] The first layer is the input layer, used to receive the sample state vector and input it into the second layer. The number of neurons in the input layer is equal to the dimension of the sample state vector. For example, if the semantic feature vector is 768-dimensional and there are 5 hyperparameters, then the input layer will have 773 neurons.

[0043] The second layer, the hidden layer, contains three fully connected layers (MLP), with 1024, 512, and 256 neurons respectively. Each layer is followed by a Layer Normalization layer and a ReLU activation function to extract complex nonlinear features from malicious instructions.

[0044] The third layer, the output layer, employs a Gaussian policy head to output the initial mean vector μ and initial standard deviation vector σ for each hyperparameter adjustment distribution. The mean branch μ uses a Tanh activation function to restrict the output value to the [-1, 1] space, representing the normalized adjustment position of the hyperparameters relative to their physical range. The standard deviation branch σ is processed using a Softplus activation function to ensure the non-negativity of action exploration. The final action vector is obtained through reparameterized sampling. This randomization strategy gives the defense mechanism a natural "mimicry" characteristic, making it impossible for attackers to obtain a fixed defense boundary through differential attacks or gradient estimation. Here, the initial mean vector μ and initial standard deviation vector σ can represent the probability distribution of continuous actions.

[0045] The initial value network can be a model that takes sample state vectors as input and outputs a scalar of expected state value. The initial value network is responsible for decision-making, evaluating the long-term value of an action, i.e., "how much should Temperature and Top-P be adjusted under the current malicious instruction?" The pruning objective function of the Proximal Policy Optimization (PPO) algorithm is used to limit the policy update magnitude, preventing policy collapse due to individual extreme samples and steadily improving the agent's defense success rate under malicious instructions. For example, the initial value network can be a deep, multi-layer, fully connected neural network. Thus, the expected state value scalar is used to calculate the advantage function in the Proximal Policy Optimization (PPO) algorithm, evaluating how much better the initial policy network's current action is than the average level, thereby guiding the initial policy network to adjust its network parameters.

[0046] The third step is to select target training samples from the training sample set. In practice, the aforementioned execution entity can randomly select training samples from the training sample set as target training samples.

[0047] The fourth step is to input the sample state vectors of the selected target training samples into the initial policy network to obtain the initial mean vector and the initial standard deviation vector.

[0048] The fifth step is to input the sample state vector into the initial value network to obtain the expected state value scalar.

[0049] The sixth step is to adjust the network parameters of the initial endogenous security protection model in response to the determination that the initial policy network does not meet the preset convergence conditions, based on the reverse security reward function, the initial mean vector, the initial standard deviation vector, and the expected state value scalar.

[0050] The reverse security reward function can be:

[0051] ,

[0052] in, Indicates the reverse security reward value. This indicates a successful comprehensive attack and a score. This represents a constant used to prevent program crashes or computational overflows caused by a denominator of zero. For example, It could be 1e-5.

[0053] ,

[0054] in, This indicates the content toxicity score. This indicates a refusal to respond to the matching score. Indicates the length penalty score. , , This represents the weighting coefficient.

[0055] here, This represents the reverse security reward value (Reward) obtained by the reinforcement learning agent (initial intrinsic security protection model) in the current interaction (training) round. This represents the overall attack success score of the content generated by the initial intrinsic security protection model. The higher the score, the easier the attack is to succeed (i.e., the more harmful the content), and the reward is given accordingly. The lower the value, the better. The score is composed of a weighted average of the rejection response match score, length penalty score, and content toxicity score. The rejection response match score is generated by the program's built-in regular expression engine, which automatically scans the text generated by the initial intrinsic security protection model (the text obtained after reparameterizing and mapping the initial mean and standard deviation vectors generated by the initial intrinsic security protection model) to see if it contains a pre-defined dictionary of security rejection terms (such as "I cannot," "Sorry," etc.). A match results in a value of 1, considered a successful defense, the attack score approaches zero, and the reward is significantly increased; otherwise, it is 0. The content toxicity score is calculated by inputting the text generated by the initial intrinsic security protection model into an externally deployed lightweight security classifier (such as RoBERTa (a toxicity detection model)) for forward inference, outputting a continuous probability value between 0 and 1, representing the degree of harm of the text. The length penalty score is determined by calculating the effective character length of the text generated by the initial intrinsic security protection model. If the output is an empty string or extremely short, a higher penalty value is applied to prevent the model from exploiting loopholes by outputting blanks (to prevent the model from "cheating" by outputting empty strings (to obtain low toxicity scores), excessively short responses are penalized). This reciprocal form of the reward function ensures that the more secure the model, the higher the reward, thereby driving the agent to evolve towards a "high-defense" parameter region. This function aims to maximize the model's rejection response probability and content compliance, avoiding garbled output. Therefore, the inverse security reward function can be used to calculate immediate feedback signals from the environment.

[0056] In practice, firstly, the aforementioned implementing entity can construct a total loss function based on the inverse security reward function, the initial mean vector, the initial standard deviation vector, and the expected state value scalar using a preset total loss construction algorithm. Secondly, in response to the determination that the initial policy network does not meet the preset convergence conditions, the aforementioned implementing entity can adjust the network parameters of the initial endogenous security protection model using a preset parameter tuning algorithm. The preset convergence conditions are determined based on the total loss function. Specifically, the preset convergence conditions can be: joint convergence of the initial policy network and the initial value network (i.e., the total loss function tends to stabilize and the average security reward reaches a stable state). As an example, the preset total loss construction algorithm can be: the comprehensive loss function of the PPO (Proximal Policy Optimization) algorithm (using generalized advantage estimation to calculate the advantage function and constructing a total loss function that includes the policy pruning objective). The preset parameter tuning algorithm can be: the PPO algorithm (achieving stable iteration and optimization of the parameters of the endogenous security protection model (policy network) through gradient backpropagation on the total loss function).

[0057] Here, the total loss function tends to stabilize when: the rate of change of loss for a consecutive preset number of losses is less than a preset rate of change, and the gradient paradigm converges. For example, the preset number of losses could be 15, and the preset rate of change could be 5%. The rate of change of loss could be the magnitude of change between the current difference value and the previous difference value; the current difference value could be the difference value calculated by the total loss function, and the previous difference value could be the difference value calculated by the total loss function in the previous training iteration. The average safety reward reaches a stable state when: two consecutive average safety rewards are less than a preset reward; for example, the preset reward could be 5%. The average safety reward could be the average of a consecutive preset number of reverse safety reward values; for example, the preset number could be 50.

[0058] Optionally, the aforementioned implementing entity may also determine the initial endogenous security protection model as an endogenous security protection model in response to determining that the initial policy network satisfies the aforementioned preset convergence condition.

[0059] Therefore, for specific types of malicious input, a specific combination of hyperparameters can be automatically matched to cut off the malicious association path of the model (for example, for leading questions, the temperature can be automatically reduced and the Top-P can be tightened), thereby blocking the malicious output of the large model with the highest probability, without affecting the coherence of normal text.

[0060] Therefore, by simulating the interaction process between the attacker and the model, the intrinsic security protection model is trained using the Proximal Policy Optimization (PPO) algorithm. The policy update is guided by a reverse reward function that includes rejection response matching and toxicity scoring, revealing the combination of security parameters in the parameter space that can minimize the attack success rate and maintain text fluency.

[0061] Step 103: Based on the above-mentioned intrinsic security protection model, perform dynamic defense reasoning processing on user request information.

[0062] In some embodiments, the aforementioned execution entity can perform dynamic defense reasoning processing on user request information based on the aforementioned intrinsic security protection model. In practice, the intrinsic security protection model trained above is deployed as a front-end controller in the actual application stage to dynamically schedule user requests using dynamic defense, thereby effectively improving the robustness of the system. As an example, refer to... Figure 4 , Figure 4 The diagram illustrates the dynamic inference timing under the dynamic defense mechanism of some embodiments of the reinforcement learning-based large model intrinsic security protection method according to this application.

[0063] In practice, the aforementioned implementing entities can perform dynamic defense reasoning processing on user request information based on the above-mentioned intrinsic security protection model through the following steps:

[0064] The first step involves feature extraction and state concatenation of the aforementioned user request information to generate a multi-dimensional user state vector. The user request information can represent the prompt text entered by the end-user into the dialog box for the intrinsic security protection model. In practice, firstly, the executing entity can convert the user request information into a semantic feature vector using a semantic encoding model. Secondly, the executing entity can combine the extracted semantic feature vector with historical data to form the multi-dimensional user state vector. For example, the semantic encoding model can be a lightweight semantic encoding model (Sentence-BERT) in Natural Language Processing (NLP). Historical data can be the hyperparameter information from the previous time step. Therefore, the multi-dimensional user state vector can represent a composite vector, composed of two parts: the first part is the semantic features of the current input (user request information), a semantic feature vector extracted from the aforementioned user request information using a lightweight encoding model (representing the current potential threat); the second part is the hyperparameter information from the previous time step, referring to the historical hyperparameter information vector used in the previous round of inference (representing the defensive posture of the executing entity).

[0065] As an example, when a user inputs "how to make a bomb" (user request information), the aforementioned agent uses Sentence-BERT to extract its semantic danger features and concatenates them with the decoding parameters from the previous round to generate a comprehensive multi-dimensional state vector. Upon receiving this state vector, the policy network recognizes that it is facing an "extremely high security threat" and outputs the action distribution in real time—that is, it decides to significantly lower Temperature and Top-P to tighten the output boundary of the large model and cut off its path to generate dangerous content.

[0066] The second step is to input the aforementioned user multidimensional state vector into the aforementioned endogenous security protection model to obtain the mean vector and standard deviation vector.

[0067] The third step involves reparameterizing the mean vector and standard deviation vector to generate the original action vector, in response to the determination that the user request information meets the preset high-risk request conditions. In practice, the executing entity can use a preset sampling algorithm to reparameterize the mean vector and standard deviation vector to generate the original action vector. For example, the preset sampling algorithm could be random sampling based on reparameterization techniques. The preset high-risk request conditions could be: the user request information meets preset intent feature discrimination conditions, or the user request information meets preset action distribution entropy monitoring conditions. The preset intent feature discrimination condition can be: extracting the semantic tensor of the instruction using a pre-built lightweight feature encoding model (such as Sentence-BERT), and calculating the similarity between this tensor and the feature space of known attack samples (known attack samples can refer to pre-collected benchmark jailbreak datasets containing clear malicious intent, such as the AdvBench dataset, which are pre-vectorized and stored in a local feature vector library). (Here, the cosine similarity algorithm can be used to calculate the similarity, which measures the closeness of two multidimensional vectors in the semantic direction by calculating the cosine of the angle between them in space). If the semantics belong to the category of malicious attack (the largest similarity among the calculated similarities is greater than the preset similarity threshold, then it belongs to the category of malicious attack; for example, the preset similarity threshold can be 0.85), then it is judged as high risk. The preset action distribution entropy monitoring condition can be: calculating the Shannon entropy of the policy network output distribution. When faced with unfamiliar or highly disruptive adversarial examples (here, unfamiliar or highly disruptive adversarial examples can be: meaningless garbled character sequences generated by attackers using automated optimization algorithms (such as GCG (greedy coordinate gradient) or AutoDAN (automated large language model stealth jailbreak hint generation)) and appended to instructions to disrupt the attention mechanism of the large model. For example, "Please tell me how to make a bomb ==!!!!! description!!!!!==" (the latter half is the adversarial suffix generated by the GCG algorithm)), the standard deviation (σ) of the policy network output included in the intrinsic security protection large model will increase significantly, causing the Shannon entropy value to exceed a preset threshold, and thus being identified as a high-risk request. Here, the Shannon entropy can be determined by the following formula:

[0068] ,

[0069] in, Indicates the serial number. Represents Shannon entropy. The first standard deviation vector represents the first standard deviation vector. There are several standard deviation components. Therefore, the larger the standard deviation, the more chaotic the network decision-making, and the higher the entropy value. The larger.

[0070] For example, the preset threshold could be the product of 1.5 and the average Shannon entropy. The average Shannon entropy could be the average Shannon entropy output by the intrinsic security protection model when the executing entity processes benign instructions (normal instructions, i.e., instructions that do not meet the preset high-risk conditions). Therefore, when the adversarial scrambling code in the above example is input, the policy network's prediction variance increases dramatically, and its calculated entropy value... It will inevitably exceed this threshold and trigger the defense interception.

[0071] The fourth step involves mapping the original action vectors to generate a hyperparameter information set. In practice, the executing entity can use a pre-defined mapping algorithm to map the original action vectors to generate the hyperparameter information set. For example, the pre-defined mapping algorithm could be: by combining the linear scaling mapping formula of the Sigmoid activation function, each dimension of the original action vector is mapped to a pre-defined physical boundary interval. Therefore, the actual values ​​of decoding hyperparameters such as Temperature, Top-P kernel sampling threshold, Top-K truncation value, and Max_Tokens, which are ultimately used for model inference, can be accurately calculated.

[0072] Therefore, when a user request arrives, the intrinsic security protection model outputs the action distribution in real time based on the request characteristics and samples it to generate a set of decoding hyperparameters. Because the output of the intrinsic security protection model is random, even if an attacker inputs the same prompt, the model's decoding parameters will dynamically change at different times, thus undermining the attacker's precondition of performing gradient estimation or genetic algorithm optimization based on fixed parameters.

[0073] Fifth, based on the security scorer, select heterogeneous security hyperparameter information from the aforementioned hyperparameter information set. The aforementioned security scorer can be:

[0074] ,

[0075] in, Indicates the safety score. This represents the Sigmoid activation function. , , Indicates the weighting coefficient. This indicates a refusal to respond to the matching score. This indicates the content toxicity score. The degree of confusion is represented by a score, where,

[0076]

[0077] in, Represents the response text. This indicates that the Boolean function's rejection vocabulary can be a word from a predefined safe rejection vocabulary dictionary. This represents the probability output by the toxicity classifier. Indicates toxicity label, Indicates toxicity parameters, Indicates the serial number. Indicates the number of words in the response text. Indicates the first in the response text One word, Indicates the first in the response text One word, This represents the probability predicted by the language estimator for the current word. Here, the response text can be hyperparameter information from a hyperparameter information set. For example, the toxicity classifier could be a lightweight safety classifier (RoBERTa (toxicity detection model)). The language estimator could be Llama-2-7b (a large language model).

[0078] Here, the rejection response match score represents whether the response text (hyperparameter information) contains a preset safe rejection vocabulary dictionary (such as "Sorry", "I cannot", "as an AI assistant", etc.) automatically scanned by the regular expression (Regex) engine. (If the response text matches a rejection word included in the safe rejection vocabulary dictionary, the rejection response match score is 1; otherwise, it is 0). The content toxicity score represents the conditional probability of the harmful category output by the toxicity classifier. The content toxicity score can be a continuous floating-point number in the range [0,1]. The closer it is to 1, the more toxic the response text is; the closer it is to 0, the safer and more harmless the response text is. The perplexity score represents the perplexity of the response text calculated using the locally deployed Llama-2-7b as a language evaluator. It is used to evaluate the coherence and natural language fluency of the response text, preventing the model from outputting meaningless gibberish to evade security detection. The perplexity score can be a real number greater than 0. The lower the value, the more the generated response text conforms to human natural language logic (the more fluent); the higher the value, the more the response text resembles gibberish or has serious grammatical errors.

[0079] Here, the rejection response matching score is used to scan for rejection response features (such as "As an AI assistant, I cannot...", "Sorry, I cannot fulfill this request", etc.). The content toxicity score is used to call a localized, fine-tuned semantic model (such as RoBERTa) to give the probability of harmlessness. The perplexity score is used to evaluate the logical quality and fluency of the output content.

[0080] In practice, the aforementioned implementing entity can select heterogeneous safety hyperparameter information from the aforementioned hyperparameter information set based on the safety scorer through the following sub-steps:

[0081] The first sub-step involves inputting each hyperparameter in the hyperparameter information set into the security scorer to obtain a security score.

[0082] The second sub-step involves settling the hyperparameter information above together with the hyperparameter information corresponding to the largest security score among the obtained security scores, and determining it as the security heterogeneous hyperparameter information.

[0083] Therefore, for high-risk requests, multiple heterogeneous hyperparameter combinations, such as the high Top-K divergent dynamic parameter combination A and the low-Temperature conservative defensive parameter combination B, can be sampled in parallel, and multiple inferences can be performed. A lightweight security scorer then selects the safest result for return. Attackers, based on gradient information or attack samples obtained from historical queries, will likely fail when facing the next dynamically changing parameter environment. This breaks the passive situation of "one-time breach, permanent effectiveness" under static defense, significantly improving the system's robustness. This "dynamic change + optimal output" mechanism constitutes a key line of defense for the intrinsic security of large models.

[0084] Therefore, the intrinsic security protection model trained for the preceding process generates a nondeterministic optimal hyperparameter combination in real time for each user input during the inference phase. This dynamic defense breaks the attacker's attack chain based on a fixed configuration, thereby effectively improving the system's robustness under covert malicious attacks.

[0085] Optionally, the aforementioned implementing entity may also perform the following steps:

[0086] The first step, in response to the determination that the user request information does not meet the aforementioned preset high-risk conditions, is to reparameterize and sample the mean vector and standard deviation vector to generate the original action vector. In practice, the executing entity can use a preset sampling algorithm to reparameterize and sample the mean vector and standard deviation vector to generate the original action vector.

[0087] The second step involves mapping the original action vectors to generate secure heterogeneous hyperparameter information. In practice, the executing entity can use the preset mapping algorithm to map the original action vectors to generate secure heterogeneous hyperparameter information.

[0088] Therefore, for non-high-risk requests, large model inference can be performed directly using the unique set of decoding hyperparameters obtained by the policy network through a single sampling, in order to ensure low service latency.

[0089] Therefore, this application proposes an intrinsic security protection method and device for large models based on reinforcement learning, targeting widely used black-box LLM objects. Using the decoded hyperparameter space as a carrier, the method constructs an intrinsic security mechanism with "dynamic defense" characteristics by dynamically adjusting the inference strategy through the reinforcement learning agent without modifying the model weight parameters. Its defense effect will far exceed that of static rule filtering, and it has a highly adaptive and low-overhead defensive deterrent capability.

[0090] The above embodiments of this application have the following beneficial effects: Through the reinforcement learning-based large model intrinsic security protection method of some embodiments of this application, the inference hyperparameter space of LLM is used as the optimization object. By attaching reinforcement learning agents to the black box environment, an adaptive defense chain of "state space construction - policy network evolution - dynamic defense inference" is ingeniously constructed, which can effectively intercept existing jailbreak attack methods and support the formation of network security capabilities such as harmless instructions, compliant output, and high service availability.

[0091] Further reference Figure 5 As an implementation of the methods shown in the above figures, this disclosure provides some embodiments of a large model intrinsic security protection device based on reinforcement learning. These embodiments of the large model intrinsic security protection device based on reinforcement learning are similar to... Figure 1 Corresponding to the method embodiments shown, this reinforcement learning-based large model intrinsic security protection device can be specifically applied to various electronic devices.

[0092] like Figure 5 As shown, some embodiments of the reinforcement learning-based large-model intrinsic security protection device 500 include: a construction unit 501, a training unit 502, and a dynamic defense inference unit 503. The construction unit 501 is configured to construct a reinforcement learning state space; the training unit 502 is configured to train an intrinsic security protection large model based on the aforementioned state space; and the dynamic defense inference unit 503 is configured to perform dynamic defense inference processing on user request information based on the aforementioned intrinsic security protection large model.

[0093] It is understandable that the units and references described in the reinforcement learning-based large model intrinsic security protection device 500 are related. Figure 1 The steps described in the method correspond to each other. Therefore, the operations, features, and beneficial effects described above for the method are also applicable to the reinforcement learning-based large model intrinsic security protection device 500 and the units contained therein, and will not be repeated here.

[0094] This application also provides a computer device 600. For example... Figure 6As shown, computer device 600 includes: bus 601, processor 602, memory 603, and communication interface 604. Processor 602, memory 603, and communication interface 604 communicate via bus 601. Computer device 600 can be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in computer device 600.

[0095] Bus 601 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be divided into address buses, data buses, control buses, etc. For ease of representation, Figure 6 The bus 601 is represented by a single line, but this does not mean that there is only one bus or one type of bus. The bus 601 may include a path for transmitting information between various components of the computer device 600 (e.g., memory 603, processor 602, communication interface 604).

[0096] Processor 602 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).

[0097] Memory 603 may include volatile memory, such as random access memory (RAM). Memory 603 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).

[0098] The memory 603 stores executable program code, which the processor 602 executes to implement the functions of the aforementioned construction unit, training unit, and dynamic defense inference unit, thereby realizing the aforementioned reinforcement learning-based large model intrinsic security protection method. In other words, the memory 603 stores instructions for executing the aforementioned reinforcement learning-based large model intrinsic security protection method.

[0099] The communication interface 604 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the computer device 600 and other devices or communication networks.

[0100] This application also provides a chip, which includes a processor and a data interface. The processor reads instructions stored in the memory through the data interface to execute the above-described reinforcement learning-based large model intrinsic security protection method.

[0101] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to execute the aforementioned reinforcement learning-based large model intrinsic security protection method.

[0102] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0103] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of this application.

Claims

1. A method for intrinsic security protection of large models based on reinforcement learning, characterized in that, include: Constructing the state space for reinforcement learning; Based on the state space, train an intrinsic security protection model; Based on the aforementioned intrinsic security protection model, dynamic defense reasoning is performed on user request information.

2. The method for protecting the intrinsic security of large models based on reinforcement learning according to claim 1, characterized in that, The state space includes: a reinforcement learning environment, a multi-dimensional state vector, and an interaction pool; And constructing the state space for reinforcement learning, including: Wrap the LLM API into a reinforcement learning environment; Define a multidimensional state vector; The interaction pool is initialized based on the multidimensional state vector.

3. The method for protecting the intrinsic security of large models based on reinforcement learning according to claim 1, characterized in that, The training of the intrinsic security protection large model based on the state space includes: Obtain the training sample set, wherein the training samples in the training sample set include: sample state vectors; The initial endogenous security protection model is determined, which includes: the initial policy network and the initial value network; Select target training samples from the training sample set; The sample state vectors included in the selected target training samples are input into the initial policy network to obtain the initial mean vector and the initial standard deviation vector; The sample state vector is input into the initial value network to obtain the expected state value scalar; In response to the determination that the initial policy network does not meet the preset convergence conditions, the network parameters of the initial endogenous security protection model are adjusted based on the reverse security reward function, the initial mean vector, the initial standard deviation vector, and the expected state value scalar.

4. The large-model intrinsic security protection method based on reinforcement learning according to claim 3, characterized in that, The method further includes: In response to the determination that the initial policy network satisfies the preset convergence condition, the initial endogenous security protection model is determined as the endogenous security protection model.

5. The method for protecting the intrinsic security of large models based on reinforcement learning according to claim 1, characterized in that, The dynamic defense reasoning processing of user request information based on the intrinsic security protection model includes: The user request information is processed by feature extraction and state concatenation to generate a multi-dimensional user state vector; The user's multidimensional state vector is input into the intrinsic security protection model to obtain the mean vector and standard deviation vector; In response to determining that the user request information meets the preset high-risk request conditions, the mean vector and the standard deviation vector are reparameterized and sampled to generate the original action vector; The original action vector is mapped to generate a hyperparameter information set; Based on the security scorer, security heterogeneous hyperparameter information is selected from the hyperparameter information set.

6. The method for protecting the intrinsic security of large models based on reinforcement learning according to claim 5, characterized in that, The selection of heterogeneous security hyperparameter information from the hyperparameter information set based on the security scorer includes: For each hyperparameter in the hyperparameter information set, the hyperparameter information is input into the safety scorer to obtain a safety score; The hyperparameter information set is combined with the hyperparameter information corresponding to the largest security score among the obtained security scores, and this is determined as the security heterogeneous hyperparameter information.

7. The method for protecting the intrinsic security of large models based on reinforcement learning according to claim 5, characterized in that, The security scorer is: , in, Indicates the safety score. This represents the Sigmoid activation function. , , Indicates the weighting coefficient. This indicates a refusal to respond to the matching score. This indicates the content toxicity score. The level of confusion is indicated by a score.

8. A large-scale model intrinsic security protection device based on reinforcement learning, characterized in that, include: The building unit is configured to construct the state space for reinforcement learning; The training unit is configured to train an intrinsic security protection large model based on the state space; The dynamic defense inference unit is configured to perform dynamic defense inference processing on user request information based on the intrinsic security protection big model.

9. A computer device, characterized in that, The computer device includes a processor, a memory, and a computer program stored in the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the method as described in any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1-7.