Large model security defense method and system based on hidden state geometric separability
By adaptively selecting key segments in a large language model and calibrating the security direction vector using Mahalanobis distance, the problems of low detection accuracy and high computational cost in existing technologies are solved, achieving efficient and accurate jailbreak attack detection that is adaptable to different language model structures.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAQIAO UNIVERSITY
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-16
AI Technical Summary
Existing jailbreak attack detection methods for large language models rely on external training data or input/output filtering, resulting in low detection accuracy, weak generalization ability, and high computational cost, making it difficult to quickly adapt to language models with different structures.
By using a method based on the geometric separation of hidden states, key segments are adaptively selected, the security direction vector is calibrated using Mahalanobis distance, and the detection threshold is adaptively calibrated according to the conflict intensity distribution, thereby directly detecting jailbreak attacks within the model.
It achieves high-precision, low-overhead jailbreak attack detection, significantly improves detection robustness and generalization ability, reduces computational overhead and false alarm rate, and adapts to the security needs of different application scenarios.
Smart Images

Figure CN121935925B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a method and system for large-scale model security defense based on the geometric separation of hidden states. Background Technology
[0002] Currently, Large Language Models (LLMs) have brought about tremendous changes in the field of human-computer interaction and have been widely applied in various industries such as healthcare, finance, and education. However, since LLMs are trained on large amounts of text data, this data inevitably contains harmful information such as hate speech and illegal manufacturing. Attackers can bypass the model's security mechanisms through carefully designed prompts, causing the model to output harmful content; this type of attack is known as a jailbreak attack. To address this threat, researchers have proposed various security defense methods, including rule-based filtering, security fine-tuning, and input / output monitoring. However, existing methods mainly rely on external training data or filtering of inputs and outputs, failing to fully utilize the geometric characteristics of the model's internal hidden state space. This results in low detection accuracy and weak generalization ability, making it difficult to effectively defend against increasingly complex jailbreak attacks. Chinese invention patent application CN121119036A discloses a method and apparatus for protecting and defending language models based on dynamic adjustment. This method trains a security offset vector on a specified non-secure layer of a large language model and enhances the robustness of the large language model by dynamically adjusting the hidden states of the specified layer, thus achieving security protection and defense. However, this method relies on backpropagation to calculate gradients to identify sensitive layers and requires iterative gradient descent training using a complex loss function to optimize the security offset vector. This results in extremely high computational overhead for defense deployment and makes it difficult to quickly adapt to language models with different structures. Therefore, a jailbreak attack detection method that does not rely on additional training and can deeply utilize the internal representation characteristics of the model is needed to improve the security of large language models. Summary of the Invention
[0003] The purpose of this invention is to solve the problems in the prior art.
[0004] The technical solution adopted by this invention to solve its technical problem is: to provide a large model security defense method based on the geometric separation of hidden states, including the following steps:
[0005] Based on the differences in the distribution of safe and risky queries in the hidden state space of a large language model, the key segment with the highest discriminative power is adaptively selected.
[0006] Based on the hidden state of key segments, the safety direction vector is calibrated using Mahalanobis distance;
[0007] The detection threshold is adaptively calibrated based on the conflict intensity distribution between security queries and risk queries;
[0008] Calculate the conflict intensity of the input query on the security direction vector, compare it with the detection threshold, and determine whether it is a jailbreak attack.
[0009] Preferably, the adaptive selection of the key segment with the highest discriminative power includes the following steps:
[0010] For different segments of a large language model, the Mahalanobis distance between the hidden states of the safe query set and the risk query set in that segment is calculated as the discriminant.
[0011] Select the continuous segment with the highest distinguishability as the key segment;
[0012] The length n of the key segment is determined based on the total number of layers in the model: when the total number of layers in the model is ≤12, n=3; when the total number of layers in the model is ≤24, n=4; when the total number of layers in the model is >24, n=max(6, total number of layers in the model / 8), where max represents the maximum value function.
[0013] Preferably, the calibration of the safety direction vector using Mahalanobis distance specifically involves:
[0014] For each layer k in the selected critical segment K, calculate the safety direction vector, expressed as:
[0015] ;
[0016] in, This represents the safety direction vector of the k-th layer. To safely query the average hidden state vector at layer k, The average hidden state vector of the risk query at layer k; The inverse matrix of the covariance matrix of the hidden state distribution of the secure query.
[0017] Preferably, the adaptive calibration detection threshold includes the following steps:
[0018] Calculate the conflict strength of each query in the safe query set, denoted as set. ;
[0019] Calculate the conflict strength of each query in the risk query set, denoted as set. ;
[0020] like and If there is no overlap, the threshold τ is expressed as:
[0021] ;
[0022] in, For the 100th percentile of safety cases, The 0th percentile for risk cases;
[0023] like and If overlap exists, then the threshold τ is the safe query set. The α quantile, where α is a preset threshold parameter, represents the proportion of safe queries that are allowed to be misclassified as risky queries.
[0024] Preferably, the formula for calculating the conflict intensity is:
[0025] ;
[0026] ;
[0027] in, To query the hidden state vector of q at layer k, The normalized safety direction vector of layer k is obtained as a unit vector, where K is the set of critical layers; |K| represents the number of critical layers, which is numerically equal to n; cos sim This represents the cosine similarity.
[0028] Preferably, the determination of whether it is a jailbreak attack specifically involves:
[0029] If the conflict intensity conflict(q) of the input query q is greater than the threshold τ, it is determined to be a jailbreak attack and the response is rejected.
[0030] If the conflict intensity conflict(q) of the input query q is less than or equal to the threshold τ, it is determined to be a safe query and a normal response is given.
[0031] This invention also provides a large-model security defense system based on hidden state geometric separation, comprising:
[0032] The key segment selection module adaptively selects the key segment with the highest discriminative power based on the distribution differences between safe queries and risky queries in the hidden state space of a large language model.
[0033] The safety orientation calibration module calibrates the safety orientation vector based on the hidden state of key layers using Mahalanobis distance.
[0034] The threshold calibration module adaptively calibrates the detection threshold based on the conflict intensity distribution between security queries and risk queries.
[0035] The attack detection module calculates the conflict intensity of the input query on the security direction vector, compares it with the detection threshold, and determines whether it is a jailbreak attack.
[0036] The present invention also provides an electronic device, comprising:
[0037] One or more processors;
[0038] A storage device for storing one or more programs that, when executed by one or more processors, cause the one or more processors to perform any of the methods described above.
[0039] The present invention also provides a computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements any of the methods described above.
[0040] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements any of the methods described above.
[0041] The present invention has the following beneficial effects:
[0042] (1) This invention utilizes the geometric separation characteristics of security query and risk query in the hidden state space of a large language model. By directly extracting the hidden state in the model inference process and using Mahalanobis distance to calibrate the security direction vector, high-precision detection can be achieved without additional training data or model fine-tuning. This operation not only eliminates the expensive backpropagation and model fine-tuning process and achieves efficient deployment, but also uses statistical properties to eliminate noise interference in the feature space and accurately characterizes the geometric boundaries of security and risk distribution. Thus, while ensuring extremely low computational overhead, it significantly improves the robustness and generalization ability of detecting complex jailbreak attacks.
[0043] (2) The critical layer segment selection (CLSS) mechanism proposed in this invention significantly reduces computational overhead and improves detection efficiency by adaptively selecting the critical layer segment with the highest discriminative power.
[0044] (3) The Safety Direction Calibration (SDC) mechanism proposed in this invention uses Mahalanobis distance to calculate the safety direction vector, takes into account the covariance structure of the hidden state space, makes the direction vector closer to the real distribution, and improves the detection accuracy.
[0045] (4) The adaptive threshold calibration (ATC) mechanism proposed in this invention adaptively determines the threshold based on the conflict intensity distribution, balancing detection accuracy and false alarm rate, and adapting to the security requirements of different application scenarios.
[0046] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments, but the present invention is not limited to the embodiments. Attached Figure Description
[0047] Figure 1This is a flowchart illustrating a large-model security defense method based on the geometric separation of hidden states according to the present invention.
[0048] Figure 2 This is a schematic diagram of the structure of a large-model security defense system based on the geometric separation of hidden states according to the present invention;
[0049] Figure 3 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0050] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.
[0051] See Figure 1 As shown, an embodiment of the present invention provides a large model security defense method based on the geometric separation of hidden states, comprising the following steps:
[0052] S101, based on the distribution differences of safe queries and risky queries in the hidden state space of a large language model, adaptively selects the key segment with the highest discriminative power;
[0053] Specifically, this is achieved through the Key Layer Adaptive Selection (CLSS) mechanism. First, a safe example set S={s_1,s_2,...,s_m} and a risk example set R={r_1,r_2,...,r_n} are obtained. Then, the discriminant strength of each layer in the model for safe and risky queries is calculated, expressed using Mahalanobis distance.
[0054] ;
[0055] Where Σ represents the covariance matrix of the safe query, calculated using the Ledoit-Wolf estimator:
[0056] ;
[0057] in, The average hidden state vector for safe queries is represented as:
[0058] ;
[0059] The average hidden state vector for risk queries is represented as:
[0060] ;
[0061] Finally, the length n of the key segment is determined based on the total number of layers in the model, and the n consecutive layers with the highest discrimination are selected as the key segment K.
[0062] S102, based on the hidden state of the key layer, calibrate the safety direction vector using Mahalanobis distance;
[0063] Specifically, this is achieved through a Safety Direction Vector Calibration (SDC) mechanism. For each layer k in the selected critical segment K, the safety direction vector is calculated, expressed as:
[0064] ;
[0065] Normalization yields a unit vector, represented as:
[0066] ;
[0067] The safety direction vector represents the geometric separation direction between safe and risky queries in the hidden state space. Mahalanobis distance considers the covariance structure of the hidden state space, making the direction vector more reflective of the true distribution and thus improving detection accuracy. The safety direction vector points to the shortest distance between the two distributions; by projecting the query onto this direction, safe and risky queries can be effectively distinguished.
[0068] S103, Adaptively calibrate the detection threshold based on the conflict intensity distribution between security queries and risk queries;
[0069] Specifically, this is achieved through an Adaptive Threshold Calibration (ATC) mechanism. First, the conflict strength of each query s in the safe query set is calculated:
[0070] For a set of safe queries S and a set of risky queries R, the conflict strength of query q at the critical layer k is defined as:
[0071] ;
[0072] in, This represents the hidden state vector of query q at key layer k. With the safety direction vector d k The cosine similarity between them is calculated using the following formula:
[0073] ;
[0074] in, Represents the magnitude of the vector;
[0075] The multi-level average conflict intensity is defined as follows:
[0076] ;
[0077] Theoretically, the conflict strength of a safe query They are usually concentrated in the lower value range, while risk queries The conflict intensity is concentrated in the higher value region. By analyzing the distribution characteristics of conflict intensity, a threshold that can effectively distinguish between the two types of queries can be determined. The specific implementation steps of the ATC mechanism are as follows: First, collect a certain number of safe and risky query samples and calculate their conflict intensity. Then, based on the conflict intensity distribution, if the conflict values of safe examples and risky examples do not overlap, the detection threshold τ is determined according to the following strategy:
[0078] ;
[0079] in 100th percentile for safety cases;
[0080] ;
[0081] in, The quantile function is defined as follows: quantile(..., 1) represents the 100th quantile, which is the maximum value in the dataset. 0% percentile for risk cases:
[0082] ;
[0083] If the conflicting values of the safety example and the risk example overlap, the following threshold formula is used:
[0084] ;
[0085] Security Example % quantile:
[0086] ;
[0087] Here, α is a threshold parameter, representing the proportion of safe queries allowed to be falsely identified as risky queries. This dynamic threshold setting can adapt to the different characteristics of various LLMs and different security requirements, significantly improving detection accuracy while effectively controlling the false positive rate. For example, if α=95, the threshold is the set of safe queries at the 95th position. The value of .
[0088] S104, calculate the conflict intensity of the input query on the security direction vector, compare it with the detection threshold, and determine whether it is a jailbreak attack.
[0089] Specifically, this is achieved through the Jailbreaking Attack Detection (JAD) module. The JAD module performs jailbreaking attack detection on user queries based on the key segments, security direction vectors, and threshold parameters generated by the three core components (CLSS, SDC, and ATC). JAD compares the calculated conflict strength of the user query with the threshold obtained from ATC to determine whether it is a jailbreaking attack. Specifically, for the input query q, the JAD module first obtains its hidden state vector at the key segment K. Then, the conflict strength of the query on the safe direction vector dk is calculated. Finally, the conflict strength is compared with a threshold τ. If the conflict strength > τ, it is determined to be a jailbreak attack, and the model refuses to respond.
[0090] This invention proposes a training-free geometric analysis mechanism that directly extracts hidden states during model inference and uses Mahalanobis distance to calibrate the security direction vector. This not only eliminates the expensive backpropagation and model fine-tuning processes, achieving efficient deployment, but also utilizes statistical properties to eliminate noise interference in the feature space, accurately characterizing the geometric boundaries of security and risk distributions. Thus, while ensuring extremely low computational overhead, it significantly improves the robustness and generalization ability to detect complex jailbreak attacks.
[0091] To verify the effectiveness of the embodiments of the present invention, experiments were conducted on four large open-source language models: Mistral-7B, Deepseek-R1-7B, Qwen2.5-7B, and Vicuna-7B. The experimental dataset included 300 security questions (from the Google NaturalQuestions dataset), 300 direct attack samples (from the Hugging Face prompt-injection dataset), and 300 indirect attack samples (from the JailBench benchmark set). The evaluation metrics used were false positive rate (FPR) and false negative rate (FNR).
[0092] Three types of questions were designed to evaluate the defense capabilities of the large-model security defense method and system (ActivationGuard) based on the geometrical separability of hidden states. 300 security questions were sourced from Google's NaturalQuestions dataset; 300 direct attack samples were sourced from Hugging Face's prompt-injection dataset, containing explicit risk queries; and 300 indirect attack samples were sourced from the JailBench benchmark set, including advanced attack forms such as role-playing and indirect injection. Security and risk examples used to determine critical layers, security direction vectors, and thresholds were generated by qwen3. Four large open-source language models—Mistral-7B, Deepseek-R1-7B, Qwen2.5-7B, and Vicuna-7B—were selected, and all models underwent a certain degree of security alignment to evaluate the generalization ability of ActivationGuard. False Positive Rate (FPR) and False Negative Rate (FNR) were used as evaluation criteria. FPR, the proportion of secure queries misclassified as risky queries, measures the degree of interference the system causes to normal queries; FNR, the proportion of risky queries misclassified as secure queries, measures the effectiveness of the defense system. Experimental results are shown in Table 1. On Mistral-7B, ActivationGuard reduced the FNR of direct attacks from 98.34% to 7.00%; the FNR of indirect attacks decreased to 0.33%, successfully intercepting a large number of indirect attacks. On DeepSeek-R1-7B, ActivationGuard reduced the FNR of indirect attacks from 96.67% to 26.00%, demonstrating ActivationGuard's good generalization ability across different language models. On all tested models, ActivationGuard's average FPR was 2.75%, and its impact on the false positive rate of secure queries and model performance was within acceptable limits. Experiments have shown that ActivationGuard achieves high detection rates on all tested models with minimal impact on model performance.
[0093] Table 1 - Model Generalization Ability Test Results:
[0094]
[0095] As can be seen, this invention provides a training-free jailbreak attack detection method based on the geometric characteristics of the hidden state space. This method analyzes the separability of secure and risky queries in the hidden state space and combines three core components—CLSS, SDC, and ATC—to achieve high-precision, low-overhead jailbreak attack detection, demonstrating excellent generalization ability on multiple large-scale language models.
[0096] See Figure 2 As shown, this embodiment of the invention provides a large-model security defense system based on the geometric separation of hidden states, and... Figure 1 Corresponding to the illustrated method embodiment, this system can be specifically applied to various electronic devices; the system includes:
[0097] The key segment selection module 201 adaptively selects the key segment with the highest discriminative power based on the distribution differences between safe queries and risk queries in the hidden state space of a large language model.
[0098] The safety orientation calibration module 202 calibrates the safety orientation vector based on the hidden state of the key layer using Mahalanobis distance.
[0099] The threshold calibration module 203 adaptively calibrates the detection threshold based on the conflict intensity distribution between security queries and risk queries.
[0100] The attack detection module 204 calculates the conflict intensity of the input query on the security direction vector, compares it with the detection threshold, and determines whether it is a jailbreak attack.
[0101] See Figure 3 As shown in the schematic diagram of the hardware structure of the electronic device provided in this embodiment of the invention, it includes: a processor 301 and a memory 302; wherein the memory 302 is used to store computer execution instructions; the processor 301 is used to execute the computer execution instructions stored in the memory to implement the various steps performed by the electronic device in the above embodiment. For details, please refer to the relevant descriptions in the foregoing method embodiments.
[0102] Alternatively, the memory 302 can be either standalone or integrated with the processor 301.
[0103] When the memory 302 is set up independently, the electronic device also includes a bus 303 for connecting the memory 302 and the processor 301.
[0104] This invention also provides a computer storage medium storing computer execution instructions, which, when executed by a processor, implement the method described above.
[0105] This invention also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0106] In the embodiments provided by this invention, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms.
[0107] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to implement the solution of this embodiment according to actual needs.
[0108] Furthermore, the functional modules in the various embodiments of the present invention can be integrated into one processing unit, or each module can exist physically separately, or two or more modules can be integrated into one unit. The unit composed of the above modules can be implemented in hardware or in the form of hardware plus software functional units.
[0109] The integrated modules described above, implemented as software functional modules, can be stored in a computer-readable storage medium. These software functional modules, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods of the various embodiments of this application.
[0110] It should be understood that the aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. A general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.
[0111] The memory may include high-speed RAM, and may also include non-volatile storage (NVM), such as at least one disk storage device, and may also be a USB flash drive, external hard drive, read-only memory, disk or optical disc, etc.
[0112] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.
[0113] The aforementioned storage medium can be implemented from any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The storage medium can be any available medium accessible to general-purpose or special-purpose computers.
[0114] An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Alternatively, the storage medium can be an integral part of the processor. Both the processor and the storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and storage medium can exist as discrete components in an electronic device or host device.
[0115] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.
[0116] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A large-model security defense method based on the geometric separation of hidden states, characterized in that, Includes the following steps: Based on the differences in the distribution of safe and risky queries in the hidden state space of a large language model, the key segment with the highest discriminative power is adaptively selected. Based on the hidden state of key segments, the safety direction vector is calibrated using Mahalanobis distance; The detection threshold is adaptively calibrated based on the conflict intensity distribution between security queries and risk queries; Calculate the conflict intensity of the input query on the security direction vector, compare it with the detection threshold, and determine whether it is a jailbreak attack. The adaptive calibration detection threshold includes the following steps: Calculate the conflict strength of each query in the safe query set, denoted as set. ; Calculate the conflict strength of each query in the risk query set, denoted as set. ; like and If there is no overlap, then the threshold is... Represented as: ; in, For the 100th percentile of safety cases, The 0th percentile for risk cases; like and If overlap exists, then the threshold... For secure query sets The α quantile, where α is a preset threshold parameter, represents the proportion of safe queries that are allowed to be misjudged as risky queries; The formula for calculating the conflict intensity is: ; ; in, To query the hidden state vector of q at layer k, The unit vector is obtained by normalizing the safety direction vector of the kth layer, where K is the set of key layers. This indicates the number of critical segments, and is numerically equal to n. This represents the cosine similarity.
2. The large-model security defense method based on hidden state geometric separation according to claim 1, characterized in that, The adaptive selection of the key segment with the highest discriminative power includes the following steps: For different segments of a large language model, the Mahalanobis distance between the hidden states of the safe query set and the risk query set in that segment is calculated as the discriminant. Select the continuous segment with the highest distinguishability as the key segment; The length n of the key segment is determined based on the total number of layers in the model: when the total number of layers in the model is ≤12, n=3; when the total number of layers in the model is ≤24, n=4; when the total number of layers in the model is >24, n=max(6, total number of layers in the model / 8), where max represents the maximum value function.
3. The large-model security defense method based on hidden state geometric separation according to claim 2, characterized in that, The calibration of the safety direction vector using Mahalanobis distance specifically involves: For each layer k in the selected critical segment K, calculate the safety direction vector, expressed as: ; in, This represents the safety direction vector of the k-th layer. To safely query the average hidden state vector at layer k, The average hidden state vector of the risk query at layer k; The inverse matrix of the covariance matrix of the hidden state distribution of the secure query.
4. The large-model security defense method based on hidden state geometric separation according to claim 1, characterized in that, The determination of whether it is a jailbreak attack is as follows: If the conflict strength conflict(q) of the input query q is greater than the threshold If so, it is judged as a jailbreak attack and the response is refused; If the conflict intensity conflict(q) of the input query q is less than or equal to the threshold If the query is successful, it is considered a safe query and a normal response is given.
5. A large-model security defense system based on hidden state geometric separation, characterized in that, include: The key segment selection module adaptively selects the key segment with the highest discriminative power based on the distribution differences between safe queries and risky queries in the hidden state space of a large language model. The safety orientation calibration module calibrates the safety orientation vector based on the hidden state of key layers using Mahalanobis distance. The threshold calibration module adaptively calibrates the detection threshold based on the conflict intensity distribution between security queries and risk queries. The attack detection module calculates the conflict strength of the input query on the security direction vector and compares it with the detection threshold to determine whether it is a jailbreak attack. The adaptive calibration detection threshold includes the following steps: Calculate the conflict strength of each query in the safe query set, denoted as set. ; Calculate the conflict strength of each query in the risk query set, denoted as set. ; like and If there is no overlap, then the threshold is... Represented as: ; in, For the 100th percentile of safety cases, The 0th percentile for risk cases; like and If overlap exists, then the threshold... For secure query sets The α quantile, where α is a preset threshold parameter, represents the proportion of safe queries that are allowed to be misjudged as risky queries; The formula for calculating the conflict intensity is: ; ; in, To query the hidden state vector of q at layer k, The unit vector is obtained by normalizing the safety direction vector of the kth layer, where K is the set of key layers. This indicates the number of critical segments, and is numerically equal to n. This represents the cosine similarity.
6. An electronic device, characterized in that, include: One or more processors; A storage device for storing one or more programs that, when executed by one or more processors, cause the one or more processors to perform the method as described in any one of claims 1-4.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-4.
8. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1-4.