A method and device for detecting and defending against jailbreaking attacks on a large language model based on SAE

By introducing a sparsity penalty mechanism and feature selection into a large language model, and using SAE to detect jailbreak attacks, the problems of high false positive rate and limited defense effect in existing methods are solved, and efficient and accurate jailbreak attack detection is achieved.

CN122242510APending Publication Date: 2026-06-19AIR FORCE UNIV PLA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
AIR FORCE UNIV PLA
Filing Date
2025-07-15
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing large language model jailbreak attack detection methods have high false positive rates and limited defense effectiveness. They also struggle to accurately distinguish between genuine jailbreak prompts and normal user input, impacting model performance.

Method used

The SAE is used to sparsify the intermediate activation vectors of the large language model and map them to a high-dimensional sparse feature space. The SAE encoder is used to calculate the average feature activation of the datasets of secure rejection, successful jailbreak, and security utility. Sets of security-critical and insecure-critical features are selected, and the feature activation frequency is calculated. Combined with the security threshold, jailbreak attack prompts are determined.

Benefits of technology

Significantly improves the accuracy of jailbreak attack detection, reduces the risk of false positives, maintains the semantic fluency and performance of the model, and dynamically adjusts the security threshold to optimize detection performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242510A_ABST
    Figure CN122242510A_ABST
Patent Text Reader

Abstract

This application discloses a method and apparatus for detecting and defending against jailbreak attacks based on a large language model using SAE, relating to the field of network security technology. The method includes: deploying a Sparse Auto-Encoder (SAE) in a large language model; for each word in the secure rejection dataset, jailbreak success dataset, and security utility dataset, calculating its feature activation value in each feature dimension using the SAE encoder; for each dataset, calculating the average feature activation value for each feature dimension across all words in the corresponding dataset, obtaining a set of average feature activation values ​​for each dataset; filtering the set of average feature activation values ​​to identify a set of secure key features and a set of insecure key features; calculating the secure feature activation frequency and insecure feature activation frequency of the input prompt sequence; and combining this with a security threshold to determine whether the input prompt sequence is a jailbreak attack prompt. This application helps to significantly improve the accuracy of jailbreak attack detection and the effectiveness of jailbreak attack defense.
Need to check novelty before this filing date? Find Prior Art