A method and device for detecting and defending against jailbreaking attacks on a large language model based on SAE
By introducing a sparsity penalty mechanism and feature selection into a large language model, and using SAE to detect jailbreak attacks, the problems of high false positive rate and limited defense effect in existing methods are solved, and efficient and accurate jailbreak attack detection is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- AIR FORCE UNIV PLA
- Filing Date
- 2025-07-15
- Publication Date
- 2026-06-19
AI Technical Summary
Existing large language model jailbreak attack detection methods have high false positive rates and limited defense effectiveness. They also struggle to accurately distinguish between genuine jailbreak prompts and normal user input, impacting model performance.
The SAE is used to sparsify the intermediate activation vectors of the large language model and map them to a high-dimensional sparse feature space. The SAE encoder is used to calculate the average feature activation of the datasets of secure rejection, successful jailbreak, and security utility. Sets of security-critical and insecure-critical features are selected, and the feature activation frequency is calculated. Combined with the security threshold, jailbreak attack prompts are determined.
Significantly improves the accuracy of jailbreak attack detection, reduces the risk of false positives, maintains the semantic fluency and performance of the model, and dynamically adjusts the security threshold to optimize detection performance.
Smart Images

Figure CN122242510A_ABST