Expandable large language model jailbreaking attack method, device, medium and product

By updating the prompt template and adjusting the feedback parameters of the large language model, jailbreak attack prompts that meet the format requirements are generated, solving the problem of narrow security boundary assessment scope in large language model jailbreak attacks, and achieving high scalability and effective execution of jailbreak tasks.

CN119884311BActive Publication Date: 2026-06-26HANGZHOU HIGH-TECH ZONE (BINJIANG) INSTITUTE OF BLOCKCHAIN & DATA SECURITY +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU HIGH-TECH ZONE (BINJIANG) INSTITUTE OF BLOCKCHAIN & DATA SECURITY
Filing Date
2024-12-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, large language models have a narrow scope for security boundary assessment during jailbreak attacks, and there is a lack of effective assessment methods.

Method used

By obtaining the first hint corresponding to the jailbreak mission, updating the hint template based on the character description and format requirements, generating a second hint that meets the format requirements, and using the target large language model to generate second response data, the hint template is adjusted through feedback parameters to meet the needs of different jailbreak missions.

Benefits of technology

It achieves high scalability of large language model jailbreak attack methods, avoids the problem of narrow content security boundary assessment caused by fixed algorithms and processes, and is adaptable to various jailbreak task scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119884311B_ABST
    Figure CN119884311B_ABST
Patent Text Reader

Abstract

The application relates to an expandable large language model jailbreaking attack method, device, medium and product. The method comprises the following steps: obtaining a first prompt corresponding to a jailbreaking task, and generating first answer data of the first prompt according to a question template; updating the writing content in a preset first prompt template according to a role description and / or a scene description corresponding to the jailbreaking task and a preset format requirement; transcribing the first prompt by taking the first answer data as an example and combining the role description and / or the scene description in the first prompt template to obtain a second prompt meeting the format requirement; and obtaining second answer data generated by a target large language model based on the second prompt. The method can solve the problem of narrow evaluation range of the security boundary of the large language model when coping with the jailbreaking attack.
Need to check novelty before this filing date? Find Prior Art