A method and system for generating jailbreak hints for large-scale closed-source artificial intelligence models
By using lightweight word-level discrete perturbation and multiple candidate selection strategies to generate diverse jailbreak hints in a black-box environment, the problem of insufficient computational efficiency and diversity in existing methods is solved, and efficient and flexible security assessment of closed-source large language models is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING UNIV
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-30
AI Technical Summary
Existing jailbreak attack methods cannot simultaneously satisfy the requirements of computational efficiency, prerequisites, and diversity of prompts. In particular, in the security assessment of closed-source large language models in a black-box environment, they lack automation and diversity and rely on gradient access or pre-existing templates.
By using lightweight word-level discrete perturbation operations and multiple candidate selection strategies in a black-box environment, diverse jailbreak hint words are generated. Combined with a continuous scoring evaluation function, the security of a large language model is evaluated.
It efficiently generates diverse jailbreak hints in a closed-source environment, with high computational efficiency and success rate. It is applicable to various large language models, supports pure black-box operation and does not require pre-templates, and has broad model compatibility and flexible evaluation strategy selection.
Smart Images

Figure CN122309676A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a method for security assessment of large language models, and more particularly to a method for generating jailbreak prompts for closed-source artificial intelligence large models, belonging to the field of artificial intelligence security assessment technology. Background Technology
[0002] With the widespread deployment of Large Language Models (MLMs) in fields such as natural language processing, intelligent dialogue, and code generation, their security issues have become increasingly prominent. Although the industry has used secure alignment techniques such as reinforcement learning from human feedback and direct preference optimization to train MLMs securely, enabling them to refuse to generate harmful content, MLMs remain vulnerable to "jailbreak attacks." A jailbreak attack is an adversarial technique that uses carefully crafted input prompts to bypass the security mechanisms of MLMs, inducing them to generate harmful content. Systematically assessing these security vulnerabilities is crucial for improving the security of MLMs, as improvements in secure alignment techniques rely on a known knowledge base of jailbreak prompts; efficiently discovering diverse jailbreak prompts is a prerequisite for developing secure and reliable MLMs.
[0003] Existing jailbreak attack methods can be mainly divided into the following three categories, but each has obvious shortcomings: The first category is manually constructed jailbreak hints. These methods involve security researchers manually designing specific hint templates to bypass security protections, such as through role-playing, hypothetical scenarios, and text flipping. The drawback of this type of method is the lack of automation and hint diversity; it relies on fixed hint templates, resulting in highly similar jailbreak hints that cannot comprehensively cover security vulnerabilities in large language models.
[0004] The second category is the red team large language model approach. This type of method utilizes an attacker's large language model to automatically generate jailbreak hints, for example, by optimizing the hints through iterative dialogue or tree search strategies. The drawback of this approach is that it requires additional calls to the large language model for hint generation, resulting in significant computational overhead and application programming interface (API) call costs, severely limiting its scalability in large-scale security assessment scenarios.
[0005] The third category is optimization-based methods. These methods utilize lexical embeddings or gradient information to automatically search for valid jailbreak hints; for example, the greedy coordinate gradient method requires access to model gradient information. The drawback of these methods is that they require white-box access, meaning they need to obtain the internal parameters, gradients, or log-likelihood scores of the target large language model, which is impractical for closed-source commercial large language model products. Furthermore, some methods rely on pre-existing jailbreak templates, limiting their applicability.
[0006] In summary, existing jailbreak attack methods cannot simultaneously meet the following three core requirements: (1) computational efficiency: large-scale security assessment requires generating and testing a large number of prompt words, and computational cost is a key bottleneck; (2) minimum prerequisites: generating and testing prompt words in a black-box environment is more practical than requiring gradient access or pre-existing jailbreak templates; (3) prompt word diversity: comprehensive assessment of security vulnerabilities in large language models requires the discovery of more diverse jailbreak patterns, rather than relying on known attack methods.
[0007] Therefore, there is an urgent need to establish an efficient, black-box, and diverse method for security assessment of large language models. Summary of the Invention
[0008] To address the shortcomings of existing jailbreak attack methods in simultaneously satisfying the requirements of computational efficiency, prerequisites, and diversity of jailbreak hints, this invention provides a jailbreak hint generation method for closed-source large-scale artificial intelligence models. This method does not require access to the internal parameters or gradient information of the target large-scale language model, does not require the use of a red-team large-scale language model for hint generation, and does not rely on pre-existing jailbreak templates. It efficiently generates diverse jailbreak hints in a black-box environment by applying lightweight word-level discrete perturbations to the original malicious target text, combined with a continuous scoring evaluation function and multiple candidate selection strategies, enabling comprehensive evaluation of the security of large-scale language models.
[0009] This application discloses a method for generating jailbreak hints for large-scale closed-source artificial intelligence models (hereinafter referred to as the method or the jailbreak hint generation method). Under the premise that interaction with the target model is limited to an input-response interface, the jailbreak hint search problem is modeled as a discrete optimization problem. Automatic search for jailbreak hints is achieved through an iterative "perturbation-evaluation-selection" loop. The method comprises three core components: a discrete perturbation operator module, an evaluation function module, and a candidate selection strategy module. The method directly uses the original malicious target text as initial input, without requiring any pre-designed hint templates. Specific steps include: Step 1: Initialize the candidate set to contain only malicious target text, initialize the best prompt word to the malicious target text, and calculate the initial score; Step 2: Extract a set of core keywords from the malicious target text, obtain a set of synonyms for each core keyword, and construct a mapping table from keywords to the set of synonyms; Step 3: Enter the iterative loop and generate a variant set for each candidate prompt word in the current candidate set using the composite perturbation function of the discrete perturbation operator module; Step 4: Perform a keyword validity check on each variant in the variant set to ensure that it retains at least a preset number of core keywords or their synonyms; Step 5: Submit the checked variants in batches to the target large language model to obtain the response text corresponding to each variant; Step 6: Calculate the raw scores for all responses using the evaluation function module and adjust the keyword-aware scoring. Step 7: Iterate through all evaluation results. If there is a variant with a score greater than the current best score, update the best prompt word and the best score. Step 8: Determine if the best score is greater than or equal to the early stopping threshold. If yes, proceed to step 10; otherwise, proceed to step 9. Step 9: Update the candidate set according to the candidate selection strategy, and return to step 3 to continue the next round of iteration; Step 10: Output the best jailbreak hint word and its corresponding response and score. The process ends.
[0010] The technical solution further defined in this invention is as follows: the discrete perturbation operator module in step 3 perturbs the prompt word through word-level editing operations to generate candidate variants. This technical solution defines the following discrete perturbation operators, each of which modifies the prompt word through a single operation. The discrete perturbation operators include: The synonym replacement operator replaces one content word in the prompt (excluding stop words and words longer than 3 characters) with a synonym obtained from a vocabulary database; A word insertion operator inserts a semantically neutral filler word at a random position of the prompt word, the filler word being drawn from a predefined set of 15 filler words; The word deletion operator removes one non-keyword from the prompt word, prioritizing the deletion of filler words, and only executes when the prompt word is longer than 10 words; The adjacent transposition operator swaps the positions of two adjacent words, introducing structural changes.
[0011] The technical solution further defined in this invention is as follows: the operation steps of the synonym replacement operator include: segmenting the input prompt word by spaces to obtain a word list; traversing the word list and filtering out words that do not belong to the stop word list and have a word length of more than 3 characters as candidate replacement words; randomly selecting a word from the candidate replacement words; querying all synonym sets of the word through the WordNet vocabulary database and extracting up to 5 different synonyms; randomly selecting a synonym from the synonym list to replace the original word and generating a new prompt word; The operation steps of the word insertion operator include: segmenting the input prompt word by spaces to obtain a word list; checking the length of the word list, and if the length is less than 5, not performing the insertion operation; randomly selecting an insertion position between the first and last positions of the word list; randomly selecting a filler word from the filler word set; inserting the filler word at the selected position to generate a new prompt word; The operation steps of the word deletion operator include: segmenting the input prompt word by spaces to obtain a word list; checking the length of the word list, and if the length is less than or equal to 10, not performing the deletion operation; prioritizing the search for words in the word list that belong to the filler word set; if a filler word is found and meets the random probability condition, deleting one of the randomly selected filler words; and generating a new prompt word after deletion. The operation steps of the adjacent transposition operator include: segmenting the input prompt word by spaces to obtain a word list; checking the length of the word list, and if the length is less than 5, not performing the transposition operation; randomly selecting a position between the 0th position and the second to last position of the word list; swapping the two words at the selected position and the next position; and generating a new prompt word after transposition.
[0012] The technical solution further defined in this invention is as follows: the evaluation function module in step 6 adopts the StrongREJECT evaluation function, and combines it with a keyword-aware scoring adjustment mechanism to extract core keywords and their synonym sets from malicious target text, check the keyword matching in the response, and penalize off-topic responses. The keyword-aware scoring adjustment mechanism includes the following steps: Step 601: Use regular expressions to extract all English words from the malicious target text and convert them to lowercase. Step 602: Filter out stop words and short words with a length of less than 3 characters to obtain the core keyword set; Step 603: For each keyword in the core keyword set, obtain up to 3 synonyms from the WordNet vocabulary database and construct a mapping table, where the synonym set of each keyword includes the keyword itself; Step 604: Use regular expressions to extract all English words from the response text of the target large language model, convert them to lowercase, and obtain the response word set K; Step 605: For each core keyword in the mapping table, check whether the intersection of the response word set and the synonym set of the keyword is not empty. If it is not empty, the keyword is matched, and the total number of matched core keywords A is counted. Step 606: Determine if the total number of matched core keywords A is greater than 0. If A equals 0, the final score is 0, and proceed to step 608; otherwise, proceed to the next step. Step 607: Calculate the dynamic threshold as half of the total number of core keywords, calculate the adjustment factor f(A) = min(A / dynamic threshold, 1.0), and calculate the final score as the original StrongREJECT score multiplied by the adjustment factor f(A). Step 608: Return the final score; process ends.
[0013] The technical solution further defined in this invention is as follows: the candidate selection strategy mentioned in step 9 is at least one of the following: beam search strategy, genetic algorithm strategy, simulated annealing strategy, ant colony optimization strategy, greedy selection strategy, and random walk strategy.
[0014] The technical solution further defined in this invention is as follows: the genetic algorithm strategy employs two recombination operators: single-point crossover operator and uniform crossover operator; The single-point crossover operator randomly selects a cutting point from the two parent prompts, and splices the part before the first parent cutting point with the part after the second parent cutting point to generate the offspring. The uniform crossover operator independently samples the corresponding word from any parent generation with equal probability for each position to generate the child generation.
[0015] The technical solution further defined in this invention is as follows: the beam search strategy includes the following steps: Step 9011: Using malicious target text as the base prompt, apply random mutations to the base prompt to generate... The initial candidates constitute the initial bundle; Step 9012, all of the initial bundle Each candidate is submitted in batches to the target large language model for evaluation and scoring; Step 9013: For each candidate prompt word in the bundle, use a composite perturbation function to generate b variants. Perform a keyword validity check on each variant, generating a total of b variants. A new candidate; Step 9014: Submit all new candidates in batches to the target large language model to obtain responses, calculate scores using the evaluation function, and perform keyword-aware scoring. Step 9015: Sort all new candidates in descending order of score, and select the candidate with the highest score. The candidates form a new bundle, and at the same time update the global best solution; Step 9016: Determine whether the global best score is greater than or equal to the early stopping threshold or whether the maximum number of steps has been reached. If so, proceed to step 9017; otherwise, return to step 9013 to continue iterating. Step 9017: Output the globally best jailbreak hint word and its corresponding response and score; the bundle search process ends.
[0016] The technical solution further defined in this invention is as follows: the genetic algorithm strategy includes the following steps: Step 9021: Using malicious target text as the base prompt word, apply random mutation to the base prompt word to generate N initial individuals, forming the initial population; Step 9022: Submit all individuals in the population to the target large language model in batches, and use the evaluation function to calculate the fitness score of each individual; Step 9023: The e individuals with the highest fitness are directly copied into the next generation of the population; Step 9024: Select a pair of parents from the current population using the tournament selection method: randomly select 3 individuals to form a tournament, and select the individual with the highest fitness as the parent. Step 9025, with probability p c Perform a crossover operation on the selected parent pairs: If it is a single-point crossover, randomly select a cut point in each of the two parents, and swap the portions after the cut point to generate two offspring; if the random number is greater than p c In this case, the two parent generations are directly treated as child generations; Step 9026, with probability p m Apply discrete perturbation mutations to each offspring; Step 9027: Check whether each offspring retains at least a preset number of core keywords or their synonyms. If the verification fails, the offspring will be mutated again from the corresponding parent. Step 9028: Determine whether the number of individuals in the next generation population has reached N. If not, return to step 9024 to continue selecting and generating new offspring. If the number has reached N, submit all newly generated offspring in batches to the target large language model for evaluation, merge them with elite individuals to form a new generation population, and update the global optimal solution. Step 9029: Determine whether the global best score is greater than or equal to the early stop threshold or whether the maximum number of generations has been reached. If so, output the global best jailbreak hint word; otherwise, return to step 9022 to continue iterating.
[0017] The technical solution further defined in this invention is as follows: the simulated annealing strategy includes the following steps: Step 9031: Initialize the current prompt word as malicious target text, calculate the initial score, and set the temperature T as the initial temperature T0; Step 9032: Use a composite perturbation function to generate a mutation candidate for the current prompt word; Step 9033: Submit the mutation candidates to the target large language model to obtain the response and calculate the score; Step 9034: Calculate the score difference Δ = candidate score - current score; Step 9035: Determine if Δ is greater than 0. If it is, accept the candidate unconditionally, update the current prompt word and the current score, and proceed to step 9037. Otherwise, proceed to step 9036. Step 9036: Calculate the acceptance probability P=exp(Δ / T), generate a random number r∈[0,1). If r is less than P, accept the candidate and update the current prompt word and the current score; otherwise, keep the current solution unchanged. Step 9037: If the current solution is better than the global best solution, then update the global best solution; Step 9038, update temperature T=max(T×γ,T min ), where γ is the cooling rate; Step 9039: Determine whether the global best score is greater than or equal to the early stop threshold or whether the maximum number of iterations has been reached. If so, output the global best jailbreak prompt word; otherwise, return to step 9032 to continue iterating.
[0018] The technical solution further defined in this invention is as follows: the ant colony optimization strategy includes the following steps: Step 9041: Initialize the pheromone level of all perturbation operations to 1.0, initialize the current best prompt word to the malicious target text, and calculate the initial score; Step 9042: Each of the m ants independently selects a perturbation operation based on pheromone levels and heuristic information, and applies multiple perturbations to the current best prompt word to construct its own candidate solution. Step 9043: Submit all candidate solutions constructed by ants in batches to the target large language model for evaluation and scoring; Step 9044: Iterate through all the solutions for all ants. If there is a solution with a score better than the current best score, then update the global best solution. Step 9045, perform an evaporation operation on all pheromones, according to... Update, where ρ is the evaporation rate; Step 9046: Add pheromone deposition to the operation used by the ant with the highest score in this round; Step 9047: Determine whether the global best score is greater than or equal to the early stop threshold or whether the maximum number of iterations has been reached. If so, output the global best jailbreak prompt word; otherwise, return to step 9042 to continue iterating.
[0019] This invention also provides a jailbreak prompt word generation system for large-scale closed-source artificial intelligence models, comprising: The input module is used to receive malicious target text and use the malicious target text as initial input; The keyword extraction module, connected to the input module, is used to extract core keywords from malicious target text, obtain a set of synonyms for each core keyword, and construct a mapping table from keywords to the set of synonyms. The discrete perturbation operator module, connecting the keyword extraction module and the candidate selection strategy module, is used to apply word-level discrete perturbation operations to the prompt words to generate variants. The discrete perturbation operator module includes a synonym substitution operator, a word insertion operator, a word deletion operator, an adjacent transposition operator, and a single-point crossover operator and a uniform crossover operator for genetic algorithms. The candidate selection strategy module, connected to the discrete perturbation operator module and the evaluation function module, is used to update the candidate set according to the scoring results and selection strategy. The candidate selection strategy module supports one or more of the following strategies: bundle search strategy, genetic algorithm strategy, simulated annealing strategy, ant colony optimization strategy, random walk strategy, or greedy selection strategy. The target large language model interface module is connected to the candidate selection strategy module and the evaluation function module. It is used to submit candidate prompts to the target large language model in batches to obtain a response. It supports two modes: local model query and remote application programming interface query. The evaluation function module, connected to the target large language model interface module and the keyword extraction module, is used to score the response of the target large language model. The evaluation function module includes a StrongREJECT evaluation submodule and a keyword-aware scoring adjustment submodule. The keyword-aware scoring adjustment submodule adjusts the original score according to the number of core keywords matched in the response. The iteration control module, connected to the evaluation function module and the candidate selection strategy module, is used to determine whether the early stopping condition is met. If not, the candidate selection strategy module is triggered to perform the next round of iteration. The result output module, connected to the iteration control module, is used to output the best jailbreak prompt word and its corresponding response and score.
[0020] This invention supports two query methods for target large language models: (1) Local model query, which uses a high-efficiency inference engine to perform batch inference by loading the locally deployed large language model. The system automatically detects and adapts the chat template format of different large language model families, including Llama-2 command format, Llama-3.x format, Qwen series ChatML format, etc.; (2) Remote application programming interface query, which accesses closed-source commercial large language models through application programming interfaces, enabling this invention to evaluate the security of closed-source large language models.
[0021] The technical solution provided in this application has at least the following technical effects or advantages: 1. Pure Black-Box Operation: This invention does not require access to the target large language model's internal parameters, gradients, or log-likelihood scores. It interacts with the target model solely through an input-response interface, making it suitable for security evaluation of closed-source commercial large language models. Existing gradient-based methods (such as the greedy coordinate gradient method) require white-box access and cannot evaluate closed-source models.
[0022] 2. No Pre-designed Template Required: This invention uses raw malicious target text as initial input, without relying on pre-designed jailbreak templates. Existing manual construction methods and some optimization-based methods rely on pre-existing prompt templates, limiting their applicability and attack diversity.
[0023] 3. High attack success rate: Experiments on five large language models with different architectures show that the beam search selection strategy and genetic algorithm selection strategy of this invention achieve the highest StrongREJECT success rates of 99.0% and 95.2% respectively, far exceeding the best result of 67.7% of existing methods.
[0024] 4. High computational efficiency: This invention employs lightweight word-level discrete perturbation operations, requiring only O(n) time complexity for each operator, avoiding expensive gradient calculations and large language model inference. The genetic algorithm selection strategy takes an average of only 267.0 seconds per target, up to 72 times faster than white-box methods.
[0025] 5. High diversity of prompt words: This invention generates jailbreak prompt words widely distributed in the embedding space through a combination of discrete perturbation operators and multiple selection strategies. Experiments show that the average pairwise distance of this invention reaches 1.30, which is 2.9 times higher than the hand-constructed method and 1.6 times higher than the gradient-based method.
[0026] 6. Keyword-aware scoring mechanism: The keyword-aware scoring adjustment mechanism introduced in this invention filters off-topic responses by checking the matching of core keywords in the response, ensuring that the evaluation results accurately reflect the real jailbreak effect and reducing false alarms.
[0027] 7. Flexible switching between multiple strategies: This invention provides six different candidate selection strategies, which users can flexibly choose according to specific scenarios. Single candidate strategies (random walk, greedy, simulated annealing) are suitable for rapid evaluation, while multiple candidate strategies (beam search, genetic algorithm, ant colony optimization) are suitable for in-depth evaluation that pursues a high success rate.
[0028] 8. Broad model compatibility: This invention supports the evaluation of various large language models, including open-source models (such as Llama-2, Llama-3.1, Qwen2.5 series) and closed-source models (such as GPT-4o series), and has good versatility and portability. Attached Figure Description
[0029] Figure 1 This is a schematic diagram of the overall framework of the jailbreak prompt generation method described in this invention.
[0030] Figure 2 This is a schematic diagram illustrating the operation of the four discrete perturbation operators described in this invention.
[0031] Figure 3 This is a flowchart of the main algorithm for the jailbreak prompt word generation method for closed-source artificial intelligence large models described in this invention.
[0032] Figure 4 This is a flowchart of the keyword extraction and scoring adjustment process described in this invention.
[0033] Figure 5 This is a flowchart of the beam search selection strategy described in this invention.
[0034] Figure 6 This is a flowchart of the genetic algorithm selection strategy described in this invention.
[0035] Figure 7 This is a flowchart of the simulated annealing selection strategy described in this invention.
[0036] Figure 8 This is a flowchart of the ant colony optimization selection strategy described in this invention. Detailed Implementation
[0037] To better understand the above technical solutions, the following will provide a detailed explanation of the technical solutions in conjunction with the accompanying drawings and specific implementation methods.
[0038] This embodiment provides a method for generating jailbreak hints for large closed-source artificial intelligence models, used to systematically evaluate the security alignment performance of large language models. The following detailed description uses experiments on the AdvBench dataset as an example.
[0039] like Figure 1 As shown: The overall system architecture of the jailbreak prompt generation method in this embodiment includes the following components: (1) Target large language model: The large language model being evaluated can be an open-source model deployed locally or a closed-source model accessed through an application programming interface. In this embodiment, five large language models were evaluated: Llama-2-7B-Chat, Llama-2-13B-Chat, Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, and GPT-4o; (2) Discrete perturbation operator module, which implements four word-level editing operations (synonym replacement, word insertion, word deletion, and adjacent transposition), as well as two recombination operations for genetic algorithms (single-point crossover and uniform crossover). (3) Evaluation function module: The StrongREJECT evaluation model is fine-tuned to score the response of the target large language model, supplemented by a keyword-aware scoring adjustment mechanism; given the response Provide a continuous fraction The effectiveness of jailbreaks is measured by evaluating three dimensions: rejection status, persuasiveness, and specificity. This invention also introduces a keyword-aware scoring adjustment mechanism, extracting core keywords and their synonym sets from malicious target text, checking keyword matching in responses, and penalizing off-topic responses.
[0040] (4) Candidate selection strategy module, providing six selection strategies: random walk, greedy, simulated annealing, ant colony optimization, beam search, and genetic algorithm. The random walk selection strategy generates a batch of mutated candidates in each iteration and randomly selects one as the current prompt word for the next round, while tracking the discovered best solution; the greedy selection strategy generates a mutated candidate for the current best solution in each iteration, and only accepts the mutated candidate if its score is strictly better than the current best solution.
[0041] (5) Result output module, which records the best jailbreak hint, response content, scores at each stage, attack success rate judgment and other information for each malicious target. The method takes the malicious target text as input and outputs the best jailbreak hint through an iterative "perturbation-evaluation-selection" loop.
[0042] In this embodiment, the target large language model is set as follows: For input prompt words Target large language model generates response .set up This represents a malicious target text, i.e., a securely aligned model trained to refuse a response to a request. This technical solution models jailbreak attacks as a discrete optimization problem: From... Start by finding an adversarial cue word. This makes the evaluation function Maximize the value: ,in Indicates from Starting from the feasible cue word space reachable via discrete perturbation operators. The score quantifies the harmfulness of the response; a higher score indicates a more successful jailbreak attack. Because... Since the model is discrete and has a huge combinatorial space, and the model itself is inaccessible, this embodiment uses a combination of heuristic search and discrete perturbation operators to approximate the optimal solution.
[0043] like Figure 2 The diagram illustrates the operation of the four discrete perturbation operators provided by the discrete perturbation operator module in this embodiment. Preferably, this embodiment defines four word-level discrete perturbation operators and a composite perturbation function.
[0044] The synonym replacement operator operates as follows: the input prompt word is segmented by spaces to obtain a word list; the word list is traversed, and words that do not belong to the stop word list and have a length of more than 3 characters are selected as candidate replacement words. The stop word list contains about 150 common function words (such as articles, prepositions, pronouns, conjunctions, etc.); a word is randomly selected from the candidate replacement words; all synonyms of the word are queried through the WordNet vocabulary database, and up to 5 different synonyms are extracted; a synonym is randomly selected from the synonym list to replace the original word, generating a new prompt word.
[0045] The word insertion operator operates as follows: the input prompt word is segmented by spaces to obtain a word list; the length of the word list is checked, and if the length is less than 5, the insertion operation is not performed; an insertion position is randomly selected between the first and last positions of the word list; a filler word is randomly selected from the above filler word set; the filler word is inserted at the selected position to generate a new prompt word.
[0046] The word deletion operator operates as follows: the input prompt word is segmented by spaces to obtain a word list; the length of the word list is checked, and if the length is less than or equal to 10, the deletion operation is not performed; words in the word list that belong to the filler word set are searched first; if a filler word is found and meets the random probability condition, one of the randomly selected filler words is deleted; a new prompt word after deletion is generated.
[0047] The operation of the adjacent transposition operator is as follows: the input prompt word is segmented into a word list by spaces; the length of the word list is checked, and if the length is less than 5, the transposition operation is not performed; a position is randomly selected between the 0th position and the second to last position in the word list; the two words at the selected position and the next position are swapped; and a new prompt word after transposition is generated.
[0048] The operation method for the composite perturbation function is as follows: randomly determine the number of perturbation operations to be 1 or 2 (equal probability); select the perturbation operator by weighted random sampling according to the configuration weight of each operator (the four operators are equally weighted by default); apply the selected perturbation operator to the prompt word in sequence; and return the perturbation prompt word.
[0049] For the genetic algorithm selection strategy, this embodiment defines two additional recombination operators. The single-point crossover operator operates by randomly selecting a cutoff point from the two parent prompts. and The first parent generation split point The previous part and the second parent generation cutoff point The subsequent parts are concatenated to generate offspring. The uniform crossover operator operates by independently sampling the corresponding word from any parent with equal probability at each position to generate offspring.
[0050] The above perturbation operator satisfies three key properties: (a) locality, each edit generates a prompt word with an edit distance of 1, enabling fine-grained exploration of the prompt word space; (b) connectivity, repeated edits can reach any point in the feasible space; and (c) computational efficiency, for a length of... The prompt words, each operator only needs The time complexity of word-level perturbation is as follows: Word-level perturbation can bypass the safe alignment mechanism because the mapping by which the safe alignment training model identifies harmful input patterns and generates rejection prefixes is shallow and relies on specific word-level patterns during alignment training. Word-level perturbation shifts the input to a perturbation distribution outside the training distribution. When the perturbed input cannot match the learned harmful patterns, the model will not generate rejection prefixes in the initial output words, and subsequent generation will be unconstrained.
[0051] like Figure 3 As shown: In this embodiment, the hyperparameter is set as: early stopping threshold The maximum generated length is 300 tokens, the sampling temperature is 0.7, and the kernel sampling probability is 0.9. The main algorithm flow of the jailbreak hint generation method in this embodiment is described as follows: Step 1, Initialize the candidate set To contain only malicious target text The set of initial best prompt words. for ,Will Submitted to the target large language model Obtain the response and calculate the initial score using the StrongREJECT evaluation function. ; Step 2, from malicious target text Extracting core keyword set Use regular expressions to extract all English words and convert them to lowercase, filtering out stop words and short words shorter than 3 characters; For each keyword, up to three synonyms are obtained from the WordNet vocabulary database, and a mapping table is constructed between the keywords and the set of synonyms. ; Step 3, process the candidate set Each candidate in Variants are generated using a composite perturbation function, and all variants are aggregated to form a candidate variant set. ; Step 4, for Each variant in the variant undergoes a keyword validity check. The number of core keywords (or their synonyms) retained in the variant is counted. If the number is lower than a preset threshold (set to 2 in this embodiment), the variant is mutated again from the parent. The maximum number of retries is 5. If the requirement is still not met after 5 retries, the parent is retained. Step 5: Submit all variants that pass the inspection in batches to the target large language model. Get the response text corresponding to each variant; Step 6: Calculate the raw score for all responses using the StrongREJECT evaluation function, then perform keyword-aware scoring adjustment, and count the number of matching core keywords in each response. Calculate the adjustment factor The final score is the original score multiplied by the adjustment factor; Step 7: Iterate through all evaluation results. If a final score exists... Greater than the current best score variants Then update for , for ; Step 8, determine the best score Is it greater than or equal to the early stopping threshold? (In this embodiment) If yes, proceed to step 10; otherwise, proceed to step 9. Step 9: Update the candidate set according to the selected candidate selection strategy (one of the following: beam search, genetic algorithm, simulated annealing, ant colony optimization, greedy search, or random walk). Return to step 3 to continue the next iteration; Step 10, output the best jailbreak hint words The process ends with the corresponding response and score.
[0052] like Figure 4 As shown: The process steps of the keyword perception scoring adjustment mechanism in this embodiment are as follows: Step 601: Receive the malicious target text and the response text of the target large language model as input, and use regular expressions to extract all English words from the malicious target text and convert them to lowercase. Step 602 filters out words from the stop word list (which contains approximately 150 common functional words) and short words with a length of less than 3 characters, resulting in a core keyword set. ; Step 603, for Each keyword in By querying all synonym sets of the WordNet vocabulary database, up to three synonyms are extracted and a mapping table is constructed. Each keyword's synonym set includes the keyword itself; Step 604: Use regular expressions to extract all English words from the response text and convert them to lowercase to obtain the response word set. ; Step 605, Initialize the match count For mapping tables Each core keyword in ,examine and Is the intersection of the two non-empty sets? If not, then Increment by 1; Step 606, determine the match count. If the score is equal to 0, set the final score to 0 and proceed to step 608; otherwise, proceed to step 607. Step 607, Calculate the dynamic threshold (i.e., half of the total number of core keywords), calculate the adjustment factor. Calculate the final score ,when When the dynamic threshold is reached or exceeded, the final score equals the original score; when When the score is below the dynamic threshold, the final score is reduced proportionally. Step 608: Return the final score; the keyword perception scoring adjustment process is complete.
[0053] like Figure 5 The flowchart shown is a process for the beam search and selection strategy described in this embodiment. In this embodiment, the hyperparameter of the beam search and selection strategy is set as: beam width. branching factor The maximum number of steps is 20. The process of the beam search selection strategy in this embodiment includes the following steps: Step 9011: Using malicious target text as the base prompt, add the base prompt to the bundle, and apply 1 to 3 random mutations to the base prompt to generate... Several initial candidates are added to the bundle, making the initial bundle size... ; Step 9012, all in the bundle Each candidate is submitted in batches to the target large language model to obtain a response. The evaluation function is used to calculate the score and adjust the keyword perception score. The scores are sorted in descending order and the current global best solution is recorded. Step 9013, for each candidate prompt word in the bundle Generate using composite perturbation function Each variant was subjected to keyword validity checks, resulting in a total of [number] variants. A new candidate; Step 9014, all A batch of new candidates are submitted to the target large language model to obtain responses. Scores are calculated using an evaluation function and keyword-aware scoring is adjusted. Step 9015: Sort all new candidates in descending order of score, and select the candidates with the highest scores. The candidates form a new bundle; if the new highest score is better than the global best score, then update the global best prompt and the global best score; Step 9016: Determine whether the global best score is greater than or equal to the early stopping threshold. (In this embodiment) If either condition is met, proceed to step 9017; otherwise, return to step 9013 to continue the next expansion. Step 9017: Output the globally best jailbreak hint word and its corresponding response and score; the bundle search process ends.
[0054] Figure 6 The diagram shows a flowchart of the genetic algorithm selection strategy in this embodiment. The hyperparameters of the genetic algorithm selection strategy in this embodiment are set as follows: population size. Probability of mutation Crossover probability Elite retention The selection method is tournament selection (tournament size 3), the crossover method is single-point crossover, and the maximum number of generations is 20. The process of the genetic algorithm selection strategy is described as follows: Step 9021: Using malicious target text as the base prompt word, add the base prompt word to the population, and perform 1 to 3 random mutations on the base prompt word to generate... An initial number of individuals joins the population, forming a population of size [number]. The initial population; Step 9022, select all population members... Individual samples are submitted in batches to the target large language model to obtain responses. An evaluation function is used to calculate the fitness score of each individual sample and adjust the keyword perception score. Step 9023, select the one with the highest fitness. Individuals (in this embodiment) This is directly copied to the next generation of the population to ensure that the optimal solution is not lost. Step 9024: Use the tournament selection method to select a pair of parents from the current population. For each parent, randomly select 3 individuals to form a tournament and select the individual with the highest fitness. Step 9025, with probability (In this embodiment) Perform a single-point crossover operation on the selected parent pairs: randomly select a cut-off point from the two parent prompts, swap the parts after the cut-off point to generate two offspring; if the generated random number is greater than 1, the crossover operation will fail. In this case, the two parent generations are directly treated as child generations; Step 9026, with probability (In this embodiment) Apply discrete perturbations to each offspring, and use a composite perturbation function to perform word-level mutations on the offspring; Step 9027: Check whether each offspring retains at least 2 core keywords (or their synonyms). If the verification fails, re-execute the mutation operation from the corresponding parent generation, with a maximum of 5 retries. Step 9028: Determine whether the number of individuals in the current next generation population (elite individuals plus offspring already generated) has reached the population size. If the condition is not met, return to step 9024 to continue selecting and generating new offspring; if the condition is met, proceed to step 9029. Step 9029: Submit all newly generated offspring to the target large language model for evaluation and scoring, merging elite individuals and offspring to form a new generation population; if the best score of this generation is better than the global best score, update the global best prompt word and the global best score; determine whether the global best score is greater than or equal to the early stopping threshold. (In this embodiment) If either condition is met, output the globally best jailbreak prompt word and its corresponding response and score, and the genetic algorithm process ends; otherwise, return to step 9022 to enter the next generation of evolution.
[0055] Figure 7 The diagram shows a flowchart of the simulated annealing selection strategy in this embodiment. The hyperparameters of the simulated annealing selection strategy in this embodiment are set as follows: initial temperature... Cooling rate minimum temperature The maximum number of iterations is 100. The process of selecting a strategy using simulated annealing is described below: Step 9031: Initialize the current prompt word as malicious target text, submit the malicious target text to the target large language model to obtain the response and calculate the initial score as the current score, and simultaneously set the current prompt word and current score as the global best prompt word and global best score; set the temperature. (In this embodiment) ); Step 9032: Use a composite perturbation function to generate a mutation candidate for the current prompt word and perform a keyword validity check; Step 9033: Submit the mutation candidates to the target large language model to obtain a response, use the evaluation function to calculate the score and adjust the keyword perception score; Step 9034, calculate the fractional difference. Mutant candidate score - current score; Step 9035, determine If the value is greater than 0, unconditionally accept the mutation candidate, set the mutation candidate as the current prompt word, set the mutation candidate score as the current score, and proceed to step 9037; otherwise, proceed to step 9036. Step 9036, calculate the acceptance probability. Generate uniformly distributed random numbers ,judge Is it less than If yes, accept the mutation candidate, set the mutation candidate as the current prompt word, and set the mutation candidate score as the current score; otherwise, keep the current prompt word and current score unchanged. Step 9037: Determine if the current score is greater than the global best score. If so, update the current prompt word and the current score to the global best prompt word and the global best score. Step 9038, update temperature (In this embodiment) , The temperature gradually decreased from 1.0 to 0.01. Step 9039: Determine whether the global best score is greater than or equal to the early stopping threshold. (In this embodiment) If either condition is met, output the globally best jailbreak hint word and its corresponding response and score, and the simulated annealing process ends; otherwise, return to step 9032 to continue the next round of iteration.
[0056] Figure 8 This is a flowchart of the ant colony optimization selection strategy in this embodiment. The hyperparameter of the ant colony optimization selection strategy in this embodiment is set as: number of ants. Evaporation rate pheromone importance parameter Heuristic information importance parameter The maximum number of iterations is 50. The process of selecting a strategy for ant colony optimization is described as follows: Step 9041: Initialize the pheromone levels for all perturbation operations. Set to 1.0, submit the malicious target text to the target large language model to obtain the response and calculate the initial score, set the current best prompt word as the malicious target text, and the current best score as the initial score; Step 9042, Only one ant (in this embodiment) Each ant independently constructs a solution, and each ant applies multiple perturbations to the current best prompt word (in this embodiment, each ant performs 2 perturbations). Each perturbation uses a keyword validity check to ensure that the prompt word after perturbation retains enough core keywords. Step 9043, all The candidate solutions constructed by each ant are submitted in batches to the target large language model to obtain responses. The scores are calculated using the evaluation function and the keyword perception scores are adjusted. Step 9044: Iterate through all the solutions and their scores for all ants. If there is a solution with a score better than the current best score, update the global best hint word and the global best score. Step 9045: Perform an evaporation operation on all pheromones involved in the perturbation, updating them in the following way: (In this embodiment) At the same time, the pheromone value is limited to no less than 0.1 to prevent the pheromone from disappearing completely; Step 9046: Select the ant with the highest score in this round, and increase the pheromone deposition for the perturbation operation used by the ant. The deposition amount is proportional to the ant's score. Step 9047: Determine whether the global best score is greater than or equal to the early stopping threshold. (In this embodiment) If either condition is met, output the globally best jailbreak prompt word and its corresponding response and score, and the ant colony optimization process ends; otherwise, return to step 9042 to continue the next round of iteration.
[0057] In the method described in this invention, the query method for the target large language model supports two modes: local model query and remote application programming interface (API) query. For the local model query mode, the system automatically adapts to the chat templates of different large language models in the following way: First, it detects keywords in the model name or path to identify the family to which the model belongs. If it is the Llama-2 series, it uses the [INST] {prompt} [ / INST] format; if it is the Llama-3.x series, it uses a format containing special markers such as <|begin_of_text|> and <|start_header_id|>; if it is the Qwen series, it uses the ChatML format (<|im_start|> and <|im_end|> markers); if the model family cannot be identified, it first tries to use the chat template that comes with the model, and if that fails, it uses the ChatML format as the default format. The inference parameters include a maximum generation length of 300 tokens, a sampling temperature of 0.7, and a kernel sampling probability of 0.9. For the remote application programming interface (API) query mode, the system accesses the closed-source commercial large language model through the API, supports model name alias mapping, and manages the API key and API address through configuration files.
[0058] In summary, this embodiment provides a method for generating jailbreak hints for large closed-source artificial intelligence models, applicable to the security assessment of large language models. This method can be widely applied to security testing by large language model developers, security auditing of enterprise-level large language model products, and academic research on large language model security. By efficiently generating diverse jailbreak hints, this invention helps identify weaknesses in the security alignment mechanisms of large language models, providing data support and directional guidance for improving security alignment technologies, thereby promoting the development of large language model security protection technologies.
[0059] The jailbreak hint generation method for large closed-source artificial intelligence models provided in this embodiment has many specific methods and approaches for implementing this technical solution. The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention. For example, the discrete perturbation operator can be extended to character-level or sentence-level editing operations, other continuous scoring evaluation functions can be used to replace StrongREJECT, and more metaheuristic search strategies can be introduced. These improvements and modifications should also be considered within the scope of protection of the present invention. All components not explicitly stated in this embodiment can be implemented using existing technologies.
Claims
1. A jailbreak prompt word generation method for a closed-source artificial intelligence large model, characterized by, The method comprises the following steps: Step 1, initialize the candidate set to a set containing only the malicious target text, initialize the best prompt to the malicious target text, and calculate the initial score; Step 2, extract a set of core keywords from the malicious target text, and obtain a set of synonyms for each core keyword, and construct a mapping table of keywords to synonym sets; Step 3, enter an iterative loop, for each candidate prompt in the current candidate set, use the composite perturbation function of the discrete perturbation operator module to generate a variant set; Step 4, perform keyword validity checks on each variant in the variant set to ensure that it retains at least a preset number of core keywords or their synonyms; Step 5, batch submit the checked variants to the target large language model to obtain the corresponding response text for each variant; Step 6, use the evaluation function module to calculate the original score for all responses, and perform keyword-aware score adjustment; Step 7, traverse all evaluation results, if there is a variant with a score greater than the current best score, update the best prompt and the best score; Step 8, determine whether the best score is greater than or equal to the early stopping threshold, if yes, execute step 10; if not, execute step 9; Step 9, update the candidate set according to the candidate selection strategy, and return to step 3 to continue the next round of iteration; Step 10, output the best jailbreak prompt and its corresponding response and score, and the process ends. 2.The closed-source AI large model-oriented jailbreak prompt word generation method according to claim 1, wherein, The discrete perturbation operator module in step 3 perturbs the prompt by word-level editing operations to generate candidate variants, and the discrete perturbation operator includes: A synonym replacement operator that replaces one content word in the prompt with a synonym obtained from a vocabulary database; A word insertion operator that inserts a filler word from a predefined filler word set at a random position in the prompt; A word deletion operator that removes one non-keyword in the prompt, preferably a filler word; An adjacent transposition operator that exchanges the positions of two adjacent words to introduce structural changes.
3. The jailbreak prompt generation method for closed-source artificial intelligence large models according to claim 2, characterized in that: The operation steps of the synonym replacement operator include: tokenizing the input prompt by spaces to obtain a word list; filtering out words that are not in the stop word list and have a word length greater than 3 characters as candidate replacement words; randomly selecting a word from the candidate replacement words; querying all synonym sets of the word through the WordNet vocabulary database, and extracting up to 5 different synonyms; randomly selecting a synonym from the synonym list to replace the original word, generating a new prompt; The operation steps of the word insertion operator include: tokenizing the input prompt by spaces to obtain a word list; checking the length of the word list, if the length is less than 5, do not perform the insertion operation; randomly select an insertion position between the first position and the last position of the word list; randomly select a filler word from the filler word set; insert the filler word at the selected position to generate a new prompt; The operation steps of the word deletion operator include: tokenizing the input prompt by spaces to obtain a word list; checking the length of the word list, if the length is less than or equal to 10, no deletion operation is performed; preferentially searching for words belonging to the filler word set in the word list; if a filler word is found and the random probability condition is met, one randomly selected filler word is deleted; and generating a new prompt after deletion. The operation steps of the adjacent transposition operator include: tokenizing the input prompt by spaces to obtain a word list; checking the length of the word list, if the length is less than 5, no transposition operation is performed; randomly selecting a position between the 0th position and the second-to-last position in the word list; exchanging the two words at the selected position and the next position; and generating a new prompt after transposition. 4.The closed-source AI large model-oriented jailbreak prompt word generation method of claim 1, wherein, The evaluation function module in step 6 adopts a StrongREJECT evaluation function, and combines a keyword-aware score adjustment mechanism to extract core keywords and their synonym sets from the malicious target text, check keyword matching in the response, and punish off-topic responses. The keyword-aware score adjustment mechanism includes the following steps: Step 601: Use a regular expression to extract all English words from the malicious target text and convert them to lowercase; Step 602: Filter out stop words and short words with a length of less than 3 characters to obtain a core keyword set; Step 603: For each keyword in the core keyword set, obtain up to 3 synonyms through the WordNet lexical database to construct a mapping table, wherein the synonym set of each keyword includes the keyword itself; Step 604: Use a regular expression to extract all English words from the response text of the target large language model and convert them to lowercase to obtain a response word set K; Step 605: For each core keyword in the mapping table, check whether the intersection of the response word set and the synonym set of the keyword is non-empty, if it is non-empty, the keyword is matched, and the total number of matched core keywords A is counted; Step 606: Determine whether the total number of matched core keywords A is greater than 0, if A is equal to 0, the final score is 0, and step 608 is executed, otherwise the next step is executed; Step 607: Calculate the dynamic threshold as half of the total number of core keywords, calculate the adjustment factor f(A)=min(A / dynamic threshold, 1.0), and calculate the final score as the original StrongREJECT score multiplied by the adjustment factor f(A); Step 608: Return the final score and the process ends. 5.The closed-source AI large model-oriented jailbreak prompt word generation method according to claim 1, wherein, The candidate selection strategy in step 9 is at least one of a beam search strategy, a genetic algorithm strategy, a simulated annealing strategy, an ant colony optimization strategy, a greedy selection strategy, and a random walk strategy. 6.The closed-source AI large model oriented jailbreak prompt word generation method according to claim 5, characterized in that, The beam search strategy includes the following steps: Step 9011, taking the malicious target text as a basis prompt word, applying random variation to the basis prompt word to generate an initial candidate, constituting an initial beam; Step 9012, all of the initial bundle Each candidate is submitted in batches to the target large language model for evaluation and scoring; Step 9013, for each candidate prompt word in the bundle, use the composite perturbation function to generate b mutants, perform keyword validity check for each mutant, altogether produce new candidates; Step 9014: Submit all new candidate batches to the target large language model to obtain responses, calculate scores using the evaluation function, and perform keyword-aware scoring; Step 9015, arrange all new candidates in descending order of score, select the top candidates to form a new beam, and update the global best solution. Step 9016: Determine whether the global best score is greater than or equal to the early stopping threshold or whether the maximum number of steps has been reached, if so, execute step 9017; otherwise, return to step 9013 to continue iteration; Step 9017: Output the globally best jailbreak hint word and its corresponding response and score; the bundle search process ends. 7.The closed-source AI large model-oriented jailbreak prompt word generation method according to claim 5, characterized in that, The genetic algorithm strategy includes the following steps: Step 9021: Using malicious target text as the base prompt word, apply random mutation to the base prompt word to generate N initial individuals, forming the initial population; Step 9022: Submit all individuals in the population to the target large language model in batches, and use the evaluation function to calculate the fitness score of each individual; Step 9023: The e individuals with the highest fitness are directly copied into the next generation of the population; Step 9024: Select a pair of parents from the current population using the tournament selection method: randomly select 3 individuals to form a tournament, and select the individual with the highest fitness as the parent. Step 9025, with probability p c performing a crossover operation on the selected parent pair: if it is a single-point crossover, randomly selecting a cutting point in each parent, and exchanging the parts after the cutting point to generate two offspring; if the random number is greater than p c , then directly taking the two parents as offspring; Step 9026, with probability p m Apply a discrete perturbation variation to each offspring; Step 9027: Check whether each offspring retains at least a preset number of core keywords or their synonyms. If the verification fails, the offspring will be mutated again from the corresponding parent. Step 9028: Determine whether the number of individuals in the next generation population has reached N. If not, return to step 9024 to continue selecting and generating new offspring. If the number has reached N, submit all newly generated offspring in batches to the target large language model for evaluation, merge them with elite individuals to form a new generation population, and update the global optimal solution. Step 9029: Determine whether the global best score is greater than or equal to the early stop threshold or whether the maximum number of generations has been reached. If so, output the global best jailbreak hint word; otherwise, return to step 9022 to continue iterating. 8.The closed-source AI large model oriented jailbreak prompt word generation method according to claim 5, characterized in that, The simulated annealing strategy includes the following steps: Step 9031: Initialize the current prompt word as malicious target text, calculate the initial score, and set the temperature T as the initial temperature T0; Step 9032: Use a composite perturbation function to generate a mutation candidate for the current prompt word; Step 9033: Submit the mutation candidates to the target large language model to obtain the response and calculate the score; Step 9034: Calculate the score difference Δ = candidate score - current score; Step 9035: Determine if Δ is greater than 0. If it is, accept the candidate unconditionally, update the current prompt word and the current score, and proceed to step 9037. Otherwise, proceed to step 9036. Step 9036: Calculate the acceptance probability P=exp(Δ / T), generate a random number r∈[0,1). If r is less than P, accept the candidate and update the current prompt word and the current score; otherwise, keep the current solution unchanged. Step 9037: If the current solution is better than the global best solution, then update the global best solution; Step 9038, update temperature T = max(T x γ, T min ), where γ is the cooling rate. Step 9039: Determine whether the global best score is greater than or equal to the early stop threshold or whether the maximum number of iterations has been reached. If so, output the global best jailbreak prompt word; otherwise, return to step 9032 to continue iterating. 9.The closed-source AI large model oriented jailbreak prompt word generation method according to claim 5, characterized in that, The ant colony optimization strategy includes the following steps: Step 9041: Initialize the pheromone level of all perturbation operations to 1.0, initialize the current best prompt word to the malicious target text, and calculate the initial score; Step 9042: Each of the m ants independently selects a perturbation operation based on pheromone levels and heuristic information, and applies multiple perturbations to the current best prompt word to construct its own candidate solution. Step 9043: Submit all candidate solutions constructed by ants in batches to the target large language model for evaluation and scoring; Step 9044: Iterate through all the solutions for all ants. If there is a solution with a score better than the current best score, then update the global best solution. Step 9045, perform evaporation operation on all pheromones, according to update, where p is the evaporation rate; Step 9046: Add pheromone deposition to the operation used by the ant with the highest score in this round; Step 9047: Determine whether the global best score is greater than or equal to the early stop threshold or whether the maximum number of iterations has been reached. If so, output the global best jailbreak prompt word; otherwise, return to step 9042 to continue iterating.
10. A jailbreak prompt word generation system for a closed-source artificial intelligence large model, characterized by, include: The input module is used to receive malicious target text and use the malicious target text as initial input; The keyword extraction module, connected to the input module, is used to extract core keywords from malicious target text, obtain a set of synonyms for each core keyword, and construct a mapping table from keywords to the set of synonyms. The discrete perturbation operator module, connecting the keyword extraction module and the candidate selection strategy module, is used to apply word-level discrete perturbation operations to the prompt words to generate variants. The discrete perturbation operator module includes a synonym substitution operator, a word insertion operator, a word deletion operator, an adjacent transposition operator, and a single-point crossover operator and a uniform crossover operator for genetic algorithms. The candidate selection strategy module, connected to the discrete perturbation operator module and the evaluation function module, is used to update the candidate set according to the scoring results and selection strategy. The candidate selection strategy module supports one or more of the following strategies: bundle search strategy, genetic algorithm strategy, simulated annealing strategy, ant colony optimization strategy, random walk strategy, or greedy selection strategy. The target large language model interface module is connected to the candidate selection strategy module and the evaluation function module. It is used to submit candidate prompts to the target large language model in batches to obtain a response. It supports two modes: local model query and remote application programming interface query. The evaluation function module, connected to the target large language model interface module and the keyword extraction module, is used to score the response of the target large language model. The evaluation function module includes a StrongREJECT evaluation submodule and a keyword-aware scoring adjustment submodule. The keyword-aware scoring adjustment submodule adjusts the original score according to the number of core keywords matched in the response. The iteration control module, connected to the evaluation function module and the candidate selection strategy module, is used to determine whether the early stopping condition is met. If not, the candidate selection strategy module is triggered to perform the next round of iteration. The result output module, connected to the iteration control module, is used to output the best jailbreak prompt word and its corresponding response and score.