A model security defense method and system
By constructing a malicious feature model to calculate the real-time output Logits sequence similarity of a large language model, and intervening in the generation of harmful content in real time, this solves the problem that existing technologies cannot cope with unknown threats and achieves efficient and proactive security defense.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU MORESEC TECH CO LTD
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-12
AI Technical Summary
Existing security defense methods for large language models cannot effectively defend against attacks launched through 'outside-distributed inputs' or 'unknown inducement methods'. They are passive, lagging, and difficult to deal with unknown threats.
By constructing a malicious feature model, the similarity between the real-time output Logits sequence and the malicious output Logits sequence is calculated. A classifier is used to determine the risk status of the model output, and real-time intervention is performed when the risk is high, including output softening and forced truncation.
It achieves proactive defense against unknown threats, can identify and block the generation of harmful content in real time, improves the security robustness of the model, reduces false alarm rate and reduces computational overhead.
Smart Images

Figure CN122197004A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of model security defense technology, and in particular relates to a model security defense method and system. Background Technology
[0002] Currently, security defenses for Large Language Models (LLMs) primarily employ two approaches: content filtering and secure alignment training. Content filtering, encompassing techniques such as keywords and classifiers, relies heavily on surface-level analysis of the input / output text, making it susceptible to bypassing through adversarial perturbations and semantic obfuscation. Secure alignment training, like RLHF, suffers from insufficient generalization, easily failing against attacks originating from outside the training data distribution, such as novel jailbreak attacks. Neither of these approaches effectively defends against attacks initiated through "out-of-distribution inputs" or "unknown inducements," representing passive and delayed defenses ill-suited to addressing unknown threats.
[0003] Therefore, there is an urgent need to develop a model-based security defense method and system to solve the problems in existing technologies. Summary of the Invention
[0004] The purpose of this invention is to provide a security defense method and system for a model, which solves the problems of passive, lagging, and difficult-to-cope-with unknown threats in the security defense of existing models mentioned in the background art by using a malicious feature model to calculate the similarity between malicious output Logits sequences and real-time output Logits sequences.
[0005] To solve the above-mentioned technical problems, the specific technical solution of the present invention is as follows:
[0006] A security defense method for a model includes the following steps:
[0007] Obtain a malicious feature model; wherein the malicious feature model is used to calculate the similarity between the malicious output Logits sequence and the real-time output Logits sequence;
[0008] Obtain the real-time output Logits sequence of the target model;
[0009] By using a malicious feature model, it is determined whether the real-time output of the target model is in a high-risk state; if it is determined to be in a high-risk state, real-time intervention is performed.
[0010] Furthermore, the malicious feature model includes a classifier and a malicious feature sequence template;
[0011] The malicious feature sequence template is obtained based on the malicious output Logits sequence;
[0012] The classifier is used to generate a comprehensive anomaly score based on the scoring features of the real-time output Logits sequence;
[0013] The scoring features of the real-time output Logits sequence include distance set and global statistical features;
[0014] The distance set is obtained through the following steps:
[0015] Extract multidimensional features from each Logits vector in the real-time output Logits sequence to convert the real-time output Logits sequence into a real-time feature time series;
[0016] Calculate the feature distance between the real-time feature time series and each malicious feature sequence template to obtain the distance set.
[0017] Furthermore, the scoring features of the real-time output Logits sequence also include: context features; the context features are context signals used to enhance detection accuracy.
[0018] The determination of whether the real-time output of the target model is in a high-risk state includes:
[0019] The actual threshold is dynamically adjusted based on the context. When the overall anomaly score is greater than the actual threshold, it is judged as a high-risk state.
[0020] Furthermore, the malicious feature model is obtained through the following steps:
[0021] Construct a malicious output variant library; wherein, the malicious output variant library includes several malicious output texts;
[0022] Based on the malicious output variant library, obtain the malicious output Logits sequence when the model generates malicious output;
[0023] A malicious feature model is constructed based on the malicious output Logits sequence.
[0024] Furthermore, the construction of the malicious output variant library includes the following steps:
[0025] Obtain several malicious output seeds;
[0026] Based on the malicious output seed, generate semantically equivalent candidate variants;
[0027] Perform semantic verification on semantically equivalent candidate variants;
[0028] For the semantically equivalent candidate variants that pass the verification, reverse translation or local editing based on the thesaurus is performed to generate semantically equivalent secondary variants;
[0029] Perform semantic verification on semantically equivalent quadratic variants;
[0030] A malicious output variant library is obtained based on malicious output seeds, verified semantically equivalent candidate variants, and verified semantically equivalent secondary variants.
[0031] Furthermore, the malicious feature sequence template is obtained through the following steps:
[0032] Extract multidimensional features from each Logits vector in the malicious output Logits sequence to convert the malicious output Logits sequence into a malicious feature time series;
[0033] Cluster all malicious feature time series to obtain several malicious feature sequence templates;
[0034] The multidimensional features include Top-k probabilities, probability distribution entropy, target concept activation, and distribution skewness / kurtosis.
[0035] Furthermore, the training features of the classifier include: the feature distance between each malicious feature time series and each malicious feature sequence template, and the global statistical features of the malicious feature time series.
[0036] A model of security defense system, comprising:
[0037] The online defense module is used to obtain the real-time output Logits sequence of the target model; and to determine whether the real-time output of the target model is in a high-risk state through the malicious feature model; if it is determined to be in a high-risk state, real-time intervention is performed.
[0038] Furthermore, it also includes:
[0039] A preparation module is used to construct a malicious output variant library; wherein, the malicious output variant library includes several malicious output texts;
[0040] The modeling module is used to obtain the malicious output logits sequence when the model generates malicious output based on the malicious output variant library; and to construct a malicious feature model based on the malicious output logits sequence.
[0041] A computer program product includes a computer program that, when executed by a processor, implements the steps of the method.
[0042] The present invention has the following advantages:
[0043] Unlike traditional methods that filter at the input / output text level, this application focuses on the core decision signal when the model generates output—output layer logits. By monitoring the real-time output logit sequence when the model generates text and calculating its similarity to malicious output logit sequences, the risk of the model being in a "malicious output state" can be identified before the harmful content is fully output, and real-time intervention can be performed. This method does not rely on the specific form of attack input, but focuses on the inherent generation pattern of malicious output. This allows it to identify the abnormal internal decision-making patterns exhibited by the model when generating harmful content, thereby effectively intercepting unknown malicious inducements, improving the model's security robustness, and effectively responding to unknown threats.
[0044] Other features and advantages of the present invention will be disclosed in detail in the following detailed description and accompanying drawings. Attached Figure Description
[0045] Figure 1 This is a schematic diagram of the overall process of this application;
[0046] Figure 2 A flowchart illustrating the process of building a malicious signature model. Detailed Implementation
[0047] To better understand the purpose, structure, and function of this invention, the invention will be described in further detail below with reference to the accompanying drawings.
[0048] A security defense method for a model, such as Figure 1 As shown, it includes the following steps:
[0049] S1: Obtain a malicious feature model; wherein the malicious feature model is used to calculate the similarity between the malicious output Logits sequence and the real-time output Logits sequence;
[0050] S2: Obtain the real-time output Logits sequence of the target model;
[0051] S3: Use the malicious feature model to determine whether the real-time output of the target model is in a high-risk state; if it is determined to be in a high-risk state, then perform real-time intervention.
[0052] The malicious feature model includes a classifier and a malicious feature sequence template. For example... Figure 2 As shown, the malicious feature model is obtained through the following steps:
[0053] S11: Construct a malicious output variant library; wherein, the malicious output variant library includes several malicious output texts;
[0054] S12: Based on the malicious output variant library, obtain the malicious output Logits sequence when the model generates malicious output;
[0055] S13: Construct a malicious feature model based on the malicious output Logits sequence.
[0056] Wherein, S11 and S12 are the offline preparation stages of this application; S13 is the offline modeling stage of this application; and S2 and S3 are the online defense stages of this application. Optionally, the offline modeling stage can serve as an independent security configuration center, and the online defense stage can serve as a lightweight plugin or reverse proxy, decoupled from the model service.
[0057] In this embodiment, S11, which constructs a malicious output variant library, includes the following steps:
[0058] S111: Obtain several malicious output seeds;
[0059] The malicious output seed is obtained through known malicious output text, specifically referring to a series of known, typical, and identified large language model output texts that violate security policies or are harmful.
[0060] In this embodiment, the plurality of malicious output seeds are represented by a malicious output seed set S={M1,M2,...}.
[0061] S112: Generate semantically equivalent candidate variants based on malicious output seeds;
[0062] The method for generating semantically equivalent candidate variants includes generation via a prompt-based large language model.
[0063] Specifically, in this embodiment, the text of the malicious output seed is combined with an instruction selected from a predefined "rewrite policy instruction set" and input into a secure large language model to generate semantically equivalent candidate variants.
[0064] The instructions in the rewriting strategy instruction set include: style shifting—such as “using academic tone”, perspective shifting—such as “from a historian’s perspective”, structural transformation—such as “changing a list into a paragraph”, and word substitution—such as “using synonyms”.
[0065] Optionally, a conditional variational autoencoder (CVAE) can be used to sample and generate semantically equivalent candidate variants in the latent semantic space, or a combination of traditional rule-based sentence reconstruction and synonym replacement tools can be used to generate semantically equivalent candidate variants.
[0066] S113: Perform semantic verification on semantically equivalent candidate variants; in this embodiment, the semantic verification method is to compare the cosine similarity of the semantic embedding vectors of the semantically equivalent candidate variants and the corresponding malicious output seeds. If the cosine similarity of the semantic embedding vectors is lower than the threshold T1, the variant is discarded. In this embodiment, the threshold T1 is 0.85.
[0067] S114: For the semantically equivalent candidate variants that pass the verification, perform reverse translation or local editing based on the thesaurus to generate semantically equivalent secondary variants.
[0068] The reverse translation, in this context, involves first translating the Chinese text into French, and then translating the French text back into Chinese.
[0069] S115: Perform semantic verification on semantically equivalent quadratic variants; wherein, the semantic verification method is to compare the semantic embedding vector cosine similarity between the semantically equivalent candidate variant and the corresponding malicious output seed. If the semantic embedding vector cosine similarity is lower than the threshold T1, the variant is discarded.
[0070] S116: Based on the malicious output seed, the verified semantically equivalent candidate variants, and the verified semantically equivalent secondary variants, a malicious output variant library V is obtained. In this embodiment, the malicious output variant library includes several malicious output texts; the malicious output texts include the malicious output seed, the semantically verified semantically equivalent candidate variants, and the semantically verified semantically equivalent secondary variants.
[0071] In summary, in this embodiment, for each malicious output seed Mi, the above-described combination algorithm is used to generate n high-quality, diverse, semantically equivalent, and validated semantic variants {M_i1, ..., M_in}. The malicious output variant library V = {Mi, M_i1, ..., M_in}.
[0072] S12: Based on the malicious output variant library, obtain the malicious output Logits sequence when the model generates malicious output, including the following steps:
[0073] S121: Collect the Logits sequence of each malicious output text in the malicious output variant library to obtain the malicious output Logits sequence; wherein, the Logits sequence includes several Logits vectors.
[0074] In this embodiment, each malicious output text in the malicious output variant library is input into the target model. The target model is required to generate malicious output text in an autoregressive manner and completely record the output layer Logits vector for each time step / Token, so as to obtain several malicious output Logits sequences L_M'=[l_1,l_2,...,l_T] corresponding to each malicious output text. Where T is the number of time steps or Tokens.
[0075] S13 constructs a malicious feature model based on the malicious output Logits sequence, including the following steps:
[0076] S131: Extract the multidimensional features of each Logits vector l_t in the malicious output Logits sequence to form a low-dimensional feature vector f_t.
[0077] The multidimensional features include Top-k probabilities, probability distribution entropy, target concept activation, and distribution skewness / kurtosis.
[0078] Specifically, the Top-k probability sum is the sum of the model's confidence in the current k most likely candidate words.
[0079] The probability distribution entropy is the information entropy of the Logits vector, reflecting the certainty of decision-making.
[0080] The activation of the target concept is the maximum cosine similarity between the Logits vector and a set of malicious core concept embedding vectors. These core concepts are clustered from the semantic embeddings of the variant library V. The malicious core concept embedding vectors refer to a set of anchor vectors in the semantic space that represent different categories of malicious intent. These are not manually defined but are obtained by extracting the centers / centroids of each cluster after unsupervised semantic clustering of the malicious variant library V.
[0081] The distribution skewness / kurtosis is the statistical moment of the Logits vector probability distribution.
[0082] S132: Based on the low-dimensional feature vector f_t, convert each malicious output Logits sequence into a corresponding malicious feature time series F_M' = [f_1, f_2, ..., f_T].
[0083] S133: Cluster all malicious feature time series {F_M'} to obtain K representative malicious feature sequence templates {F_mal_1,...,F_mal_K}. Assuming there are dozens of main, distinguishable types of malicious generation patterns, K=20 can be set.
[0084] The clustering methods include K-Means, which are existing technologies and will not be described in detail in this application.
[0085] S134: Construct a classifier, wherein the training features of the classifier include: the DTW distance between each malicious feature time series and each malicious feature sequence template, and the global statistical features of the malicious feature time series. In the training data, positive samples are malicious feature time series of malicious output text, and negative samples are malicious feature time series extracted from normal corpora using the same method. In this embodiment, the classifier is LightGBM. The global statistical features are the mean, variance, etc. of each dimension, which are existing technologies and can be set as needed.
[0086] S135: Based on the malicious feature sequence templates and the classifier, a malicious feature model P_malicious is obtained. In this embodiment, the malicious feature model includes K feature sequence templates and a trained classifier.
[0087] The step S2, which acquires the real-time output Logits sequence of the target model, specifically includes the following steps:
[0088] When the target model generates a response, its output Logits sequence L_live is collected synchronously to obtain the real-time output Logits sequence.
[0089] In step S3, the malicious feature model is used to determine whether the real-time output of the target model is in a high-risk state, including the following steps:
[0090] S31: Convert the real-time output Logits sequence into a real-time feature time series F_live; wherein the conversion method of the real-time feature time series F_live is the same as that of the malicious feature time series. The conversion method of multidimensional features is also the same as that of the malicious feature time series.
[0091] S32: Calculate the feature distance between the real-time feature time series F_live and each malicious feature sequence template F_mal_i obtained in the offline stage, and obtain the distance set {d_i}.
[0092] Optionally, the feature distance can be the Dynamic Time Warping (DTW) distance, the Longest Common Subsequence (LCSS) distance, or the edit distance, or the cosine similarity can be calculated directly after encoding the feature sequence into a vector using a pre-trained sequence encoder (such as LSTM or Transformer).
[0093] S32: Input the distance set {d_i}, the global statistical features of the real-time feature time series F_live, and the context features of the current dialogue into the offline-trained classifier. The classifier outputs a comprehensive anomaly score S between 0 and 1.
[0094] Optionally, the classifier may include unsupervised anomaly detection algorithms such as anomaly scoring classifier, one-class support vector machine (SVM) or isolation forest. When resources are limited, a simple distance-weighted average and threshold comparison may also be used.
[0095] The contextual features of the pre-dialogue include user-inputted risk scores, which are optional contextual signals used to enhance detection accuracy. As a scalar feature, it is input into the final classifier along with DTW distance features and sequence statistical features to jointly determine the final "comprehensive anomaly score S". Optionally, the user-inputted risk score can be a distilled continuous risk score, etc., which is existing technology and will not be elaborated upon in this application.
[0096] S33: Set a base threshold θ_base. Dynamically adjust the actual threshold θ based on the context, for example: θ = θ_base - λ * (user input risk score). When user input is suspicious, lower the threshold to increase sensitivity.
[0097] Judgment: If S > θ, then it is determined to be a high-risk state.
[0098] If a high-risk state is determined in S3, real-time intervention is performed, including the following steps:
[0099] If the situation is determined to be high-risk, immediately perform at least one of the following actions:
[0100] Output softening: Remove high-risk top candidate words from the Logits at the current time step to guide the model's direction.
[0101] Forced truncation: Stops generation and returns to the default security response.
[0102] Alarm source tracing: Trigger an alarm and save the complete Logits sequence, characteristics, and context of this interaction.
[0103] A model of security defense system, comprising:
[0104] A preparation module is used to construct a malicious output variant library; wherein, the malicious output variant library includes several malicious output texts;
[0105] The modeling module is used to obtain the malicious output logits sequence when the model generates malicious output based on the malicious output variant library; and to construct a malicious feature model based on the malicious output logits sequence.
[0106] The online defense module is used to obtain the real-time output Logits sequence of the target model; and to determine whether the real-time output of the target model is in a high-risk state through the malicious feature model; if it is determined to be in a high-risk state, real-time intervention is performed.
[0107] A computer program product includes a computer program that, when executed by a processor, implements the steps of the method.
[0108] In summary, this application has the following advantages:
[0109] (1) Active defense against unknown attacks: Based on the generation pattern of malicious output rather than the attack input, detection can effectively deal with new and unknown inducement methods.
[0110] (2) High accuracy and low false alarms: By analyzing the distribution pattern of Logits sequences (DTW distance, entropy, confidence), it is possible to better distinguish between "discussion danger" and "implementation danger" and reduce false alarms.
[0111] (3) Real-time and efficient: Detection occurs during the generation process and can be intercepted in advance. The calculation focuses on the output layer Logits and lightweight feature comparison, with low overhead.
[0112] (4) Evolvability: The malicious variant library and feature model can be continuously expanded and iterated as new attack samples are discovered, and the defense capability evolves on its own.
[0113] It is understood that the present invention has been described through some embodiments, and those skilled in the art will recognize that various changes or equivalent substitutions can be made to these features and embodiments without departing from the spirit and scope of the invention. Furthermore, under the teachings of the present invention, these features and embodiments can be modified to adapt to specific situations and materials without departing from the spirit and scope of the invention. Therefore, the present invention is not limited to the specific embodiments disclosed herein, and all embodiments falling within the scope of the claims of this application are within the protection scope of the present invention.
[0114] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.
Claims
1. A security defense method for a model, characterized in that, Includes the following steps: Obtain a malicious feature model; wherein the malicious feature model is used to calculate the similarity between the malicious output Logits sequence and the real-time output Logits sequence; Obtain the real-time output Logits sequence of the target model; By using a malicious feature model, it is determined whether the real-time output of the target model is in a high-risk state; if it is determined to be in a high-risk state, real-time intervention is performed.
2. The security defense method for the model according to claim 1, characterized in that, The malicious feature model includes a classifier and a malicious feature sequence template; The malicious feature sequence template is obtained based on the malicious output Logits sequence; The classifier is used to generate a comprehensive anomaly score based on the scoring features of the real-time output Logits sequence; The scoring features of the real-time output Logits sequence include distance set and global statistical features; The distance set is obtained through the following steps: Extract multidimensional features from each Logits vector in the real-time output Logits sequence to convert the real-time output Logits sequence into a real-time feature time series; Calculate the feature distance between the real-time feature time series and each malicious feature sequence template to obtain the distance set.
3. The security defense method for the model according to claim 2, characterized in that, The scoring features of the real-time output Logits sequence also include: context features; the context features are context signals used to enhance detection accuracy; The determination of whether the real-time output of the target model is in a high-risk state includes: The actual threshold is dynamically adjusted based on the context. When the overall anomaly score is greater than the actual threshold, it is judged as a high-risk state.
4. The security defense method for the model according to any one of claims 1-3, characterized in that, The malicious feature model is obtained through the following steps: Construct a malicious output variant library; wherein, the malicious output variant library includes several malicious output texts; Based on the malicious output variant library, obtain the malicious output Logits sequence when the model generates malicious output; A malicious feature model is constructed based on the malicious output Logits sequence.
5. The security defense method for the model according to claim 4, characterized in that, The construction of the malicious output variant library includes the following steps: Obtain several malicious output seeds; Based on the malicious output seed, generate semantically equivalent candidate variants; Perform semantic verification on semantically equivalent candidate variants; For the semantically equivalent candidate variants that pass the verification, reverse translation or local editing based on the thesaurus is performed to generate semantically equivalent secondary variants; Perform semantic verification on semantically equivalent quadratic variants; A malicious output variant library is obtained based on malicious output seeds, verified semantically equivalent candidate variants, and verified semantically equivalent secondary variants.
6. The security defense method for the model according to claim 5, characterized in that, The malicious feature sequence template is obtained through the following steps: Extract multidimensional features from each Logits vector in the malicious output Logits sequence to convert the malicious output Logits sequence into a malicious feature time series; Cluster all malicious feature time series to obtain several malicious feature sequence templates; The multidimensional features include Top-k probabilities, probability distribution entropy, target concept activation, and distribution skewness / kurtosis.
7. The security defense method for the model according to claim 6, characterized in that, The training features of the classifier include: the feature distance between each malicious feature time series and each malicious feature sequence template, and the global statistical features of the malicious feature time series.
8. A model-based security defense system, characterized in that, include: The online defense module is used to obtain the real-time output Logits sequence of the target model; And through malicious feature models, determine whether the real-time output of the target model is in a high-risk state; If the situation is determined to be high-risk, real-time intervention will be implemented.
9. The security defense system for the model according to claim 8, characterized in that, Also includes: A preparation module is used to construct a malicious output variant library; wherein, the malicious output variant library includes several malicious output texts; The modeling module is used to obtain the malicious output logits sequence when the model generates malicious output based on the malicious output variant library; and to construct a malicious feature model based on the malicious output logits sequence.
10. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the steps of the method according to any one of claims 1-7.