A semantic-structure collaborative watermark embedding method based on an autoregressive image generation model

By employing graph regularization and cross-modal alignment, the semantic consistency and robustness issues of watermark embedding in autoregressive image generation models are addressed. This achieves a balance between watermark concealment and generation quality without altering model parameters, thereby improving the accuracy of watermark detection and the visual fidelity of image generation.

CN122243712APending Publication Date: 2026-06-19BEIJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING UNIV OF POSTS & TELECOMM
Filing Date
2026-03-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies cannot effectively embed watermarks in autoregressive image generation models, resulting in insufficient semantic consistency and robustness, and difficulty in maintaining generation quality.

Method used

We employ graph regularization and cross-modal alignment methods to learn structure-aware features by constructing a codebook KNN graph and a graph convolutional network. We combine cross-modal contrastive learning to achieve semantic-structure collaborative watermark embedding. We utilize red-green list grouping and vector quantization generative adversarial networks for watermark embedding and detection.

Benefits of technology

Without changing the parameters of the autoregressive model, a balance was achieved between the concealment, robustness and generation quality of the watermark, improving the accuracy of watermark detection and the visual fidelity of image generation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243712A_ABST
    Figure CN122243712A_ABST
Patent Text Reader

Abstract

This invention discloses a semantic-structural collaborative watermark embedding method for autoregressive image generation models, belonging to the field of generative artificial intelligence content security technology. Addressing the problems of existing methods' difficulty in adapting to autoregressive sequence generation structures, susceptibility to semantic drift, and insufficient robustness, this invention achieves watermark embedding through two core modules: 1) Graph-regularized structure-aware feature representation learning, capturing the internal topological structure of the codebook through KNN graph construction and GCN aggregation; 2) Global semantic awareness based on cross-modal alignment, achieving alignment and fusion of structural and semantic features through contrastive learning. The codebook index is divided according to a unified feature representation, and a probability bias is applied to specific groups of indices during autoregressive generation to embed the watermark. This invention effectively avoids semantic drift, improving the robustness of the watermark under various image perturbations while maintaining the visual consistency of the generated image. It can be used for copyright protection and content traceability of AI-generated images.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a watermark embedding method for autoregressive image generation models (AR), belonging to the field of generative artificial intelligence content security technology. Background Technology

[0002] In recent years, autoregressive image generation models have been widely used in the field of image and video generation. These models generate content based on a sequence prediction mechanism using tokens (discrete feature units). When deployed in open environments, the issues of copyright protection and traceability of the generated content become increasingly prominent, necessitating the support of effective watermark embedding technologies.

[0003] Existing watermarking methods are primarily designed for diffusion models, typically embedding watermarks through continuous latent space perturbations. However, these methods rely on continuous feature modeling mechanisms, making them difficult to directly apply to autoregressive models based on token prediction. Existing solutions for autoregressive models often employ index replacement or probability modulation for embedding. While these methods do not require additional training, they can easily affect the generation distribution during embedding and do not adequately consider the structural constraints between discrete codebook indices and cross-modal semantic consistency issues, making it difficult to balance watermark stability and generation quality.

[0004] Therefore, existing technologies lack a watermark embedding method that can achieve structural constraints and semantic coordination under the premise of ensuring semantic consistency and robustness of the generated tokens within the framework of autoregressive model token probability modeling. Summary of the Invention

[0005] To address the problems of semantic drift, insufficient robustness, and poor semantic consistency in watermark embedding caused by existing technologies' inability to adapt to the sequence generation structure of autoregressive models, neglect of codebook structure constraints, and cross-modal semantic associations, this invention provides a semantic-structure co-aware watermark embedding method. This method achieves a balance between concealment, robustness, and semantic consistency without requiring parameter fine-tuning of the autoregressive image generation model. The training described below does not involve parameter updates of the autoregressive image generation model. The specific scheme is as follows:

[0006] 1. Graph Regularization-Based Structure-Aware Feature Representation Learning Module:

[0007] Step 1.1: Construct the codebook KNN graph. Given the discrete codebook V={U1,U2,...,U...} of the autoregressive image generation model. n} (Uᵢ∈Rᵈ, d is the dimension of the codebook vector), the neighborhood set N(i) of each codebook index is determined by the K-Nearest Neighbor (KNN) algorithm, and the edge weight Wᵢⱼ is calculated based on cosine similarity to construct the codebook topology graph.

[0008] Step 1.2: Define the graph regularization objective function. Through the objective function...

[0009] (j∈N(i))

[0010] We learn a structure-aware feature representation Zᵢ (Zᵢ∈Rᵈ) to balance original semantic preservation with structural proximity constraints. Here, Ûᵢ is the L2-normalized original codebook vector, λ>0 is the balancing hyperparameter, the first term is the reconstruction constraint, and the second term is the graph smoothing regularization term.

[0011] Step 1.3: Aggregating structural information using a graph convolutional network. An L-layer graph convolutional network (GCN) is used to aggregate features, with the inter-layer propagation rule being...

[0012]

[0013] Where Ã=A+I (A is the adjacency matrix, I is the identity matrix), D is the degree matrix, W⁽ˡ⁾ is the learnable parameter, and the final output is the structure-aware feature representation Z=H⁽ᴸ⁾.

[0014] 2. Global semantic awareness module based on cross-modal alignment:

[0015] Step 2.1: Obtain cross-modal features. Extract semantic embeddings Sᵢ (Sᵢ∈Rᵈᵗ, d) from the text labels corresponding to the codebook index using a Generative Pre-trained Transformer (GPT). t (This is the embedding dimension of the language model). Simultaneously, input the structural feature representation Zᵢ output from step 1.3.

[0016] Step 2.2: Cross-modal contrastive learning alignment. Contrastive learning is used to align cross-modal features into a unified cross-modal shared feature space. The optimization objective is the InfoNCE loss:

[0017]

[0018] Where φ(・,・) is the similarity calculation function, T is the temperature hyperparameter, and D is the number of comparison samples. The model learns that the similarity of cross-modal features of the same index in the cross-modal shared feature space is significantly higher than that of other indices.

[0019] Step 2.3: Adaptive Weighted Fusion. Structural and semantic features are fused using trainable fusion weights α (where α∈(0,1)), generating a unified feature representation Uᵢ = α·Zᵢ + (1-α)·Sᵢ that simultaneously characterizes the structural proximity and semantic consistency of codebook tokens. Based on the unified feature representation Uᵢ, the global similarity matrix S=φ(Uᵢ,U j The global similarity matrix is ​​used to characterize the semantic-structural joint similarity between codebook tokens and serves as the metric for subsequent red and green list sorting and grouping.

[0020] 3. Red and Green List Grouping Module: Based on the semantic-structural joint similarity reflected by the global similarity matrix, the codebook tokens are sorted and grouped to obtain a set of tokens (green list) used to carry watermark information and a set of tokens (red list) used as a probability comparison benchmark.

[0021] Step 3.1: The green list is a watermark carrier group. Tokens that meet a preset ratio and are sorted from highest to lowest score are selected as the green list. The preset conditions include structural proximity and semantic high matching with the input text prompt.

[0022] Step 3.2: The red list serves as the baseline group. It consists of the remaining tokens and is used as a probability comparison benchmark in the candidate index set during the watermark embedding process.

[0023] 4. Watermark Embedding Module:

[0024] Step 4.1: Zero-bit watermark construction. A uniform logits (log probability value) increment Δ is applied to the sampling probability of all green list token sets to achieve the covert embedding of the zero-bit digital watermark. Δ is a positive bias value that ensures no significant decrease in generation quality. The zero-bit digital watermark is a redundant watermark achieved by modulating the probability distribution generated from autoregressive tokens without requiring an additional carrier.

[0025] Step 4.2: Logits Adjustment. After the model encodes the input text prompts, it generates an index sequence for each token. After the candidate index set for each token is calculated and before the softmax probability transformation, the logits of the green list token set are incrementally adjusted.

[0026] Step 4.3: Watermark Embedding. The adjusted logits are converted into sampling probabilities using softmax, and the sampling probability of the green list token set is slightly increased compared to the original distribution.

[0027] 5. Vector Quantized Generative Adversarial Network (VQ-GAN) Decoding Module: The watermarked index sequence output by the model is input into the VQ-GAN decoder.

[0028] Step 5.1: Codebook Mapping and Pixel Restoration. The index sequence is mapped to continuous feature vectors through a discrete codebook. After convolution and upsampling operations by the decoder, it is restored to an RGB image with the same resolution as the original generation process.

[0029] 6. Watermark Detection Module: The detection process is achieved by statistically analyzing the proportion of green list tokens in the index sequence.

[0030] Step 6.1: Image Encoding and Reconstruction. Input the RGB image to be detected into the VQ-GAN encoder to reconstruct the corresponding index sequence {I1, I2, ..., I...}. k (k is the sequence length).

[0031] Step 6.2: Statistical Test. Calculate the proportion of the green list token set in the index sequence, r = c / k. Calculate the one-sided p = P(B(k,P0)≥c) (the probability of observing ≥c green list token sets without watermark). If the p value is lower than the preset extremely low threshold τ, it is determined that a watermark exists; otherwise, it is determined to be uncertain or non-existent.

[0032] This semantic-structural collaborative watermarking embedding method based on an autoregressive image generation model utilizes graph regularization to capture the codebook topology, ensuring that the set of tokens used for watermark embedding maintains proximity within the codebook's topological space. This guarantees that the index sequence recovered after perturbation still has a high probability of falling into similar structural regions. Furthermore, cross-modal semantic alignment ensures that the set of tokens used for watermark embedding remains consistent with the input text prompts in the semantic space, thus reducing semantic drift. Simultaneously, contrastive learning ensures that the generated image quality does not significantly decrease after watermark embedding. The training process only affects the structural feature extraction and cross-modal alignment modules, without involving updates to the autoregressive image generation model parameters. Therefore, it can be directly deployed within existing autoregressive image generation inference processes, avoiding changes to the inference chain and model performance drift caused by retraining the generation model. This invention provides an innovative solution for watermarking in the field of autoregressive image generation models. Attached Figure Description

[0033] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the relevant technologies will be introduced below. The accompanying drawings described below are merely embodiments of the present invention.

[0034] Figure 1A schematic diagram of the watermark embedding and detection framework for the AR model provided by this invention.

[0035] Figure 2 A schematic diagram of the structure-aware feature representation learning module provided by this invention.

[0036] Figure 3 A schematic diagram of the semantic-structural collaborative watermarking embedding method provided by the present invention. Detailed Implementation Plan

[0037] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. It should be noted that the following detailed descriptions are exemplary and intended to provide further explanation of the present invention. The present invention provides a semantic-structural collaborative watermarking embedding method based on an autoregressive image generation model, such as... Figure 1 As shown, it includes a structure-aware feature representation learning module, a global semantic awareness module, a red-green list grouping module, a watermark embedding module, a VQ-GAN decoding module, and a watermark detection module. Its main steps include:

[0038] S101, such as Figure 2 As shown, the discrete codebook of the input model is used to construct a KNN adjacency graph and calculate edge weights. The GCN is initialized, trained based on graph regularization loss, and outputs structure-aware features Z. Figure 3 As shown, the text semantic embedding S is extracted, a feature mapping network is trained by cross-modal contrastive loss, Z is aligned with S, and finally weighted and fused to generate a unified feature representation U, thus completing the training.

[0039] S102. Based on the unified feature representation U, calculate the global similarity matrix between codebook tokens, and obtain the semantic-structural joint similarity score corresponding to each token from the global similarity matrix. Sort the codebook tokens according to the scores, select tokens that meet a preset ratio as the green list, and the rest as the red list. The grouping results are fixed for subsequent embedding and detection.

[0040] S103. Input text prompt T, the model encodes T, and starts generating the index sequence per token: First, calculate the logits distribution of the candidate index set for the current token. Then, apply a uniform logits increment Δ to all green-listed tokens in the candidate index set. Convert the adjusted logits into sampling probabilities using softmax, and sample to obtain the index of the current token. Repeat the above steps until a complete index sequence is generated. Input the watermarked index sequence into the VQ-GAN decoder, and decode to output a watermarked RGB image X'.

[0041] S104. Input the RGB image to be detected X', preprocess it, and then input it into the VQ-GAN encoder to restore the index sequence {I1,I2,…,I...}. k The number of green list tokens in the statistical sequence is c, and the proportion r = c / k is calculated. By constructing a binomial distribution B(k,0.5), p = P(B(k,0.5)≥c). If p < 0.05, a watermark is determined to exist; otherwise, it is determined to be uncertain or non-existent.

[0042] S105. All experiments can be implemented on conventional deep learning frameworks and GPU computing platforms. The specific training process is as follows:

[0043] Algorithm: Training steps of the semantic-structural co-watermarking embedding algorithm based on AR model: Step 1: Learn codebook structure-aware feature representation through graph convolutional network: Step 2: Align visual structure with text semantic features through comparative learning: Step 3: Weighted fusion to generate a unified feature representation: Step 4: Group the codebook tokens into red and green groups: Step 5: Apply logits increments to the green list token set during the generation process: Step 6: Use a binomial distribution test to statistically analyze the proportion of the green list token set:

[0044] This invention evaluates performance from two dimensions: in terms of visual fidelity, peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and multi-scale structural similarity index (MS-SSIM) are used to evaluate image quality. Regarding watermark robustness, the invention calculates watermark detection accuracy (ACC) under six perturbation scenarios, including random cropping and Gaussian blur.

[0045] We used the autoregressive image generation model watermarking framework WMAR as a baseline for comparative experiments, comparing the resulting watermarked RGB images with the watermarked images generated by the model in this embodiment:

[0046] Model PSNR↑ SSIM↑ MSSIM↑ WMAR 21.79 0.8063 0.8858 Ours 27.83 0.9183 0.9592

[0047] The generated images underwent the following seven simulated attacks: Normal extraction (no attack); Crop attack with a scaling ratio of 0.75 and an aspect ratio of 1.0 (Crop-0.75); Gaussian blur attack with a Gaussian kernel size of 11 and a standard deviation of 1.0 (Blur-11 / 1.0); Additive noise attack with a noise standard deviation ratio of 0.01 (Noise-0.01); Color jitter attack with a brightness jitter intensity of 0.5 (ColorJitter-0.5); and Random erasing attack with an erasure region ratio of 0.1 (RandomErasing-0.1).

[0048] <![CDATA[ Model Attack ]]> Clean Crop Blur Noise Bright Erase WMAR 100.00 82.02 88.19 92.9 86.2 90.19 Ours 100.00 85.50 91.03 94.26 90.13 93.73

[0049] The watermark extraction accuracy under different attacks obtained through experiments is shown in the table, indicating that the image generated by the model proposed in this paper maintains higher extraction accuracy than the baseline when facing attacks.

[0050] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A semantic-structural collaborative watermarking embedding method based on an autoregressive image generation model, characterized in that, Includes the following steps: A. Obtain the codebook vectors of each codebook token (discrete feature unit) in the discrete codebook of the autoregressive image generation model. Use the K-Nearest Neighbor (KNN) algorithm based on discrete codebooks and graph regularization constraints to perform structural enhancement feature modeling, and output the structure-aware feature Z that aggregates multi-order neighborhood topological feature information. B. Obtain the semantic embedding S of the corresponding text label extracted by the Generative Pre-trained Transformer (GPT), and align the structure-aware feature Z with the semantic embedding S by contrastive learning to a unified cross-modal shared feature space. C. Fuse the aligned structure-aware features Z with the semantic embedding S to generate a unified feature representation U that simultaneously represents the structural proximity and semantic consistency of the codebook tokens. D. Calculate the joint semantic structure similarity between codebook tokens based on the unified feature representation U, and group the codebook tokens according to the similarity and a preset threshold. Apply a preset incremental bias to the logits of a specific token set and then sample them, thereby embedding watermark information during the generation of the index sequence. Inputting this into the decoder yields a watermarked image.

2. The semantic-structural collaborative watermarking embedding method for the autoregressive image generation model according to claim 1, characterized in that, Step A further includes the following steps: A1. Given a discrete codebook V corresponding to an autoregressive image generation model, construct the adjacency relationship of the codebook tokens using the K-nearest neighbor algorithm to determine the tokens. i The neighborhood set N(i) is calculated, and the edge weight Wᵢⱼ is calculated based on the cosine similarity. A2. Construct a graph regularization objective function, using the L2-normalized original codebook vector as the reconstruction constraint term and the weighted neighborhood token feature difference as the graph smoothing regularization term. A3. The graph regularization objective function is approximately solved by stacking L layers of graph convolutional network (GCN), where L is a positive integer, and the inter-layer propagation adopts the Laplacian smoothing rule. The final output is a structure-aware feature representation Z that aggregates L-order neighborhood information.

3. The semantic-structural collaborative watermarking embedding method for the autoregressive image generation model according to claim 1, characterized in that, Step B further includes the following steps: B1. Obtain the semantic embedding Sᵢ∈Rᵈᵗ of the text tag corresponding to the codebook token from the word embedding layer of GPT. B2. Construct a cross-modal contrastive learning objective function, taking the structure-aware feature Zᵢ and semantic embedding Sᵢ corresponding to the same codebook token as positive sample pairs, and cross-modal combinations of different tokens as negative sample pairs, to guide the model to learn that the similarity of positive sample pairs in the cross-modal shared feature space is significantly higher than that of negative sample pairs. B3. By optimizing the contrastive learning objective function, the structure-aware features Z and semantic embeddings S are aligned and mapped to a unified cross-modal shared feature space.

4. The semantic-structural collaborative watermarking embedding method for the autoregressive image generation model according to claim 1, characterized in that, Step C further includes the following steps: C1. Adaptive weighted summation is performed on the aligned structure-aware features Zᵢ and semantic embedding Sᵢ using trainable fusion weights α (where α∈(0,1)). C2. Generate a unified feature representation Uᵢ = α·Zᵢ + (1-α)·Sᵢ, which simultaneously encodes the local topological structure and cross-modal semantic association of the codebook token. C3. Calculate the global similarity matrix S=φ(Uᵢ,Uᵢ) based on the unified feature representation Uᵢ. j The global similarity matrix represents the semantic-structural joint similarity between codebook tokens and serves as a metric for grouping subsequent candidate index sets.

5. The semantic-structural collaborative watermarking embedding method for the autoregressive image generation model according to claim 1, characterized in that, Step D further includes the following steps: D1. In the autoregressive token generation stage, the candidate index set is sorted and grouped according to the semantic-structural joint similarity score corresponding to the global similarity matrix to obtain the token set (green list) used to carry watermark information and the token set (red list) used as the benchmark set. D2. In the token generation stage of the autoregressive image generation model, an incremental bias Δ is applied to the logits (logarithmic probability values) of the green list token set. This is then converted to sampling probabilities using a softmax function to adjust the sampling probability distribution of the generated tokens, achieving the covert embedding of a zero-bit digital watermark. The generated index sequence with the watermark is then input into the decoder to obtain the watermarked image. The zero-bit digital watermark is a redundant watermark achieved without additional carriers and based on the probability distribution modulation of the autoregressive token generation.