Image watermark embedding method and device, electronic equipment and storage medium

By combining local pixel statistical features with multi-scale deep semantic priors in a dual-path perception model, adaptive energy allocation of digital image watermarks in complex images is achieved, solving the problem of difficulty in balancing visual imperceptibility and robustness in existing technologies, and improving the watermark embedding effect.

CN122243715APending Publication Date: 2026-06-19ANHUI XINGDUN INTELLIGENT TECHNOLOGY CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ANHUI XINGDUN INTELLIGENT TECHNOLOGY CO LTD
Filing Date
2026-05-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing digital image watermarking technologies, while balancing visual imperceptibility and robustness, lack an understanding of the high-level semantic content of images. This makes it difficult to achieve accurate and adaptive allocation of watermark energy in complex images, and thus cannot effectively overcome the bottleneck between imperceptibility and robustness.

Method used

An image watermarking embedding method based on a dual-path perception model is adopted. By combining local pixel statistical features with multi-scale deep semantic priors, a differentiated watermark embedding intensity map is generated, thereby achieving adaptive energy allocation for different local regions of the target image.

Benefits of technology

While ensuring extremely high visual quality in semantically sensitive areas, it enhances robustness in areas with complex textures, achieving a perfect balance between watermark imperceptibility and robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243715A_ABST
    Figure CN122243715A_ABST
Patent Text Reader

Abstract

This invention provides an image watermark embedding method, apparatus, electronic device, and storage medium, belonging to the field of image processing technology. The method includes: generating an initial watermark perturbation signal based on a target image and watermark information to be embedded; generating a basic embedding intensity map based on local pixel statistical features; extracting multi-scale depth features and fusing them to generate a perceptual guidance map, and generating a depth prior embedding intensity map; fusing the basic embedding intensity map and the depth prior embedding intensity map to obtain a target embedding intensity map, which indicates the differentiated watermark embedding intensity in different local regions of the target image; and modulating the perturbation signal using the target embedding intensity map and superimposing it onto the target image to generate a watermarked image. This invention achieves precise energy allocation through dual-path collaborative fusion of low-level pixel statistics and high-level semantic priors, thereby ensuring extremely high imperceptibility in sensitive areas and enhancing robustness in areas with complex textures, effectively overcoming the bottleneck of the difficulty in simultaneously achieving watermark imperceptibility and robustness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and in particular to an image watermark embedding method, apparatus, electronic device, and storage medium. Background Technology

[0002] The core objective of digital image watermarking technology is to embed copyright information or traceability information into a carrier image without affecting human visual perception, and to ensure accurate extraction even after the image has undergone various signal processing or malicious attacks. In complex real-world applications, the embedded watermark must effectively resist complex attacks such as compression, tampering, and geometric transformations. Therefore, achieving a perfect balance between extremely high visual imperceptibility and strong robustness in images has become a core technological requirement that urgently needs to be met in fields such as digital multimedia copyright protection, generative content traceability, and authenticity authentication.

[0003] To meet these requirements, existing technologies have proposed various technical solutions for robust and invisible image watermarking. Current mainstream localized watermarking embedding methods typically employ mechanisms such as the Just Noticeable Difference (JND) model to guide the spatial allocation of watermark embedding energy. In practical applications, these solutions primarily estimate the human visual perception threshold by calculating basic pixel statistics such as brightness, contrast, and texture complexity of local image regions. Based on these basic statistical characteristics, they determine the watermark embedding intensity for each region of the image, attempting to achieve a compromise between imperceptibility and robustness.

[0004] However, the perceptual modeling logic used in existing technologies to balance imperceptibility and robustness mainly relies on the statistical regularity of pixels in the basic dimension. When dealing with complex images with rich contextual information, it is difficult to accurately capture and understand the high-level semantic content of the image. This makes it difficult for existing solutions to fundamentally overcome the inherent contradiction between watermark robustness and imperceptibility when facing complex images. Summary of the Invention

[0005] This invention provides an image watermark embedding method, apparatus, electronic device, and storage medium to address the shortcomings of existing technologies where the perceptual modeling logic guiding watermark embedding energy allocation relies solely on low-level pixel statistical features and lacks an understanding of high-level semantic content of the image. It achieves the integration of low-level pixel statistical features and multi-scale deep semantic priors, enabling differentiated and adaptive precise allocation of watermark embedding intensity for different local regions of the target image. This ensures extremely high visual quality in sensitive areas while fully releasing embedding capacity in complex texture areas, effectively overcoming the bottleneck of balancing imperceptibility and robustness of watermarks.

[0006] This invention provides an image watermark embedding method, comprising: An initial watermark perturbation signal is generated based on the target image and the watermark information to be embedded. A basic embedding intensity map is generated based on the local pixel statistical features of the target image; Multi-scale depth features of the target image are extracted and fused to generate a perception guidance map, and a depth prior embedding intensity map is generated based on the perception guidance map. By fusing the base embedding intensity map and the depth prior embedding intensity map, a target embedding intensity map is obtained, which is used to indicate the differentiated watermark embedding intensity in different local regions of the target image. The initial watermark perturbation signal is modulated using the target embedding intensity map to obtain the target watermark perturbation signal; The target watermark perturbation signal is superimposed onto the target image to generate a watermarked image.

[0007] According to an image watermark embedding method provided by the present invention, the step of extracting multi-scale depth features of the target image and fusing them to generate a perceptual guidance map includes: Extract multi-scale depth features from the target image; The multi-scale depth features are subjected to feature extraction and spatial alignment processing to obtain a set of spatially aligned feature maps; Dynamically determine the adaptive fusion weights corresponding to each spatial alignment feature map in the spatial alignment feature map set; Based on the adaptive fusion weights, all the spatial alignment feature maps in the spatial alignment feature map set are subjected to weighted fusion processing to generate the perception guidance map.

[0008] According to an image watermark embedding method provided by the present invention, the step of performing feature extraction and spatial alignment processing on the multi-scale depth features to obtain a spatially aligned feature map set includes: Convolutional operations are used to compress and reweight the channel dimensions of each layer of feature maps in the multi-scale depth features to obtain compressed feature maps corresponding to each layer. Each of the compressed feature maps is upsampled to adjust the resolution of each compressed feature map to be consistent with the resolution of the target image, thereby obtaining the corresponding single-channel feature map; The spatially aligned feature map set is composed of each of the single-channel feature maps.

[0009] According to an image watermark embedding method provided by the present invention, the step of dynamically determining the adaptive fusion weights corresponding to each spatial alignment feature map in the spatial alignment feature map set includes: The spatially aligned feature map set is subjected to cross-scale interactive processing to obtain interactive feature maps; The interaction feature map is subjected to global pooling to generate a global content descriptor corresponding to the target image; Based on the global content descriptor, the weight values ​​corresponding to each of the spatial alignment feature maps are dynamically calculated and used as the adaptive fusion weights.

[0010] According to an image watermark embedding method provided by the present invention, the step of performing cross-scale interactive processing on the spatially aligned feature map set to obtain an interactive feature map includes: The spatially aligned feature map set is subjected to convolutional operations to extract local consistency features, resulting in an initial interaction feature map. The initial interaction feature map is subjected to residual processing that fuses cross-scale feature information to obtain the interaction feature map.

[0011] According to an image watermark embedding method provided by the present invention, the step of generating a depth prior embedding intensity map based on the perceptual guidance map includes: A preset perception threshold mapping model is used to map and transform the perception guidance map to obtain the perception threshold corresponding to the perception guidance map; The depth prior embedding strength map is generated based on the perception threshold; The depth prior embedding intensity map has the same dimension and numerical distribution as the basic embedding intensity map.

[0012] According to an image watermark embedding method provided by the present invention, the step of generating the depth prior embedding intensity map based on the perceptual threshold includes: Determine the perception threshold corresponding to each local region in the target image; Based on the perception threshold corresponding to each of the local regions, a corresponding embedding strength value is assigned to each local region to form the depth prior embedding strength map. The perception threshold corresponding to the semantically sensitive region or visually smooth region in the target image is lower than the perception threshold corresponding to the textured complex region in the target image. Accordingly, the embedding strength value assigned to the semantically sensitive region or visually smooth region is less than the embedding strength value assigned to the textured complex region.

[0013] According to an image watermark embedding method provided by the present invention, the step of generating a basic embedding intensity map based on the local pixel statistical features of the target image includes: Extract the pixel intensity distribution features and / or pixel spatial variation features of each local region in the target image, as the local pixel statistical features of each local region; Based on the local pixel statistical features of each local region, calculate the basic embedding strength value corresponding to each local region; The basic embedding strength map is composed of the basic embedding strength values ​​of each of the local regions.

[0014] According to an image watermark embedding method provided by the present invention, the step of fusing the basic embedding intensity map and the depth prior embedding intensity map to obtain a target embedding intensity map includes: Obtain the predetermined fusion weight coefficients; Based on the fusion weight coefficients, the base embedding intensity map and the deep prior embedding intensity map are weighted and fused to obtain the target embedding intensity map.

[0015] According to an image watermark embedding method provided by the present invention, the fusion weight coefficient includes spatially adaptive weight values ​​corresponding to each local region of the target image; obtaining the predetermined fusion weight coefficient includes: Obtain the feature distribution value of the local pixel statistical features corresponding to each local region, and the semantic response value of each local region corresponding to the perception guidance map; Calculate the relative proportion between the feature distribution values ​​and the semantic response values ​​of each of the local regions; Based on the relative proportional relationship, the spatial adaptive weight value corresponding to each of the local regions is dynamically generated; The fusion weight coefficient is composed of the spatial adaptive weight values ​​corresponding to each of the local regions.

[0016] According to an image watermark embedding method provided by the present invention, the step of superimposing the target watermark perturbation signal onto the target image to generate a watermarked image includes: Obtain the embedding mask corresponding to the target image, and determine the target embedding region in the target image; The target watermark perturbation signal is superimposed only onto the target embedding region to generate the watermarked image.

[0017] The present invention also provides an image watermark embedding device, comprising the following modules: The signal generation module is used to generate an initial watermark perturbation signal based on the target image and the watermark information to be embedded. The first path processing module is used to generate a basic embedding intensity map based on the local pixel statistical features of the target image; The second path processing module is used to extract multi-scale depth features of the target image and fuse them to generate a perception guidance map, and generate a depth prior embedding intensity map based on the perception guidance map. An intensity fusion module is used to fuse the base embedding intensity map and the depth prior embedding intensity map to obtain a target embedding intensity map, which is used to indicate the differentiated watermark embedding intensity of different local regions in the target image. The perturbation modulation module is used to modulate the initial watermark perturbation signal using the target embedding intensity map to obtain the target watermark perturbation signal; The watermark overlay module is used to overlay the target watermark perturbation signal onto the target image to generate a watermarked image.

[0018] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the image watermark embedding method as described above.

[0019] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the image watermark embedding method as described above.

[0020] The image watermarking embedding method, apparatus, electronic device, and storage medium provided by this invention solve the problem of how to overcome the limitations of low-level pixel statistics by fusion of low-level pixel statistics and high-level semantic priors through dual-path collaborative fusion. This enables precise and adaptive allocation of watermark energy based on high-level semantic understanding, perfectly balancing imperceptibility and robustness. It can effectively achieve precise energy allocation that is adaptive to content, thereby ensuring extremely high imperceptibility in sensitive areas and enhancing robustness in areas with complex textures. This effectively overcomes the bottleneck of the difficulty in balancing imperceptibility and robustness of watermarks. Attached Figure Description

[0021] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0022] Figure 1 This is a flowchart illustrating the image watermark embedding method provided by the present invention.

[0023] Figure 2 This is a schematic diagram of the overall architecture of the image watermark embedding system provided by the present invention.

[0024] Figure 3 This is a schematic diagram of the generation process of the perception guidance map provided by the present invention.

[0025] Figure 4 This is a schematic diagram of the dual-path JND architecture provided by the present invention.

[0026] Figure 5 This is a schematic diagram of the generation process of the basic embedding strength map provided by the present invention.

[0027] Figure 6 This is a schematic diagram of the process for generating the target embedding strength map provided by the present invention.

[0028] Figure 7 This is a schematic diagram of the image watermark embedding device provided by the present invention.

[0029] Figure 8 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0030] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0031] It should be noted that, in the description of this invention, the terms "comprising," "including," or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Those skilled in the art will understand the specific meaning of the above terms in this invention according to the specific circumstances.

[0032] The terms "first," "second," etc., used in this invention are used to distinguish similar objects, not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class, without limiting the number of objects; for example, a first object can be one or more. Furthermore, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.

[0033] Digital image watermarking technology is a key research direction in the fields of information hiding and digital rights management. Its core objective is to robustly embed copyright identifiers or traceability information into a carrier image without affecting human visual perception, and to accurately extract the embedded information even after the image has undergone various signal processing operations or malicious attacks. This technology, with its strong visual imperceptibility and content traceability capabilities, has been widely applied in key areas such as digital media copyright protection, Artificial Intelligence Generated Content (AIGC) content identification, forensic electronic evidence preservation, and sensitive image leakage prevention. In complex real-world application scenarios, watermarking systems must effectively resist complex attacks, including value transformations (such as brightness, contrast, and saturation adjustments, JPEG compression), geometric transformations (such as rotation, cropping, and perspective transformations), and local content tampering. Therefore, designing a watermarking scheme that simultaneously achieves high visual quality and strong robustness has become a core challenge of the current technology.

[0034] Currently, the mainstream watermark embedding methods are mainly divided into two categories: spatial domain and frequency domain. Both aim to balance imperceptibility and robustness, but each has its own characteristics and limitations.

[0035] Spatial domain is an early technology that directly fine-tunes pixel values ​​to embed watermarks. Least Significant Bit (LSB) replacement is the most common method. It takes advantage of the fact that the human eye is not sensitive to changes in low-bit pixels to achieve imperceptibility. Its advantages are simple implementation, low computational overhead, and adaptability to real-time scenarios. However, its robustness is extremely poor and it is easily damaged by compression, noise, etc., and artifacts may also appear in smooth areas.

[0036] Frequency domain methods transform images to the frequency domain using techniques such as Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT), embedding mid-frequency coefficients to balance quality and interference resistance. This significantly improves robustness compared to spatial domain methods and can resist attacks such as JPEG compression. However, these methods suffer from block artifacts, lack content-adaptive embedding strategies, are prone to distortion in semantically sensitive areas, and are susceptible to geometric attacks such as rotation. Both types of methods lack deep content awareness, have low embedding capacity, and struggle to meet the demands of high-definition media.

[0037] In recent years, deep learning has brought new breakthroughs to watermarking technology. Approaches such as end-to-end trained Convolutional Neural Networks (CNNs) can automatically learn the complex mapping relationship from raw pixels to watermark embeddings. Some methods, such as the WAM (Watermark Anything With LocalizedMessage) model, introduce local embedding masks to reconstruct watermark embeddings into pixel-level segmentation tasks, significantly improving resistance to local attacks such as cropping and splicing. However, these methods still internally use the traditional single-path JND model based solely on pixel statistics to guide strength allocation. This fundamental flaw prevents them from utilizing the rich perceptual priors inherent in deep networks, resulting in a sacrifice of embedding strength in truly robust regions when pursuing invisibility. The synergistic optimization of robustness and invisibility reaches a bottleneck.

[0038] In summary, existing digital image watermarking technologies, especially the current mainstream methods based on localized embedding and the JND model, have the following limitations: While deep learning methods for global embedding (such as HiDDeN and TrustMark) improve overall visual quality through end-to-end learning or GAN priors, their embedding process lacks fine-grained modeling of the characteristics of local image regions. This makes it impossible to achieve intelligent and adaptive spatial allocation of watermark energy, and it is easy to introduce perceptible artifacts in visually sensitive areas.

[0039] Meanwhile, localized embedding methods (such as WAM, Mask Image Watermark, etc.) effectively enhance resistance to local attacks such as cropping by embedding masks. However, their JND module, which balances imperceptibility and robustness, relies only on low-level pixel statistics such as local brightness, contrast and texture of the image to estimate the perception threshold. This has two fundamental defects: (1) It cannot perceive the high-level semantic content of the image (e.g., it cannot distinguish between the face and the background), which may lead to over-embedding in semantically important regions; (2) It cannot effectively utilize the rich perceptual priors revealed by modern deep networks, which leads to the failure to fully utilize its high embedding capacity in regions with complex textures and insensitivity to the human eye, resulting in suboptimal energy allocation.

[0040] These shortcomings prevent existing frameworks from synergistically utilizing low-level spatial redundancy and high-level semantic context information during watermark embedding, making it difficult to achieve a better balance between robustness and imperceptibility, two core contradictory metrics.

[0041] In view of this, the present invention creatively provides a robust, invisible image watermark embedding method, device, electronic device and storage medium based on a dual-path perception model, aiming to achieve highly imperceptible and robust watermark embedding and accurate extraction in images, providing a reliable solution for scenarios such as multimedia copyright protection, AIGC content tracing and authenticity authentication.

[0042] This invention provides an image watermark embedding method, the execution subject of which can be a computer device, such as a personal computer (PC), server, workstation, or other electronic device with image processing and computing capabilities. In some practical application scenarios, the execution subject can also be a computing node deployed in the cloud, or a processor integrated into a specific software system (such as a copyright protection system or an AIGC content tracing system). For the sake of consistency and simplicity, the following description of the embodiments will consistently use electronic devices as the execution subject.

[0043] Figure 1 This is a flowchart illustrating the image watermark embedding method provided by the present invention, as shown below. Figure 1 As shown, including but not limited to the following steps: Step 11: Generate an initial watermark perturbation signal based on the target image and the watermark information to be embedded.

[0044] In practice, the target image can refer to the original carrier image that needs to be processed for copyright protection, AIGC content tracing, or authenticity verification. It can be a digital image of any format and resolution.

[0045] The watermark information to be embedded is hidden data used to characterize specific attributes, such as copyright identification code, user ID, or anti-counterfeiting tracking serial number.

[0046] In this embodiment, the target image and the watermark information to be embedded are first acquired. Then, through a preset signal encoding mechanism or mapping algorithm, the one-dimensional or structured watermark information is converted into a two-dimensional baseband signal associated with the spatial dimension of the target image, thereby generating an initial watermark perturbation signal.

[0047] The initial watermark perturbation signal can be a basic matrix of tiny numerical changes to be added, which can carry confidential information in space and provide a basic data representation for subsequent steganography.

[0048] Step 12: Generate a basic embedding intensity map based on the local pixel statistical features of the target image.

[0049] In order to embed watermarks covertly in images, this embodiment performs spatial redundancy analysis along the underlying visual dimension (referred to as the first processing path).

[0050] The extracted local pixel statistical features refer to low-level physical properties extracted directly from the base pixel level of the target image, such as the brightness distribution, contrast variation, or texture complexity of local regions. By calculating and analyzing these local pixel statistical features, the physical tolerance of the human visual system to pixel value variations in each local region is evaluated, thereby generating a base embedding intensity map.

[0051] The basic embedding intensity map spatially corresponds to the target image, and its value reflects the basic watermark energy threshold that each local region can withstand based solely on the statistical regularity of the underlying pixels. This method effectively utilizes the spatial redundancy of the image's underlying layers, providing physical robustness for watermark embedding against attacks that transform the basic values, such as brightness adjustments and conventional noise.

[0052] Step 13: Extract multi-scale depth features from the target image and fuse them to generate a perception guidance map, and generate a depth prior embedding intensity map based on the perception guidance map.

[0053] To overcome the limitation of relying solely on low-level pixel features to understand high-level semantic content, this embodiment will simultaneously expand the processing along the deep semantic prior dimension (referred to as the second processing path).

[0054] In practice, pre-trained deep learning networks or feature extraction models can be used to extract multi-scale features from target images, covering different receptive fields and different levels of abstraction. Among them, shallow features usually preserve fine edge and texture details, while deep features capture high-level semantic concepts such as object category and global structure.

[0055] Subsequently, these heterogeneous multi-scale deep features can be dynamically interacted and fused to generate a perceptual guidance map rich in image context information. The resulting perceptual guidance map breaks through the limitations of the single pixel level and can accurately capture and distinguish semantically sensitive regions (such as facial contours) and non-critical regions (such as cluttered backgrounds) in the target image from a high-level cognitive perspective.

[0056] Next, the generated perceptual guidance map is input into a specific mapping model to transform abstract deep semantic information into specific embedding strength constraint values, thereby generating a deep prior embedding strength map. This deep prior embedding strength map can objectively reflect the ideal watermark energy distribution that can be applied to each region of the target image under the guidance of high-level semantic prior knowledge.

[0057] Step 14: Fuse the base embedding intensity map and the depth prior embedding intensity map to obtain the target embedding intensity map. The target embedding intensity map is used to indicate the differentiated watermark embedding intensity in different local regions of the target image.

[0058] Furthermore, this embodiment will make a comprehensive decision based on the evaluation results of the first processing path and the second processing path.

[0059] Specifically, a specific mathematical fusion algorithm, such as weighted summation, can be used to fuse the basic embedding strength map that focuses on the underlying physical redundancy with the deep prior embedding strength map that focuses on the high-level semantic perception, and calculate the final target embedding strength map.

[0060] The target embedding intensity map is content-adaptive, used to indicate the differentiated watermark embedding intensity in different local regions of the target image. Compared to current mainstream watermark embedding methods, the resulting target embedding intensity map is no longer entirely... Figure 1 Instead of assigning a fixed intensity value, it intelligently and finely allocates weights based on both the underlying texture and high-level semantic attributes of the target image. For example, it indicates a weaker embedding intensity in semantically sensitive smooth regions and a stronger embedding intensity in complex texture regions that are not sensitive to the human eye.

[0061] Step 15: Modulate the initial watermark perturbation signal using the target embedding intensity map to obtain the target watermark perturbation signal.

[0062] In this embodiment, the fused target embedding intensity map is further used as the control matrix of spatial energy. This control matrix is ​​then used to perform pixel-by-pixel modulation processing on the initial watermark perturbation signal, thereby precisely adjusting the actual effect amplitude of the perturbation signal at each coordinate position of the target image.

[0063] The modulation method used can be through multiplication of corresponding position values, etc., so that the original basic perturbation substrate is transformed into a target watermark perturbation signal with fine energy distribution.

[0064] Step 16: Superimpose the target watermark perturbation signal onto the target image to generate a watermarked image.

[0065] Finally, in this embodiment, the modulated target watermark perturbation signal is superimposed and fused with the original pixel data of the target image to perform the watermark steganography embedding operation. Through the above processing, the target watermark perturbation signal is deeply hidden in the target image, and the final output is a watermarked image that is visually difficult to detect and robustly carries confidential information.

[0066] Existing localized watermarking embedding methods, while balancing imperceptibility and robustness, typically rely solely on low-level local pixel statistical features such as brightness and contrast for their perceptual modeling logic, lacking an understanding of the image's high-level semantic content. This one-size-fits-all embedding approach easily introduces perceptible artifacts in semantically sensitive areas, or forces a global reduction in embedding capacity to protect sensitive areas, resulting in suboptimal energy allocation.

[0067] In contrast, the image watermarking embedding method provided by this invention innovatively proposes a collaborative modeling framework based on dual-path awareness. While retaining the underlying pixel statistics to obtain basic spatial physical redundancy, it introduces a deep semantic prior path. By deeply fusing the underlying physical statistical features with the high-level multi-scale deep semantic prior, it achieves differentiated and content-adaptive precise allocation of watermark embedding intensity for different local regions of the target image. It can intelligently suppress the embedding intensity in semantically sensitive areas to ensure extremely high visual quality (imperceptibility), while fully releasing the watermark embedding potential in non-sensitive complex texture areas to maintain strong resistance to compression and tampering (robustness). Thus, it fundamentally and effectively breaks through the bottleneck of balancing imperceptibility and robustness in traditional digital image watermarking technology.

[0068] Figure 2 This is a schematic diagram of the overall architecture of the image watermark embedding system provided by the present invention. The following is a combination of... Figure 2 As shown, a specific architecture of an image watermark embedding system for implementing the image watermark embedding method provided in the above embodiments is introduced.

[0069] The overall architecture of the image watermark embedding system (hereinafter referred to as the system) provided in this embodiment mainly includes the following core components: embedding layer, attack layer, extraction layer, and discrimination layer. The specific data flow and module collaboration process within the image watermark embedding method are as follows: The embedding layer is primarily responsible for transforming hidden information into visually invisible perturbations and adaptively injecting them into the image. In the actual watermark embedding process, it mainly performs the following operations: First, obtain the original target image (denoted as...). x The watermark information to be embedded (denoted as) msg ) and the corresponding watermark index (denoted as indices In the image feature extraction branch, the target image is input into the variational autoencoder (VAE encoder), and after dimensionality reduction and feature extraction by the network, the corresponding image embedding features are output.

[0070] In the watermark information encoding branch, the watermark information to be embedded and the watermark index are jointly input into the feature generator (Emb Creator) and transformed into a multi-channel watermark feature representation (denoted as...). msg emb ).

[0071] Subsequently, the aforementioned image embedding features are concatenated with the watermark feature representation, that is, in Figure 2 The "C" node in the code performs a Concat operation and inputs the fused feature tensor into the VAE decoder to generate the initial watermark perturbation signal (denoted as ). δ ).

[0072] Simultaneously, the target image is also input in parallel to the dual-path processing and intensity fusion module (denoted as Dual-Path JND). Dual-Path JND calculates and fuses local pixel statistical features and multi-scale depth features of the target image, outputting a content-adaptive target embedding intensity map. For example... Figure 2 As shown, the target embedding intensity map is applied to the initial watermark perturbation signal, and its spatial energy is modulated to obtain the modulated target watermark perturbation signal.

[0073] Finally, by combining a preset embedding mask, the target watermark perturbation signal is superimposed on the original target image only within the area specified by the mask, thereby generating a watermarked image.

[0074] To enhance the system's anti-interference capability in practical applications, an attack layer is introduced during the training phase. Specifically, the generated watermarked image is fed into the distortion layer inside the attack layer. The distortion layer applies various simulated destructive operations (such as Gaussian noise, cropping, JPEG compression, equivalent transformations, and geometric transformation attacks) to the watermarked image to simulate transmission distortion or malicious tampering in a real network environment, thereby training the robustness of the extraction layer.

[0075] The extraction layer's role is to accurately recover the hidden watermark content and location from the damaged image. After the image has been processed by the distortion layer, it enters the extraction layer and first undergoes image patch embedding (referred to as Embed Patch) module processing, which divides the two-dimensional image into sequential image patch features.

[0076] Next, the image patch sequence is fed into the ViT encoder (e.g., the visual Transformer encoder) to fully capture the global and local contextual features of the image using a self-attention mechanism.

[0077] Finally, the extracted features are fed into the pixel decoder for decoding. The system simultaneously outputs two key results: one is the extracted watermark information, which is used for copyright tracing and information recovery; the other is the recovered detection mask, which is used to accurately locate the actual tampered or remaining area of ​​the watermark in the image.

[0078] To further ensure the visual imperceptibility of the generated watermarked image, the system introduces an adversarial training mechanism into the training framework. The original target image and the generated watermarked image are input into the discriminator.

[0079] Combination Figure 2As shown, the discriminator can be composed of multiple cascaded convolutional network layers, such as 2D convolution + activation layer, 2D convolution + normalization + activation layer, etc. The discriminator distinguishes between genuine and fake features and assesses their quality, ultimately outputting image patch scores. These scores serve as adversarial loss feedback signals, guiding the network weights of the embedding layers to be updated in reverse, thus enabling the system to generate watermarked images that are visually more natural, realistic, and free of any perceptible artifacts.

[0080] Based on the above specific implementation methods, this embodiment of the invention provides an implementation method for extracting multi-scale depth features from the target image and fusing them to generate a perceptual guidance map. Figure 3 This is a schematic diagram of the generation process of the perception guidance map provided by the present invention, such as... Figure 3 As shown, the main steps include: Step 31: Extract multi-scale depth features from the target image.

[0081] To extract rich depth prior knowledge from target images, pre-trained deep neural networks, such as the EfficientNet-B0 network which achieves a good balance between parameter quantity and perceptual feature extraction capability, can be used as the backbone network to process the target images, including feature extraction from different network layers, thereby obtaining multi-scale depth features.

[0082] The resulting multi-scale deep features constitute a rich perceptual information database. Specifically, the feature maps extracted from the shallow layers of the network mainly encode microscopic information such as fine textures, edges, and corners in the target image, while the feature maps extracted from the deep layers of the network capture macroscopic information such as high-level semantic concepts, object categories, and the global topological structure of objects.

[0083] In one specific embodiment, when the backbone network completes pre-training and participates in watermark embedding, its network weights are set to a frozen state. The purpose of freezing the weights is to preserve the general image perception knowledge acquired by the backbone network during the massive data pre-training stage, and to avoid interference from specific watermark task training, thereby ensuring the stability and high generalization of deep prior feature extraction.

[0084] Step 32: Perform feature extraction and spatial alignment processing on the multi-scale depth features to obtain a set of spatially aligned feature maps.

[0085] Because multi-scale deep features extracted directly from different network layers exhibit significant heterogeneous differences in spatial resolution, number of channels, and statistical distribution, directly fusing these heterogeneous features would lead to extreme instability in model training and feature representation. Therefore, this embodiment performs feature extraction and spatial alignment processing on these multi-scale deep features.

[0086] In this process, feature extraction is performed first. Specific network operations are used to suppress redundant background information in the feature maps of each layer and highlight the core channel features that play a crucial role in visual perception. Then, spatial alignment is performed, adjusting the size of the extracted feature maps so that their resolution is uniformly stretched or compressed to match the original target image. After this dimensionality reduction and size unification process, the heterogeneous features at each layer are transformed into the same spatial coordinate system, resulting in a set of dimensionally unified feature maps—the spatially aligned feature map set.

[0087] Step 33: Dynamically determine the adaptive fusion weights corresponding to each spatial alignment feature map in the spatial alignment feature map set.

[0088] After obtaining a set of feature maps with uniform dimensions, this embodiment further determines the proportion of features at each level during the final fusion. Since the dependence of different image content on feature levels is dynamically changing, for example, in textured areas such as grass with rich details, shallow features should contribute more; while in semantically clear areas such as faces, deep features should dominate.

[0089] To this end, the contextual content of the target image is globally evaluated, and the adaptive fusion weights corresponding to each spatially aligned feature map are dynamically determined based on this evaluation. These weight values ​​are not fixed constants determined during model construction, but are calculated and intelligently allocated during inference based on the specific image content (texture complexity, semantic distribution, etc.) of the current input image.

[0090] Step 34: Based on the adaptive fusion weights, perform weighted fusion processing on all the spatial alignment feature maps in the spatial alignment feature map set to generate the perception guidance map.

[0091] Finally, using the adaptive fusion weights determined in the previous step, a pixel-by-pixel weighted fusion process is performed on all feature maps in the spatially aligned feature map set, such as a weighted summation operation. Through this fusion process, shallow texture information and deep semantic information from different depths are organically combined in an optimal ratio, ultimately outputting a unified perceptual guidance map with global context awareness. The resulting perceptual guidance map preserves spatial details while deeply integrating the advanced semantic understanding mechanisms of the human visual system.

[0092] Existing multi-scale feature fusion schemes typically employ simple feature splicing or fixed-weight averaging. This fixed-weight mechanism is difficult to flexibly adapt to the content characteristics of different images and the differentiated attributes of each feature level, which can easily lead to poor consistency of fusion results in different scenarios and a serious lack of coordination between semantic and texture information.

[0093] In comparison, the image watermarking embedding method provided in this invention innovatively introduces a dynamic weighted fusion mechanism of feature extraction and alignment with content adaptation. On the one hand, through feature extraction and spatial alignment processing, the heterogeneous barriers between multi-scale features are effectively eliminated, ensuring the basic stability of feature fusion. On the other hand, by dynamically determining adaptive fusion weights, multi-scale feature integration that is adapted to the image and content is adaptive is achieved, effectively solving the problem that fixed weights in the prior art cannot take into account different image characteristics, and ensuring that the generated perceptual guidance map can distinguish between untouchable semantically sensitive areas and complex texture areas with high tolerance with extreme accuracy.

[0094] Based on the above embodiments, as an optional embodiment, a further implementation method is provided that performs feature extraction and spatial alignment processing on the multi-scale depth features to obtain a spatially aligned feature map set, mainly including but not limited to: Convolutional operations are used to compress and reweight the channel dimensions of each layer of feature maps in the multi-scale depth features to obtain compressed feature maps corresponding to each layer. Each of the compressed feature maps is upsampled to adjust the resolution of each compressed feature map to be consistent with the resolution of the target image, thereby obtaining the corresponding single-channel feature map; The spatially aligned feature map set is composed of each of the single-channel feature maps.

[0095] Because multi-scale deep features extracted directly from different layers of a deep network vary significantly in the number of channels and often contain a lot of redundant background information unrelated to watermark visual perception, direct use would increase the computational burden and introduce noise interference. Therefore, this embodiment chooses to apply convolution operations independently to each feature map in the multi-scale deep features. For example, it uses a learnable 1 1. Convolutional Layer. This convolutional operation not only achieves feature reduction and compression in the channel dimension, but also adaptively reweights the features based on their contribution to the final visual perception. This effectively suppresses invalid or redundant feature information and highly highlights key channel features closely related to image texture and semantics, thereby accurately outputting compressed feature maps corresponding to each layer's feature maps.

[0096] After refining the channel dimensions, the compressed feature maps of each layer still exhibit a heterogeneous state across multiple scales in terms of spatial resolution. For example, the size of the feature maps output from deeper layers of the network is usually much smaller than the size of the shallow feature maps and the original target image, making it impossible to directly perform pixel-to-pixel superposition and fusion in the same coordinate system. Therefore, this embodiment chooses to perform upsampling processing on each compressed feature map, such as using bilinear interpolation to stretch the image.

[0097] Through upsampling, compressed feature maps of varying sizes are uniformly enlarged or interpolated until their width and height perfectly match the original target image. After the aforementioned channel compression and spatial resolution stretching and alignment, each original heterogeneous multi-scale feature map is transformed into a single-channel feature map perfectly aligned in spatial scale.

[0098] Finally, this embodiment combines all the single-channel feature maps that have undergone the above-mentioned refinement and size standardization processes to construct a unified spatially aligned feature map set. All feature maps within the resulting spatially aligned feature map set achieve strict standardization and uniformity in physical dimensions such as resolution and number of channels (all single-channel), completely eliminating the heterogeneity of the original deep features and providing a spatially fully aligned high-quality data input tensor for the dynamic weighted fusion of subsequent network layers.

[0099] The image watermarking embedding method provided in this embodiment incorporates a channel-level refinement and compression mechanism and a spatial-level resolution alignment mechanism. Utilizing convolutional operations for channel compression and reweighting, it accurately removes redundant background information from multi-scale features and strengthens key perceptual channels, significantly reducing the computational overhead of subsequent fusion calculations. Upsampling processing smooths out the resolution gap between multi-scale features, achieving strict alignment of all feature maps in physical space. This mechanism fundamentally solves the system instability problem caused by direct fusion of multi-scale heterogeneous features, facilitating the construction of highly robust and content-adaptive perceptual guidance maps.

[0100] Based on the above embodiments, as a continuation embodiment, a further implementation method is provided for dynamically determining the adaptive fusion weights corresponding to each spatial alignment feature map in the spatial alignment feature map set, mainly including but not limited to: The spatially aligned feature map set is subjected to cross-scale interactive processing to obtain interactive feature maps; The interaction feature map is subjected to global pooling to generate a global content descriptor corresponding to the target image; Based on the global content descriptor, the weight values ​​corresponding to each of the spatial alignment feature maps are dynamically calculated and used as the adaptive fusion weights.

[0101] The features extracted from different network layers have different emphases. Shallow features emphasize texture, while deep features emphasize semantics. In order to enable features of different scales to work together, this embodiment processes the spatially aligned feature map set obtained in the previous embodiment as a whole.

[0102] One alternative approach is to concatenate and overlay the features along the channel dimension before inputting them into a specific hierarchical feature fusion network. The network's internal processing mechanism aims to perform cross-scale interactive processing on the spatially aligned feature map set. This process not only enhances the local spatial consistency of features at each scale but also promotes deep information interaction and integration between shallow microscopic information and deep macroscopic information, thereby outputting an interactive feature map containing rich joint contextual information.

[0103] To obtain core indicators that represent the overall macroscopic characteristics of the current target image, this embodiment performs global pooling on the obtained interactive feature map. An optional global pooling method is global average pooling.

[0104] The purpose of global pooling is to compress and aggregate a spatially distributed two-dimensional feature matrix into a highly condensed one-dimensional feature vector. The resulting one-dimensional feature vector removes complex spatial location information while retaining the most essential global image attributes, thus generating a global content descriptor unique to the current target image. This global content descriptor is equivalent to the content fingerprint of the current image, objectively representing whether the image as a whole leans towards complex textures or explicit semantics.

[0105] After obtaining the global content descriptor that reflects the characteristics of the entire image, the weight values ​​that should be assigned to each spatially aligned feature map in the feature map set can be dynamically calculated by using the global content descriptor through specific mapping calculation logic, such as learning the mapping through a fully connected layer network.

[0106] This calculation process completely breaks away from the traditional fixed ratio setting, instead dynamically allocating weights based on the specific content of the target image. For example, if the global content descriptor indicates that the main subject of the current image is a large area of ​​complex textures such as tree bark or grass, the calculation logic will automatically assign higher weights to shallow feature maps that encode fine edges; conversely, if the descriptor indicates that the current image is a semantically sensitive region containing a clear subject such as a face, it will intelligently assign higher weights to deep feature maps that capture high-level concepts. This set of weight values, dynamically generated based on real-time inference of the current image content, constitutes the final adaptive fusion weights, used to guide the on-demand allocation and fusion of multi-scale features in subsequent steps.

[0107] The refined scheme provided in this embodiment introduces a cross-scale interaction and dynamic weight learning mechanism based on a global content descriptor. Through cross-scale interaction and global pooling processing, it accurately extracts content descriptions that can macroscopically summarize the overall attributes of the image. Furthermore, it utilizes this global content descriptor to achieve dynamic and intelligent allocation of feature importance at each scale. This mechanism effectively solves the industry pain point that fixed weights are difficult to adapt to the characteristics of different image content, enabling the system to dynamically determine the focus of feature fusion. This achieves truly content-adaptive multi-scale feature fusion, providing decisive algorithmic logic support for the subsequent accurate differentiation between semantically sensitive areas and complex texture areas.

[0108] Based on the above embodiments, as an optional embodiment, a further implementation method is provided for performing cross-scale interactive processing on a set of spatially aligned feature maps to obtain interactive feature maps, mainly including but not limited to: The spatially aligned feature map set is subjected to convolutional operations to extract local consistency features, resulting in an initial interaction feature map. The initial interaction feature map is subjected to residual processing that fuses cross-scale feature information to obtain the interaction feature map.

[0109] In this embodiment, since the spatially aligned feature map set is a heterogeneous feature set extracted from multiple network depths and stitched together after being sized uniformly, the spatial responses of each layer's features may exhibit slight misalignment or inconsistency. Therefore, this embodiment applies a preset convolution operation to this set, such as using a lightweight convolutional layer. Using a lightweight convolutional operation allows for feature smoothing and correlation analysis of adjacent pixels in the spatial dimension of multi-scale features within its receptive field, thereby achieving local consistency feature extraction. After the above smoothing and consistency enhancement processing, an initial interactive feature map that preliminarily integrates local structural information from various scales can be obtained, effectively eliminating spatial discontinuities or abrupt changes caused by direct stitching of multi-scale features.

[0110] After obtaining the initial interaction feature map with local consistency, in order to further promote the deep association and connection between the fine texture information from the shallow layer and the macroscopic semantic information from the deep layer, this embodiment will perform specific residual processing on the initial interaction feature map, for example, inputting the feature map into a network structure containing two residual blocks for depth mapping.

[0111] In the residual processing, not only are nonlinear feature transformations performed on cross-scale information through multi-layer convolution to capture high-order combination patterns, but more importantly, the information from the input is directly bypassed and passed to the output for summation through a skip connection mechanism. This residual processing mechanism, which integrates cross-scale feature information, effectively avoids information loss and gradient vanishing in multi-layer networks, while greatly promoting cross-scale interaction between deep and shallow features. Ultimately, it outputs an interactive feature map that is deeply integrated across scales and contains rich joint contextual priors.

[0112] Existing multi-scale feature fusion processing, when faced with heterogeneous cross-scale features, often only involves simple channel splicing or linear summation, lacking in-depth exploration of local smoothness in the feature space and deep nonlinear correlations between scales. This leads to spatial structural misalignment in the fused features, with shallow textures and deep semantics operating independently, resulting in poor information synergy. To address this, the image watermarking embedding method provided in this embodiment adds a cascaded processing mechanism combining lightweight convolution and residual networks. The initial convolution operation extracts local consistency features, accurately smoothing out any structural abruptness that may remain after spatial alignment of features at different scales, ensuring the spatial coherence of the feature map. Subsequent residual processing leverages the unique skip connections and nonlinear transformation capabilities of residual blocks to effectively promote deep interaction and fusion of cross-scale information. This processing mechanism preserves the independent characteristics of each original scale feature while deeply exploring the joint perception patterns between multiple scales, fundamentally overcoming the structural bottleneck of easy cross-scale feature splicing but difficult deep fusion.

[0113] Figure 4 This is a schematic diagram of the dual-path JND architecture provided by the present invention. To more intuitively illustrate the internal data flow and module collaboration mechanism of the dual-path awareness model in the embodiments of the present invention, the following is combined with... Figure 4 The specific implementation process of the dual-path feature extraction and intensity fusion architecture in the image watermark embedding method provided by this invention is described in detail.

[0114] First, the original target image is acquired and fed into two parallel processing paths. In the image-driven first processing path, the target image is input to the pixel statistical feature mapping module (i.e., Figure 4 In the JND module on the left, physical redundancy is calculated based on the local pixel statistical features of the target image, thereby generating a basic embedding intensity map.

[0115] Meanwhile, in the second processing path of the depth prior, the target image is input into a pre-trained deep neural network to extract multi-scale depth features covering different network layers. These features then flow into the feature extraction and spatial alignment module. In this module, 1 step is performed sequentially on the multi-scale depth features of each layer. 1. Convolutional operations are performed to compress the channel dimension and reweight key features, upsampling is performed to unify the spatial resolution, and the processed single-channel feature maps are combined into a set of spatially aligned feature maps with a unified structure through feature map concatenation.

[0116] Next, the generated set of spatially aligned feature maps is input into the hierarchical feature fusion module for deep cross-scale information interaction. Combined with... Figure 4 As shown in the diagram, within the hierarchical feature fusion module, local consistency extraction and cross-scale information deep fusion are first performed using two-dimensional convolution operations and residual processing to obtain an interactive feature map. Then, global pooling is performed on the interactive feature map to generate a global content descriptor representing the overall attributes of the image. Finally, linear fusion calculation is performed based on this descriptor to dynamically deduce adaptive fusion weights and perform weighted fusion on each feature map, thereby outputting a perceptual guidance map containing high-level contextual information.

[0117] Furthermore, in this embodiment, the generated perception guidance map is input into the perception threshold mapping model (i.e., Figure 4 In the JND module on the right, a pre-defined mapping logic is used to transform it into a deep prior embedding intensity map with the same dimension and distribution as the base embedding intensity map.

[0118] Finally, the target embedding intensity map is obtained by performing pixel-by-pixel weighted fusion of the basic embedding intensity map generated by the first path and the deep prior embedding intensity map generated by the second path.

[0119] The above implementation steps enable the organic synergy between underlying physical features and high-level semantic priors. As can be seen from the visual presentation of the final output target embedding intensity map, this scheme can accurately allocate extremely low embedding intensity in semantically sensitive and visually smooth areas to ensure visual imperceptibility, while fully releasing the embedding potential in complex texture areas to enhance robustness. This effectively breaks through the bottleneck of balancing imperceptibility and robustness in traditional digital watermarking technology.

[0120] Based on the above embodiments, as an optional embodiment, a further implementation method for generating a deep prior embedding intensity map based on the perceptual guidance map is provided, which mainly includes, but is not limited to: A preset perception threshold mapping model is used to map and transform the perception guidance map to obtain the perception threshold corresponding to the perception guidance map; The depth prior embedding strength map is generated based on the perception threshold; The depth prior embedding intensity map has the same dimension and numerical distribution as the basic embedding intensity map.

[0121] After obtaining the perceptual guidance map formed by the fusion of multi-scale deep features, since the perceptual guidance map is essentially an abstract expression of deep semantic features in a high-dimensional feature space, its numerical range and statistical distribution are usually quite different from the pixel statistical features at the bottom layer of the original image. This means that such an abstract feature expression cannot be directly used as the embedding intensity threshold to control the watermark energy.

[0122] Therefore, in this embodiment, the perceptual guidance map is input to a preset perceptual threshold mapping model. For example, a JND processing module with the same structure as the one used to generate the basic embedding intensity map in the first processing path or with mature features to the perceptual threshold mapping logic can be used for mapping. Through the internal conversion logic of this mapping model, abstract deep semantic information can be decoded into specific numerical values ​​that conform to the laws of human visual psychology, thereby converting the perceptual guidance map pixel by pixel into the corresponding perceptual threshold.

[0123] After obtaining the perception thresholds at each location, a complete deep prior embedding intensity map can be constructed. Specifically, after standardization transformation using the perception threshold mapping model, the generated deep prior embedding intensity map is strictly constrained in terms of physical properties: on the one hand, its spatial resolution (i.e., dimension) is completely consistent with the basic embedding intensity map generated by the first processing path; on the other hand, the dynamic range and statistical characteristics (i.e., numerical distribution) of its internal values ​​are also stretched or restricted to the same scale space as the basic embedding intensity map. This strict alignment of dimension and numerical distribution unifies the deep semantic signals and underlying physical signals, which were originally in different metric spaces, into the same numerical evaluation system, thereby ensuring that the deep prior embedding intensity map can directly participate in subsequent mathematical fusion operations.

[0124] Based on the above embodiments, as an optional embodiment, a further improved implementation method for generating the depth prior embedding intensity map based on the perception threshold is provided, specifically including but not limited to: Determine the perception threshold corresponding to each local region in the target image; Based on the perception threshold corresponding to each of the local regions, a corresponding embedding strength value is assigned to each local region to form the depth prior embedding strength map. The perception threshold corresponding to the semantically sensitive region or visually smooth region in the target image is lower than the perception threshold corresponding to the textured complex region in the target image. Accordingly, the embedding strength value assigned to the semantically sensitive region or visually smooth region is less than the embedding strength value assigned to the textured complex region.

[0125] In a specific implementation, after transformation by a preset perception threshold mapping model, the perception threshold corresponding to each local region in the target image has been determined in the spatial dimension. Here, the perception threshold is no longer simply the physical difference calculated based on the underlying pixels, but rather an advanced perception tolerance index that deeply integrates multi-scale semantic features. This advanced perception tolerance index precisely quantifies the ability of each local block of the target image to conceal watermark perturbation signals under the advanced visual cognitive mechanisms of humans.

[0126] Once the perceptual tolerance of each local region is clearly defined, a direct mapping mechanism from the perceptual threshold to the actual watermark injection energy can be accurately established. Specifically, based on the perceptual threshold corresponding to each local region, a differentiated allocation of spatial energy can be performed, assigning a corresponding embedding intensity value to each local region, and combining these spatially distributed embedding intensity values ​​to ultimately form the final depth prior embedding intensity map.

[0127] In this allocation process, this embodiment follows strict semantic-driven adjustment logic. Since the previously generated perceptual guidance map is generated under depth prior guidance, it outputs lower and more uniform response values ​​in semantically important or visually flat areas, while outputting higher and more variable response values ​​in textured areas. Therefore, after mapping, the perceptual thresholds corresponding to semantically sensitive areas (such as facial features, text outlines, etc.) or visually smooth areas (such as a solid sky, a flat wall, etc.) in the target image will naturally be lower than the perceptual thresholds corresponding to textured areas (such as tree bark, grass, complex backgrounds, etc.) in the target image.

[0128] Accordingly, during energy allocation, the embedding intensity values ​​assigned to semantically sensitive or visually smooth regions will be strictly less than those assigned to regions with complex textures. This high-level semantic-based energy allocation strategy allows the system to identify semantically critical regions such as faces as highly sensitive regions and intelligently suppress embedding intensity even if they contain underlying texture variations such as pores and hair; while when the region is identified as a complex background region with no semantic importance, the embedding potential is fully released.

[0129] In existing localized watermarking embedding methods, the perceptual models used to guide energy allocation often exhibit limitations: due to a lack of understanding of the high-level semantics of the image, traditional models cannot distinguish between complex backgrounds and crucial sensitive subject details, such as equating the pores of a face with the texture of grass, resulting in over-embedding in semantically sensitive areas and causing unacceptable artifacts, or being forced to globally reduce the embedding strength of all textured areas in order to protect the face, severely sacrificing the robustness of the watermark.

[0130] In contrast, the image watermarking embedding method provided in this embodiment establishes a causal mapping relationship between abstract perceptual thresholds and specific image semantic regions. By clearly defining the differentiated allocation logic of low threshold and low intensity in semantically sensitive / smooth areas and high threshold and high intensity in textured complex areas, the watermarking embedding system is endowed with a semantic brain similar to that of humans. This achieves truly content-adaptive and refined energy allocation, enabling strong visual protection in sensitive areas to ensure extremely high imperceptibility, while maximizing embedding energy in safe, high-tolerance areas. Thus, at the underlying logic level, it completely overcomes the core industry problem that traditional methods cannot balance visual quality and anti-attack capabilities.

[0131] Figure 5 This is a schematic diagram of the basic embedding strength map generation process provided by the present invention. As an optional embodiment, it is described below in conjunction with... Figure 5 As shown, this invention provides a detailed implementation of a method for generating a basic embedding intensity map based on local pixel statistical features of the target image, including but not limited to: Step 51: Extract the pixel intensity distribution features and / or pixel spatial variation features of each local region in the target image, as the local pixel statistical features of each local region; Step 52: Calculate the basic embedding strength value corresponding to each local region based on the local pixel statistical features of each local region; Step 53: The basic embedding strength map is constructed from the basic embedding strength values ​​of each of the local regions.

[0132] In one specific implementation, the analysis unfolds along a first processing path focused on the underlying physical information of the image. To comprehensively and objectively quantify the physical visual redundancy of a local image, this embodiment extracts features from two core fundamental dimensions: First, it extracts pixel intensity distribution features, which essentially reflect the overall magnitude and distribution range of pixel values ​​within a local region, such as local average brightness and contrast, used to measure the region's tolerance to changes in global brightness or illumination; second, it extracts pixel spatial variation features, reflecting the drastic degree of numerical change between adjacent pixels, such as texture complexity, edge gradient, or high-frequency energy, used to assess the local region's ability to conceal minute perturbation signals. By summarizing the fundamental physical features extracted from these two dimensions, the local pixel statistical features of each local region can be obtained.

[0133] Furthermore, after obtaining the local pixel statistical features, a pre-defined visual psychology model, such as the classic minimum perceptible difference (JND) mapping logic, can be used to comprehensively calculate and transform these underlying physical features. Specifically, based on the aforementioned local pixel statistical features reflecting brightness and texture complexity, the lower limit of physical insensitivity of the human eye to changes in pixel values ​​in the current local area can be evaluated, and then the maximum safe perturbation threshold corresponding to each local area without causing visual perception, i.e., the basic embedding strength value, can be calculated.

[0134] Finally, by splicing and matrixing the basic embedding intensity values ​​calculated from all the divided local regions according to their original spatial topological relationships, a two-dimensional energy matrix matching the spatial resolution of the target image, namely the basic embedding intensity map, can be constructed. This matrix objectively reflects the watermark carrying capacity of the target image in the layer physical space.

[0135] The image watermarking embedding method provided in this embodiment highly refines the underlying pixel attributes into two fundamental dimensions: pixel intensity distribution features and pixel spatial variation features. This comprehensively covers all physical redundancies in the target image in terms of brightness contrast and texture variation. The resulting basic embedding intensity map ensures that even in extreme cases where there is a lack of high-level semantic priors, it can still provide robust protection against attacks on basic value transformations such as brightness adjustment, Gaussian noise, and JPEG compression.

[0136] Based on the above embodiments, as an optional embodiment, a further implementation method is provided that fuses the basic embedding strength map and the depth prior embedding strength map to obtain the target embedding strength map, mainly including but not limited to: Obtain the predetermined fusion weight coefficients; Based on the fusion weight coefficients, the base embedding intensity map and the deep prior embedding intensity map are weighted and fused to obtain the target embedding intensity map.

[0137] Before the final merging of the dual-path strengths, this embodiment first obtains the fusion weight coefficient. This fusion weight coefficient is a key parameter used to adjust the relative proportion between the underlying physical redundancy information and the high-level deep semantic prior information. It directly determines whether the final watermark embedding strategy is more inclined to rely on the physical robustness against attacks brought by basic pixel statistics or more inclined to rely on the visual imperceptibility protection brought by deep semantic understanding.

[0138] In one specific implementation, the fusion weighting coefficient can be a global constant scalar preset by the system; for example, setting a fusion coefficient. λ A score of 0.5 indicates that the evaluation results of the two paths are given equal trust and importance.

[0139] After obtaining the fusion weight coefficients, specific mathematical fusion operations can be performed using these coefficients. Specifically, based on the fusion weight coefficients, a pixel-by-pixel linear weighted summation can be performed on the basic embedding intensity map generated by the first path and the depth prior embedding intensity map generated by the second path.

[0140] For example, a simple linear weighted fusion formula can be used: 。

[0141] in, Represents the target embedding intensity map. Represents the basic embedding strength map. This represents the depth prior embedding strength map. Through this linear weighted summation process, the redundant information of the underlying physical space, which was originally computed independently, is organically superimposed with the prior information of the high-level semantic perception, thus smoothly and stably obtaining a target embedding strength map that takes into account both attributes.

[0142] The image watermarking embedding method provided in this embodiment provides an intuitive, interpretable, and easily adjustable balance tool for the watermarking embedding system by obtaining clear fusion weight coefficients. This allows engineers to flexibly adjust the contribution of the two processing paths according to the preferences of the actual application scenario. Without increasing the additional computing power burden, it perfectly realizes the synergistic complementarity of basic spatial redundancy and deep semantic prior, and provides a stable, reliable, and highly content-adaptive energy allocation benchmark for the final watermark signal modulation.

[0143] In one alternative implementation, the fusion weight coefficients include spatially adaptive weight values ​​corresponding to each local region of the target image. Figure 6 This is a schematic diagram of the target embedding intensity map generation process provided by the present invention. The following is in conjunction with... Figure 6 As shown, an implementation method for obtaining fusion weight coefficients is provided, which mainly includes, but is not limited to: Step 61: Obtain the feature distribution value of the local pixel statistical features corresponding to each local region, and the semantic response value of each local region corresponding to the perception guidance map; Step 62: Calculate the relative proportion between the feature distribution value and the semantic response value of each local region; Step 63: Based on the relative proportion, dynamically generate the spatial adaptive weight value corresponding to each of the local regions; Step 64: The fusion weight coefficient is composed of the spatial adaptive weight values ​​corresponding to each of the local regions.

[0144] In a specific implementation, to achieve extremely fine-grained control, this embodiment reuses the multi-dimensional feature data extracted in the preceding processing path without introducing additional network overhead. On one hand, the feature distribution values ​​of the local pixel statistical features corresponding to each local region are obtained from the first processing path. These feature distribution values ​​objectively quantify the texture complexity or smoothness of the current region in the underlying physical space. On the other hand, the semantic response values ​​corresponding to each local region in the perception guidance map are obtained from the second processing path. These semantic response values ​​characterize the semantic sensitivity level of the current region in human high-level visual cognition, such as whether it belongs to a key subject that is easily perceived visually, such as a face or text.

[0145] After obtaining the quantitative indicators of the above two dimensions, the relative proportion between the feature distribution value and the semantic response value of each local region can be calculated by performing in-depth cross-comparison at the pixel level or region level. The core purpose of this calculation step is to evaluate whether the underlying physical texture complexity or the higher-level semantic sensitivity dominates in each tiny local space, thereby providing a quantitative mathematical basis for subsequent accurate weighting.

[0146] Furthermore, based on the calculated relative proportions, the rigid constraints of fixed parameters are eliminated, and spatial adaptive weight values ​​corresponding to each local region can be dynamically generated based on the relative proportions.

[0147] For example, if the relative proportions of a local region indicate that its semantic response value is extremely high, such as a close-up of a face, a weight value biased towards the depth prior path will be dynamically generated to ensure that the visual imperceptibility of this local region is protected with absolute priority. Conversely, if the relative proportions indicate that its feature distribution values ​​suggest extremely complex textures without obvious semantics, such as a cluttered background of leaves, a weight value biased towards the basic pixel path will be dynamically generated to maximize the embedding potential of this local region and improve its resistance to attacks.

[0148] Finally, the adaptive weight values ​​calculated independently for all local regions are matrix-concatenated according to the original spatial topology, and the spatial adaptive weight values ​​corresponding to each local region constitute the fusion weight coefficient. At this point, the fusion weight coefficient has been upgraded to a dynamic weight matrix that perfectly matches the spatial resolution of the target image. This dynamic weight matrix is ​​then seamlessly input into the linear weighted summation processing step to guide the final fusion of the intensity of the two processing paths.

[0149] The image watermarking embedding method provided in this embodiment completely upgrades the globally fixed fusion parameters to a spatially adaptive dynamic weight matrix. By directly cross-comparing the feature distribution values ​​of the lower layer with the semantic response values ​​of the higher layer, it truly achieves pixel-level / local region-level fine-grained energy control. This not only makes full use of the information value of existing features, but also balances the invisibility and robustness in the watermark embedding process. This allows the model to intelligently and smoothly switch between two strategies—physical defense priority and semantic protection priority—in different blocks of the same image, thereby significantly improving the overall concealment performance and adversarial robustness of the entire watermarking system in the face of various complex and ever-changing scenarios.

[0150] Based on the above embodiments, as an optional embodiment, a further implementation method is provided to superimpose the target watermark perturbation signal onto the target image to generate a watermarked image, which mainly includes, but is not limited to: Obtain the embedding mask corresponding to the target image, and determine the target embedding region in the target image; The target watermark perturbation signal is superimposed only onto the target embedding region to generate the watermarked image.

[0151] In this specific implementation, after generating the target watermark perturbation signal with fine energy allocation, this embodiment does not directly cover the entire target image globally. Instead, it first obtains the embedding mask corresponding to the target image. This embedding mask is usually a two-dimensional matrix or a binarized mask image that matches the spatial size of the target image (for example, an element value of 1 in the matrix represents that embedding is allowed, and a value of 0 represents that embedding is prohibited).

[0152] Using this embedding mask, the target embedding region in the target image is precisely determined in the spatial coordinate dimension. This operation sets strict physical boundaries for watermark injection, allowing the system to flexibly and controllably restrict the hiding operation to specific local areas of the image. For example, watermark embedding can be performed in a rectangular area at the center of the target image, a specific diagonal area, or a mask area generated based on specific content, while other local areas are isolated and protected.

[0153] After defining the spatial boundaries, a spatially constrained signal fusion operation is performed to superimpose the target watermark perturbation signal only onto the target embedding region. In practice, an embedding mask can be used as a gating switch, adding or fusing the target watermark perturbation signal with the original pixel value of the target image only at pixel coordinates where the mask indicates validity (e.g., a value of 1); while in areas where the mask indicates invalidity (e.g., a value of 0), the original pixels of the target image remain absolutely unmodified.

[0154] Through the aforementioned precise localization and overlay processing, hidden signals can be securely embedded in a controlled local area, ultimately outputting a watermarked image with extremely high overall visual quality and locally concealed markings.

[0155] The image watermarking embedding method provided in this embodiment accurately divides the target embedding region by acquiring and applying an embedding mask. This allows the system to concentrate and covertly deploy watermark energy in high-security local areas of the image, greatly reducing the overall visual fidelity loss caused by modifying the entire image. Furthermore, by performing strict local limit superposition, the watermarking system can independently and completely recover publication rights or traceability information even when facing local destructive attacks such as cropping or splicing, as long as the target embedding region is not completely destroyed. This mechanism significantly enhances the local resilience and survivability of the entire watermarking system in complex tampering scenarios from the underlying dimension of physical spatial isolation.

[0156] Figure 7 This is a schematic diagram of the image watermark embedding device provided by the present invention, as shown below. Figure 7 As shown, it mainly includes, but is not limited to, the following components: The signal generation module 71 is used to generate an initial watermark perturbation signal based on the target image and the watermark information to be embedded. The first path processing module 72 is used to generate a basic embedding intensity map based on the local pixel statistical features of the target image; The second path processing module 73 is used to extract multi-scale depth features of the target image and fuse them to generate a perception guidance map, and generate a depth prior embedding intensity map based on the perception guidance map. The intensity fusion module 74 is used to fuse the basic embedding intensity map and the depth prior embedding intensity map to obtain a target embedding intensity map, which is used to indicate the differentiated watermark embedding intensity of different local regions in the target image. The perturbation modulation module 75 is used to modulate the initial watermark perturbation signal using the target embedding intensity map to obtain the target watermark perturbation signal; The watermark overlay module 76 is used to overlay the target watermark perturbation signal onto the target image to generate a watermarked image.

[0157] It should be noted that the image watermark embedding device provided by the present invention can execute the image watermark embedding method described in any of the above embodiments during specific operation, which will not be elaborated in this embodiment.

[0158] The image watermark embedding device provided by this invention achieves precise energy allocation for content adaptation by fusion of low-level pixel statistics and high-level semantic priors through dual-path collaborative fusion. This ensures extremely high imperceptibility in sensitive areas and enhances robustness in areas with complex textures, effectively breaking through the bottleneck of the difficulty in balancing watermark imperceptibility and robustness.

[0159] Figure 8 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 8 As shown, the electronic device may include: a processor 810, a communications interface 820, a memory 830, and a communication bus 840, wherein the processor 810, the communications interface 820, and the memory 830 communicate with each other through the communication bus 840. The processor 810 can call logical instructions in the memory 830 to execute an image watermark embedding method, which includes: generating an initial watermark perturbation signal based on a target image and watermark information to be embedded; generating a basic embedding intensity map based on local pixel statistical features of the target image; extracting multi-scale depth features of the target image and fusing them to generate a perceptual guidance map, and generating a depth prior embedding intensity map based on the perceptual guidance map; fusing the basic embedding intensity map and the depth prior embedding intensity map to obtain a target embedding intensity map, wherein the target embedding intensity map is used to indicate the differentiated watermark embedding intensity in different local regions of the target image; modulating the initial watermark perturbation signal using the target embedding intensity map to obtain a target watermark perturbation signal; and superimposing the target watermark perturbation signal onto the target image to generate a watermarked image.

[0160] Furthermore, the logical instructions in the aforementioned memory 830 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0161] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, wherein when the program instructions are executed by a computer, the computer is able to execute the image watermark embedding method provided in the above embodiments, the method comprising: generating an initial watermark perturbation signal based on a target image and watermark information to be embedded; generating a basic embedding intensity map based on local pixel statistical features of the target image; extracting multi-scale depth features of the target image and fusing them to generate a perceptual guidance map, and generating a depth prior embedding intensity map based on the perceptual guidance map; fusing the basic embedding intensity map and the depth prior embedding intensity map to obtain a target embedding intensity map, the target embedding intensity map being used to indicate the differentiated watermark embedding intensity in different local regions of the target image; modulating the initial watermark perturbation signal using the target embedding intensity map to obtain a target watermark perturbation signal; and superimposing the target watermark perturbation signal onto the target image to generate a watermarked image.

[0162] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, performs the image watermarking embedding method provided in the above embodiments. The method includes: generating an initial watermark perturbation signal based on a target image and watermark information to be embedded; generating a basic embedding intensity map based on local pixel statistical features of the target image; extracting multi-scale depth features of the target image and fusing them to generate a perceptual guidance map, and generating a depth prior embedding intensity map based on the perceptual guidance map; fusing the basic embedding intensity map and the depth prior embedding intensity map to obtain a target embedding intensity map, the target embedding intensity map indicating the differentiated watermark embedding intensity in different local regions of the target image; modulating the initial watermark perturbation signal using the target embedding intensity map to obtain a target watermark perturbation signal; and superimposing the target watermark perturbation signal onto the target image to generate a watermarked image.

[0163] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0164] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0165] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image watermark embedding method characterized by, include: An initial watermark perturbation signal is generated based on the target image and the watermark information to be embedded. A basic embedding intensity map is generated based on the local pixel statistical features of the target image; Multi-scale depth features of the target image are extracted and fused to generate a perception guidance map, and a depth prior embedding intensity map is generated based on the perception guidance map. By fusing the base embedding intensity map and the depth prior embedding intensity map, a target embedding intensity map is obtained, which is used to indicate the differentiated watermark embedding intensity in different local regions of the target image. The initial watermark perturbation signal is modulated using the target embedding intensity map to obtain the target watermark perturbation signal; The target watermark perturbation signal is superimposed onto the target image to generate a watermarked image.

2. The image watermark embedding method of claim 1, characterized in that, The step of extracting multi-scale deep features from the target image and fusing them to generate a perceptual guidance map includes: Extract multi-scale depth features from the target image; The multi-scale depth features are subjected to feature extraction and spatial alignment processing to obtain a set of spatially aligned feature maps; Dynamically determine the adaptive fusion weights corresponding to each spatial alignment feature map in the spatial alignment feature map set; Based on the adaptive fusion weights, all the spatial alignment feature maps in the spatial alignment feature map set are subjected to weighted fusion processing to generate the perception guidance map.

3. The image watermark embedding method of claim 2, wherein, The process of feature extraction and spatial alignment of the multi-scale depth features yields a set of spatially aligned feature maps, including: Convolutional operations are used to compress and reweight the channel dimensions of each layer of feature maps in the multi-scale depth features to obtain compressed feature maps corresponding to each layer. Each of the compressed feature maps is upsampled to adjust the resolution of each compressed feature map to be consistent with the resolution of the target image, thereby obtaining the corresponding single-channel feature map; The spatially aligned feature map set is composed of each of the single-channel feature maps.

4. The image watermark embedding method of claim 2, characterized in that, The dynamic determination of the adaptive fusion weights corresponding to each spatial alignment feature map in the spatial alignment feature map set includes: The spatially aligned feature map set is subjected to cross-scale interactive processing to obtain interactive feature maps; The interaction feature map is subjected to global pooling to generate a global content descriptor corresponding to the target image; Based on the global content descriptor, the weight values ​​corresponding to each of the spatial alignment feature maps are dynamically calculated and used as the adaptive fusion weights.

5. The image watermark embedding method of claim 4, characterized in that, The cross-scale interactive processing of the spatially aligned feature map set to obtain interactive feature maps includes: The spatially aligned feature map set is subjected to convolutional operations to extract local consistency features, resulting in an initial interaction feature map. The initial interaction feature map is subjected to residual processing that fuses cross-scale feature information to obtain the interaction feature map.

6. The image watermark embedding method of claim 1, wherein, The generation of a deep prior embedding strength map based on the perception guidance map includes: A preset perception threshold mapping model is used to map and transform the perception guidance map to obtain the perception threshold corresponding to the perception guidance map; The depth prior embedding strength map is generated based on the perception threshold; The depth prior embedding intensity map has the same dimension and numerical distribution as the basic embedding intensity map.

7. The image watermark embedding method of claim 6, characterized in that, The generation of the deep prior embedding strength map based on the perception threshold includes: Determine the perception threshold corresponding to each local region in the target image; Based on the perception threshold corresponding to each of the local regions, a corresponding embedding strength value is assigned to each local region to form the depth prior embedding strength map. The perception threshold corresponding to the semantically sensitive region or visually smooth region in the target image is lower than the perception threshold corresponding to the textured complex region in the target image. Accordingly, the embedding strength value assigned to the semantically sensitive region or visually smooth region is less than the embedding strength value assigned to the textured complex region.

8. The image watermark embedding method of claim 1, wherein, The generation of the basic embedding intensity map based on the local pixel statistical features of the target image includes: Extract the pixel intensity distribution features and / or pixel spatial variation features of each local region in the target image, as the local pixel statistical features of each local region; Based on the local pixel statistical features of each local region, calculate the basic embedding strength value corresponding to each local region; The basic embedding strength map is composed of the basic embedding strength values ​​of each of the local regions.

9. The image watermark embedding method of claim 1, wherein, The process of fusing the base embedding strength map and the depth prior embedding strength map to obtain the target embedding strength map includes: Obtain the predetermined fusion weight coefficients; Based on the fusion weight coefficients, the base embedding intensity map and the deep prior embedding intensity map are weighted and fused to obtain the target embedding intensity map.

10. The image watermark embedding method of claim 9, characterized in that, The fusion weight coefficients include spatially adaptive weight values ​​corresponding to each local region of the target image; obtaining the pre-determined fusion weight coefficients includes: Obtain the feature distribution value of the local pixel statistical features corresponding to each local region, and the semantic response value of each local region corresponding to the perception guidance map; Calculate the relative proportion between the feature distribution values ​​and the semantic response values ​​of each of the local regions; Based on the relative proportional relationship, the spatial adaptive weight value corresponding to each of the local regions is dynamically generated; The fusion weight coefficient is composed of the spatial adaptive weight values ​​corresponding to each of the local regions.

11. The image watermark embedding method of claim 1, wherein, The step of superimposing the target watermark perturbation signal onto the target image to generate a watermarked image includes: Obtain the embedding mask corresponding to the target image, and determine the target embedding region in the target image; The target watermark perturbation signal is superimposed only onto the target embedding region to generate the watermarked image.

12. An image watermark embedding apparatus characterized by comprising: include: The signal generation module is used to generate an initial watermark perturbation signal based on the target image and the watermark information to be embedded. The first path processing module is used to generate a basic embedding intensity map based on the local pixel statistical features of the target image; The second path processing module is used to extract multi-scale depth features of the target image and fuse them to generate a perception guidance map, and generate a depth prior embedding intensity map based on the perception guidance map. An intensity fusion module is used to fuse the base embedding intensity map and the depth prior embedding intensity map to obtain a target embedding intensity map, which is used to indicate the differentiated watermark embedding intensity of different local regions in the target image. The perturbation modulation module is used to modulate the initial watermark perturbation signal using the target embedding intensity map to obtain the target watermark perturbation signal; The watermark overlay module is used to overlay the target watermark perturbation signal onto the target image to generate a watermarked image.

13. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the image watermark embedding method as described in any one of claims 1 to 11.

14. A non-transitory computer-readable storage medium having stored thereon a computer program, characterized in that, When the computer program is executed by a processor, it implements the image watermark embedding method as described in any one of claims 1 to 11.