Image tampering positioning method and device based on difference perception, equipment and medium
By employing a difference-aware image tampering localization method, utilizing a dual-stream feature extraction network and a high-resolution feature recovery module, the problem of accurately locating minute tampered regions in biomedical images was solved, achieving high-resolution tampered region identification and segmentation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies struggle to achieve pixel-level precision in locating minute tampered areas in biomedical images, particularly due to issues such as microscopic feature dilution and loss of boundary details.
A difference-aware image tampering localization method is adopted. Multi-scale local and global features are extracted through a dual-stream feature extraction network, and difference calculation and fusion are performed. Combined with a high-resolution feature recovery module, a pixel-level tampering probability mask is generated.
It achieves high-resolution and precise localization and complete boundary segmentation of tiny tampered regions in biomedical images, improves the accuracy of tampered region identification, and can accurately restore the true geometric shape of the tampered region.
Smart Images

Figure CN122313005A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of image processing technology, and more specifically, relates to an image tampering localization method, apparatus, device, and medium based on difference perception. Background Technology
[0002] In the field of biomedical research, images serve as a crucial medium for visualizing experimental results and validating scientific arguments; their authenticity and integrity directly impact the reliability of scientific discoveries. However, the act of deliberately altering images not only severely erodes the foundation of scientific integrity, leading to numerous paper retractions and wasting research resources, but also profoundly damages public trust in the scientific research system. Although general image forgery detection technologies have made some progress in the field of natural images, biomedical images, due to their unique visual characteristics and semantic structure, still face fundamental challenges in locating tampering.
[0003] While existing technologies can capture traces of forgery in general scenarios, their detection performance is significantly reduced due to the special characteristics of biomedical images—such as the microscale features and dense distribution patterns of tampered targets like cells and colonies—making it difficult to achieve pixel-level precise positioning.
[0004] Specifically, the current technological shortcomings are mainly reflected in the following two aspects: (1) Microscopic feature dilution problem: Tampered targets (such as cells and colonies) in biomedical images are usually small in size and densely distributed. The downsampling operation used by existing models during the encoding process will gradually erode these already weak tampered features, causing the tampered signals in the deep feature space to be submerged by background noise, making it difficult to effectively activate and identify small tampered regions.
[0005] (2) Problem of missing boundary details: Continuous downsampling and pooling operations will further destroy the edge details and transition information between the tampered area and normal tissue. Even if the model captures the approximate range of the tampering through rough localization, due to the irreversible loss of high-frequency boundary information, the final segmentation result often presents the phenomenon of blurred boundaries, broken contours or incomplete regions, and cannot accurately restore the true geometric shape of the tampered area. Summary of the Invention
[0006] In view of the above problems, this application provides a method, apparatus, device, and medium for locating image tampering based on difference perception, thereby solving or at least alleviating one or more of the above-mentioned problems and other problems existing in the prior art.
[0007] A first aspect of this application provides an image tampering localization method based on difference perception, comprising: The input target image is preprocessed to obtain the preprocessed image; The preprocessed image is used to extract features based on a dual-stream feature extraction network to obtain multi-scale local features and multi-scale global features. For the multi-scale local features and multi-scale global features at each scale level, difference calculation and fusion are performed respectively to obtain multi-scale difference perception features; The multi-scale local features, the multi-scale global features, and the multi-scale difference-aware features are upsampled to the resolution of the target image to obtain upsampled local features, upsampled global features, and upsampled difference features; The upsampled local features, the upsampled global features, and the upsampled difference features are sequentially fused to obtain the fused features; The fused features are input into the prediction head to generate a tampering probability mask; The tampered area in the target image is determined based on a threshold. A second aspect of this application provides an image tampering location device based on difference perception, comprising: The preprocessing unit is used to preprocess the input target image to obtain the preprocessed image; The extraction unit is used to extract features from the preprocessed image based on a dual-stream feature extraction network to obtain multi-scale local features and multi-scale global features. The computing unit is used to perform difference calculation and fusion on the multi-scale local features and the multi-scale global features at each scale level to obtain multi-scale difference perception features. The upsampling unit is used to upsample the multi-scale local features, the multi-scale global features, and the multi-scale difference-aware features to the resolution of the target image, respectively, to obtain upsampled local features, upsampled global features, and upsampled difference features; The fusion unit is used to sequentially fuse the upsampled local features, the upsampled global features, and the upsampled difference features to obtain fused features; The prediction unit is used to input the fused features into the prediction head and generate a tampering probability mask; The determination unit is used to determine the tampered area in the target image based on a threshold. A third aspect of this application provides an electronic device including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the steps of the method described above.
[0008] A fourth aspect of this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.
[0009] The beneficial effects of this application's embodiments are as follows: by capturing multi-scale local details and global semantic features through a dual-stream feature extraction network, the tamper identification is enhanced through difference calculation and fusion, the small features are uniformly upsampled to the original resolution to preserve their fidelity, and then the three types of features are fused in sequence to achieve progressive collaborative optimization. Finally, a pixel-level tamper probability mask is generated and a threshold is determined, which realizes high-resolution accurate positioning and complete boundary segmentation of small tampered areas in biomedical images, improves the accuracy of tampered area identification, and can further accurately restore the true geometric shape of the tampered area. Attached Figure Description
[0010] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0011] Figure 1 A flowchart illustrating an image tampering localization method based on difference perception provided in an embodiment of this application; Figure 2 A flowchart illustrating an image tampering localization method based on difference perception provided in an embodiment of this application; Figure 3 A schematic diagram illustrating the generation process of differential sensing features provided in an embodiment of this application; Figure 4 A schematic diagram illustrating the process of generating a tamper mask according to an embodiment of this application; Figure 5 This is a schematic block diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0012] To enable those skilled in the art to better understand the technical solutions in the embodiments of this application, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments in the specific implementation of this application should fall within the protection scope of the embodiments of this application.
[0013] To keep the drawings concise, each drawing only schematically shows the parts relevant to the disclosure; these do not represent the actual structure of the product. Furthermore, for ease of understanding, in some drawings, only one of components with the same structure or function is schematically shown, or only one is labeled. In this document, "one" not only means "only one," but can also mean "more than one," and "several" includes "two" and "more than two."
[0014] Furthermore, in the description of this application, the terms "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0015] It should be understood that, unless the context clearly states otherwise, the terms "comprising," "including," or "having" as used herein refer to the presence of an element, but do not exclude the presence or addition of one or more other elements. Furthermore, "comprising" and / or "including" as used herein specify the presence of shapes, numbers, steps, operations, members, elements, and / or combinations thereof, and do not exclude the presence or addition of one or more other shapes, numbers, operations, elements, and / or combinations thereof. Some embodiments of this application are described in detail below with reference to the accompanying drawings. Where there is no conflict between the embodiments, the following embodiments and features can be combined with each other. The steps in the following method embodiments are for illustrative purposes only and are not intended to limit this application.
[0016] This application proposes an efficient feature fusion mechanism that fully explores the complementary information of different modal features, reduces redundant interference, and enhances the identification of tampered features. It also constructs a high-resolution feature recovery module to solve the problems of tampered feature dilution and edge detail degradation caused by downsampling, and achieves accurate boundary segmentation of the tampered region.
[0017] The proposed differential-aware image tampering localization method is based on a CNN-Transformer hybrid architecture. Its core components include a two-stream feature extraction method, a differential-aware feature fusion (FFM) method, and a bottom-up attention-guided high-resolution decoding method (Region Decoder). Through the synergistic effect of multi-dimensional feature extraction, deep fusion, and high-resolution restoration, it achieves precise localization of tampered regions in biomedical images. The overall workflow is as follows: input image preprocessing → two-stream feature extraction → differential-aware feature fusion → high-resolution feature restoration and tampering localization → output pixel-level tampering results. For detailed workflow information, please refer to [reference needed]. Figure 2 As shown: Input target image → DCT frequency domain decomposition (divided into low-frequency and high-frequency components) → channel stitching → dual-stream encoder (CNN branch extracts local features f) ci+ Transformer branch extracts global features f ti → Feature Fusion Module (FFM) → Region Decoder → Tampered Region Location → Region Loss Calculation
[0018] This application also provides a method for locating image tampering based on difference perception, including the following steps S101-S107: S101. Preprocess the input target image to obtain the preprocessed image; Specifically, the preprocessing of the input target image to obtain the preprocessed image includes: The input target image is uniformly resized according to the preset dimensions to obtain the preprocessed image.
[0019] Optionally, the aforementioned target image refers to the image to be tampered with, specifically a biomedical RGB image, with a preset size of 512×512. Size adjustment can ensure the consistency and stability of the subsequent feature extraction process; no additional complex preprocessing operations are required, avoiding damage to the original tampering traces and feature information in the image.
[0020] S102. Based on a dual-stream feature extraction network, feature extraction is performed on the preprocessed image to obtain multi-scale local features and multi-scale global features. Dual-stream feature extraction networks refer to feature extraction networks composed of parallel CNN branches and Transformer branches. This process combines the advantages of CNN's local feature extraction with Transformer's global feature capture capabilities, while also introducing frequency domain features to achieve multi-dimensional and multi-scale feature coverage of tampering traces in biomedical images.
[0021] By capturing the differences between features from CNN branches and Transformer branches, the complementary information of cross-modal features is enhanced. At the same time, redundant information is suppressed through adaptive calibration, thereby improving the ability of fused features to identify tampering traces.
[0022] In some optional embodiments of this application, in the aforementioned step S102, feature extraction is performed on the preprocessed image based on a dual-stream feature extraction network to obtain multi-scale local features and multi-scale global features, including the following steps S1021-S1023: S1021. Perform frequency domain decomposition on the preprocessed image to obtain multiple decomposition results; In some optional embodiments, the step of performing frequency domain decomposition on the preprocessed image to obtain multiple decomposition results includes: performing discrete cosine transform (DCT) on the preprocessed image to decompose the image into multiple decomposition results including multiple frequency intervals, with each frequency interval corresponding to one decomposition result.
[0023] Specifically, one frequency range corresponds to the detailed changes in the image, while the other frequency range corresponds to the overall contour information of the image.
[0024] Optionally, multiple frequency ranges include high-frequency component ranges and low-frequency component ranges, where the high-frequency component corresponds to the detail changes in the image and the low-frequency component corresponds to the overall contour information of the image. Through frequency domain decomposition, hidden tampering traces that are not visible in the RGB channels can be discovered.
[0025] S1022. Perform feature concatenation on the multiple decomposition results to obtain multimodal features; Specifically, the aforementioned multiple decomposition results can be concatenated with the target image to form multimodal features, providing a rich information foundation for subsequent local and global feature extraction. Optionally, the aforementioned multimodal features can also be obtained by concatenating the features of the aforementioned multiple decomposition results with the preprocessed image.
[0026] S1023. Input the multimodal features into the CNN branch and the Transformer branch respectively to obtain the multi-scale local features and the multi-scale global features respectively.
[0027] Optionally, the multimodal features can be divided into two paths, inputting into a CNN branch and a Transformer branch respectively. The CNN branch can use ConvNeXt-Tiny as the backbone network, with 96, 192, 384, and 768 feature channels at each stage, respectively. It extracts local texture anomaly information of the image step-by-step through multi-layer convolution and downsampling, outputting multi-scale local features. : ={ },
[0028] ={ , , , },
[0029] in, This represents the local features output in the i-th stage. Represents the feature of the i-th stage The number of channels, and , , , The corresponding feature channel numbers are 96, 192, 384 and 768, respectively. and Feature map of the i-th stage The height and width, H and W represent the height and width of the input image, respectively, both of which are 512. The dimension representing fci belongs to In this application, 'i' refers to the i-th stage. Optionally, there may be four stages, where i is 1, 2, 3, or 4.
[0030] The Transformer branch can adopt a SegFormer structure, with 64, 128, 320, and 512 feature channels at each stage, corresponding to 1, 2, 5, and 8 attention heads. It models the global pixel-to-pixel relationships through a self-attention mechanism, extracts global semantic features, and outputs multi-scale global features. : ={ },
[0031] ={ , , , },
[0032] in, This represents the global feature output in the i-th stage. Represents the feature of the i-th stage The number of channels, and The corresponding feature channel numbers are 64, 128, 320 and 512, respectively. and Feature map of the i-th stage The height and width, H and W represent the height and width of the input image, respectively, both of which are 512.
[0033] Furthermore, by capturing the differences between features in CNN branches and Transformer branches, the complementary information of cross-modal features can be enhanced. At the same time, by suppressing redundant information through adaptive calibration, the ability of fused features to identify tampering traces can be improved.
[0034] S103. For the multi-scale local features and multi-scale global features at each scale level, perform difference calculation and fusion respectively to obtain multi-scale difference perception features; In some optional embodiments of this application, in S103, the difference calculation and fusion of the multi-scale local features and the multi-scale global features at each scale level are performed respectively to obtain multi-scale difference-aware features. Taking the feature calculation of the i-th scale level as an example, it includes the following steps S1031-S1036: S1031. For local and global features in the same stage, perform channel dimension alignment, calculate the absolute difference of each element, and obtain the feature difference map. Specifically, for the local features of the CNN in the i-th stage and Transformer global features Features The channel dimension is aligned to the linear layer. ,at this time Then, the element-wise absolute difference is calculated to obtain the feature difference map. This highlights the inconsistency between the two sets of information and the falsification of related information. Formula:
[0035] S1032. Perform channel significance scoring and reweighting on the feature difference map to obtain the filtered difference features; Optionally, feature difference map Invalid information, such as noise, can be filtered out by the difference feature screening module to identify high-value differences, thereby improving the accuracy of subsequent fusion. Channel-dimensional filtering is achieved through a process of "significance scoring + feature reweighting."
[0036] In some optional embodiments of this application, in S1032, the step of performing channel significance scoring and reweighting on the feature difference map to obtain the filtered difference features includes: S31. Calculate the significance score of each channel of the feature difference map; Specifically, first calculate the feature difference map. The significance score of each channel measures the importance of channel differences;
[0037] in, For channel saliency scoring vector, express The number of channels. and They are respectively Height and width.
[0038] S32. The saliency score is activated by the Sigmoid activation function to obtain the channel weights. The channel weights are then multiplied element-wise by the feature difference map to obtain the filtered difference features.
[0039] The differential features after filtering are obtained by reweighting with Sigmoid activation. :
[0040] It is the Sigmoid activation function. This is for element-wise multiplication.
[0041] S1033. The feature difference map and the local features are concatenated by channels to obtain the first concatenated feature. The feature difference map and the global features are concatenated by channels to obtain the second concatenated feature. The first concatenated feature and the second concatenated feature are integrated by depthwise separable convolution to obtain the calibrated local feature and the calibrated global feature. Feature difference map With local features and global features Combine, to make local features and global features Carry clues related to tampering to improve feature differentiation.
[0042] Will respectively with , Perform channel concatenation (dimension overlay) to obtain the first concatenated feature. Second splicing features Then, input the depthwise separable convolutions (DSConv) separately, integrate the information, reduce computational complexity, and output calibrated local features. and global characteristics after calibration The formula is:
[0043]
[0044] Among them, DSConv reduces the number of parameters and computational cost while ensuring the fusion effect by decomposing the standard convolution through depthwise convolution and pointwise convolution.
[0045] S1034. The filtered difference features, the calibrated local features, and the calibrated global features are concatenated along the channel dimension, and the features are refined through a group attention mechanism to obtain refined fused features. Selected differential features Local features after CNN calibration Global features after Transformer calibration The data is concatenated along the channel dimension to form the complete input for the feature refinement module. This operation integrates complementary information from three types of features:
[0046] For the stitched feature map To perform a grouped CBAM operation, first... Each channel is divided into G=8 independent feature groups, the first... The input features of the group are represented as (in =1,2,…,8, (Divisible by 8), grouping operations can reduce the computational complexity of a single set of features, while enhancing the attention-specific expression of different channel groups.
[0047] Specifically, for each group The CBAM attention mechanism is applied independently. This mechanism consists of a channel attention module and a spatial attention module, and the specific process and formulas are as follows: 1) Channel attention calculation First, consider the grouping features. Perform Global Average Pooling (GAP) and Global Max Pooling (GMP) respectively to extract statistical features along the channel dimension:
[0048]
[0049] in The average and maximum statistics for the corresponding channels are shown respectively.
[0050] The two pooling results are then input into a shared two-layer fully connected network (FCN) for feature encoding, where the first fully connected layer compresses the number of channels to... To achieve the compression ratio, it is necessary to ensure Divisible Using the ReLU activation function, the second fully connected layer restores the number of channels to [previous level]. No activation function:
[0051]
[0052] Finally, the two encoded results are added element-wise, and a channel attention map is generated using the Sigmoid activation function. The channel dimensions of the grouped features are weighted to obtain the weighted features. :
[0053]
[0054] 2) Spatial attention calculation Features weighted by channel attention As input, average pooling and max pooling are performed along the channel dimension to extract statistical features in the spatial dimension. The two pooling results are then concatenated along the channel dimension to obtain the input feature map for spatial attention.
[0055]
[0056]
[0057] in , These correspond to the average and maximum statistical information of the space, respectively.
[0058] Then enter Convolutional layers (used to capture a large range of spatial context information) and spatial attention maps generated by the Sigmoid activation function. The spatial dimensions of the features are weighted to obtain the weighted output features. :
[0059]
[0060] This weighting method can highlight the feature information of important spatial locations and suppress noise interference from the background area.
[0061] The output features of the 8 groups after CBAM attention weighting By stitching along the channel dimension, the final refined and fused features are obtained. This operation can integrate the specific attention information of all groups to recover the complete channel-dimensional feature representation: .
[0062] S1035. Generate a gated weight map based on the calibrated local features and the calibrated global features, and use the gated weight map to perform weighted fusion of the refined fusion features to obtain the difference perception features at this scale. In some optional embodiments of this application, the gated weight map includes a local gated weight map and a global gated weight map. The step of generating the gated weight map based on the calibrated local features and the calibrated global features includes: performing a 1×1 convolution operation and a Sigmoid activation on the calibrated local features and the calibrated global features respectively to generate the local gated weight map and the global gated weight map.
[0063] In some optional embodiments of this application, the step of using the gated weight map to perform weighted fusion of the refined fusion features to obtain the difference-aware features at this scale includes: multiplying the refined fusion features element-wise with the local gated weight map and the global gated weight map respectively, adding the product results element-wise, and then integrating the features through depthwise separable convolution to obtain the difference-aware features at this scale.
[0064] Based on requirements, the calibrated global features Local features after calibration Each is used as a single input and passed independently. Convolutional layers and the sigmoid activation function generate corresponding gated weight maps, which are used to adaptively control the fusion weights of the two types of features.
[0065]
[0066] in, , The gating weights range from [0,1], enabling the gating of global features after calibration. Local features after calibration The soft choice.
[0067] The gating fusion method has been revised to refine the fusion features. The features are then weighted element-wise by two gate weights, summed, and finally integrated and encoded through a depthwise separable convolutional layer to obtain the module's final output features, i.e., differentially perceived features. :
[0068] This fusion method allows refined fused features to adaptively absorb effective information from temporal and contextual features based on gating weights, thereby improving the expressive power and robustness of the final features.
[0069] The process of generating difference-perceived features can also be found in [reference needed]. Figure 3 As shown.
[0070] S1036. The difference perception features at each scale are spliced together to obtain the multi-scale difference perception features.
[0071] S104. Upsample the multi-scale local features, the multi-scale global features, and the multi-scale difference-aware features to the resolution of the target image to obtain upsampled local features, upsampled global features, and upsampled difference features. Biomedical image tampering localization faces two key challenges: 1) Tampered targets are often tiny structures such as cells and colonies, whose features are easily diluted during downsampling, making it difficult to detect subtle tampering traces; 2) Downsampling further destroys the edge details of the tampered area, resulting in incomplete segmentation results even if the tampered area is roughly located due to blurred edges. To address these issues, the high-resolution decoding module focuses on "high-resolution feature recovery" and "cross-scale information alignment," employing an upsampling and attention-guided fusion strategy to achieve precise localization of minute tampered areas while preserving complete boundary details.
[0072] This application can receive multi-scale feature maps output by a difference-aware feature fusion module and complete pixel-level prediction of the tampered region through two core operations: first, the multi-scale features are uniformly restored to the input image resolution to avoid loss of details; second, through bottom-up attention-guided fusion, the semantic information and spatial details of features at different levels are integrated to finally generate an accurate tampering mask.
[0073] Multi-scale feature maps For different downsampling factors, its spatial resolution is lower than that of the target image (size is ). To maximize the preservation of spatial details from minor alterations, feature maps at all scales need to be uniformly upsampled to the original resolution. and reduce the channel dimension to .
[0074] The upsampling process is implemented through transposed convolution, mathematically defined as:
[0075] in, Indicates that for the first The transpose convolution operation on layer features, by learning an adaptive upsampling kernel, transforms the low-resolution feature map... Restore to Feature space This ensures the consistency of features at each level in terms of resolution, laying the foundation for cross-scale fusion.
[0076] S105. Sequentially fuse the upsampled local features, the upsampled global features, and the upsampled difference features to obtain the fused features; In some optional embodiments of this application, in S105, the sequential fusion of the upsampled local features, the upsampled global features, and the upsampled difference features to obtain the fused features includes the following steps S1051-S1052: S1051. The upsampled local features are fused with the upsampled global features to obtain a first fused feature; S1052. The first fusion feature is fused with the upsampled difference feature to obtain the fusion feature.
[0077] Furthermore, directly fusing upsampled multi-scale features easily introduces noise and redundant information, and it is difficult to balance the detail advantages of shallow features with the semantic advantages of deep features. To address this, a bottom-up, staged attention-guided fusion mechanism is proposed, starting with the shallowest features and gradually integrating deeper features, dynamically filtering effective information through a spatial attention map.
[0078] For details, please refer to Figure 4 As shown.
[0079] In the diagram and Taking the corresponding Guide Layer as an example, input and First, convolution operations are performed on the two input features to extract local details and semantic information: Perform "1×1 convolution → ReLU activation → 3×3 convolution" to obtain the refined features. ;right Performing the same operation as above yields Compared with global average pooling Add them together to obtain the refined features. Mathematical expression:
[0080]
[0081]
[0082] right The channel transformation is performed by "convolution → BN → ReLU", and finally a spatial attention map is generated by sigmoid activation. (for filtering) (The effective region within). The mathematical expression corresponding to the process is:
[0083] Where GAP(·): global average pooling, Feature map compression The channel descriptor captures global relevance; σ(·): Sigmoid activation function, which makes the attention map element values in [0, 1], and the larger the value, the more important the feature at the corresponding position.
[0084] attention map Compared with pre-processed Perform element-wise multiplication (for (Perform weighted filtering), and then compare with the preprocessed... Perform element-wise addition to obtain the fusion result for the current round. Mathematical expression:
[0085] S106. Input the fused features into the prediction head to generate a tampering probability mask; S107. Determine the tampered area in the target image based on the threshold. Optionally, the tampering probability mask can be compared pixel by pixel with a threshold to generate a binary tampering region mask, wherein regions with pixel values greater than the preset threshold are determined to be tampering regions, and regions with pixel values less than or equal to the preset threshold are determined to be normal regions.
[0086] After passing through three guide layers, the final fused feature integrating all scale information is obtained. (Corresponding to the deepest feature fusion result). [The following is a list of steps / methods:] ... Input prediction head, through Convolution compresses the channel dimension to 1, and then generates a pixel-level tampering mask through Sigmoid activation. The mathematical expression is:
[0087] in, Each pixel value represents the probability that the location is a tampered area. If the probability is greater than the threshold of 0.5, it is determined to be a tampered pixel.
[0088] Module training uses a binary cross-entropy (BCE) loss function L to minimize the prediction mask. With real tampering mask The difference (where a pixel value of 0 represents a normal area and 1 represents a tampered area) is mathematically defined as:
[0089] in, and The height and width of the image are respectively used. By averaging the pixel-wise loss, the model is guided to learn the boundary features and spatial distribution of the tampered region.
[0090] For details, please refer to Figure 4 As shown.
[0091] This method and system are mainly applied to scenarios such as reviewing biomedical research results, verifying images of academic journal papers, tracing biomedical R&D data, and supervising the authenticity of images in research institutions. It can accurately locate common tampering types such as copy-move, splicing, and cleaning in biomedical images, including protein blotting / gel electrophoresis, microscopic images (such as cell and tissue images), and macroscopic laboratory scene images (such as culture dish colony images). It provides technical support for scientific integrity supervision, helps research institutions, journal editorial departments, and biomedical companies identify image tampering behaviors, and maintains the authenticity and reliability of biomedical research.
[0092] Innovation in difference-aware feature fusion method: A fusion mechanism based on feature difference calculation, depthwise separable convolution and gated attention calibration is designed. By calculating the absolute differences of features of different modalities, complementary information is highlighted. Then, through difference integration and adaptive weight calibration, efficient feature fusion is achieved. This avoids the problems of loss of complementary information and redundant interference caused by existing simple splicing or weighted summation methods, and significantly improves the recognition of tampering traces by fused features.
[0093] An innovative bottom-up attention-guided high-resolution decoding method: This paper proposes a decoding strategy that combines multi-scale feature upsampling with cross-level attention fusion. First, all scale features are restored to their original resolution. Then, features are gradually fused from shallow to deep layers. This solves the problems of feature dilution and edge detail degradation caused by downsampling in existing methods. It achieves high-resolution segmentation of tampered regions and improves the integrity and accuracy of the localization results.
[0094] Advantages of difference-aware feature fusion: By calculating and deeply integrating the differences between features of different modalities, the complementary information between local and global features is fully explored, solving the problem of insufficient utilization of complementary information in existing fusion methods; the gated attention calibration mechanism effectively suppresses the interference of redundant information on tampered features, making the fused features more recognizable and able to more accurately identify tampering traces covered by noise or publication artifacts, providing a better feature foundation for subsequent localization. Compared with simple splicing or weighted summation fusion methods, the feature representation capability is significantly improved.
[0095] Advantages of high-resolution decoding methods: Transposed convolution upsampling restores features to their original resolution, effectively avoiding the problem of diluted tampered features due to downsampling, ensuring that features of minor tampered targets are not lost; the bottom-up attention-guided fusion mechanism can gradually integrate features from shallow to deep layers. Shallow features provide rich detail information, while deep features provide accurate semantic information. The two work together to accurately restore the edge details of the tampered region, solving the problems of blurry and incomplete segmentation in existing models, making the boundaries of the output tampered mask clearer and the positioning more accurate.
[0096] An embodiment of this application also provides an image tampering localization device based on difference perception, the device: The preprocessing unit is used to preprocess the input target image to obtain the preprocessed image; The extraction unit is used to extract features from the preprocessed image based on a dual-stream feature extraction network to obtain multi-scale local features and multi-scale global features. The computing unit is used to perform difference calculation and fusion on the multi-scale local features and the multi-scale global features at each scale level to obtain multi-scale difference perception features. The upsampling unit is used to upsample the multi-scale local features, the multi-scale global features, and the multi-scale difference-aware features to the resolution of the target image, respectively, to obtain upsampled local features, upsampled global features, and upsampled difference features; The fusion unit is used to sequentially fuse the upsampled local features, the upsampled global features, and the upsampled difference features to obtain fused features; The prediction unit is used to input the fused features into the prediction head and generate a tampering probability mask; The determination unit is used to determine the tampered area in the target image based on a threshold. When the aforementioned device is used to extract features from the preprocessed image based on a two-stream feature extraction network to obtain multi-scale local features and multi-scale global features, it is specifically used for: The preprocessed image is decomposed in the frequency domain to obtain multiple decomposition results; The multiple decomposition results are concatenated to obtain multimodal features; The multimodal features are input into the CNN branch and the Transformer branch respectively to obtain the multi-scale local features and the multi-scale global features respectively.
[0097] When the aforementioned device performs difference calculations and fusions on the multi-scale local features and the multi-scale global features at each scale level to obtain multi-scale difference-perceived features, it is specifically used for: For local and global features in the same stage, channel dimension alignment is performed, and element-wise absolute differences are calculated to obtain a feature difference map; The feature difference map is subjected to channel significance scoring and reweighting to obtain the filtered difference features; The feature difference map and the local features are concatenated by channels to obtain the first concatenated feature. The feature difference map and the global features are concatenated by channels to obtain the second concatenated feature. The first concatenated feature and the second concatenated feature are integrated by depthwise separable convolution to obtain the calibrated local feature and the calibrated global feature. The filtered difference features, the calibrated local features, and the calibrated global features are concatenated along the channel dimension, and the features are refined through a group attention mechanism to obtain refined fused features. A gated weight map is generated based on the calibrated local features and the calibrated global features. The refined fusion features are then weighted and fused using the gated weight map to obtain the difference-perceived features at this scale. The multi-scale difference perception features are obtained by concatenating the difference perception features at each scale through channel stitching.
[0098] When the aforementioned device is used to perform channel saliency scoring and reweighting on the feature difference map to obtain the filtered difference features, it is specifically used for: Calculate the significance score for each channel of the feature difference map; The saliency score is activated by the Sigmoid activation function to obtain the channel weights. The channel weights are then multiplied element-wise by the feature difference map to obtain the filtered difference features.
[0099] Optionally, the gating weight map includes a local gating weight map and a global gating weight map. When the aforementioned device generates the gating weight map based on the calibrated local features and the calibrated global features, it is specifically used for: The calibrated local features and the calibrated global features are subjected to 1×1 convolution and sigmoid activation respectively to generate a local gated weight map and a global gated weight map.
[0100] Optionally, when the aforementioned device is used to perform weighted fusion of the refined fusion features using the gated weight map to obtain the difference-perceived features at this scale, it is specifically used for: The refined fusion features are multiplied element-wise by the local gated weight map and the global gated weight map, the product results are added element-wise, and then the features are integrated through depthwise separable convolution to obtain the difference-aware features at this scale.
[0101] Optionally, when the aforementioned device sequentially fuses the upsampled local features, the upsampled global features, and the upsampled difference features to obtain the fused features, it is specifically used for: The upsampled local features are fused with the upsampled global features to obtain the first fused feature; The first fusion feature is fused with the upsampled difference feature to obtain the fusion feature.
[0102] See Figure 5 , Figure 5 This is a schematic block diagram of an electronic device provided according to an embodiment of this application. Figure 5 The electronic device 300 shown in this embodiment may include one or more processors 301, one or more input devices 302, one or more output devices 303, and one or more memories 304. The processors 301, input devices 302, output devices 303, and memories 304 communicate with each other via a communication bus 305. The memories 304 store computer programs, including program instructions. The processors 301 execute the program instructions stored in the memories 304.
[0103] It should be understood that, in the embodiments of this application, the processor 301 may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.
[0104] Input device 302 may include a touchpad, a fingerprint sensor (for collecting the user's fingerprint information and fingerprint orientation information), a microphone, etc., and output device 303 may include a display (LCD, etc.), a speaker, etc.
[0105] The memory 304 may include read-only memory and random access memory, and provides instructions and data to the processor 301. A portion of the memory 304 may also include non-volatile random access memory.
[0106] In specific implementations, the processor 301, input device 302, and output device 303 described in the embodiments of this application can execute the implementation methods described above in the embodiments of this application, or they can execute the implementation methods of the electronic devices described in the embodiments of this application, which will not be repeated here.
[0107] In another embodiment of this application, a computer-readable storage medium is provided. This computer-readable storage medium stores a computer program, which includes program instructions. When executed by a processor, the program instructions implement all or part of the processes in the methods described above. Alternatively, the computer program can instruct related hardware to complete the process. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include any entity or device capable of carrying computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.
[0108] The computer-readable storage medium can be an internal storage unit of the electronic device in any of the foregoing embodiments, such as a hard disk or memory of the electronic device. The computer-readable storage medium can also be an external storage device of the electronic device, such as a plug-in hard disk, smart media card (SMC), secure digital card (SD), flash card, etc., equipped on the electronic device. Furthermore, the computer-readable storage medium can include both internal and external storage units of the electronic device. The computer-readable storage medium is used to store computer programs and other programs and data required by the electronic device. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.
[0109] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this application.
[0110] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the electronic devices and units described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0111] In the several embodiments provided in this application, it should be understood that the disclosed electronic devices and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interfaces or units, or it may be an electrical, mechanical, or other form of connection.
[0112] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments of this application, depending on actual needs.
[0113] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0114] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for locating image tampering based on difference perception, characterized in that, include: The input target image is preprocessed to obtain the preprocessed image; The preprocessed image is used to extract features based on a dual-stream feature extraction network to obtain multi-scale local features and multi-scale global features. For the multi-scale local features and multi-scale global features at each scale level, difference calculation and fusion are performed respectively to obtain multi-scale difference perception features; The multi-scale local features, the multi-scale global features, and the multi-scale difference-aware features are upsampled to the resolution of the target image to obtain upsampled local features, upsampled global features, and upsampled difference features; The upsampled local features, the upsampled global features, and the upsampled difference features are sequentially fused to obtain the fused features; The fused features are input into the prediction head to generate a tampering probability mask; The tampered area in the target image is determined based on a threshold.
2. The method according to claim 1, characterized in that, The preprocessed image is subjected to feature extraction based on a dual-stream feature extraction network to obtain multi-scale local features and multi-scale global features, including: The preprocessed image is decomposed in the frequency domain to obtain multiple decomposition results; The multiple decomposition results are concatenated to obtain multimodal features; The multimodal features are input into the CNN branch and the Transformer branch respectively to obtain the multi-scale local features and the multi-scale global features respectively.
3. The method according to claim 1, characterized in that, The multi-scale local features and multi-scale global features at each scale level are subjected to difference calculation and fusion respectively to obtain multi-scale difference-perceived features, including: For local and global features in the same stage, channel dimension alignment is performed, and element-wise absolute differences are calculated to obtain a feature difference map; The feature difference map is subjected to channel significance scoring and reweighting to obtain the filtered difference features; The feature difference map and the local features are concatenated by channels to obtain the first concatenated feature. The feature difference map and the global features are concatenated by channels to obtain the second concatenated feature. The first concatenated feature and the second concatenated feature are integrated by depthwise separable convolution to obtain the calibrated local feature and the calibrated global feature. The filtered difference features, the calibrated local features, and the calibrated global features are concatenated along the channel dimension, and the features are refined through a group attention mechanism to obtain refined fused features. A gated weight map is generated based on the calibrated local features and the calibrated global features. The refined fusion features are then weighted and fused using the gated weight map to obtain the difference-perceived features at this scale. The multi-scale difference perception features are obtained by concatenating the difference perception features at each scale through channel stitching.
4. The method according to claim 3, characterized in that, The process of performing channel significance scoring and reweighting on the feature difference map to obtain the filtered difference features includes: Calculate the significance score for each channel of the feature difference map; The saliency score is activated by the Sigmoid activation function to obtain the channel weights. The channel weights are then multiplied element-wise by the feature difference map to obtain the filtered difference features.
5. The method according to claim 3, characterized in that, The gating weight map includes a local gating weight map and a global gating weight map. Generating the gating weight map based on the calibrated local features and the calibrated global features includes: The calibrated local features and the calibrated global features are subjected to 1×1 convolution and sigmoid activation respectively to generate a local gated weight map and a global gated weight map.
6. The method according to claim 5, characterized in that, The step of using the gated weight map to perform weighted fusion of the refined fusion features to obtain the difference-perceived features at this scale includes: The refined fusion features are multiplied element-wise by the local gated weight map and the global gated weight map, the product results are added element-wise, and then the features are integrated through depthwise separable convolution to obtain the difference-aware features at this scale.
7. The method according to claim 1, characterized in that, The sequential fusion of the upsampled local features, the upsampled global features, and the upsampled difference features to obtain the fused features includes: The upsampled local features are fused with the upsampled global features to obtain the first fused feature; The first fusion feature is fused with the upsampled difference feature to obtain the fusion feature.
8. An image tampering location device based on difference perception, characterized in that, include: The preprocessing unit is used to preprocess the input target image to obtain the preprocessed image; The extraction unit is used to extract features from the preprocessed image based on a dual-stream feature extraction network to obtain multi-scale local features and multi-scale global features. The computing unit is used to perform difference calculation and fusion on the multi-scale local features and the multi-scale global features at each scale level to obtain multi-scale difference perception features. The upsampling unit is used to upsample the multi-scale local features, the multi-scale global features, and the multi-scale difference-aware features to the resolution of the target image, respectively, to obtain upsampled local features, upsampled global features, and upsampled difference features; The fusion unit is used to sequentially fuse the upsampled local features, the upsampled global features, and the upsampled difference features to obtain fused features; The prediction unit is used to input the fused features into the prediction head and generate a tampering probability mask; The determination unit is used to determine the tampered area in the target image based on a threshold.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 7.