Textual image tamper detection and localization method and related device
By using a dual-stream encoder network structure and feature fusion technology, the problem of inaccurate localization in text image tampering detection was solved, and efficient tampering detection of text regions was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN UNIV
- Filing Date
- 2022-07-11
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to accurately detect tampered areas in text-based images, particularly for text-based images such as screenshots, documents, credentials, street view images, and cards.
A dual-stream encoder network structure is adopted to process the original image and the text image separately. Features are extracted through downsampling convolution module and fusion module. Tampering localization is performed by combining attention mechanism and decoder. OCR technology is used to extract text regions and perform feature fusion. Upsampling convolution module and SCSE attention module are used for accurate localization.
It improves the accuracy and performance of text image tampering detection, better identifies tampered areas within text regions, and enhances the effectiveness of tampering detection.
Smart Images

Figure CN115294438B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image detection technology, and in particular to a method and related equipment for detecting and locating text-based image tampering. Background Technology
[0002] With the rapid development of science and technology and the widespread application of various image editing software, people can easily use these image editing software to create a series of tampered images without leaving obvious traces of tampering. If these forged images are used as evidence or for other purposes, they can easily disrupt normal social order.
[0003] With the rapid development of deep learning in recent years, tamper detection and localization methods based on convolutional neural networks are mainly aimed at tampered images in natural scenes. These images typically have large tampered areas, often involving humans or objects, and have relatively obvious tamper outlines, making the tampered areas relatively easy to detect. However, malicious tampering of text-based images such as screenshots, documents, credentials, street scene images, and cards has been increasing in recent years. The commonly tampered areas in text-based images are areas rich in information, such as text and numbers, and are relatively small, making it difficult to directly apply the aforementioned methods.
[0004] Therefore, existing technologies still need to be improved and enhanced. Summary of the Invention
[0005] To address the aforementioned shortcomings of existing technologies, this paper provides a method and related equipment for detecting and locating text-based image tampering, aiming to solve the problem of inaccurate detection and location of text-based image tampering in existing technologies.
[0006] A first aspect of the present invention provides a method for detecting and locating text-based image tampering, comprising:
[0007] A first target image is obtained, and the first target image is processed to obtain a second target image, wherein the second target image is an image that retains only the text portion of the first target image;
[0008] The first target image and the second target image are respectively input into the first encoder and the second encoder to obtain the first initial feature and the second initial feature, respectively.
[0009] The first initial feature and the second initial feature are fused to obtain the first target feature and the second target feature;
[0010] The first target feature and the second target feature are input into the decoder to obtain the tampering location result in the first target image output by the decoder.
[0011] The text-based image tampering detection and localization method includes a first encoder and a second encoder, each of which includes at least one downsampling convolution module, and each downsampling convolution module includes two residual modules.
[0012] The text-based image tampering detection and localization method, wherein fusing the first initial feature and the second initial feature includes:
[0013] Perform a convolution operation on the first initial feature to obtain the first feature and the second feature, and perform a convolution operation on the second initial feature to obtain the third feature and the fourth feature;
[0014] Multiply the second feature and the fourth feature by a matrix to obtain the first intermediate feature;
[0015] The first intermediate feature is obtained by swapping its channels;
[0016] The first target feature and the second target feature are obtained based on the first intermediate feature and the second intermediate feature.
[0017] The text-based image tampering detection and localization method, wherein obtaining the first target feature and the second target feature based on the first intermediate feature and the second intermediate feature includes:
[0018] The first intermediate feature is processed to obtain the first attention feature, and the second intermediate feature is processed to obtain the second attention feature;
[0019] Based on the first attention feature and the third feature, a first output feature is obtained; based on the second attention feature and the first feature, a second output feature is obtained.
[0020] The first output feature and the first initial feature are added together to obtain the first target feature;
[0021] The second target feature is obtained by adding the second output feature and the second initial feature.
[0022] The text-based image tampering detection and localization method, wherein processing the first intermediate feature to obtain the first attention feature includes:
[0023] The first intermediate feature is passed through a linear layer and a softmax layer to obtain the first attention feature.
[0024] The step of processing the second intermediate feature to obtain the second attention feature includes: passing the first intermediate feature through a linear layer and a softmax layer to obtain the first attention feature.
[0025] The text-based image tampering detection and localization method, wherein obtaining the first output feature based on the first attention feature and the third feature includes:
[0026] After swapping the first and second channels in the first attention feature, multiply it with the third feature to obtain the first output feature;
[0027] The step of obtaining the second output feature based on the second attention feature and the first feature includes:
[0028] After swapping the first and second channels in the second attention feature, multiply it with the first feature to obtain the second output feature.
[0029] The text-based image tampering detection and localization method, wherein the decoder includes at least one upsampling convolution module;
[0030] Each of the upsampling convolutional modules includes an upsampling module, a feature extraction module, and an SCSE attention module;
[0031] The upsampling module is used for upsampling, the feature extraction module is used for convolution processing, and the SCSE attention module is used to extract spatial attention features and channel attention features.
[0032] A second aspect of the present invention provides a device for detecting and locating text-based image tampering, comprising:
[0033] An image acquisition module is used to acquire a first target image, process the first target image to obtain a second target image, and the second target image is an image that retains only the text portion of the first target image.
[0034] The encoder module is used to input the first target image and the second target image into the first encoder and the second encoder respectively, and obtain the first initial feature and the second initial feature respectively;
[0035] A fusion module is used to fuse the first initial feature and the second initial feature to obtain a first target feature and a second target feature;
[0036] A decoder module is used to input the first target feature and the second target feature into the decoder to obtain the tampering location result in the first target image output by the decoder.
[0037] A third aspect of the present invention provides a terminal, comprising: a processor, a storage medium communicatively connected to the processor, the storage medium being adapted to store a plurality of instructions, and the processor being adapted to invoke the instructions in the storage medium to execute the steps of implementing the text-based image tampering detection and location method described in any of the preceding claims.
[0038] In a fourth aspect, the present invention provides a storage medium storing one or more programs that can be executed by one or more processors to implement the steps of the text-based image tampering detection and location method described in any of the preceding claims.
[0039] Beneficial Effects: Compared with existing technologies, this invention provides a method and related equipment for detecting and locating text-based image tampering. In the method, a first target image is acquired, processed to obtain a second target image, which retains only the text portion of the first target image. The first and second target images are then input to a first encoder and a second encoder, respectively, to obtain a first initial feature and a second initial feature. The first and second initial features are fused to obtain a first target feature and a second target feature. Finally, the first and second target features are input to a decoder to obtain the tampering location result in the first target image output by the decoder. This method improves the performance of detecting text-based image tampering, enabling better identification of tampered text regions. Attached Figure Description
[0040] Figure 1 A flowchart illustrating an embodiment of the text-based image tampering detection and localization method provided by the present invention;
[0041] Figure 2 A structural diagram of the tamper detection and localization network in an embodiment of the text image tamper detection and localization method provided by the present invention;
[0042] Figure 3 A structural diagram of the dual-stream fusion module in the tamper detection and localization network in an embodiment of the text image tamper detection and localization method provided by the present invention;
[0043] Figure 4 A structural diagram of the SCSE attention module in the tamper detection and localization network in an embodiment of the text image tamper detection and localization method provided by the present invention;
[0044] Figure 5 A schematic diagram of an embodiment of the text image tampering detection and localization device provided by the present invention;
[0045] Figure 6A schematic diagram of the structure of an embodiment of the terminal provided by the present invention. Detailed Implementation
[0046] To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0047] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.
[0048] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0049] The present invention provides a method for detecting and locating text-based image tampering, which can be applied to terminals with computing capabilities. The terminal can execute the method provided by the present invention to detect and locate the target position in the image to be processed.
[0050] Example 1
[0051] With the rapid development of deep learning in recent years, tamper detection and localization methods based on convolutional neural networks include RGB-N, RRUNET, MANTRNET, DFCN, PSCCNET, and MVSSNET. However, these networks are mainly designed for tampered images in natural scenes, where the tampered areas are relatively large, often involving people or objects. Commonly used datasets for tampered images include CASIA v1.0 & 2.0, CoMoFoD, Coverage, NIST 2016, Columbia, and Ps-boundary. In recent years, malicious tampering of text-based images such as screenshots, documents, credentials, street scene signs, and cards has been increasing. The commonly tampered areas in text-based images are often rich in information, such as text and numbers, and are relatively small, making it difficult to directly apply the aforementioned methods. Therefore, it is necessary to design tamper detection and localization models specifically for text-based images.
[0052] Therefore, this embodiment provides a method for detecting and locating text-based tampering. For example... Figure 1 As shown, the text-based image tampering detection and localization method provided by the present invention includes the following steps:
[0053] S100. Obtain a first target image, process the first target image to obtain a second target image, wherein the second target image is an image that retains only the text portion of the first target image.
[0054] In this embodiment, the first target image is an RGB stream image. OCR technology is used to process the first target image, retaining only the recognized text regions while setting all other regions to 0, thus obtaining the second target image. The second target image has the same size as the first target image.
[0055] S200: Input the first target image and the second target image into the first encoder and the second encoder respectively to obtain the first initial feature and the second initial feature respectively.
[0056] Each method includes at least one downsampling convolutional module. In this embodiment, a text tampering detection and localization network is used to implement the text tampering detection and localization method. The network structure includes an encoder module, a fusion module, and a decoder module. The encoder module is a dual-stream encoder, that is, it includes two encoders: a first encoder and a second encoder. (Refer to...) Figure 2 The first encoder and the second encoder have the same structure, each consisting of at least one downsampling convolutional module. In one embodiment, the at least one downsampling convolutional module is defined as M. Figure 2The diagram shows the case when M is 4. The downsampling convolutional module may include two residual modules. These residual modules can be from existing residual networks (e.g., ResNet18, ResNet50, etc.). Taking M as 4 and using the residual modules from ResNet18 as an example, the four downsampling convolutional modules can be composed of the first eight residual modules of ResNet18: block1, block2, block3, block4, block5, block6, block7, and block8. Each downsampling convolutional module consists of two blocks sequentially: the first downsampling convolutional module down1 consists of blocks1 and 2; the second downsampling convolutional module down2 consists of blocks3 and 4; the third downsampling convolutional module down3 consists of blocks5 and 6; and the fourth downsampling convolutional module down4 consists of blocks7 and 8.
[0057] The first target image is input into the first encoder and convolved by the four downsampling convolution modules to obtain the first initial feature; the second target image is input into the second encoder and convolved by the four downsampling convolution modules to obtain the second initial feature.
[0058] S300. The first initial feature and the second initial feature are fused to obtain the first target feature and the second target feature.
[0059] The first initial feature and the second initial feature are fused together by the fusion module, referring to... Figure 2 The fusion module is Figure 2 The Dual Attention module in the network integrates the first and second initial features. By fusing these features through the fusion module, the text tampering detection and localization network can better focus on the features of the tampered text region, thereby improving tampering detection and localization performance.
[0060] The step of fusing the first initial feature and the second initial feature includes:
[0061] S310. Perform a convolution operation on the first initial feature to obtain the first feature and the second feature, and perform a convolution operation on the second initial feature to obtain the third feature and the fourth feature.
[0062] In this embodiment, the specific structure of the fusion module is as follows: Figure 3 As shown, the first initial feature is the feature F∈R of the RGB stream. C×H×W The second initial feature is the feature F of the OCR stream. o ∈RC×H×W The first initial feature is input to the fusion module, wherein the size of the first initial feature F is (C, H, W). The first initial feature is processed by two 1×1 convolutions in the fusion module to obtain a Value1 value of size (C, H, W) and a Key1 value of size (C / / r, H, W), where / / represents rounding down. In this embodiment, r is set to 8 to reduce the number of channels in the Key1 value and decrease the computational load. The Value1 value is the first feature, and the Key1 value is the second feature. The second initial feature is input to the fusion module, wherein the F of the second initial feature... o The size is also (C, H, W). The second initial feature is processed by two 1×1 convolutions in the fusion module to obtain a Value2 value of size (C, H, W) and a Key2 value of size (C / / r, H, W). Here, r is also set to 8 to reduce the number of channels in the Key2 value and reduce the amount of computation. The Value2 value is the third feature, and the Key2 value is the fourth feature.
[0063] S320. Multiply the second feature and the fourth feature by a matrix to obtain the first intermediate feature.
[0064] The second feature (i.e., the Key1 value of size (C / / r, H, W) obtained by two 1×1 convolutions of the first initial feature) and the fourth feature (i.e., the Key2 value of size (C / / r, H, W) obtained by two 1×1 convolutions of the second initial feature) are matrix multiplied to obtain an intermediate result energy1 of size (H*W, H*W). In this embodiment, the intermediate result energy1 is the first intermediate feature.
[0065] S330. The first intermediate feature is subjected to channel swapping to obtain the second intermediate feature.
[0066] The first and second channels in the first intermediate feature (i.e., energy1) are interchanged to obtain energy2, i.e., the second intermediate feature, with the size still being (H*W, H*W).
[0067] S340. Obtain the first target feature and the second target feature based on the first intermediate feature and the second intermediate feature.
[0068] The step of obtaining the first target feature and the second target feature based on the first intermediate feature and the second intermediate feature includes:
[0069] S341. Process the first intermediate feature to obtain the first attention feature, and process the second intermediate feature to obtain the second attention feature.
[0070] The process of processing the first intermediate feature to obtain the first attention feature includes:
[0071] The first intermediate feature is passed through a linear layer and a softmax layer to obtain the first attention feature.
[0072] The process of processing the second intermediate feature to obtain the second attention feature includes:
[0073] The first intermediate feature is passed through a linear layer and a softmax layer to obtain the first attention feature.
[0074] Specifically, the first intermediate feature energy1 and the second intermediate feature energy2 are respectively passed through a linear layer and a softmax layer to obtain Attention1 and Attention2, both with a size of (H*W, H*W). Here, Attention1 is the first attention feature, and Attention2 is the second attention feature.
[0075] S342. Based on the first attention feature and the third feature, a first output feature is obtained, and based on the second attention feature and the first feature, a second output feature is obtained.
[0076] The step of obtaining the first output feature based on the first attention feature and the third feature includes:
[0077] After swapping the first and second channels in the first attention feature, multiplying it with the third feature, the first output feature is obtained.
[0078] The first attention feature Attention1, after being swapped between the first and second channels, is multiplied by the third feature Value2 to obtain Out1, which has a size of (C, H, W). In this embodiment, Out1 is the first output feature.
[0079] The step of obtaining the second output feature based on the second attention feature and the first feature includes:
[0080] After swapping the first and second channels in the second attention feature, multiply it with the first feature to obtain the second output feature.
[0081] The second attention feature, Attention1, after swapping the first and second channels, is multiplied by the first feature, Value1, to obtain Out2, which has a size of (C, H, W). In this embodiment, Out2 is the second output feature.
[0082] S343. Add the first output feature and the first initial feature to obtain the first target feature;
[0083] S344. Add the second output feature and the second initial feature to obtain the second target feature.
[0084] The first output feature Out1 is added to the first initial feature, i.e., the RGB stream feature, to obtain the first target feature;
[0085] The second output feature Out2 is added to the second initial feature, i.e., the OCR stream feature, to obtain the second target feature.
[0086] Reference Figure 1 The text-based image tampering detection and localization method further includes the following steps:
[0087] S400: Input the first target feature and the second target feature into the decoder to obtain the tampering location result in the first target image output by the decoder.
[0088] The decoder includes at least one upsampled convolutional module.
[0089] Each of the upsampling convolutional modules includes an upsampling module, a feature extraction module, and an SCSE attention module;
[0090] The upsampling module is used for upsampling, the feature extraction module is used for convolution processing, and the SCSE attention module is used to extract spatial attention features and channel attention features.
[0091] The first target feature and the second target feature are input into the decoder, which is a Unet decoder network. The decoder includes at least one upsampling convolutional module, and each upsampling convolutional module includes an upsampling module, a feature extraction module, and an attention module.
[0092] In this embodiment, the upsampling module is a bilinear interpolation upsampling module, the feature extraction module is a dual convolution feature extraction module, and the attention module is an SCSE attention module.
[0093] The upsampling convolution module's dual convolutional feature extraction blocks each consist of a 2D convolutional layer with a 3×3 kernel, a BatchNorm module, and a ReLU module. The SCSE attention module mainly consists of a spatial SE attention module and a channel SE attention module. The SCSE attention module is described as follows: Figure 4 As shown.
[0094] In one embodiment, the number of the at least one downsampling convolutional module is defined as M, the same as the total number of downsampling convolutional modules. Each upsampling convolutional module is connected to a connection layer. The nth connection layer adds the output of the nth upsampling convolutional module to the output of the (M+1-n)th downsampling convolutional module, and then outputs the result to the next upsampling convolutional layer. The final convolutional output layer consists of a 2D convolutional layer and a sigmoid activation function, wherein the kernel of the 2D convolutional layer is 1×1, the stride is 1, and the output channel is 1. Finally, a tamper detection and localization probability map of 0-1 is output, and the target localization result is obtained based on the output tamper detection and localization probability map.
[0095] To verify the effectiveness of the text tampering detection and localization network proposed in this embodiment, its performance was validated on a self-built screenshot tampering image dataset. The proposed text tampering detection and localization network was divided into networks with and without the fusion module. The network with the fusion module was named Ours_dualAttention, and the network without the fusion module was named Ours_base. Specifically, the self-built screenshot tampering image dataset contained 2000 images, including operations such as copy-paste, splicing, removal, addition, and replacement, and was divided in a training:validation:test ratio of 8:1:1. During training and validation, the images were cropped into 256×256 image blocks, while the entire image was tested. The network was trained using gradient descent and directional propagation, with the Adam optimizer used and an initial learning rate of 2×10⁻⁴. The network iterated for 30 epochs. By using cross-entropy and Dice coefficients as the network's loss function, all convolutional kernels are initialized using Kaiming, and all kernel biases are set to 0.
[0096] Table 1 shows the comparison results of the text tampering detection and localization network proposed in this embodiment through the above dataset:
[0097] F1 IOU AUC MCC NoisePrint 0.01354 0.01666 0.67914 0.052946 Mantrnet 0.0204 0.01106 0.52478 0.01096 DFCN 0 0 0.63952 0 Ours_base 0.16736 0.0926 0.72904 0.14002 Ours_dualAttention 0.32978 0.22528 0.880875 0.34356
[0098] Table 1
[0099] As can be seen, the text tampering detection and localization network proposed in this embodiment outperforms other comparative methods on the above dataset regardless of whether the fusion module is present, and the dual-stream network with the melting heat module performs even better.
[0100] In summary, this embodiment provides a method for detecting and locating text-based image tampering. It involves acquiring a first target image, processing the first target image to obtain a second target image, which is an image retaining only the text portion of the first target image; then inputting the first target image and the second target image into a first encoder and a second encoder, respectively, to obtain a first initial feature and a second initial feature; fusing the first initial feature and the second initial feature to obtain a first target feature and a second target feature; and finally inputting the first target feature and the second target feature into a decoder to obtain the tampering location result in the first target image output by the decoder. The text-based image tampering detection and location method provided by this invention can better identify tampered text regions and improve the tampering detection performance of text-based images.
[0101] It should be understood that although the steps in the flowcharts shown in the accompanying drawings are displayed sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of the steps in this invention, and these steps can be executed in other orders. Moreover, at least a portion of the steps in this invention may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least a portion of the sub-steps or stages of other steps.
[0102] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided by this invention can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and RAMbus dynamic RAM (RDRAM), etc.
[0103] Example 2
[0104] Based on the above embodiments, the present invention also provides a text-based image tampering detection and location device, the functional module of which is shown in the figure below. Figure 5 As shown, the text-based image tampering detection and location device includes:
[0105] An image acquisition module is used to acquire a first target image, process the first target image to obtain a second target image, and the second target image is an image that retains only the text portion of the first target image, as specifically described in Embodiment 1.
[0106] The encoder module is used to input the first target image and the second target image into the first encoder and the second encoder respectively to obtain the first initial feature and the second initial feature respectively, as described in Embodiment 1;
[0107] A fusion module is used to fuse the first initial feature and the second initial feature to obtain a first target feature and a second target feature, as described in Embodiment 1.
[0108] The decoder module is used to input the first target feature and the second target feature into the decoder to obtain the tampering location result in the first target image output by the decoder, as specifically described in Embodiment 1.
[0109] Example 3
[0110] Based on the text-based image tampering detection and localization method described in Embodiment 1 above, the present invention also provides a terminal, the principle block diagram of which is as follows: Figure 6 As shown. The terminal includes a memory 10 and a processor 20. The memory 10 stores a text-based image tampering detection and location program. When the processor 10 executes the computer program, it can perform at least the following steps:
[0111] A first target image is obtained, and the first target image is processed to obtain a second target image, wherein the second target image is an image that retains only the text portion of the first target image;
[0112] The first target image and the second target image are respectively input into the first encoder and the second encoder to obtain the first initial feature and the second initial feature, respectively.
[0113] The first initial feature and the second initial feature are fused to obtain the first target feature and the second target feature;
[0114] The first target feature and the second target feature are input into the decoder to obtain the tampering location result in the first target image output by the decoder.
[0115] Both the first encoder and the second encoder include at least one downsampling convolution module, and each downsampling convolution module includes two residual modules.
[0116] The step of fusing the first initial feature and the second initial feature includes:
[0117] Perform a convolution operation on the first initial feature to obtain the first feature and the second feature, and perform a convolution operation on the second initial feature to obtain the third feature and the fourth feature;
[0118] Multiply the second feature and the fourth feature by a matrix to obtain the first intermediate feature;
[0119] The first intermediate feature is obtained by swapping its channels;
[0120] The first target feature and the second target feature are obtained based on the first intermediate feature and the second intermediate feature.
[0121] The step of obtaining the first target feature and the second target feature based on the first intermediate feature and the second intermediate feature includes:
[0122] The first intermediate feature is processed to obtain the first attention feature, and the second intermediate feature is processed to obtain the second attention feature;
[0123] Based on the first attention feature and the third feature, a first output feature is obtained; based on the second attention feature and the first feature, a second output feature is obtained.
[0124] The first output feature and the first initial feature are added together to obtain the first target feature;
[0125] The second target feature is obtained by adding the second output feature and the second initial feature.
[0126] The step of processing the first intermediate feature to obtain the first attention feature includes:
[0127] The first intermediate feature is passed through a linear layer and a softmax layer to obtain the first attention feature.
[0128] The step of processing the second intermediate feature to obtain the second attention feature includes: passing the first intermediate feature through a linear layer and a softmax layer to obtain the first attention feature.
[0129] The step of obtaining the first output feature based on the first attention feature and the third feature includes:
[0130] After swapping the first and second channels in the first attention feature, multiply it with the third feature to obtain the first output feature;
[0131] The step of obtaining the second output feature based on the second attention feature and the first feature includes:
[0132] After swapping the first and second channels in the second attention feature, multiply it with the first feature to obtain the second output feature.
[0133] The decoder includes at least one upsampled convolutional module;
[0134] Each of the upsampling convolutional modules includes an upsampling module, a feature extraction module, and an SCSE attention module;
[0135] The upsampling module is used for upsampling, the feature extraction module is used for convolution processing, and the SCSE attention module is used to extract spatial attention features and channel attention features.
[0136] Example 4
[0137] The present invention also provides a storage medium storing one or more programs that can be executed by one or more processors to implement the steps of the text image tampering detection and location method described in the above embodiments.
[0138] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for detecting and locating text-based image tampering, characterized in that, include: A first target image is obtained, and the first target image is processed to obtain a second target image, wherein the second target image is an image that retains only the text portion of the first target image; The first target image and the second target image are respectively input into the first encoder and the second encoder to obtain the first initial feature and the second initial feature, respectively. The first initial feature and the second initial feature are fused to obtain the first target feature and the second target feature; The first target feature and the second target feature are input into the decoder to obtain the tampering location result in the first target image output by the decoder; The step of fusing the first initial feature and the second initial feature includes: Perform a convolution operation on the first initial feature to obtain the first feature and the second feature, and perform a convolution operation on the second initial feature to obtain the third feature and the fourth feature; Multiply the second feature and the fourth feature by a matrix to obtain the first intermediate feature; The first intermediate feature is obtained by swapping its channels; The first target feature and the second target feature are obtained based on the first intermediate feature and the second intermediate feature; The step of obtaining the first target feature and the second target feature based on the first intermediate feature and the second intermediate feature includes: The first intermediate feature is processed to obtain the first attention feature, and the second intermediate feature is processed to obtain the second attention feature; Based on the first attention feature and the third feature, a first output feature is obtained; based on the second attention feature and the first feature, a second output feature is obtained. The first output feature and the first initial feature are added together to obtain the first target feature; The second output feature and the second initial feature are added together to obtain the second target feature; The process of processing the first intermediate feature to obtain the first attention feature includes: The first intermediate feature is passed through a linear layer and a softmax layer to obtain the first attention feature. The process of processing the second intermediate feature to obtain the second attention feature includes: The second intermediate feature is passed through a linear layer and a softmax layer to obtain the second attention feature.
2. The method for detecting and locating text-based image tampering according to claim 1, characterized in that, Both the first encoder and the second encoder include at least one downsampling convolution module, and each downsampling convolution module includes two residual modules.
3. The method for detecting and locating text-based image tampering according to claim 1, characterized in that, The step of obtaining the first output feature based on the first attention feature and the third feature includes: After swapping the first and second channels in the first attention feature, multiplying it with the third feature, the first output feature is obtained. The step of obtaining the second output feature based on the second attention feature and the first feature includes: After swapping the first and second channels in the second attention feature, multiply it with the first feature to obtain the second output feature.
4. The method for detecting and locating text-based image tampering according to claim 1, characterized in that, The decoder includes at least one upsampling convolutional module; Each of the upsampling convolutional modules includes an upsampling module, a feature extraction module, and an SCSE attention module; The upsampling module is used for upsampling, the feature extraction module is used for convolution processing, and the SCSE attention module is used to extract spatial attention features and channel attention features.
5. A device for detecting and locating text-based image tampering, characterized in that, The device includes: An image acquisition module is used to acquire a first target image, process the first target image to obtain a second target image, and the second target image is an image that retains only the text portion of the first target image. The encoder module is used to input the first target image and the second target image into the first encoder and the second encoder respectively, and obtain the first initial feature and the second initial feature respectively; A fusion module is used to fuse the first initial feature and the second initial feature to obtain a first target feature and a second target feature; A decoder module is used to input the first target feature and the second target feature into the decoder to obtain the tampering location result in the first target image output by the decoder; The step of fusing the first initial feature and the second initial feature includes: Perform a convolution operation on the first initial feature to obtain the first feature and the second feature, and perform a convolution operation on the second initial feature to obtain the third feature and the fourth feature; Multiply the second feature and the fourth feature by a matrix to obtain the first intermediate feature; The first intermediate feature is obtained by swapping its channels; The first target feature and the second target feature are obtained based on the first intermediate feature and the second intermediate feature; The step of obtaining the first target feature and the second target feature based on the first intermediate feature and the second intermediate feature includes: The first intermediate feature is processed to obtain the first attention feature, and the second intermediate feature is processed to obtain the second attention feature; Based on the first attention feature and the third feature, a first output feature is obtained; based on the second attention feature and the first feature, a second output feature is obtained. The first output feature and the first initial feature are added together to obtain the first target feature; The second output feature and the second initial feature are added together to obtain the second target feature; The process of processing the first intermediate feature to obtain the first attention feature includes: The first intermediate feature is passed through a linear layer and a softmax layer to obtain the first attention feature. The process of processing the second intermediate feature to obtain the second attention feature includes: The second intermediate feature is passed through a linear layer and a softmax layer to obtain the second attention feature.
6. A terminal, characterized in that, The terminal includes: a processor and a storage medium communicatively connected to the processor. The storage medium is adapted to store multiple instructions, and the processor is adapted to call the instructions in the storage medium to execute the steps of the text image tampering detection and location method according to any one of claims 1-4.
7. A storage medium, characterized in that, The storage medium stores one or more programs, which can be executed by one or more processors to implement the steps of the text-based image tampering detection and localization method as described in any one of claims 1-4.