An image tamper-proofing method based on multi-modal feature fusion and electronic equipment

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The image tampering detection method using multimodal feature fusion solves the problems of low detection accuracy and resource waste in existing technologies, and achieves efficient image tampering detection and localization.

CN122244653APending Publication Date: 2026-06-19BEIJING PACTERA JINXIN TECH LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING PACTERA JINXIN TECH LTD
Filing Date: 2026-03-12
Publication Date: 2026-06-19

Application Information

Patent Timeline

12 Mar 2026

Application

19 Jun 2026

Publication

CN122244653A

IPC: G06V20/00; G06V10/77; G06V10/80; G06V10/82; G06N3/0464; G06N3/045; G06N3/0499; G06V10/774; G06V10/54; G06V10/28; G06V10/72; G06V10/74

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing image tampering detection models have low accuracy when processing computer-generated images, and the tampering detection results depend on the localization results, leading to errors in the detection results and waste of resources.

⚗Method used

A multimodal feature fusion method is adopted, which constructs a noise extraction module and separates the recognition path of tamper detection and localization results, and combines visual and text encoding modules to perform image tamper detection and localization.

🎯Benefits of technology

It improves the accuracy of image tampering detection, reduces computational load and resource consumption, takes into account the differences in noise patterns between photoelectric and non-photoelectric devices, and reduces the impact of errors.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244653A_ABST

Patent Text Reader

Abstract

This application provides an image anti-tampering method and electronic device based on multimodal feature fusion, comprising: inputting the image to be identified into a noise extraction module for noise feature extraction to obtain a target frequency domain feature map indicating the noise pattern of the image itself; inputting the image to be identified and the target frequency domain feature map into an image fusion module for feature fusion to obtain an initial feature map; inputting the initial feature map and a pre-stored detection result prompt vector into a pre-trained multimodal model to obtain an optimized feature map, optimized feature vector, and optimized text vector; inputting the optimized feature vector and optimized text vector into a detection module for tamper detection to obtain an output tamper detection result; and inputting the optimized text vector and optimized feature map into a localization module for tamper localization to obtain a localization result of the tampered position in the image to be identified. By separating the identification path of the tamper detection result and the localization result, the recognition accuracy is improved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image tampering detection and localization technology, and in particular to an image anti-tampering method and electronic device based on multimodal feature fusion. Background Technology

[0002] Image forgery detection and localization (IFDL) technology is one of the core research directions in the field of digital multimedia security. With the widespread use of electronic editing tools such as Photoshop, and the rapid development of generative artificial intelligence and deepfake technology, malicious image tampering poses a serious threat to the authenticity of news, financial security, judicial fairness and social stability. Developing reliable IFDL technology has become an essential option for preventing the spread of false information and maintaining the integrity and authenticity of digital images.

[0003] Currently, most tamper detection models use images acquired by optical devices as detection targets or evaluation samples, which means they rely heavily on extracting the "camera fingerprint" from the detection images and cannot effectively deal with tampering cases such as computer-generated images (CGI, Computer Generated Images) and rendered images.

[0004] Furthermore, many existing tamper detection models (such as TruFor) only provide the tamper location result corresponding to the image, or require the location result to be determined first and then the tamper detection result to be determined based on it. This makes the determination of the tamper detection result dependent and highly sensitive. That is, the deviation of the tamper location result will cause the tamper detection result to be wrong, reducing the accuracy of the tamper detection result. Summary of the Invention

[0005] In view of this, the purpose of this application is to provide at least one image anti-tampering method and electronic device based on multimodal feature fusion, which improves the recognition accuracy by constructing a novel noise extraction module and separating the recognition path of tampering detection results from the positioning results.

[0006] This application mainly includes the following aspects: In a first aspect, embodiments of this application provide an image anti-tampering method based on multimodal feature fusion, applied to an image anti-tampering network. The method includes: inputting the image to be identified into a noise extraction module for noise feature extraction to obtain a target frequency domain feature map indicating the noise pattern of the image itself; inputting the image to be identified and the target frequency domain feature map into an image fusion module for feature fusion to obtain an initial feature map; inputting the initial feature map and a pre-stored detection result prompt vector into a pre-trained multimodal model to obtain an optimized feature map, optimized feature vector, and optimized text vector; inputting the optimized feature vector and optimized text vector into a detection module for tamper detection to obtain an output tamper detection result, the tamper detection result describing whether the image to be identified has been tampered with; and inputting the optimized text vector and optimized feature map into a localization module for tamper localization to obtain a localization result of the tampered location in the image to be identified.

[0007] In one possible implementation, the noise extraction module performs the following steps: converting the image to be identified into a grayscale feature map; inputting the grayscale feature map into a first two-dimensional convolutional layer with preset operating parameters to obtain an optimized grayscale feature map; inputting the grayscale feature map into a second two-dimensional convolutional layer with preset operating parameters to obtain a constrained feature map; inputting the optimized grayscale feature map into a discrete cosine transform layer to obtain a processed frequency feature map; inputting the processed frequency feature map into a third two-dimensional convolutional layer to obtain an optimized frequency feature map; performing difference processing on the optimized grayscale feature map and the constrained feature map, and inputting the difference processing result into a fourth two-dimensional convolutional layer to obtain an anomaly feature map; and performing dimensional joint union of the anomaly feature map and the optimized frequency feature map to obtain a target frequency domain feature map.

[0008] In one possible implementation, the image fusion module performs the following steps: inputting the image to be recognized into the fifth two-dimensional convolutional layer with preset operating parameters to obtain the original feature map; performing a dimension-level union of the original feature map and the target frequency domain feature map; and sequentially performing dimension permutation and standardization on the merged result to obtain the initial feature map.

[0009] In one possible implementation, the multimodal model includes a visual encoding module and a text encoding module, wherein optimized feature maps, optimized feature vectors, and optimized text vectors are obtained by: inputting the initial feature map into the visual encoding module for image feature extraction to obtain optimized feature maps and optimized feature vectors; and inputting the pre-stored cue vector into the text encoding module for text feature vector extraction to obtain optimized text vectors.

[0010] In one possible implementation, the visual encoding module includes a feature transformation structure layer and an attention pooling layer, wherein optimized feature maps and optimized feature vectors are obtained by sequentially inputting the initial feature map into the feature transformation structure layer and the attention pooling layer to obtain optimized feature maps and optimized feature vectors.

[0011] In one possible implementation, the detection module performs the following steps: calculating the cosine similarity between the optimized feature vector and the optimized text vector; calculating the cosine similarity using a normalization function; and determining the calculation result as the detection result.

[0012] In one possible implementation, the localization module performs the following steps: inputting an optimized text vector into a linear projection layer to obtain attention terms; merging the optimized feature maps of the attention terms to obtain localization input features; inputting the localization input features into a pre-trained visual self-attention model to obtain localization result features output by the backbone network corresponding to the visual self-attention model; separating the localization result vector from the localization result features; inputting the localization result vector into a multilayer perceptron to obtain at least one prediction box and the tampering probability corresponding to the prediction box; forming a localization result from the target prediction boxes whose tampering probability is greater than or equal to a given localization recognition threshold, and marking all pixels within the target prediction boxes as image tampering pixels.

[0013] In one possible implementation, the method further includes: training the detection module before performing image tampering detection on the image, including: initializing the noise extraction module, the image fusion module, and the detection module respectively and loading a pre-trained multimodal model; inputting a preset classification result prompt phrase into a word segmenter for byte-pair encoding to obtain a prompt vector; inputting multiple image samples into the image tampering detection network to obtain the tampering detection result corresponding to each image sample output by the detection module; and training and optimizing the detection module by combining the actual tampering result corresponding to each image sample and the image sample to obtain a trained detection module.

[0014] In one possible implementation, after training the detection module is completed, the method further includes: training the localization module: initializing the localization module, loading and freezing the noise extraction module, image fusion module, detection module, and pre-trained multimodal model; inputting multiple image samples into the image tampering detection network to obtain the predicted tampering localization result output by the localization module for each image sample; obtaining the actual tampering localization result corresponding to each image sample; and training and optimizing the localization module by combining the actual tampering localization result and the predicted tampering localization result corresponding to each image sample to obtain a trained localization module.

[0015] Secondly, embodiments of this application also provide an image anti-tampering device based on multimodal feature fusion, applied to an image anti-tampering network. The device includes: a noise extraction unit, used to input the image to be identified into a noise extraction module for noise feature extraction, obtaining a target frequency domain feature map indicating the noise pattern of the image itself; a fusion unit, used to input the image to be identified and the target frequency domain feature map into an image fusion module for feature fusion, obtaining an initial feature map; an optimization unit, used to input the initial feature map and a pre-stored detection result prompt vector into a pre-trained multimodal model, obtaining an optimized feature map, optimized feature vector, and optimized text vector; a tampering detection unit, used to input the optimized feature vector and optimized text vector into a detection module for tampering detection, obtaining an output tampering detection result, the tampering detection result describing whether the image to be identified has been tampered with; and a localization unit, used to input the optimized text vector and optimized feature map into a localization module for tampering localization, obtaining a localization result of the tampered location of the image to be identified.

[0016] This application provides an image anti-tampering method and electronic device based on multimodal feature fusion, comprising: inputting the image to be identified into a noise extraction module for noise feature extraction to obtain a target frequency domain feature map indicating the noise pattern of the image itself; inputting the image to be identified and the target frequency domain feature map into an image fusion module for feature fusion to obtain an initial feature map; inputting the initial feature map and a pre-stored detection result prompt vector into a pre-trained multimodal model to obtain an optimized feature map, optimized feature vector, and optimized text vector; inputting the optimized feature vector and optimized text vector into a detection module for tamper detection to obtain an output tamper detection result; and inputting the optimized text vector and optimized feature map into a localization module for tamper localization to obtain a localization result of the tampered position in the image to be identified. By separating the identification path of the tamper detection result and the localization result, the identification accuracy is improved.

[0017] The advantages of this application are: This application separates the tamper detection process from the tamper location process, making tamper detection independent of the tamper location result and avoiding the erroneous impact of tamper location result deviation on the tamper detection result. This effectively improves the tamper recognition capability and tamper detection accuracy of existing methods in real-world scenarios. By effectively extracting noise features from the image to be identified and fusing texture features, specific frequency domain energy features, and object edge features, it effectively takes into account the noise patterns of electronic images acquired by photoelectric imaging devices and the noise differences of electronic images generated or acquired by non-photoelectric imaging devices (e.g., computer software). This overcomes the technical gap in existing tamper recognition models in areas such as tamper recognition of purely computer-generated images. At the same time, using the predicted bounding box as the location result greatly reduces the model's computational load and resource consumption compared to pixel-level result output and prediction processes. By using feature analysis compatible with multiple modalities (including noise, image, and text) to determine the image tamper location and detection results, it effectively improves the final tamper detection and tamper location accuracy.

[0018] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0019] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 The flowchart shown is a method for detecting image tampering based on multimodal feature fusion provided in an embodiment of this application; Figure 2 This illustration shows a schematic diagram of the structure of an image tampering detection network provided in an embodiment of this application; Figure 3 This paper shows a schematic diagram of the structure of a noise extraction module provided in an embodiment of this application; Figure 4 A schematic diagram of the structure of the visual encoding module provided in an embodiment of this application is shown; Figure 5 This illustration shows a structural schematic diagram of a positioning module provided in an embodiment of this application; Figure 6 This illustration shows a schematic diagram of the structure of an image tampering detection device based on multimodal feature fusion provided in an embodiment of this application; Figure 7 A schematic diagram of the structure of an electronic device provided in an embodiment of this application is shown. Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. It should be understood that the drawings in this application are for illustrative and descriptive purposes only and are not intended to limit the scope of protection of this application. Furthermore, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of this application. It should be understood that the operations in the flowcharts may not be implemented in sequence, and steps without logical contextual relationships may be reversed or implemented simultaneously. In addition, those skilled in the art, guided by the content of this application, may add one or more other operations to the flowcharts, or remove one or more operations from the flowcharts.

[0022] Furthermore, the described embodiments are merely some, not all, of the embodiments of this application. The components of the embodiments of this application described and illustrated herein can typically be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0023] Most existing tamper detection models (such as TruFor) have at least the following drawbacks: 1. In the existing technology, it is necessary to first determine the tampering location result corresponding to the image, and then further determine the tampering detection result based on the tampering location result. This makes the determination of the tampering detection result dependent and highly sensitive. That is, the deviation of the tampering location result will cause the tampering detection result to be wrong, reducing the accuracy of the tampering detection result.

[0024] 2. Optical noise superimposed on images during the shooting process in optoelectronic imaging devices such as cameras has a fixed pattern. Based on this, the existing tamper detection model introduces a noise extractor. The noise extractor can collect the inconsistencies in noise patterns in electronic images captured by cameras or other optoelectronic imaging devices. However, since images captured by non-optoelectronic devices do not have the same type of noise as those captured by cameras and other optoelectronic devices, the noise extractor provided by the existing technology cannot correctly identify the underlying features corresponding to purely computer-generated electronic images (such as rendered images), thereby reducing the accuracy of tamper detection for images captured by non-optoelectronic devices.

[0025] 3. Most tamper detection models use pixel-level positioning results, which inevitably generates a large amount of computation and resource consumption, resulting in resource waste and reduced tamper detection efficiency.

[0026] Based on this, this application provides an image anti-tampering method based on multimodal feature fusion. By constructing a novel noise extraction module and separating the identification path of tampering detection results from the localization results, the identification accuracy is improved, as detailed below: Please see Figure 1 , Figure 1 A flowchart illustrating an image anti-tampering method based on multimodal feature fusion provided in an embodiment of this application is shown. Please refer to... Figure 2 , Figure 2 A schematic diagram of the structure of an image anti-tampering network provided in an embodiment of this application is shown. Figure 1 As shown in the embodiments of this application, the method provided is applied to an image anti-tampering network, which includes a noise extraction module 1, an image fusion module 2, a visual encoding module 3, a text encoding module 4, a detection module 5, and a positioning module 6.

[0027] In a preferred embodiment, such as Figure 1 and Figure 2 As shown, the method includes the following steps: S100. Input the image to be identified into the noise extraction module to extract noise features and obtain a target frequency domain feature map that indicates the noise pattern of the image itself.

[0028] S200. Input the image to be identified and the target frequency domain feature map into the image fusion module for feature fusion to obtain the initial feature map.

[0029] S300. Input the initial feature map and the pre-stored detection result prompt vector into the pre-trained multimodal model to obtain the optimized feature map, optimized feature vector, and optimized text vector.

[0030] S400. Input the optimized feature vector and optimized text vector into the detection module for tamper detection and obtain the output tamper detection result.

[0031] Preferably, the tampering detection result describes whether the image to be identified has been tampered with.

[0032] S500: The optimized text vector and optimized feature map are input into the localization module for tampering localization, and the localization result of the tampered position of the image to be identified is obtained.

[0033] In specific implementation, in step S100, since the noise extractor provided by the existing tamper detection model is incompatible with noise superimposed on the corresponding image by non-photoelectric devices, this application provides a noise extraction module 1 that is compatible with noise pattern extraction capabilities superimposed on images by both photoelectric devices and non-photoelectric devices. Specifically, after inputting the image X to be identified into the noise extraction module 1, the target frequency domain feature map F is obtained. Please refer to... Figure 3 , Figure 3 A schematic diagram of a noise extraction module provided in an embodiment of this application is shown. Figure 3 As shown, step S100 includes: S1001. Convert the image to be recognized into a grayscale feature map.

[0034] Specifically, the image X to be recognized, with height H and width W, is read and converted into a grayscale feature map with 1 channel, height H, and width W. ,in, ∈ (This represents a three-dimensional array with 1 channel, height H, and width W).

[0035] After obtaining the grayscale feature map Next, the grayscale feature map is input into the frequency domain transformation module (not shown in the figure) to obtain the target frequency domain feature map F. Preferably, the frequency domain transformation module performs the following: S1002. Input the grayscale feature map into the first two-dimensional convolutional layer with preset operation parameters to obtain the optimized grayscale feature map.

[0036] Specifically, the grayscale feature map The input is fed into a first two-dimensional convolutional layer with a kernel size of (S×S) and a stride of S. In the image (where S corresponds to the image patch size of the visual encoder), the optimized grayscale feature map is obtained. ,in, ∈ (This indicates that the number of channels is D and the image height is...) Image width is A three-dimensional array, where 3D is the embedding dimension of the subsequent visual encoder.

[0037] S1003. Input the grayscale feature map into the second two-dimensional convolutional layer with preset operation parameters to obtain the constrained feature map.

[0038] Specifically, the grayscale feature map The input is fed into a second two-dimensional convolutional layer with a kernel size of (S×S) and a stride of S. In the process, the constrained feature map is obtained. ,in, ∈ Second two-dimensional convolutional layer It is a Bayer convolutional layer.

[0039] S1004, Optimize the grayscale feature map The input is fed into the discrete cosine transform layer to obtain the processed frequency feature map. .

[0040] Specifically, .

[0041] Discrete Cosine Transform Layer (DCT Layer) This indicates the optimization of the grayscale feature map in the discrete cosine transform layer. Perform a discrete cosine transform to obtain the processed frequency feature map. .

[0042] S1005, Process the frequency feature map Input to the third 2D convolutional layer In the process, the optimized frequency feature map is obtained. .

[0043] Among them, the optimized frequency feature map ∈ Specifically: .

[0044] This indicates the use of a third two-dimensional convolutional layer. Processing frequency feature map The process is performed to obtain an optimized frequency feature map. .

[0045] S1006, Optimize grayscale feature map and constraint feature map Perform interpolation processing and input the result into the fourth two-dimensional convolutional layer. In the process, an anomaly feature map is obtained. .

[0046] For example, the fourth two-dimensional convolutional layer The kernel size is 3×3, the stride is 1, and the padding is 1. Among them, abnormal feature map ∈ , representing anomaly feature map For an image with D channels and a height of D... Image width is The feature image, specifically: .

[0047] This indicates the use of a fourth two-dimensional convolutional layer. Optimize grayscale feature maps and constraint feature map Perform interpolation to obtain an anomaly feature map. .

[0048] S1007, Obtain the abnormal feature map and optimized frequency feature map Perform dimensional union to obtain the target frequency domain feature map F.

[0049] Wherein, the target frequency domain feature map F∈ This indicates that the target frequency domain feature map F is a 2D image with a channel count and an image height of . Image width is Feature maps, specifically:

[0050] in, Represents the vector dimension. This indicates that concatenation is performed along the row direction, that is, merging is performed along the dimension of sample quantity. Representative-level union, specifically, This indicates the anomaly feature map in the row direction. and optimized frequency feature map Perform dimensional union to obtain the target frequency domain feature map F.

[0051] In a preferred embodiment, in step S200, the image fusion module 2 specifically performs the following: S2001. Input the image X to be recognized into the fifth two-dimensional convolutional layer. In this process, the original feature map A is obtained.

[0052] Preferably, A∈ This indicates that the original feature map A is a single image with D channels and an image height of... Image width is The feature image, for example, the fifth two-dimensional convolutional layer. The kernel size is (S×S), and the compensation is S.

[0053] S2002. Merge the original feature map A and the target frequency domain feature map F. Perform dimensional substitution and standardization on the merged result in sequence to obtain the initial feature map I.

[0054] Wherein, the initial feature map I∈ This indicates that the initial feature map I is an image with a height of Image width is and the number of channels is Feature maps, specifically:

[0055] in, Represents the normalization layer. Represents the vector dimension.

[0056] { } indicates that the original feature map A and the target frequency domain feature map F are joined in the row direction at a dimension level to obtain the merged result. Indicates the use of normalization layer The merged result of the original feature map A and the target frequency domain feature map F is normalized to obtain the initial feature map I.

[0057] In step S300, for example, the pre-trained multimodal model can be a CLIP (Contrastive Language-Image Pre-training) model, which includes a visual encoding module 3 and a text encoding module 4, wherein, as Figure 2 As shown, step S300 includes: The initial feature map I is input into the visual encoding module 3 for image feature extraction to obtain the optimized feature map Q and the optimized feature vector M. The pre-stored prompt vector P is input into the text encoding module 4 for text feature vector extraction to obtain the optimized text vector N.

[0058] In a preferred embodiment, please refer to Figure 4 , Figure 4 A schematic diagram of the structure of the visual encoding module provided in an embodiment of this application is shown. Figure 4 As shown, the visual encoding module 3 provided in this application includes a feature transformation structure layer 31 (e.g., a transformer layer) and an attention pooling layer 32, wherein the visual encoding module 3 performs: In one specific embodiment, the initial feature map I is sequentially input into the feature transformation structure layer 31 and the attention pooling layer 32 to obtain the optimized feature map Q∈ and optimize the feature vector M∈ Where J represents the output dimension of the visual encoder, M represents the optimized feature vector as a vector with 1 row and J columns, and Q represents the optimized feature map. The optimized feature map Q represents an image with height Q. Image width is The feature image and the feature map with 3D channels. In another preferred embodiment, the optimized text vector N∈ (This indicates that the detection result type is 2 and the output dimension is...) (a two-dimensional array).

[0059] return Figure 1 In step S400, detection module 5 performs the following: The cosine similarity between the optimized feature vector M and the optimized text vector N is calculated using a normalization function (Softmax). The result is then defined as the detection result Y, where the detection result Y∈ (This represents an array with a quantity of 1 and a detection type dimension of 2).

[0060] In one specific embodiment:

[0061] in, Represents the dot product. Denotes the norm of M, express norm, This indicates transpose.

[0062] Softmax indicates that normalization is performed.

[0063] return Figure 1 and Figure 2 In step S500, the optimized text vector N and optimized feature map Q are input into the localization module 6 to obtain the localization result Z. Please refer to [link to relevant documentation]. Figure 5 , Figure 5 A schematic diagram of a positioning module provided in an embodiment of this application is shown. Figure 6 As shown, the localization module 6 includes a linear projection layer 61, a fusion layer 62, a visual self-attention model 63, and a multilayer perceptron (MLP).

[0064] Preferably, positioning module 6 performs the following: The optimized text vector N is input into the linear projection layer 61 to obtain attention terms. Among them, attention lexical units ∈ (Represents an array with 20 prediction boxes and 3D channels). For linear projection layer 61, it performs the following: optimizes the shape of the text vector N to... ∈ ,Will Input to input dimension is The output is a 3D linear layer, which is then augmented with vectors to obtain attention terms. .

[0065] Pay attention to the word The optimized feature map Q is input into the fusion layer 62 and merged to obtain the localization input feature B∈ The localization input feature B is input into the pre-trained visual self-attention model 63 to obtain the localization result feature E∈ of the backbone network output of the visual self-attention model 63. Separate the positioning result vector C∈ from the positioning result feature E. The localization result vector C is input into a multilayer perceptron (MLP) to obtain the localization result Z∈ of the tampered location in the image to be identified. (This represents an array of 20 predicted bounding boxes and 5-dimensional localization result vectors.)

[0066] The Multilayer Perceptron (MLP) is a neural network structure with one hidden layer. The feature dimension of the hidden layer is 1 / 4 of the feature dimension of the input layer (3D), and ReLU is used as the activation function.

[0067] In this application, the positioning result vector corresponding to the positioning result Z represents the prediction bounding box [p, , [w, h], where p represents the probability of the predicted bounding box being tampered with. This represents the normalized x-coordinate of the center point of the prediction box. The normalized ordinate of the center point of the prediction box is represented by w, the width of the prediction box is represented by h, and the height of the prediction box is represented by h. When the tampering probability p of the prediction box is greater than or equal to the given localization and recognition threshold, all pixels within the prediction box are marked as tampered pixels in the image.

[0068] The image tampering detection method based on multimodal feature fusion provided in this application includes a training phase and a prediction phase. Steps S100 to S500 above are the prediction phase of the image tampering detection network, which will not be elaborated here. The training phase of the image tampering detection method includes a detection module training phase and a localization module training phase.

[0069] In a preferred embodiment, the detection module training phase includes: Initialize the noise extraction module 1, image fusion module 2, and detection module 3 respectively, and load the pre-trained multimodal model (including visual encoding module 3 and text encoding module 4) to obtain the preset classification result. The preset classification result describes the detection classification array output by the detection module 3. For example, the detection classification array is [authentic image, doctored image], where authentic image means that the image has not been tampered with, and doctored image means that the image has been tampered with.

[0070] The preset classification result is input into the word segmenter for byte-pair encoding (BPE) to obtain the initial trainable cue vector P.

[0071] Multiple image samples are acquired, each carrying its corresponding actual tampering result. The multiple image samples are input into a network formed by noise extraction module 1, image fusion module 2, detection module 3, visual encoding module 3, and text encoding module 4 to obtain the tampering detection result corresponding to each image sample output by detection module 3. The detection module 3 is trained and optimized by combining the actual tampering result corresponding to each image sample and the image sample.

[0072] Preferably, during the training and optimization process, the visual encoding module 3 and the text encoding module 4 are frozen, and the binary cross-entropy loss (BCE) can be used to guide the noise extraction module 1, the image fusion module 2, and the detection module 3 to learn.

[0073] In another preferred embodiment, the localization module training phase includes: Load and freeze the noise extraction module 1, image fusion module 2, visual encoding module 3, text encoding module 4, and detection module 5, which have completed the training phase of the detection module, and initialize the localization module 6. Multiple image samples identical to those used in the training phase of the detection module are input into the image tampering detection network described above. The predicted tampering localization result for each image sample is obtained from the output of the localization module 6. The predicted tampering localization result includes at least one prediction box and the tampering result corresponding to each prediction box. The actual tampering localization result corresponding to each image sample is obtained. The localization module is trained and optimized by combining the actual tampering localization result and the predicted tampering localization result corresponding to each image sample, and a trained localization module is obtained.

[0074] Specifically, the loss function is calculated in a supervised manner during the training process of the localization module, and its calculation formula is as follows:

[0075] in, This is represented as the regression loss function for the predicted bounding box, and the regression loss corresponding to the predicted bounding box is calculated using CIoU (Complete-IoU). This represents the target loss function corresponding to the probability of the predicted bounding box being tampered with. The target loss function can be calculated using binary cross-entropy loss (BCE). This represents the total loss of positioning module 6.

[0076] When L is less than the preset total loss threshold, the training of the localization module 6 is completed.

[0077] Based on the same application concept, please refer to Figure 6 , Figure 6This illustration shows a schematic diagram of an image anti-tampering structure based on multimodal feature fusion provided in an embodiment of this application. For example... Figure 6 As shown, the device is used in an image anti-tampering network and includes: The noise extraction unit 600 is used to input the image to be identified into the noise extraction module to extract noise features and obtain a target frequency domain feature map that indicates the noise pattern of the image to be identified itself. The fusion unit 610 is used to input the image to be identified and the target frequency domain feature map into the image fusion module for feature fusion to obtain an initial feature map; The optimization unit 620 is used to input the initial feature map and the pre-stored detection result prompt vector into the pre-trained multimodal model to obtain the optimized feature map, optimized feature vector and optimized text vector; The tamper detection unit 630 is used to input the optimized feature vector and the optimized text vector into the detection module to perform tamper detection and obtain the output tamper detection result, which describes whether the image to be identified has been tampered with. The localization unit 640 is used to input the optimized text vector and optimized feature map into the localization module for tamper localization, and obtain the localization result of the tampered position of the image to be identified.

[0078] Based on the same application concept, please refer to Figure 7 , Figure 7 A schematic diagram of the structure of an electronic device provided in an embodiment of this application is shown. For example... Figure 7 As shown, the electronic device 70 includes a processor 701, a memory 702, and a bus 703. The memory 702 stores machine-readable instructions that can be executed by the processor 701. When the electronic device 70 is running, the processor 701 and the memory 702 communicate through the bus 703. The machine-readable instructions are executed by the processor 701 to perform the steps of the image anti-tampering method based on multimodal feature fusion provided in any of the above embodiments.

[0079] Based on the same concept, this application also provides a computer-readable storage medium storing a computer program, which, when run by a processor, executes the steps of the image anti-tampering method based on multimodal feature fusion provided in the above embodiments.

[0080] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems and devices described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division; in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Another point is that the displayed or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces; the indirect coupling or communication connection of devices or units may be electrical, mechanical, or other forms.

[0081] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0082] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0083] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a processor-executable, non-volatile, computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0084] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for preventing image tampering based on multimodal feature fusion, characterized in that, Applied to image anti-tampering networks, The method includes: The image to be identified is input into the noise extraction module for noise feature extraction, resulting in a target frequency domain feature map that indicates the noise pattern of the image itself. The image to be identified and the target frequency domain feature map are input into the image fusion module for feature fusion to obtain an initial feature map; The initial feature map and the pre-stored detection result prompt vector are input into the pre-trained multimodal model to obtain the optimized feature map, optimized feature vector, and optimized text vector; The optimized feature vector and optimized text vector are input into the detection module for tamper detection, and the output tamper detection result is obtained. The tamper detection result describes whether the image to be identified has been tampered with. The optimized text vector and optimized feature map are input into the localization module for tampering localization, and the localization result of the tampered position of the image to be identified is obtained.

2. The method according to claim 1, characterized in that, The noise extraction module performs the following: The image to be identified is converted into a grayscale feature map; The grayscale feature map is input into a first two-dimensional convolutional layer with preset operating parameters to obtain an optimized grayscale feature map; The grayscale feature map is input into a second two-dimensional convolutional layer with preset operating parameters to obtain a constrained feature map; The optimized grayscale feature map is input into the discrete cosine transform layer to obtain the processed frequency feature map; The processed frequency feature map is input into the third two-dimensional convolutional layer to obtain the optimized frequency feature map; The optimized grayscale feature map and the constrained feature map are subjected to difference processing, and the difference processing result is input into the fourth two-dimensional convolutional layer to obtain the abnormal feature map; The abnormal feature map and the optimized frequency feature map are combined at the dimensional level to obtain the target frequency domain feature map.

3. The method according to claim 1, characterized in that, The image fusion module performs the following: The image to be identified is input into the fifth two-dimensional convolutional layer with preset operating parameters to obtain the original feature map; The original feature map and the target frequency domain feature map are joined at the dimension level, and the merged result is sequentially subjected to dimension permutation and standardization to obtain the initial feature map.

4. The method according to claim 1, characterized in that, The multimodal model includes a visual encoding module and a text encoding module. The optimized feature map, optimized feature vector, and optimized text vector are obtained through the following methods: The initial feature map is input into the visual encoding module for image feature extraction to obtain the optimized feature map and optimized feature vector. The pre-stored prompt vector is input into the text encoding module to extract text feature vectors, resulting in optimized text vectors.

5. The method according to claim 1, characterized in that, The visual encoding module includes a feature transformation structure layer and an attention pooling layer. The optimized feature map and the optimized feature vector are obtained in the following ways: The initial feature map is sequentially input into the feature transformation structure layer and the attention pooling layer to obtain the optimized feature map and the optimized feature vector.

6. The method according to claim 1, characterized in that, The detection module performs the following: Calculate the cosine similarity between the optimized feature vector and the optimized text vector; The cosine similarity is calculated using a normalization function, and the calculation result is determined as the detection result.

7. The method according to claim 1, characterized in that, The positioning module performs: The optimized text vector is input into a linear projection layer to obtain attention lexical units; The attention lexical and the optimized feature map are combined to obtain the localization input features; The localization input features are input into a pre-trained visual self-attention model to obtain the localization result features output by the backbone network corresponding to the visual self-attention model. Extract the positioning result vector from the positioning result features; The localization result vector is input into a multilayer perceptron to obtain at least one prediction box and the tampering probability corresponding to the prediction box; The localization result is formed by a target prediction box whose tampering probability is greater than or equal to a given localization recognition threshold, and all pixels within the target prediction box are marked as image tampering pixels.

8. The method according to claim 1, characterized in that, The method further includes: Before performing image tampering detection, the detection module is trained, including: Initialize the noise extraction module, image fusion module, and detection module respectively, and load the pre-trained multimodal model; The preset classification result prompt word group is input into the word segmenter for byte-pair encoding to obtain the prompt vector; Multiple image samples are input into the image tampering detection network to obtain the tampering detection result corresponding to each image sample output by the detection module; The detection module is trained and optimized by combining the actual tampering result and the image sample corresponding to each image sample, resulting in a well-trained detection module.

9. The method according to claim 8, characterized in that, After training the detection module is complete, the method further includes: Train the localization module: Initialize the localization module, load and freeze the noise extraction module, image fusion module, detection module, and pre-trained multimodal model; The multiple image samples are input into the image tampering detection network to obtain the predicted tampering location result for each image sample output by the localization module; Obtain the actual tamper location result corresponding to each image sample; The localization module is trained and optimized by combining the actual tampering localization result and the predicted tampering localization result corresponding to each image sample, resulting in a well-trained localization module.

10. An image anti-tampering device based on multimodal feature fusion, characterized in that, An image anti-tampering network is used, wherein the device includes: The noise extraction unit is used to input the image to be identified into the noise extraction module to extract noise features and obtain a target frequency domain feature map that indicates the noise pattern of the image to be identified itself. The fusion unit is used to input the image to be identified and the target frequency domain feature map into the image fusion module for feature fusion to obtain an initial feature map; The optimization unit is used to input the initial feature map and the pre-stored detection result prompt vector into the pre-trained multimodal model to obtain the optimized feature map, optimized feature vector, and optimized text vector; The tamper detection unit is used to input the optimized feature vector and optimized text vector into the detection module to perform tamper detection and obtain the output tamper detection result, which describes whether the image to be identified has been tampered with. The localization unit is used to input the optimized text vector and optimized feature map into the localization module to locate the tampered position in the image to be identified.