Image robust source forensics method based on information interaction

By constructing the Siameformer model and combining CNN and ViT, and utilizing the multi-head self-attention mechanism and the Siamese network weight sharing mechanism, the robustness problem of image source forensics algorithms under complex image tasks and large-scale datasets is solved. This enables effective processing of interfering images and improves the accuracy and robustness of forensics.

CN117671443BActive Publication Date: 2026-06-23DALIAN UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DALIAN UNIV OF TECH
Filing Date
2023-12-01
Publication Date
2026-06-23

Smart Images

  • Figure CN117671443B_ABST
    Figure CN117671443B_ABST
Patent Text Reader

Abstract

The application discloses an image robust source forensics method based on information interaction, relates to the technical field of artificial intelligence, mainly takes a Vision Transformer based on a multi-head self-attention mechanism as a basic framework, combines a weight value sharing extended twin branch to train a deep learning algorithm model, utilizes the global modeling capability of the Vision Transformer, combines a self-attention mechanism to simulate convolution operation, extracts local feature information through image content feature correlation, realizes global feature information guiding local feature information extraction, and greatly improves model robustness. Then, the weight value sharing mechanism is used to expand the twin branch, and a double-branch information sharing is used to realize a residual-like training strategy. Finally, the obtained image source forensics model can better cope with the image robust source forensics task under the interference and distortion images transmitted through a social network.
Need to check novelty before this filing date? Find Prior Art

Description

TECHNICAL FIELD

[0001] The present application relates to the technical field of artificial intelligence, in particular to an image robust source forensics method based on information interaction. BACKGROUND

[0002] In the traditional original scene setting, the image source forensics (SCI) algorithm mainly extracts the fingerprint features (such as color histogram, texture feature, shape descriptor, etc.) left by the image in the imaging process for source identification. These fingerprint features need to be manually extracted through the process of feature engineering, and then a classifier (such as support vector machine, random forest, etc.) is used for training and classification. The key of this method lies in the design of effective features and the selection of appropriate classifiers. Traditional machine learning methods are usually suitable for small-scale data sets and simple feature expression methods, and have high requirements for feature engineering. It performs well when the data volume is not large and the features are clear, and can provide results with strong interpretability. However, for complex image tasks and large-scale data sets, the performance of traditional machine learning methods is limited. The current mainstream methods for manually extracting features based on machine learning are:

[0003] The image source forensics scheme based on the fingerprint features such as photo response non-uniformity noise (PRNU) is the camera fingerprint generated due to the hardware defects that cannot be eliminated, and is the most commonly used fingerprint, which plays a decisive role in the field of SCI. Many algorithms use the indelibility and uniqueness of PRNU to distinguish the differences between images of different sources. The main method used is to construct a residual map for training.

[0004] The traditional SCI method based on manual feature extraction has limitations and is difficult to extract effective features. The development of deep learning technology provides a new way for automatically extracting image features using neural networks. The deep learning SCI method uses deep neural networks to automatically learn image feature representations. Taking the convolutional neural network (CNN) as an example, it can automatically learn the features in the image through multiple convolution and pooling operations, and classify through the fully connected layer. Deep learning methods do not need to manually design and select features, but through end-to-end learning on large-scale data sets, they can learn more rich and advanced feature representations to improve the accuracy of source forensics. The current mainstream methods for automatically extracting features based on deep learning are:

[0005] The image source forensics scheme based on different network framework optimization uses neural network models (AlexNet, VGGNet, etc.) to automatically extract image features, and converts the input image into a specific vector representation. The vector is jointly trained with the camera information label, and the trained model is evaluated, and the model structure and parameters are optimized.

[0006] Robustness is an important indicator to evaluate the ability of SCI methods to resist image interference. Existing robust SCI methods are based on traditional machine learning and deep learning frameworks, so they are difficult to get rid of the limitations brought by the two directions themselves. The following are the ideas of robust SCI algorithms in the two directions:

[0007] 1. Robust SCI algorithm based on traditional machine learning: based on the difference characteristics between the interference image and the original image, an optimization algorithm for extracting image fingerprint features is designed to improve the robustness; or in the case of unchanged feature extraction algorithm, the missing information part of the interference image is recovered.

[0008] 2. Robust SCI algorithm based on deep learning: mainly through data enhancement, multi-scale feature and its fusion, reinforcement of intra-domain consistency, multi-modal fusion, and adversarial attack and defense.

[0009] Existing SCI methods mainly rely on an assumption that the test set and the training set are sufficient and not disturbed. However, the actual situation is not so simple. Nowadays, mainstream social media will manipulate images when transmitting images, thereby interfering with the fingerprint information of the image for forensics. Although it is possible to obtain the original image for training the classifier, the image used for testing / forensics cannot be guaranteed to be original. Currently, there are three common interference modes, including cropping, noise reduction, and compression. Cropping may cause the deletion of key features or important regions (such as edge texture features containing important image details and distinguishable information). This may introduce distortions such as edge jaggedness, color distortion, or stretching. Noise will cause distortion or loss of details, thereby affecting image features. During transmission, compression techniques are often used to reduce communication costs. However, the cost of small file size is the loss of details and information contained in the image, which will directly affect the performance of the SCI algorithm. Most existing images from the Internet may be mixedly interfered by the above modes.

[0010] In summary, current SCI algorithms mainly identify the source based on the original scene, and lack the ability to cope with image interference. The mainstream robust SCI algorithm is a specially designed algorithm or model for a certain interference mode, and lacks adaptive ability. SUMMARY

[0011] In order to solve the problem of source forensics of interference images, the present application provides an image robust source forensics method based on information interaction, proposes a learning framework for information interaction of interference images by applying a multi-head self-attention mechanism, combines the characteristics of convolutional neural network (CNN) and ViT, applies a residual-like training strategy based on the idea of information interaction, optimizes the fusion of multi-dimensional feature information, and improves the accuracy and robustness of image source forensics.

[0012] To this end, the present application provides the following technical solutions:

[0013] The application discloses an image robust source forensics method based on information interaction, comprising:

[0014] Obtaining a plurality of disturbed images;

[0015] Constructing a double-branch information interaction model; the model comprises a first high-resolution spatiotemporal feature learning model branch and a second high-resolution spatiotemporal feature learning model branch; the first high-resolution spatiotemporal feature learning model branch is connected with a denoising filter; the two branches perform information interaction based on a weight sharing mechanism of a twin network to obtain a double-branch integrated photo response non-uniformity noise feature PRNU which is input into a classifier for learning classification; each high-resolution spatiotemporal feature learning model branch takes a Vision Transformer based on a multi-head self-attention mechanism as a basic framework;

[0016] Training the model by using the disturbed images;

[0017] Performing image robust source forensics by using the trained model.

[0018] Further, the two branches perform information interaction based on a weight sharing mechanism of a twin network to obtain a double-branch integrated photo response non-uniformity noise feature PRNU, comprising:

[0019] The similarity between the feature vectors of the two branches is compared by using a contrast loss function, the shared weight values are adjusted by weight sharing, and the double-branch integrated PRNU is obtained.

[0020] Further, the shared weight parameters of the branch network are updated by gradient backpropagation.

[0021] Further, the contrast loss function is:

[0022]

[0023] Wherein, X 1,2 respectively represent the image features of the images obtained after the images pass through each branch of the network, L represents a model loss function, L Ct represents a contrast loss function, L Cr represents a cross-entropy loss function, Y represents a similarity label, N represents the number of sample pairs, and D w represents the Euclidean distance between the image feature pairs.

[0024] Further, each high-resolution spatiotemporal feature learning model branch comprises a local feature attention module and a global feature attention module; the local feature attention module and the global feature attention module have the same structure and comprise dynamic position encoding, a multi-head attention block and a feedforward neural network.

[0025] Furthermore, the dynamic position encoding includes: for tensor X in conduct After zero-padding expansion, a two-dimensional depthwise convolution is performed with a position encoding matrix K of kernel size k*k:

[0026]

[0027] Where DPE(·) represents dynamic position coding, R(X) in ) represents X in Perform zero-padding, K k-1-p,k-1-q This represents the element in the (k-1-p)th row and (k-1-q)th column of the position coding matrix.

[0028] Furthermore, the multi-head attention block applies spatial attention to learn content feature relevance.

[0029] Furthermore, in the multi-head attention block, each head is restricted to a very small window area to extract local features across spatial locations.

[0030] Furthermore, channel attention is used in the global feature attention module.

[0031] Furthermore, the local feature attention module and the global feature attention module are repeatedly stacked in each branch of the high-resolution spatiotemporal representation learning model.

[0032] Advantages and positive effects of this invention: This invention combines practical engineering with a residual-like training strategy that enhances PRNU features and introduces weight sharing and shared branch information to design a Siameformer model that integrates and interacts with multiple feature information. This model is used for image source forensics. This invention studies the relationship between image content and its features and uses this relationship to adaptively extract image features using a combination of multi-head attention and convolution. A local / global feature attention module is used to integrate surface and deep features to achieve interaction between local and global information. In addition, a Siamese network weight sharing mechanism is introduced to emphasize the fingerprint features of the image, thereby improving the model's ability to handle different sources and variations. Through these improvements, this invention provides an advanced solution for image source forensics, solving the problem of robust image source forensics in real-world scenarios facing interfering images, and improving accuracy and robustness. Attached Figure Description

[0033] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0034] Figure 1 Evidence collection of images from original and real-world scenarios;

[0035] Figure 2 A model framework diagram of the source forensics method in this embodiment of the invention;

[0036] Figure 3 Local / global feature attention module in this embodiment of the invention;

[0037] Figure 4 The original image and its distorted version transmitted through mainstream social media;

[0038] Figure 5 The results are shown by applying the proposed Siameformer model to images taken by some mainstream mobile phone models. Detailed Implementation

[0039] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0040] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0041] This invention is mainly used in robust source forensics of interfered digital images, such as... Figure 1As shown, original images captured by a camera can be interfered with in real-world scenarios, forming interfered images. While original images are used for source forensics, interfered images require robust source forensics. Considering the interference digital images experience during network transmission, resulting in missing image content and fingerprint feature information, traditional model training methods are no longer feasible. To address this, this invention uses the VisionTransformer (ViT) framework with a multi-head self-attention mechanism to combine local features and global information to train a robust image source forensics model. This model can well fit the fingerprint information distribution in interfered digital images.

[0042] like Figure 2 , Figure 3 As shown, an image robust source forensics method based on information interaction in an embodiment of the present invention includes:

[0043] S1. Obtain multiple sample images;

[0044] S2. Construct the two-branch information interaction model Siameformer;

[0045] The Siameformer consists of two branches: a first high-resolution spatiotemporal representation learning model (HSR) Uniformer branch and a second high-resolution spatiotemporal representation learning model (HSR Uniformer branch). The first HSR Uniformer branch is connected to a denoising filter. The two branches interact based on a weight-sharing mechanism of Siamese networks to obtain PRNU-like features after dual-branch integration, which are then input into a classifier for classification. Each HSR Uniformer branch includes a local feature attention module and a global feature attention module.

[0046] Suppose there are M samples J(x, y), and the image content of each sample can be modeled as J(x, y) = I(x, y) + I(x, y)K d To obtain an estimate of PRNU, the image J(x,y) is denoised to obtain the residual image R(x,y) = J(x,y) - F(x,y) + N(x,y). d [J(x,y)]=I(x,y)K d (x, y) + N(x, y), in order to accurately estimate the multiplicative noise component K d The estimation is achieved using multiple images, and the obtained maximum likelihood estimate is the reference PRNU noise. Then, image source identification can be achieved simply by comparing the similarity between the test image and the reference PRNU. Because of the indelible and unique nature of PRNU due to camera hardware defects, this method selects PRNU features to guide the model in extracting and learning relevant features for source identification. Simultaneously, this method references the residual image extraction idea to achieve the purpose of PRNU-guided modeling, extending the high-resolution spatiotemporal representation learning model Uniformer into a two-branch structure. Since PRNU is mainly concentrated in the high-frequency part of the image, this method adds a denoising filter to one branch, making this model branch mainly focused around the PRNU feature domain (e.g., ...). Figure 2 (As shown).

[0047] PRNU features alone are insufficient as guidance. To address the complexities of interfering images, more robust features are needed to collaboratively complete the source identification task. Simultaneously, to appropriately discard regions with severe information perturbation or loss in interfering images, more computational resources should be allocated to regions with well-preserved information. Therefore, this method introduces a multi-head self-attention mechanism, combining CNN and VIT, and integrating local and global information.

[0048] Figure 3 This diagram illustrates the structural framework of the local / global feature attention module in the improved ViT model branch designed for the SCI task. The local and global feature attention modules are structurally similar, primarily consisting of Dynamic Position Encoding (DPE), Multi-Head Attention Block (MHA), and Feedforward Neural Network (FNN). To enable the network model to better learn the spatial features of the input, making the features of different regions more prominent and accurate, thereby improving the performance and effectiveness of the attention mechanism, convolutional layers of different sizes are added at the module's entry point to extract features at different scales, helping the module identify and utilize more features. This allows the model to better distinguish different image information and improves its robustness.

[0049] The mapping X of the input feature X is obtained after passing through the convolutional layer. in Then, dynamic position encoding is performed on it, that is, the tensor Xin is... After zero-padding expansion, a two-dimensional depthwise convolution is performed with a position encoding matrix K of kernel size k*k:

[0050]

[0051] Where DPE(·) represents dynamic position coding, R(X) in ) represents X in Perform zero-padding, K k-1-p,k-1-q This represents the element in the (k-1-p)th row and (k-1-q)th column of the position coding matrix.

[0052] The dynamic position encoding module based on depthwise convolution is more flexible and applicable to any input feature shape; secondly, depthwise convolution adds extra zero padding to progressively query the boundary to obtain its absolute position. This encoding mechanism helps improve the model's performance and generalization ability.

[0053] Next, the obtained position encoding matrix DPE(X) in ) and input features X in Adding them together gives X S =DPE(X) in )+X in Input to the multi-head self-attention module:

[0054] Y S =MHA(Norm(X) S ))+X S ;

[0055] MHA(X o =Concat(R1(X) o R2(X) o );…;R N (X o ))U;

[0056] R n (X o ) = A n (X o )·V n (X o ),

[0057] Among them, X S Y represents the output of dynamic position encoding. S Z represents the output of the multi-head self-attention module. S Represents the output of the feedforward neural network, Norm(·) represents the batch normalization operation, MHA(·) represents the multi-head self-attention module, Concat(·) represents the concatenation operation, U represents the C*C weight parameter matrix, and R represents the weight parameter matrix. n Let A represent the nth attention head. n (·) represents the attention matrix of the nth head.

[0058] Multi-head self-attention mechanisms are an important tool in the field of deep learning. They can interact with information in different spatial locations, thereby better processing long sequences and complex information.

[0059] Attention modules targeting local and global features have different goals and functions; this method addresses these differences in R. nDifferent improvements were made to make it more attentive to different regions. First, the local feature attention module was improved primarily based on the correlation between image content and features. In the local feature attention module, MHA applies spatial attention to learn this content feature correlation. Attention operation A n It can be represented as:

[0060]

[0061] in and Calculate the query vector and key vector of X respectively. Q n (·) represents the query vector of the nth head, K n (·) represents the key vector of the nth head, V n (·) represents the value vector of the nth head. This represents the query matrix for the nth head. This represents the key matrix of the nth head. D represents the value matrix of the nth head. h C represents the dimension size of the nth head. g Indicates the size of the channel feature dimension.

[0062] To highlight the relevance of content features rather than the relationships between content elements, and to reduce redundant computation, this method restricts each head to a very small window region to extract local features across spatial locations. This is similar to the convolution operation in CNNs. At this point, because... It is a linear transformation operation, which can be viewed as a single point convolution, and It can be simplified to:

[0063]

[0064] in This can be represented as a learnable parameter matrix, where ij represents the relative positions of tokens i and j. Because It can be instantiated into each head operation V n The weight parameter matrix of (X) can then be... This can be viewed as a depthwise convolution. Finally, concatenating and fusing all attention heads can be equivalent to a pointwise convolution. MobileNet has a similar point-to-depth-to-point convolution pattern. Therefore, the local feature attention module can be interpreted to some extent as a MobileNet block. Ultimately, this method simulates a CNN and implements the convolution operation through multi-head spatial self-attention. This allows the model to extract features more effectively for SCI tasks by relying on content feature relevance in the spatial dimension.

[0065] After effectively extracting local features, the next challenge is to discover the remaining feature information in the interfering image. To address this, ViT's greatest advantage—global information modeling—becomes essential. Therefore, this method uses channel attention in the global feature attention module. For channel attention, each channel token contains global features. On one hand, the attention head naturally considers the left-right spatial position within each channel while calculating the correlation between different channels, thus capturing global information. On the other hand, the attention head can calculate the importance of channel information, strengthening important feature channels and weakening irrelevant ones. This better preserves effective information in the feature map, removes noise and redundant information, and more effectively captures global information. In summary, channel attention can better capture global information, improving model accuracy and generalization ability. Channel attention collects information along the channel dimension for learning, so attention learning is performed after transposing the patch. Each transposed patch can then be considered an abstraction of global information. Therefore, the formula for the global feature-based attention module can be derived as follows:

[0066]

[0067] The final step of both local and global feature attention modules involves a feedforward neural network to further mine and process features, thereby improving the model's expressive power and predictive performance.

[0068] Z S =FNN(Norm(Y) S ))+Y S ;

[0069] To achieve both local feature extraction and global information guidance, this method combines CNN and ViT by repeatedly stacking local and global feature attention modules. This ultimately enables the interaction between local and global information, which is then used to guide model learning and improve robustness.

[0070] S3. Train the Siameformer using sample images;

[0071] To achieve information exchange between the upper and lower branches of the Siameformer model, this method introduces a weight-sharing mechanism from Siamese networks. This mechanism compares the similarity between the feature vectors of the two branches, and by sharing and adjusting the weights, makes the model more focused on the camera's unique fingerprint—PRNU. This strategy ultimately results in a PRNU-like feature input to a classifier for classification after the two branches are integrated, thus realizing the residual map training strategy of PRNU from another perspective.

[0072] The similarity comparison in the weight-sharing mechanism is mainly achieved through the contrastive loss function, and the shared weight parameters of the branch network are updated through gradient backpropagation. Let the input image pair be J1 and J2 = F. d (J1), the input network branches up and down to obtain image feature pairs X1 and X2, and the contrastive loss function can be defined as:

[0073]

[0074] D w It is obtained by calculating the Euclidean distance between image feature pairs. The detailed calculation process is as follows:

[0075]

[0076] The image feature similarity label Y represents the label obtained after mapping whether X1 and X2 were taken by the same camera. Y = 1 indicates that the image pair was taken by the same camera; otherwise, Y = 0. m is a predefined threshold. At this point, the loss function L... Ct It can be simplified to:

[0077]

[0078] As the Euclidean distance between feature vector pairs gradually approaches zero, the contrastive loss function L... Ct The distance gradually decreases. When the Euclidean distance reaches its minimum, it indicates that similar features are closest, the contrastive loss function is minimized, and the model no longer needs optimization. At this point, the model combines local features with global information based on PRNU guidance, ultimately obtaining multi-scale fused PRNU-like features for classification. The classification uses the cross-entropy loss function. This aims to minimize the penalty imposed by the model for misclassifying samples. Based on this, the final loss function of this method is:

[0079] L = L Ct +L Cr ;

[0080] Among them, X 1,2 Let L represent the image features obtained after passing through each branch of the network, and L represent the model loss function. Ct L represents the contrastive loss function. Cr Y represents the cross-entropy loss function, Y represents the similarity label, N represents the number of sample pairs, and D represents the number of samples. w The Euclidean distance between image feature pairs is represented by P, where P represents the dimension of the image features, K represents the number of classification categories, and y represents the distance between them. k p represents the true label of the k-th category. k Let X represent the predicted label for the k-th category, and let X represent the input to a single branch of the model.

[0081] S4. Use the trained Siameformer for image source forensics.

[0082] By inputting image J into the Siameformer model, we can obtain:

[0083] M T (J(x,y))=M s (J(x,y))+M s (F d [J(x,y)]);

[0084] Finally, combining the Siameformer two-branch model function, we can derive:

[0085] M T (J(x,y))=M s (R(x, y));

[0086] Where (x, y) represents the spatial coordinates of each pixel in the image, J(x, y) represents the image generated by the camera, I(x, y) represents the image content, and F... d M represents a noise reduction filter. S M represents a single branch of the Siameformer bi-branch model. T This represents the Siameformer dual-branch model. K d Let N(x, y) represent multiplicative noise in the image, R(x, y) represent additive noise in the image, M represent the residual image, and R represent the number of image samples. PRNU K represents d The maximum likelihood estimate is the image reference PRNU feature.

[0087] The original / interference image J is input, and a denoising filter and weight sharing mechanism are used to make the model focus more on features near the PRNU feature domain. Weight sharing also makes the model resemble the residual image input, thus guiding Siameformer to better extract local features and combine them with global information for source identification.

[0088] This invention primarily uses the Vision Transformer, based on a multi-head self-attention mechanism, as its basic framework. It combines weight-sharing with extended Siamese branches to train a deep learning algorithm model. Leveraging the global modeling capabilities of the Vision Transformer and combining the self-attention mechanism to simulate convolution operations, it extracts local feature information through the correlation of image content features. This global feature information guides the extraction of local feature information, significantly improving the model's robustness. Then, the Siamese branches are extended through a weight-sharing mechanism, utilizing information sharing between the two branches to implement a residual-like training strategy. The resulting image source forensics model is better able to handle robust source forensics tasks with images that have been transmitted through social networks and are subject to interference and distortion.

[0089] Experiments have confirmed that the Siameformer dual-branch information interaction model established in this invention outperforms existing source forensics algorithms while also mitigating the risk of image interference. The comparison results between the method in this invention and existing source forensics algorithms are as follows: Figure 5 As shown in Table 1, for some mainstream mobile phone models, the images taken (such as...) Figure 4 As shown, the Siameformer model proposed in this method achieves an overall accuracy of 97.08%, which is higher than the current SCI algorithm. Furthermore, the images also demonstrate good forensic performance at the individual level (images taken by different individual devices of the same model).

[0090] Table 1

[0091]

[0092]

[0093] The information interaction framework proposed in this invention can be transferred to different image source forensics algorithms, exhibiting strong versatility and flexibility. Compared to traditional methods, this method significantly improves robustness while avoiding performance degradation in the original scene.

[0094] This invention can be applied to multiple aspects:

[0095] In the field of digital evidence collection, particularly in criminal cases, intellectual property disputes, and fraud cases, the authenticity of digital evidence is paramount. Image source forensics technology can verify the origin and integrity of an image by analyzing its metadata, digital signatures, watermarks, and other information, thereby determining whether the image has been tampered with, forged, or whether its source is genuine and reliable. This is of great significance for criminal investigations, disputes involving suspected forgery, and online fraud.

[0096] Digital copyright protection is crucial in the field of digital content distribution, where piracy and infringement are rampant. Image source forensics techniques can be used to trace and prove the original source of digital content, ensuring the rights of copyright holders. By providing reliable evidence of origin, copyright protection can be strengthened and piracy can be combated.

[0097] In the medical field, image source forensics technology is used to verify and protect the accuracy and integrity of medical images. By analyzing medical images forensically, it is possible to detect whether there has been tampering, modification, or forgery, ensuring the reliability of medical results and the accuracy of treatment effects, thereby safeguarding the patient's health and treatment outcomes.

[0098] Social media verification is crucial because misinformation and online rumors are rampant on social media. Image source verification technology can help verify the authenticity and origin of images posted on social media, preventing the spread of misinformation and maintaining the credibility and trustworthiness of social media platforms.

[0099] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A robust image source forensics method based on information interaction, characterized in that, include: Acquire multiple disturbed images; Construct a two-branch information interaction model; The dual-branch information interaction model includes: a first high-resolution spatiotemporal representation learning model branch and a second high-resolution spatiotemporal representation learning model branch; the first high-resolution spatiotemporal representation learning model branch is connected to a denoising filter; the two branches interact based on the weight sharing mechanism of Siamese networks to obtain the PRNU-like light response non-uniformity noise feature after dual-branch integration, which is input into a classifier for classification, including: comparing the similarity between the feature vectors of the two branches through a contrastive loss function, and adjusting the weights through weight sharing to obtain the PRNU-like feature after dual-branch integration; each high-resolution spatiotemporal representation learning model branch uses the Vision Transformer based on a multi-head self-attention mechanism as its basic framework; each high-resolution spatiotemporal representation learning model branch includes: a local feature attention module and a global feature attention module; the local feature attention module and the global feature attention module have the same structure, including: dynamic position encoding, multi-head attention blocks, and a feedforward neural network; the dynamic position encoding includes: tensor conduct After zero-padding expansion, the kernel size is... Position encoding matrix Perform two-dimensional depthwise convolution: ; in, Indicates dynamic position encoding. Indicates to Perform zero-filling. The position encoding matrix represents the first... OK List the elements; The distorted image is used to train the dual-branch information interaction model; A trained bi-branch information interaction model is used for robust source forensics of images.

2. The image robust source forensics method based on information interaction according to claim 1, characterized in that, The shared weight parameters of the branch network are updated through gradient backpropagation.

3. The image robust source forensics method based on information interaction according to claim 1, characterized in that, The contrastive loss function is: ; in, These represent the image features obtained after the image passes through each branch of the network. Represents the model loss function. This represents the contrastive loss function. Represents the cross-entropy loss function. Indicates similarity tags, Indicates the number of sample pairs. This represents the Euclidean distance between image feature pairs.

4. The image robust source forensics method based on information interaction according to claim 1, characterized in that, The multi-head attention block applies spatial attention to learn content feature relevance.

5. The image robust source forensics method based on information interaction according to claim 4, characterized in that, In the multi-head attention block, each head is restricted to a window area to extract local features across spatial locations.

6. The image robust source forensics method based on information interaction according to claim 1, characterized in that, The global feature attention module uses channel attention.

7. The image robust source forensics method based on information interaction according to claim 1, characterized in that, In each branch of the high-resolution spatiotemporal representation learning model, the local feature attention module and the global feature attention module are repeatedly stacked.