Image-to-video cross-modal adversarial sample generation method based on key frame recognition

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining sliding window and optical flow, key video segments are identified and adversarial examples are generated, which solves the problems of inaccurate keyframe identification and poor robustness in video adversarial attacks, and achieves stronger attack effects.

WO2026123391A1PCT designated stage Publication Date: 2026-06-18SHANGHAI CHENGDIAN FUZHI TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SHANGHAI CHENGDIAN FUZHI TECH CO LTD
Filing Date: 2024-12-17
Publication Date: 2026-06-18

Application Information

Patent Timeline

17 Dec 2024

Application

18 Jun 2026

Publication

WO2026123391A1

IPC: G06V20/40

AI Tagging

Application Domain

Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN2024139797_18062026_PF_FP_ABST

Patent Text Reader

Abstract

Disclosed in the present invention is an image-to-video cross-modal adversarial sample generation method based on key frame recognition. The method comprises the steps of: acquiring a video frame sequence; acquiring key video segments by means of a sliding window strategy and video segment contribution degrees; selecting key frames from the key video segments on the basis of an optical flow method; forming a key frame set from all the key frames in the key video segments; generating a corresponding adversarial sample for each key frame in the key frame set; and replacing the corresponding key frame in the video frame sequence X with the adversarial sample to obtain an adversarial video sequence for the video frame sequence X. By means of improving a key frame extraction method and an adversarial perturbation generation technique, the present invention effectively overcomes deficiencies in the prior art, such as inaccurate key frame localization, insufficient consideration of temporal information during key frame extraction, and poor robustness of perturbation generation methods, and can thus significantly improve the accuracy and robustness of video adversarial attacks, thereby better meeting the requirements of practical applications.

Need to check novelty before this filing date? Find Prior Art

Description

Image-to-video cross-modal adversarial example generation method based on keyframe recognition Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a method for generating cross-modal adversarial examples from images to videos based on keyframe recognition. Background Technology

[0002] In the research of video adversarial attacks, techniques have expanded from perturbations of single images to video data, targeting tasks such as video classification and action recognition. These tasks typically employ time-aware neural network architectures, such as 3D convolutional networks (C3D), I3D, and SlowFast models, to capture inter-frame dynamic features. Therefore, effective adversarial attack strategies must consider both single-frame perturbations and their cumulative effects over time. Currently, key techniques for implementing video adversarial attacks include: technique (1), using video models to perform gradient analysis on the input video to generate perturbations and selectively applying perturbations to keyframes; and technique (2), using image models such as I2V or BTC to perturb video frames. These methods have collectively advanced the development of video adversarial attack techniques, enhancing the depth and breadth of robustness testing of deep learning models.

[0003] Regarding the technology (1): First, keyframes are identified based on the Grad-CAM (Gradient-weighted Class Activation Mapping) approach, and then perturbations are applied to the keyframes. During keyframe identification, Grad-CAM calculates the gradient of the convolutional layer output relative to the classification label y, and calculates the weights by the global average of the gradients, thereby assigning a weight to each frame image and arranging them in descending order according to these weights to dynamically determine the keyframes. This method can adaptively adjust the number of keyframes. This method not only improves the flexibility of adversarial attacks, but also achieves effective selection of keyframes through accurate evaluation of gradient contributions. However, this type of method has the following problems:

[0004] (1) Without considering temporal information, keyframe extraction methods based on gradient calculation, such as class activation mapping (G-CAM), usually only focus on the features of a single frame image and independently evaluate the contribution of local frames. This ignores the correlation between video frames, resulting in the inability to fully capture dynamic information across frames, which is crucial for action recognition models.

[0005] (2) The key frame localization is inaccurate. Action recognition videos do not have important information all over the world. The key frame extraction based on optical flow cannot accurately locate the key frame that affects the model output by relying solely on the information changes between adjacent frames. This may result in the extraction of video frames that do not contain effective information but change drastically.

[0006] Regarding the I2V attack method (2), this method is based on an ImageNet pre-trained image model. By generating adversarial examples on all frames of the video, it improves the transferability between different modal models and achieves an attack on the black-box video model. The core idea of the I2V method is to perturb the intermediate features of the image model by inputting video frames one by one into the image model, thereby perturbing the intermediate features of the video model with a high probability. In this method, I2V optimizes the adversarial perturbation of each frame to make the features of the perturbed frame and the normal frame as orthogonal as possible in the feature space of the image model. However, this method has poor robustness. This is because action recognition in the video depends not only on the content of a single frame image, but also on the continuity and temporal correlation between frames. The I2V method generates adversarial perturbations across modalities of the image model, but by only maximizing the difference between the perturbation map and the adversarial image feature map, it obviously does not take into account the temporal relationship between consecutive video frames, resulting in the perturbation only affecting a single frame and failing to produce a sustained attack effect on the entire action sequence. This limitation makes the attack perform poorly when facing complex temporal models.

[0007] Glossary: Typical class activation mapping methods include CAM (Class Activation Mapping) and Grad-CAM (Gradient-weighted Class Activation Mapping). CAM introduces the concept of class activation mapping, while Grad-CAM is an improvement on CAM. Grad-CAM generates a heatmap that highlights key regions of the image by analyzing the gradients flowing into the last convolutional layer of the CNN. By calculating the predicted class score and the gradient of the feature map in the last convolutional layer, Grad-CAM determines the importance of each feature map to a specific class.

[0008] Optical flow is the instantaneous velocity of pixels moving on an image plane. The optical flow method utilizes the temporal changes of pixels in an image sequence and the correlation between adjacent frames to find the correspondence between the previous and current frames, thereby calculating the motion information of objects between adjacent frames. In space, motion can be described by a motion field. On an image plane, the motion of an object is often reflected by the different gray-level distributions of different images in a sequence. Therefore, the motion field in space, transferred to the image, is represented as the optical flow field. The optical flow field is a two-dimensional vector field that reflects the trend of gray-level changes at each point in the image. It can be viewed as the instantaneous velocity field generated by the movement of pixels with gray levels on the image plane. It contains the instantaneous velocity vector information of each pixel. In short, the optical flow field is the collection of optical flows from each pixel in an image. Summary of the Invention

[0009] The purpose of this invention is to provide a method for generating cross-modal adversarial examples from images to videos based on keyframe recognition, which solves the above-mentioned problems, fully considers temporal information, accurately identifies keyframes, and generates more aggressive adversarial examples.

[0010] To achieve the above objectives, the technical solution adopted by the present invention is as follows: a method for generating cross-modal adversarial examples from images to videos based on keyframe recognition, comprising the following steps;

[0011] S1, Obtain the video frame sequence A is the total number of images in the video frame sequence, X a For the a-th frame of image X, the maximum preset sliding window length L max Minimum value L min And stride, let the current sliding window length L = L max ;

[0012] S2, Obtain key video segments, including steps S21~S26;

[0013] S21, traverse X according to the current sliding window length L to perform sliding window sampling. If there are K samplings in total, the k-th sampling yields the video segment P. k P k It contains L images, where the l-th image is labeled as ,but , 1≤k≤K, 1≤l≤L;

[0014] S22, for P k For each image, extract its channel average feature vector, and then average the channel average feature vectors to obtain P. k Comprehensive feature vector ;

[0015] S23, Calculation based on ladder activation mapping method Contribution to classification w k ;

[0016] S24, process each video segment sequentially according to steps S22~S23 to obtain P1~P K The corresponding contribution level w1~w K and w1~w K The maximum value is marked as w max ;

[0017] S25, preset the current maximum contribution (Max);

[0018] If w max ≥Max, w max The corresponding video segment is marked as a key video segment; otherwise, the current window length is reduced by 1, and the process proceeds to step S26.

[0019] S26, if the current sliding window length L≥L min Repeat S21~S25, if L < L min Then L=L min time w max The corresponding video clips are marked as key video clips;

[0020] S3, Select keyframes from key video segments, if P k Extracting key frames for key video segments includes steps S31 to S32;

[0021] S31, Calculate P k The optical flow field and difference between two adjacent frames, among which , The optical flow field is The degree of difference is ;

[0022] S32, preset difference threshold θ, if Then Mark as a keyframe;

[0023] S4, which creates a keyframe set from all keyframes in a key video segment;

[0024] S5 generates a corresponding adversarial sample for each keyframe in the keyframe set;

[0025] S6, replace the corresponding keyframes in the video frame sequence X with adversarial examples to obtain the adversarial video sequence of video frame sequence X.

[0026] As a preferred option: in S22, for P k Image l in the middle Extract its channel average feature vector The method is as follows;

[0027] Extracted by convolutional neural network We obtain a feature map with dimensions C×W×H. Where C, W, and H are respectively The number of channels, width, and height are used to obtain the average feature vector of each channel using the following formula. ;

[0028] ,

[0029] In the formula, for For the value of the c-th channel, 1 ≤ c ≤ C;

[0030] In S22, .

[0031] As a preferred option: S23 , where y is the category label of the target in the image. It is an L2 norm.

[0032] As a preferred option: optical flow field Difference It is obtained from the following formula;

[0033] ,

[0034] ,

[0035] In the formula, This refers to the FlowNet2 optical flow algorithm, where N is the total number of pixels in the image, and x and y are the horizontal and vertical coordinates of the pixels, respectively. It is an L2 norm.

[0036] In S5, any keyframe in the keyframe set Generate adversarial examples Includes the following steps;

[0037] S51, based on random optical flow, simulates two different motion trajectories respectively. Generate two optical flow distortion images , ;exist Add random perturbation to generate random perturbation samples ;

[0038] S52, Add initial adversarial perturbations to generate initial adversarial samples ;

[0039] S53, Set the time loss function Space loss function ;

[0040] S54, first minimize the time loss function update Then minimize the space loss function to update. ,get .

[0041] The main idea of this invention is to identify key video segments from a video frame sequence, select key frames from the key video segments, generate an adversarial sample for each key frame, and replace the corresponding key frame in the video frame sequence with it, thereby generating an adversarial video sequence.

[0042] When identifying key video segments, for each video segment, the comprehensive feature vector calculated in step S22 is used. , integrate P k The feature information of all frames and all channels within the segment can more comprehensively reflect the characteristics of the segment. This is calculated through step S23. Contribution to classification w k The contribution of each segment is quantified. However, this invention may involve multiple rounds of calculation and comparison operations when identifying key video segments, and adjust the sliding window length based on the calculation and comparison results. The final length of the key video segment is not fixed, but adjusted according to the contribution, which leads to a variable number of key frames for subsequent screening. This results in adaptive adjustment of the number of key frames under different video inputs.

[0043] When selecting keyframes, we further perform optical flow calculations on the frames in the segment to find keyframes with significant inter-frame changes. This step aims to detect motion information between frames through optical flow and select frames with differences greater than a preset threshold as keyframes.

[0044] When generating adversarial examples, we further optimize keyframes to generate adversarial examples with strong destructive capabilities. The entire optimization process consists of two steps, generating different optimization objectives through optical flow and random perturbation, aiming to disrupt the temporal consistency and frame feature stability of the video. One of the optimization objectives corresponds to the temporal loss function Loss1, which is used to disrupt temporal consistency and is minimized during the optimization process. and and The cosine similarity between them, thus As different as possible from those generated by optical flow and This is because optical flow frames represent continuous motion in the time dimension. Disrupting this similarity means that adversarial examples break the temporal continuity of the video, making the generated adversarial examples more effective at disrupting action recognition in the video. Another optimization objective corresponds to the spatial loss function Loss2, which is used to disrupt frame feature stability. The optimization process minimizes the number of adversarial examples. and The cosine similarity, thus More different Through these two optimization steps, adversarial examples can not only disrupt temporal information but also break the stability of local features in keyframes, thereby generating a stronger attack effect at the frame-by-frame level.

[0045] When generating adversarial video sequences, replacing the corresponding keyframes in the video frame sequence with adversarial samples generated in the above steps will severely damage the coherence and recognizability of the resulting adversarial video sequences.

[0046] Compared with the prior art, the advantages of the present invention are as follows:

[0047] (1) Improved keyframe localization accuracy: This invention effectively solves the problem of locating keyframes in non-core regions using traditional methods through a sliding window strategy and dynamic adjustment based on feature contribution. Compared with existing methods based on single-frame gradient calculation, this invention can comprehensively consider the contribution of video frames, ensuring that keyframes accurately reflect important information in the video. This method makes keyframe extraction more accurate, avoiding the problem of inaccurate keyframe localization due to ignoring temporal information, thereby improving the effectiveness against attacks.

[0048] (2) Enhanced robustness: The sliding window-based segment selection method proposed in this invention considers not only the feature information of a single frame during feature extraction, but also the comprehensive features of all frames within a segment. Compared with traditional optical flow methods and single-frame gradient calculation methods, this method is more adaptable to the changes in different video segments, improves the robustness and accuracy of key frame selection, and thus performs better when processing complex video data.

[0049] (3) Improved Adversarial Perturbation Generation Method: This invention proposes a novel adversarial perturbation generation method by combining optical flow information and random perturbations. This method not only considers inter-frame motion information but also introduces random perturbations, thereby generating powerful adversarial examples at both the temporal and frame levels. Compared with existing I2V methods, the perturbation generation method of this invention can more effectively disrupt the temporal consistency and frame feature stability of videos, providing a more comprehensive and persistent attack effect.

[0050] (4) Comprehensive Optimization of Attack Effect: In the process of generating adversarial examples, this invention effectively improves the overall effect of adversarial attacks through a dual strategy of optical flow optimization and frame feature optimization. By minimizing the cosine similarity between optical flow frame versions and the cosine similarity between randomly perturbed samples, not only can the temporal continuity of the video be disrupted, but the stability of intra-frame features can also be weakened. This dual optimization strategy makes the generated adversarial examples more aggressive overall, and can significantly improve the success rate of attacks on action recognition models.

[0051] In summary, by improving the keyframe extraction method and adversarial perturbation generation technology, this invention effectively overcomes the main defects in the prior art, significantly improves the accuracy and robustness of video adversarial attacks, and can better meet the needs of practical applications. Attached Figure Description

[0052] Figure 1 is a flowchart of the present invention;

[0053] Figure 2 is a flowchart of the process for obtaining key video segments;

[0054] Figure 3 is a flowchart of generating adversarial examples. Detailed Implementation

[0055] The invention will now be further described with reference to the accompanying drawings.

[0056] Example 1: Referring to Figures 1 to 3, a method for generating cross-modal adversarial examples from images to videos based on keyframe recognition includes the following steps;

[0057] S1, Obtain the video frame sequence A is the total number of images in the video frame sequence, X a For the a-th frame of image X, the maximum preset sliding window length L max Minimum value L min And stride, let the current sliding window length L = L max ;

[0058] S2, Obtain key video segments, including steps S21~S26;

[0059] S21, traverse X according to the current sliding window length L to perform sliding window sampling. If there are K samplings in total, the k-th sampling yields the video segment P. k P k It contains L images, where the l-th image is labeled as ,but , 1≤k≤K, 1≤l≤L;

[0060] S22, for P k For each image, extract its channel average feature vector, and then average the channel average feature vectors to obtain P.k Comprehensive feature vector ;

[0061] S23, Calculation based on ladder activation mapping method Contribution to classification w k ;

[0062] S24, process each video segment sequentially according to steps S22~S23 to obtain P1~P K The corresponding contribution level w1~w K and w1~w K The maximum value is marked as w max ;

[0063] S25, preset the current maximum contribution (Max);

[0064] If w max ≥Max, w max The corresponding video segment is marked as a key video segment; otherwise, the current window length is reduced by 1, and the process proceeds to step S26.

[0065] S26, if the current sliding window length L≥L min Repeat S21~S25, if L < L min Then L=L min time w max The corresponding video clips are marked as key video clips;

[0066] S3, Select keyframes from key video segments, if P k Extracting key frames for key video segments includes steps S31 to S32;

[0067] S31, Calculate P k The optical flow field and difference between two adjacent frames, among which , The optical flow field is The degree of difference is ;

[0068] S32, preset difference threshold θ, if Then Mark as a keyframe;

[0069] S4, which creates a keyframe set from all keyframes in a key video segment;

[0070] S5 generates a corresponding adversarial sample for each keyframe in the keyframe set;

[0071] S6, replace the corresponding keyframes in the video frame sequence X with adversarial examples to obtain the adversarial video sequence of video frame sequence X.

[0072] In this embodiment, in S22, for P k Image l in the middle Extract its channel average feature vector The method is to extract using a convolutional neural network. We obtain a feature map with dimensions C×W×H. Where C, W, and H are respectively The number of channels, width, and height are used to obtain the average feature vector of each channel using the following formula. ;

[0073] ,

[0074] In the formula, for For the value of the c-th channel, 1 ≤ c ≤ C;

[0075] In S22, .

[0076] In S23, , where y is the category label of the target in the image. It is an L2 norm.

[0077] In this embodiment, the optical flow field Difference It is obtained from the following formula;

[0078] ,

[0079] ,

[0080] In the formula, This refers to the FlowNet2 optical flow algorithm, where N is the total number of pixels in the image, and x and y are the horizontal and vertical coordinates of the pixels, respectively. It is an L2 norm.

[0081] In S5, any keyframe in the keyframe set Generate adversarial examples Includes the following steps;

[0082] S51, based on random optical flow, simulates two different motion trajectories respectively. Generate two optical flow distortion images , ;exist Add random perturbation to generate random perturbation samples ;

[0083] S52, Add initial adversarial perturbations to generate initial adversarial samples ;

[0084] S53, Set the time loss function Space loss function ;

[0085] S54, first minimize the time loss function update Then minimize the space loss function to update. ,get .

[0086] Example 2: Referring to Figures 1 to 3, a method for generating cross-modal adversarial examples from images to videos based on keyframe recognition includes the following steps;

[0087] S1, Obtain the video frame sequence A=32, preset maximum sliding window length L max =16, Minimum value L min =5, and stride = 1, let the current sliding window length L = L max ;

[0088] S2, Obtain key video segments, the same as step S2 in Example 1, but the following is a more specific operation method:

[0089] Round 1:

[0090] Execute step S21, iterate through X 16 times to perform sliding window sampling, for a total of K=17 samplings, to obtain K=17 video segments P1~P 17 Each video clip contains 16 images, because ,so Other video clips are tagged in the same way;

[0091] Perform steps S22~S24 to calculate P1~P 17 Comprehensive feature vector ~ ,calculate ~ Contribution to classification w1~w 17 and w1~w 17 The maximum value is marked as w max ;

[0092] Execute step S25, preset the current maximum contribution Max=0.8, and assume w max =0.75, which means that the key video segment cannot be obtained under this window length. The current sliding window length L needs to be reduced by 1 before the second round. Therefore, the current sliding window length L in the second round is 15.

[0093] Round Two:

[0094] Execute step S21, perform sliding window sampling with L=15, and obtain a total of K=18 video segments;

[0095] Perform steps S22~S24 to obtain 18 contribution values and find the maximum value w. max =0.75;

[0096] Execute step S25, assuming w in this round max =0.85>Max, then w max The corresponding video segment is marked as a keyframe segment; otherwise, a third and fourth round are performed, continuing until L=5, with L=5 being the final round.

[0097] Final round: When L=5, if w max If it is still less than Max, then the next round of operation will not be repeated. Instead, the value of w in the round with L=5 will be used. max The corresponding video clips are marked as keyframe clips.

[0098] S3, Select keyframes from key video segments;

[0099] S4, which creates a keyframe set from all keyframes in a key video segment;

[0100] S5 generates a corresponding adversarial sample for each keyframe in the keyframe set. In S51, the method for generating two optical flow distortion maps for a keyframe can be found in the paper: Wei Z, Chen J, Wu Z, et al. Cross-modal transferable adversarial attacks from images to videos[C] / / Proceedings of the IEEE / CVF conference on computer vision and pattern recognition. 2022: 15064-15073.

[0101] S6, replace the corresponding keyframes in the video frame sequence X with adversarial examples to obtain the adversarial video sequence of video frame sequence X.

[0102] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for generating cross-modal adversarial examples from images to videos based on keyframe recognition, characterized in that: Includes the following steps; S1, Obtain the video frame sequence A is the total number of images in the video frame sequence, X a For the a-th frame of image X, the maximum preset sliding window length L max Minimum value L min And stride, let the current sliding window length L=L max ; S2, Obtain key video segments, including steps S21~S26; S21, traverse X according to the current sliding window length L to perform sliding window sampling. If there are K samplings in total, the k-th sampling yields the video segment P. k P k It contains L images, where the l-th image is labeled as ,but , 1≤k≤K, 1≤l≤L; S22, for P k For each image, extract its channel average feature vector, and then average the channel average feature vectors to obtain P. k Comprehensive feature vector ； S23, Calculation based on ladder activation mapping method Contribution to classification w k ; S24, process each video segment sequentially according to steps S22~S23 to obtain P1~P K The corresponding contribution level w1~w K and w1~w K The maximum value is marked as w max ; S25, preset the current maximum contribution (Max); If w max ≥Max, w max The corresponding video segment is marked as a key video segment; otherwise, the current window length is reduced by 1, and the process proceeds to step S26. S26, if the current sliding window length L≥L min Repeat S21~S25, if L < L min Then L=L min time w max The corresponding video clips are marked as key video clips; S3, Select keyframes from key video segments, if P k Extracting key frames for key video segments includes steps S31 to S32; S31, Calculate P k The optical flow field and difference between two adjacent frames, among which 、 The optical flow field is The degree of difference is ； S32, preset difference threshold θ, if Then Mark as a keyframe; S4, which creates a keyframe set from all keyframes in a key video segment; S5 generates a corresponding adversarial sample for each keyframe in the keyframe set; S6, replace the corresponding keyframes in the video frame sequence X with adversarial examples to obtain the adversarial video sequence of video frame sequence X.

2. The image-to-video cross-modal adversarial example generation method based on keyframe recognition according to claim 1, characterized in that: In S22, for P k Image l in the middle Extract its channel average feature vector The method is as follows; Extracted by convolutional neural network We obtain a feature map with dimensions C×W×H. Where C, W, and H are respectively The number of channels, width, and height are used to obtain the average feature vector of each channel using the following formula. ；， In the formula, for For the value of the c-th channel, 1 ≤ c ≤ C; In S22, 。 3. The image-to-video cross-modal adversarial example generation method based on keyframe recognition according to claim 1, characterized in that: In S23, , where y is the category label of the target in the image. It is an L2 norm.

4. The image-to-video cross-modal adversarial example generation method based on keyframe recognition according to claim 1, characterized in that: Optical flow field Difference It is obtained from the following formula; ，， In the formula, This refers to the FlowNet2 optical flow algorithm, where N is the total number of pixels in the image, and x and y are the horizontal and vertical coordinates of the pixels, respectively. It is an L2 norm.

5. The image-to-video cross-modal adversarial example generation method based on keyframe recognition according to claim 1, characterized in that: In S5, any keyframe in the keyframe set Generate adversarial examples Includes the following steps; S51, based on random optical flow, simulates two different motion trajectories respectively. Generate two optical flow distortion images 、 ;exist Add random perturbation to generate random perturbation samples ； S52， Add initial adversarial perturbations to generate initial adversarial samples ； S53, Set the time loss function Space loss function ； S54, first minimize the time loss function update Then minimize the space loss function to update. ,get 。