Cross-architecture video action recognition method and device based on knowledge distillation
By combining visual Transformer and CNN models with a complementary feature distillation method, the problem of underutilization of intermediate layer features in cross-architecture learning is solved, and more efficient video action recognition results are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA UNIV OF TECH
- Filing Date
- 2024-04-03
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies in the field of video action recognition have not effectively utilized knowledge distillation for cross-architecture learning, especially by not fully mining intermediate layer features, resulting in poor cross-architecture learning performance.
A complementary feature distillation method is adopted, which integrates the cross-attention fusion of the visual Transformer teacher model and the CNN student model, and combines the feature distillation loss of the intermediate layer and the output layer to construct soft label distillation loss and classification cross-entropy loss, so as to achieve cross-architecture video action recognition.
It improves the effectiveness and recognition accuracy of cross-architecture learning, fully utilizes the advantages of CNN and visual Transformer, and enhances the expressive power and accuracy of video action recognition.
Smart Images

Figure CN118172705B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the technical field of action recognition, specifically relating to a cross-architecture video action recognition method and apparatus based on knowledge distillation. Background Technology
[0002] Video action recognition is one of the most important research contents in the field of video understanding. The data it processes is a continuous RGB image sequence. Video action recognition not only needs to analyze the content of each video frame, but also needs to extract clues from the temporal changes between video frames. This requires the model to be able to model local spatial information and long-distance temporal dependence at the same time. Therefore, single-architecture methods usually cannot fully capture the action features in the video. At present, most video action recognition models are designed based on two network architectures, namely convolutional neural networks (CNN) and visual Transformers. The typical representative of CNN architecture is 3D CNN, which uses three-dimensional convolution as the processing unit of video data to process the two-dimensional spatial and one-dimensional temporal information of video data at the same time. On the other hand, based on the design concept of visual Transformers proposed in research [3], existing technologies have proposed video action recognition models based on visual Transformer architecture. They use self-attention mechanism to model the global spatiotemporal information of video data. Although both CNN and visual Transformer architectures can be used for video action recognition tasks, they have their own advantages and disadvantages. CNNs excel at extracting local features from data, but due to limitations in their effective receptive field, they struggle to capture global representations, such as long-range dependencies and contextual information. Visual Transformers, on the other hand, excel at modeling global relationships in data, but due to the lack of strong data assumptions (i.e., inductive biases, such as locality and translation invariance), they have limitations in capturing detailed local features. Clearly, CNNs and Visual Transformers are complementary; therefore, combining the characteristics of these two network architectures to achieve more accurate action recognition has become a research hotspot in this field.
[0003] Knowledge distillation, a common knowledge transfer and model optimization technique, has been applied in research to cross-architecture learning of CNNs and Transformers. Knowledge distillation methods typically employ a teacher-student training framework. Cross-architecture learning based on knowledge distillation uses heterogeneous teacher models to guide the training of student models, enabling them to master advanced knowledge from different architectures and thus achieve better performance.
[0004] In the field of image classification, existing techniques have proposed a knowledge distillation method specifically for Transformer models. This method adds an extra distillation term to the input sequence of the Transformer student model to receive guidance from the teacher model. Furthermore, they use a CNN as the teacher model to distill the visual Transformer, improving its training efficiency by leveraging the inductive bias of the CNN. This research effectively achieves cross-architecture learning between CNN and visual Transformer models using knowledge distillation, and experimental results verify the effectiveness and superiority of this cross-architecture learning.
[0005] In the field of audio classification, existing technologies have proposed a bidirectional knowledge distillation framework, CMKD, based on CNN and Transformer audio classification models. This framework aims to explore whether cross-architecture knowledge distillation can further improve the performance of these two models. Specifically, CMKD uses knowledge represented by the model's output layer for distillation and designs distillation experiments in two directions: from CNN to Transformer and from Transformer to CNN. Extensive experimental results demonstrate that cross-architecture knowledge distillation not only works in both directions but also allows the student model to outperform the teacher model.
[0006] Deep learning models excel at learning multi-layered abstract representations of data; therefore, the intermediate layer representations contain rich information that can be used for knowledge distillation. Furthermore, different network architectures employ different methods (such as convolution and self-attention) to model data features, and their intermediate layer representations also contain architecture-specific information. The two aforementioned cross-architecture learning studies based on knowledge distillation only considered knowledge based on the output layer representation (i.e., class probability distribution), without further exploring the rich information contained in the intermediate layer features for knowledge distillation, thus failing to fully realize cross-architecture learning. However, using traditional feature distillation methods to directly align student features with heterogeneous teacher features rarely yields improvement results. This is mainly due to two reasons: first, the significant differences between heterogeneous teacher and student features make feature alignment loss difficult to optimize; second, simply emphasizing consistency between student and heterogeneous teacher features may cause the loss of inherent key information. Currently, no research has proposed an effective cross-architecture learning method based on knowledge distillation in the field of video action recognition, especially knowledge distillation based on intermediate layer features. Further exploration is needed to address the difficulty of applying traditional feature distillation methods to cross-architecture scenarios and to achieve effective cross-architecture learning based on knowledge distillation technology. Summary of the Invention
[0007] The main objective of this invention is to overcome the shortcomings and deficiencies of the prior art and provide a cross-architecture video action recognition method and device based on knowledge distillation. This invention proposes a complementary feature distillation method, which solves the problem that traditional feature distillation methods are difficult to improve in cross-architecture scenarios. It effectively realizes the cross-architecture transfer of intermediate layer feature knowledge and further enhances the gain effect of cross-architecture learning.
[0008] To achieve the above objectives, the present invention adopts the following technical solution:
[0009] In a first aspect, the present invention provides a cross-architecture video action recognition method based on knowledge distillation, comprising the following steps:
[0010] Choose teacher and student models belonging to different architectures;
[0011] The raw data from the video is obtained, and the raw data is preprocessed to obtain training data for training.
[0012] The same batch of training data is fed into the teacher model and the student model respectively. The intermediate layer features of the teacher model and the student model are extracted to construct a complementary feature distillation loss. The complementary feature distillation loss is as follows: when using the teacher model to distill the student model, cross attention is first used to allow the teacher model to fuse the local features of the student model to obtain a new teacher feature with both global and local features. The new teacher feature retains its own complete information as well as some key information contained in the student feature. This allows the student model to retain its own advantages while learning the advanced knowledge of the teacher model during feature distillation.
[0013] The output layer representations of the teacher model and the student model are extracted, and a soft-label distillation loss is constructed. The soft-label distillation loss is used to characterize the difference between the prediction results of the student model and the prediction results of the teacher model.
[0014] Construct a classification cross-entropy loss for the student model; the classification cross-entropy loss is used to characterize the difference between the student model's prediction results and the true labels;
[0015] The student model is trained using complementary feature distillation loss, soft label distillation loss, and classification cross-entropy loss. The trained student model is then used to identify actions in the video to be processed.
[0016] As a preferred technical solution, the teacher model adopts a visual Transformer architecture, and the student model adopts a CNN architecture.
[0017] As a preferred technical solution, the step of acquiring the raw data from the video and training using the raw data specifically involves:
[0018] Several frames are sampled at equal intervals from each video sample as the raw data for the model;
[0019] Each sampled video frame is scaled proportionally.
[0020] Perform data augmentation;
[0021] The data is tensorized and normalized to obtain the input for the model.
[0022] As a preferred technical solution, the construction of complementary characteristic distillation loss specifically includes:
[0023] Let the features extracted by the teacher model and the student model in the i-th intermediate layer be denoted as follows: and
[0024] Adjust the shape and dimensions of student features to match those of teacher features;
[0025] Let the adjusted teacher and student characteristics of the i-th intermediate layer be denoted as follows: and Calculate the cross-attention A of teacher features on student features. c ;
[0026] Adding the original teacher characteristics to cross-attention yields new teacher characteristics:
[0027] A complementary feature distillation loss is constructed, and the loss function is defined as follows:
[0028]
[0029] Where N represents the number of intermediate layers, l i ,i∈[1,N] represents the feature distillation weight of the i-th intermediate layer, and ||·||2 represents the L2 norm, also known as the Euclidean distance, which is used to measure the difference between two feature vectors.
[0030] As a preferred technical solution, the calculation of the cross-attention of teacher features to student features specifically involves:
[0031] Based on the key-value pair attention mechanism, a query matrix Q is first generated using teacher features, and a key matrix K and a value matrix V are generated using student features:
[0032]
[0033] Among them, W Q W K and W V Let represent the parameter weights used to generate the linear mappings Q, K, and V, respectively; then, the cross-attention A is calculated using the following formula.c :
[0034]
[0035] Among them, C t Channel dimensions representing teacher characteristics.
[0036] As a preferred technical solution, the construction of soft-label distillation loss specifically includes:
[0037] Let z denote the output layer representations of the teacher model and the student model, respectively. t and z s ;
[0038] Calculate the class probability distribution p predicted by the teacher model and the student model. t and p s These are called soft labels, where the probability value corresponding to each category i is calculated using the following formula:
[0039]
[0040] Where C is the total number of categories and t is the temperature parameter used to smooth the probability distribution of the output;
[0041] The soft-label distillation loss is constructed, and the loss function is defined as follows:
[0042]
[0043] Where N represents the number of input samples, and KL(·||·) represents the KL divergence, which is used to measure the difference between two probability distributions.
[0044] As a preferred technical solution, the classification cross-entropy loss for constructing the student model is specifically as follows:
[0045] Let y denote the true label of the input sample; let y denote the predicted vector output by the student model.
[0046] The classification cross-entropy loss is constructed, and the definition of the classification cross-entropy loss function is as follows:
[0047]
[0048] Where N is the number of input samples; C is the total number of categories; The value can be 0 or 1. If the true class of sample n is c, the value is 1; otherwise, the value is 0. Let n be the probability that sample n is predicted to be class c.
[0049] Secondly, the present invention provides a cross-architecture video action recognition system based on knowledge distillation, which is applied to the cross-architecture video action recognition method based on knowledge distillation, including a model selection module, a data preparation module, a complementary feature distillation loss construction module, a soft label distillation loss construction module, a classification cross-entropy loss construction module, and a model training module.
[0050] The model selection module is used to select teacher and student models belonging to different architectures;
[0051] The data preparation module is used to acquire raw data from the video and preprocess the raw data to obtain training data for training.
[0052] The complementary feature distillation loss construction module is used to input the same batch of training data into the teacher model and the student model respectively, extract the intermediate layer features of the teacher model and the student model, and construct the complementary feature distillation loss. The complementary feature distillation loss is specifically as follows: when distilling the student model using the teacher model, cross attention is first used to allow the teacher model to fuse the local features of the student model to obtain a new teacher feature with both global and local features. The new teacher feature retains its own complete information as well as some key information contained in the student feature, so that during feature distillation, the student model can retain its own advantages while learning the advanced knowledge of the teacher model.
[0053] The soft-label distillation loss construction module is used to extract the output layer representations of the teacher model and the student model to construct the soft-label distillation loss; the soft-label distillation loss is used to characterize the difference between the prediction results of the student model and the prediction results of the teacher model.
[0054] The classification cross-entropy loss construction module is used to construct the classification cross-entropy loss of the student model; the classification cross-entropy loss is used to characterize the difference between the prediction results of the student model and the true labels.
[0055] The model training module is used to train a student model based on complementary feature distillation loss, soft label distillation loss, and classification cross-entropy loss, and then uses the trained student model to identify the actions in the video to be processed.
[0056] Thirdly, the present invention provides an electronic device, the electronic device comprising:
[0057] At least one processor; and,
[0058] A memory communicatively connected to the at least one processor; wherein,
[0059] The memory stores computer program instructions that can be executed by the at least one processor to enable the at least one processor to perform the knowledge distillation-based cross-architecture video action recognition method.
[0060] Fourthly, the present invention provides a computer-readable storage medium storing a program that, when executed by a processor, implements the cross-architecture video action recognition method based on knowledge distillation.
[0061] Compared with the prior art, the present invention has the following advantages and beneficial effects:
[0062] 1. More comprehensive and effective cross-architecture learning: Compared with existing cross-architecture learning methods based on knowledge distillation, this invention further considers knowledge based on intermediate layer features, making cross-architecture learning more comprehensive. In addition, to address the problem that traditional feature distillation methods are difficult to improve performance in cross-architecture scenarios, a complementary feature distillation method that is more suitable for cross-architecture scenarios is designed, ensuring the effectiveness of cross-architecture learning.
[0063] 2. Improved expressive power and recognition accuracy: This invention effectively combines the characteristics and advantages of two network architectures, Convolutional Neural Network (CNN) and Visual Transformer, using knowledge distillation technology. This improves the expressive power of the CNN model, enabling it to capture the spatiotemporal features of video actions more fully. Compared with existing video action recognition methods based on single architectures, it has a higher recognition accuracy. Attached Figure Description
[0064] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0065] Figure 1 This is a flowchart of a cross-architecture video action recognition method based on knowledge distillation, as described in an embodiment of the present invention.
[0066] Figure 2 This is a schematic diagram of the overall framework of the cross-architecture video action recognition method based on knowledge distillation according to an embodiment of the present invention;
[0067] Figure 3 This is a schematic diagram of complementary feature distillation based on intermediate layer feature knowledge in an embodiment of the present invention;
[0068] Figure 4 This is a block diagram of a cross-architecture video action recognition system based on knowledge distillation, according to an embodiment of the present invention.
[0069] Figure 5 This is a structural diagram of an electronic device according to an embodiment of the present invention. Detailed Implementation
[0070] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of the present application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present application without creative effort are within the scope of protection of the present application.
[0071] In this application, the reference to "embodiment" means that a specific feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a mutually exclusive, independent, or alternative embodiment. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described in this application can be combined with other embodiments.
[0072] like Figures 1-3 As shown, this embodiment provides a cross-architecture video action recognition method based on knowledge distillation, which solves the problem that traditional feature distillation methods are difficult to improve in cross-architecture scenarios. The method specifically includes the following steps:
[0073] S1. Select teacher and student models belonging to different architectures.
[0074] Furthermore, in S1, the teacher model and student model are based on the visual Transformer and CNN architectures, respectively. The visual Transformer architecture excels at modeling long-range dependencies, while the CNN architecture excels at capturing local feature details. That is, the teacher model is configured as a Transformer teacher model, and the student model is configured as a CNN student model. The model architecture in this embodiment is as follows: Figure 2 As shown.
[0075] Understandably, to combine the characteristics and advantages of different network architectures to achieve more accurate video action recognition, video action recognition not only needs to analyze the content of each video frame but also needs to extract clues from the temporal changes between video frames. This requires the model to be able to simultaneously model local spatial information and long-range temporal dependencies. Currently, most video action recognition models are designed based on two network architectures: Convolutional Neural Networks (CNNs) and Visual Transformers. Each has its own advantages and disadvantages in video action recognition tasks. CNNs excel at extracting local features from data, but due to the limitation of the effective receptive field, they struggle to capture global representations, such as long-range dependencies and contextual information. Visual Transformers, on the other hand, excel at modeling global relationships in data, but due to the lack of strong data assumptions (i.e., inductive biases, such as locality and translation invariance), they have limitations in capturing detailed local features. Therefore, existing single-architecture-based methods often fail to adequately capture action features in videos. This embodiment utilizes knowledge distillation technology to achieve cross-architecture learning between CNN models and visual Transformer models. By using a visual Transformer teacher model to guide the training of the CNN student model, the expressive power and performance of the CNN model are improved, resulting in a higher recognition accuracy compared to single-architecture training methods.
[0076] S2. Prepare training data, specifically:
[0077] S21. Sample several frames at equal intervals from each video sample as the raw data for the model;
[0078] S22. Scale each sampled video frame proportionally and adjust the length of the shortest side to 256 pixels.
[0079] S23. Perform a series of data augmentations to obtain training data.
[0080] In one specific embodiment, the data augmentation includes randomly cropping video frames to a size of 224×224 and randomly flipping them horizontally with a probability of 0.5, and finally tensorizing and normalizing the data to obtain the input of the model.
[0081] S3. Input the same batch of training data into the teacher model and the student model respectively, extract the intermediate layer features of the teacher model and the student model, and construct a complementary feature distillation loss.
[0082] Furthermore, such as Figure 3 As shown, the specific process of the complementary characteristic distillation includes the following steps:
[0083] S31: Let the features extracted by the teacher model and the student model in the i-th intermediate layer be respectively denoted as... and
[0084] S32: Adjust the shape and dimensions of the student features to match those of the teacher features to facilitate distillation.
[0085] Furthermore, step S32 specifically includes:
[0086] S321. Scale the time and space dimensions of teacher and student characteristics, with the scaling factor depending on the smaller of the corresponding dimensions of teacher and student characteristics.
[0087] S322. Next, the feature map is expanded into a feature sequence;
[0088] S323. Adjust the channel dimension of student features to be the same as the channel dimension of teacher features through a fully connected layer.
[0089] S33. The teacher and student characteristics of the i-th intermediate layer after the adjustment described in S32 are respectively denoted as... and Calculate the cross-attention of teacher features to student features.
[0090] Furthermore, step S33 specifically includes:
[0091] S331. Based on the key-value pair attention mechanism, first generate the query matrix Q using teacher features, and generate the key matrix K and value matrix V using student features:
[0092]
[0093] Among them, W Q W K and W V These represent the parameter weights used to generate the linear mappings of Q, K, and V, respectively.
[0094] S332. Next, calculate the cross-attention A using the following formula. c :
[0095]
[0096] Among them, C t Channel dimensions representing teacher characteristics.
[0097] S34. Add the original teacher characteristics to the cross-attention described in S33 to obtain new teacher characteristics:
[0098] S35. Construct a complementary feature distillation loss function, defined as follows:
[0099]
[0100] Where N represents the number of intermediate layers, l i ,i∈[1,N] represents the feature distillation weight of the i-th intermediate layer, and ||·||2 represents the L2 norm, also known as the Euclidean distance, which is used to measure the difference between two feature vectors.
[0101] Understandably, traditional feature distillation methods directly align student features with heterogeneous teacher features, which is unlikely to yield improvements in cross-architecture scenarios. As mentioned in the technical background, there are significant differences between CNN and visual Transformer features. CNNs focus more on local details, while visual Transformers focus more on global context. Aligning CNN features with visual Transformer features may cause the loss of local information (and conversely, it may cause the visual Transformer to lose global information). To address these issues, this invention proposes a complementary feature distillation method. The core idea is to allow the teacher model to integrate some important information already possessed by the student model before guiding its training. Specifically, when distilling CNN students using a visual Transformer teacher, cross-attention is first used to fuse the local features of the CNN students into a new teacher feature with both global and local characteristics. Because of the fusion of some information from the student features, the difference between the new teacher feature and the student features is reduced. Simultaneously, the new teacher feature retains its complete information as well as some key information contained in the student features. This allows the student model to retain its own advantages while learning the advanced knowledge of the teacher model during feature distillation.
[0102] S4. Extract the output layer representations of the teacher and student models, and construct a soft-label distillation loss; specifically:
[0103] S41. Denote the output layer representations of the teacher model and the student model as z, respectively. t and z s ;
[0104] S42. Calculate the class probability distribution p predicted by the teacher model and the student model. t and p s Also known as soft labels, the probability value corresponding to each category i can be calculated using the following formula:
[0105]
[0106] Where C is the total number of categories and t is the temperature parameter, used to smooth the probability distribution of the output;
[0107] S43. Construct the soft-label distillation loss function, which is defined as follows:
[0108]
[0109] Where N represents the number of input samples, and KL(·||·) represents the KL (Kullback Leibler) divergence, which is used to measure the difference between two probability distributions.
[0110] S5. Construct the classification cross-entropy loss for the student model, specifically including the following steps:
[0111] S51. Let y denote the true label of the input sample; let y denote the predicted vector of the final output of the student model.
[0112] S52. Construct the classification cross-entropy loss function, as shown below:
[0113]
[0114] Where N is the number of input samples; C is the total number of categories; The value can be 0 or 1. If the true class of sample n is c, the value is 1; otherwise, the value is 0. Let n be the probability that sample n is predicted to be class c.
[0115] S6. Using the three loss functions constructed in the above steps, train the student model. The overall loss function is shown below:
[0116]
[0117] Here, α and b are both hyperparameters used to measure the weight of each loss.
[0118] Furthermore, during training, the teacher model is initialized with pre-trained parameter weights, and the parameters of the student model and the additional parameters introduced by complementary feature distillation are updated using gradient descent. During testing, only the parameters of the student model are used.
[0119] It is understandable that the intermediate layer features and output layer representations of a model contain rich information that can be used for knowledge distillation. However, in cross-architecture learning based on knowledge distillation, most existing research only considers the knowledge based on the output layer representations, without further mining the architecture-specific information contained in the intermediate layer features, resulting in insufficient cross-architecture learning. Furthermore, traditional feature distillation methods directly align student features with heterogeneous teacher features, which is also difficult to improve performance in cross-architecture scenarios. This invention proposes a complementary feature distillation method. Its core idea is to allow the teacher model to integrate some important information already mastered by the student model before guiding the student model's training. This not only reduces the differences between student and teacher features but also allows the student model to retain its own advantages while learning the advanced knowledge of the teacher model, ultimately achieving effective cross-architecture learning.
[0120] It should be noted that, for the sake of simplicity, the aforementioned method embodiments are all described as a series of actions. However, those skilled in the art should understand that the present invention is not limited to the described order of actions, because according to the present invention, some steps can be performed in other orders or simultaneously.
[0121] Based on the same idea as the knowledge distillation-based cross-architecture video action recognition method in the above embodiments, the present invention also provides a knowledge distillation-based cross-architecture video action recognition system, which can be used to execute the above-described knowledge distillation-based cross-architecture video action recognition method. For ease of explanation, the structural diagrams of the knowledge distillation-based cross-architecture video action recognition system embodiments only show the parts related to the embodiments of the present invention. Those skilled in the art will understand that the illustrated structures do not constitute a limitation on the device, and may include more or fewer components than illustrated, or combine certain components, or have different component arrangements.
[0122] Please see Figure 4 In another embodiment of this application, a cross-architecture video action recognition system 100 based on knowledge distillation is provided. The system includes a model selection module 101, a data preparation module 102, a complementary feature distillation loss construction module 103, a soft label distillation loss construction module 104, a classification cross-entropy loss construction module 105, and a model training module 106.
[0123] The model selection module 101 is used to select teacher models and student models belonging to different architectures;
[0124] The data preparation module 102 is used to acquire raw data from the video and preprocess the raw data to obtain training data for training.
[0125] The complementary feature distillation loss construction module 103 is used to input the same batch of training data into the teacher model and the student model respectively, extract the intermediate layer features of the teacher model and the student model, and construct the complementary feature distillation loss. The complementary feature distillation loss is specifically as follows: when using the teacher model to distill the student model, cross attention is first used to allow the teacher model to fuse the local features of the student model to obtain a new teacher feature with global and local features. The new teacher feature retains its own complete information as well as some key information contained in the student feature, so that when feature distillation, the student model can retain its own advantages while learning the advanced knowledge of the teacher model.
[0126] The soft-label distillation loss construction module 104 is used to extract the output layer representations of the teacher model and the student model to construct the soft-label distillation loss; the soft-label distillation loss is used to characterize the difference between the prediction results of the student model and the prediction results of the teacher model.
[0127] The classification cross-entropy loss construction module 105 is used to construct the classification cross-entropy loss of the student model; the classification cross-entropy loss is used to characterize the difference between the prediction results of the student model and the true labels.
[0128] The model training module 106 is used to train a student model based on complementary feature distillation loss, soft label distillation loss and classification cross-entropy loss, and to use the trained student model to identify the video actions to be processed.
[0129] It should be noted that the cross-architecture video action recognition system based on knowledge distillation of the present invention corresponds one-to-one with the cross-architecture video action recognition method based on knowledge distillation of the present invention. The technical features and beneficial effects described in the embodiments of the cross-architecture video action recognition method based on knowledge distillation described above are applicable to the embodiments of cross-architecture video action recognition based on knowledge distillation. For details, please refer to the description in the embodiments of the method of the present invention, which will not be repeated here.
[0130] Furthermore, in the above embodiments of the cross-architecture video action recognition system based on knowledge distillation, the logical division of each program module is only an example. In actual applications, the above functions can be assigned to different program modules as needed, for example, for the sake of corresponding hardware configuration requirements or software implementation convenience. That is, the internal structure of the cross-architecture video action recognition system based on knowledge distillation is divided into different program modules to complete all or part of the functions described above.
[0131] Please see Figure 5In one embodiment, an electronic device is provided that implements a cross-architecture video action recognition method based on knowledge distillation. The electronic device 200 may include a first processor 201, a first memory 202 and a bus, and may also include a computer program stored in the first memory 202 and executable on the first processor 201, such as a cross-architecture video action recognition program 203 based on knowledge distillation.
[0132] The first memory 202 includes at least one type of readable storage medium, including flash memory, portable hard drive, multimedia card, card-type memory (e.g., SD or DX memory), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the first memory 202 can be an internal storage unit of the electronic device 200, such as the portable hard drive of the electronic device 200. In other embodiments, the first memory 202 can also be an external storage device of the electronic device 200, such as a plug-in portable hard drive, smart media card (SMC), secure digital card (SD), flash card, etc., equipped on the electronic device 200. Furthermore, the first memory 202 can include both internal and external storage units of the electronic device 200. The first memory 202 can be used not only to store application software and various types of data installed on the electronic device 200, such as the code of the cross-architecture video action recognition program 203 based on knowledge distillation, but also to temporarily store data that has been output or will be output.
[0133] In some embodiments, the first processor 201 may be composed of integrated circuits, such as a single packaged integrated circuit or multiple integrated circuits with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips. The first processor 201 is the control unit of the electronic device, connecting various components of the entire electronic device through various interfaces and lines. It executes programs or modules stored in the first memory 202 and calls data stored in the first memory 202 to perform various functions of the electronic device 200 and process data.
[0134] Figure 5 Only electronic devices with components are shown; it will be understood by those skilled in the art that... Figure 5The structure shown does not constitute a limitation on the electronic device 200, and may include fewer or more components than shown, or combine certain components, or have different component arrangements.
[0135] The knowledge distillation-based cross-architecture video action recognition program 203 stored in the first memory 202 of the electronic device 200 is a combination of multiple instructions, which, when run in the first processor 201, can achieve the following:
[0136] Choose teacher and student models belonging to different architectures;
[0137] The raw data from the video is obtained, and the raw data is preprocessed to obtain training data for training.
[0138] The same batch of training data is fed into the teacher model and the student model respectively. The intermediate layer features of the teacher model and the student model are extracted to construct a complementary feature distillation loss. The complementary feature distillation loss is as follows: when using the teacher model to distill the student model, cross attention is first used to allow the teacher model to fuse the local features of the student model to obtain a new teacher feature with both global and local features. The new teacher feature retains its own complete information as well as some key information contained in the student feature. This allows the student model to retain its own advantages while learning the advanced knowledge of the teacher model during feature distillation.
[0139] The output layer representations of the teacher model and the student model are extracted, and a soft-label distillation loss is constructed. The soft-label distillation loss is used to characterize the difference between the prediction results of the student model and the prediction results of the teacher model.
[0140] Construct a classification cross-entropy loss for the student model; the classification cross-entropy loss is used to characterize the difference between the student model's prediction results and the true labels;
[0141] The student model is trained using complementary feature distillation loss, soft label distillation loss, and classification cross-entropy loss. The trained student model is then used to identify actions in the video to be processed.
[0142] Furthermore, if the modules / units integrated in the electronic device 200 are implemented as software functional units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, or a read-only memory (ROM).
[0143] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and RAMbus dynamic RAM (RDRAM), etc.
[0144] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0145] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.
Claims
1. A cross-architecture video action recognition method based on knowledge distillation, characterized in that, Includes the following steps: Choose teacher and student models belonging to different architectures; The raw data from the video is obtained, and the raw data is preprocessed to obtain training data for training. The same batch of training data is fed into the teacher model and the student model respectively. The intermediate layer features of the teacher model and the student model are extracted to construct a complementary feature distillation loss. The complementary feature distillation loss is as follows: when using the teacher model to distill the student model, cross attention is first used to allow the teacher model to fuse the local features of the student model to obtain a new teacher feature with both global and local features. The new teacher feature retains its own complete information as well as some key information contained in the student feature. This allows the student model to retain its own advantages while learning the advanced knowledge of the teacher model during feature distillation. The output layer representations of the teacher model and the student model are extracted, and a soft-label distillation loss is constructed. The soft-label distillation loss is used to characterize the difference between the prediction results of the student model and the prediction results of the teacher model. Construct a classification cross-entropy loss for the student model; the classification cross-entropy loss is used to characterize the difference between the student model's prediction results and the true labels; The student model is trained based on complementary feature distillation loss, soft label distillation loss and classification cross-entropy loss, and the trained student model is used to identify the actions in the video to be processed. The construction of the complementary feature distillation loss is specifically as follows: The teacher model and student model will be in the second... The features extracted by the intermediate layers are denoted as follows: and ; Adjust the shape and dimensions of student features to match those of teacher features; The adjusted number The teacher and student characteristics of each intermediate layer are denoted as follows: and Calculate the cross-attention of teacher features to student features. ; Adding the original teacher characteristics to cross-attention yields new teacher characteristics: ; A complementary feature distillation loss is constructed, and the loss function is defined as follows: ; in, Indicates the number of intermediate layers. Indicates the first The feature distillation weights of the intermediate layer L2 norm, also known as Euclidean distance, is used to measure the difference between two eigenvectors.
2. The cross-architecture video action recognition method based on knowledge distillation according to claim 1, characterized in that, The teacher model uses a visual Transformer architecture, and the student model uses a CNN architecture.
3. The cross-architecture video action recognition method based on knowledge distillation according to claim 1, characterized in that, The process of acquiring raw data from the video and training on the raw data specifically involves: Several frames are sampled at equal intervals from each video sample as the raw data for the model; Each sampled video frame is scaled proportionally. Perform data augmentation; The data is tensorized and normalized to obtain the input for the model.
4. The cross-architecture video action recognition method based on knowledge distillation according to claim 1, characterized in that, The calculation of the cross-attention of teacher features to student features specifically involves: Based on the key-value pair attention mechanism, a query matrix is first generated using teacher features. Generate a key matrix using student features Sum matrix : , , in, , and They represent the methods used for generation. , and The parameter weights of the linear mapping are then used; next, the cross-attention is calculated using the following formula. : ; in, Channel dimensions representing teacher characteristics.
5. The cross-architecture video action recognition method based on knowledge distillation according to claim 1, characterized in that, The construction of the soft-label distillation loss is specifically as follows: The output layer representations of the teacher model and the student model are denoted as follows: and ; Calculate the class probability distributions predicted by the teacher model and the student model. and These are called soft tags, where each category The corresponding probability value is calculated using the following formula: ; in, The total number of categories, The temperature parameter is used to smooth the probability distribution of the output. The soft-label distillation loss is constructed, and the loss function is defined as follows: ; in This indicates the number of input samples. This represents the KL divergence, used to measure the difference between two probability distributions.
6. The cross-architecture video action recognition method based on knowledge distillation according to claim 1, characterized in that, The classification cross-entropy loss used to construct the student model is specifically as follows: Let the true label of the input sample be denoted as The predicted vector output by the student model is denoted as... ; The classification cross-entropy loss is constructed, and the definition of the classification cross-entropy loss function is as follows: ; in, The number of input samples; Total number of categories; The value can be 0 or 1, if the sample The true category is Select 1 if the value is 1, otherwise select 0. For the sample Predicted as category The probability of.
7. A cross-architecture video action recognition system based on knowledge distillation, characterized in that, The cross-architecture video action recognition method based on knowledge distillation, as described in any one of claims 1-6, includes a model selection module, a data preparation module, a complementary feature distillation loss construction module, a soft label distillation loss construction module, a classification cross-entropy loss construction module, and a model training module. The model selection module is used to select teacher and student models belonging to different architectures; The data preparation module is used to acquire raw data from the video and preprocess the raw data to obtain training data for training. The complementary feature distillation loss construction module is used to input the same batch of training data into the teacher model and the student model respectively, extract the intermediate layer features of the teacher model and the student model, and construct the complementary feature distillation loss. The complementary feature distillation loss is specifically as follows: when distilling the student model using the teacher model, cross attention is first used to allow the teacher model to fuse the local features of the student model to obtain a new teacher feature with both global and local features. The new teacher feature retains its own complete information as well as some key information contained in the student feature, so that during feature distillation, the student model can retain its own advantages while learning the advanced knowledge of the teacher model. The soft-label distillation loss construction module is used to extract the output layer representations of the teacher model and the student model to construct the soft-label distillation loss; the soft-label distillation loss is used to characterize the difference between the prediction results of the student model and the prediction results of the teacher model. The classification cross-entropy loss construction module is used to construct the classification cross-entropy loss of the student model; the classification cross-entropy loss is used to characterize the difference between the prediction results of the student model and the true labels. The model training module is used to train a student model based on complementary feature distillation loss, soft label distillation loss, and classification cross-entropy loss, and then uses the trained student model to identify the actions in the video to be processed.
8. An electronic device, characterized in that, The electronic device includes: At least one processor; and, A memory communicatively connected to the at least one processor; wherein, The memory stores computer program instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the knowledge distillation-based cross-architecture video action recognition method as described in any one of claims 1-6.
9. A computer-readable storage medium storing a program, characterized in that, When the program is executed by the processor, it implements the cross-architecture video action recognition method based on knowledge distillation as described in any one of claims 1-6.