Domain adaptive video classification method, apparatus, device, medium and product
By constructing a private network and maximizing the feature distribution distance, semantically irrelevant information features are extracted, which solves the problem of poor performance of video classification methods in transfer learning in different domains and achieves high-accuracy domain-adaptive video classification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN POWER SUPPLY BUREAU
- Filing Date
- 2023-04-17
- Publication Date
- 2026-06-23
AI Technical Summary
Existing video classification methods have poor transfer learning performance across different domains and struggle to effectively extract and utilize common semantic information, especially in video data containing a large amount of semantically irrelevant interference information.
Construct at least two private networks, extract semantically irrelevant information features, and extract common semantic information features by maximizing the feature distribution distance. Iteratively train the initial video classification model to obtain a domain-general target video classification model.
It improves the domain adaptability and accuracy of video classification, reduces the impact of semantically irrelevant information on classification, and enhances the transfer learning effect in different domains.
Smart Images

Figure CN116416562B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a domain-adaptive video classification method, apparatus, device, medium, and product. Background Technology
[0002] In recent years, unsupervised domain adaptation has attracted a lot of research attention. Its purpose is to learn a domain-independent feature representation so that models trained on labeled source domain datasets can still maintain good performance in unlabeled target domains with different distributions.
[0003] Training a video classification model on a labeled source domain dataset yields a video classification model. However, videos in the target domain typically have different feature distributions and are unlabeled, making it difficult to achieve good video classification results on the trained model. Therefore, transfer learning is required for videos from different domains during video classification.
[0004] Current video classification methods for different domains are typically implemented using domain adaptation methods, which learn domain-independent feature representations within a neural network based on adversarial learning. Adversarial learning methods incorporate a domain discriminator with a gradient inversion layer within the feature extraction network. The domain discriminator determines the domain origin of the extracted features, while the feature extractor learns how to extract more common semantic information to confuse the domain discriminator.
[0005] However, for video data containing a large amount of semantically irrelevant interference, it is very difficult to learn the common semantic information between two domains, and the domain adaptation effect of traditional adversarial learning matching sample-level feature distribution is poor. Summary of the Invention
[0006] Therefore, it is necessary to provide a video classification method, apparatus, computer device, computer-readable storage medium, and computer program product that can achieve domain-adaptive capabilities to address the aforementioned technical problems.
[0007] Firstly, this application provides a domain-adaptive video classification method. The method includes:
[0008] Obtain video input samples from the source domain and at least one target domain, and classify them based on the features of the source domain video input samples to obtain an initial video classification model;
[0009] Construct at least two private networks, which are used to obtain semantically irrelevant information features of video input samples from each domain respectively;
[0010] Obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model; obtain the feature distribution distance between the video classification model and the features extracted by each private network; maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features.
[0011] The initial video classification model and each private network are trained iteratively. When the iteration stopping condition is met, a domain-general target video classification model is obtained, and video classification is performed based on the target video classification model.
[0012] In one embodiment, at least two private networks are constructed, including:
[0013] Obtain the background data of the video input sample, and use the background data as a supervision signal for the reconstruction training of the private network;
[0014] Reconstruction background data is obtained by training on video input samples from various fields through a private network;
[0015] The reconstruction loss between acquiring background data and reconstructing background data;
[0016] Minimize the reconstruction loss to obtain semantically irrelevant information features.
[0017] In one embodiment, the private network includes a video feature extractor and a reconstruction network. The private network is used to train the reconstruction of video input samples from various domains to obtain reconstructed background data, including:
[0018] Background features of video input samples from various fields are obtained based on a video feature extractor;
[0019] Reconstructed background data is obtained by reconstructing background features based on the reconstruction network;
[0020] The reconstruction loss between obtaining the background data and reconstructing the background data includes:
[0021] Obtain the distance between the background data and the reconstructed background data;
[0022] The reconstruction loss is calculated using a loss function based on a distance metric and the distance.
[0023] In one embodiment, the initial video classification model includes a feature extractor, a domain discriminator, and a classifier. The initial video classification model is obtained by classifying features based on the source domain video input samples.
[0024] Features of source domain video input samples are obtained using a feature extractor;
[0025] Features are classified using a classifier;
[0026] Obtain the classification loss, which is used to iteratively train the initial video classification model and each private network;
[0027] Obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model, including:
[0028] Initial feature data of video input samples in the source domain and at least one target domain are obtained through a feature extractor;
[0029] Adversarial training is performed on the initial feature data using the domain discriminator to obtain the target feature data after adversarial training.
[0030] Video classification based on target feature data obtained from a classifier;
[0031] Obtain the adversarial training loss of the domain discriminator. The adversarial training loss is used to iteratively train the initial video classification model and each private network.
[0032] In one embodiment, the method further includes:
[0033] Construct a feature source classifier, and determine the source identifier of the input features based on the feature source classifier. The source identifier is used to determine whether the source of the input features is the initial video classification model or a private network.
[0034] Obtain the source classification loss of the feature source classifier. The source classification loss is used to iteratively train the initial video classification model and each private network.
[0035] In one embodiment, the initial video classification model and each private network are trained iteratively. When the iteration stopping condition is met, a domain-general target video classification model is obtained, including:
[0036] The training loss is obtained based on the loss function, and the iteration stopping condition is derived from the training loss.
[0037] The gradient of the loss function is calculated using backpropagation based on the training loss, and the loss function is updated accordingly.
[0038] When the training loss is stable and the iteration stopping condition is met, a domain-general target video classification model is obtained.
[0039] Secondly, this application also provides a video classification device capable of achieving domain-adaptive classification. The device includes:
[0040] The video classification module is used to acquire video input samples from the source domain and at least one target domain, and to classify them based on the features of the source domain video input samples to obtain an initial video classification model.
[0041] A private network module is used to construct at least two private networks, which are used to acquire semantically irrelevant information features of video input samples from different domains.
[0042] The mean difference module is used to obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model, obtain the feature distribution distance between the video classification model and the features extracted by each private network, maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features.
[0043] The iterative training module is used to iteratively train the initial video classification model and each private network. When the iteration stopping condition is met, a domain-general target video classification model is obtained, and video classification is performed based on the target video classification model.
[0044] Thirdly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to perform the following steps:
[0045] Obtain video input samples from the source domain and at least one target domain, and classify them based on the features of the source domain video input samples to obtain an initial video classification model;
[0046] Construct at least two private networks, which are used to obtain semantically irrelevant information features of video input samples from each domain respectively;
[0047] Obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model; obtain the feature distribution distance between the video classification model and the features extracted by each private network; maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features.
[0048] The initial video classification model and each private network are trained iteratively. When the iteration stopping condition is met, a domain-general target video classification model is obtained, and video classification is performed based on the target video classification model.
[0049] Fourthly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, performs the following steps:
[0050] Obtain video input samples from the source domain and at least one target domain, and classify them based on the features of the source domain video input samples to obtain an initial video classification model;
[0051] Construct at least two private networks, which are used to obtain semantically irrelevant information features of video input samples from each domain respectively;
[0052] Obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model; obtain the feature distribution distance between the video classification model and the features extracted by each private network; maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features.
[0053] The initial video classification model and each private network are trained iteratively. When the iteration stopping condition is met, a domain-general target video classification model is obtained, and video classification is performed based on the target video classification model.
[0054] Fifthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, performs the following steps:
[0055] Obtain video input samples from the source domain and at least one target domain, and classify them based on the features of the source domain video input samples to obtain an initial video classification model;
[0056] Construct at least two private networks, which are used to obtain semantically irrelevant information features of video input samples from each domain respectively;
[0057] Obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model; obtain the feature distribution distance between the video classification model and the features extracted by each private network; maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features.
[0058] The initial video classification model and each private network are trained iteratively. When the iteration stopping condition is met, a domain-general target video classification model is obtained, and video classification is performed based on the target video classification model.
[0059] The aforementioned domain-adaptive video classification method, apparatus, computer device, storage medium, and computer program product acquire video input samples from a source domain and at least one target domain. Based on the features of the source domain video input samples, an initial video classification model is obtained. At least two private networks are constructed, each used to acquire semantically irrelevant information features from video input samples of each domain. Feature data from the source domain and at least one target domain video input samples extracted by the initial video classification model are obtained. The feature distribution distance between the video classification model and the features extracted by each private network is obtained. The feature distribution distance is maximized, and the maximum mean difference is calculated to obtain common semantic information features. The initial video classification model and each private network are iteratively trained. When the iteration stopping condition is met, a domain-general target video classification model is obtained. Domain-adaptive characteristics are achieved based on the target video classification model. This video classification method constructs a private network and extracts semantically irrelevant information features. It then obtains the maximum mean difference between these semantically irrelevant information features and the feature data extracted by the initial video classification model. Maximizing this maximum mean difference—that is, maximizing the feature difference between semantically irrelevant information features and the feature data—facilitates the video classification model to acquire common semantic information features while ignoring semantically irrelevant features during the classification process. This reduces the impact of semantically irrelevant information features in the target domain video preventing adaptive video classification in the initial model, thus improving the domain adaptability of video classification during domain transfer. For video data containing mostly semantically irrelevant interference information, the domain adaptation effect is good, and the accuracy and reliability of domain-adaptive video classification are high. Attached Figure Description
[0060] Figure 1 This is a diagram illustrating the application environment of a video classification method in one embodiment;
[0061] Figure 2 This is a flowchart illustrating a video classification method in one embodiment;
[0062] Figure 3 This is a schematic diagram illustrating the maximum mean difference of video classification methods in one embodiment;
[0063] Figure 4 This is a schematic diagram of the structure of a private network in one embodiment;
[0064] Figure 5 This is a schematic diagram of the structure of an initial video classification model in one embodiment;
[0065] Figure 6 This is a flowchart illustrating a video classification method in another embodiment;
[0066] Figure 7This is a structural block diagram of a video classification device in one embodiment;
[0067] Figure 8 This is a structural block diagram of a video classification device in another embodiment;
[0068] Figure 9 This is an internal structure diagram of a computer device that is a server in one embodiment;
[0069] Figure 10 This is an internal structure diagram of a computer device as a terminal in one embodiment. Detailed Implementation
[0070] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0071] The domain-adaptive video classification method provided in this application can be applied to, for example... Figure 1 In the application environment shown, terminal 102 communicates with server 104 via a network. A data storage system can store the data that server 104 needs to process. The data storage system can be integrated onto server 104 or placed on a cloud or other network server. Terminal 102 acquires video input samples from a source domain and at least one target domain. Based on the features of the source domain video input samples, it classifies them to obtain an initial video classification model. It constructs at least two private networks, which are used to acquire semantically irrelevant information features of video input samples from each domain respectively. It acquires feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model. It acquires the feature distribution distance between the video classification model and the features extracted by each private network. It maximizes the feature distribution distance and calculates the maximum mean difference to obtain common semantic information features. It iteratively trains the initial video classification model and each private network. When the iteration stopping condition is met, a domain-general target video classification model is obtained. Video classification is performed based on the target video classification model. Terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can be smart speakers, smart TVs, smart air conditioners, smart vehicle devices, etc. Portable wearable devices can include smartwatches, smart bracelets, and head-mounted devices. Server 104 can be implemented using a standalone server or a server cluster consisting of multiple servers.
[0072] In one embodiment, such as Figure 2 As shown, a domain-adaptive video classification method is provided, which can be applied to... Figure 1 Taking server 104 as an example, the following steps are included:
[0073] Step 202: Obtain video input samples from the source domain and at least one target domain, and classify them based on the features of the source domain video input samples to obtain an initial video classification model.
[0074] Here, the source domain represents a domain different from the test sample and has rich supervised annotation information; the target domain represents the domain in which the test sample is located, with no label or only a few labels. For example, in transfer learning, the original sample from the source domain contains video category labels, while the original sample from the target domain does not contain video category labels and has a different distribution than the sample from the source domain.
[0075] For example, the input source and target domain video data are processed by image frame extraction and downsampling. The downsampling method is to sample 16 frames at a fixed sampling frequency starting from a random position to obtain source and target domain video input samples of the same size. The source domain video input samples are iteratively trained with 3000 video classification tasks to obtain the initial video classification model.
[0076] Step 204: Construct at least two private networks, which are used to obtain semantically irrelevant information features of video input samples from each domain.
[0077] Among them, semantically irrelevant information features refer to background information in video classification that is closely related to the domain but irrelevant to semantics, and are manifested as interference features in video classification.
[0078] For example, for dynamic videos, static background features are the most important semantically irrelevant information features. After constructing a private network, different private networks can be trained through background reconstruction to obtain the background features of video input samples in the source and target domains respectively.
[0079] Step 206: Obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model; obtain the feature distribution distance between the video classification model and the features extracted by each private network; maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features.
[0080] The Maximum Mean Difference (MMD) measures the distance between different feature distributions. If the mean difference of this distance reaches its maximum, it indicates that the sampled data comes from completely different distributions. Common semantic information features refer to features that are domain-independent but semantically relevant, manifesting as features related to classifying video categories.
[0081] For example, the initial video model extracts feature data from video input samples in the source and target domains. This feature data, the background features of the source domain video input samples extracted by the private network, and the background features of the target domain video input samples have different feature distributions. Maximizing the distance between the feature distributions of the three input samples yields the maximum mean difference (MMD), calculated as follows:
[0082]
[0083] Among them, X S ,X T These are the source and target domain video input samples, φ(x) and φ(x), respectively. s ) and φ(x t ) represents the corresponding kernel function.
[0084] For example, such as Figure 3 The diagram illustrates the maximum mean difference in the video classification method. Maximizing the feature distribution distance results in the greatest difference between features, effectively ignoring background features from the source and target domains in the feature data obtained from the initial video classification model, thus obtaining common semantic information features from both domains. The loss function formula for maximizing the mean difference in the initial video classification model during the feature distribution distance maximization process is expressed as follows:
[0085]
[0086] in, For the maximum mean difference loss, the background feature distribution in the source domain is d. Sp The background feature distribution of the target domain is d Tp The feature data distribution of the initial video classification model is d main In the case of d Sp With d Tp The maximum mean difference between them is MMD(d) Sp ,d Tp ), d main With d Sp The maximum mean difference between them is MMD(d) main ,d Sp ), d main With d Tp The maximum mean difference between them is MMD(d) main ,d Tp ).
[0087] Step 208: Iteratively train the initial video classification model and each private network. When the iteration stopping condition is met, a domain-general target video classification model is obtained, and video classification is performed based on the target video classification model.
[0088] Iterative training is the process of training a video classification model based on an iterative algorithm. Iteration is an activity of repeated feedback. An iterative algorithm starts from a certain value and continuously calculates the result of the next step based on the result of the previous step.
[0089] For example, the video classification model can be iteratively trained based on the loss function during the training process. When the loss function no longer changes, specifically when the loss function no longer decreases or when the loss function tends to stabilize, the video classification model converges, resulting in a target video classification model that can accurately classify videos in the target domain. Video classification can then be performed based on this target video classification model.
[0090] The aforementioned domain-adaptive video classification method acquires video input samples from a source domain and at least one target domain. Based on the features of the source domain video input samples, an initial video classification model is obtained. At least two private networks are constructed, each used to acquire semantically irrelevant information features from the video input samples of each domain. Feature data extracted from the source domain and at least one target domain video input samples by the initial video classification model are obtained. The feature distribution distance between the video classification model and the features extracted by each private network is obtained. This feature distribution distance is maximized, and the maximum mean difference is calculated to obtain common semantic information features. The initial video classification model and each private network are iteratively trained. When the iteration stopping condition is met, a domain-general target video classification model is obtained. Based on this target video classification model, domain-adaptive video classification is achieved. This method... By constructing a private network and extracting semantically irrelevant information features, and obtaining the maximum mean difference between the semantically irrelevant information features and the feature data extracted by the initial video classification model, and maximizing the maximum mean difference, it is beneficial for the video classification model to obtain common semantic information features and ignore semantically irrelevant information features during the video classification process. This helps to reduce the impact of semantically irrelevant information features in the target domain video causing the target domain video to fail to obtain adaptive video classification in the initial video classification model, and improves the domain adaptability of video classification in domain transfer. For the classification of video data containing most semantically irrelevant interference information, the domain adaptation effect is good, the accuracy of domain-adaptive video classification is high, and it has high reliability.
[0091] In one embodiment, at least two private networks are constructed, including: acquiring background data of video input samples, using the background data as a supervision signal for reconstruction training of the private networks; performing reconstruction training on video input samples from various domains through the private networks to obtain reconstructed background data; acquiring the reconstruction loss between the background data and the reconstructed background data; and minimizing the reconstruction loss to obtain semantically irrelevant information features.
[0092] In this context, the supervision signal refers to the expected output value of the video input sample in supervised learning. Reconstruction (IR) aims to reconstruct an image based on various information extracted from the ground truth image. Ground truth, in supervised learning, refers to labeled data presented in the form of (x,t), where x is the input data, t is the label, and the correct t label is the ground truth. Reconstruction loss refers to the loss function used in reconstruction, which is a computational function that measures the difference between the predicted value and the true value in the private network's image reconstruction. Loss functions include distance-based loss functions and probability distribution-based loss functions.
[0093] For example, such as Figure 4 The schematic diagram of the private network shown illustrates how background data from source and target domain video input samples can be extracted using a temporal median filter (TMF) to obtain source and target domain background data. This source domain background data and source domain video input samples are then input into the source domain private network, and the target domain background data and target domain video input samples are input into the target domain private network. The background data serves as a supervisory signal, representing the expected output value during the private network's reconstruction training process, and is also the correct t-labeled ground truth. The reconstruction output value of the private network is compared with the background data to calculate the reconstruction loss function. Minimizing this reconstruction loss during training reduces the difference between the background data and the reconstructed background data, enabling the private network to learn semantically irrelevant information, i.e., semantically irrelevant features.
[0094] In this embodiment, a temporal median filter is used to obtain the background data of the video input sample. The temporal median filter is a simple, intuitive and fast method for extracting video backgrounds. The loss function is used in the training phase of the model. After obtaining the loss value between the predicted value and the difference value obtained from a single training, the parameters of the private network are updated according to the direction of minimizing the loss value. This reduces the loss between the true value and the predicted value, making the predicted value generated by the model move closer to the true value. In this way, the private network learns the semantically irrelevant information features of the video input samples in the source and target domains.
[0095] In one embodiment, the private network includes a video feature extractor and a reconstruction network. The private network is used to reconstruct video input samples from various domains to obtain reconstructed background data. This includes: obtaining background features of video input samples from various domains based on the video feature extractor; reconstructing the background features based on the reconstruction network to obtain reconstructed background data; and obtaining the reconstruction loss between the background data and the reconstructed background data. This includes: obtaining the distance between the background data and the reconstructed background data; and calculating the reconstruction loss using a loss function based on a distance metric and the distance.
[0096] The video feature extractor is used to extract features from the video input samples. The reconstruction network is used to reconstruct the background based on the extracted features.
[0097] For example, such as Figure 4 The diagram shown illustrates the structure of a private network. The source domain private network includes a source domain video feature extractor F. Sp The source domain reconstruction network and the target domain private network include the target domain video feature extractor F. Tp The target domain reconstruction network obtains the background features of the video input sample extracted by the video feature extractor. Based on the background features, the reconstruction network performs image reconstruction to obtain the predicted reconstructed background data. The L2 loss function of the source and target domains is calculated based on the distance between the reconstructed background data and the background data. The L2 loss function, also known as Euclidean distance, is a commonly used distance metric function, usually used to measure the similarity between data points. The formula for calculating the source domain reconstruction loss is expressed as:
[0098]
[0099] The formula for calculating the target domain reconstruction loss is as follows:
[0100]
[0101] The formula for calculating the private network reconstruction loss as the sum of two reconstruction losses is as follows:
[0102]
[0103] Among them, b Sp To reconstruct background data for the source domain, b S For source domain background data, Let b be the L2 loss function of the source domain. Tp To reconstruct background data for the target domain, b T Background data for the target domain. Let L2 loss function be the target domain. Reconstruct the loss function for the private network.
[0104] In this embodiment, the distance between the true value of the video input sample and the predicted value of the private network in the feature space is measured by a loss function based on distance metric. The smaller the distance between two points in the feature space, the better the prediction performance of the private network. Moreover, the curve of the L2 loss function is flat enough when approaching the target, so this characteristic can be used to gradually and slowly converge towards the target, which is suitable for image processing.
[0105] In one embodiment, the initial video classification model includes a feature extractor, a domain discriminator, and a classifier. The initial video classification model is obtained by classifying features from source domain video input samples, including: acquiring features from source domain video input samples using the feature extractor; classifying the features using the classifier; obtaining a classification loss, which is used to iteratively train the initial video classification model and each private network; and acquiring feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model, including: obtaining initial feature data of video input samples from the source domain and at least one target domain using the feature extractor; performing adversarial training on the initial feature data using the domain discriminator to obtain adversarially trained target feature data; obtaining video classification of the target feature data using the classifier; and obtaining the adversarial training loss of the domain discriminator, which is used to iteratively train the initial video classification model and each private network.
[0106] Adversarial training refers to a training process where both the domain discriminator and the image classifier receive inputs from features extracted by the feature extractor during the initial training of the video classification model. The domain discriminator maximizes the domain discrimination loss to confuse the target domain video input data with the source domain video input data, while the image classifier minimizes the image classification loss to achieve accurate image classification. The domain discriminator consists of a gradient inversion layer and two fully connected layers, used to determine whether the features extracted by the feature extractor come from the source domain or the target domain. The gradient of the domain discriminator loss function is opposite in direction to the gradient of the image classification loss function. The gradient inversion layer automatically inverts the gradient of the domain discriminator loss before it propagates back to the parameters of the feature extractor, thus achieving adversarial training. The classifier refers to the image classifier, used to classify the video based on the features extracted from the video input samples.
[0107] For example, such as Figure 5 The schematic diagram of the initial video classification model shown illustrates how the feature extractor obtains features from the source domain video input samples. These source domain video input samples contain video classification labels. The classifier is trained on 3000 video classification tasks based on the extracted features to obtain the initial video classification model. This initial model can accurately classify source domain videos with video classification labels. The difference between the output value and the true value of the classified video input samples during the video classification task is obtained, yielding the video classification loss. The formula for calculating the video classification loss function is expressed as follows:
[0108]
[0109] in, The loss is used for video classification, where x is an input sample, x∈X. SThe input sample is the source domain video input sample, y is the source domain video category label which is also the true value, σ is the softmax function, and C(F(x)) is the probability value calculated by the softmax function after the classifier classifies the source domain video input sample.
[0110] Initial feature data of source and target domain video input samples is extracted using a feature extractor. A gradient inversion layer for a domain discriminator exists between the feature extractor and the classifier. Maximizing the loss value for domain discrimination and minimizing the loss value for video classification enables adversarial training of the initial feature data, resulting in target feature data that confuses the source and target domains. The source and target domain input videos can then be classified based on this target feature data, yielding the loss value for adversarial training. The formula for calculating the adversarial training loss function is as follows:
[0111]
[0112] in, To counteract training loss, y d It is a two-dimensional vector representing the domain label, that is, the true value of the input video sample in the domain. When the input is x, x∈X S When it is the original sample from the source domain, y d =<1,0>, or when the input x is an original sample x∈X in the target domain. T At that time, y d =<0,1>, σ is the softmax function, and the probability value is calculated by the softmax function after the output of the (D(F(X))) domain discriminator performs neighborhood discrimination on the source and target domain video input samples.
[0113] In this embodiment, video classification task training and adversarial training of the initial video classification model were implemented.
[0114] In one embodiment, the method further includes: constructing a feature source classifier, determining the source identifier of the input feature based on the feature source classifier, wherein the source identifier is used to determine whether the source of the input feature is an initial video classification model or a private network; obtaining the source classification loss of the feature source classifier, wherein the source classification loss is used to iteratively train the initial video classification model and each private network.
[0115] For example, features extracted by the feature extractor of the initial video classification model and the video feature extractor of the private network are obtained. These features carry a source identifier, which is used to determine whether the source of the input features is the feature extractor F of the initial video classification model or the source domain video feature extractor F of the source domain private network. Sp Or target domain private network target domain video feature extractor F TpThe extracted features are input to the feature source classifier. The feature source classifier obtains the feature source from the initial video classification model or a private network based on the input extracted features. The source classification loss is calculated based on the output value and the ground truth value. The formula for calculating the source classification loss function is expressed as:
[0116]
[0117] in, For source classification loss, y N Feature extractor F, source domain video feature extractor F Sp And target domain video feature extractor F Tp The source identifier, f is the extracted feature of any input, C N (f) is the probability value calculated by the softmax function after the output of the feature source classifier judges the input features.
[0118] In this embodiment, by adding a feature source classifier, the features extracted by the video classification model and the private network are distinguished, thereby enhancing the difference in the feature content obtained after training.
[0119] In one embodiment, the method further includes: iteratively training an initial video classification model and each private network, and obtaining a domain-general target video classification model when the iteration stopping condition is met, including: obtaining the training loss based on the loss function, obtaining the iteration stopping condition based on the training loss; calculating the gradient of the loss function based on backpropagation of the training loss, and updating the loss function; and obtaining a domain-general target video classification model when the training loss is stable and the iteration stopping condition is met.
[0120] Backpropagation, short for "error backpropagation," is a common method used in conjunction with optimization techniques to train artificial neural networks. This method calculates the gradient of the loss function for all weights in the network, and this gradient is fed back to the optimization method to update the weights and minimize the loss function.
[0121] For example, the video classification loss is obtained. Combat training loss Reconstruction loss Maximum mean difference loss Source Classification Loss The initial video classification model is iteratively trained based on the total loss function, which is calculated using the following formula:
[0122]
[0123] The direction of the maximum value of the directional derivative on the surface of the function's graph represents the direction of the gradient. When performing gradient descent, updates should be made along the opposite direction of the gradient. Backpropagation calculates the gradient of the loss function. Combined with the stochastic gradient descent (SGD) optimization algorithm, the initial training model can be iteratively trained to minimize the loss function. That is, based on the obtained total loss function, the gradient is calculated. After backpropagation, the total loss function moves towards the direction of minimizing the loss function. The gradient is fed back to the stochastic gradient descent optimization algorithm, which can update the model parameters of the initial video classification model based on the gradient. Iterative training of the initial video classification model continues until the minimum value of the total loss function is stable. The model gradually converges to obtain the target video classification model, which exhibits the lowest total loss, meaning it has high accuracy in video classification. This demonstrates that accurate classification of videos in the target domain can also be achieved within the video classification model.
[0124] like Figure 6 The diagram shown is a flowchart of a domain-adaptive video classification method in another embodiment. This domain-adaptive video classification method includes the following steps:
[0125] Step 602: Obtain the original source and target domain video data, perform video frame extraction and downsampling on the video data to obtain source and target domain video input samples.
[0126] For the original video data, image frames are extracted to obtain an RGB video frame sequence, and video frames are sampled. Based on the original source and target domain videos used to train the initial video classification model, the acquired RGB frame sequence is obtained. Starting from a random position, one frame is sampled every four frames as input data, and each sample consists of t frames, which serve as the video input samples for training the initial video classification model.
[0127] Step 604: Construct an initial video classification model, including a feature extractor, a domain discriminator, and a classifier. Obtain the source domain features of the source domain video input samples based on the feature extractor. Based on the classifier and the video classification using the obtained source domain features, obtain the video classification loss function.
[0128] The initial video classification model was pre-trained using source domain video input samples with video classification labels. The initial video classification model adopted the I3D video classification model, and was initially pre-trained for 3000 iterations using only source domain data. The pre-training used the SGD (Standard Gradient Descent) optimization algorithm with a learning rate of 0.001.
[0129] Step 606: Construct private networks for the source and target domains. The private networks include a video feature extractor and a reconstruction network. The private networks are used to obtain semantically irrelevant information features of video input samples from each domain.
[0130] Step 608: Obtain the background images of the source and target domain video input samples through a time median filter. The background images are used as supervision signals for the reconstruction training of the private network.
[0131] The background image of the video input sample is extracted using a time median filter with fixed parameters. The dimensions of the input sample RGB frame sequence are time × height × width × number of channels (t × h × w × c). The extracted background image has the dimensions of height × width × number of channels (h × w × c) and no time dimension.
[0132] Step 610: Obtain the background features of the source and target domain video input samples through the video feature extractors of each private network, and obtain the reconstructed background images of the source and target domain video input samples through the reconstruction network of each private network.
[0133] Step 612: Obtain the L2 loss of the source domain and target domain between the background image of the source domain and the reconstructed background image of the source domain and target domain respectively, obtain the reconstruction loss of the source domain and target domain, obtain the L2 loss function of the reconstruction loss, minimize the L2 loss function, and obtain the semantically irrelevant information features of the video input samples of the source domain and target domain.
[0134] The source domain video input samples are fed into the source domain private network for background reconstruction training to learn semantically irrelevant information features of the source domain, and the computation is performed. Input the target domain video input sample into the target domain private network, and calculate... Find the sum of the training losses from reconstructing two private networks.
[0135] Step 614: Obtain initial features of the source and target domain video input samples using the feature extractor; perform adversarial training on the initial features using the domain discriminator to obtain target features; obtain the classification of the target features using the classifier; and obtain the adversarial training loss function.
[0136] The initial video classification model has a feature dimension of 1024. The domain discriminator consists of a gradient inversion layer and a two-layer fully connected classifier. The input is a 1024-dimensional feature vector, the hidden layer dimension is 100, and the output is a 2-dimensional vector. Labeled source domain samples and unlabeled target domain samples are used as input data to train the main network's feature extractor's ability to extract domain-independent features.
[0137] Step 616: Maximize the feature distribution distance between the reconstructed background image and the target features to obtain common semantic information features, and obtain the loss function that maximizes the feature difference value.
[0138] Maximize the MMD distance of the three feature distributions, and calculate The kernel function in MMD uses multiple Gaussian kernels.
[0139] Step 618: Construct a feature source classifier. Input the target features and background features obtained from the private network video feature extractor into the feature source classifier. Determine the initial video classification model or private network for the input feature source based on the feature source classifier, and obtain the source classification loss function.
[0140] Step 620: Iteratively train the initial video classification model and each private network. When the iteration stopping condition is met, obtain the domain-general target video classification model, and perform video classification based on the target video classification model.
[0141] The total loss function is obtained by summing all the obtained loss function expressions. The expression calculates the gradient of the total loss function through backpropagation, uses the SGD (Standard Gradient Descent) optimization algorithm to update the parameters of the initial video classification model, sets the learning rate to 0.0001, and repeats the reconstruction training and adversarial training process. After 16,000 iterations, the target video classification model is obtained when the total loss function remains stable. Based on the target video classification model, video classification of the source domain and the target domain can be achieved.
[0142] Step 622: Obtain test samples to test and train the target video classification model.
[0143] The original video of the target video classification model is obtained. Five random positions in the RGB frame sequence are sampled as input data, and the average of the prediction results of the five samples is taken as the final prediction result. The dimension of each sample RGB frame sequence is time × height × width × number of channels (t × h × w × c). In this embodiment, t is 16, and h and w are 224.
[0144] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0145] Based on the same inventive concept, this application also provides a domain-adaptive video classification apparatus for implementing the domain-adaptive video classification method described above. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in the following embodiments of the domain-adaptive video classification apparatus can be found in the limitations of the domain-adaptive video classification method described above, and will not be repeated here.
[0146] In one embodiment, such as Figure 7 As shown, a domain-adaptive video classification device 700 is provided, comprising: a video classification module 702, a private network module 704, a mean-difference module 706, and an iterative training module 708, wherein:
[0147] The video classification module 702 is used to acquire video input samples from the source domain and at least one target domain, and to classify the initial video classification model based on the features of the source domain video input samples.
[0148] Private network module 704 is used to construct at least two private networks, which are used to acquire semantically irrelevant information features of video input samples from each domain respectively;
[0149] The mean difference module 706 is used to obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model, obtain the feature distribution distance between the video classification model and the features extracted by each private network, maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features.
[0150] The iterative training module 708 is used to iteratively train the initial video classification model and each private network. When the iteration stopping condition is met, a domain-general target video classification model is obtained, and video classification is performed based on the target video classification model.
[0151] In one embodiment, the private network module 704 is further configured to construct at least two private networks, including: acquiring background data of video input samples, using the background data as a supervision signal for reconstruction training of the private networks; performing reconstruction training on video input samples from various domains through the private networks to obtain reconstructed background data; acquiring the reconstruction loss between the background data and the reconstructed background data; and minimizing the reconstruction loss to obtain semantically irrelevant information features.
[0152] In one embodiment, the private network module 704 is further configured to include a video feature extractor and a reconstruction network in the private network, and to perform reconstruction training on video input samples from various domains through the private network to obtain reconstructed background data, including: obtaining background features of video input samples from various domains based on the video feature extractor; reconstructing the background features based on the reconstruction network to obtain reconstructed background data; and obtaining the reconstruction loss between the background data and the reconstructed background data, including: obtaining the distance between the background data and the reconstructed background data; and calculating the reconstruction loss using a loss function based on a distance metric and the distance.
[0153] In one embodiment, the mean difference module 706 is further used to obtain the initial video classification model, which includes a feature extractor, a domain discriminator, and a classifier, by classifying the features of source domain video input samples. This includes: obtaining features of source domain video input samples through the feature extractor; classifying the features through the classifier; obtaining a classification loss, which is used to iteratively train the initial video classification model and each private network; and obtaining feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model. This includes: obtaining initial feature data of video input samples from the source domain and at least one target domain through the feature extractor; performing adversarial training on the initial feature data according to the domain discriminator to obtain adversarially trained target feature data; obtaining video classification of the target feature data according to the classifier; and obtaining the adversarial training loss of the domain discriminator, which is used to iteratively train the initial video classification model and each private network.
[0154] In one embodiment, such as Figure 8 As shown, the device also includes a source classification module 810, which is used to construct a feature source classifier, determine the source identifier of the input feature based on the feature source classifier, wherein the source identifier is used to determine whether the source of the input feature is the initial video classification model or a private network; obtain the source classification loss of the feature source classifier, and use the source classification loss to iteratively train the initial video classification model and each private network.
[0155] In one embodiment, the iterative training module 708 is further used to iteratively train the initial video classification model and each private network, and obtain a domain-general target video classification model when the iteration stopping condition is met, including: obtaining the training loss based on the loss function, obtaining the iteration stopping condition based on the training loss; calculating the gradient of the loss function based on backpropagation of the training loss, and updating the loss function; and obtaining a domain-general target video classification model when the training loss is stable and the iteration stopping condition is met.
[0156] The modules in the aforementioned adaptive video classification device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.
[0157] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 9 As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores video classification data. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When executed by the processor, the computer program implements a domain-adaptive video classification method.
[0158] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 10As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When executed by the processor, the computer program implements a domain-adaptive video classification method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.
[0159] Those skilled in the art will understand that the aforementioned structure is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than shown in the figure, or combine certain components, or have different component arrangements.
[0160] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described method embodiments.
[0161] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method embodiments.
[0162] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the above-described method embodiments.
[0163] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data shall comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0164] Those skilled in the art will understand that implementing all or part of the processes in the above embodiments can be accomplished by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory may include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory may include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.
[0165] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0166] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A domain-adaptive video classification method, characterized in that, The method includes: Obtain video input samples from a source domain and at least one target domain, and classify them based on the features of the source domain video input samples to obtain an initial video classification model; Construct at least two private networks to obtain video input samples from the source domain and at least one target domain, wherein the private networks are used to obtain semantically irrelevant information features of the video input samples from each of the domains respectively; Obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model; obtain the feature distribution distance between the features extracted by the initial video classification model and each private network; maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features. The initial video classification model and each of the private networks are trained iteratively. When the iteration stopping condition is met, a domain-general target video classification model is obtained. Video classification is then performed based on the target video classification model. The construction of at least two private networks includes: The background data of the video input sample is obtained, and the background data is used as a supervision signal for the reconstruction training of the private network; Reconstruction training of video input samples in each of the aforementioned domains is performed through the private network to obtain reconstructed background data; Obtain the reconstruction loss between the background data and the reconstructed background data; Minimize the reconstruction loss to obtain the semantically irrelevant information features.
2. The method according to claim 1, characterized in that, The private network includes a video feature extractor and a reconstruction network. The reconstructed background data obtained by training the reconstruction of video input samples from each of the aforementioned domains using the private network includes: The background features of each domain video input sample are obtained based on the video feature extractor. The reconstructed background data is obtained by reconstructing the background features based on the reconstructed network. The step of obtaining the reconstruction loss between the background data and the reconstructed background data includes: Obtain the distance between the background data and the reconstructed background data; The reconstruction loss is calculated using a loss function based on a distance metric and the distance.
3. The method according to claim 1, characterized in that, The initial video classification model includes a feature extractor, a domain discriminator, and a classifier. The process of classifying the initial video classification model based on the features of the source domain video input samples includes: The feature extractor is used to obtain the features of the source domain video input sample; The features are classified using the classifier. Obtain a classification loss, which is used in the iterative training of the initial video classification model and each private network; The step of obtaining feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model includes: The feature extractor obtains initial feature data for video input samples in the source domain and at least one target domain. The initial feature data is subjected to adversarial training based on the domain discriminator to obtain the target feature data after adversarial training. The video classification of the target feature data is obtained based on the classifier; Obtain the adversarial training loss of the domain discriminator, which is used for the iterative training of the initial video classification model and each of the private networks.
4. The method according to claim 1, characterized in that, The method further includes: Construct a feature source classifier, and determine the source identifier of the input feature based on the feature source classifier, wherein the source identifier is used to determine whether the source of the input feature is an initial video classification model or a private network; Obtain the source classification loss of the feature source classifier, which is used for iterative training of the initial video classification model and each of the private networks.
5. The method according to claim 1, characterized in that, The iterative training of the initial video classification model and each of the private networks, when the iteration stopping condition is met, yields a domain-general target video classification model, including: The training loss is obtained based on the loss function, and the iteration stopping condition is obtained based on the training loss. The gradient of the loss function is calculated based on the backpropagation of the training loss, and the loss function is updated accordingly. When the training loss is stable and the iteration stopping condition is met, a target video classification model that is general in the domain is obtained.
6. The method according to claim 1, characterized in that, Background data of the video input sample is obtained using a time median filter.
7. A domain-adaptive device, characterized in that, The device includes: The video classification module is used to acquire video input samples from the source domain and at least one target domain, and to classify them based on the features of the source domain video input samples to obtain an initial video classification model. A private network module is used to construct at least two private networks, which are used to acquire semantically irrelevant information features of video input samples in each of the domains respectively; The mean difference module is used to obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model, obtain the maximum mean difference of the feature distribution distance between the video classification model and each of the private networks, and maximize the maximum mean difference to obtain common semantic information features. The iterative training module is used to iteratively train the initial video classification model and each of the private networks. When the iteration stopping condition is met, a domain-general target video classification model is obtained, and video classification is performed based on the target video classification model. The private network module is also used to build at least two private networks, including: Obtain the background data of the video input sample, and use the background data as a supervision signal for the reconstruction training of the private network; Reconstruction training is performed on video input samples from various domains using a private network to obtain reconstructed background data; the reconstruction loss between the background data and the reconstructed background data is then obtained. Minimize the reconstruction loss to obtain semantically irrelevant information features.
8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.