A Weakly Supervised Video Temporal Action Localization Method Based on Slow-Motion Enhancement
By generating slow-motion related masks and enhanced features through slow-motion related mining and localization modules, and combining them with a dual-stream branch localization network, the problem of inaccurate localization of slow-motion instances in existing technologies is solved, achieving more efficient video temporal motion localization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIHANG UNIV
- Filing Date
- 2022-08-24
- Publication Date
- 2026-06-30
AI Technical Summary
Existing weakly supervised temporal motion localization methods struggle to effectively locate slow-motion instances, especially in slow-motion playback scenarios common in videos, resulting in insufficient localization accuracy.
A method consisting of a slow-motion relevance mining module and a localization module is adopted. By generating slow-motion relevance masks and enhancing features, combined with a two-stream branch localization network, the localization capability of slow-motion instances is improved.
It significantly improves the localization accuracy of slow-motion instance segments and the overall accuracy of video temporal motion localization, enhancing the effect of weakly supervised temporal motion localization.
Smart Images

Figure CN115937733B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer multimedia technology, and more specifically to a weakly supervised video time localization method based on slow-motion enhancement. Background Technology
[0002] Currently, with the explosive growth of video content, learning and understanding massive amounts of video data has become a hot research direction in the field of computer vision. Among these, temporal action localization is a fundamental yet highly challenging task in video understanding research, aiming to locate and classify action instances in undressed videos. With the rapid development of deep learning, many fully supervised temporal action localization algorithms have emerged in recent years, demonstrating considerable performance. These fully supervised methods require precise temporal boundary annotations for each action instance during training. However, temporal boundary annotation of action instances in videos is extremely time-consuming and costly, and the quality of these annotations is often difficult to guarantee. Therefore, weakly-supervised temporal action localization (W-TAL), which only requires video-level action category labels, is a more reasonable choice and has received widespread attention. Compared to precise action instance temporal boundary annotations, video-level action category labels only require describing which actions are present in the video; they are easier to collect and help avoid localization biases introduced by human annotators.
[0003] While existing weakly supervised temporal action localization work has made significant progress, it neglects the fact that action instances have different occurrence rates, especially slow motion. Slow motion, referring to actions performed at a slower speed than normal, is very common in temporal action localization tasks, such as slow-motion replays frequently seen in sports videos. In the publicly available dataset THUMOS'14, over 64.0% of the videos and 26.4% of the action instances contain slow-motion segments. Existing work typically samples video frames at a fixed rate to extract features, then processes these features to obtain the final prediction result. At a fixed sampling rate, researchers primarily consider actions occurring at the normal rate, i.e., normal motion, while ignoring slow-motion instance segments. Therefore, existing weakly supervised temporal action localization frameworks struggle to effectively localize slow-motion instance segments.
[0004] Therefore, this paper proposes a weakly supervised video temporal localization method based on slow-motion enhancement, namely a slow-motion enhanced localization network. Enhancing the localization network's ability to locate slow-motion instance segments and better completing the weakly supervised temporal motion localization task is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0005] In view of this, the present invention provides a weakly supervised video temporal action localization method based on slow-motion enhancement, which consists of a slow-motion correlation mining module and a localization module. The slow-motion correlation mining module takes fused video features as input and outputs a slow-motion correlation mask. The localization module takes the slow-motion correlation mask output by the slow-motion correlation mining module as input and generates slow-motion enhancement features. Then, taking the slow-motion enhancement features and fused video features as input, the method is passed through a two-stream branch localization network to finally generate the temporal action localization result.
[0006] To achieve the above objectives, the present invention adopts the following technical solution:
[0007] A weakly supervised video temporal action localization method based on slow-motion enhancement includes: a slow-motion correlation mining module and a localization module;
[0008] The slow-motion correlation mining module takes the fused video features as input and outputs a slow-motion correlation mask. The localization module takes the slow-motion correlation mask output by the slow-motion correlation mining module as input and generates slow-motion enhancement features. Taking the slow-motion enhancement features and the fused video features as input, the localization module generates time-motion localization results through a dual-stream branch localization network.
[0009] Preferably, the video time-based motion localization method specifically includes:
[0010] S1. Given a video V, extract video features and concatenate the video features along the feature dimension to obtain fused video features X;
[0011] S2. Downsample the fused video features X to generate downsampled features. The downsampled features are input into the baseline network to generate downsampled class activation sequences. CAS sub Perform max pooling on the category dimension C and then perform minimum-maximum normalization along the time dimension to generate a downsampling mask M. sub For the downsampling mask M sub Smoothing is performed to obtain the smoothed downsampling mask M. smooth For the smooth downsampling mask M smooth Upsampling is performed to obtain the slow-motion related mask.
[0012] S3. Based on the slow-motion related mask M and the fused video features X, obtain the slow-motion related video features. Using the fused video features X as input to the normal action branch, we obtain CAS. normal ,attn normal Using slow-motion related video features X slow As input to the slow-motion branch, we obtain CAS.slow ,attn slow ;
[0013] S4. Perform max pooling to convert the CAS outputs of the two branches. normal attn normal CAS slow attn slow By fusion, CAS is obtained. fuse and attn fuse And calculate the attention-guided category activation sequence;
[0014] S5. Apply the multi-instance learning method to obtain video-level video classification scores. In video V, for each action category, take the classification scores of the K video segments with the highest classification scores for that category, and take the average of these K classification scores as the classification score of the video. Classify the test video based on a predefined classification threshold, and calculate the localization result based on the weighted category activation sequence and attention score of the action instance.
[0015] Preferably, step S1 specifically includes:
[0016] Given an unedited video V, extract RGB video frames and obtain the video optical flow map based on the TVL1 algorithm. Divide the RGB video frames and the optical flow map into T non-overlapping video slices, each containing 16 video frames. Then, define the video as... v i For the i-th video slice, video features are extracted from the RGB video frames and optical flow graphs using the I3D network to obtain the RGB video features. and optical flow video features Where d represents the dimensionality feature, which is the RGB video feature. and optical flow video features The features are stitched together along the feature dimension to form a fused video feature.
[0017] Preferably, step S2 specifically includes:
[0018] Downsampling is performed on the fused video features X, sampling one segment every τ segments to generate downsampled features. The downsampled features are input into the baseline network to generate downsampled class activation sequences. CAS sub Perform max pooling on the category dimension C and normalize the values of the minimum and maximum values along the time dimension to generate a downsampling mask M. sub :
[0019] M sub =normalize(maxpool(CAS) sub ))
[0020] Here, maxpool(·) performs max pooling, and normalize(·) performs max-min value normalization.
[0021] Based on the coefficient of variation smoothing mechanism for M sub To smooth out potential noise, define the coefficient of variation c. v for:
[0022]
[0023] in, and M respectively sub The variance and expectation, based on the coefficient of variation c v and the scaling factor s for M sub Smooth:
[0024] M smooth =(M sub ) α
[0025] α=1-s×c v
[0026] Among them, M smooth For smooth downsampling mask;
[0027] The smooth downsampling mask M smooth Transform it into a binary mask using a predefined threshold θ:
[0028]
[0029] Video clips above the threshold θ are slow-motion clips, while those below the threshold are normal motion clips and video background clips.
[0030] The nearest neighbor difference algorithm is used to smooth the downsampling mask M. smooth Align with the original video in the time domain:
[0031] M = upsample(M smooth )
[0032] This is a mask related to slow motion.
[0033] Preferably, step S3 specifically includes:
[0034] Based on the slow-motion related mask M and the fused video features X, the slow-motion related video features are obtained.
[0035] X slow =X⊙M
[0036] Using the fused video features X as input to the normal action branch, we obtain... Using slow-motion related video features X slow As input to the slow-motion branch, we get C+1 represents the action category and background category (CAS). t,c denoted by t, it represents the probability that the t-th video slice belongs to the c-th action class, and attn represents the attention score of the action instance, action context, and background.
[0037] Preferably, step S4 specifically includes:
[0038] The CAS operations of the two branches are performed using max pooling. normal attn normal CAS slow attn slow By fusion, CAS is obtained. fuse and attn fuse :
[0039] CAS fuse =maxpool(CAS) normal CAS slow )
[0040] attn fuse =maxpool(annt) normal ,attn slow )
[0041] Based on the CAS obtained through fusion fuse ,attn fuse Calculate attention-guided category activation sequences:
[0042]
[0043]
[0044]
[0045] Preferably, step S5 specifically includes:
[0046] A multi-instance learning approach is applied to obtain video-level classification scores. In video V, for each action category, the classification scores of the K highest-scoring video segments for that category are taken, and the average of these K scores is used as the classification score for the entire video.
[0047]
[0048]
[0049] in, This represents the activation sequence of the k-th class among the K video clips with the highest classification scores in class c. This represents the classification score for the c-th action in video V, where * represents any value in [act, con, bac], i.e., action instance, action context, and background class;
[0050] The test video is classified based on a predefined classification threshold, and a weighted category activation sequence based on action instances is used. and attention score To calculate the positioning result, denoted as This represents the start time, end time, action classification, and confidence score of an action instance. The confidence score for each action prediction instance is obtained based on an inner and outer continuous function.
[0051] v=(1-γ)·CAS ins +γ·attn ins
[0052]
[0053] Where γ is the control activation sequence and attention score Hyperparameter of fusion ratio It is an expanded contrast area.
[0054] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a weakly supervised video temporal action localization method based on slow motion enhancement, which consists of a slow motion correlation mining module and a localization module. The slow motion correlation mining module takes the fused video features as input and outputs a slow motion correlation mask. The localization module takes the slow motion correlation mask output by the slow motion correlation mining module as input and generates slow motion enhancement features. Then, the localization module takes the slow motion enhancement features and the fused video features as input and passes them through a two-stream branch localization network to finally generate the temporal action localization result. Attached Figure Description
[0055] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0056] Figure 1 The attached figure is a schematic diagram of video feature extraction provided by an embodiment of the present invention.
[0057] Figure 2The attached figure is a schematic diagram of the ACM-NET structure provided in an embodiment of the present invention.
[0058] Figure 3 The attached figure is a schematic diagram of the slow-motion related mining module and the positioning module provided in an embodiment of the present invention. Detailed Implementation
[0059] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0060] This invention discloses a weakly supervised video temporal localization method based on slow-motion enhancement, comprising: a slow-motion correlation mining module and a localization module;
[0061] The slow-motion correlation mining module takes the fused video features as input and outputs a slow-motion correlation mask. The localization module takes the slow-motion correlation mask output by the slow-motion correlation mining module as input and generates slow-motion enhancement features. Taking the slow-motion enhancement features and the fused video features as input, the localization module generates temporal motion localization results through a dual-stream branch localization network.
[0062] To further optimize the above technical solution, the video temporal motion localization method specifically includes:
[0063] S1. Given a video V, extract video features and concatenate the video features along the feature dimension to obtain fused video features X;
[0064] S2. Downsample the fused video features X to generate downsampled features. The downsampled features are input into the baseline network to generate downsampled class activation sequences. CAS sub Perform max pooling on the category dimension C and then perform minimum-maximum normalization along the time dimension to generate a downsampling mask M. sub For the downsampling mask M sub Smoothing is performed to obtain the smoothed downsampling mask M. smooth For the smooth downsampling mask M smooth Upsampling is performed to obtain the slow-motion related mask.
[0065] S3. Based on the slow-motion related mask M and the fused video features X, obtain the slow-motion related video features. Using the fused video features X as input to the normal action branch, we obtain CAS. normal ,attnnormal Using slow-motion related video features X slow As input to the slow-motion branch, we obtain CAS. slow ,attn slow ;
[0066] S4. Perform max pooling to convert the CAS outputs of the two branches. normal attn normal CAS slow attn slow By fusion, CAS is obtained. fuse and attn fuse And calculate the attention-guided category activation sequence;
[0067] S5. Apply the multi-instance learning method to obtain video-level video classification scores. In video V, for each action category, take the classification scores of the K video segments with the highest classification scores for that category, and take the average of these K classification scores as the classification score of the video. Classify the test video based on a predefined classification threshold, and calculate the localization result based on the weighted category activation sequence and attention score of the action instance.
[0068] To further optimize the above technical solution, step S1 specifically includes:
[0069] Given an unedited video V, extract RGB video frames and obtain the video optical flow map based on the TVL1 algorithm. Divide the RGB video frames and the optical flow map into T non-overlapping video slices, each containing 16 video frames. Then, define the video as... v i For the i-th video slice, video features are extracted from the RGB video frames and optical flow graphs using the I3D network to obtain the RGB video features. and optical flow video features Where d represents the dimensionality feature, which is the RGB video feature. and optical flow video features The features are stitched together along the feature dimension to form a fused video feature.
[0070] To further optimize the above technical solution, step S2 specifically includes:
[0071] Downsampling is performed on the fused video features X, sampling one segment every τ segments to generate downsampled features. The downsampled features are input into the baseline network to generate downsampled class activation sequences. CAS sub Perform max pooling on the category dimension C and normalize the values of the minimum and maximum values along the time dimension to generate a downsampling mask M. sub :
[0072] M sub =normalize(maxpool(CAS) sub ))
[0073] Here, maxpool(·) performs max pooling, and normalize(·) performs max-min value normalization.
[0074] Based on the coefficient of variation smoothing mechanism for M sub To smooth out potential noise, define the coefficient of variation c. v for:
[0075]
[0076] in, and M respectively sub The variance and expectation, based on the coefficient of variation c v and the scaling factor s for M sub Smooth:
[0077] M smooth =(M sub ) α
[0078] α=1-s×c v
[0079] Among them, M smooth For smooth downsampling mask;
[0080] The smooth downsampling mask M smooth Transform it into a binary mask using a predefined threshold θ:
[0081]
[0082] Video clips above the threshold θ are slow-motion clips, while those below the threshold are normal motion clips and video background clips.
[0083] The nearest neighbor difference algorithm is used to smooth the downsampling mask M. smooth Align with the original video in the time domain:
[0084] M = upsample(M smooth )
[0085] This is a mask related to slow motion.
[0086] To further optimize the above technical solution, step S3 specifically includes:
[0087] Based on the slow-motion related mask M and the fused video features X, the slow-motion related video features are obtained.
[0088] x slow =X⊙M
[0089] Using the fused video features X as input to the normal action branch, we obtain... Using slow-motion related video features X slow As input to the slow-motion branch, we get C+1 represents the action category and background category (CAS). t,c denoted by t, it represents the probability that the t-th video slice belongs to the c-th action class, and attn represents the attention score of the action instance, action context, and background.
[0090] To further optimize the above technical solution, step S4 specifically includes:
[0091] The CAS operations of the two branches are performed using max pooling. normal attn normal CAS slow attn slow By fusion, CAS is obtained. fuse and attn fuse :
[0092] CAS fuse =maxpool(CAS) normal CAS slow )
[0093] attn fuse =maxpool(attn) normal ,attn slow )
[0094] Based on the CAS obtained through fusion fuse ,attn fuse Calculate attention-guided category activation sequences:
[0095]
[0096]
[0097]
[0098] To further optimize the above technical solution, step S5 specifically includes:
[0099] A multi-instance learning approach is applied to obtain video-level classification scores. In video V, for each action category, the classification scores of the K highest-scoring video segments for that category are taken, and the average of these K scores is used as the classification score for the entire video.
[0100]
[0101]
[0102] in, This represents the activation sequence of the k-th class among the K video clips with the highest classification scores in class c. This represents the classification score for the c-th action in video V, where * represents any value in [act, Con, bac], i.e., the action instance, action context, and background class.
[0103] The test video is classified based on a predefined classification threshold, and a weighted category activation sequence based on action instances is used. and attention score To calculate the positioning result, denoted as This represents the start time, end time, action classification, and confidence score of an action instance. The confidence score for each action prediction instance is obtained based on an inner and outer continuous function.
[0104] v=(1-γ)·CAS ins +γ·attn ins
[0105]
[0106] Where γ is the control activation sequence and attention score Hyperparameter of fusion ratio It is an expanded contrast area.
[0107] The slow-motion augmentation localization network proposed in this application uses ACM-NET as the baseline network and calculates the slow-motion correlation mask using the generated CAS to finally obtain the temporal motion localization result. A schematic diagram of the ACM-NET structure is shown below. Figure 2 As shown.
[0108] ACM-NET first generates initial CAS using a classification branch. Then, it generates three sets of attention weights A using three attention modules with unknown classes, which are used to distinguish action instances, action contexts, and non-action contexts, respectively.
[0109] CAS = φ cls (X)
[0110] A = φ attn(X)
[0111] in This indicates the classification score of the video clip. The t-th video segment is presented with similarity scores for the action instance, action context, and non-action background.
[0112] Based on the attention weights generated from the three attention branches above, ACM-NET constructs three new sets of CAS values:
[0113] CAS * =attn * ×CAS, * = {ins, con, bac}
[0114] Finally, based on the multi-instance learning method, the CAS values are aggregated into video-level classification prediction scores for action instances, action contexts, and non-action backgrounds.
[0115] These three video-level classification prediction scores use predefined video-level labels, meaning the loss is calculated under supervised supervision. That is:
[0116] Y ins =[y c =1, y C+1 =0]
[0117] Y con =[y c =1, y C+1 =1]
[0118] Y bac =[y c =0, y C+1 =1]
[0119] Based on the above labels, ACM-NET uses three binary cross-entropy loss functions to optimize the network, optimizing for action instances, action contexts, and non-action contexts, respectively. Furthermore, ACM-NET introduces attention-guided loss, action feature separation loss, and sparse attention loss to enhance the network's robustness.
[0120] L acm-net =L cls +L add
[0121] L cls =L ins +L con +L bac
[0122] L add =λ1L gui +λ2L feat +λ1Lspa
[0123] Table 1 shows the performance results of different methods on the THUMOS'14 dataset. This indicates a weakly supervised temporal motion localization method that uses additional supervision beyond video-level labels.
[0124] Table 1 shows the performance of different methods on the THUMOS'14 dataset.
[0125]
[0126]
[0127] Table 2 Comparison of Experimental Results on ActivityNet v1.3 Validation Set
[0128]
[0129] The method of this application was compared with methods on the latest THUMOS'14 and Activity-Net v1.3 datasets. The results for these two datasets are shown in Tables 1 and 2. From the results, we conclude that:
[0130] 1) On both datasets, the method in this application has a significant advantage over the state-of-the-art weakly supervised temporal action localization methods in all evaluation metrics;
[0131] 2) On both datasets, the SMEN proposed in this application still has certain advantages compared with some fully supervised temporal action localization methods and temporal action localization methods with additional supervision, which proves the superiority of the SMEN proposed in this application in weakly supervised temporal action localization tasks.
[0132] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.
[0133] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A weakly supervised video temporal action localization method based on slow-motion enhancement, characterized in that, include: Slow-motion related data mining and localization modules; The slow-motion correlation mining module takes the fused video features as input and outputs a slow-motion correlation mask. The localization module takes the slow-motion correlation mask output by the slow-motion correlation mining module as input, generates slow-motion correlation video features, and takes the slow-motion correlation video features and the fused video features as input, and generates time-motion localization results through a dual-stream branch localization network. The video time-based motion localization method specifically includes: S1. Given a video V, extract video features and concatenate the video features along the feature dimension to obtain fused video features X; S2. Downsample the fused video features X to generate downsampled features. The downsampled features are input into the baseline network to generate downsampled class activation sequences. CAS sub Perform max pooling on the category dimension C and normalize the values of the minimum and maximum values along the time dimension to generate a downsampling mask M. sub For the downsampling mask M sub Smoothing is performed to obtain the smoothed downsampling mask M. smooth For the smooth downsampling mask M smooth Upsampling is performed to obtain the slow-motion related mask. S3. Based on the slow-motion correlation mask M and the fused video features X, obtain the slow-motion correlation video features. Using the fused video features X as input to the normal action branch, we obtain CAS. normal ,attn normal Using slow-motion related video features X slow As input to the slow-motion branch, we obtain CAS. slow ,attn slow ; S4, fuse the CAS outputs of the two branches by max-pooling operation normal , attn normal , CAS slow , attn slow to obtain CAS fuse and attn fuse , and calculate the attention-guided class activation sequence; S5. Apply the multi-instance learning method to obtain video-level video classification scores. In video V, for each action category, take the classification scores of the K video segments with the highest classification scores for that category, and take the average of these K classification scores as the classification score of the video. Classify the test video based on a predefined classification threshold, and calculate the localization result based on the weighted category activation sequence and attention score of the action instance.
2. The method for weakly supervised video temporal motion localization based on slow-motion enhancement according to claim 1, characterized in that, Step S1 specifically includes: Given an unedited video V, extract RGB video frames and obtain the video optical flow map based on the TVL1 algorithm. Divide the RGB video frames and the optical flow map into T non-overlapping video slices, each containing 16 video frames. Then, define the video as... v i For the i-th video slice, video features are extracted from the RGB video frames and optical flow maps using a 13D network to obtain the RGB video features. and optical flow video features Where d represents the dimensionality feature, which is the RGB video feature. and optical flow video features The features are stitched together along the feature dimension to form a fused video feature.
3. The method for weakly supervised video temporal motion localization based on slow-motion enhancement according to claim 1, characterized in that, Step S2 specifically includes: Downsampling is performed on the fused video features X, sampling one segment every τ segments to generate downsampled features. The downsampled features are input into the baseline network to generate downsampled class activation sequences. CAS sub Perform max pooling on the category dimension C and normalize the values of the minimum and maximum values along the time dimension to generate a downsampling mask M. sub : M sub =normalize(maxpool(CAS sub )) Here, maxpool(·) performs max pooling, and normalize(·) performs max-min value normalization. Based on the coefficient of variation smoothing mechanism for M sub To smooth out potential noise, define the coefficient of variation c. v for: in, and M respectively sub The variance and expectation, based on the coefficient of variation c v and the scaling factor s for M sub Smoothing: M smooth =(M sub ) α α=1-s×c v Among them, M smooth For smooth downsampling mask; The smooth downsampling mask M smooth Transform it into a binary mask using a predefined threshold θ: Video clips above the threshold θ are slow-motion clips, while those below the threshold are normal motion clips and video background clips. The nearest neighbor difference algorithm is used to smooth the downsampling mask M. smooth Align with the original video in the time domain: M=upsample(M smooth ) This is a mask related to slow motion.
4. The method for weakly supervised video temporal motion localization based on slow-motion enhancement according to claim 1, characterized in that, Step S3 specifically includes: Based on the slow-motion related mask M and the fused video features X, the slow-motion related video features are obtained. X slow =X⊙M Using the fused video features X as input to the normal action branch, we obtain... Using slow-motion related video features X slow As input to the slow-motion branch, we get C+1 represents the action category and background category (CAS). t,c denoted by t, it represents the probability that the t-th video slice belongs to the c-th action class, and attn represents the attention score of the action instance, action context, and background.
5. The method for weakly supervised video temporal motion localization based on slow-motion enhancement according to claim 1, characterized in that, Step S4 specifically includes: The CAS operation of the two branches is performed using max pooling. normal ,attn normal CAS slow ,attn slow By fusion, CAS is obtained. fuse and att nfuse : CAS fuse =maxpool(CAS normal ,CAS slow ) attn fuse =maxpool(attn normal ,attn slow ) Based on the CAS obtained through fusion fuse ,attn fuse Calculate attention-guided category activation sequences:
6. The method for weakly supervised video temporal motion localization based on slow-motion enhancement according to claim 1, characterized in that, Step S5 specifically includes: A multi-instance learning approach is applied to obtain video-level video classification scores. In video V, for each action category, the classification scores of the K video segments with the highest scores for that category are taken, and the average of these K scores is used as the classification score for the video. in, This represents the activation sequence of the k-th class among the K video clips with the highest classification scores in class c. This represents the classification score for the c-th action in video V, where * represents any value in [act, con, bac], i.e., action instance, action context, and background class; The test video is classified based on a predefined classification threshold, and a weighted category activation sequence based on action instances is used. and attention score To calculate the positioning result, denoted as This represents the start time, end time, action classification, and confidence score of an action instance. The confidence score for each action prediction instance is obtained based on an inner and outer continuous function. v=(1-γ)·CAS ins +γ·attn ins Where γ is the control activation sequence and attention score Hyperparameter of fusion ratio It is an expanded contrast area.