Video labeling method, apparatus, device, medium, and product

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By acquiring the annotation results of the first and last frames of a video, and combining semi-supervised annotation and forward propagation algorithms, the intermediate frames are automatically annotated, solving the problems of low efficiency and high cost in existing video annotation technologies, and achieving efficient and accurate video annotation.

CN115905622BActive Publication Date: 2026-06-19BEIJING ZITIAO NETWORK TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date: 2022-11-15
Publication Date: 2026-06-19

Application Information

Patent Timeline

15 Nov 2022

Application

19 Jun 2026

Publication

CN115905622B

IPC: G06F16/783; G06F16/78

AI Tagging

Application Domain

Metadata video data retrievalSpecial data processing applications

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Response output system and response output device
JP2026096875AMetadata video data retrieval
Virtual object explanation method, electronic device, storage medium, and program product
CN116954786BMetadata video data retrievalExecution for user interfaces
Method for generating multimedia title, method and apparatus for pushing multimedia title
CN120832882BMetadata video data retrievalBiological models
Short video promotion optimization method and system
CN121479018BMetadata video data retrievalCommerce
Electronic device and operation method thereof
US20260170314A1Video data queryingVideo data browsing/visualisation

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Current video annotation technologies rely on manual frame-by-frame annotation, which is inefficient and costly.

Method used

By obtaining the annotation results of the first and last frames of the video, and using semi-supervised annotation algorithms and forward propagation algorithms, the intermediate frames are automatically annotated to generate the annotation results of the target sub-segments.

Benefits of technology

It improves the efficiency and accuracy of video annotation and reduces the cost of manual annotation.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115905622B_ABST

Patent Text Reader

Abstract

This disclosure provides a video annotation method, apparatus, device, medium, and product. The method includes: determining a sub-segment to be annotated in a video to obtain a target sub-segment; acquiring a first-frame annotation result corresponding to the first frame of the target sub-segment; generating a last-frame annotation result corresponding to the last frame of the target sub-segment based on the first-frame annotation result; generating annotation results for intermediate frames of the target sub-segment based on the first-frame annotation result and the last-frame annotation result to obtain the annotation result of the target sub-segment; and generating a target annotation result for the video to be annotated based on the annotation result of the target sub-segment. The technical solution of this disclosure improves video annotation efficiency.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and in particular to a video annotation method, apparatus, device, medium, and product. Background Technology

[0002] Video processing can be applied to many technical fields, such as artificial intelligence, intelligent transportation, finance, and content recommendation. Specific technologies involved include object tracking and object detection.

[0003] In related technologies, video annotation is generally done manually frame by frame. However, manual annotation is inefficient and costly. Summary of the Invention

[0004] This disclosure provides a video annotation method, apparatus, device, medium, and product to overcome the technical problems of low annotation efficiency and high annotation cost when using manual annotation.

[0005] In a first aspect, embodiments of this disclosure provide a video annotation method, including:

[0006] Identify the sub-segments to be labeled in the video to obtain the target sub-segments;

[0007] Obtain the first frame annotation result corresponding to the first frame of the target sub-fragment;

[0008] Based on the first frame annotation result, the tail frame annotation result corresponding to the tail frame of the target sub-segment is generated;

[0009] Based on the annotation results of the first frame and the annotation results of the last frame, the annotation results of the intermediate frames of the target sub-segment are generated to obtain the annotation results of the target sub-segment to be annotated;

[0010] Based on the annotation results of the target sub-segment, the target annotation results of the video to be annotated are generated.

[0011] Secondly, embodiments of this disclosure provide a video annotation device, comprising:

[0012] The first determining unit is used to determine the sub-segment to be labeled in the video to be labeled, and to obtain the target sub-segment;

[0013] The first frame annotation unit is used to obtain the first frame annotation result corresponding to the first frame of the target sub-segment;

[0014] The tail frame annotation unit is used to generate the tail frame annotation result corresponding to the tail frame of the target sub-segment based on the first frame annotation result;

[0015] The segment annotation unit is used to generate annotation results for the intermediate frames of the target sub-segment based on the annotation results of the first frame and the annotation results of the last frame, so as to obtain the annotation results of the target sub-segment to be annotated.

[0016] The second determining unit is used to generate target annotation results for the video to be annotated based on the annotation results of the target sub-segment.

[0017] Thirdly, embodiments of this disclosure provide an electronic device, including: a processor and a memory;

[0018] The memory stores computer-executed instructions;

[0019] The processor executes computer execution instructions stored in the memory, such that the processor is configured with the video annotation method described in the first aspect and various possible designs of the first aspect.

[0020] Fourthly, embodiments of this disclosure provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the video annotation method described in the first aspect and various possible designs of the first aspect.

[0021] Fifthly, embodiments of this disclosure provide a computer program product, including a computer program that, when executed by a processor, implements the video annotation method described in the first aspect and various possible designs of the first aspect.

[0022] The technical solution provided in this embodiment, for the video to be labeled, can determine the target sub-segments to be labeled from the segment dimension. When performing detailed labeling of the target sub-segments, the first frame labeling result corresponding to the first frame of the target sub-segment can be obtained first. Then, based on the first frame labeling result, the last frame labeling result corresponding to the last frame of the target sub-segment can be generated. The first and last frame labeling results can be used to label the intermediate frames in the target sub-segment, thereby achieving the labeling of the intermediate frames of the target sub-segment and obtaining the labeling result of the target sub-segment. The last frame can be automatically obtained through the first frame labeling, while the intermediate frames can be automatically obtained through the first and last frame labeling results, achieving efficient labeling of intermediate frames. After obtaining the labeling results of the target sub-segments, the target labeling result of the video to be labeled can be determined. By labeling segments with a smaller time dimension, the accuracy of segment labeling can be improved. Compared to directly labeling the video to be labeled, this method is more efficient and accurate. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0024] Figure 1 An example diagram illustrating an application of a video annotation method provided in this disclosure.

[0025] Figure 2 A flowchart of one embodiment of a video annotation method provided in this disclosure;

[0026] Figure 3 A flowchart of yet another embodiment of a video annotation method provided in this disclosure;

[0027] Figure 4 An example diagram of feature propagation provided for an embodiment of this disclosure;

[0028] Figure 5 A flowchart of yet another embodiment of a video annotation method provided in this disclosure;

[0029] Figure 6 An example diagram showing an update of the first frame annotation result provided in an embodiment of this disclosure;

[0030] Figure 7 A flowchart of yet another embodiment of a video annotation method provided in this disclosure;

[0031] Figure 8 A flowchart of yet another embodiment of a video annotation method provided in this disclosure;

[0032] Figure 9 An example diagram illustrating the division of a video segment as provided in an embodiment of this disclosure;

[0033] Figure 10 An example diagram of keyframe extraction provided for an embodiment of this disclosure;

[0034] Figure 11 A schematic diagram of one embodiment of a video annotation device provided in this disclosure;

[0035] Figure 12 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation

[0036] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0037] The technical solution disclosed herein can be applied to video annotation scenarios. By obtaining the annotation result of the first frame and automatically annotating the last frame using the annotation result of the first frame, other image frames in the image frame can be automatically annotated by obtaining the annotation results of the first frame and the last frame, thereby improving the annotation efficiency of the video.

[0038] In related technologies, training video processing models requires a large number of video samples. Video samples can include the video itself and its labels. Video labels generally refer to the labels of each image frame in the video, and the annotation results of each image frame are usually obtained manually. Performing frame-by-frame manual annotation generally requires a large amount of manual work, resulting in low annotation efficiency and high annotation costs.

[0039] To address the high cost of manual image annotation, this disclosure considers automating the process. However, automatic image annotation typically requires a region recognition model, and directly using such a model often yields inaccurate results. To achieve more accurate annotations, a method of manually annotating a portion of the image and then using that manually annotated image, employing a semi-supervised annotation approach, can be used to annotate the remaining portion. This method results in higher accuracy and significantly improved annotation efficiency.

[0040] The technical solutions of this disclosure and how they solve the aforementioned technical problems will be described in detail below with specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0041] Figure 1 This diagram illustrates an application example of a video annotation method provided in this disclosure. The video annotation method can be applied to an electronic device 1, which may include a display device 2. The display device 2 can display the video to be annotated. The video to be annotated can be divided into at least one video sub-segment based on multiple keyframes. According to the technical solution of this disclosure, the video to be annotated can be annotated according to each video sub-segment, for example, segment annotation can be performed on the target sub-segment 3. The electronic device 1 can display the segment annotation result of any image 4 in the target sub-segment 3 on the display device 2. The segment annotation result can be, for example, […]. Figure 1 In the image, the vehicle location area 5, and other types of objects such as streetlights 6, can be left unlabeled, resulting in the segment labeling result for image 4. For ease of understanding, Figure 1 The vehicle area 5 shown is labeled with a rectangular box. This labeling method is merely exemplary and should not constitute a specific limitation on the labeling method or type. In practical applications, other shapes such as the outline of the object to be labeled, circles, and polygons can also be used for labeling. After the segment labeling results are determined, the labeled target sub-segments can be used to determine the target labeling results of the video to be labeled.

[0042] like Figure 2 The diagram shown is a flowchart of one embodiment of a video annotation method provided in this disclosure. This video annotation method can be configured as a video annotation device, which can be located in an electronic device. The video annotation method can include the following steps:

[0043] 201: Identify the sub-segment to be labeled in the video to obtain the target sub-segment.

[0044] Optionally, before determining the sub-segment to be labeled in the video to be labeled, the process may further include: in response to a video labeling request, obtaining the video to be labeled.

[0045] The target sub-segment can be a sub-segment to be labeled within at least one video sub-segment of the video to be labeled. The video to be labeled can be divided into at least one video sub-segment, and this at least one video sub-segment can be obtained by dividing the video sub-segment into segments.

[0046] 202: Obtain the first frame annotation result corresponding to the first frame of the target sub-fragment.

[0047] The first frame can be the first image of the target sub-fragment, or any image of the target sub-fragment.

[0048] The first frame annotation result can be obtained through manual annotation or by extracting from an image annotation model. To improve the efficiency of the first frame annotation, it is also possible to first automatically annotate using an image annotation model, and then manually correct the annotation result of the image annotation model to obtain the final first frame annotation result.

[0049] The tail frame can be the last image of the target sub-fragment.

[0050] 203: Based on the annotation results of the first frame, generate the annotation results of the last frame corresponding to the target sub-fragment.

[0051] The annotation result of the last frame can be obtained by combining the annotation result of the first image of the target sub-fragment with a semi-supervised annotation algorithm. The semi-supervised annotation algorithm can use a forward propagation method to propagate the annotation result of the first image of the target sub-fragment to the last frame to obtain the annotation result of the last frame.

[0052] 204: Based on the annotation results of the first frame and the last frame, generate the annotation results of the intermediate frames of the target sub-fragment to obtain the annotation results of the target sub-fragment to be annotated.

[0053] Intermediate frames can include unlabeled image frames from the target sub-fragment. Intermediate frames can be obtained by labeling the first and last frames. A target sub-fragment can include multiple images or image frames, each of which can be labeled to obtain the labeling result for each image. Once all image frames in the target sub-fragment have been labeled, the segment labeling result of the target sub-fragment can be obtained, consisting of the labeling results corresponding to the multiple image frames of the target sub-fragment.

[0054] 204: Based on the annotation results of the target sub-fragments, generate the target annotation results for the video to be annotated.

[0055] The video to be annotated can include at least one video segment. Each video segment can be referred to as the target segment during the annotation process. The annotation result of the target segment can be obtained after the annotation is completed. The target annotation result of the video to be annotated can include the annotation results corresponding to multiple video segments respectively.

[0056] In this embodiment of the disclosure, for the video to be labeled, the segments to be labeled can be determined from the segment dimension to obtain the target sub-segments. In the labeling of the target sub-segments, the first frame labeling result corresponding to the first frame of the target sub-segment can be obtained first, and the last frame labeling result corresponding to the last frame can be generated using the first frame labeling result. The first frame labeling result and the last frame labeling result can be used to label the intermediate frames in the target sub-segment respectively, thereby achieving automatic labeling of the target sub-segment and obtaining the labeling result of the target sub-segment. Each image in the target sub-segment can be automatically labeled using its first frame labeling result and last frame labeling result, achieving efficient labeling results. After obtaining the segment labeling results of the target sub-segments, the target labeling result of the video to be labeled can be determined. By labeling segments with a smaller time dimension, the segment labeling efficiency can be improved, and the accuracy is higher compared to directly labeling the video to be labeled.

[0057] In general, the annotation results of the last frame can be obtained manually. However, in order to improve the annotation efficiency of the last frame, the forward propagation algorithm can be used to determine the annotation results of the last frame.

[0058] like Figure 3The diagram shown is a flowchart of one embodiment of an image annotation method provided by this disclosure. The method differs from the above embodiments in that it generates the annotation result of the target sub-fragment's tail frame based on the annotation result of the first frame, including:

[0059] 301: Obtain the first frame annotation result corresponding to the first frame of the target sub-fragment.

[0060] 302: Based on the annotation results of the first frame, the annotation results of the last frame corresponding to the last frame are determined using the forward propagation algorithm.

[0061] In this embodiment of the disclosure, the annotation result of the last frame can be automatically determined based on the annotation result of the first frame and combined with the forward propagation algorithm. By automatically determining the annotation result of the last frame, the annotation efficiency of the last frame can be effectively improved.

[0062] In one possible design, based on the annotation results of the first frame, the annotation results of the last frame are determined using a forward propagation algorithm, including:

[0063] Using the forward propagation algorithm, the annotation results of the first frame are propagated sequentially to the unannotated image frames in the target sub-fragment to obtain the annotation results of the unannotated image frames in the target sub-fragment;

[0064] The annotation result of the last image frame of the target sub-fragment is obtained as the annotation result of the tail frame corresponding to the tail frame.

[0065] In this embodiment, a forward propagation algorithm is used to propagate the annotation results of the first frame to the unannotated image frames in the target sub-segment sequentially to obtain the annotation results of the unannotated image frames. The forward propagation algorithm propagates the annotation results of the first frame to the unannotated image frames until it reaches the last image frame of the target video segment, thereby obtaining the annotation results of the last frame. Through the propagation of the annotation results, the annotation of the last frame is obtained through continuous propagation, so that the annotation of the last frame references its vicinity, such as the annotation results of the image frame preceding the last frame, thereby improving the efficiency and accuracy of the last frame annotation.

[0066] In practical applications, a bidirectional propagation method can be used to annotate the intermediate frames of the target sub-segment. For intermediate frames at different positions, the image can be automatically annotated according to the positional differences between the intermediate frames and the first and last frames, thereby improving the annotation accuracy of the image.

[0067] Therefore, as Figure 4 The flowchart shown is for another embodiment of a video annotation method provided by this disclosure. The difference from the previous embodiments lies in that, based on the annotation results of the first frame and the last frame, generating the annotation results of the intermediate frames of the target sub-segment can include:

[0068] 401: Based on the first frame annotation results, combined with the forward propagation algorithm, the forward propagation features of the intermediate frames of the target sub-segment are extracted.

[0069] Optionally, the forward propagation algorithm can include machine learning algorithms, neural network algorithms, etc., which can be obtained through training. The forward propagation algorithm can be used to propagate the first frame annotation results of the first frame to the intermediate frames located after the first frame to obtain the forward propagation features of the intermediate frames.

[0070] The target sub-fragment can include N image frames, each of which can be used as an intermediate frame for annotation. N is a positive integer greater than 1. The first and last frames can be annotated first. Then, starting from the second image frame in the target sub-fragment, each image frame can be used as an intermediate frame in sequence to obtain the annotation result of each intermediate frame, until the annotation result of the image preceding the last frame of the target sub-fragment is obtained. At this point, the annotation of the target sub-fragment ends.

[0071] Forward propagation features refer to the process of propagating the label of the first frame to subsequent images frame by frame, starting from the first frame and stopping when the label is reached in the image with the specified sequence number. The labeling result of the first frame is used as a feature propagation mask in feature calculation. Specifically, the semi-supervised segmentation algorithm described in the following embodiments can be used for feature transfer.

[0072] 402: Based on the tail frame label results, combined with the backpropagation algorithm, extract the backpropagation features of the intermediate frames of the target sub-segment.

[0073] Optionally, the backpropagation algorithm can include machine learning algorithms, neural network algorithms, etc., which can be obtained through training. The backpropagation algorithm can propagate the annotation results of the last frame to the intermediate frames before the last frame to obtain the backpropagation features of the intermediate frames.

[0074] Backpropagation features refer to the process of propagating the label of the last frame frame by frame to all preceding images, starting from the last frame and stopping when the label is reached in the image with the specified sequence number. Similarly, the labeling results of the last frame can also be used as a feature propagation mask in feature calculation.

[0075] 403: Perform feature fusion processing on the forward propagation features and the backward propagation features to obtain the target image features of the intermediate frame.

[0076] 404: Determine the annotation results of the intermediate frames based on the features of the target image.

[0077] A target sub-fragment may include one or more intermediate frames, each of which can be annotated to obtain the annotation results for each intermediate frame. The fragment annotation results of the target sub-fragment may include the annotation results of each of the multiple intermediate frames.

[0078] In this embodiment, the forward propagation algorithm can be used to obtain the forward propagation features of the intermediate frames, and the backward propagation algorithm can be used to obtain the backward propagation features of the intermediate frames. The fusion of the forward and backward propagation features allows the target image features to incorporate the annotation results of the first and last frames. The target image features can better characterize the annotation features of the intermediate frames, improving the annotation accuracy and precision of the intermediate frames.

[0079] As an example, the forward propagation feature extraction step may include: determining the forward propagation features of intermediate frames based on the annotation results of the first frame and the image sequence number using a forward propagation algorithm. The backward propagation feature extraction step may include: determining the backward propagation features of intermediate frames based on the annotation results of the last frame and the image sequence number using a backward propagation algorithm.

[0080] In practical applications, image labeling can be categorized according to different usage requirements. Multiple label types can be applied simultaneously in a single labeling operation. For example, in natural image processing scenarios, vehicles and pedestrians in videos can be tracked; therefore, vehicles and pedestrians can be treated as two separate label categories for labeling. During image feature extraction, to better represent different label categories and ensure that each category's label is unaffected by others, label features can be generated separately for each category. The elements of the label features represent the probability of each pixel belonging to that label category. For the same coordinate, the element values can specifically include the probability values corresponding to that coordinate in at least one label category; the label represented by the category with the highest probability value is the label for that coordinate.

[0081] In one possible design, after obtaining the forward and backward propagation features, feature fusion processing can be performed on the forward and backward propagation features to obtain the target image features of the intermediate frame. The annotation result of the intermediate frame can then be determined through feature recognition based on the target image features. Determining the forward propagation features of the intermediate frame based on the annotation result of the first frame and the image sequence number includes: extracting features from the first frame based on the annotation result of the first frame to obtain the label features corresponding to at least one label category of the first frame; propagating the label features corresponding to at least one label category of the first frame backward to obtain the forward label features corresponding to at least one label category of the intermediate frame, thus obtaining the forward propagation features corresponding to at least one label category.

[0082] Optionally, based on the end frame annotation results and image sequence number, the backpropagation features of the intermediate frames are determined, including: extracting features from the end frame according to the end frame annotation, obtaining the label features corresponding to at least one label category of the end frame, propagating the label features corresponding to at least one label category of the end frame forward, obtaining the backpropagation label features corresponding to at least one label category of the intermediate frames, and obtaining the forward propagation features corresponding to at least one label category.

[0083] In this embodiment, the forward propagation features are obtained based on the annotation results of the first frame and the image sequence number, thus combining the characteristics of both. Similarly, the backward propagation features are obtained based on the annotation results of the last frame and the image sequence number, combining the characteristics of both. The forward and backward propagation features are the results of image features propagating from the first and last frames, respectively. Feature fusion processing is performed using the forward and backward propagation features to obtain the target image features of the intermediate frames. Since the target image features combine the propagation characteristics in both the forward and backward directions, the annotation results obtained using the target image features are more accurate, improving the annotation efficiency and accuracy of the intermediate frames.

[0084] In one possible design, the forward propagation algorithm may include a semi-supervised segmentation algorithm.

[0085] Backpropagation algorithms can include: semi-supervised segmentation algorithms.

[0086] Based on the annotation results of the first frame, a semi-supervised segmentation algorithm can be used to perform forward feature propagation on the target sub-fragment frame by frame, starting from the first frame, until the forward propagation feature at the image index is obtained. Similarly, based on the annotation results of the last frame, a semi-supervised segmentation algorithm can be used to perform backward feature propagation on the target sub-fragment frame by frame, starting from the last frame, until the backward propagation feature at the image index is obtained.

[0087] Specifically, semi-supervised segmentation algorithms can be categorized as semi-supervised object segmentation algorithms. These algorithms can be used to segment target sub-fragments, starting from the first or last frame, and calculating the image features of the current frame using the image features of the previous frame. This process continues until the forward or backward propagation features corresponding to the image sequence number are obtained.

[0088] In this embodiment, a semi-supervised segmentation algorithm can be used to perform forward feature propagation on the target sub-segment frame by frame, starting from the first frame, until the forward propagation feature at the image sequence number is obtained. The semi-supervised segmentation algorithm can complete the forward propagation of image features, making the calculated forward propagation features more representative by combining the forward features of the first frame and previous images. The semi-supervised segmentation algorithm can also propagate from the last frame, that is, perform backward feature propagation frame by frame, until the backward propagation feature at the image sequence number is obtained. By using a semi-supervised segmentation algorithm, image features can be propagated forward or backward, improving the accuracy of image feature calculation.

[0089] After obtaining the forward propagation features and backward propagation features, feature fusion calculations can be performed based on these features to integrate the image features of the intermediate frame from both forward and backward perspectives. In some embodiments, feature fusion processing is performed on the forward propagation features and backward propagation features to obtain the target image features of the intermediate frame, including:

[0090] Determine the image sequence number of the intermediate frame in the target sub-fragment;

[0091] Determine the sequence number ratio based on the image sequence number;

[0092] The forward propagation weight and the backward propagation weight are determined based on the ratio of the sequence numbers;

[0093] The target image features of the intermediate frames are obtained based on the forward propagation weights, backward propagation weights, forward propagation features, and backward propagation features.

[0094] Optionally, the image sequence number of an intermediate frame can refer to the order in which the intermediate frame appears within the target sub-segment. For example, the image sequence number of the first image in the target sub-segment can be 1, and the image sequence number of the second image can be 2. The position of the intermediate frame within the target sub-segment can be determined by the image sequence number. Each image frame can be assigned a corresponding image sequence number according to its annotation order; for example, the image sequence number of the first frame can be 1, and the image sequence number of the last frame can be N+1.

[0095] In this embodiment, intermediate frames and their image numbers within the target sub-segment can be determined. The intermediate frame number represents its positional relationship with the first and last frames. By combining the annotation results of the first and last frames with the image numbers of the intermediate frames, the annotation results of the intermediate frames can be determined. This correlates the annotation effect of the intermediate frames with their position within the target sub-segment, improving annotation accuracy.

[0096] As one embodiment, determining the sequence number ratio based on the image sequence number may include:

[0097] Calculate the ratio of the image sequence number of the intermediate frame to the sequence number of the tail frame corresponding to the target sub-segment.

[0098] As another embodiment, obtaining the target image features of the intermediate frame based on the forward propagation weights, backward propagation weights, forward propagation features, and backward propagation features may include:

[0099] Based on the forward propagation weights and backward propagation weights, the forward propagation features and backward propagation features are fused and weighted to obtain the target image features of the intermediate frame.

[0100] If the image sequence number is K and the last frame sequence number is N, then the sequence number ratio is K / N. Determining the forward and backward propagation weights based on this sequence number ratio can include: determining the sequence number ratio K / N as the backward propagation weight, and determining the difference between the integer 1 and the sequence number ratio, i.e., 1-K / N, as the forward propagation weight. The weighted summation step of the target image features can include:

[0101] Calculate the forward propagation weights: 1-K / N and the forward propagation feature F forward The product of K and N is used to obtain the first feature; the backpropagation weights are calculated: K / N and the backpropagation feature F. backward The product of the first and second features is used to obtain the second feature; the first and second features are added together to obtain the target image feature F. current .

[0102] Optionally, the forward propagation features may include forward label features corresponding to at least one label category. The backward propagation features may include backward label features corresponding to at least one label category. Based on the forward propagation weights and backward propagation weights, the forward and backward label features of each label category are weighted and summed to obtain the fused features corresponding to each label category. The fused features corresponding to each label category are the target image features.

[0103] The weighted summation of the forward and backward label features for each label category can include: for each label category's forward and backward label features, multiplying the first feature value of each pixel coordinate in the forward label feature with the forward propagation weight, multiplying the second value of the backward label feature with the backward propagation weight, and adding the two products to obtain the feature value of each pixel coordinate in that label category, that is, obtaining the feature value of that label category at each pixel coordinate.

[0104] The target image features can be characterized as the feature values of each pixel coordinate in the intermediate frame under different label categories.

[0105] For ease of understanding, such as Figure 5The example diagram of feature propagation shown assumes that the first frame annotation result of the first frame 501 is 5011, and the last frame annotation result of the last frame 502 is 5021. The first frame annotation result 5011 of the first frame 501 corresponds to the forward propagation feature, and the last frame annotation result 5021 of the last frame 502 corresponds to the backward propagation feature. The intermediate frame 503 can perform feature fusion on the forward and backward propagation features based on its sequence number to obtain the corresponding target image features. The target image features are then identified by the image classification layer to obtain the target region 5031 of the intermediate frame. This target region 5031 can be considered the annotation result of the intermediate frame.

[0106] In this embodiment, the correlation between an image and its forward and backward propagation features can be calculated based on the image sequence number, i.e., the sequence number ratio corresponding to the image sequence number can be calculated. This sequence number ratio can be used to determine the forward and backward propagation weights. By calculating the correlation characteristics of forward and backward propagation, the propagation efficiency of the image can be accurately improved, thereby increasing the accuracy of image feature propagation.

[0107] As an example, determining the annotation results of intermediate frames based on target image features may include:

[0108] Based on the image classification layer, identify the target region with the characteristics of the target image;

[0109] The annotation results using the target region as the intermediate frame.

[0110] Optionally, identifying the target region of the target image features according to the image classification layer may include: determining the feature values corresponding to each pixel coordinate of the intermediate frame in the target image features in at least one label category; obtaining the maximum feature value among the feature values corresponding to each pixel coordinate in at least one label category; determining the target pixel coordinates corresponding to each label category based on the label category corresponding to the maximum feature value of each pixel coordinate; determining the label region formed by the target pixel coordinates of each label category; and obtaining the target region composed of the label regions corresponding to at least one label category. That is, the label regions corresponding to at least one label category can be the annotation results of the intermediate frame. The image classification layer can be a mathematical model for feature classification of image features.

[0111] In this embodiment of the disclosure, after determining the annotation result of the intermediate frame, the target region of the target image feature can be identified according to the image classification layer, and the target region can be used as the annotation result of the intermediate frame. By using the image classification layer, accurate label extraction of the target image feature can be performed.

[0112] like Figure 6The flowchart shown is for another embodiment of an image annotation method provided by this disclosure. The difference from the previous embodiments is that, after determining the annotation result of the intermediate frame, the method further includes:

[0113] 601: Output the annotation results of the intermediate frames.

[0114] The annotation results can include at least one label area corresponding to each label category.

[0115] 602: Detects the label confirmation operation performed by the user on the annotation results of the intermediate frame, and keeps the annotation results of the intermediate frame unchanged.

[0116] 603: Detects the label modification operation performed by the user on the annotation results of the intermediate frame, and obtains the annotation results of the intermediate frame after modification.

[0117] It can output intermediate frames and their annotation results simultaneously, and output the automatic annotation results of intermediate frames for users to view.

[0118] In this embodiment of the disclosure, after the annotation results of the intermediate frames are output, the user can view the annotation results and check the annotation effect. If the annotation is unqualified, the annotation results can be modified; if the annotation is qualified, the annotation results of the intermediate frames can be directly determined. Through interactive display with the user, the annotation results of the intermediate frames can be better matched with the user's annotation needs, and the annotation accuracy can be higher.

[0119] As an example, obtaining the first frame annotation result corresponding to the first frame of the target sub-fragment may include:

[0120] Detect the annotation operations performed by the user on the first frame and obtain the annotation results of the first frame corresponding to the annotation operations.

[0121] Alternatively, obtain the previous video segment of the target segment, and determine the end frame annotation result corresponding to the end frame of the previous video segment as the first frame annotation result corresponding to the first frame of the target segment.

[0122] Optionally, if the first frame is the first image of the target sub-segment and the target sub-segment is the first video sub-segment of the video to be labeled, the user's label setting operation performed on the first frame of the target sub-segment can be detected, and the first frame labeling result at the end of the setting can be obtained. Alternatively, if the target sub-segment is not the first video sub-segment, the last frame labeling result of the last frame of the previous video sub-segment of the target sub-segment can be obtained as the first frame labeling result corresponding to the first frame of the target sub-segment.

[0123] In this embodiment of the disclosure, by detecting the annotation operation performed by the user on the first frame, the annotation result of the first frame corresponding to the annotation operation can be obtained, and the annotation result of the first frame that is more in line with the user's annotation needs can be obtained. Alternatively, the annotation result of the last frame of the previous video sub-slice can be used as the annotation result of the first frame, which can improve the annotation efficiency of the first frame.

[0124] As another embodiment, in addition to the technical solutions provided in the above embodiments, the first frame of the target sub-fragment and the first frame annotation result corresponding to the first frame can also be obtained through the following methods:

[0125] If a user performs a label modification operation on the annotation results of an intermediate frame, the intermediate frame after the modification of the annotation results will be updated as the first frame;

[0126] Use the annotated results of the intermediate frames as the annotated results of the first frame.

[0127] like Figure 7 The image shown is an example of image frame annotation prompts provided in an embodiment of this disclosure. (Reference) Figure 7 After obtaining the annotation result 7011 of intermediate frame 701, if it is detected that the user has modified the annotation result of the intermediate frame, for example, changing it to annotation result 7012, intermediate frame 701 can be used as the first frame. The original first frame 702 can then no longer be used as the first frame. Of course, Figure 7 The annotation prompts for the image frames are merely illustrative and do not have a limiting effect.

[0128] In this embodiment of the disclosure, when a user performs a label modification operation on an intermediate frame, it can be shown that the label propagation accuracy decreases, and the match with the user's actual annotation needs is low. Using the intermediate frame with modified labels as the first frame, and the image annotation of the intermediate frame with modified labels as the annotation result of the first frame, can provide more effective image propagation and improve the image propagation efficiency and accuracy.

[0129] To obtain accurate video segments, such as Figure 8 The flowchart shown is for another embodiment of a video annotation method provided by this disclosure. The difference from the previous embodiments lies in that determining the sub-segment to be annotated in the video to obtain the target sub-segment includes:

[0130] 801: Extract keyframes from the video to be annotated.

[0131] 802: Divide two adjacent keyframes in the keyframe into a video segment within the video region enclosed by the video to be labeled, and obtain at least one video segment.

[0132] 803: Identify the target sub-segment to be labeled from at least one video sub-segment.

[0133] Optionally, keyframes in the video to be annotated can be grouped. Two adjacent keyframes can be considered as a group, and at least one group of keyframes can be determined from at least one keyframe. A group of keyframes includes an adjacent first keyframe and a second keyframe, with the first keyframe preceding the second keyframe. The second keyframe of the preceding group is the same as the first keyframe of the following group. The video interval enclosed by two adjacent keyframes can be considered as a video sub-segment. That is, a video sub-segment can include two keyframes and an intermediate frame between the two keyframes. Of course, the intermediate frame can be obtained by sampling at a preset sampling frequency.

[0134] Keyframes can be images that differ significantly from nearby images in the video to be labeled. For example, if there are no vehicles in the image at time t1, but vehicles appear in the image at time t2, and the time difference between t1 and t2 is within the time constraint, then the image at time t2 is determined as the keyframe.

[0135] For ease of understanding, Figure 9 This is an example diagram illustrating the division of a video segment according to an embodiment of this disclosure. (See reference...) Figure 9 The keyframes of the video to be annotated are keyframe 1, keyframe 2, keyframe 4, and keyframe 6. Two adjacent keyframes can be grouped together.

[0136] Keyframe 1 and keyframe 2 can be considered as a set of adjacent keyframes, and the image frame enclosed by this set of adjacent keyframes can be video segment 1. Video segment 1 can consist of keyframe 1, keyframe 2, and image frame 3 between keyframes 1 and 2.

[0137] Keyframes 2 and 4 can be considered as a set of adjacent keyframes, and the image frame enclosed by this set of adjacent keyframes can be video segment 2. Video segment 2 can consist of keyframes 2 and 4, and image frame 5 between keyframes 2 and 4.

[0138] Keyframes 4 and 6 can be considered as a group of adjacent keyframes, and the image frame enclosed by this group of adjacent keyframes can be video segment 3. Video segment 3 can consist of keyframes 4 and 6, and the image frame 7 between keyframes 4 and 6.

[0139] There is keyframe overlap between two adjacent keyframe sets, refer to Figure 9 Keyframe 2 can be the last frame of video segment 1, or it can be the first frame of video segment 2. Keyframe 4 can be the last frame of video segment 2, or it can be the first frame of video segment 3. Through this keyframe extraction method, each keyframe can be extracted...

[0140] In this embodiment, by extracting keyframes from the video to be labeled, two adjacent keyframes can be obtained based on these keyframes. The video interval enclosed by two adjacent keyframes in the video to be labeled can be a video sub-segment, thereby obtaining at least one video sub-segment corresponding to the video to be labeled. This ensures that the last frame of the preceding video sub-segment is the same as the first frame of the following video sub-segment among two adjacent video sub-segments, thus achieving comprehensive and accurate segmentation of the video to be labeled and improving the segmentation efficiency of at least one video sub-segment.

[0141] In some embodiments, at least one keyframe can be extracted from the video to be labeled according to the keyframe extraction frequency; or, at least one keyframe that satisfies the image change condition can be extracted from the video to be labeled.

[0142] The keyframe extraction frequency can be set according to usage requirements, or it can be preset. The unit of keyframe extraction frequency is frames / time. One keyframe is extracted every image frames at the feature extraction frequency interval. For example, when the keyframe extraction frequency is 10, one keyframe can be extracted every 10 frames, and the 1st and 11th frames can both be keyframes.

[0143] In one possible design, at least one keyframe of the video to be labeled is extracted, including:

[0144] For each image frame in the video to be labeled, calculate the motion amplitude value of each image frame;

[0145] At least one keyframe in the image frame is obtained based on the motion amplitude value.

[0146] Image change conditions may include: the motion amplitude value of the image frame is greater than the index threshold.

[0147] Optionally, obtaining at least one keyframe in the image frame based on the motion amplitude value may include:

[0148] If the motion amplitude value of any image frame is greater than the index threshold, the image frame is determined as a key frame, so as to obtain at least one key frame among multiple image frames.

[0149] Motion amplitude value refers to the difference in amplitude between an image frame and its surrounding frames. The motion amplitude value is calculated by subtracting the amplitude values of the image frame from those of its surrounding frames. If the motion amplitude value is greater than a threshold, it indicates a significant difference between the image frame and its surrounding frames, and this image frame can be used as a keyframe.

[0150] For ease of understanding, such as Figure 10This diagram illustrates an example of keyframe extraction provided in this embodiment. Using the motion amplitude of each image frame on the vertical axis and the sequence number of each image frame in the video to be labeled on the horizontal axis, the amplitude of each image frame changes continuously starting from the first image frame 0. The line connecting the amplitude values of each image frame forms curve 1001. The motion amplitude value can represent the amplitude difference between each image frame. The changes in curve 1001 can determine the amplitude difference between image frames; that is, the image frame corresponding to the key point 1002 where the motion amplitude is greater than the index threshold can be considered a keyframe.

[0151] In this embodiment of the disclosure, for multiple image frames in the video to be labeled, the index data of motion amplitude for each image frame can be calculated, and key frames can be filtered based on the motion amplitude of each image frame. Key frames can be used to obtain video sub-segments, using motion amplitude as the basis for obtaining video sub-segments, so that the motion amplitude value of the same video sub-segment can be used as the basis for division, which can effectively improve the labeling accuracy of images during automatic image labeling.

[0152] In this embodiment of the present disclosure, the steps for calculating the motion amplitude value of each image frame may include:

[0153] Calculate the inter-frame difference value corresponding to the inter-frame amplitude difference index for each image frame, and determine the inter-frame difference value as the motion amplitude value.

[0154] Alternatively, calculate the inter-frame optical flow change amplitude value corresponding to the inter-frame optical flow change index for each image frame, and determine the inter-frame optical flow change amplitude value as the motion amplitude value.

[0155] Alternatively, based on a pre-trained segmentation model, the cross-union ratio (CUI) of the segmentation results for each image frame can be calculated, and the CUI can be determined as the motion amplitude value.

[0156] By using different types of motion amplitude values, the threshold value can be determined based on the type of motion amplitude value.

[0157] Optionally, the inter-frame difference can refer to the difference in the mean pixel values of two image frames.

[0158] The optical flow variation amplitude value can refer to the difference in optical flow between two or more image frames. The optical flow floating threshold corresponding to each image frame can be calculated using the optical flow calculation formula. The intersection-union ratio (IUGR) is calculated for the segmentation results between each image frame. IUGR refers to the ratio of the intersection to the union of the segmentation results of the image frame and its surrounding frames after image segmentation processing. If the overlap between the two is high, the IUGR value is large; if the overlap is low, the IUGR value is small.

[0159] In this embodiment of the disclosure, by calculating the inter-frame difference, inter-frame optical flow change amplitude value, or cross-union ratio of the segmentation results corresponding to the image frames, various methods can be applied to accurately calculate the motion amplitude value of each image frame.

[0160] As one example, determining the sub-segment to be labeled in the video to obtain the target sub-segment includes:

[0161] Determine the segment order corresponding to at least one video segment based on the chronological order of at least one video segment;

[0162] According to the segment order corresponding to at least one video segment, starting from the first video segment, select one video segment at a time as the segment to be labeled to obtain the target segment.

[0163] Optionally, the segment order of each video segment can be determined based on the segment number corresponding to at least one video segment. The target segment can be determined sequentially from at least one video segment. After obtaining the target segment, the annotation scheme of the above embodiment can be executed until at least one video segment has been traversed, obtaining the annotation results of all video segments. The annotation results of all video segments are then combined to obtain the annotation result of the video to be annotated.

[0164] When segmenting a video, a segment number can be assigned to each obtained video segment. For example, the segment number of the first obtained video segment is 1, and the segment number of the second video segment is 2.

[0165] In this embodiment of the disclosure, target sub-segments can be selected sequentially from at least one video sub-segment according to the segment order corresponding to each of the at least one video sub-segment. Utilizing the segment order to obtain target sub-segments ensures that the corresponding target sub-segments are obtained sequentially, thereby completing the annotation of each target sub-segment sequentially. This achieves sequential annotation of at least one video sub-segment, improving the comprehensiveness of the video sub-segment annotation.

[0166] Furthermore, the technical solutions disclosed herein can also be applied to the gaming field, specifically including applications such as the design and display of 3D game scenes.

[0167] like Figure 11 The diagram shown is a structural schematic of one embodiment of a video annotation device provided in this disclosure. This device can be located in an electronic device and can be configured with the aforementioned video annotation method. The video annotation device 1100 may include:

[0168] The first determining unit 1101 is used to determine the sub-segment to be labeled in the video to be labeled, and obtain the target sub-segment;

[0169] The first frame annotation unit 1102 is used to obtain the first frame annotation result corresponding to the first frame of the target sub-fragment;

[0170] The tail frame annotation unit 1103 is used to generate the tail frame annotation result corresponding to the tail frame of the target sub-fragment based on the first frame annotation result;

[0171] The segment annotation unit 1104 is used to generate the annotation results of the intermediate frames of the target sub-segment based on the annotation results of the first frame and the last frame, so as to obtain the annotation results of the target sub-segment to be annotated.

[0172] The second determining unit 1105 is used to generate target annotation results for the video to be annotated based on the annotation results of the target sub-segments.

[0173] As one embodiment, the target acquisition unit includes:

[0174] The key extraction module is used to extract key frames from the video to be labeled;

[0175] The segment acquisition module is used to divide two adjacent keyframes in the keyframe into a video segment within the video range enclosed by the video to be labeled, thereby obtaining at least one video segment.

[0176] The target determination module is used to determine the target sub-segment to be labeled from at least one video sub-segment.

[0177] In some embodiments, the key extraction module includes:

[0178] The amplitude calculation submodule is used to calculate the motion amplitude value of each image frame in the video to be labeled.

[0179] The key determination submodule is used to obtain at least one key frame in the image frame based on the motion amplitude value.

[0180] As one embodiment, the tail frame annotation unit may include:

[0181] The first frame acquisition module is used to obtain the first frame annotation result corresponding to the first frame of the target sub-fragment;

[0182] The tail frame generation module is used to determine the tail frame annotation result based on the first frame annotation result using the forward propagation algorithm.

[0183] In one possible design, the tail frame generation module may include:

[0184] The label propagation submodule is used to use the forward propagation algorithm to propagate the labeling results of the first frame to the unlabeled image frames in the target sub-segment in a sequential manner, so as to obtain the labeling results of the unlabeled image frames in the target sub-segment.

[0185] The tail frame annotation submodule is used to obtain the annotation result of the last image frame of the target sub-fragment as the tail frame annotation result.

[0186] As another embodiment, the fragment annotation unit includes:

[0187] The first extraction module is used to extract the forward propagation features of the intermediate frames of the target sub-segment based on the annotation results of the first frame and in combination with the forward propagation algorithm.

[0188] The second extraction module is used to extract the backpropagation features of the middle frames of the target sub-segment based on the tail frame label results and the backpropagation algorithm.

[0189] The feature fusion module is used to fuse the forward propagation features and the backward propagation features to obtain the target image features of the intermediate frame;

[0190] The label determination module is used to determine the annotation results of intermediate frames based on the features of the target image.

[0191] In some embodiments, the feature fusion module may include:

[0192] The sequence number determination submodule is used to determine the image sequence number of the intermediate frame in the target sub-fragment;

[0193] The ratio determination submodule is used to determine the sequence ratio based on the image sequence number;

[0194] The weight determination submodule is used to determine the forward propagation weight and the backward propagation weight based on the index ratio.

[0195] The feature weighting submodule is used to obtain the target image features of the intermediate frame based on the forward propagation weights, backward propagation weights, forward propagation features, and backward propagation features.

[0196] As one embodiment, the first frame annotation unit may include:

[0197] The first frame annotation module is used to detect the annotation operations performed by the user on the first frame and obtain the first frame annotation results corresponding to the annotation operations.

[0198] Alternatively, the first frame determination module is used to obtain the previous video sub-segment of the target sub-segment and determine the last frame annotation result corresponding to the last frame of the previous video sub-segment as the first frame annotation result corresponding to the first frame of the target sub-segment.

[0199] As one embodiment, the first determining unit may include:

[0200] The sequence determination module is used to determine the segment order corresponding to at least one video segment based on the chronological order of at least one video segment.

[0201] The segment traversal module is used to select a video segment as the sub-segment to be labeled, starting from the first video segment, according to the segment order corresponding to at least one video sub-segment.

[0202] The apparatus provided in this embodiment can be used to execute the technical solutions of the above method embodiments. Its implementation principle and technical effects are similar, and will not be described again here.

[0203] To implement the above embodiments, this disclosure also provides an electronic device.

[0204] refer to Figure 12 The diagram illustrates a structural schematic of an electronic device 1200 suitable for implementing embodiments of the present disclosure. The electronic device 1200 can be a terminal device or a server. The terminal device can include, but is not limited to, mobile terminals such as mobile phones, laptops, digital radio receivers, personal digital assistants (PDAs), portable Android devices (PADs), portable media players (PMPs), and in-vehicle terminals (e.g., in-vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 12 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.

[0205] like Figure 12 As shown, the electronic device 1200 may include a processing unit (e.g., a central processing unit, a graphics processor, etc.) 1201, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1202 or a program loaded from a storage device 1208 into a random access memory (RAM) 1203. The RAM 1203 also stores various programs and data required for the operation of the electronic device 1200. The processing unit 1201, ROM 1202, and RAM 1203 are interconnected via a bus 1204. An input / output (I / O) interface 1205 is also connected to the bus 1204.

[0206] Typically, the following devices can be connected to I / O interface 1205: input devices 1206 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 1207 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 1208 including, for example, magnetic tapes, hard disks, etc.; and communication devices 1209. Communication device 1209 allows electronic device 1200 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 12 An electronic device 1200 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.

[0207] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 1209, or installed from storage device 1208, or installed from ROM 1202. When the computer program is executed by processing device 1201, it performs the functions defined in the methods of embodiments of this disclosure.

[0208] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0209] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.

[0210] The aforementioned computer-readable medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

[0211] This disclosure also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the video annotation method provided in any of the above embodiments.

[0212] This disclosure also provides a computer program product, including a computer program executed by a processor to configure the video annotation method provided in any of the above embodiments.

[0213] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including image-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0214] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0215] The units described in the embodiments of this disclosure can be implemented in software or in hardware. The name of a unit does not necessarily limit the unit itself; for example, the first acquisition unit can also be described as "a unit that acquires at least two Internet Protocol addresses".

[0216] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0217] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0218] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.

[0219] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.

[0220] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.

Claims

1. A video annotation method, characterized by, include: The target sub-segment is obtained by determining the sub-segment to be labeled in the video based on two adjacent keyframes; the keyframe is an image that differs from the image in the vicinity of it in the video to be labeled. Obtain the first frame annotation result corresponding to the first frame of the target sub-fragment; Based on the first frame annotation result, the tail frame annotation result corresponding to the tail frame of the target sub-segment is generated; Based on the annotation results of the first frame, the forward propagation features of the intermediate frames of the target sub-segment are determined, and based on the annotation results of the last frame, the backward propagation features of the intermediate frames of the target sub-segment are determined. The forward propagation features and the backward propagation features are fused to obtain the target image features of the intermediate frames. Based on the target image features, the annotation results of the intermediate frames of the target sub-segment are generated to obtain the annotation results of the target sub-segment to be annotated. Based on the annotation results of the target sub-segment, the target annotation results of the video to be annotated are generated.

2. The method of claim 1, wherein, The step of determining the sub-segment to be labeled in the video based on two adjacent keyframes to obtain the target sub-segment includes: Extract the keyframes from the video to be labeled; Two adjacent keyframes in the keyframe are divided into a video segment within the video range enclosed by the video to be labeled, to obtain at least one video segment; The target sub-segment to be labeled is determined from the at least one video sub-segment.

3. The method of claim 2, wherein, The extraction of keyframes from the video to be labeled includes: For each image frame in the video to be labeled, calculate the motion amplitude value of each image frame; At least one keyframe in the image frame is obtained based on the motion amplitude value.

4. The method of claim 1, wherein, The step of generating the annotation result of the tail frame corresponding to the tail frame of the target sub-fragment based on the annotation result of the first frame includes: Obtain the first frame annotation result corresponding to the first frame of the target sub-fragment; Based on the first frame annotation result, the forward propagation algorithm is used to determine the corresponding last frame annotation result.

5. The method of claim 4, wherein, The step of determining the tail frame annotation result corresponding to the tail frame using the forward propagation algorithm based on the first frame annotation result includes: Using the forward propagation algorithm, the annotation results of the first frame are propagated sequentially to the unannotated image frames in the target sub-segment to obtain the annotation results of the unannotated image frames in the target sub-segment; The annotation result of the last image frame of the target sub-segment is obtained as the annotation result of the tail frame corresponding to the tail frame.

6. The method of claim 1, wherein, Based on the annotation results of the first frame, the forward propagation features of the intermediate frames of the target sub-fragment are determined, including: Based on the first frame annotation results, and combined with the forward propagation algorithm, the forward propagation features of the intermediate frames of the target sub-segment are extracted; The step of determining the backpropagation features of the intermediate frames of the target sub-segment based on the tail frame annotation results includes: Based on the tail frame annotation results, and combined with the backpropagation algorithm, the backpropagation features of the intermediate frames of the target sub-segment are extracted.

7. The method of claim 6, wherein, The step of fusing the forward propagation features and the backward propagation features to obtain the target image features of the intermediate frame includes: Determine the image sequence number of the intermediate frame within the target sub-fragment; The sequence number ratio is determined based on the image sequence number; Based on the aforementioned sequence number ratio, the forward propagation weight and the backward propagation weight are determined; The target image features of the intermediate frame are obtained based on the forward propagation weights, the backward propagation weights, the forward propagation features, and the backward propagation features.

8. The method of claim 1, wherein, The step of obtaining the first frame annotation result corresponding to the first frame of the target sub-fragment includes: Detect the annotation operation performed by the user on the first frame, and obtain the annotation result of the first frame corresponding to the annotation operation; Alternatively, obtain the preceding video sub-segment of the target sub-segment, and determine the end frame annotation result corresponding to the end frame of the preceding video sub-segment as the first frame annotation result corresponding to the first frame of the target sub-segment.

9. A video annotation apparatus, characterized by comprising: include: The first determining unit is used to determine the sub-segment to be labeled in the video to be labeled based on two adjacent keyframes, and obtain the target sub-segment; The keyframe is an image that differs from the image in its vicinity in the video to be annotated; The first frame annotation unit is used to obtain the first frame annotation result corresponding to the first frame of the target sub-segment; The tail frame annotation unit is used to generate the tail frame annotation result corresponding to the tail frame of the target sub-segment based on the first frame annotation result; The segment annotation unit is used to determine the forward propagation features of the intermediate frames of the target sub-segment based on the annotation results of the first frame, determine the backward propagation features of the intermediate frames of the target sub-segment based on the annotation results of the last frame, perform feature fusion processing on the forward propagation features and the backward propagation features to obtain the target image features of the intermediate frames, and generate the annotation results of the intermediate frames of the target sub-segment based on the target image features to obtain the annotation results of the target sub-segment to be annotated. The second determining unit is used to generate target annotation results for the video to be annotated based on the annotation results of the target sub-segment.

10. An electronic device, comprising: include: Processor, memory; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory, causing the processor to implement the video annotation method as described in any one of claims 1 to 8.

11. A computer readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, implement the video annotation method as described in any one of claims 1 to 8.

12. A computer program product, comprising a computer program, characterized in that, The computer program is executed by a processor to implement the video annotation method as described in any one of claims 1 to 8.