Video segmentation method and device, computer device and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By obtaining the interaction points of the reference frame in the video segmentation algorithm for coarse segmentation and combining them with refined features for fine segmentation, the problem of rough edges in the segmentation mask in the existing technology is solved, and a higher precision video segmentation effect is achieved.

CN115641343BActive Publication Date: 2026-06-19XIAMEN MEITUZHIJIA TECH

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: XIAMEN MEITUZHIJIA TECH
Filing Date: 2022-10-17
Publication Date: 2026-06-19

Application Information

Patent Timeline

17 Oct 2022

Application

19 Jun 2026

Publication

CN115641343B

IPC: G06T7/11; G06V10/82; G06V10/80; G06V10/75; G06V20/40

AI Tagging

Application Domain

Image analysis Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN115641343B_ABST

Patent Text Reader

Abstract

This application relates to a video segmentation method, apparatus, computer device, and storage medium, and pertains to the field of video processing technology. The method includes: in response to an interactive operation on a reference frame in a video frame sequence, obtaining reference frame interaction points for the reference frame; performing coarse segmentation on the reference frame based on the reference frame interaction points to obtain a coarse segmentation mask for the reference frame; determining fine-grained features of the reference frame; the fine-grained features are features obtained by feature extraction of image details of the reference frame; performing fine-grained segmentation on the reference frame based on the fine-grained features and the coarse segmentation mask to obtain a segmentation mask for the reference frame; and performing segmentation prediction on each video frame in the video frame sequence based on the segmentation mask of the reference frame to obtain a segmentation mask corresponding to each video frame. This method can improve the precision of the video segmentation mask.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of video processing technology, and in particular to a video segmentation method, apparatus, computer device, and storage medium. Background Technology

[0002] With the continuous development of computer and multimedia technologies, interactive video segmentation methods are widely used in video keying functions during video editing. The main task is for users to interactively select static reference frames from the video to choose the objects to be keyed. The segmentation algorithm then automatically segments the reference frames based on the interactive graph generated from the user's interaction information. Currently, most segmentation algorithms directly concatenate the reference frames and the interactive graph before feeding them into an image segmentation network for prediction, obtaining a segmentation mask. However, the segmentation mask obtained using this method has relatively coarse edges, thus affecting the accuracy of video segmentation. Summary of the Invention

[0003] Therefore, it is necessary to provide a video segmentation method, apparatus, computer equipment, and storage medium that can improve the accuracy of video segmentation in response to the above-mentioned technical problems.

[0004] Firstly, this application provides a video segmentation method. The method includes:

[0005] In response to an interaction with a reference frame in a video frame sequence, obtain the reference frame interaction point for the reference frame.

[0006] The reference frame is coarsely segmented based on the reference frame and the reference frame interaction point to obtain the coarse segmentation mask of the reference frame.

[0007] Determine the refined features of the reference frame; refined features are features obtained by extracting features from the image details of the reference frame.

[0008] The reference frame is finely segmented based on the refined features and the coarse segmentation mask to obtain the segmentation mask of the reference frame;

[0009] Based on the segmentation mask of the reference frame, segmentation prediction is performed on each video frame in the video frame sequence to obtain the segmentation mask corresponding to each video frame.

[0010] Secondly, this application also provides a video segmentation apparatus. The apparatus includes:

[0011] The acquisition unit is used to acquire the reference frame interaction point for the reference frame in response to the interaction operation for the reference frame in the video frame sequence.

[0012] The coarse segmentation unit is used to coarsely segment the reference frame based on the reference frame and the reference frame interaction point to obtain the coarse segmentation mask of the reference frame.

[0013] The determination unit is used to determine the refined features of the reference frame; the refined features are the features obtained by extracting features from the image details of the reference frame.

[0014] The fine segmentation unit is used to perform fine segmentation on the reference frame based on the fine features and the coarse segmentation mask to obtain the segmentation mask of the reference frame;

[0015] The prediction unit is used to perform segmentation prediction on each video frame in the video frame sequence according to the segmentation mask of the reference frame, so as to obtain the segmentation mask corresponding to each video frame.

[0016] In some embodiments, the coarse segmentation unit is further configured to convert the interaction points of the reference frame into an interaction point distance map; perform feature fusion on the reference frame and the interaction point distance map to obtain a first fused feature; and perform coarse segmentation on the reference frame according to the first fused feature to obtain a coarse segmentation mask of the reference frame.

[0017] In some embodiments, the coarse segmentation unit is further configured to extract deep features and shallow features of the first fused feature respectively; and perform decoding processing based on the deep features and shallow features to obtain a coarse segmentation mask of the reference frame.

[0018] In some embodiments, the coarse segmentation unit is further configured to extract features from deep features using convolutions with multiple different receptive fields to obtain multiple receptive field features; concatenate the multiple receptive field features to obtain receptive field concatenated features; fuse the receptive field concatenated features and shallow features to obtain a second fused feature; and decode the second fused feature to obtain a coarse segmentation mask of the reference frame.

[0019] In some embodiments, the coarse segmentation mask is obtained by decoding the second fused feature through the decoding module; the fine segmentation unit is further configured to concatenate the coarse segmentation mask and the last layer feature of the decoding module to obtain the first concatenated feature; concatenate the refined feature and the feature obtained by deep feature extraction of the first concatenated feature to obtain the second concatenated feature; and predict the segmentation mask of the reference frame based on the second concatenated feature.

[0020] In some embodiments, the prediction unit is further configured to encode the reference frame and its segmentation mask to obtain reference frame features; store the reference frame features corresponding to the reference frame in a preset storage queue; determine the query frame features for each video frame; the query frame features are features obtained by feature extraction for each video frame; for the query frame features of each video frame, perform feature matching on the query frame features using the reference frame features stored in the storage queue to obtain query frame matching features; and predict the segmentation mask corresponding to each video frame based on the query frame matching features of each video frame.

[0021] In some embodiments, the video segmentation module of this application further includes a queue update module, which is used to extract reference frame features corresponding to reference frames that need to be retained from the repository queue at preset intervals when the queue length of the repository queue exceeds the queue length threshold; and to update the repository queue with the reference frame features corresponding to the reference frames that need to be retained.

[0022] In some embodiments, the acquisition unit of this application is further configured to, in response to a selection operation on a video frame sequence, determine the selected video frame as a reference frame; and determine the reference frame from the video frame sequence according to a preset extraction rule.

[0023] Thirdly, this application also provides a computer device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps in the above-described video segmentation method.

[0024] Fourthly, this application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps in the video segmentation method described above.

[0025] Fifthly, this application also provides a computer program product, which includes a computer program that, when executed by a processor, implements the steps in the video segmentation method described above.

[0026] The aforementioned video segmentation method, apparatus, computer device, storage medium, and computer program product acquire a video frame sequence comprising multiple video frames; in response to an interactive operation on a reference frame determined from the multiple video frames, acquire reference frame interaction points for the reference frame, enabling more local segmentation fine-tuning on the reference frame; perform coarse segmentation on the reference frame based on the reference frame interaction points to obtain a coarse segmentation mask for the reference frame; to obtain a more refined segmentation effect, further refinement features rich in image details can be extracted from the reference frame; based on the refinement features and the coarse segmentation mask, the reference frame is finely segmented to obtain a segmentation mask for the reference frame. This refinement process improves the accuracy of the segmentation mask based on the coarse segmentation mask, resulting in a more accurate segmentation mask obtained by segmenting and predicting multiple video frames based on an accurate segmentation mask. Attached Figure Description

[0027] Figure 1 A flowchart illustrating the first video segmentation method provided in this application embodiment;

[0028] Figure 2 This is a schematic diagram of the structure of the interactive image segmentation module provided in an embodiment of this application;

[0029] Figure 3 This is a schematic diagram illustrating the process of segmenting and predicting each video frame as provided in an embodiment of this application.

[0030] Figure 4 A flowchart illustrating the second video segmentation method provided in this application embodiment;

[0031] Figure 5 A structural block diagram of a video segmentation device provided in an embodiment of this application;

[0032] Figure 6 An internal structural diagram of a first type of computer device provided in an embodiment of this application;

[0033] Figure 7 This is an internal structural diagram of a second type of computer device provided in an embodiment of this application. Detailed Implementation

[0034] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0035] In some embodiments, such as Figure 1 As shown, a video segmentation method is provided. This embodiment illustrates the method by applying it to a computer device. It is understood that the computer device can be a server or a terminal. The method can be implemented independently by the server or terminal, or through interaction between the server and the terminal. The video segmentation in this application refers to image segmentation of each video frame in a video frame sequence. In this embodiment, the method includes the following steps:

[0036] Step 102: In response to an interactive operation on a reference frame in the video frame sequence, obtain the reference frame interaction point for the reference frame.

[0037] In this context, a video frame sequence is an image sequence that represents a specified video frame by frame, and a reference frame refers to an image of a specific frame in the video frame sequence.

[0038] A reference frame interaction point refers to an interaction point formed within a reference frame during user interaction. In essence, if a user clicks on a pixel in a reference frame, that pixel can be considered a reference frame interaction point.

[0039] Specifically, users can perform interactive operations on reference frames determined by the video frame sequence through the front-end interactive module of this application, and the computer device responds to the interactive operation to obtain the interaction points formed in the reference frames during the user's interactive operation on the reference frames.

[0040] In some embodiments, if the interaction operation refers to a gesture operation, the front-end interaction module receives two inputs: a video frame sequence and a gesture operation. The gesture operation includes at least one of the following: selecting a reference frame from the video frames, zooming, drawing a line, or clicking.

[0041] In some embodiments, the reference frame may be selected by the user from the video frame sequence, or it may be automatically selected from the video frame sequence according to preset rules; this application does not limit this. It is understood that when the reference frame is selected by the user from the video frame sequence, the user can select a video frame at any time from a video frame sequence with a total number of frames T, or select multiple video frames at different times. The computer device, in response to the user's selection operation, records the video frame selected by the user as an interactive frame, or it can be recorded as a reference frame.

[0042] In some embodiments, when the user determines a reference frame, the user can also perform a scaling operation on the reference frame, such as a zoom operation. In response to the user's zoom operation, the computer device will transform the entire content of the reference frame displayed on the canvas of the display interface into a local area of the reference frame.

[0043] In some embodiments, the front-end interaction module of this application further includes at least one of the functions of adding or deleting a selection area. Users can select to add or delete a segmented area on the display interface and perform interactions such as drawing lines or clicking on the reference frame. The computer device converts the user's interaction information into interaction points and records them as reference frame interaction points.

[0044] Step 104: Perform coarse segmentation on the reference frame based on the interaction points of the reference frame to obtain the coarse segmentation mask of the reference frame.

[0045] Coarse segmentation refers to the image segmentation operation that performs preliminary division of the cutout region of the reference frame.

[0046] A coarse segmentation mask refers to an image mask obtained by coarsely segmenting a reference frame. An image mask is a selected image, graphic, or object used to occlude the image being processed, thereby controlling the area or process of image processing. The specific image or object used for occlusion is called an image mask.

[0047] Specifically, the computer device can directly perform coarse segmentation of the reference frame based on the reference frame's interaction points to obtain a coarse segmentation mask. Alternatively, the computer device can first crop the reference frame based on the reference frame's interaction points to obtain a cropped region image, and then perform coarse segmentation of the cropped region image based on the reference frame's interaction points to obtain a coarse segmentation mask. It can be understood that the cropped region image can be an RGB image. RGB values are obtained by varying the red, green, and blue color channels and superimposing them to produce various colors.

[0048] In some embodiments, the process of cropping from the reference frame is as follows: when the user ends the interactive operation on the reference frame, the computer device obtains the cropping region coordinates of the reference frame based on the image information displayed in the canvas area of the display interface, and performs cropping processing on the reference frame based on the cropping region coordinates to obtain the cropped region image.

[0049] It is understandable that after the front-end interaction module determines the reference frame interaction points of the reference frame, it can use the reference frame and its corresponding interaction points as the output of the front-end interaction module. Alternatively, it can use the cropped region image obtained by cropping the reference frame and its corresponding interaction points as the output of the front-end interaction module. The interactive segmentation module of this application can then use the output of the front-end interaction module as input, directly performing coarse segmentation of the reference frame based on the reference frame interaction points to obtain the image mask of the reference frame, or performing coarse segmentation of the cropped region image based on the reference frame interaction points to obtain the image mask of the reference frame.

[0050] Step 106: Determine the refined features of the reference frame.

[0051] Among them, refined features are features obtained by extracting features from the image details of the reference frame.

[0052] Specifically, computer equipment can directly acquire the refined features of a pre-determined reference frame, and can also perform multi-layer convolution processing on the reference frame features to obtain the refined features of the reference frame.

[0053] In some embodiments, the computer device may input a reference frame into multiple convolutional layers for multi-layer convolution, for example, with multiple convolutional layers, each with a 3×3 convolutional kernel, so that the reference frame can be input into multiple convolutional layers for multi-layer 3×3 convolution processing to extract refined features rich in finer edge texture details. These refined features can further refine the reference frame based on the coarse segmentation of the reference frame.

[0054] Step 108: Perform fine segmentation on the reference frame based on the refined features and the coarse segmentation mask to obtain the segmentation mask of the reference frame.

[0055] Fine segmentation refers to performing further image segmentation on the reference frame by combining its refined features, based on the coarse segmentation of the reference frame.

[0056] Specifically, the computer device can further encode and decode the coarse segmentation mask of the reference frame to mine deeper features, and fuse the deep features mined from the coarse segmentation mask with the refined features of the reference frame. Based on the fused features, the reference frame is further finely segmented to predict the final segmentation mask of the reference frame.

[0057] In some embodiments, after the interactive image segmentation module performs coarse segmentation on the reference frame, it combines the fine features of the reference frame to perform fine segmentation on the basis of the coarse segmentation, so as to further improve the segmentation precision of object edges and hair details in the reference frame, so as to finally predict the segmentation mask of the reference frame.

[0058] Step 110: Based on the segmentation mask of the reference frame, perform segmentation prediction on each video frame in the video frame sequence to obtain the segmentation mask corresponding to each video frame.

[0059] Segmentation prediction refers to the operation of predicting the segmentation regions of each video frame using the segmentation mask of the reference frame.

[0060] Specifically, the computer device uses the segmentation mask of the reference frame as a reference for segmentation prediction of each video frame in the video sequence to predict the segmentation region corresponding to each video frame, and obtains the segmentation mask corresponding to each video frame based on the segmentation region corresponding to each video frame.

[0061] In some embodiments, a reference frame and a segmentation mask obtained by coarse and fine segmentation of the reference frame can be input into a video target segmentation module. The video target segmentation module then performs segmentation prediction on all video frames to obtain segmentation masks for all video frames.

[0062] In some embodiments, after predicting the segmentation masks for all video frames, the computer device can display the segmentation masks of all video frames on the display interface. The user can judge whether the segmentation masks of all video frames meet the expected segmentation effect. If the user is not satisfied with the current video segmentation effect, the user can continue to select video frames with poor segmentation effect as reference frames and return to step 102 to re-perform video segmentation until all video frames can achieve the segmentation effect satisfactory to the user. At this time, the computer device can output the final segmentation masks of each video frame, thereby completing the video segmentation process. In another case, if the user is satisfied with the current video segmentation effect, the computer device can directly output the segmentation masks of each current video frame, thereby completing the video segmentation process.

[0063] The aforementioned video segmentation method acquires a video frame sequence comprising multiple video frames; in response to an interactive operation on a reference frame determined from the multiple video frames, it acquires reference frame interaction points for the reference frame, enabling more local segmentation fine-tuning on the reference frame; based on the reference frame interaction points, it performs coarse segmentation on the reference frame to obtain a coarse segmentation mask for the reference frame; to obtain a more refined segmentation effect, it can also extract more refined features rich in image details from the reference frame; based on the refined features of the reference frame and the coarse segmentation mask, it performs fine segmentation on the reference frame to obtain a segmentation mask for the reference frame. Thus, based on the coarse segmentation mask, the above-mentioned fine segmentation processing improves the accuracy of the segmentation mask, thereby making the segmentation mask obtained by segmenting and predicting multiple video frames based on the accurate segmentation mask more accurate.

[0064] In some embodiments, step 104 specifically includes, but is not limited to: converting the interaction points of the reference frame into an interaction point distance map; performing feature fusion on the reference frame and the interaction point distance map to obtain a first fused feature; and performing coarse segmentation on the reference frame based on the first fused feature to obtain a coarse segmentation mask for the reference frame.

[0065] The reference frame interaction point includes at least one of positive interaction point or negative interaction point. Positive interaction point is used to indicate the region in the reference frame that needs to be segmented, and negative interaction point is used to indicate the region in the reference frame that does not need to be segmented.

[0066] In some embodiments, when a user performs an interactive segmentation operation on a reference frame, such as adding or deleting at least one segmented region, the computer device converts the user's interaction information for adding a segmented region into a positive interaction point, which represents the object the user wants to add. Furthermore, the computer device can also convert the user's interaction information for deleting a segmented region into a negative interaction point, which represents the object the user wants to delete.

[0067] Correspondingly, the interaction point distance map includes at least one of a positive interaction distance map or a negative interaction distance map. In the positive interaction distance map, the pixel value of each pixel is determined by the distance from each pixel to the positive interaction point, and in the negative interaction distance map, the pixel value of each pixel is determined by the distance from each pixel to the negative interaction point.

[0068] It can be understood that the pixel value of each pixel in the positive interaction distance graph is determined by the distance from each pixel to the positive interaction point; the pixel value of each pixel in the positive interaction distance graph is determined by the shortest Euclidean distance from each pixel to all positive interaction points. Similarly, the pixel value of each pixel in the negative interaction distance graph is determined by the distance from each pixel to the negative interaction point; the pixel value of each pixel in the negative interaction distance graph is determined by the shortest Euclidean distance from each pixel to all negative interaction points. The pixel values in both the positive and negative interaction distance graphs range from 0 to 255.

[0069] Specifically, the computer device converts the interaction points of the reference frame into at least one of a positive interaction distance map or a negative interaction distance map to obtain an interaction point distance map. Next, the computer device performs feature fusion on the reference frame and the interaction point distance map to obtain a first fused feature. Finally, the computer device performs coarse segmentation on the reference frame based on the first fused feature to obtain a coarse segmentation mask for the reference frame. This application, by utilizing the interaction point distance map, can more clearly identify the areas the user wants to segment and the areas they do not want to segment, thereby ensuring that the coarse segmentation yields more accurate segmented regions.

[0070] In some embodiments, such as Figure 2 The diagram shows the structure of the interactive image segmentation module provided in this embodiment of the application. The interactive image segmentation module includes a fusion module. After obtaining the interaction point distance map of the reference frame, the computer device sends the interaction point distance map and the reference frame into the fusion module for feature fusion to obtain the first fused feature. Specifically, the fusion module continuously applies 1×1 convolution, activation function, batch normalization, and 1×1 convolution to the reference frame and interaction distance map stitched together in the channel dimension to perform feature fusion and obtain the first fused feature. The activation function can be a linear rectified function (ReLU function), which is a commonly used activation function in artificial neural networks, typically referring to nonlinear functions represented by ramp functions and their variants.

[0071] In some embodiments, the step "to coarsely segment the reference frame according to the first fusion feature and obtain a coarse segmentation mask of the reference frame" specifically includes, but is not limited to, extracting deep features and shallow features of the first fusion feature respectively; and performing decoding processing based on the deep features and shallow features to obtain a coarse segmentation mask of the reference frame.

[0072] Among them, the deep and shallow features of the first fusion feature mainly depend on the number of convolutional layers. The receptive field of shallow features is smaller and the overlapping area of the receptive fields is also smaller, so as to ensure that the network captures more details, while deep features represent a larger receptive field.

[0073] Specifically, the computer device extracts deep features that focus on the global aspects from the first fused features, and also extracts shallower features that are more detailed from the first fused features. It then performs decoding processing based on the deep and shallow features to obtain a coarse segmentation mask for the reference frame. This application, by simultaneously extracting both global and more detailed features from the first fused features and combining them with more comprehensive deep and shallow features for decoding processing, can obtain a coarse segmentation mask with better segmentation results.

[0074] In some embodiments, such as Figure 2 As shown, the interactive image segmentation module also includes a backbone module. This backbone network consists of two residual blocks. The backbone refers to the feature extraction network, whose function is to extract information from the image for subsequent networks to use. After obtaining the first fused feature, ResNet-50 can be used as the backbone to extract the deep and shallow features of the first fused feature. ResNet-50 is a convolutional neural network consisting of 50 network layers.

[0075] It should be noted that the computer device can also determine the features output by the first residual block in the backbone module as the shallow features of the first fused features, and determine the output of the backbone module, that is, the output of the second residual block in the backbone module, as the deep features.

[0076] In some embodiments, the backbone of the present application may be replaced with other network models that have deep feature extraction capabilities, as long as they can perform deep feature extraction on the first fused features. The present application does not impose any specific restrictions on this.

[0077] In some embodiments, the step "decoding based on deep features and shallow features to obtain a coarse segmentation mask of the reference frame" specifically includes, but is not limited to: extracting features from deep features using convolutions with multiple different receptive fields to obtain multiple receptive field features; concatenating the multiple receptive field features to obtain receptive field concatenated features; fusing the receptive field concatenated features and shallow features to obtain a second fused feature; and decoding the second fused feature to obtain a coarse segmentation mask of the reference frame.

[0078] In convolutional neural networks, the receptive field refers to the region of the input image that a point on the feature map can see. In other words, the point on the feature map is calculated from the receptive field size of the input image. A larger receptive field value indicates a wider range of the original image it can access, meaning it may contain more global and semantically higher-level features; conversely, a smaller receptive field value indicates that the features it contains are more local and detailed.

[0079] Specifically, the computer device uses multiple different receptive fields in a convolutional neural network to extract features from deep features, obtaining receptive field features from multiple different receptive fields. Next, the computer device concatenates these receptive field features to obtain concatenated receptive field features. These receptive field features are then fused with shallow features from a first fusion feature to obtain a second fusion feature, achieving the fusion of deep and shallow features. Finally, the computer device decodes the second fusion feature to obtain a coarse segmentation mask for the reference frame with better segmentation results.

[0080] In some embodiments, such as Figure 2 As shown, the interactive image segmentation module of this application also includes an Atrous Spatial Pyramid Pooling (ASPP) module. The computer device uses ASPP to extract features from deep features using convolutions with multiple receptive fields, and then concatenates the features from different receptive fields. Skip connections are then used to concatenate or fuse the output features of ASPP with the output features of the first residual block in the backbone module, achieving the fusion of shallow and deep features, thereby further improving the accuracy of video segmentation. The residual block can construct a residual network, which is a very effective network for mitigating the gradient vanishing problem.

[0081] In some embodiments, the coarse segmentation mask is obtained by decoding the second fused feature through the decoding module. Step 108 specifically includes, but is not limited to: concatenating the coarse segmentation mask and the last layer feature of the decoding module to obtain a first concatenated feature; concatenating the refined feature and the feature obtained by deep feature extraction of the first concatenated feature to obtain a second concatenated feature; and predicting the segmentation mask of the reference frame based on the second concatenated feature.

[0082] The decoding module includes multiple convolutional layers, and the last feature of the decoding module refers to the feature corresponding to the last convolutional layer of the decoding module that is closest to its output.

[0083] In some embodiments, such as Figure 2As shown, the interactive image segmentation module of this application also includes a decoding module. The process of decoding the second fused features through the decoding module can be as follows: feature extraction and fusion are performed on the features output by the multi-scale dilated convolution module through two depthwise separable convolutions of the decoding module. Then, a 1×1 convolution is used to reduce the number of channels to 1, and bilinear interpolation is used for upsampling to obtain the original size of the image. Finally, an activation function, such as the sigmoid function, is used to obtain the final segmentation probability map. This segmentation probability map is the coarse segmentation mask obtained by coarsely segmenting the reference frame. The sigmoid function is a common S-shaped function in biology, also known as an S-shaped growth curve. The sigmoid function is often used as an activation function in neural networks, mapping variables to between 0 and 1.

[0084] Specifically, the computer device concatenates the coarse segmentation mask and the last layer of features from the decoding module to obtain the first concatenated feature. Next, the computer device performs deep feature extraction on the first concatenated feature, that is, it performs multiple encoding and decoding processes on the first concatenated feature to obtain the deep features of the first concatenated feature. Subsequently, the computer device concatenates the refined feature and the deep features of the first concatenated feature to obtain the second concatenated feature. Finally, the computer device uses the second concatenated feature to predict and obtain the segmentation mask of the reference frame.

[0085] In some embodiments, such as Figure 2 As shown, the interactive image segmentation module of this application also includes a fine-grained segmentation module. This fine-grained segmentation module receives a reference frame, a coarse segmentation mask, and the last layer features of the decoding module as input. The computer device extracts features from the reference frame using a 3×3 convolution of the fine-grained segmentation module to obtain fine-grained features. Next, the computer device concatenates the coarse segmentation mask with the last layer features of the decoding module to obtain the first concatenated features. Then, the computer device inputs the first concatenated features into three encoding layers and three decoding layers for encoding and decoding respectively, obtaining encoded features and decoded features. Finally, the computer device fuses the encoded features and the corresponding decoded features through element-wise addition to obtain the deep features of the first concatenated features. Finally, the deep features of the first concatenated feature are concatenated with the refined features to obtain the second concatenated feature. The second concatenated feature is then subjected to a 3×3 convolution and a sigmoid function to predict the final refined segmentation mask. The refined segmentation mask is then pasted back into the original image according to the cropped region coordinates, i.e., pasted back into the original reference frame, to obtain the final segmentation mask obtained from the reference frame after coarse and fine segmentation.

[0086] In some embodiments, step 110 specifically includes, but is not limited to: performing feature encoding on the reference frame and its segmentation mask to obtain reference frame features; storing the reference frame features corresponding to the reference frame in a preset repository queue; determining the query frame features for each video frame; performing feature matching on the query frame features for each video frame using the reference frame features stored in the repository queue to obtain query frame matching features; and predicting the segmentation mask corresponding to each video frame based on the query frame matching features for each video frame.

[0087] Among them, the query frame features are the features obtained by extracting features from each video frame.

[0088] A repository queue refers to a repository that stores the segmentation masks of each reference frame in a video frame sequence in a queue format.

[0089] Specifically, the computer device encodes the reference frame and its segmentation mask to obtain reference frame features, and stores these features in a pre-defined repository queue. Next, the computer device extracts features from each video frame to obtain query frame matching features for each video frame. Then, the computer device reads each video frame in the video frame sequence and designates the currently read video frame as the query frame. When performing segmentation prediction on each query frame, it needs to extract all reference frame features currently stored in the repository queue and concatenate these features to obtain a third concatenated feature. Then, the computer device performs feature matching on the query frame features based on the third concatenated feature to obtain the query frame matching feature. Finally, the computer device predicts the segmentation mask corresponding to each video frame based on its respective query frame matching feature.

[0090] In some embodiments, such as Figure 3 As shown, this application also includes a video target segmentation module, which includes, but is not limited to, a reference frame feature encoder, a repository queue, a query frame feature encoder, a feature matcher, and a query frame feature decoder.

[0091] Specifically, the reference frame feature encoder encodes the reference frame and its segmentation mask, and adds the encoded reference frame features to the repository queue. Next, the computer device obtains the index of the reference frame, and simultaneously reads video frames from the video frame sequence in both directions before and after the reference frame index, denoted as query frames. These query frames are then encoded using the query frame feature encoder to obtain query frame features. When performing segmentation prediction on each query frame, all reference frame features currently stored in the repository queue are extracted and concatenated to obtain a third concatenated feature. This third concatenated feature is then matched against the query frame features using a feature matcher, and the matched features are combined to obtain the query frame matched feature. The query frame matched feature is then decoded by the query frame feature decoder and fused with the features from the query frame feature encoder. Finally, a 1×1 convolution and activation function, such as a 1×1 convolution and sigmoid function, are used to obtain the segmentation mask of the query frame.

[0092] The reference frame feature encoder is composed of a residual network, which can be a ResNet18 network, including 17 convolutional layers and 1 fully connected layer, while the feature matcher can be composed of spatiotemporal global attention.

[0093] In some embodiments, the video segmentation method of this application includes, but is not limited to: when the queue length of the repository queue exceeds a queue length threshold, extracting reference frame features corresponding to reference frames to be retained from the repository queue at preset intervals; and updating the repository queue according to the reference frame features corresponding to the reference frames to be retained.

[0094] The queue length threshold refers to the maximum number of reference frame features that the repository queue can store. The reference frame features that need to be retained refer to the portion of reference frame features that can still be saved in the repository queue during queue updates.

[0095] Understandably, most video segmentation algorithms perform segmentation and prediction frame-by-frame. Traditional methods typically cache reference frames at fixed frame intervals and match them with the current video frame. Since external storage space and video processing speed are directly proportional to the number of video frames, traditional methods are unsuitable for practical applications. For example, traditional methods store the frame features of the corresponding video frame in a reference feature queue every 5 or 10 frames. However, as the total number of video frames increases, the reference frame queue becomes increasingly long, consuming more and more storage. Because feature matching requires matching the current query frame with the features of all reference frames stored in the queue, the more reference frames there are, the slower the feature matching speed becomes.

[0096] Therefore, in order to improve the speed of feature extraction and matching and reduce the storage usage of the computation repository queue, this application considers using a preset repository update strategy to update the repository queue. Compared with traditional methods, the speed of feature extraction and matching and the storage usage of this repository update strategy are not affected by the total number of frames in the video frame queue.

[0097] In some embodiments, the repository update strategy may be to extract reference frame features that need to be retained from the repository queue at preset intervals, and update the repository queue based on the reference frame features that need to be retained.

[0098] Specifically, when the length of the storage queue exceeds a threshold, the computer extracts the reference frame features corresponding to the reference frames to be retained from the storage queue at preset intervals, such as every one frame. Then, the computer saves the reference frame features corresponding to the reference frames to be retained to a new storage queue, replacing the original storage queue with the new one. This effectively reduces the consumption of computing resources, improves the overall accuracy and inference speed of video segmentation, and enhances the user's interactive experience.

[0099] To facilitate understanding, let's illustrate with an example. Assume the maximum capacity of the repository queue is 10, and the original repository queue is [N0, N1, N2, N3, N4, N5, N6, N7, N8, N9], with a queue length of 10. If a new reference frame N10 is directly added to the original reference frame queue, the reference frame queue after adding reference frame N10 becomes [N0, N1, N2, N3, N4, N5, N6, N7, N8, N9, N10]. At this point, the queue length exceeds the maximum capacity. To reduce the memory occupied by the repository, it is necessary to extract the reference frame features that need to be retained from the repository queue every frame interval, and update the queue according to the reference frame features that need to be retained. Then the repository queue can be updated to [N0, N2, N4, N6, N8, N10].

[0100] It is understandable that, since the differences between adjacent video frames are not significant, the reference frame features in the repository queue used by traditional video segmentation methods usually carry a lot of redundant information. This redundant information does not improve the video segmentation effect, but instead increases the amount of computation and storage.

[0101] Based on this, this application adopts a new repository update strategy and, through comparative experiments, obtains the final queue length threshold and preset interval. For example, during algorithm operation, reference frame features are added to the repository queue every 20 frames. If the reference frame features in the repository queue exceed the maximum queue length of 10, the reference frame features in the old repository queue are saved to the new repository queue every 1 frame. This improves the algorithm's running speed and reduces storage usage without affecting the segmentation effect, achieving a balance between speed and accuracy.

[0102] In some embodiments, the video segmentation method of this application includes, but is not limited to: in response to a selection operation on a video frame sequence, determining the selected video frame as a reference frame; and determining the reference frame from the video frame sequence according to a preset extraction rule.

[0103] Specifically, the computer can respond to a user's selection operation on a video frame sequence and determine the selected video frame as a reference frame. Furthermore, the computer device can also extract video frames at certain intervals from the video frame sequence as reference frames according to preset extraction rules, or extract video frames adjacent to the currently queried frame from the video frame sequence as reference frames.

[0104] It's understandable that during the process of a computer device reading a video frame sequence, every 20 frames, the read video frame and its corresponding segmentation mask can be used as a reference frame. Assuming the index of the current query frame is t, in addition to using the video frames corresponding to indices t-20 and t-40 every 20 frames as reference frames, the features of the previous frame at index t-1 or t+1 can also be added to the storage queue as temporary reference frames. This improves the stability and continuity of video segmentation and reduces jitter in the segmentation results of adjacent frames. It should be noted that after predicting the segmentation masks for all video frames, the temporary reference frames are removed from the storage queue to save storage space.

[0105] In some embodiments, such as Figure 4As shown, the video segmentation method of this application may specifically include the following steps: First, the user's gestures and video frame sequence are input to the front-end interaction module, which processes them and outputs the reference frame interaction points and the RGB image of the reference frame, i.e., the reference frame RGB. Next, the interactive image segmentation module performs coarse and fine segmentation on the reference frame based on the reference frame interaction points and the reference frame RGB, obtaining the reference frame RGB and the reference frame mask, where the reference frame mask is the reference frame code. The video target segmentation module performs segmentation prediction on each video frame in the video frame sequence based on the reference frame RGB and the reference frame mask, obtaining the segmentation mask of all video frames, i.e., the all frame masks. Finally, if the user is satisfied with the video segmentation effect, the segmentation mask of all video frames, i.e., the video segmentation mask, is output; if the user is not satisfied with the video segmentation effect, the video frames that need to be segmented are re-inputted to the front-end interaction module to re-execute the video segmentation process until the user is satisfied with the video segmentation effect.

[0106] In some embodiments, the video segmentation method of this application further includes a reference frame acquisition step, a coarse segmentation step, a fine segmentation step, and a prediction step. Wherein:

[0107] The reference frame acquisition step, in response to a selection operation on a video frame sequence, identifies the selected video frame as the reference frame, or determines the reference frame from the video frame sequence according to a preset extraction rule. In response to an interaction operation on a reference frame in the video frame sequence, it acquires the reference frame interaction point for that reference frame.

[0108] The coarse segmentation step involves converting the interaction points of the reference frame into an interaction point distance map; fusing features between the reference frame and the interaction point distance map to obtain the first fused feature; extracting deep and shallow features from the first fused feature; using convolutions with multiple receptive fields to extract features from the deep features to obtain multiple receptive field features; concatenating the multiple receptive field features to obtain the receptive field concatenated feature; fusing the receptive field concatenated feature and the shallow feature to obtain the second fused feature; and decoding the second fused feature to obtain the coarse segmentation mask of the reference frame.

[0109] The fine segmentation step involves concatenating the coarse segmentation mask and the last layer features of the decoding module to obtain the first concatenated feature; concatenating the refined features and the features obtained by deep feature extraction from the first concatenated feature to obtain the second concatenated feature; predicting the segmentation mask of the reference frame based on the second concatenated feature; and performing fine segmentation on the reference frame based on the refined features and the coarse segmentation mask to obtain the segmentation mask of the reference frame.

[0110] The prediction steps involve: encoding the reference frame and its segmentation mask to obtain reference frame features; storing the reference frame features corresponding to the reference frame in a preset repository queue; determining the query frame features for each video frame; performing feature matching on the query frame features for each video frame using the reference frame features stored in the repository queue to obtain query frame matching features; and predicting the segmentation mask corresponding to each video frame based on the query frame matching features for each video frame.

[0111] In some embodiments, if the queue length of the repository queue exceeds the queue length threshold, the reference frame features corresponding to the reference frames that need to be retained are extracted from the repository queue at preset intervals; and the repository queue is updated according to the reference frame features corresponding to the reference frames that need to be retained.

[0112] It should be noted that the image segmentation method of this application can improve the precision of video segmentation masks while effectively reducing the consumption of computing resources, improving the overall accuracy and computational inference speed of interactive video segmentation, and enhancing the user experience.

[0113] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0114] Based on the same inventive concept, this application also provides an image segmentation apparatus for implementing the image segmentation method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations in one or more image segmentation apparatus embodiments provided below can be found in the limitations of the image segmentation method described above, and will not be repeated here.

[0115] In some embodiments, such as Figure 5 As shown, an image segmentation apparatus is provided, including: an acquisition unit 502, a coarse segmentation unit 504, a determination unit 506, a fine segmentation unit 508, and a prediction unit 510, wherein:

[0116] The acquisition unit 502 is used to acquire the reference frame interaction point for the reference frame in response to the interaction operation for the reference frame in the video frame sequence.

[0117] The coarse segmentation unit 504 is used to coarsely segment the reference frame based on the reference frame interaction points to obtain the coarse segmentation mask of the reference frame.

[0118] The determining unit 506 is used to determine the refined features of the reference frame; the refined features are the features obtained by extracting features from the image details of the reference frame.

[0119] The fine segmentation unit 508 is used to perform fine segmentation on the reference frame based on the fine features and the coarse segmentation mask to obtain the segmentation mask of the reference frame.

[0120] The prediction unit 510 is used to perform segmentation prediction on each video frame in the video frame sequence according to the segmentation mask of the reference frame, so as to obtain the segmentation mask corresponding to each video frame.

[0121] The aforementioned video segmentation apparatus acquires a video frame sequence comprising multiple video frames; in response to an interactive operation on a reference frame determined from the multiple video frames, it acquires reference frame interaction points for the reference frame, enabling more local segmentation fine-tuning on the reference frame; it performs coarse segmentation on the reference frame based on the reference frame and the reference frame interaction points to obtain a coarse segmentation mask for the reference frame; to obtain a more refined segmentation effect, it can also extract more refined features rich in image details from the reference frame, and perform fine segmentation on the reference frame based on the refined features and the coarse segmentation mask to obtain a segmentation mask for the reference frame. This allows the accuracy of the segmentation mask to be improved through the aforementioned fine segmentation processing on the basis of the coarse segmentation mask, thereby making the segmentation mask obtained by segmenting and predicting multiple video frames based on the accurate segmentation mask more accurate.

[0122] In some embodiments, the coarse segmentation unit 504 is further configured to convert the interaction points of the reference frame into an interaction point distance map; perform feature fusion on the reference frame and the interaction point distance map to obtain a first fused feature; and perform coarse segmentation on the reference frame according to the first fused feature to obtain a coarse segmentation mask of the reference frame.

[0123] In some embodiments, the coarse segmentation unit 504 is further configured to extract the deep features and shallow features of the first fused features respectively; and perform decoding processing based on the deep features and shallow features to obtain the coarse segmentation mask of the reference frame.

[0124] In some embodiments, the coarse segmentation unit 504 is further configured to extract features from deep features using convolutions of multiple different receptive fields to obtain multiple receptive field features; concatenate the multiple receptive field features to obtain receptive field concatenated features; fuse the receptive field concatenated features and shallow features to obtain a second fused feature; and decode the second fused feature to obtain a coarse segmentation mask of the reference frame.

[0125] In some embodiments, the coarse segmentation mask is obtained by decoding the second fused feature through the decoding module; the fine segmentation unit 508 is further configured to concatenate the coarse segmentation mask and the last layer feature of the decoding module to obtain the first concatenated feature; concatenate the refined feature and the feature obtained by deep feature extraction of the first concatenated feature to obtain the second concatenated feature; and predict the segmentation mask of the reference frame based on the second concatenated feature.

[0126] In some embodiments, the prediction unit 510 is further configured to perform feature encoding on the reference frame and the segmentation mask of the reference frame to obtain reference frame features; store the reference frame features corresponding to the reference frame in a preset repository queue; determine the query frame features of each video frame; the query frame features are features obtained by feature extraction for each video frame; for the query frame features of each video frame, perform feature matching on the query frame features through the reference frame features stored in the repository queue to obtain query frame matching features; and predict the segmentation mask corresponding to each video frame based on the query frame matching features of each video frame.

[0127] In some embodiments, the video segmentation module of this application further includes a queue update module, which is used to extract reference frame features corresponding to reference frames that need to be retained from the repository queue at preset intervals when the queue length of the repository queue exceeds the queue length threshold; and to update the repository queue with the reference frame features corresponding to the reference frames that need to be retained.

[0128] In some embodiments, the acquisition unit 502 of this application is further configured to, in response to a selection operation on a video frame sequence, determine the selected video frame as a reference frame; and determine the reference frame from the video frame sequence according to a preset extraction rule.

[0129] Each module in the aforementioned image segmentation device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0130] In some embodiments, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 6As shown, the computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database stores data related to image segmentation. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements an image segmentation method.

[0131] In some embodiments, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 7 As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When executed by the processor, the computer program implements an image segmentation method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0132] Those skilled in the art will understand that Figure 6 and Figure 7 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0133] In some embodiments, a computer device is also provided, the computer device including a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps in the above method embodiments.

[0134] In some embodiments, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0135] In some embodiments, a computer program product is provided, which includes a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0136] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties.

[0137] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0138] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0139] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A method of video segmentation, characterized in that, include: In response to an interactive operation on a reference frame in a video frame sequence, a reference frame interaction point for the reference frame is obtained. The reference frame is coarsely segmented based on the reference frame interaction points to obtain a coarse segmentation mask for the reference frame; the coarse segmentation mask is obtained by decoding through a decoding module. Determine the refined features of the reference frame; the refined features are features obtained by extracting features from the image details of the reference frame. The coarse segmentation mask and the last layer of features of the decoding module are concatenated to obtain the first concatenated feature. The refined features and the features obtained by deep feature extraction on the first spliced features are spliced together to obtain the second spliced features; Based on the second splicing feature prediction, a refined segmentation mask is obtained. The refined segmentation mask is then pasted back onto the reference frame according to the cropping region coordinates to obtain the segmentation mask of the reference frame. Based on the segmentation mask of the reference frame, segmentation prediction is performed on each video frame in the video frame sequence to obtain the segmentation mask corresponding to each video frame.

2. The method of claim 1, wherein, The step of coarsely segmenting the reference frame based on the reference frame and the reference frame interaction points to obtain the coarse segmentation mask of the reference frame includes: Convert the reference frame interaction points into an interaction point distance map; The reference frame and the interaction point distance map are fused to obtain a first fused feature; The reference frame is coarsely segmented based on the first fusion feature to obtain the coarse segmentation mask of the reference frame.

3. The method of claim 2, wherein, The step of coarsely segmenting the reference frame based on the first fusion feature to obtain a coarse segmentation mask for the reference frame includes: Extract the deep features and shallow features of the first fused feature respectively; Decoding is performed based on the deep features and the shallow features to obtain the coarse segmentation mask of the reference frame.

4. The method according to claim 3, characterized in that, The step of decoding based on the deep features and the shallow features to obtain the coarse segmentation mask of the reference frame includes: The deep features are extracted by using convolutions with multiple different receptive fields to obtain multiple receptive field features; The multiple receptive field features are spliced together to obtain the receptive field spliced features; The receptive field splicing features and the shallow layer features are fused to obtain a second fused feature; The coarse segmentation mask of the reference frame is obtained by decoding the second fusion feature.

5. The method of claim 4, wherein, The coarse segmentation mask is obtained by decoding the second fused feature through the decoding module.

6. The method according to claim 1, characterized in that, The step of performing segmentation prediction on each video frame in the video frame sequence based on the segmentation mask of the reference frame to obtain the segmentation mask corresponding to each video frame includes: The reference frame and its segmentation mask are feature-encoded to obtain the reference frame features; Store the reference frame features corresponding to the reference frame into a preset repository queue; Determine the query frame features for each video frame; the query frame features are features obtained by feature extraction for each video frame. For each video frame, the query frame features are matched with reference frame features stored in the repository queue to obtain query frame matching features. Based on the query frame matching features of each video frame, the segmentation mask corresponding to each video frame is predicted.

7. The method according to claim 6, characterized in that, Also includes: If the queue length of the repository queue exceeds the queue length threshold, the reference frame features corresponding to the reference frames that need to be retained are extracted from the repository queue at preset intervals. The repository queue is updated according to the reference frame features corresponding to the reference frames that need to be retained.

8. The method according to any one of claims 1 to 7, characterized in that, Also includes: The step of determining a reference frame from the video frame sequence includes at least one of the following: In response to a selection operation on the video frame sequence, the selected video frame is determined as a reference frame; Reference frames are determined from the video frame sequence according to preset extraction rules.

9. A video partitioning apparatus, characterized by comprising: include: The acquisition unit is configured to acquire a reference frame interaction point for a reference frame in response to an interaction operation for a reference frame in a video frame sequence. A coarse segmentation unit is used to coarsely segment the reference frame based on the reference frame and the reference frame interaction point to obtain a coarse segmentation mask of the reference frame; the coarse segmentation mask is obtained by decoding by a decoding module; A determining unit is used to determine the refined features of the reference frame; the refined features are features obtained by extracting features from the image details of the reference frame; The fine segmentation unit is used to concatenate the coarse segmentation mask and the last layer features of the decoding module to obtain a first concatenated feature; concatenate the refined feature and the features obtained by deep feature extraction on the first concatenated feature to obtain a second concatenated feature; predict the refined segmentation mask based on the second concatenated feature; and paste the refined segmentation mask back to the reference frame according to the cropping region coordinates to obtain the segmentation mask of the reference frame. The prediction unit is used to perform segmentation prediction on each video frame in the video frame sequence according to the segmentation mask of the reference frame, so as to obtain the segmentation mask corresponding to each video frame.

10. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 8.

11. A computer readable storage medium having stored thereon a computer program, characterized in that When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.

12. A computer program product comprising a computer program on a medium, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.