Video frame segmentation
By adapting segmentation masks using previous and subsequent frames, the method improves temporal consistency, addressing visual artifacts in multi-view video depth estimation.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- KONINKLIJKE PHILIPS NV
- Filing Date
- 2024-05-22
- Publication Date
- 2026-06-25
AI Technical Summary
Existing image-based object segmentation algorithms for video frames, particularly for depth estimation in multi-view videos, suffer from visual artifacts due to temporal inconsistencies in segmentation masks, leading to jitter and unstable depth maps.
Generate a temporally consistent segmentation mask by integrating existing algorithms with specific measures such as using previous and subsequent frames to modify the current segmentation mask, leveraging neural networks and visual computing techniques to improve temporal consistency.
The proposed methods enhance the temporal consistency of segmentation masks, reducing visual artifacts and stabilizing depth estimation in multi-view videos.
Smart Images

Figure 2026520863000001_ABST
Abstract
Description
Technical Field
[0001] The present invention relates to object segmentation of frames in a video. The present invention also relates to object segmentation for use in depth estimation of multi-view video.
Background Art
[0002] Recent advances in machine learning have made image-based (semantic) object detection and segmentation (such as Detectron2) possible. In multi-view video, some algorithms such as depth estimation provide means to limit the search space in the estimation process, so benefits can be obtained by using the results of these techniques.
[0003] Furthermore, using semantic object type information can guide the estimation process to use model-based specialization. For example, an algorithm specially created or adjusted to perform depth estimation for humans, or an algorithm specially created or adjusted to estimate the depth profile of a ball.
[0004] When performing depth estimation on video input, image-based (semantic) object detection and segmentation can be performed for each frame.
Summary of the Invention
Problems to be Solved by the Invention
[0005] However, it has been recognized that when performing image-based (semantic) object detection for each frame, visual artifacts may occur in the resulting multi-view video when the segmentation is used for depth estimation.
[0006] Therefore, when used in video frames, especially for depth estimation in multiview videos, object segmentation needs improvement. [Means for solving the problem]
[0007] The present invention is defined by the claims.
[0008] According to an embodiment of one aspect of the present invention, a method for object segmentation in a video frame is provided, the method is: Steps include obtaining the current segmentation mask of an object within the current video frame, The process includes the steps of generating a temporally consistent current segmentation mask for an object with respect to the current video frame, using a segmentation mask and segmentation obtained from one or more previous and / or subsequent video frames.
[0009] Temporal consistency of segmentation masks is particularly important in some video applications. For example, in 3D video (such as immersive video), object segmentation is recognized as a way to estimate depth. However, if the segmentation mask is not temporally consistent, the edges of the segmentation mask may appear to abruptly change depth, potentially causing jitter in the video. If these edges are not aligned with the edges of the object being segmented, visual artifacts may appear in the rendered video.
[0010] Segmentation algorithms are currently trained on still images. This is because a large number of images are required to successfully train a segmentation algorithm. Theoretically, it is recognized that it is possible to train segmentation algorithms using various video frames to aim for improved temporal consistency. However, the amount of training data and ground truth data required to train such algorithms for general use means that these types of segmentation algorithms are not currently a practical option.
[0011] Therefore, it has been proposed to obtain the segmentation mask of the current frame and adapt / modify the segmentation mask using segmentation information from a temporally different frame (i.e., the previous or subsequent frame). In this way, currently available segmentation algorithms can be used. The temporal consistency of the segmentation mask is improved by adapting / modifying it after comparing it with the segmentation information of the previous / subsequent frame.
[0012] In other words, instead of directly attempting to determine a temporally consistent segmentation mask, we can use currently available segmentation algorithms to generate a segmentation mask and then modify that mask (i.e., adjust it to be temporally consistent) using information available from temporally different frames.
[0013] Various methods for adapting the current segmentation mask to generate a temporally consistent current segmentation mask for the current frame are described below. Some embodiments involve comparing the current segmentation mask with segmentation from previous / next frames and adapting the current segmentation mask based on the comparison. Other embodiments involve using a neural network trained to generate a temporally consistent segmentation mask for the current video frame.
[0014] The objective is to improve object segmentation in video frames. This is achieved by generating a temporally consistent segmentation mask by leveraging existing segmentation techniques across current and temporally different video frames.
[0015] Segmentation from previous / next frames can include segmentation masks, bounding boxes, parts of previous / next frames, object keypoints, etc.
[0016] Preferably, one or more preceding and / or succeeding video frames include one or more preceding and succeeding video frames that are temporally adjacent.
[0017] Generating a temporally consistent current segmentation mask may involve comparing the current segmentation mask with a segmentation obtained using one or more previous and / or subsequent video frames, and then adapting the current segmentation mask based on that comparison to generate a temporally consistent current segmentation mask.
[0018] Obtaining the current segmentation mask of an object may include obtaining two spatially adjacent current segmentation masks within the current video frame. Comparing current segmentation masks may include generating a combined segmentation mask by combining the two spatially adjacent current segmentation masks, comparing the combined segmentation mask with one or more transient segmentation masks corresponding to previous and / or subsequent video frames, and determining, based on the comparison between the combined segmentation mask and one or more transient segmentation masks corresponding to previous and / or subsequent video frames, whether the two spatially adjacent current segmentation masks belong to the same object instance, where adapting the current segmentation masks includes assigning the combined segmentation mask to the corresponding object within the current video frame, in accordance with the determination that the two spatially adjacent current segmentation masks belong to the same object instance.
[0019] One type of segmentation error that leads to temporal inconsistency is when a single object is incorrectly detected and segmented into two separate objects. This solution resolves this error by checking whether the join of two adjacent segmentation masks is more temporally consistent than the adjacent segmentation masks themselves.
[0020] Spatially adjacent segmentation masks are preferably of the same object class. However, in some cases, image regions may be misclassified. In such cases, combining two spatially adjacent masks of different object classes (one or both of which are incorrect) may result in a correct segmentation mask.
[0021] Adjacent segmentation masks of the same class (i.e., segmentation masks with overlapping or close edges) may indicate that two separate objects of the same class are in close proximity to each other, or they may indicate an object that was misdetected, as mentioned above.
[0022] One way to compare temporal consistency is to first compare the number of segmentation masks for an object class in the current frame with the same number in the previous / next frame. This is based on the assumption that the number of objects does not change very frequently between frames that are close in time. Sudden changes may indicate incorrectly detected segmentation.
[0023] If the numbers are not equal, you can compare the combined segmentation mask with the segmentation masks of the preceding and succeeding frames. Alternatively, you can compare adjacent segmentation masks.
[0024] Comparing the current segmentation mask may include generating a background image using multiple previous and / or subsequent video frames, determining the difference between the current video frame and the background image to identify foreground regions within the current video frame, and comparing the regions corresponding to the current segmentation mask with the identified foreground regions to detect missing and / or invalid segmentation regions. Adapting the current segmentation mask may include adding missing segmentation regions to the current segmentation mask and / or removing invalid segmentation regions from the current segmentation mask.
[0025] In some cases, the segmentation mask may be missing a part of the object. Therefore, the background frame (generated using the previous and / or subsequent video frames) can be used to identify the segmentation of the foreground region and compare that segmentation with the current segmentation mask. This enables the detection of missing segmentation regions of the foreground object.
[0026] Adding the missing segmentation region includes adapting the current segmentation mask to include the missing segmentation region.
[0027] The comparison of the segmentation masks can include obtaining one or more temporary segmentation masks corresponding to the object from the previous and / or subsequent frames, determining the motion field between the current video frame and the previous and / or subsequent video frames, temporarily projecting the temporary segmentation masks to the time corresponding to the current video frame using that motion field, and comparing the projected temporary segmentation masks with the current segmentation mask to detect missing and / or invalid segmentation regions. Adapting the current segmentation mask can include adding the missing segmentation regions to the current segmentation mask and / or removing invalid segmentation regions from the current segmentation mask.
[0028] In an embodiment, a known motion estimation algorithm such as optical flow can be used to determine the motion field.
[0029] Motion fields are known to indicate movement between frames. It is recognized that this information can be used to project segmentation masks from previous / next frames over time. Therefore, the projected segmentation mask essentially provides a prediction of the expected segmentation mask in the current frame. However, edges may not be as accurate as segmentation performed directly in the current frame. Therefore, projected segmentation is used to provide more accurate temporal consistency to the current segmentation mask. This is achieved by comparing the current segmentation mask with the projected segmentation mask to check for any missing segmentation regions.
[0030] Comparing the current segmentation mask may include retrieving one or more temporary segmentation masks from previous and / or subsequent frames, and comparing the shape of the temporary segmentation masks and the textures joined by the temporary segmentation masks with the shape of the current segmentation mask and the textures joined by the current segmentation mask, respectively, in order to detect missing and / or invalid segmentation regions, where adapting the current segmentation mask includes adding missing segmentation regions to the current segmentation mask and / or removing invalid segmentation regions from the current segmentation mask.
[0031] For example, a significant difference in shape (e.g., exceeding a threshold) between the primary segmentation mask and the current segmentation mask may indicate that the current segmentation mask is missing a region and / or contains an invalid region. Similarities and / or differences in texture in the suspected missing / invalid region can be used to identify the actual shape of the missing / invalid segmentation region.
[0032] The current segmentation mask adaptation may further include performing visibility checks for missing and / or invalid segmentation regions to determine whether they are occluded in the current frame, and the current segmentation mask adaptation may be performed depending on at least a portion of missing and / or invalid segmentation regions that are not occluded in the current frame.
[0033] Visibility checks involve determining whether an object is obscured within the frame.
[0034] In some cases, an object's region may be missing from the current segmentation mask because it is obscured by another object. In such cases, the current segmentation mask may already be correct, so it is not necessary to adjust the current segmentation mask even if the missing segmentation region is detected.
[0035] Generating a temporally consistent current segmentation mask may involve obtaining one or more transient segmentation masks from previous and / or subsequent frames corresponding to the same object class as the current segmentation, and inputting the current segmentation mask and transient segmentation masks into a neural network trained to output a segmentation mask adapted for the current frame.
[0036] In this case, the neural network is trained with the already determined current segmentation mask and temporary segmentation mask, so the required training data can be much smaller than that of the segmentation model.
[0037] Preferably, the current video frame is also input to the neural network. Preferably, the previous and / or subsequent frames are also input to the neural network.
[0038] The present invention also provides a method for object segmentation in video frames using a prompt-based segmentation model, the method being: The steps include: applying the algorithm to the first video frame to obtain a first index of the position of an object within the first video frame; The steps include inputting the first video frame into the prompt-based segmentation model using a first index of the object's position as a prompt; The steps include: applying the same algorithm to a second subsequent video frame to obtain a second index of the object's position in the second video frame; The process includes the step of inputting a second video frame into a prompt-based segmentation model using a second metric of the object's position as a prompt.
[0039] Prompt-based segmentation models (such as the Segment Anything Model (SAM)) have recently proven to provide robust and accurate segmentation. However, because such models are trained on still images, they do not provide temporal consistency when used with video frames. SAM often outputs different segmentation masks for a single object based on the prompts used.
[0040] The prompts used in SAM often include manually selected prompts or arrays of arbitrarily placed prompts. Manually placing prompts is not a practical solution for object segmentation in video. Even if prompts are arbitrarily placed, temporal consistency is not achieved because objects may have moved across different frames, and the resulting segmentation mask becomes inconsistent over time because the prompts are inconsistent with respect to the objects.
[0041] It is recognized that temporally consistent prompts are necessary to ensure temporally consistent output from SAM. This can be achieved by using the same algorithm across different frames to obtain indices of the position of objects within a frame. These position indices provide a more temporally consistent position even if the object moves between frames, because the algorithm is used for each frame.
[0042] This algorithm may include a keypoint detection algorithm configured to determine one or more keypoints of an object in an image, where first and second indices of the object's position include one or more keypoints of the object determined by the keypoint detection algorithm.
[0043] This algorithm can be configured to segment objects in an input video frame, identify pixels corresponding to the segmented objects, and determine the center of the pixels, where first and second indices of the object's position include the corresponding determined center.
[0044] The method includes the steps of: performing a visibility check on a first video frame to detect whether a first position indicator of an object is obscured by another object; adapting the first position indicator to an unobscured position based on whether the position indicator is obscured; performing a visibility check on a second video frame to detect whether a second position indicator of an object is obscured by another object; and adapting the second position indicator to an unobscured position based on whether the position indicator is obscured.
[0045] This method may further include the step of receiving confidence scores for each of the first and second indicators of object position, and inputting the second video frame into a prompt-based segmentation model is based on the fact that both confidence scores are above a minimum confidence threshold.
[0046] The algorithm is configured to perform all steps of any of the object segmentation methods in the video frame described above, where the first and second indices of the object's position include the corresponding temporally consistent current segmentation mask.
[0047] The present invention also provides a computer program carrier that, when executed on a computer, causes the computer to perform all steps according to any of the methods described above.
[0048] A computer program carrier can include computer memory (e.g., random access memory) and computer storage (e.g., hard drives, solid-state drives, etc.). A computer program carrier can also be a bitstream that transmits computer program code.
[0049] The present invention also provides a system comprising a processor configured to perform all steps by any of the methods provided by the present invention as described above.
[0050] The present invention also provides a computer implementation method comprising all steps of any of the methods provided by the present invention as described above.
[0051] These and other aspects of the present invention will become apparent from and be explained with reference to the embodiments described below. [Brief explanation of the drawing]
[0052] To better understand the present invention and to more clearly illustrate how it can be put into practice, refer to the accompanying drawings as merely examples. [Figure 1] A diagram showing a scene with the two characters. [Figure 2] Figure 1 shows the scene with the correct segmentation mask overlaid. [Figure 3] Figure 1 shows a scene with a first type of incorrect segmentation mask superimposed. [Figure 4] Figure 1 shows the scene with a second type of incorrect segmentation mask superimposed. [Figure 5] A diagram illustrating the creation and use of background images. [Figure 6] A diagram showing a scene where a person is obstructed by an object. [Figure 7] A diagram illustrating the method of object segmentation in video frames. [Figure 8] A diagram illustrating a method for object segmentation in video frames using a prompt-based segmentation model. [Modes for carrying out the invention]
[0053] The present invention will be described with reference to the drawings.
[0054] The detailed descriptions and specific examples illustrate exemplary embodiments of the apparatus, systems, and methods, but should be understood to be for illustrative purposes only and not intended to limit the scope of the invention. These and other features, aspects, and advantages of the apparatus, systems, and methods of the invention will be better understood from the following description, the appended claims, and the appended drawings. The drawings are for illustrative purposes only and are not drawn to a specific scale. Also, the same reference numerals are used throughout the drawings to indicate the same or similar parts.
[0055] The present invention provides a method for object segmentation in a video frame. This method includes obtaining the current segmentation mask of an object in the current video frame and using that segmentation mask and segmentation obtained from one or more previous and / or subsequent video frames to generate a temporally consistent current segmentation mask of the object for the current video frame.
[0056] Conceptually, it is theoretically possible to design and train a segmentation model that takes temporal aspects into account. However, at present, this idea is impractical due to the increasing complexity of the model and the size of the training dataset required.
[0057] On the other hand, observing the temporal inconsistencies between image-based object detection and segmentation, it was recognized that combining the results of image-based instance segmentation with several visual computing techniques could potentially improve the consistency and accuracy of instance segmentation.
[0058] Therefore, it has been proposed to combine the results of frame-by-frame instance segmentation with visual computing techniques to generate more accurate and temporally consistent segmentation. For example, these methods can use temporal context to detect erroneous segmentation results and correct them to more accurate and temporally consistent segmentation.
[0059] In another example, a second network can be added to process the output segmentation maps of a fixed-instance segmentation network and give them temporal consistency.
[0060] The temporal consistency of segmentation masks becomes a problem for object segmentation in video frames when they are generated by neural networks trained on still images. Several methods for stabilizing segmentation masks temporally so that they can be used in video applications, particularly depth estimation, are proposed herein.
[0061] Figure 1 shows a scene with two people, 102 and 104. The first person, 102, is obscured by the second person, 104. If this scene is captured in a video frame used for depth estimation, segmentation can be applied to this video frame. When algorithms such as depth estimation utilize the results of video input, temporal consistency is necessary. However, inconsistent input can lead to unstable depth maps, resulting in visually flickering artifacts when used to composite new views.
[0062] Figure 2 shows the scene from Figure 1 with the correct segmentation masks 202 and 204 superimposed. Note that the "correct" segmentation masks 202 and 204 are shown at a distance from the edges of the first person 102 and the second person 104 in order to easily distinguish the segmentation masks from the people. In practice, the correct segmentation masks should be positioned along the edges of the segmented object (e.g., the people shown in Figure 2). In this case, the segmentation is correct because segmentation mask 202 accurately covers the first person 102 and segmentation mask 204 accurately covers the second person 104.
[0063] Figure 3 shows the scene from Figure 1 with the first type of incorrect segmentation masks 304 and 306 superimposed. In this case, the segmentation mask 302 for person 102 is correct, but the segmentation of person 104 is such that one object instance is divided into two segmentation masks 304 and 306. However, in this case, both segmentation masks 304 and 306 align with the contours of person 104.
[0064] It is recognized that a common missegmentation can cause a single object instance to be incorrectly split into two or more mask instances, and the combination of pixels from both incorrectly labeled mask instances may, in some cases, match the contour of a single ground truth object instance.
[0065] To detect when object instances are incorrectly split into different parts, combinations of detected instances that are close to each other in an image can be tested against the hypothesis that they form a single instance. In a basic implementation, the union of multiple adjacent instances of the same class (e.g., humans) is captured, and the Intersection over Union (IoU) metric is evaluated for all overlapping / spatially adjacent instances detected in the previous (or subsequent) frame. If the IoU is higher than a threshold (e.g., 0.95), the adjacent instances are merged into a single instance. This solves the problem shown in Figure 3.
[0066] For example, the merger of segmentation masks 304 and 306 can be compared with the segmentation masks in the preceding and / or following frames (e.g., within 5, 10, 20, or 30 frames within 0.1, 0.5, or 1 second). The comparison can be performed on one or more frames that are temporally different. Depending on whether the comparison determines that the segmentation masks belong to the same object instance, the merged segmentation mask can be assigned to the corresponding object in the current video frame.
[0067] Figure 4 shows the scene from Figure 1 with a second type of incorrect segmentation mask 404 superimposed. In this case, person 102 is correctly detected and segmented by segmentation mask 402. However, although person 104 is correctly detected, segmentation mask 404 misses person 104's feet 406.
[0068] Generally, this type of incorrect segmentation results in incorrect placement of the boundaries of objects represented by the segmentation mask. For example, parts of the object may be missing. A second type of incorrect segmentation is known to cause problems when used in the context of depth estimation and new view compositing. Generally, the second type of incorrect segmentation is considered the most problematic.
[0069] This specification describes several implementations for resolving the second type of erroneous segmentation. These solutions can be combined.
[0070] The first solution to the second type of false segmentation is to create and use a background image for the scene.
[0071] Figure 5 shows the creation and use of background image 504. Figure 5a shows the current video frame 502 with incorrect segmentation superimposed. In this case, as shown in Figure 4, the incorrect segmentation is missing the person's feet. An incorrect segmentation mask results in temporal inconsistencies across video frames, as the previous / next frames may correctly segment the person's feet.
[0072] In this context, as shown in Figure 5b, the static background image 504 depicts a scene from a video sequence that does not contain any detected or moving objects. This is created by weighted accumulation of unlabeled pixels representing the background in video frames over longer time intervals. This embodiment assumes a stationary camera. For a moving camera, motion correction would need to be applied during the accumulation process. This is applied only to the background (unlabeled) portion, and is easy to apply because the background is robust and far away.
[0073] Of course, background image 504 can also be obtained by capturing the scene at another point in time when it is known that there are no objects in the scene being captured.
[0074] A static background image 504 is used to identify the foreground region in any given video frame using the absolute frame difference between the video frame and the background image 504. For example, the difference between the current frame 502 and the background image 504 leads to the identification of two people in the current video frame 502.
[0075] If the identified foreground region is not part of the object segmentation mask, the missing segmentation can be detected. For example, in the current video frame 502, this “known background” technique can be used to flag the feet of a person in the foreground as “missed.” Figure 5c shows image 506, which includes the portion of the foreground region that is not superimposed on the segmentation mask of image 502. In this case, the missing feet of the person in the foreground that were not covered by the segmentation mask of image 502 are shown.
[0076] To correct missing regions, visually corresponding regions are determined from their adjacent temporal positions. Their object labels (if any) are used for the "missing" regions. The determination of corresponding regions is performed using one of the known methods, such as motion estimation, (sieving) feature detection, or tracking. A simpler (spatial) approach is also possible, where for all pixels with large time-lag signals, the spatially nearest known instance pixel is detected, and its instance label is used.
[0077] A second solution to the second type of erroneous segmentation involves temporal prediction of the segmentation mask. Regions from nearby points in time (e.g., t(-1) and t(+1)) are temporarily projected onto the current video frame at t=t(0), yielding a temporal prediction, i.e., the expected position of the region at t=t(0). The difference between the predicted and actual positions of the verified region indicates a temporal inconsistency.
[0078] For a temporal projection, a field motion vector (optical flow) can be determined that represents the pixel-by-pixel displacement from a point in time adjacent to t=t(0). Improvements to the optical flow calculation include the scanning order of subsequent motion vectors that follow the boundaries of the segmentation mask but remain within them.
[0079] The correction of the inconsistent areas can be performed in the same way as described in the first solution.
[0080] A third solution to the second type of erroneous segmentation is to compare the regions enclosed by the segmentation mask by scanning the boundaries. The corresponding regions in subsequent video frames are compared by comparing texture and shape together while scanning the boundaries. Texture similarity is calculated on a kernel that is a complete part of the (foreground) object using the sum of absolute differences (SAD), sum of squared differences (SSD), or other image matching algorithms.
[0081] Shape similarity is determined by accumulating the difference in relative positional paths as a 2D vector. Such accumulated 2D vectors remain at the origin if the scanned boundaries have the same shape, and deviate from the origin if the shapes are different. A scan is initialized by first searching for the corresponding starting point. A scan can also consist of a series of small boundary segments. Segmentation inconsistencies can be detected if the scanned metrics are different.
[0082] Inconsistencies can also arise due to occlusion. For example, if a subject's arm is obscured by occlusion, a segmentation mask that does not include the arm would be correct. Occlusion can be verified using additional visibility checks. Such checks can be performed by continuing the scan guided by the path of the region being compared (i.e., ignoring its own shape). Continuing the scan directly provides an opportunity to correct the missing region, at least if the visibility check has been passed.
[0083] The solution mentioned for the second type of erroneous segmentation can also be used to resolve the first type of erroneous segmentation, as it allows one of the regions of the divided segmentation mask to be treated as a missing region as described above.
[0084] Furthermore, the solutions mentioned for the second type of incorrect segmentation can also be used for invalid segmentation regions (i.e., parts of the segmentation mask that are not part of the corresponding object).
[0085] A third type of mis-segmentation occurs when part of an object is incorrectly labeled. For example, a bag might be mistakenly placed on the ground in front of a person's feet. The bag object class might not exist, resulting in a typically low reported detection probability (e.g., below 50%).
[0086] Another solution to all three types of erroneous segmentation is to modify / add to the segmentation network for temporal consistency. Most instance segmentation neural networks are trained on still images rather than video. There is good reason for this, as adding an additional (temporal) dimension significantly increases the annotation workload. Instead, efforts are made to use larger training sets with more examples and more variations of object categories per object category.
[0087] If you fix most of the weights of an existing neural network, you can adjust a limited set of weights with new data. These weights are either parameters of the existing layers or parameters of newly added layers. Fixing most of the weights helps avoid overfitting the network by using a much smaller new dataset for training.
[0088] Alternatively, it has been proposed to feed the output of an instance segmentation network to a second neural network, train it with segmentation masks from video frames of different time intervals, and generate a new, temporally consistent output.
[0089] Consider the known instance segmentation network Mask-R-CNN. This network outputs, for each object instance, a label, bounding box, detection confidence, and a 28 x 28 pixel probability mask, where the value per pixel is between 0 and 1 and represents the likelihood that an object exists in a particular pixel of the mask. As a post-processing step, the 28x28 mask is typically scaled to the resolution of the detected bounding box using bilinear interpolation, and then a threshold (usually 0.5) is applied to obtain a binary object mask at the original image resolution.
[0090] To achieve temporal consistency, it has been proposed to provide a second neural network that takes the output from Mask-R-CNN (or other segmentation models) of the current frame and the previous and / or subsequent frames as input. The scaled probability mask per object (i.e., full-frame resolution) can be adjusted using multiple probability masks as input, corresponding to different instances of the same object category in the previous frame. This "multiple" can be fixed to, for example, eight masks, some of which may be empty (all pixels have a probability of zero). For example, if two people were detected in the previous frame, six of the eight input masks would all be zero. On the other hand, if the number of people detected exceeds eight, only the instance with the highest confidence level can be used.
[0091] To train this added neural network, a new ground truth is required. Ground truth masks can be created by manually editing the original prediction masks and using them as ground truth.
[0092] The second network is preferably trained for each object instance in the current frame and run for inference. For example, if two dog instances are detected in the current frame, the network is run twice, once for each dog instance. In each run, all dog instance masks from the previous frame are first searched and eight temporary input masks are filled. These eight channels are then concatenated with Mask-R-CNN predictions for the instances to be predicted and fed into the convolutional neural network.
[0093] The second network is preferably of type U-Net to obtain sufficient spatial context. The output prediction is an improved probability mask of the instance being processed. The predicted probability values are between 0 and 1, while the ground truth probability mask can consist of binary probabilities of 0 or 1.
[0094] In general, the second network is a post-processing neural network trained to predict a more accurate and therefore temporally stable object probability mask given a probability mask generated by the first segmentation model for time-different frames and the current frame.
[0095] Of course, the second network can be used with multiple temporally different segmentation masks.
[0096] After running the post-processing neural network, the improved mask can be thresholded (for example, with a value of 0.5) to obtain a final temporally consistent binary segmentation mask for the object instances.
[0097] The present invention also proposes utilizing prompt-based segmentation models such as the Segment Anything Model (SAM) (Kirillov et al. Segment Anything. arXiv:2304.02643, 2023), which is a neural network that generates segmentation when given so-called input prompts.
[0098] In the case of SAM, the input prompt can be one or a combination of an object mask, one or more points on an object, a bounding box, and a text string. It is recognized that the prompt strongly determines the output of the segmentation mask for a particular object, and therefore also affects the temporal stability when using this neural network with video.
[0099] To achieve temporally stable prompts across multiple video frames, it has been found that the same algorithm must be used to obtain prompts for video frames that are at different times. The prompts obtained from the algorithm need to provide an index of the object's position. By using the same algorithm to provide an index of the position for video frames that are at different times, the movement of the object between frames can be captured by the prompts.
[0100] A suitable algorithm can be a keypoint detection algorithm. For example, a suitable algorithm would be a human keypoint detection algorithm obtained using a neural network called Mask R-CNN (K. He et al. Mask R-CNN. arXiv:1703.06870v3, 2018).
[0101] Furthermore, during experiments with possible input prompts for SAM, it was found that point visibility is crucial for obtaining accurate and temporally stable segmentation results. If object points are obscured by other objects, the segmentation mask may become incorrect.
[0102] Selecting unshielded human keypoints with sufficiently high detection scores (confidence levels between 0 and 1) improves the temporal consistency of segmentation results.
[0103] Figure 6 shows a scene where person 602 is obscured by object 604. In this case, person 602 may not be detected by the segmentation model. If detected, it may receive a confidence score of less than 0.5. The reason for this low score is that person 602 is largely obscured by object 604.
[0104] To correctly segment the unoccluded portion of the occluded person 602 using SAM, an unoccluded keypoint of person 602 can be used as a prompt. One such keypoint could be the keypoint 606 on the right ankle. When the keypoint 606 on the right ankle is selected as the prompt for SAM, correct segmentation is obtained. Note that if, for example, the center point of the bounding box or the bounding box itself is used as the prompt, this will result in incorrectly segmenting a portion of the occluded object 604 instead of segmenting the entire person 602.
[0105] The above insights lead to using time-constrained occlusion-aware input prompts in prompt-based segmentation models to improve the temporal consistency of object segmentation masks within video frames.
[0106] Specifically, when dealing with human segmentation, it has been proposed to use human keypoint detection to detect keypoints of all people within a frame, and then create a segmentation mask using a prompt-based segmentation model, using only the keypoints not covered by highly reliable humans, in order of decreasing reliability of human detection.
[0107] To avoid false detections (which are usually unreliable), a temporal correspondence between detected individuals can be established over time using spatial proximity. Low-scoring detections are segmented only if a corresponding high-scoring detection exists temporally close to them in the previous or next video frame.
[0108] This concept can be extended to interactions between other object classes and between point prompts, box prompts, mask prompts, and text prompts. For example, to segment an object, all high-scoring objects can be segmented in descending order of score, pixels not covered by higher-scoring objects within the ball's bounding rectangle can be identified, the centroids of these pixels can be determined, and these centroids can be used as point prompts for a prompt-based segmentation model.
[0109] In another embodiment, the algorithm may be configured to determine one or more of the following: the bounding box of an object, text indicating the object's position within a video frame (and optionally indicating the object's texture / shape), and points indicating specific parts of the object (e.g., the center of the object, a human ankle, etc.).
[0110] The algorithm can be configured to determine one or more prompts for an object. By using multiple prompts (such as keypoints and bounding boxes), a more accurate segmentation mask can be obtained.
[0111] Figure 7 illustrates a method for object segmentation within a video frame. This method includes the step (702) of obtaining the current segmentation mask of the objects in the current video frame. This can be achieved by using an existing segmentation model for the current video frame.
[0112] The method further includes the step (704) of using the segmentation mask and segmentation obtained from one or more previous and / or subsequent video frames (i.e., video frames that are different in time) to generate a temporally consistent current segmentation mask of the object for the current video frame.
[0113] In the first embodiment, the segmentation obtained from temporally different frames may be a segmentation mask of objects from temporally different frames. The segmentation mask can be compared to the current segmentation mask to identify missing / invalid regions and adapt / modify the current segmentation mask accordingly. Alternatively, both masks can be input into a neural network trained to output a temporally consistent segmentation for the current video frame.
[0114] In a second embodiment, the segmentation obtained from temporally different frames may be missing / invalid segmentation regions obtained by finding the difference between the background image (obtained using temporally different frames) and the current frame. The current video frame can be corrected by including the missing regions or removing the invalid regions. Alternatively, the current mask and the missing / invalid segmentation regions can be input into a neural network trained to output a temporally consistent segmentation for the current video frame.
[0115] In a third embodiment, the segmentation obtained from temporally different frames may be bounding boxes, keypoints, or text indicating the positions of objects within the temporally different frames. A temporally consistent segmentation mask can be generated by inputting the current segmentation mask and bounding boxes, keypoints, or text indicating the positions of objects as prompts in a prompt-based segmentation model.
[0116] In any of the embodiments described above, segmentation obtained from temporally different frames can be projected temporally onto the current frame using optical flow.
[0117] As described above, it will be understood that there are various other methods for generating a temporally consistent segmentation mask using segmentation obtained from frames at different times and the current segmentation mask.
[0118] Figure 8 shows a method for object segmentation within a video frame using a prompt-based segmentation model. This method includes the steps of applying an algorithm to a first video frame to obtain a first index of the object's position within the first video frame (802), and inputting the first video frame into a prompt-based segmentation model that uses the first index of the object's position as a prompt (804).
[0119] Furthermore, this method includes the steps of applying the same algorithm to a second subsequent video frame to obtain a second index of the object's position in the second video frame (806), and inputting the second video frame into a prompt-based segmentation model that uses the second index of the object's position as a prompt (802).
[0120] Those skilled in the art can easily develop a processor to perform any of the methods described herein. Thus, each step in the flowchart represents a different action performed by the processor, which can be performed by each module of the processing processor.
[0121] One or more steps of any method described herein may be performed by one or more processors. A processor consists of electronic circuits suitable for processing data. Any method described herein may be computer-implemented, where computer implementation means that the steps of the method are performed by one or more computers, where a computer is defined as a device suitable for data processing. A computer is suitable for processing data according to given instructions.
[0122] As described above, the system utilizes a processor to perform data processing. The processor is implemented in various ways using software and / or hardware to perform the various functions required. The processor typically uses one or more microprocessors programmed to perform the required functions using software (e.g., microcode). The processor may also be implemented as a combination of dedicated hardware for performing some functions and one or more programmed microprocessors and associated circuits for performing other functions.
[0123] Examples of circuits used in various embodiments of this disclosure include, but are not limited to, conventional microprocessors, application-specific integrated circuits (ASICs), and field-programmable gate arrays (FPGAs).
[0124] In various implementations, the processor may be associated with one or more storage media, which are volatile and non-volatile computer memories such as RAM, PROM, EPROM, and EEPROM. These storage media may be encoded with one or more programs that perform the required functions when executed on one or more processors and / or controllers. The various storage media may be mounted within the processor or controller, or they may be transportable so that one or more programs stored in the storage media can be loaded into the processor.
[0125] Modifications of the disclosed embodiments can be understood and implemented by those skilled in the art in carrying out the claimed invention, based on a review of the drawings, disclosures, and appended claims. In the claims, the words “comprising” do not exclude other components or steps, and the indefinite articles “a” or “an” do not exclude plurality.
[0126] A single processor or other unit can perform the functions of several of the items listed in the claims.
[0127] The mere fact that certain means are described in mutually different dependent claims does not indicate that combinations of these means cannot be used advantageously.
[0128] Computer programs can be stored / distributed on suitable media such as optical or solid-state media supplied together with or as part of other hardware, but they can also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems.
[0129] When the term "adapt" is used in a claim or specification, it means that the term "adapt" is equivalent to the term "constituted".
[0130] No reference numeral in a claim should be construed as limiting the scope.
[0131] Any method described herein excludes methods of performing mental acts.
[0132] Modifications of the disclosed embodiments can be understood and implemented by those skilled in the art in carrying out the claimed invention, based on a review of the drawings, disclosures, and appended claims. In the claims, the words “comprising” do not exclude other components or steps, and the indefinite articles “a” or “an” do not exclude plurality.
[0133] A single processor or other unit can perform the functions of several of the items listed in the claims.
[0134] The mere fact that certain means are described in mutually different dependent claims does not indicate that combinations of these means cannot be used advantageously.
[0135] Computer programs can be stored / distributed on suitable media such as optical or solid-state media supplied together with or as part of other hardware, but they can also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems.
[0136] When the term "adapt" is used in a claim or specification, it means that the term "adapt" is equivalent to the term "constituted".
[0137] No reference numeral in a claim should be construed as limiting the scope.
[0138] The present invention is generally defined by the following embodiments.
[0139] Embodiment 1 - A method for object segmentation within a video frame, the method comprising the steps of: obtaining a current segmentation mask of an object in the current video frame; and using the segmentation mask and segmentations obtained from one or more previous and / or subsequent video frames to generate a temporally consistent current segmentation mask of an object for the current video frame.
[0140] Embodiment 2 - According to Embodiment 1, the step of generating a temporally consistent current segmentation mask includes the steps of comparing the current segmentation mask with a segmentation obtained using one or more previous and / or subsequent video frames, and adapting the current segmentation mask to generate a temporally consistent current segmentation mask based on the comparison.
[0141] Embodiment 3 - In accordance with Embodiment 2, the step of obtaining the current segmentation mask of an object includes the step of obtaining two spatially adjacent current segmentation masks in the current video frame, the step of comparing the current segmentation masks includes the step of generating a combined segmentation mask by combining the two spatially adjacent current segmentation masks, and the step of comparing the combined segmentation mask with one or more temporary segmentation masks corresponding to previous and / or subsequent video frames, and the step of adapting the current segmentation mask includes assigning the combined segmentation mask to the corresponding object in the current video frame.
[0142] Embodiment 4 - In accordance with Embodiment 2 or 3, the step of comparing the current segmentation mask includes the steps of generating a background image using a plurality of previous and / or subsequent video frames, determining the differences between the current video frame and the background image to identify foreground regions within the current video frame, and comparing regions corresponding to the current segmentation mask with the identified foreground regions to detect missing and / or invalid segmentation regions, and the step of adapting the current segmentation mask includes adding missing segmentation regions to the current segmentation mask and / or removing invalid segmentation regions from the current segmentation mask.
[0143] Embodiment 5 - In accordance with any of Embodiments 2 to 4, the step of comparing segmentation masks includes the steps of: obtaining one or more temporary segmentation masks corresponding to objects from previous and / or subsequent frames; determining a motion field between the current video frame and the previous and / or subsequent video frames; using the motion field to project the temporary segmentation masks temporally to the time corresponding to the current video frame; and comparing the projected temporary segmentation masks to the current segmentation mask to detect missing and / or invalid segmentation regions, wherein the step of adapting the current segmentation mask includes adding missing segmentation regions to the current segmentation mask and / or removing invalid segmentation regions from the current segmentation mask.
[0144] Embodiment 6 - In accordance with any of Embodiments 2 to 5, the step of comparing the current segmentation mask includes the steps of obtaining one or more temporary segmentation masks from previous and / or subsequent frames, and comparing the shape of the temporary segmentation masks and the textures joined by the temporary segmentation masks with the shape of the current segmentation mask and the textures joined by the current segmentation mask, respectively, in order to detect missing and / or invalid segmentation regions, and the step of adapting the current segmentation mask includes adding missing segmentation regions to the current segmentation mask and / or removing invalid segmentation regions from the current segmentation mask.
[0145] Embodiment 7 - In accordance with Embodiment 5 or 6, the step of adapting the current segmentation mask further includes performing a visibility check of missing and / or invalid segmentation regions to determine whether they are occluded in the current frame, and the adaptation of the current segmentation mask is performed depending on whether at least a portion of the missing and / or invalid segmentation regions are not occluded in the current frame.
[0146] Embodiment 8 - According to Embodiment 1, the step of generating a temporally consistent current segmentation mask includes the steps of obtaining one or more transient segmentation masks from previous and / or subsequent frames corresponding to the same object class as the current segmentation, and inputting the current segmentation mask and the transient segmentation masks into a neural network trained to output an adapted segmentation mask for the current frame.
[0147] Embodiment 9 - In accordance with Embodiment 8, the input to the neural network further includes the current video frame.
[0148] Embodiment 10 - Depending on Embodiment 8 or 9, the input to the neural network further includes frames before and / or after one or more transient segmentation masks are obtained therefrom.
[0149] Embodiment 11 – In accordance with any of Embodiments 8 to 10, the neural network is trained using a training algorithm configured to receive a group of training inputs and known outputs, each training input corresponding to a known output, each training input including the current segmentation mask of an object in the current video frame and one or more transient segmentation masks from previous and / or subsequent frames corresponding to the same object class as the current segmentation, and each known output including the segmentation mask of the current frame of the corresponding training input adapted for temporal consistency.
[0150] Embodiment 12 - In accordance with any of Embodiments 8 to 11, the neural network has a U-Net architecture.
[0151] Embodiment 13 - A method for object segmentation in a video frame using a prompt-based segmentation model, the method comprising the steps of: applying an algorithm to a first video frame to obtain a first index of the position of an object in the first video frame; inputting the first video frame into a prompt-based segmentation model using the first index of the position of the object as a prompt; applying the same algorithm to a second subsequent video frame to obtain a second index of the position of an object in the second video frame; and inputting the second video frame into the prompt-based segmentation model using the second index of the position of the object as a prompt.
[0152] Embodiment 14 - In accordance with Embodiment 13, the algorithm includes a keypoint detection algorithm configured to determine one or more keypoints of an object in an image, wherein first and second indices of the object's position include one or more keypoints of the object determined by the keypoint detection algorithm.
[0153] Embodiment 15 - In accordance with Embodiment 13 or 14, the algorithm is configured to segment objects in an input video frame, identify pixels corresponding to the segmented objects, and determine the center of the pixels, wherein first and second indices of the object's position include the corresponding determined center.
[0154] Embodiment 16 - In accordance with any of Embodiments 13 to 15, a visibility check is performed on the first video frame to detect whether the first position indicator of an object is obscured by another object, and based on whether the first position indicator is obscured, the first position indicator is adapted to an unobscured position. A visibility check is also performed on the second video frame to detect whether the second position indicator of an object is obscured by another object, and based on whether the position indicator is obscured, the second position indicator is adapted to an unobscured position.
[0155] Embodiment 17 - Depending on any of Embodiments 13 to 16, further comprising receiving confidence scores for each of the first and second indicators of the object's position, and inputting the second video frame into a prompt-based segmentation model based on the fact that both confidence scores exceed a minimum confidence threshold.
[0156] Embodiment 18 – In accordance with any of Embodiments 13 to 17, the algorithm is configured to perform any of the steps of Embodiments 1 to 12, where the first and second indices of the object's position include the corresponding temporally consistent current segmentation mask.
[0157] Embodiment 19 - A computer program carrier that, when executed on a computer, includes computer program code that causes the computer to perform all the steps according to any of Embodiments 1 to 12 or any of Embodiments 13 to 18.
[0158] Embodiment 20 - A system including a processor configured to perform all the steps according to any of Embodiments 1 to 12 or any of Embodiments 13 to 18.
[0159] The present invention is defined more specifically by the appended claims.
Claims
1. A method for object segmentation in a video frame, wherein the method is Steps include obtaining the current segmentation mask of an object within the current video frame, The steps of generating a temporally consistent current segmentation mask for the object for the current video frame using the segmentation mask and segmentation obtained from one or more previous and / or subsequent video frames, The step of generating the above is The steps include comparing the current segmentation mask with a segmentation obtained using one or more previous and / or subsequent video frames, A method comprising the step of adapting the current segmentation mask to generate the time-consistent current segmentation mask based on the comparison.
2. The step of obtaining the current segmentation mask of an object includes obtaining two spatially adjacent current segmentation masks in the current video frame, and the step of comparing the current segmentation masks is A step of generating a combined segmentation mask by combining the two spatially adjacent current segmentation masks mentioned above. The step of comparing the combined segmentation mask with one or more temporary segmentation masks corresponding to the preceding and / or succeeding video frames, The steps include determining whether the two spatially adjacent current segmentation masks belong to the same object instance based on a comparison between the combined segmentation mask and one or more temporary segmentation masks corresponding to the previous and / or subsequent video frames, It has, The method according to claim 1, wherein the step of applying the current segmentation masks is to assign the combined segmentation mask to the corresponding object in the current video frame in response to a determination that the two spatially adjacent current segmentation masks belong to the same object instance.
3. The step of comparing the current segmentation masks is, A step of generating a background image using multiple previous and / or subsequent video frames, A step of determining the difference between the current video frame and the background image, and identifying the foreground region within the current video frame. A step of detecting missing and / or invalid segmentation regions by comparing the region corresponding to the current segmentation mask with the identified foreground region, It has, The method according to claim 1 or 2, wherein the step of adapting the current segmentation mask includes adding the missing segmentation regions to the current segmentation mask and / or removing the invalid segmentation regions from the current segmentation mask.
4. The step of comparing the segmentation masks is, A step of obtaining one or more temporary segmentation masks corresponding to the object from previous and / or subsequent frames, The steps include determining the motion field between the current video frame and the previous and / or subsequent video frames, A step of projecting the temporary segmentation mask temporally onto the time corresponding to the current video frame using the motion field, Steps include: comparing the projected temporary segmentation mask with the current segmentation mask to detect missing and / or invalid segmentation regions; It has, The method according to claim 1 or 2, wherein the step of adapting the current segmentation mask includes adding the missing segmentation regions to the current segmentation mask and / or removing the invalid segmentation regions from the current segmentation mask.
5. The step of comparing the current segmentation mask is, Steps include obtaining one or more temporary segmentation masks from the previous and / or subsequent frames, Steps to detect missing and / or invalid segmentation regions by comparing the shape of the temporary segmentation mask and the texture joined by the temporary segmentation mask with the shape of the current segmentation mask and the texture joined by the current segmentation mask, respectively. It has, The method according to claim 1 or 2, wherein the step of adapting the current segmentation mask includes adding the missing segmentation regions to the current segmentation mask and / or removing the invalid segmentation regions from the current segmentation mask.
6. The method according to claim 4 or 5, wherein the step of applying the current segmentation mask further includes performing visibility checks on the missing segmentation regions and / or the invalid segmentation regions to determine whether they are obscured within the current frame.
7. A method for object segmentation within a video frame using a prompt-based segmentation model, The steps include applying an algorithm to a first video frame to obtain a first index of the position of an object within the first video frame, The steps include inputting the first video frame into the prompt-based segmentation model using the first index of the object's position as a prompt, The steps include applying the same algorithm to a second subsequent video frame to obtain a second index of the position of the object in the second video frame, The steps include inputting the second video frame into the prompt-based segmentation model using the second index of the object's position as a prompt, A method of having.
8. The method according to claim 7, wherein the algorithm includes a keypoint detection algorithm configured to determine one or more keypoints of an object in an image, and the first and second indices of the object's position include one or more keypoints of the object determined by the keypoint detection algorithm.
9. The method according to claim 7 or 8, wherein the algorithm is configured to segment the object in the input video frame, identify the pixel corresponding to the segmented object, and determine the center of the pixel, the first and second indices of the position of the object include the corresponding determined center.
10. A visibility check is performed on the first video frame to detect whether the first index of the object's position is obscured by another object, and based on whether the index of the position is obscured, the first index of the position is adapted to an unobscured position, and The visibility check of the second video frame is performed to detect whether the second index of the object's position is obscured by another object, and based on whether the index of the position is obscured, the second index of the position is adjusted to an unobscured position. The method according to claim 7 or 8, further comprising:
11. The method according to claim 7 or 8, further comprising the step of receiving confidence scores for each of the first and second indicators of the position of the object, and inputting the second video frame into the prompt-based segmentation model based on the fact that both confidence scores exceed a minimum confidence threshold.
12. The method according to claim 7 or 8, wherein the algorithm is configured to perform all the steps of claim 1 or 2, and the first and second indices of the object's position include a corresponding temporally consistent current segmentation mask.
13. A computer program that runs on a computer and causes the computer to perform any of claims 1 to 6 and / or any of claims 7 to 12.
14. A system comprising a processor configured to perform the method according to any one of claims 1 to 6 and / or the method according to any one of claims 7 to 12.