An instance segmentation method and device for intersection traffic monitoring
By introducing static background images and a foreground-background fusion module into the traffic monitoring model, a clear foreground mask is generated, which solves the problem of decreased accuracy of existing models at new intersections and camera positions, and improves the generalization ability of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2024-01-19
- Publication Date
- 2026-06-23
AI Technical Summary
Existing traffic monitoring camera image instance segmentation models have insufficient generalization ability, especially when facing new intersections and new camera positions, the accuracy drops significantly, making it difficult to promote them on a large scale.
By introducing a static background image, the foreground features of the current frame are enhanced through a twin feature extractor and a foreground-background fusion module. An attention module is used to generate a clear foreground mask, suppressing background information interference and improving the segmentation accuracy across intersections and camera positions.
It effectively alleviates the problem of accuracy decline of existing models at new intersections and new camera positions, and improves the generalization performance of models in cross-intersection and cross-camera position scenarios.
Smart Images

Figure CN118097134B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of traffic monitoring technology, and in particular to an instance segmentation method and apparatus for traffic monitoring at intersections. Background Technology
[0002] Currently, most instance segmentation models used for traffic monitoring camera images suffer from poor generalization. Existing solutions mainly rely on collecting data from a sufficient number of different camera positions to enrich the sample diversity of the training set, which is not only time-consuming and labor-intensive but also inevitably produces long-tail problems. Summary of the Invention
[0003] In view of this, this application provides an instance segmentation method and apparatus for intersection traffic monitoring to solve the above-mentioned technical problems.
[0004] In a first aspect, embodiments of this application provide an instance segmentation method for intersection traffic monitoring, including:
[0005] Acquire the current RGB image frame of any surveillance camera at the intersection, as well as the static background image of the surveillance camera; the static background image is a pure background RGB image that does not contain any instances under the camera position.
[0006] The pre-trained instance segmentation model is used to process the current RGB image frame and the static background image to obtain the instance segmentation result.
[0007] Furthermore, the method also includes:
[0008] The RGB image sequence of the monitoring camera preceding the current RGB image frame is acquired at preset time intervals.
[0009] Based on the density clustering algorithm, pixels at the same location in an RGB image sequence are clustered along the time dimension to obtain multiple pixel clusters;
[0010] The average value of all pixel values in the pixel cluster with the largest number of pixels is calculated and used as the pixel value of the corresponding position in the static background image, thus obtaining the static background image.
[0011] Furthermore, the instance segmentation model includes: a twin feature extractor, a foreground / background fusion module, an attention module, an instance segmentation head, and a post-processing module;
[0012] The pre-trained instance segmentation model is used to process the current RGB image frame and the static background image to obtain instance segmentation results, including:
[0013] The twin feature extractor is used to downsample the current RGB image frame layer by layer to obtain a first feature map of N scales, and the static background image is downsampled layer by layer to obtain a second feature map of N scales;
[0014] The foreground and background fusion module is used to process the first feature map and the second feature map of the same scale respectively to generate attention feature maps of the corresponding scale; the attention feature maps of N-1 scales are upsampled to obtain N-1 feature maps of the same size as the attention feature map of the largest scale; the N-1 feature maps are then concatenated with the attention feature map of the largest scale to generate the foreground attention feature map.
[0015] The foreground attention feature map is processed using the attention module to obtain a foreground mask. The first feature map at N scales is multiplied by the foreground mask to obtain foreground feature maps at N scales.
[0016] The instance segmentation head is used to process the foreground feature maps at N scales to obtain M instance center point Gaussian heatmaps and a position offset map from the center point. Each instance center point Gaussian heatmap corresponds to one category, and M is the number of categories.
[0017] The post-processing module is used to process the foreground mask, the Gaussian heatmaps of the center points of M instances, and a position offset map from the center point to obtain the instance segmentation result.
[0018] Furthermore, the twin feature extractor employs ResNet-50; the first feature maps at N scales include: a first feature map at a first scale, a first feature map at a second scale, and a first feature map at a third scale; the first feature map at the first scale is obtained by downsampling the current RGB image frame by 4 times, the first feature map at the second scale is obtained by downsampling the current RGB image frame by 8 times, and the first feature map at the third scale is obtained by downsampling the current RGB image frame by 16 times; the second feature maps at N scales include: a second feature map at a first scale, a second feature map at a second scale, and a second feature map at a third scale; the second feature map at the first scale is obtained by downsampling the static background image by 4 times, the second feature map at the second scale is obtained by downsampling the static background image by 8 times, and the second feature map at the third scale is obtained by downsampling the static background image by 16 times.
[0019] Furthermore, the foreground / background fusion module includes a first max spatial pooling, a first average spatial pooling, a second max spatial pooling, a second average spatial pooling, a first arithmetic unit, a second arithmetic unit, a first multilayer perceptron, a second multilayer perceptron, an adder, a first multiplier, a second multiplier, a first max channel pooling, a first average channel pooling, a second max channel pooling, a second average channel pooling, a third arithmetic unit, a fourth arithmetic unit, a first convolutional layer, a second convolutional layer, a second adder, a third multiplier, a fourth multiplier, a stitching unit, and a third convolutional layer;
[0020] The foreground-background fusion module processes the first and second feature maps of the same scale to generate attention feature maps of the corresponding scale; including:
[0021] The first feature map is processed by first max spatial pooling and first average spatial pooling respectively to obtain a first intermediate feature map and a second intermediate feature map;
[0022] The second feature map is processed by the second max spatial pooling and the second average spatial pooling respectively to obtain the third intermediate feature map and the fourth intermediate feature map.
[0023] The absolute value of the difference between the first intermediate feature map and the third intermediate feature map is calculated using the first arithmetic unit to obtain the fifth intermediate feature map;
[0024] The absolute value of the difference between the second intermediate feature map and the fourth intermediate feature map is calculated using the second arithmetic unit to obtain the sixth intermediate feature map;
[0025] The fifth intermediate feature map is processed using the first multilayer perceptron to obtain the first attention-guided weight feature map;
[0026] The sixth intermediate feature map is processed using a second multilayer perceptron to obtain a second attention-guided weight feature map;
[0027] The first attention-guided weight feature map is obtained by using the first adder to calculate the sum of the first attention-guided weight feature map and the second attention-guided weight feature map.
[0028] The first multiplier is used to perform a dot product operation on the first weighted feature map and the first feature map to obtain the first spatial dimension weighted feature map;
[0029] The first weighted feature map and the second feature map are multiplied by the second multiplier to obtain the second spatial dimension weighted feature map.
[0030] The first spatial dimension weighted feature map is processed by the first maximum channel pooling and the first average channel pooling respectively to obtain the seventh intermediate feature map and the eighth intermediate feature map.
[0031] The second spatial dimension weighted feature map is processed by the second maximum channel pooling and the second average channel pooling respectively to obtain the ninth intermediate feature map and the tenth intermediate feature map.
[0032] The eleventh intermediate feature map is obtained by calculating the absolute value of the difference between the seventh and ninth intermediate feature maps using the third arithmetic unit.
[0033] The absolute value of the difference between the eighth and tenth intermediate feature maps is calculated using the fourth arithmetic unit to obtain the twelfth intermediate feature map.
[0034] The eleventh intermediate feature map is processed using the first convolutional layer to obtain the third attention-guided weight feature map;
[0035] The twelfth intermediate feature map is processed using the second convolutional layer to obtain the fourth attention-guided weight feature map;
[0036] The second weight feature map is obtained by using the second adder to calculate the sum of the third attention-guided weight feature map and the fourth attention-guided weight feature map;
[0037] The second weighted feature map and the first feature map are multiplied by the third multiplier to obtain the third spatial dimension weighted feature map.
[0038] The second weighted feature map is multiplied by the second feature map using the fourth multiplier to obtain the fourth spatial dimension weighted feature map.
[0039] The weighted feature map of the third spatial dimension and the weighted feature map of the fourth spatial dimension are concatenated using the concatenation unit to obtain the weighted feature map of the fifth spatial dimension;
[0040] The fifth spatial dimension weighted feature map is processed using the third convolutional layer to obtain the attention feature map.
[0041] Furthermore, the attention module includes: two convolutional layers and an argmax function;
[0042] The foreground mask is obtained by processing the foreground attention feature map using the attention module, including:
[0043] Two convolutional layers are used to process the foreground attention feature map to obtain a mask with two channels, which represent the probability of the position being foreground and the probability of the position being background, respectively.
[0044] The argmax function is used to process the mask to obtain the foreground mask.
[0045] Furthermore, the instance segmentation head consists of a hollow spatial pyramid pooling layer, a first sub-convolutional layer, a second sub-convolutional layer, a third sub-convolutional layer, a fourth sub-convolutional layer, a fifth sub-convolutional layer, a sixth sub-convolutional layer, a seventh sub-convolutional layer, an eighth sub-convolutional layer, and a splicing unit;
[0046] The foreground feature maps at the N scales include: a foreground feature map at the first scale, a foreground feature map at the second scale, and a foreground feature map at the third scale.
[0047] The instance segmentation head is used to process the foreground feature maps at N scales to obtain the category, center point, and center point offset map of each instance, including:
[0048] The foreground feature map at the third scale is processed using a void space pyramid pooling layer to obtain the first intermediate foreground feature map at the third scale.
[0049] The first sub-convolutional layer is used to process the foreground feature map at the second scale to obtain the second intermediate foreground feature map at the second scale.
[0050] Upsample the first intermediate foreground feature map at the third scale to obtain the third intermediate foreground feature map at the second scale. Then, stitch the second intermediate foreground feature map at the second scale and the third intermediate foreground feature map at the second scale together to obtain the fourth intermediate foreground feature map at the second scale.
[0051] The fourth intermediate foreground feature map at the second scale is processed using the second sub-convolutional layer to obtain the fifth intermediate foreground feature map at the second scale.
[0052] The foreground feature map at the first scale is processed by the third sub-convolutional layer to obtain the sixth intermediate foreground feature map at the first scale.
[0053] Upsample the fifth intermediate foreground feature map at the second scale to obtain the seventh intermediate foreground feature map at the first scale. Then, concatenate the sixth and seventh intermediate foreground feature maps at the first scale to obtain the eighth intermediate foreground feature map at the first scale.
[0054] The eighth intermediate foreground feature map at the first scale is processed using the fourth sub-convolutional layer to obtain the ninth intermediate foreground feature map at the first scale.
[0055] The fifth sub-convolutional layer is used to process the fifth intermediate foreground feature map of the second scale to obtain the tenth intermediate foreground feature map of the second scale.
[0056] The ninth intermediate foreground feature map at the first scale is downsampled to obtain the eleventh intermediate foreground feature map at the second scale. The tenth intermediate foreground feature map at the second scale and the eleventh intermediate foreground feature map at the second scale are then stitched together to obtain the twelfth intermediate foreground feature map at the second scale.
[0057] The twelfth intermediate foreground feature map at the second scale is processed using the sixth sub-convolutional layer to obtain the thirteenth intermediate foreground feature map at the second scale.
[0058] The first intermediate foreground feature map at the third scale is processed using the seventh sub-convolutional layer to obtain the fourteenth intermediate foreground feature map at the third scale.
[0059] Downsampling the thirteenth intermediate foreground feature map at the second scale yields the fifteenth intermediate foreground feature map at the third scale. The fourteenth and fifteenth intermediate foreground feature maps at the third scale are then stitched together to obtain the sixteenth intermediate foreground feature map at the third scale.
[0060] The sixteenth intermediate foreground feature map at the third scale is processed using the eighth sub-convolutional layer to obtain the seventeenth intermediate foreground feature map at the third scale.
[0061] Upsample the seventeenth intermediate foreground feature map at the third scale to obtain the seventeenth intermediate foreground feature map at the first scale;
[0062] Upsample the seventeenth intermediate foreground feature map at the second scale to obtain the seventeenth intermediate foreground feature map at the first scale;
[0063] Upsample the thirteenth intermediate foreground feature map at the second scale to obtain the eighteenth intermediate foreground feature map at the first scale;
[0064] The ninth, seventeenth, and eighteenth intermediate foreground feature maps at the first scale are stitched together to obtain the fused foreground feature map at the first scale.
[0065] The fused foreground feature map at the first scale is processed using a center prediction head to obtain M Gaussian heatmaps of instance center points, wherein each Gaussian heatmap of instance center points includes the predicted center points of all instances of a class; the fused foreground feature map at the first scale is processed using a center offset prediction head to obtain a position offset map of the distance from the center point, wherein the pixel value of each pixel in the position offset map of the distance from the center point is the two-dimensional coordinate offset of the distance from the center point.
[0066] Furthermore, the post-processing module is used to process the foreground mask, the Gaussian heatmaps of the center points of M instances, and a position offset map from the center point to obtain the instance segmentation result, including:
[0067] The first center position of each pixel instance is obtained by subtracting the coordinate encoding map from the position offset map from the center point.
[0068] Calculate the distance between the first center position of each pixel and each center point, take the center point corresponding to the minimum distance as the second center position of the pixel, and obtain the coarse segmentation result image based on the second center positions of each pixel in the example;
[0069] The final instance segmentation result is obtained by multiplying the foreground mask and the coarse segmentation result.
[0070] Secondly, embodiments of this application provide an instance segmentation device for intersection traffic monitoring, comprising:
[0071] The acquisition unit is used to acquire the current RGB image frame of any surveillance camera at the intersection, as well as the static background image of the surveillance camera; the static background image is a pure background RGB image that does not contain any instances under the camera position of the surveillance camera.
[0072] The instance segmentation unit is used to process the current RGB image frame and the static background image using a pre-trained instance segmentation model to obtain instance segmentation results.
[0073] Thirdly, embodiments of this application provide an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method of embodiments of this application.
[0074] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the methods of embodiments of this application.
[0075] This application effectively enhances the foreground features of the current frame by introducing a static background image into the instance segmentation model, thus alleviating the generalization problem of the severe accuracy drop when the existing instance segmentation model is applied to new intersections and new camera positions. Attached Figure Description
[0076] To more clearly illustrate the technical solutions in the specific embodiments of this application or the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0077] Figure 1A flowchart illustrating an instance segmentation method for intersection traffic monitoring provided in an embodiment of this application;
[0078] Figure 2 A schematic diagram of the instance segmentation model provided in the embodiments of this application;
[0079] Figure 3 A schematic diagram of the foreground / background fusion module provided in an embodiment of this application;
[0080] Figure 4 A schematic diagram of the attention module provided in an embodiment of this application;
[0081] Figure 5 A functional structure diagram of an instance segmentation device for intersection traffic monitoring provided in an embodiment of this application;
[0082] Figure 6 A functional structure diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0083] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.
[0084] Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.
[0085] First, a brief introduction to the design concept of the embodiments of this application will be given.
[0086] In advanced intelligent transportation perception systems, traffic monitoring cameras are gradually replacing traditional coils and geomagnetic sensors due to their relatively low cost and rich texture information, becoming the main sensors in intelligent transportation scenarios. Simultaneously, the rapid development of vehicle-to-infrastructure (V2I) technology has placed new demands on vision-based advanced intelligent transportation perception systems, including vehicle ground contact point recognition, axle count recognition, anomaly detection, and 3D bounding box prediction. These perception outputs largely rely on accurate roadside instance segmentation results. Therefore, instance segmentation has gradually become a research hotspot in advanced intelligent transportation perception systems.
[0087] Currently widely used instance segmentation methods are all geared towards general instance segmentation, such as Mask R-CNN, Mask2Former, OneFormer, and MaskDINO. While these methods have indeed demonstrated high accuracy in traffic monitoring, they are difficult to widely adopt. One of the main reasons is the generalization problem. When encountering new intersections and camera positions not present in the algorithm's training set, existing instance segmentation methods experience a precipitous drop in accuracy. This drop is due to the inherent characteristic of roadside perception data: high intra-class similarity and low inter-class similarity. Specifically, for a single frame image, a large proportion of background pixels remain essentially unchanged under the same camera position, resulting in highly similar image data features acquired from the same position. However, background pixels are completely different under different camera positions, leading to very low similarity in image data features acquired from different positions. Therefore, existing general instance segmentation methods inevitably overfit to the camera positions in the training set, and when faced with unknown new intersections and camera positions, the drastic changes in negative sample pixels cause a significant drop in algorithm accuracy.
[0088] To address the aforementioned technical issues, this application provides an instance segmentation method for intersection traffic monitoring. This method effectively enhances the foreground features of the current frame by introducing a static background image, thus alleviating the generalization problem of severe accuracy degradation in existing general instance segmentation algorithms when facing new intersections and new camera positions.
[0089] The advantages of this application are:
[0090] 1. By introducing static background image information into the feature extraction process, the designed foreground-background fusion module transforms the feature extraction process of traffic monitoring cameras from the existing direct discrimination of positive and negative samples to the task of discrimination of non-negative samples under known negative samples. That is, comparing the difference between the current frame image and the background image, it effectively solves the problem of the change in discrimination boundary caused by the large change in the distribution of negative background samples under new camera positions, thereby improving the segmentation accuracy of this method under cross-intersection and cross-camera positions.
[0091] 2. By applying explicit foreground mask constraints to the attention module, the problem of blurry boundaries and poor interpretability of attention maps generated by existing self-attention mechanisms is solved. By generating a foreground mask with clear attention map boundaries and obvious physical meaning, the foreground mask is multiplied with the extracted multi-scale features, which can suppress background information to the greatest extent and isolate the interference of background changes at new intersections and new camera positions on the instance segmentation head and post-processing module. This also improves the generalization performance of this method in cross-intersection and cross-camera position scenarios.
[0092] After introducing the application scenarios and design concepts of the embodiments of this application, the technical solutions provided by the embodiments of this application will be described below.
[0093] like Figure 1 As shown in the figure, this application provides an instance segmentation method for intersection traffic monitoring, including the following steps:
[0094] Step 101: Obtain the current RGB image frame of any surveillance camera at the intersection, and the static background image of the surveillance camera; the static background image is a pure background RGB image that does not contain any instances under the camera position.
[0095] Step 102: Use the pre-trained instance segmentation model to process the current RGB image frame and the static background image to obtain the instance segmentation result.
[0096] The method further includes:
[0097] The RGB image sequence of the monitoring camera preceding the current RGB image frame is acquired at a preset time interval; preferably, the preset time interval is 10 seconds.
[0098] Based on the density clustering algorithm, pixels at the same location in an RGB image sequence are clustered along the time dimension to obtain multiple pixel clusters;
[0099] The average value of all pixel values in the pixel cluster with the largest number of pixels is calculated and used as the pixel value of the corresponding position in the static background image, thus obtaining the static background image.
[0100] like Figure 2 As shown, the instance segmentation model includes: a twin feature extractor, a foreground / background fusion module, an attention module, an instance segmentation head, and a post-processing module;
[0101] Specifically, a pre-trained instance segmentation model is used to process the current RGB image frame and the static background image to obtain instance segmentation results; including:
[0102] The twin feature extractor is used to downsample the current RGB image frame layer by layer to obtain a first feature map of N scales, and the static background image is downsampled layer by layer to obtain a second feature map of N scales;
[0103] The foreground and background fusion module is used to process the first feature map and the second feature map of the same scale respectively to generate attention feature maps of the corresponding scale; the attention feature maps of N-1 scales are upsampled to obtain N-1 feature maps of the same size as the attention feature map of the largest scale; the N-1 feature maps are then concatenated with the attention feature map of the largest scale to generate the foreground attention feature map.
[0104] The foreground attention feature map is processed using the attention module to obtain a foreground mask. The first feature map at N scales is multiplied by the foreground mask to obtain foreground feature maps at N scales.
[0105] The instance segmentation head is used to process the foreground feature maps at N scales to obtain M instance center point Gaussian heatmaps and a position offset map from the center point. Each instance center point Gaussian heatmap corresponds to one category, and M is the number of categories.
[0106] The post-processing module is used to process the foreground mask, the Gaussian heatmaps of the center points of M instances, and a position offset map from the center point to obtain the instance segmentation result.
[0107] In the embodiments of this application, the twin feature extractor uses ResNet-50; the first feature maps at N scales include: a first feature map at a first scale, a first feature map at a second scale, and a first feature map at a third scale; the first feature map at the first scale is obtained by downsampling the current RGB image frame by 4 times, the first feature map at the second scale is obtained by downsampling the current RGB image frame by 8 times, and the first feature map at the third scale is obtained by downsampling the current RGB image frame by 16 times; the second feature maps at N scales include: a second feature map at a first scale, a second feature map at a second scale, and a second feature map at a third scale; the second feature map at the first scale is obtained by downsampling the static background image by 4 times, the second feature map at the second scale is obtained by downsampling the static background image by 8 times, and the second feature map at the third scale is obtained by downsampling the static background image by 16 times.
[0108] The feature extraction network uses ResNet-50 as the base network for advanced semantic feature extraction. The network extracting features from the static background and the current frame maintains isomorphism and identical parameters, with weight iteration occurring synchronously. Background pixels in the current frame are similar to pixels in the corresponding static background frame, and the features extracted after passing through the Siamese feature extraction network also maintain this similarity. However, the foreground pixel features of the current frame differ from those of the corresponding static background frame, thus ensuring efficient recognition of the foreground mask by the attention module across intersections and camera positions. After passing through the Siamese feature extraction network, three sets of six multi-scale resolution features are output after downsampling by 4x, 8x, and 16x, respectively.
[0109] like Figure 3 As shown, the foreground / background fusion module includes a first max spatial pooling, a first average spatial pooling, a second max spatial pooling, a second average spatial pooling, a first arithmetic unit, a second arithmetic unit, a first multilayer perceptron, a second multilayer perceptron, an adder, a first multiplier, a second multiplier, a first max channel pooling, a first average channel pooling, a second max channel pooling, a second average channel pooling, a third arithmetic unit, a fourth arithmetic unit, a first convolutional layer, a second convolutional layer, a second adder, a third multiplier, a fourth multiplier, a stitching unit, and a third convolutional layer;
[0110] The foreground-background fusion module processes the first and second feature maps of the same scale to generate attention feature maps of the corresponding scale; including:
[0111] The first feature map is processed by first max spatial pooling and first average spatial pooling respectively to obtain a first intermediate feature map and a second intermediate feature map;
[0112] The second feature map is processed by the second max spatial pooling and the second average spatial pooling respectively to obtain the third intermediate feature map and the fourth intermediate feature map.
[0113] The absolute value of the difference between the first intermediate feature map and the third intermediate feature map is calculated using the first arithmetic unit to obtain the fifth intermediate feature map;
[0114] The absolute value of the difference between the second intermediate feature map and the fourth intermediate feature map is calculated using the second arithmetic unit to obtain the sixth intermediate feature map;
[0115] The fifth intermediate feature map is processed using the first multilayer perceptron to obtain the first attention-guided weight feature map;
[0116] The sixth intermediate feature map is processed using a second multilayer perceptron to obtain a second attention-guided weight feature map;
[0117] The first attention-guided weight feature map is obtained by using the first adder to calculate the sum of the first attention-guided weight feature map and the second attention-guided weight feature map.
[0118] The first multiplier is used to perform a dot product operation on the first weighted feature map and the first feature map to obtain the first spatial dimension weighted feature map;
[0119] The first weighted feature map and the second feature map are multiplied by the second multiplier to obtain the second spatial dimension weighted feature map.
[0120] The first spatial dimension weighted feature map is processed by the first maximum channel pooling and the first average channel pooling respectively to obtain the seventh intermediate feature map and the eighth intermediate feature map.
[0121] The second spatial dimension weighted feature map is processed by the second maximum channel pooling and the second average channel pooling respectively to obtain the ninth intermediate feature map and the tenth intermediate feature map.
[0122] The eleventh intermediate feature map is obtained by calculating the absolute value of the difference between the seventh and ninth intermediate feature maps using the third arithmetic unit.
[0123] The absolute value of the difference between the eighth and tenth intermediate feature maps is calculated using the fourth arithmetic unit to obtain the twelfth intermediate feature map.
[0124] The eleventh intermediate feature map is processed using the first convolutional layer to obtain the third attention-guided weight feature map;
[0125] The twelfth intermediate feature map is processed using the second convolutional layer to obtain the fourth attention-guided weight feature map;
[0126] The second weight feature map is obtained by using the second adder to calculate the sum of the third attention-guided weight feature map and the fourth attention-guided weight feature map;
[0127] The second weighted feature map and the first feature map are multiplied by the third multiplier to obtain the third spatial dimension weighted feature map.
[0128] The second weighted feature map is multiplied by the second feature map using the fourth multiplier to obtain the fourth spatial dimension weighted feature map.
[0129] The weighted feature map of the third spatial dimension and the weighted feature map of the fourth spatial dimension are concatenated using the concatenation unit to obtain the weighted feature map of the fifth spatial dimension;
[0130] The fifth spatial dimension weighted feature map is processed using the third convolutional layer to obtain the attention feature map.
[0131] The foreground / background fusion module can maximize the suppression of background features in the spatial dimension by fusing the background pixels of the current frame and the background frame, thereby reducing the generalization curse caused by background changes; and maximize the suppression of background features in the channel dimension.
[0132] like Figure 4 As shown, the attention module includes: two convolutional layers and an argmax function;
[0133] The foreground mask is obtained by processing the foreground attention feature map using the attention module, including:
[0134] Two convolutional layers are used to process the foreground attention feature map to obtain a mask with two channels, which represent the probability of the position being foreground and the probability of the position being background, respectively.
[0135] The argmax function is used to process the mask to obtain the foreground mask, which is a multi-scale feature containing only the foreground.
[0136] Specifically, the instance segmentation head consists of a hollow spatial pyramid pooling layer, a first sub-convolutional layer, a second sub-convolutional layer, a third sub-convolutional layer, a fourth sub-convolutional layer, a fifth sub-convolutional layer, a sixth sub-convolutional layer, a seventh sub-convolutional layer, an eighth sub-convolutional layer, and a splicing unit;
[0137] The foreground feature maps at the N scales include: a foreground feature map at the first scale, a foreground feature map at the second scale, and a foreground feature map at the third scale.
[0138] The instance segmentation head is used to process the foreground feature maps at N scales to obtain the category, center point, and center point offset map of each instance, including:
[0139] The foreground feature map at the third scale is processed using a void space pyramid pooling layer to obtain the first intermediate foreground feature map at the third scale.
[0140] The first sub-convolutional layer is used to process the foreground feature map at the second scale to obtain the second intermediate foreground feature map at the second scale.
[0141] Upsample the first intermediate foreground feature map at the third scale to obtain the third intermediate foreground feature map at the second scale. Then, stitch the second intermediate foreground feature map at the second scale and the third intermediate foreground feature map at the second scale together to obtain the fourth intermediate foreground feature map at the second scale.
[0142] The fourth intermediate foreground feature map at the second scale is processed using the second sub-convolutional layer to obtain the fifth intermediate foreground feature map at the second scale.
[0143] The foreground feature map at the first scale is processed by the third sub-convolutional layer to obtain the sixth intermediate foreground feature map at the first scale.
[0144] Upsample the fifth intermediate foreground feature map at the second scale to obtain the seventh intermediate foreground feature map at the first scale. Then, concatenate the sixth and seventh intermediate foreground feature maps at the first scale to obtain the eighth intermediate foreground feature map at the first scale.
[0145] The eighth intermediate foreground feature map at the first scale is processed using the fourth sub-convolutional layer to obtain the ninth intermediate foreground feature map at the first scale.
[0146] The fifth sub-convolutional layer is used to process the fifth intermediate foreground feature map of the second scale to obtain the tenth intermediate foreground feature map of the second scale.
[0147] The ninth intermediate foreground feature map at the first scale is downsampled to obtain the eleventh intermediate foreground feature map at the second scale. The tenth intermediate foreground feature map at the second scale and the eleventh intermediate foreground feature map at the second scale are then stitched together to obtain the twelfth intermediate foreground feature map at the second scale.
[0148] The twelfth intermediate foreground feature map at the second scale is processed using the sixth sub-convolutional layer to obtain the thirteenth intermediate foreground feature map at the second scale.
[0149] The first intermediate foreground feature map at the third scale is processed using the seventh sub-convolutional layer to obtain the fourteenth intermediate foreground feature map at the third scale.
[0150] Downsampling the thirteenth intermediate foreground feature map at the second scale yields the fifteenth intermediate foreground feature map at the third scale. The fourteenth and fifteenth intermediate foreground feature maps at the third scale are then stitched together to obtain the sixteenth intermediate foreground feature map at the third scale.
[0151] The sixteenth intermediate foreground feature map at the third scale is processed using the eighth sub-convolutional layer to obtain the seventeenth intermediate foreground feature map at the third scale.
[0152] Upsample the seventeenth intermediate foreground feature map at the third scale to obtain the seventeenth intermediate foreground feature map at the first scale;
[0153] Upsample the seventeenth intermediate foreground feature map at the second scale to obtain the seventeenth intermediate foreground feature map at the first scale;
[0154] Upsample the thirteenth intermediate foreground feature map at the second scale to obtain the eighteenth intermediate foreground feature map at the first scale;
[0155] The ninth, seventeenth, and eighteenth intermediate foreground feature maps at the first scale are stitched together to obtain the fused foreground feature map at the first scale.
[0156] The fused foreground feature map at the first scale is processed using a center prediction head to obtain M Gaussian heatmaps of instance center points, wherein each Gaussian heatmap of instance center points includes the predicted center points of all instances of a class; the fused foreground feature map at the first scale is processed using a center offset prediction head to obtain a position offset map of the distance from the center point, wherein the pixel value of each pixel in the position offset map of the distance from the center point is the two-dimensional coordinate offset of the distance from the center point.
[0157] In this embodiment of the application, the post-processing module is used to process the foreground mask, the Gaussian heatmaps of the center points of M instances, and a position offset map from the center point to obtain the instance segmentation result, including:
[0158] The first center position of each pixel instance is obtained by subtracting the coordinate encoding map from the position offset map from the center point.
[0159] Calculate the distance between the first center position of each pixel and each center point, take the center point corresponding to the minimum distance as the second center position of the pixel, and obtain the coarse segmentation result image based on the second center positions of each pixel in the example;
[0160] Specifically, instance boundaries are divided into two categories. One category is the boundaries between instances of different categories and between instances and the background. These boundaries are obtained from the foreground mask output by the attention module. The other category is the boundaries between different instances of the same category. These boundaries are naturally obtained from the coarse segmentation results.
[0161] The final instance segmentation result is obtained by multiplying the foreground mask and the coarse segmentation result.
[0162] In addition, the method also includes the step of training an instance segmentation model;
[0163] Obtain the training set from the RopeIns dataset, which includes multiple training samples;
[0164] The training samples are processed using an instance segmentation model, and the supervision signal consists of three parts:
[0165] The foreground mask portion to be predicted is supervised by cross-entropy loss;
[0166] Center point heatmap, this part is monitored by L2 loss;
[0167] The location offset map from the center point is supervised by L1 loss.
[0168] In addition, data augmentation can be performed on the training data. Data augmentation methods can be divided into two categories: traditional coupled data augmentation and decoupled data augmentation. Traditional coupled data augmentation includes random translation, rotation, cropping, and flipping, which simultaneously applies to the input static background image and the current frame image. Decoupled data augmentation includes color transformation enhancement, exposure transformation enhancement, and contrast transformation enhancement. This type of enhancement is performed independently between the static background image and the current frame image, and the transformation parameters used for the two images are not the same.
[0169] Based on the above embodiments, this application provides an instance segmentation device for intersection traffic monitoring, see below. Figure 5 As shown, the instance segmentation device 200 for intersection traffic monitoring provided in this application embodiment includes at least:
[0170] The acquisition unit 201 is used to acquire the current RGB image frame of any surveillance camera at the intersection, as well as the static background image of the surveillance camera; the static background image is a pure background RGB image that does not contain any instances under the camera position of the surveillance camera.
[0171] The instance segmentation unit 202 is used to process the current RGB image frame and the static background image using a pre-trained instance segmentation model to obtain the instance segmentation result.
[0172] It should be noted that the principle of the instance segmentation device 200 for intersection traffic monitoring provided in this application embodiment to solve the technical problem is similar to the method provided in this application embodiment. Therefore, the implementation of the instance segmentation device 200 for intersection traffic monitoring provided in this application embodiment can refer to the implementation of the method provided in this application embodiment, and the repeated parts will not be described again.
[0173] Based on the above embodiments, this application also provides an electronic device, see below. Figure 6 As shown, the electronic device 300 provided in this application embodiment includes at least: a processor 301, a memory 302, and a computer program stored in the memory 302 and executable on the processor 301. When the processor 301 executes the computer program, it implements the instance segmentation method for intersection traffic monitoring provided in this application embodiment.
[0174] The electronic device 300 provided in this application embodiment may further include a bus 303 connecting different components (including processor 301 and memory 302). The bus 303 represents one or more types of bus structures, including memory bus, peripheral bus, local area bus, etc.
[0175] The memory 302 may include a readable medium in the form of volatile memory, such as random access memory (RAM) 3021 and / or cache memory 3022, and may further include read-only memory (ROM) 3023.
[0176] The memory 302 may also include a program tool 3025 having a set (at least one) of program modules 3024, including but not limited to: an operating subsystem, one or more application programs, other program modules, and program data, each or some combination of these examples may include an implementation of a network environment.
[0177] Electronic device 300 can also communicate with one or more external devices 304 (e.g., keyboard, remote control, etc.), and with one or more devices that enable a user to interact with electronic device 300 (e.g., mobile phone, computer, etc.), and / or with any device that enables electronic device 300 to communicate with one or more other electronic devices 300 (e.g., router, modem, etc.). This communication can be performed through input / output (I / O) interface 305. Furthermore, electronic device 300 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) through network adapter 306. Figure 6 As shown, network adapter 306 communicates with other modules of electronic device 300 via bus 303. It should be understood that, although... Figure 6 As not shown, other hardware and / or software modules may be used in conjunction with electronic device 300, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, Redundant Arrays of Independent Disks (RAID) subsystems, tape drives, and data backup storage subsystems.
[0178] It should be noted that, Figure 6 The electronic device 300 shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.
[0179] This application also provides a computer-readable storage medium storing computer instructions. When executed by a processor, these computer instructions implement the instance segmentation method for intersection traffic monitoring provided in this application. Specifically, the executable program can be built into or installed in the electronic device 300, so that the electronic device 300 can implement the instance segmentation method for intersection traffic monitoring provided in this application by executing the built-in or installed executable program.
[0180] The method provided in this application embodiment can also be implemented as a program product, which includes program code. When the program product can run on the electronic device 300, the program code is used to make the electronic device 300 execute the instance segmentation method for intersection traffic monitoring provided in this application embodiment.
[0181] The program product provided in this application embodiment can be any combination of one or more readable media, wherein the readable media can be a readable signal medium or a readable storage medium, and the readable storage medium can be, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or apparatus, or any combination thereof. Specifically, more specific examples of readable storage media (a non-exhaustive list) include: electrical connections with one or more wires, portable disks, hard disks, RAM, ROM, erasable programmable read-only memory (EPROM), optical fibers, portable compact disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0182] The program product provided in this application embodiment can be a CD-ROM and include program code, and can also run on a computing device. However, the program product provided in this application embodiment is not limited thereto. In this application embodiment, the readable storage medium can be any tangible medium that contains or stores a program, which can be used by or in conjunction with an instruction execution system, apparatus, or device.
[0183] It should be noted that although several units or sub-units of the device have been mentioned in the detailed description above, this division is merely exemplary and not mandatory. In fact, according to embodiments of this application, the features and functions of two or more units described above can be embodied in one unit. Conversely, the features and functions of one unit described above can be further divided and embodied by multiple units.
[0184] Furthermore, although the operations of the method of this application are described in a specific order in the accompanying drawings, this does not require or imply that these operations must be performed in that specific order, or that all the operations shown must be performed to achieve the desired result. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.
[0185] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application and are not intended to limit it. Although this application has been described in detail with reference to the embodiments, those skilled in the art should understand that modifications or equivalent substitutions to the technical solutions of this application do not depart from the spirit and scope of the technical solutions of this application, and should all be covered within the scope of the claims of this application.
Claims
1. An instance segmentation method for intersection traffic monitoring, characterized in that, include: Acquire the current RGB image frame of any surveillance camera at the intersection, as well as the static background image of the surveillance camera; The static background image is a pure background RGB image that does not contain any instances under the camera position of the surveillance camera; The pre-trained instance segmentation model is used to process the current RGB image frame and the static background image to obtain the instance segmentation result; The method further includes: The RGB image sequence of the monitoring camera preceding the current RGB image frame is acquired at preset time intervals. Based on the density clustering algorithm, pixels at the same location in an RGB image sequence are clustered along the time dimension to obtain multiple pixel clusters; The average value of all pixel values in the pixel cluster with the largest number of pixels is calculated and used as the pixel value of the corresponding position in the static background image, thus obtaining the static background image; The instance segmentation model includes: a twin feature extractor, a foreground / background fusion module, an attention module, an instance segmentation head, and a post-processing module; The pre-trained instance segmentation model is used to process the current RGB image frame and the static background image to obtain instance segmentation results, including: The twin feature extractor is used to downsample the current RGB image frame layer by layer to obtain a first feature map of N scales, and the static background image is downsampled layer by layer to obtain a second feature map of N scales; The foreground and background fusion module is used to process the first feature map and the second feature map of the same scale respectively to generate attention feature maps of the corresponding scale; the attention feature maps of N-1 scales are upsampled to obtain N-1 feature maps of the same size as the attention feature map of the largest scale; the N-1 feature maps are then concatenated with the attention feature map of the largest scale to generate the foreground attention feature map. The foreground attention feature map is processed using the attention module to obtain a foreground mask. The first feature map at N scales is multiplied by the foreground mask to obtain foreground feature maps at N scales. The instance segmentation head is used to process the foreground feature maps at N scales to obtain M instance center point Gaussian heatmaps and a position offset map from the center point. Each instance center point Gaussian heatmap corresponds to one category, and M is the number of categories. The post-processing module is used to process the foreground mask, the Gaussian heatmaps of the center points of M instances, and a position offset map from the center point to obtain the instance segmentation result.
2. The instance segmentation method for intersection traffic monitoring according to claim 1, characterized in that, The twin feature extractor uses ResNet-50; the first feature maps at N scales include: a first feature map at a first scale, a first feature map at a second scale, and a first feature map at a third scale; the first feature map at the first scale is obtained by downsampling the current RGB image frame by 4 times, the first feature map at the second scale is obtained by downsampling the current RGB image frame by 8 times, and the first feature map at the third scale is obtained by downsampling the current RGB image frame by 16 times; the second feature maps at N scales include: a second feature map at a first scale, a second feature map at a second scale, and a second feature map at a third scale; the second feature map at the first scale is obtained by downsampling the static background image by 4 times, the second feature map at the second scale is obtained by downsampling the static background image by 8 times, and the second feature map at the third scale is obtained by downsampling the static background image by 16 times.
3. The instance segmentation method for intersection traffic monitoring according to claim 2, characterized in that, The foreground / background fusion module includes a first max spatial pooling, a first average spatial pooling, a second max spatial pooling, a second average spatial pooling, a first arithmetic unit, a second arithmetic unit, a first multilayer perceptron, a second multilayer perceptron, an adder, a first multiplier, a second multiplier, a first max channel pooling, a first average channel pooling, a second max channel pooling, a second average channel pooling, a third arithmetic unit, a fourth arithmetic unit, a first convolutional layer, a second convolutional layer, a second adder, a third multiplier, a fourth multiplier, a stitching unit, and a third convolutional layer. The foreground-background fusion module processes the first and second feature maps of the same scale to generate attention feature maps of the corresponding scale; including: The first feature map is processed by first max spatial pooling and first average spatial pooling respectively to obtain a first intermediate feature map and a second intermediate feature map; The second feature map is processed by the second max spatial pooling and the second average spatial pooling respectively to obtain the third intermediate feature map and the fourth intermediate feature map. The absolute value of the difference between the first intermediate feature map and the third intermediate feature map is calculated using the first arithmetic unit to obtain the fifth intermediate feature map; The absolute value of the difference between the second intermediate feature map and the fourth intermediate feature map is calculated using the second arithmetic unit to obtain the sixth intermediate feature map; The fifth intermediate feature map is processed using the first multilayer perceptron to obtain the first attention-guided weight feature map; The sixth intermediate feature map is processed using a second multilayer perceptron to obtain a second attention-guided weight feature map; The first attention-guided weight feature map is obtained by using the first adder to calculate the sum of the first attention-guided weight feature map and the second attention-guided weight feature map. The first multiplier is used to perform a dot product operation on the first weighted feature map and the first feature map to obtain the first spatial dimension weighted feature map; The first weighted feature map and the second feature map are multiplied by the second multiplier to obtain the second spatial dimension weighted feature map. The first spatial dimension weighted feature map is processed by the first maximum channel pooling and the first average channel pooling respectively to obtain the seventh intermediate feature map and the eighth intermediate feature map. The second spatial dimension weighted feature map is processed by the second maximum channel pooling and the second average channel pooling respectively to obtain the ninth intermediate feature map and the tenth intermediate feature map. The eleventh intermediate feature map is obtained by calculating the absolute value of the difference between the seventh and ninth intermediate feature maps using the third arithmetic unit. The absolute value of the difference between the eighth and tenth intermediate feature maps is calculated using the fourth arithmetic unit to obtain the twelfth intermediate feature map. The eleventh intermediate feature map is processed using the first convolutional layer to obtain the third attention-guided weight feature map; The twelfth intermediate feature map is processed using the second convolutional layer to obtain the fourth attention-guided weight feature map; The second weight feature map is obtained by using the second adder to calculate the sum of the third attention-guided weight feature map and the fourth attention-guided weight feature map; The second weighted feature map and the first feature map are multiplied by the third multiplier to obtain the third spatial dimension weighted feature map. The second weighted feature map is multiplied by the second feature map using the fourth multiplier to obtain the fourth spatial dimension weighted feature map. The weighted feature map of the third spatial dimension and the weighted feature map of the fourth spatial dimension are concatenated using the concatenation unit to obtain the weighted feature map of the fifth spatial dimension; The fifth spatial dimension weighted feature map is processed using the third convolutional layer to obtain the attention feature map.
4. The instance segmentation method for intersection traffic monitoring according to claim 3, characterized in that, The attention module includes: two convolutional layers and an argmax function; The foreground mask is obtained by processing the foreground attention feature map using the attention module, including: Two convolutional layers are used to process the foreground attention feature map to obtain a mask with two channels, which represent the probability of the position being foreground and the probability of the position being background, respectively. The argmax function is used to process the mask to obtain the foreground mask.
5. The instance segmentation method for intersection traffic monitoring according to claim 1, characterized in that, The instance segmentation head consists of a hollow spatial pyramid pooling layer, a first sub-convolutional layer, a second sub-convolutional layer, a third sub-convolutional layer, a fourth sub-convolutional layer, a fifth sub-convolutional layer, a sixth sub-convolutional layer, a seventh sub-convolutional layer, an eighth sub-convolutional layer, and a splicing unit; The foreground feature maps at the N scales include: a foreground feature map at the first scale, a foreground feature map at the second scale, and a foreground feature map at the third scale. The instance segmentation head is used to process the foreground feature maps at N scales to obtain the category, center point, and center point offset map of each instance, including: The foreground feature map at the third scale is processed using a void space pyramid pooling layer to obtain the first intermediate foreground feature map at the third scale. The first sub-convolutional layer is used to process the foreground feature map at the second scale to obtain the second intermediate foreground feature map at the second scale. Upsample the first intermediate foreground feature map at the third scale to obtain the third intermediate foreground feature map at the second scale. Then, stitch the second intermediate foreground feature map at the second scale and the third intermediate foreground feature map at the second scale together to obtain the fourth intermediate foreground feature map at the second scale. The fourth intermediate foreground feature map at the second scale is processed using the second sub-convolutional layer to obtain the fifth intermediate foreground feature map at the second scale. The foreground feature map at the first scale is processed by the third sub-convolutional layer to obtain the sixth intermediate foreground feature map at the first scale. Upsample the fifth intermediate foreground feature map at the second scale to obtain the seventh intermediate foreground feature map at the first scale. Then, concatenate the sixth and seventh intermediate foreground feature maps at the first scale to obtain the eighth intermediate foreground feature map at the first scale. The eighth intermediate foreground feature map at the first scale is processed using the fourth sub-convolutional layer to obtain the ninth intermediate foreground feature map at the first scale. The fifth sub-convolutional layer is used to process the fifth intermediate foreground feature map of the second scale to obtain the tenth intermediate foreground feature map of the second scale. The ninth intermediate foreground feature map at the first scale is downsampled to obtain the eleventh intermediate foreground feature map at the second scale. The tenth intermediate foreground feature map at the second scale and the eleventh intermediate foreground feature map at the second scale are then stitched together to obtain the twelfth intermediate foreground feature map at the second scale. The twelfth intermediate foreground feature map at the second scale is processed using the sixth sub-convolutional layer to obtain the thirteenth intermediate foreground feature map at the second scale. The first intermediate foreground feature map at the third scale is processed using the seventh sub-convolutional layer to obtain the fourteenth intermediate foreground feature map at the third scale. Downsampling the thirteenth intermediate foreground feature map at the second scale yields the fifteenth intermediate foreground feature map at the third scale. The fourteenth and fifteenth intermediate foreground feature maps at the third scale are then stitched together to obtain the sixteenth intermediate foreground feature map at the third scale. The sixteenth intermediate foreground feature map at the third scale is processed using the eighth sub-convolutional layer to obtain the seventeenth intermediate foreground feature map at the third scale. Upsample the seventeenth intermediate foreground feature map at the third scale to obtain the seventeenth intermediate foreground feature map at the first scale; Upsample the seventeenth intermediate foreground feature map at the second scale to obtain the seventeenth intermediate foreground feature map at the first scale; Upsample the thirteenth intermediate foreground feature map at the second scale to obtain the eighteenth intermediate foreground feature map at the first scale; The ninth, seventeenth, and eighteenth intermediate foreground feature maps at the first scale are stitched together to obtain the fused foreground feature map at the first scale. The fused foreground feature map at the first scale is processed using a center prediction head to obtain M Gaussian heatmaps of instance center points, wherein each Gaussian heatmap of instance center points includes the predicted center points of all instances of a class; the fused foreground feature map at the first scale is processed using a center offset prediction head to obtain a position offset map of the distance from the center point, wherein the pixel value of each pixel in the position offset map of the distance from the center point is the two-dimensional coordinate offset of the distance from the center point.
6. The method according to claim 5, characterized in that, The post-processing module processes the foreground mask, the Gaussian heatmaps of the center points of M instances, and a position offset map from the center point to obtain the instance segmentation result, including: The first center position of each pixel instance is obtained by subtracting the coordinate encoding map from the position offset map from the center point. Calculate the distance between the first center position of each pixel and each center point, take the center point corresponding to the minimum distance as the second center position of the pixel, and obtain the coarse segmentation result image based on the second center positions of each pixel in the example; The final instance segmentation result is obtained by multiplying the foreground mask and the coarse segmentation result.
7. An instance segmentation device for intersection traffic monitoring, characterized in that, include: The acquisition unit is used to acquire the current RGB image frame of any surveillance camera at the intersection, as well as the static background image of the surveillance camera; The static background image is a pure background RGB image that does not contain any instances under the camera position of the surveillance camera; The instance segmentation unit is used to process the current RGB image frame and the static background image using a pre-trained instance segmentation model to obtain instance segmentation results; The device further includes a processing unit, specifically used for: The RGB image sequence of the monitoring camera preceding the current RGB image frame is acquired at preset time intervals. Based on the density clustering algorithm, pixels at the same location in an RGB image sequence are clustered along the time dimension to obtain multiple pixel clusters; The average value of all pixel values in the pixel cluster with the largest number of pixels is calculated and used as the pixel value of the corresponding position in the static background image, thus obtaining the static background image; The instance segmentation model includes: a twin feature extractor, a foreground / background fusion module, an attention module, an instance segmentation head, and a post-processing module; The instance segmentation unit is specifically used for: The twin feature extractor is used to downsample the current RGB image frame layer by layer to obtain a first feature map of N scales, and the static background image is downsampled layer by layer to obtain a second feature map of N scales; The foreground and background fusion module is used to process the first feature map and the second feature map of the same scale respectively to generate attention feature maps of the corresponding scale; the attention feature maps of N-1 scales are upsampled to obtain N-1 feature maps of the same size as the attention feature map of the largest scale; the N-1 feature maps are then concatenated with the attention feature map of the largest scale to generate the foreground attention feature map. The foreground attention feature map is processed using the attention module to obtain a foreground mask. The first feature map at N scales is multiplied by the foreground mask to obtain foreground feature maps at N scales. The instance segmentation head is used to process the foreground feature maps at N scales to obtain M instance center point Gaussian heatmaps and a position offset map from the center point. Each instance center point Gaussian heatmap corresponds to one category, and M is the number of categories. The post-processing module is used to process the foreground mask, the Gaussian heatmaps of the center points of M instances, and a position offset map from the center point to obtain the instance segmentation result.
8. An electronic device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method as claimed in any one of claims 1-6.