A method for training a target segmentation model, a target segmentation method, and related devices
By combining convolutional neural networks, gated recurrent unit networks, and feature fusion decoding networks, the instability problem of human body segmentation in video under complex backgrounds is solved, achieving more accurate and stable target segmentation and improving user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN SHULIAN KANGJIAN INTELLIGENT TECH CO LTD
- Filing Date
- 2023-09-14
- Publication Date
- 2026-06-23
AI Technical Summary
In video human body segmentation, complex background interference leads to unstable target segmentation, obvious artifacts, and poor user experience.
We employ convolutional neural networks, gated recurrent unit networks, and feature fusion decoding networks. By acquiring feature maps from multiple consecutive frames of images, we perform feature fusion and decoding, and use a loss function to iteratively train the target segmentation model, thereby reducing background interference and artifacts.
It improves the accuracy and stability of target segmentation, reduces boundary jitter, and enhances the user experience.
Smart Images

Figure CN117237756B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video content understanding technology, and in particular to a method for training a target segmentation model, a target segmentation method, and related apparatus. Background Technology
[0002] In the digital age, various multimedia digital content, such as text, audio, images, and video, permeates people's daily lives. With the widespread adoption of mobile devices equipped with cameras and sensors, video has become a new form of communication between internet users, such as online video conferencing and video calls. This trend has led to the rapid development of a range of video content understanding technologies and related applications, enabling face-to-face conversations between users in different locations via communication devices and networks. Human body segmentation in video is a core technology in this field, and it has seen widespread development in the sports and health sector. It requires the ability to segment the subject in real time and separate it from complex backgrounds, thereby accurately generating motion scenes of the subject against different backgrounds and enhancing entertainment value. However, in practical applications, the motion scenes of people are complex and subject to interference from complex backgrounds. While segmenting the subject area, it is easy to also segment some complex background images, resulting in obvious artifacts, unstable target segmentation, and a poor user experience. Summary of the Invention
[0003] This invention provides a method for training a target segmentation model, a target segmentation method, and related apparatus. The obtained target segmentation model can accurately segment targets, reduce interference from complex backgrounds and the generation of artifacts, and reduce the jitter of target segmentation boundaries.
[0004] To address the aforementioned technical problems, in a first aspect, embodiments of the present invention provide a method for training a target segmentation model, wherein the target segmentation model includes a convolutional neural network, a gated recurrent unit network, and a feature fusion decoding network, and the method includes:
[0005] Obtain a training set, which includes multiple consecutive frames of original images, each frame of which is labeled with the true label of the target;
[0006] The original images in the training set are input into the convolutional neural network to obtain feature maps of multiple scales for each frame of the original image;
[0007] The first feature map of the current frame original image and the second feature map of the previous frame original image are input into the gated recurrent unit network to fuse the first feature map and the second feature map to obtain a fused feature map. The first feature map and the second feature map are both feature maps with the smallest scale of the corresponding original images.
[0008] The feature maps at multiple scales of the original image of the current frame and the fused feature map are input into the feature fusion decoding network to obtain the predicted label of the target;
[0009] The loss between the true label and the predicted label is calculated based on the loss function, and the target segmentation model is iteratively trained according to the loss until the target segmentation model converges, thus obtaining the trained target segmentation model.
[0010] In some embodiments, inputting a first feature map of the current frame's original image and a second feature map of the previous frame's original image into the gated recurrent unit network to fuse the first feature map and the second feature map to obtain a fused feature map includes:
[0011] The first feature map and the second feature map are input into the gated recurrent unit network, and feature fusion is performed on the first feature map and the second feature map according to the first formula to obtain a fused feature map, wherein the first formula is:
[0012] Z t =σ(W z *[h t-1 ,x t ])
[0013] r t =σ(W r *[h t-1 ,x t ])
[0014]
[0015]
[0016] Among them, h t-1 The second feature map, x t Let Z represent the first feature map, σ represent the activation function, and Z represent the activation function. t The update gate, r, represents the fused feature map. t W represents the reset gate of the fused feature map. z W r and Represents the weight matrix. h represents the candidate hidden state of the fused feature map. t This represents the fused feature map.
[0017] In some embodiments, the feature fusion decoding network includes multiple cascaded decoding layers, and the step of inputting feature maps of multiple scales of the original image of the current frame and the fused feature map into the feature fusion decoding network to obtain the predicted label of the target includes:
[0018] The fused feature map is input into the first-level decoding layer for upsampling to obtain the first-level output feature map of the first-level decoding layer. The first-level output feature map is connected to the first target feature map to obtain the first-level connection feature map. The first target feature map has the same scale as the first-level output feature map, and the first target feature map is the feature map with the smallest scale of the original image of the current frame.
[0019] The first-level connection feature map is input into the next-level decoding layer for upsampling to obtain the next-level output feature map of the next-level decoding layer. The next-level output feature map is connected with the next-target feature map to obtain the next-level connection feature map. The next-target feature map is a feature map with the same scale as the next-level output feature map in the feature map of the original image of the current frame.
[0020] The process of upsampling the first-level connected feature map into the next-level decoding layer is repeated through the remaining decoding layers to obtain the next-level output feature map of the next-level decoding layer. The next-level output feature map is then connected to the next target feature map to obtain the next-level connected feature map, until an output feature map with the same scale as the original image of the current frame is obtained, thus obtaining the predicted label of the target.
[0021] In some embodiments, the target segmentation model further includes an attention network, and the method further includes:
[0022] The attention network is input with feature maps of multiple scales of the original image of the current frame to obtain the attention scores of the feature maps at each scale.
[0023] In some embodiments, inputting feature maps at multiple scales of the original image of the current frame into the attention network to obtain attention scores for each scale of feature map includes:
[0024] The feature maps of the original image of the current frame at multiple scales are input into the attention network;
[0025] The attention score is calculated according to the second formula, wherein the second formula is:
[0026] S n =W T *f n +b
[0027] α = Softmax(s)
[0028] Among them, W T Let b represent the weight matrix, b represent the bias parameter, and f represent the bias parameter. nThe s represents the feature map of the original image of the current frame, n represents the number of feature maps of the original image of the current frame, α represents the attention score of the feature map of the original image of the current frame, and the value range is [0,1]. The Softmax() function is the normalization function, and s = S n .
[0029] In some embodiments, the loss function is:
[0030]
[0031] Where p represents the true label of the target. The predicted label α represents the target. i The attention score represents the i-th feature map of the current frame's original image, and n represents the number of feature maps in the current frame's original image.
[0032] To address the aforementioned technical problems, in a second aspect, embodiments of the present invention provide a target segmentation method, comprising:
[0033] Obtain the image to be processed;
[0034] The image to be processed is input into the target segmentation model to obtain the predicted label of the target in the image to be processed, wherein the target segmentation model is trained using the method described in any of the above.
[0035] The target image is segmented from the image to be processed based on the predicted label of the target.
[0036] To address the aforementioned technical problems, in a third aspect, embodiments of the present invention provide an electronic device, comprising:
[0037] A processor and a memory communicatively connected to the processor;
[0038] The memory stores computer program instructions executable by the processor. When invoked by the processor, the computer program instructions cause the processor to execute the method for training the target segmentation model described above or the target segmentation method described above.
[0039] To address the aforementioned technical problems, in a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer program instructions, wherein a processor executes the computer program instructions to perform the method for training a target segmentation model as described in any of the preceding claims or the target segmentation method as described in any of the preceding claims.
[0040] The beneficial effects of this invention's embodiments are as follows: Unlike existing technologies, the method for training a target segmentation model provided in this invention includes a convolutional neural network, a gated recurrent unit network, and a feature fusion decoding network. The method includes: acquiring a training set comprising multiple consecutive frames of original images, each frame labeled with a true target label; inputting the original images from the training set into the convolutional neural network to obtain feature maps at multiple scales for each frame; inputting the first feature map of the current frame and the second feature map of the previous frame into the gated recurrent unit network to fuse the first and second feature maps, obtaining a fused feature map, where the first and second feature maps are both the smallest feature maps in their respective original images; inputting the feature maps at multiple scales and the fused feature map of the current frame into the feature fusion decoding network to obtain a predicted target label; calculating the loss between the true and predicted labels based on a loss function, and iteratively training the target segmentation model according to the loss until the target segmentation model converges, thus obtaining the trained target segmentation model.
[0041] In this embodiment of the invention, when training the target segmentation model, feature maps at multiple scales of the previous frame's original image are acquired. The smallest feature map from the previous frame's original image is then fused with the smallest feature map from the current frame's original image to obtain a fused feature map. This fused feature map is then used to train the target segmentation model. Consequently, the target segmentation model, during training, places greater emphasis on the correlation between the current and previous frames' original images, strengthening the connection between the target subject in consecutive frames. The resulting target segmentation model can accurately segment the target, reducing interference from complex backgrounds and artifacts, lowering the jitter of the target segmentation boundary, improving the accuracy and stability of target segmentation, and resulting in more accurate and reliable segmentation results, thus enhancing the user experience. Attached Figure Description
[0042] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below only show some embodiments of the present invention, and therefore should not be regarded as a limitation on the scope of protection. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0043] Figure 1 This is a schematic diagram illustrating an application scenario of the method for training an object segmentation model provided in some embodiments of the present invention;
[0044] Figure 2 These are schematic diagrams of the structure of an electronic device provided in some embodiments of the present invention;
[0045] Figure 3This is a schematic diagram of the overall network structure of the target segmentation model provided in some embodiments of the present invention;
[0046] Figure 4 This is a flowchart illustrating a method for training an object segmentation model according to some embodiments of the present invention;
[0047] Figure 5 This is a schematic diagram of the structure of the gated recurrent unit network of the target segmentation model provided in some embodiments of the present invention;
[0048] Figure 6 yes Figure 4 A schematic diagram of a sub-process of step S400 in the method for training the target segmentation model shown in the embodiment;
[0049] Figure 7 This is a flowchart illustrating a method for training an object segmentation model according to other embodiments of the present invention;
[0050] Figure 8 This is a flowchart illustrating the target segmentation method provided in some embodiments of the present invention. Detailed Implementation
[0051] To make the objectives and advantages of the embodiments of the present invention more readily understood, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. The detailed description of the embodiments of the present invention in the accompanying drawings is not intended to limit the scope of protection claimed by the present invention, but merely to illustrate selected embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0052] It should be noted that, unless otherwise specified, the various features in the embodiments of this invention can be combined with each other, all within the scope of protection of this application. Furthermore, although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than the module division in the device or the order in the flowchart. In addition, the terms "first," "second," and "third" used herein do not limit the data or execution order, but only distinguish identical or similar items with substantially the same function and effect.
[0053] Unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. The term "and / or" as used in this specification includes any and all combinations of one or more of the associated listed items. Furthermore, the technical features involved in the various embodiments of the invention described below can be combined with each other as long as they do not conflict with each other.
[0054] To facilitate understanding of the methods provided in the embodiments of the present invention, the terms used in the embodiments of the present invention will first be introduced:
[0055] (1) Neural Network
[0056] A neural network can be composed of neural units, specifically understood as a neural network with input layers, hidden layers, and output layers. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. Neural networks with many hidden layers are called deep neural networks (DNNs). The work of each layer in a neural network can be described by the mathematical expression y = a(W·x + b). From a physical perspective, the work of each layer in a neural network can be understood as transforming the input space (the set of input vectors) to the output space (i.e., from the row space to the column space of a matrix) through five operations on the input space: 1. Dimensionality increase / decrease; 2. Magnification / scaling; 3. Rotation; 4. Translation; 5. "Bending". Operations 2 and 3 are performed by "W·x", operation 4 by "+b", and operation 5 by "a()". The term "space" is used here because the objects being classified are not individual things, but a class of things; space refers to the set of all individuals within that class. W is the weight matrix of each layer in the neural network, where each value represents the weight of a neuron in that layer. This matrix W determines the spatial transformation from the input space to the output space, as described above; that is, the W of each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to ultimately obtain the weight matrices of all layers of the trained neural network. Therefore, the training process of a neural network is essentially learning how to control spatial transformation, more specifically, learning the weight matrix.
[0057] It is important to note that in the embodiments of this invention, the models used for machine learning tasks are essentially neural networks. Common components in neural networks include convolutional layers, pooling layers, and normalization layers. By assembling these common components in neural networks, a model is designed. When the model parameters (weight matrices of each layer) are determined such that the model error meets a preset condition or the number of model parameters is adjusted to reach a preset threshold, the model converges.
[0058] (2) Convolution
[0059] Convolution is a mathematical operation widely used in signal processing, image processing, and machine learning. Its application in image processing is particularly common. When convolving an image, a small filter or kernel function is applied to each pixel, generating new pixels by weighted summation of neighboring pixels. This process can be used to perform many image processing tasks, such as edge detection, image enhancement, and blurring.
[0060] In machine learning, a Convolutional Neural Network (CNN) is a deep learning model based on convolutional operations. CNNs are widely used in tasks such as image recognition, object detection, and speech processing. Their main advantage lies in their ability to automatically learn and extract features from input data. Convolutional layers are the core component of CNNs. By sliding convolutional kernels across the input data to perform convolution operations, they can effectively capture the local patterns and structural information of the input data.
[0061] (3) Convolutional layer
[0062] A convolutional layer is the core component of a convolutional neural network (CNN) used for feature extraction and convolution operations on input data. A convolutional layer consists of a set of convolutional kernels, each of which can be viewed as a feature detector used to detect a specific feature in the input data. The convolutional layer generates an output feature map by sliding the kernels across the input data, performing convolution operations at different locations. By utilizing convolution operations to extract and map features from the input data, the convolutional layer captures spatially local features and reduces the number of parameters through weight sharing, thereby achieving effective feature learning and representation.
[0063] Specifically, the input to a convolutional layer is a multi-channel feature map (such as an image or the output of the previous layer), with each channel corresponding to a different feature. Convolutional layers support multi-channel input data and multi-channel convolutional kernels, enabling the extraction and integration of various feature information. The convolutional kernel is element-wise multiplied and summed with the input data to obtain a single pixel value on the output feature map. By sliding the convolutional kernel across the input data and performing convolution operations at each location, an output feature map corresponding to the size of the input data can be generated. Each location in the output feature map corresponds to a local region in the input data; through convolution operations, the convolutional layer can extract local patterns and features from the input data.
[0064] Convolutional layers play a crucial role in deep learning. Through multiple convolutional kernels at different locations and scales, they extract features from input data, gradually building a high-level abstract representation and learning the features. Convolutional layers are often combined with other types of neural network layers (such as pooling layers, activation function layers, and fully connected layers) to form a complete convolutional neural network. This network is used to solve computer vision tasks such as image segmentation, image classification, and object detection, and is also widely applied to deep learning tasks in natural language processing and other fields.
[0065] (4) Convolution kernel
[0066] A convolutional kernel, also known as a filter or feature detector, is the core component of a convolutional neural network. In image processing and computer vision tasks, a convolutional kernel is a small matrix or tensor used to perform convolution operations on an image. A convolutional kernel is typically a square matrix, the size of which can be defined according to the task requirements; common sizes include 1x1, 3x3, 5x5, and 7x7. The convolutional kernel contains a set of weight parameters that are used to perform a weighted summation with the input data during the convolution operation. In the convolution operation, the convolutional kernel slides across the input data, performing element-wise multiplication with the corresponding region of the input data at each location and summing the results to generate a single pixel in the output. The number of convolutional kernels represents the number of kernels used in each convolutional layer. Multiple convolutional kernels can extract different features, and the number of kernels can be selected based on the specific task and data characteristics to achieve optimal model performance and results.
[0067] Convolutional kernels play a crucial role in Convolutional Neural Networks (CNNs). By designing different kernels, the network can learn various features, such as edges, textures, and corners. Each kernel can be viewed as a feature detector, sensitive to a specific feature of the input data and calculating its presence at different locations using a sliding window approach. In deep learning tasks, the parameters of the convolutional kernels can be learned automatically during training or pre-set empirically. Through backpropagation, the neural network can automatically adjust the weight parameters in the convolutional kernels based on the feedback signal from the loss function, enabling the network to better adapt to the task's requirements and learn higher-level feature representations.
[0068] The following describes exemplary applications of the electronic devices provided in the embodiments of the present invention for training object segmentation models or for object segmentation. The electronic devices provided in the embodiments of the present invention can be various suitable types of devices with certain computing and control capabilities, such as laptops, desktop computers, or mobile devices. For example, see... Figure 1 As shown, Figure 1 This is a schematic diagram illustrating an application scenario of the method for training a target segmentation model provided in some embodiments of the present invention.
[0069] Specifically, when the electronic device 100 is used to train a target segmentation model, it can be used to acquire training image data and construct the target segmentation model. For example, those skilled in the art can download prepared training image data and build the network structure of the target segmentation model on the electronic device 100, and train the target segmentation model when the training image data is acquired. The training image data includes multiple consecutive frames of original images. It is understood that the electronic device 100 can also be used to acquire image data to be processed. For example, those skilled in the art can package the image data to be processed and send it to the electronic device 100 via a communication network, thereby enabling the electronic device 100 to acquire the image data to be processed. In some embodiments, when the electronic device 100 is used for human target segmentation, after acquiring the training image data or the image data to be processed, the electronic device 100 sends it to the controller in the electronic device 100. Figure 1 (not shown in the image), thus, the controller uses the built-in target segmentation model to segment human targets on the training image data or the image data to be processed, and obtains the human target segmentation results.
[0070] In some embodiments, the electronic device 100 can locally execute the method for training a target segmentation model provided in this embodiment of the invention to train the designed target segmentation model using training image data, determine the final model parameters, and then configure the target segmentation model with the final model parameters to obtain the target segmentation model. In other embodiments, the electronic device 100 can connect to a server via a communication network and send the server training image data and the constructed target segmentation model stored on the electronic device 100 by those skilled in the art. The server receives the training image data and the target segmentation model, iteratively trains the target segmentation model using the training image data, determines the final model parameters, and then sends the final model parameters to the electronic device 100. The electronic device 100 receives and saves the final model parameters, and configures the target segmentation model with the final model parameters to obtain the target segmentation model. It is readily understood that the aforementioned communication network can be a wide area network (WAN), a local area network (LAN), or a combination of both.
[0071] The structure of the electronic device in an embodiment of the present invention is described below, see [link / reference]. Figure 2 As shown, Figure 2 This is a schematic diagram of the structure of an electronic device 100 provided in some embodiments of the present invention. The electronic device 100 includes at least one processor 110 and a memory 120 connected in communication. Figure 2 Taking a bus system connection and a processor as an example, the various components in electronic device 100 are coupled together through a bus system 130, which is used to realize the connection and communication between these components. It is easy to understand that the bus system 130 includes not only a data bus, but also a power bus, a control bus, and a status signal bus. However, for the sake of clarity and brevity, in... Figure 2 The various buses are all labeled as Bus System 130. Those skilled in the art will understand that... Figure 2 The structure shown is merely illustrative and does not limit the structure of the electronic device 100. For example, the electronic device 100 may also include components that are more... Figure 2 The more or fewer components shown, or having the same Figure 2 The different configurations shown.
[0072] The processor 110 provides computational and control capabilities to control the electronic device 100 to perform corresponding tasks, such as controlling the electronic device 100 to execute any of the aforementioned training target segmentation model methods or any of the aforementioned target segmentation methods. It is understood that the processor 110 can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0073] Memory 120, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the method for training a target segmentation model or the program instructions / modules corresponding to the target segmentation method in the embodiments of the present invention. Processor 110 can implement any of the methods for training a target segmentation model or any of the target segmentation methods in the embodiments of the present invention by running the non-transitory software programs, instructions, and modules stored in memory 120. Memory 120 may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 120 may also include memory remotely located relative to the processor, and these remote memories can be connected to processor 110 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0074] See Figure 3 As shown, Figure 3 This is a schematic diagram of the overall network structure of the target segmentation model provided in some embodiments of the present invention. Specifically, the target segmentation model includes a convolutional neural network, a gated recurrent unit network, and a feature fusion decoding network. After acquiring training images, the obtained training images are input into the convolutional neural network to obtain feature maps of multiple scales for each frame of the training image. Figure 3 As shown, Figure 3This example uses feature maps at five scales; understandably, feature maps at other scales can also be obtained. The smallest feature map from the previous training frame and the smallest feature map from the current training frame are then input into a gated recurrent unit (GRU) network to obtain a fused feature map. Subsequently, the obtained fused feature map and feature maps at various scales from the current training frame are input into a feature fusion decoding network. Multiple decoding layers in the feature fusion decoding network fuse the fused feature map and the feature maps at various scales from the current training frame layer by layer, and the fused feature map is decoded to obtain the predicted label corresponding to the target in the current training frame, outputting the target segmentation image.
[0075] In some embodiments, the target segmentation model further includes an attention network. After acquiring training images and inputting them into a convolutional neural network to obtain feature maps at multiple scales for each frame of the training image, the feature maps at each scale of the current frame of the training image are input into the attention network (i.e., an MLP network). The attention score for each scale of the feature maps is calculated using the Softmax function. Then, the attention scores of the feature maps at each scale of the current frame of the training image are substituted into a loss function to calculate the loss function value between the true label and the predicted label of the target. The target segmentation model is then iteratively trained based on the loss function value until the target segmentation model converges, thus obtaining the trained target segmentation model.
[0076] As can be understood from the above, the method for training a target segmentation model or the target segmentation method provided in the embodiments of the present invention can be implemented by various suitable types of electronic devices with certain computing and control capabilities, such as the aforementioned electronic devices, or by other devices with computing and control capabilities that are communicatively connected to the electronic devices, such as servers, smart terminals, etc. The following description, in conjunction with exemplary applications and implementations of the electronic devices provided in the embodiments of the present invention, illustrates the method for training a target segmentation model or the target segmentation method provided in the embodiments of the present invention.
[0077] See Figure 4 As shown, Figure 4 This is a flowchart illustrating a method for training an object segmentation model according to some embodiments of the present invention. Those skilled in the art will understand that the executing entity of this method for training an object segmentation model can be the aforementioned electronic device, and the method for training an object segmentation model includes, but is not limited to, the following steps S100-S500:
[0078] S100: Obtain a training set, which includes multiple consecutive frames of original images, each frame of which is labeled with the target's true label.
[0079] In practical applications, users or trainers can collect multiple different raw images from various data sources and combine them into a dataset for training the target segmentation model. These raw images are multiple consecutive frames from the same video segment. It is understood that this method for training the target segmentation model is applicable to segmenting targets in any raw image, especially when the target is a moving object, where it achieves better segmentation results. Obviously, each frame of the raw image includes one or more targets. The targets in the raw image can be stationary or moving, or they can be animals, human bodies, or other targets. The following specific implementation describes the method for training the target segmentation model provided in this embodiment of the invention using a human body as the target. It is readily understood that the specific execution content and implementation process for segmenting other targets in the raw image can refer to the specific implementation method for human body target segmentation.
[0080] After obtaining the original image data for training the target segmentation model, the pixels in the original images are labeled with ground truth labels based on the human targets in the original images. In some embodiments, the labels are divided into two categories: human and background. It can be agreed that 0 represents the background category and 1 represents the human target category. Therefore, pixels belonging to the human target category will be labeled as 1, while pixels belonging to the background category will be labeled as 0. After labeling the pixels in the original images with ground truth labels, each pixel in the original image is assigned a specific category or label to identify the semantic category or target type to which each pixel belongs, ensuring that each pixel in the original images in the training set has a corresponding ground truth label. It is easy to understand that during the acquisition of the training set, the labels on the original images can be manually labeled by professionals or automatically labeled using labeling techniques. Obviously, when obtaining the original image data as the training set, the labels labeled on the original images are all ground truth labels for the targets in the original images. When the original images are input into the target segmentation model, the segmentation result obtained is the predicted label output after segmenting the targets in the original images.
[0081] S200: Input the original images in the training set into the convolutional neural network to obtain feature maps at multiple scales for each frame of the original image.
[0082] Target segmentation models are computer models used for semantic segmentation of images or videos. The goal of target segmentation models is to label each pixel in an image with its corresponding target or semantic category, thereby achieving fine-grained classification at the pixel level. For example, in this embodiment of the invention, the target segmentation model is used to classify each pixel in an input image as belonging to either the background category or the human target category. In some embodiments, the target segmentation model consists of a Convolutional Neural Network (CNN), a Gated Recurrent Unit (GRU), and a feature fusion decoding network. The CNN has excellent feature extraction capabilities, used to obtain feature maps at multiple scales of the original image. The GRU has excellent sequence modeling capabilities, used to fuse features from consecutive frames to obtain a fused feature map. The feature fusion decoding network has excellent upsampling and semantic fusion capabilities, used to decode and obtain human features to output the target segmentation result of the original image. The synergistic work of these network structures enables the target segmentation model to understand images at the pixel level and achieve accurate target segmentation, playing a crucial role in many practical application areas.
[0083] Specifically, a convolutional neural network (CNN) is a deep learning network architecture suitable for image processing tasks. It consists of multiple convolutional and pooling layers. By using convolutional operations such as convolutional and pooling layers, it can effectively extract local and global features from the original image. In object segmentation tasks, CNNs act as feature extractors, learning different levels of feature information from the original image to capture its important features.
[0084] Gated Recurrent Unit (GRU) networks are a variant of Recurrent Neural Networks (RNNs). In traditional RNNs, feature information is gradually passed through time steps. However, during backpropagation, due to long-term dependencies, gradients may decay or grow exponentially, making it difficult for the neural network to learn long-term dependencies. Therefore, GRU networks address the gradient vanishing and exploding problems of traditional RNNs by introducing two gating mechanisms: the Reset Gate and the Update Gate, thus better capturing long-range dependencies in time-series image data. In object segmentation tasks, GRU networks can be used to process spatially correlated image data. Since the shape and structure of an object may vary depending on its location, spatial dependencies need to be considered to establish connections between contextual feature information at different locations in the image, fusing features from time-series image data to obtain a fused feature map.
[0085] Feature fusion decoding network is a neural network architecture commonly used in computer vision for image segmentation tasks. It is used to reconstruct the original image from high-level features of networks such as convolutional neural networks (CNNs) and gated recurrent unit networks (GRUs). It gradually restores low-resolution, high-semantic-information features to target segmentation results with the same resolution (same scale) as the original image, outputs pixel-level segmentation masks, classifies each pixel as background or human target, and achieves pixel-level prediction, that is, segmenting human targets in the original image.
[0086] After acquiring the original images, which are typically unprocessed or unmodified real images collected from the data source, preprocessing and cleaning are necessary. This includes image resizing, data augmentation, pixel value normalization, and standardization to ensure the original images meet the input requirements of the convolutional neural network (CNN). The processed original images are then input into the CNN, enabling it to learn abstract feature representations from the images and output feature maps at multiple scales for each frame of the original image. To extract features from each frame of the original image at different scales, the CNN can use multiple convolutional and pooling kernels. Through multiple convolution, pooling, and activation operations, it captures the details and overall information of the image at different levels. That is, it gradually abstracts the features of each frame of the original image into higher-level representations, extracting feature maps at multiple scales for each frame. Each scale feature map represents a different level of image features. In some embodiments, to achieve the best feature extraction effect, five feature maps at different scales are extracted from each frame of the original image. Those skilled in the art will understand that the embodiments of the present invention do not impose any limitations on the number of feature maps extracted from each frame of the original image, nor on the scale of the feature maps extracted from each frame of the original image, and can be selected, adjusted, and transformed according to actual needs.
[0087] S300: Input the first feature map of the current frame original image and the second feature map of the previous frame original image into the gated recurrent unit network to fuse the first feature map and the second feature map to obtain a fused feature map, wherein the first feature map and the second feature map are both feature maps with the smallest scale of the corresponding original images.
[0088] Specifically, the Gated Recurrent Unit (GRU) network is a variant of the Recurrent Neural Network (RNN) used for information fusion, possessing a certain capacity for memory and information updating. The GRU network introduces two gating mechanisms: a reset gate and an update gate, which determine which information needs to be passed and remembered, and which needs to be ignored or updated. These gating mechanisms help the GRU network better handle long-term dependencies and uncertainties in sequential image data. In some embodiments, the first feature map of the current frame and the second feature map of the previous frame are both the smallest-scale feature maps obtained after feature extraction from the corresponding original images using a Convolutional Neural Network (CNN). The first feature map represents the high-semantic abstract feature information of the current frame, and the second feature map represents the high-semantic abstract feature information of the previous frame. By fusing the high-semantic features of different frames based on the high-semantic feature information of the previous frame, the associated features of different frames can be better represented, improving the accuracy of human target segmentation based on spatiotemporal information. The GRU network inputs the first feature map of the current frame and the second feature map of the previous frame into the GRU network. The GRU network automatically learns the feature information from the first and second feature maps to better capture the changes and correlations between consecutive frames in time-series image data. Then, the feature information from the first and second feature maps is fused to obtain a fused feature map containing information about target motion, shape changes, etc. By introducing a gating mechanism, the GRU network can selectively memorize and filter the feature information from the previous frame while fusing it with the feature information from the current frame, generating a more comprehensive and reliable fused feature map. This allows the GRU network to find patterns, changes, and possible target trajectories in time-series image data.
[0089] In some embodiments, please refer to Figure 5 , Figure 5 This is a schematic diagram of the network structure of a gated recurrent unit (GRU) network in a target segmentation model provided in some embodiments of the present invention. After inputting the first feature map of the current frame's original image and the second feature map of the previous frame's original image into the GRU network, the GRU network can fuse the features in the first and second feature maps according to a first formula to obtain a fused feature map, wherein the first formula is:
[0090] Z t =σ(W z *[h t-1 ,x t ])
[0091] r t =σ(W r *[h t-1 ,xt ])
[0092]
[0093]
[0094] Among them, h t-1 x represents the feature map of the previous frame of the original image. t Z represents the feature map of the original image in the current frame. t The update gate represents the fused feature map. The update gate determines the amount of prior information passed to the future from the previous frame's original image. The larger the value of the update gate, the more state information is introduced from the previous frame's original image. t W represents the reset gate for the fused feature map. The reset gate determines the amount of forgotten prior information from the previous frame. The smaller the reset gate value, the less important the information from the previous frame is to the current frame, and the more likely the information from the previous frame should be ignored. z W r and Represents the weight matrix. h represents the candidate hidden state of the fused feature map. t This represents the fused feature map.
[0095] Specifically, the reset and update gates of the GRU network are vectors whose values vary between 0 and 1. These vectors are calculated from the input and hidden states of the GRU network. Specifically, assuming the current time step is t and the input is x... t Update the hidden state to h t Reset gate is recorded as r t Update the gate record as Z t The calculation process is as follows:
[0096] Reset door r t =σ(W r *[h t-1 ,x t ])
[0097] Update Gate Z t =σ(W z *[h t-1 ,x t ])
[0098] Among them, W r and W z This is the learnable weight matrix in the GRU network, where σ represents the activation function, used to introduce non-linear characteristics into the GRU network to handle complex image data relationships. Understandably, different neural network structures can use different activation functions. * represents matrix multiplication. [ht-1 ,x t ] indicates that h t-1 and x t Concatenate the columns to form a new vector.
[0099] Next, based on the value of the reset door, the candidate hidden states can be calculated.
[0100] Candidate hidden state
[0101] in, It is the learnable weight matrix in the GRU network, [r t *h t-1 ,x t ] indicates that r t *h t-1 and x t Concatenate the columns to form a new vector.
[0102] Finally, an update gate is used to merge the previous hidden state h. t-1 (i.e., feature map of the original image from the previous frame) and candidate hidden states Get updated hidden status That is, to obtain the fused feature map h t .
[0103] By introducing update and reset gates, GRU networks can control the amount of information passed from previous time steps to the current time step, reducing the vanishing and exploding gradient problems and more effectively handling long-term dependencies in time-series image data. Due to its relatively simple structure and good performance, GRU networks are widely used in many deep learning tasks. By learning the temporal changes in features, GRU networks can better understand the dynamic changes in image sequences and extract the necessary information. GRU networks play a crucial role in video analysis, action recognition, and behavior prediction, enhancing the model's ability to understand time-series data.
[0104] S400: Input the feature maps of multiple scales of the original image of the current frame and the fused feature map into the feature fusion decoding network to obtain the predicted label of the target.
[0105] The feature maps at multiple scales of the current frame's original image represent abstract features at different levels, ranging from low-level textures and edges to high-level semantic information, encompassing a wealth of information in the image. The smallest fused feature map represents the comprehensive features after spatiotemporal fusion, integrating information from the current frame's original image and the previous frame's original image. The feature fusion decoding network consists of a series of convolutional and deconvolutional layers, used to progressively restore the abstract feature maps to target segmentation results with the same resolution (same scale) as the original image, i.e., obtaining the predicted labels corresponding to human targets in the original image. Through techniques such as upsampling and skip connections, the resolution of the feature maps is gradually increased while preserving semantic information, fusing features from different levels together to extract information related to human targets, and finally outputting the predicted labels corresponding to human targets in the original image, preserving the diversity and richness of features.
[0106] It is understandable that the predicted labels for human targets output by the feature fusion decoding network can be for various tasks, such as object detection, semantic segmentation, and instance segmentation. For example, in this embodiment of the invention, the predicted label represents the target segmentation model's understanding of human targets in the original image, that is, whether each pixel in the original image belongs to the background category or the human target category. Its generation process goes through multiple stages from the original image to abstract features and then to the predicted label.
[0107] Please see Figure 6 , Figure 6 This is a schematic diagram of a sub-process of step S400 in the method for training a target segmentation model provided in some embodiments of the present invention. In some embodiments, the feature fusion decoding network includes multiple cascaded decoding layers. The feature maps at multiple scales of the original image of the current frame and the fused feature map are input into the feature fusion decoding network to obtain the predicted label of the target, specifically including but not limited to the following steps S410-S430:
[0108] S410: The fused feature map is input into the first-level decoding layer for upsampling to obtain the first-level output feature map of the first-level decoding layer. The first-level output feature map is connected to the first target feature map to obtain the first-level connected feature map. The first target feature map has the same scale as the first-level output feature map, and the first target feature map is the feature map with the smallest scale of the original image of the current frame.
[0109] Specifically, the feature fusion decoding network comprises multiple cascaded decoding layers. This cascaded decoding layer structure consists of a series of convolutional layers, deconvolutional layers, and upsampling layers, forming a path from abstract features to the original image within the network structure. Within this path, each decoding layer recovers the image's details and semantic information at different levels, enabling the network to perform feature reconstruction and semantic understanding at various levels. Information transfer between decoding layers is achieved through skip connections. Skip connections allow connecting low-level features with high-level features, enabling the network to obtain detailed information from lower-level features while simultaneously acquiring more abstract semantic information from higher-level features. This prevents information loss in deep networks and improves the network's stability during feature fusion and decoding. By connecting lower-level convolutional features with upper-level deconvolutional features through skip connections, information can be transferred between feature maps with different spatial resolutions. Skip connections enhance information transfer within the network while preserving low-level details. Connecting upsampling and downsampling features helps the network better learn local and global image features, improving the accuracy of semantic segmentation and accelerating the convergence of the target segmentation model.
[0110] In practical applications, the fused feature map is input into the first-level decoding layer of the feature fusion decoding network. An upsampling operation is performed on the fused feature map, mapping it back to the size of the original image from a lower resolution. Through upsampling, the first-level decoding layer gradually recovers the high-level semantic information captured in the fused feature map, obtaining the processed first-level output feature map. It can be understood that the first-level output feature map has the same scale as the smallest feature map in the current frame's original image, representing a relatively coarse feature reconstruction result, typically containing ambiguous but high-level semantic information, such as the approximate location and shape of human targets. Then, the first-level output feature map is fused with the first target feature map using techniques such as skip connections or concatenation to obtain the first-level connected feature map. Here, the first target feature map has the same scale as the first-level output feature map, and is the smallest feature map in the current frame's original image. By fusing the first-level output feature map with the feature map of the current frame original image at the same scale as the first-level output feature map, that is, by connecting the first-level output feature map with the feature map of the current frame original image at the same scale as the first-level output feature map, the abstract information recovered from the fused feature map can be combined with the detailed features of the current frame original image to obtain a first-level connected feature map with more accurate and detailed feature representation.
[0111] S420: Input the first-level connected feature map into the next-level decoding layer for upsampling to obtain the next-level output feature map of the next-level decoding layer. Connect the next-level output feature map with the next-target feature map to obtain the next-level connected feature map. The next-target feature map is a feature map in the feature map of the current frame original image with the same scale as the next-level output feature map.
[0112] Specifically, the obtained first-level connected feature map is input into the next-level decoding layer of the feature fusion decoding network. An upsampling operation is performed on the first-level connected feature map to further recover the details and semantic information of the fused feature map, mapping the first-level connected feature map from a lower resolution back to the size of the original image. Through the upsampling operation, the next-level decoding layer gradually transforms the coarser first-level connected feature map into a finer next-level output feature map. It can be understood that the next-level output feature map represents a higher-level feature reconstruction result, capturing the semantic information and local details of the original image at a higher level. Then, through techniques such as skip connections or concatenation, the next-level output feature map is fused with the next-target feature map to obtain the next-level connected feature map. Here, the next-target feature map is a feature map in the feature map of the current frame's original image with the same scale as the next-level output feature map. By fusing the features of the next-level output feature map with the features of the current frame original image at the same scale as the next-level output feature map, that is, by connecting the feature map of the next-level output feature map with the feature map of the current frame original image at the same scale as the next-level output feature map, we can comprehensively utilize the high-level semantic information in the fused feature map and the low-level detail information of the current frame original image to obtain a next-level connected feature map with richer and more accurate feature representation.
[0113] S430: Repeatedly execute the upsampling operation by inputting the next-level connected feature map into the next-level decoding layer through the remaining decoding layers to obtain the next-level output feature map of the next-level decoding layer. Connect the next-level output feature map with the next target feature map to obtain the next-level connected feature map, until an output feature map with the same scale as the original image of the current frame is obtained, and obtain the predicted label of the target.
[0114] Specifically, in each remaining cascaded decoding layer, the connection feature map output by the previous decoding layer is upsampled to gradually restore the details and semantic information in the fused feature map, expand the size of the fused feature map, and gradually restore it to the size of the original image. This effectively fuses the information of the connection feature map output by the previous decoding layer with the feature map in the current frame's original image, preserving image details while adding high-level semantic information from the fused feature map. The upsampling operation is repeated, inputting the connection feature map output by the previous decoding layer into the next decoding layer to obtain the next-level output feature map. Then, the next-level output feature map is concatenated with a feature map in the current frame's original image that has the same scale as the next-level output feature map to obtain the next-level connection feature map. This process continues until an output feature map with the same scale as the current frame's original image is obtained. In other words, after inputting the connection feature map into the decoding layer for upsampling, the resulting output feature map has the same scale as the current frame's original image.
[0115] The remaining decoding layers in the feature fusion decoding network perform layer-by-layer decoding operations on the feature map, merging features from different convolutional layers to synthesize semantic information at different levels. This fully utilizes feature information from various levels, gradually restoring the feature map from high-level abstract features to low-level detailed features closer to the original image, giving the target segmentation model better context awareness. When the final decoding layer of the feature fusion decoding network is reached—that is, the decoding layer with the same scale as the original image—the high-level semantic information from the fused feature map is finally restored to an output feature map with the same scale as the original image of the current frame. Understandably, the final output feature map contains richer and more accurate image details and semantic information. Based on the output feature map, information related to human targets is extracted, and the predicted labels corresponding to the human targets in the original image are finally output, completing the transformation from high-level abstract features of the feature map to specific labels for the target segmentation task.
[0116] S500: Calculate the loss between the true label and the predicted label based on the loss function, and iteratively train the target segmentation model according to the loss until the target segmentation model converges, thus obtaining the trained target segmentation model.
[0117] In object segmentation tasks, the predicted result of each pixel in the original image output by the object segmentation model is compared with the ground truth label. A pre-defined loss function is used to measure the loss between the predicted and ground truth labels. Then, based on the calculated loss function value, optimization algorithms (such as gradient descent) are used to iteratively train and optimize the parameters of the object segmentation model. This continuously adjusts the neural network parameters in the object segmentation model to reduce the loss function value, making the object segmentation model closer to the ground truth label and outputting more accurate predictions. As training progresses, the object segmentation model gradually converges, meaning its predictions gradually approach the ground truth label. When the object segmentation model reaches a certain convergence state, it can be considered that the model has good generalization ability on the image data used for training and can be applied to previously unseen image or video data; that is, an effective object segmentation model has been obtained after training.
[0118] In some embodiments, cross-entropy loss or Dice loss can be used to calculate the loss between the ground truth and predicted labels. Cross-entropy loss measures the difference between the predicted probability distribution and the ground truth label, while Dice loss measures the similarity between the two sets. It is understood that the smaller the loss between the ground truth and predicted labels, the closer the target segmentation model's prediction is to the ground truth label, and the more optimized the training process. It is also easy to understand that other different loss functions can be used to calculate the loss between the ground truth and predicted labels; different loss functions can be used to calculate the loss between the ground truth and predicted labels according to actual needs.
[0119] The present invention provides a method for training a target segmentation model, wherein the target segmentation model includes a convolutional neural network, a gated recurrent unit network, and a feature fusion decoding network. The method includes: acquiring a training set, which includes multiple consecutive frames of original images, each frame of which is labeled with a ground truth label of the target; inputting the original images in the training set into the convolutional neural network to obtain feature maps of multiple scales for each frame of the original image; inputting a first feature map of the current frame of the original image and a second feature map of the previous frame of the original image into the gated recurrent unit network to fuse the first feature map and the second feature map to obtain a fused feature map, wherein the first feature map and the second feature map are both the smallest feature maps in the corresponding original images; inputting the feature maps of multiple scales of the current frame of the original image and the fused feature map into the feature fusion decoding network to obtain a predicted label of the target; calculating the loss between the ground truth label and the predicted label based on a loss function, and iteratively training the target segmentation model according to the loss until the target segmentation model converges to obtain the trained target segmentation model.
[0120] In this embodiment of the invention, when training the target segmentation model, feature maps at multiple scales of the previous frame's original image are acquired. The smallest feature map from the previous frame's original image is then fused with the smallest feature map from the current frame's original image to obtain a fused feature map. This fused feature map is then used to train the target segmentation model. Consequently, the target segmentation model, during training, places greater emphasis on the correlation between the current and previous frames' original images, strengthening the connection between the target subject in consecutive frames. The resulting target segmentation model can accurately segment the target, reducing interference from complex backgrounds and artifacts, lowering the jitter of the target segmentation boundary, improving the accuracy and stability of target segmentation, and resulting in more accurate and reliable segmentation results, thus enhancing the user experience.
[0121] See Figure 7 As shown, Figure 7 This is a flowchart illustrating a method for training an object segmentation model according to other embodiments of the present invention. In other embodiments, the object segmentation model further includes an attention network, and the method for training the object segmentation model further includes, but is not limited to, the following step S350:
[0122] S350: Input feature maps of multiple scales of the original image of the current frame into the attention network to obtain the attention scores of the feature maps at each scale.
[0123] In some embodiments, the target segmentation model further includes an attention network, which employs a Multi-Layer Perceptron (MLP) structure. The MLP is a basic feedforward neural network structure and one of the most common forms of artificial neural networks. An MLP consists of multiple neuron layers, specifically an input layer, one or more hidden layers, and an output layer. Each layer's neurons are fully connected to those of the previous layer and have a set of weight matrices and bias parameters, but there are no connections between different layers. The input layer receives the raw image data as input features, and then information is passed and processed layer by layer through the hidden layers, finally producing the model's prediction result at the output layer. The attention network is used to dynamically assign weight scores or calculate attention scores to human targets or background in the target segmentation model, enabling the model to focus more intently on the details and important regions of the human target in the input raw image data, and outputting attention scores for feature maps at various scales of the original image.
[0124] Specifically, feature maps at multiple scales of the current frame's original image represent abstract features at different levels, ranging from low-level textures and edges to high-level semantic information, encompassing rich information within the original image. These feature maps at multiple scales are input into the attention network of the target segmentation model. The attention network calculates attention scores for each scale of the feature maps based on the feature content and contextual relationships within each scale. Intuitively, the attention score represents the importance or focus of different regions in each scale of the feature maps of the current frame's original image. Through iterative training, the attention network learns the inherent patterns and relationships in the original image data, automatically adjusting the attention weights of feature maps at different scales, enabling the target segmentation model to better focus on important information. By calculating the attention scores of feature maps at each scale of the current frame's original image, the attention network can weight and highlight important information in the current frame's original image. Understandably, feature maps at different scales in the current frame's original image may contribute differently to different aspects of the target segmentation task, and their attention scores may vary. For example, in some embodiments, smaller-scale feature maps are better suited for capturing texture and details, while larger-scale feature maps are better suited for capturing the overall shape and structure of the human body.
[0125] In some embodiments, after inputting feature maps of multiple scales of the original image of the current frame into the attention network, the attention network can calculate the attention score of the feature maps of each scale of the original image of the current frame according to a second formula, wherein the second formula is:
[0126] S n = T * n +
[0127] α = Softmax(s)
[0128] Among them, W T Let b represent the weight matrix, b represent the bias parameter, and f represent the bias parameter. n This represents the feature map of the original image in the current frame, where n represents the number of feature maps in the original image in the current frame, α represents the attention score of the feature map in the original image in the current frame, and the value range is [0,1]. The Softmax() function is the normalization function, and s = S n .
[0129] Specifically, the neurons multiply the feature maps at various scales of the input current frame's original image with the weight matrix and add a bias term. Through multiple iterations of training and adjusting the network parameters (weight matrix and bias parameters), the MLP network can learn the patterns and relationships in the original image data, adjust the intermediate variables of the output, and use the normalization function Softmax() to calculate the attention scores of the feature maps at various scales of the current frame's original image.
[0130] In this embodiment, the target segmentation model includes an attention network, which calculates attention scores for feature maps at multiple scales of the original image of the current frame. Then, the attention scores of the feature maps at each scale of the original image of the current frame are substituted into a loss function to calculate the loss between the true and predicted labels. The target segmentation model is iteratively trained based on this loss until it converges, resulting in the trained target segmentation model. Accordingly, the loss function can be a cross-entropy loss function that calculates the loss based on the attention scores to obtain the loss between the true and predicted labels. The cross-entropy loss function is:
[0131]
[0132] Specifically, p represents the true label of the target. The predicted label of the target, α i This represents the attention score of the i-th feature map in the original image of the current frame, where n represents the number of feature maps in the original image of the current frame. It's easy to understand that other different loss functions can also be used to calculate the loss between the true and predicted labels. Different loss functions can be used to calculate the loss between the true and predicted labels according to actual needs.
[0133] During the training of the target segmentation model, the original image is input into the model to obtain the predicted label output by the model. The predicted label is then compared with the true label, and a loss function value is calculated between the predicted and true labels using a preset loss function. It is easy to understand that optimization algorithms (such as Stochastic Gradient Descent (SGD), Adam, etc.) can be used to adjust the relevant weight parameters in the target segmentation model through backpropagation to reduce the loss function value, thereby optimizing the predictive ability of the target segmentation model. In some embodiments, the Adam optimization algorithm can be used to optimize the network parameters of the target segmentation model, where the number of iterations is set to 100,000, the initial learning rate is set to 0.001, the weight decay is set to 0.0005, and the learning rate decays to 1 / 10 of its original value every 1000 iterations until the target segmentation model converges.
[0134] After multiple iterations of training, the loss function value of the object segmentation model gradually decreases, and the consistency between the predicted and true labels gradually improves until the object segmentation model converges. When the object segmentation model reaches a certain convergence state, it indicates that the model has reached a relatively stable state, and further training will not bring significant improvement. It can be considered that the object segmentation model has good generalization ability on the image data used for training and can be applied to unseen image or video data, outputting pixel-level segmentation results, accurately classifying each pixel in the image as belonging to the background or human object, thus obtaining an effective object segmentation model after training.
[0135] In summary, the method for training a target segmentation model provided in this embodiment of the invention includes a convolutional neural network, a gated recurrent unit network, and a feature fusion decoding network. The method includes: acquiring a training set, which includes multiple consecutive frames of original images, each frame of which is labeled with a ground truth label for the target; inputting the original images in the training set into the convolutional neural network to obtain feature maps at multiple scales for each frame of the original image; inputting the first feature map of the current frame of the original image and the second feature map of the previous frame of the original image into the gated recurrent unit network to fuse the first feature map and the second feature map to obtain a fused feature map, wherein the first feature map and the second feature map are both the smallest feature maps in the corresponding original images; inputting the feature maps at multiple scales of the current frame of the original image and the fused feature map into the feature fusion decoding network to obtain a predicted label for the target; calculating the loss between the ground truth label and the predicted label based on a loss function, and iteratively training the target segmentation model according to the loss until the target segmentation model converges to obtain the trained target segmentation model.
[0136] In this embodiment of the invention, when training the target segmentation model, feature maps at multiple scales of the previous frame's original image are acquired. The smallest feature map from the previous frame's original image is then fused with the smallest feature map from the current frame's original image to obtain a fused feature map. This fused feature map is then used to train the target segmentation model. Consequently, the target segmentation model, during training, places greater emphasis on the correlation between the current and previous frames' original images, strengthening the connection between the target subject in consecutive frames. The resulting target segmentation model can accurately segment the target, reducing interference from complex backgrounds and artifacts, lowering the jitter of the target segmentation boundary, improving the accuracy and stability of target segmentation, and resulting in more accurate and reliable segmentation results, thus enhancing the user experience.
[0137] See Figure 8 As shown, Figure 8This is a flowchart illustrating a target segmentation method provided in some embodiments of the present invention. It is understood that the executing entity of this target segmentation method can be the aforementioned electronic device, and the target segmentation method includes, but is not limited to, the following steps S600-S800:
[0138] S600: Acquire the image to be processed.
[0139] When applying a target segmentation model to segment an image, the first step is to acquire the image to be processed. This image data can be obtained by the user or operator from various data sources for image processing or computer vision tasks. It is understood that the image to be processed can be pixel data represented digitally, image files in formats such as JPEG and PNG, or real-time images acquired through image acquisition devices such as cameras or scanners. In some embodiments, it is readily understood that other methods can also be used to acquire the image to be processed; for example, users can acquire image data uploaded by themselves through applications, such as avatars or albums in social media applications. Acquiring the image to be processed provides a reliable data source for subsequent image data preprocessing, recognition and detection, feature extraction, and target segmentation operations.
[0140] S700: Input the image to be processed into the target segmentation model to obtain the predicted label of the target in the image to be processed, wherein the target segmentation model is trained using the method described in any of the above-mentioned methods for training target segmentation models.
[0141] The acquired image to be processed is loaded as input data into the target segmentation model. After forward propagation through the target segmentation model, each pixel in the image is classified as either a human target or a background target to obtain a predicted label for the target in the image. This target segmentation model is trained using the method described in any of the embodiments of the present invention, and has the same structure and function as the target segmentation model described in the embodiments of the present invention, which will not be elaborated further here. During forward propagation, each pixel in the image to be processed is input into the target segmentation model and undergoes a series of operations such as convolution, pooling, and feature fusion to extract feature information from the input image. Based on the feature representation and weight parameters learned during the training phase, the target segmentation model outputs a corresponding predicted label for each pixel in the image to be processed, classifying each pixel as either a human target or a background target, thereby obtaining a predicted label image with the same scale as the image to be processed. The obtained predicted label image is the predicted label for the human target in the image to be processed. It can be understood that each pixel in the predicted label image is assigned a category label to indicate whether the pixel belongs to the human target category or the background category. In some embodiments, post-processing is required after obtaining the predicted label image. For example, noise in the predicted label image may be removed, and target regions may be filled in to obtain a more accurate and complete human target segmentation result.
[0142] S800: Segment the target image from the image to be processed based on the predicted label of the target.
[0143] After obtaining the predicted labels of human targets in the image to be processed, the human target pixels are extracted from the image to be processed through pixel-level operations based on the predicted labels, resulting in a separate human target image. Clearly, the segmented human target image only contains the human target portion of the image to be processed; the background or other portions are removed. It is readily understood that in some embodiments, the human target image can be extracted simply by traversing the pixels in the image to be processed and selecting target pixels based on the predicted labels, or it can be generated through image masking operations, or other methods can be used to obtain the human target image.
[0144] The target segmentation method provided in this invention includes: acquiring an image to be processed; inputting the image to be processed into a target segmentation model to obtain predicted labels for targets in the image to be processed, wherein the target segmentation model is trained using the method described above; and segmenting a target image from the image to be processed based on the predicted labels of the targets. The obtained target segmentation model can accurately segment targets, reduce interference from complex backgrounds and the generation of artifacts, reduce the jitter of target segmentation boundaries, improve the accuracy and stability of target segmentation, and provide more accurate and reliable segmentation results, thus enhancing the user experience.
[0145] This invention provides a computer-readable storage medium storing computer program instructions. A processor executes the computer program instructions to perform any of the methods for training a target segmentation model or any of the target segmentation methods provided in the above-described embodiments.
[0146] In some embodiments, the storage medium may be a flash memory, magnetic surface memory, optical disk, CD-ROM, FRAM, ROM, PROM, EPROM, or EEPROM, or various devices including one or any combination of the above-mentioned memories.
[0147] In some embodiments, executable instructions may take the form of a program, software, software module, script, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
[0148] As an example, executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple collaborative files (e.g., a file that stores one or more modules, subroutines, or code sections).
[0149] As an example, executable instructions can be deployed to execute on a single computing device (including devices such as smart terminals and servers), or on multiple computing devices located in one location, or on multiple computing devices distributed across multiple locations and interconnected via a communication network.
[0150] Those skilled in the art will understand that the embodiments provided by this invention are merely illustrative. The order in which the steps in the methods of the embodiments are written does not imply a strict execution order and does not constitute any limitation on the implementation process. The order can be adjusted, merged, and deleted according to actual needs. Modules or sub-modules, units, or sub-units in the apparatus or system of the embodiments can be merged, divided, and deleted according to actual needs. For example, the division of units is merely a logical functional division, and there may be other division methods in actual implementation. For another example, multiple units or components can be combined or integrated into another device, or some features can be ignored or not executed.
[0151] Those skilled in the art will recognize that all or part of the steps of the methods described in the embodiments of this invention can be implemented directly using electronic hardware or processor-executable computer program instructions, or a combination of both. These computer program instructions can be stored in memory, a hard disk, a register, a removable disk, random access memory (RAM), read-only memory (ROM), a CD-ROM, an electrically programmable ROM, an electrically erasable programmable ROM, or any other form of storage medium known in the art.
[0152] It should be noted that the above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement it accordingly. They should not be used to limit the scope of protection of the present invention. Those skilled in the art can understand that all or part of the processes of implementing the above embodiments, and all equivalent changes and modifications made in accordance with the claims of the present invention should fall within the scope of the claims of the present invention.
Claims
1. A method for training an object segmentation model, characterized in that, The target segmentation model includes a convolutional neural network, a gated recurrent unit network, and a feature fusion decoding network. The feature fusion decoding network includes multiple cascaded decoding layers. The method includes: Obtain a training set, which includes multiple consecutive frames of original images, each frame of which is labeled with the true label of the target; The original images in the training set are input into the convolutional neural network to obtain feature maps of multiple scales for each frame of the original image; The first feature map of the current frame original image and the second feature map of the previous frame original image are input into the gated recurrent unit network to fuse the first feature map and the second feature map to obtain a fused feature map. The first feature map and the second feature map are both feature maps with the smallest scale of the corresponding original images. The process involves inputting feature maps at multiple scales of the original image of the current frame and the fused feature map into the feature fusion decoding network to obtain the predicted label of the target. This includes: inputting the fused feature map into a first-level decoding layer for upsampling to obtain a first-level output feature map; connecting the first-level output feature map with a first target feature map to obtain a first-level connection feature map, wherein the first target feature map and the first-level output feature map have the same scale, and the first target feature map is the smallest feature map in the original image of the current frame; and inputting the first-level connection feature map into the next-level decoding layer for upsampling to obtain the next-level... The next-level output feature map of the decoding layer is connected to the next-level target feature map to obtain the next-level connected feature map. The next-level target feature map is a feature map in the feature map of the current frame's original image with the same scale as the next-level output feature map. The next-level connected feature map is input into the next-level decoding layer for upsampling through the remaining decoding layers to obtain the next-level output feature map of the next-level decoding layer. The next-level output feature map is then connected to the next-level target feature map to obtain the next-level connected feature map. This process continues until an output feature map with the same scale as the current frame's original image is obtained, thus obtaining the predicted label of the target. The loss between the true label and the predicted label is calculated based on the loss function, and the target segmentation model is iteratively trained according to the loss until the target segmentation model converges, thus obtaining the trained target segmentation model.
2. The method according to claim 1, characterized in that, The step of inputting the first feature map of the current frame's original image and the second feature map of the previous frame's original image into the gated recurrent unit network to fuse the first feature map and the second feature map to obtain a fused feature map includes: The first feature map and the second feature map are input into the gated recurrent unit network, and feature fusion is performed on the first feature map and the second feature map according to the first formula to obtain a fused feature map, wherein the first formula is: in, This represents the second feature map. This represents the first feature map. This represents the activation function. This represents the update gate of the fused feature map. This represents the reset gate of the fused feature map. , and Represents the weight matrix. This represents the candidate hidden state of the fused feature map. This represents the fused feature map.
3. The method according to claim 1, characterized in that, The target segmentation model further includes an attention network, and the method further includes: The attention network is input with feature maps of multiple scales of the original image of the current frame to obtain the attention scores of the feature maps at each scale.
4. The method according to claim 3, characterized in that, The step of inputting feature maps at multiple scales of the original image of the current frame into the attention network to obtain attention scores for each scale feature map includes: The feature maps of the original image of the current frame at multiple scales are input into the attention network; The attention score is calculated according to the second formula, wherein the second formula is: in, Represents the weight matrix. Indicates the bias parameter. This represents the feature map of the original image of the current frame. This indicates the number of feature maps in the original image of the current frame. This represents the attention score of the feature map of the original image in the current frame, with a value range of [0,1]. The Softmax() function is a normalization function. .
5. The method according to claim 4, characterized in that, The loss function is: in, The true label representing the target. The predicted label representing the target. This represents the attention score of the i-th feature map of the original image of the current frame. This indicates the number of feature maps in the original image of the current frame.
6. A target segmentation method, characterized in that, include: Obtain the image to be processed; The image to be processed is input into a target segmentation model to obtain the predicted label of the target in the image to be processed, wherein the target segmentation model is trained using the method for training a target segmentation model as described in any one of claims 1-5; The target image is segmented from the image to be processed based on the predicted label of the target.
7. An electronic device, characterized in that, include: A processor and a memory communicatively connected to the processor; The memory stores computer program instructions executable by the processor, which, when invoked by the processor, cause the processor to execute the method for training an object segmentation model as described in any one of claims 1-5 or the object segmentation method as described in claim 6.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer program instructions adapted for loading by a processor to perform the method of training an object segmentation model as described in any one of claims 1-5 or the object segmentation method as described in claim 6.