Image object detection and instance segmentation methods, systems, computing devices, and media
By introducing a recursive feature pyramid and switchable dilated convolution into the Mask RCNN model, the multi-scale problem is solved, the performance of object detection and instance segmentation is improved, and better convergence and accuracy are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2022-09-14
- Publication Date
- 2026-06-26
Smart Images

Figure CN115375901B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image data processing, and in particular to an image target detection and instance segmentation method, system, computing device, and medium based on recursive feature pyramids and switchable dilated convolution. Background Technology
[0002] Instance segmentation is a crucial task in computer vision. Compared to semantic segmentation, instance segmentation is a more specific type of segmentation, focusing on the individual segmentation of each instance. With increasing research investment, instance segmentation has become an important branch of computer vision. Numerous instance segmentation algorithms can be categorized into two types based on their technical approach: single-stage and two-stage methods. In two-stage methods, the first stage generates candidate boxes using a trainable candidate box generator, and the second stage performs detection and segmentation operations on these candidate boxes. Single-stage methods offer an integrated process. This structural characteristic makes single-stage methods faster, while two-stage methods are more accurate. Based on the characteristics of deep learning, low-level features have high localization accuracy but less semantic information, while high-level features have lower localization accuracy. Therefore, fusing low-level and high-level feature maps to improve information utilization is significant. For a long time, the multi-scale problem has been one of the challenges in object detection and instance segmentation, until the Feature Pyramid Network (FPN) was proposed to address this issue. Prior to FPN, Feature Pyramid, Single Feature Map, and Pyramid Feature Hierarchy had been proposed, attempting to use feature maps from different levels in different ways. FPN introduces a top-down connection from the highest-level features to the lowest-level features, and performs predictions at each level, while utilizing the high semantic information of high-level features and the high-resolution location of low-level features.
[0003] Instance segmentation, as an important and challenging task in computer vision, has attracted widespread attention and been continuously driven by algorithmic advancements. Some have proposed unique structures different from existing algorithms, while others have improved modules of existing algorithms. In the latter case, to prove their effectiveness and improve performance, these mechanisms that modify existing modules are always applied to the latest algorithms. Therefore, how to add the DetectoRS mechanism to the Mask R-CNN model to improve the performance of the classic Mask R-CNN model in object detection and instance segmentation is a pressing technical problem that needs to be solved. Summary of the Invention
[0004] To address the aforementioned problems, the present invention aims to provide an image target detection and instance segmentation method, system, computing device, and medium that can effectively improve performance and effectiveness, and gain potential application value in a wider range of fields.
[0005] To achieve the above objectives, the present invention adopts the following technical solution: an image target detection and instance segmentation method, comprising: inputting an image into a backbone residual network containing switchable dilated convolutions for feature extraction at different stages to obtain multi-scale feature maps at different depths; inputting the multi-scale feature maps into a neck network for recursive fusion of recursive feature pyramids to obtain new feature maps, and generating candidate regions to obtain regions of interest; inputting the regions of interest into a head network for prediction of instance category, category confidence, instance bounding box, and instance mask.
[0006] Furthermore, the recursive fusion of the recursive feature pyramid includes:
[0007] The highest layer features remain unchanged, while the features of each other layer are obtained by fusing the input of the current layer and the input of the layer above it through interpolation scaling and addition at corresponding positions, resulting in a top-down feature fusion result.
[0008] The features fused from top to bottom are then extracted from bottom to top again to obtain a feature map with re-extracted features.
[0009] The feature maps with further extracted features are then fused from top to bottom to obtain new feature maps that have undergone further feature extraction and fusion.
[0010] Furthermore, the generation of candidate regions includes:
[0011] Using the new feature map as input, anchor boxes of different sizes and dimensions are generated at different locations of the feature map at different stages; features are extracted, foreground and background classification is performed on each anchor box and the category confidence is calculated, non-foreground anchor boxes are filtered out according to the confidence threshold, and target category prediction and bounding box prediction are performed on the remaining foreground anchor boxes to obtain the target category information and bounding box location information of the anchor boxes.
[0012] Furthermore, the acquisition of the region of interest includes:
[0013] The ROIAlign strategy is used to align candidate regions onto a new feature map, reducing quantization error and obtaining the region of interest.
[0014] Furthermore, the head network includes an object detection branch and a mask prediction branch;
[0015] The object detection branch predicts the instance category and instance bounding box of the new feature map;
[0016] The mask branch performs fine predictions of instance masks on new feature maps.
[0017] Furthermore, the loss function used in predicting the instance category and instance mask is CrossEntropy Loss; the loss function used in predicting the instance bounding box is L1 Loss.
[0018] Furthermore, before inputting the image into the backbone residual network containing switchable dilated convolutions for feature extraction at different stages, the method further includes a step of loading the network pre-trained weight parameters.
[0019] An image object detection and instance segmentation system includes: a first processing module that inputs an image into a backbone residual network containing switchable dilated convolutions for feature extraction at different stages to obtain multi-scale feature maps at different depths; a second processing module that inputs the multi-scale feature maps into a neck network, performs recursive fusion of recursive feature pyramids to obtain new feature maps, and generates candidate regions to obtain regions of interest; and a prediction module that inputs the regions of interest into a head network to predict instance categories, category confidence, instance bounding boxes, and instance masks.
[0020] A computer-readable storage medium storing one or more programs, the one or more programs including instructions that, when executed by a computing device, cause the computing device to perform any of the methods described above.
[0021] A computing device includes: one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include instructions for performing any of the methods described above.
[0022] The present invention has the following advantages due to the adoption of the above technical solutions:
[0023] 1. The basic framework of this invention adopts the widely used Mask RCNN architecture, which has a clear and concise basic structure and has more potential application value after the introduction of new structures.
[0024] 2. This invention adds the DetectoRS mechanism to the Mask R-CNN model, introducing a module that currently performs very well and is quite novel in the field of instance segmentation. On the one hand, it improves the performance of classic networks, and on the other hand, it makes classic algorithms still competitive. Attached Figure Description
[0025] Figure 1 This is a schematic diagram of the overall process of the image target detection and instance segmentation method in one embodiment of the present invention;
[0026] Figure 2 This is a detailed flowchart of an image target detection and instance segmentation method according to an embodiment of the present invention;
[0027] Figure 3 This is a convergence effect diagram of the deep learning part of RSMask RCNN in one embodiment of the present invention;
[0028] Figure 4 This is a diagram showing the total loss convergence effect in one embodiment of the present invention;
[0029] Figure 5 This is a new feature map that has undergone further feature extraction and fusion in one embodiment of the present invention;
[0030] Figure 6 This is a diagram showing the target category and bounding box position information of the anchor frame in one embodiment of the present invention. Detailed Implementation
[0031] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the described embodiments of the present invention are within the scope of protection of the present invention.
[0032] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments according to this application. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0033] Instance segmentation, as an important and challenging task in computer vision, has attracted widespread attention and been continuously driven by algorithmic advancements. Some have proposed unique structures different from existing algorithms, while others have improved modules of existing algorithms. In the latter case, to prove their effectiveness and improve performance, these mechanisms that modify existing modules are always applied to the latest algorithms. Therefore, how to add the DetectoRS mechanism to the Mask R-CNN model to improve the performance of the classic Mask R-CNN model in object detection and instance segmentation is a pressing technical problem that needs to be solved.
[0034] To address the aforementioned technical problems, this invention aims to introduce efficient detection components into classic deep learning algorithms to improve performance and achieve potential applications in a wider range of fields. Using Mask R-CNN in object detection and instance segmentation as the basic model framework, the detection and segmentation method is defined as three parts: classification, regression, and mask prediction. The macroscopic structure of the model can be divided into three parts: a backbone network, a neck network, and a head network. In the neck network, this invention introduces a recursive feature pyramid into the Mask R-CNN structure. The microscopic operations of the model mainly involve image convolution, pooling, normalization, and region of interest alignment. This invention replaces the original convolution with a switchable dilated convolution. The loss function of this invention is divided into two parts: Cross Entropy Loss is used for instance classification and mask prediction, and L1 Loss is used for instance bounding box prediction. The invention will be described in detail below through embodiments.
[0035] In one embodiment of the present invention, an image target detection and instance segmentation method is provided. In this embodiment, as... Figure 1 , Figure 2 As shown, the method includes the following steps:
[0036] 1) Input the image into a backbone residual network containing switchable dilated convolutions to extract features at different stages, and obtain multi-scale feature maps at different depths;
[0037] 2) Input the multi-scale feature maps into the neck network, perform recursive fusion of the recursive feature pyramid to obtain new feature maps, and generate candidate regions to obtain regions of interest;
[0038] 3) Input the region of interest into the head network to predict the instance class, class confidence, instance bounding box, and instance mask.
[0039] In step 1) above, in this embodiment, the images are input into the backbone residual network containing switchable dilated convolution in batches. Deep feature extraction is fully performed through the operation of the deep network. By extracting feature maps at different stages of the deep backbone network, multi-scale feature maps at different depths are obtained.
[0040] In step 2) above, the multi-scale feature maps are recursively fused using a recursive feature pyramid to further enhance the fusion of high- and low-order features and feature re-extraction, resulting in new multi-scale feature maps with richer information. The recursive fusion of the recursive feature pyramid includes the following steps:
[0041] 2.1) The highest layer features remain unchanged, while the features of each other layer are obtained by fusing the input of the current layer and the input of the layer above it through interpolation scaling and addition at corresponding positions, resulting in a top-down feature fusion result;
[0042] 2.2) The features fused from top to bottom are then extracted again from bottom to top to obtain a feature map with re-extracted features;
[0043] 2.3) The feature maps with further extracted features are then subjected to top-down feature fusion again to obtain new feature maps that have undergone further feature extraction and fusion.
[0044] In step 2) above, candidate regions are generated, specifically as follows:
[0045] Using the new feature map as input, anchor boxes of different sizes and dimensions are generated at different locations of the feature map at different stages; features are extracted, foreground and background classification is performed on each anchor box and the category confidence is calculated, non-foreground anchor boxes are filtered out according to the confidence threshold, and target category prediction and bounding box prediction are performed on the remaining foreground anchor boxes to obtain the target category information and bounding box location information of the anchor boxes.
[0046] In this embodiment, a Region Proposal Network (RPN) is used to generate proposed candidate regions. Feature selection is performed based on the feature maps at different stages, and the RPN parameters are adjusted to output regions of interest (ROIs).
[0047] In step 2) above, the acquisition of Regions of Interest (ROIs) is specifically as follows: the ROIAlign strategy is used to align the candidate regions onto the new feature map, thereby reducing quantization error and obtaining the ROIs.
[0048] In step 3) above, the head network includes an object detection branch and a mask prediction branch. Wherein:
[0049] The object detection branch is used to predict the instance class and instance bounding box of the new feature map;
[0050] The mask branch is used for fine-grained prediction of instance masks on new feature maps.
[0051] In this embodiment, the loss function used for predicting instance categories and instance masks is CrossEntropy Loss; the loss function used for predicting instance bounding boxes is L1 Loss. Stochastic gradient descent is used to learn and update the network parameters, returning to step 1), until the network reaches a pre-set convergence condition.
[0052] In the above embodiment, before step 1) inputting the image into the backbone residual network containing switchable dilated convolutions for feature extraction at different stages, the method further includes: loading the network pre-trained weight parameters. Model weights θ are loaded from the pre-trained model weight file, and parameters not included in the pre-trained weights are randomly initialized.
[0053] The method of this invention is compared with the existing model Mask RCNN, and the performance improvement of this invention is demonstrated through comparative experiments on major datasets.
[0054] 1. First, the deep learning part of RSMask RCNN in this invention achieves excellent convergence performance: the classification loss, regression loss, and masking loss of this invention show significant convergence effects, such as... Figure 3 As shown.
[0055] 2. The total loss, composed of the classification loss, regression loss, and masking loss mentioned above, shows significant convergence, such as... Figure 4 As shown.
[0056] 3. Compare algorithm performance under the same experimental conditions to ensure maximum fairness in the results presentation. Test the model performance on the COCO dataset, which is crucial for instance segmentation. Both the new and old models are loaded with the same Mask R-CNN weights to ensure fairness, and then fine-tuned for the same number of rounds on the dataset. Table 1 shows the comparative performance of Mask R-CNN and RSMask R-CNN on the COCO validation set.
[0057] Table 1 Comparison of Experimental Results
[0058]
[0059] Example: The target detection and instance segmentation method based on recursive feature pyramids and switchable dilated convolution of the present invention includes the following steps:
[0060] 1) Using a typical RGB three-channel image as the object, we input it into the algorithm network and load the pre-trained weight parameters of the network.
[0061] 2) Input the image into a backbone residual network containing switchable dilated convolutions for multi-stage feature extraction, extracting different levels of image information at each stage.
[0062] The backbone network consists of four stages. The feature maps output from each stage are taken, resulting in four feature maps of different sizes and orders, which are then used as input to the next stage, the neck network. The feature maps can be represented as x. i,i=1,2,3,4 Let i represent the stage to which the feature map belongs. A larger i value means that the feature map was generated from a deeper stage.
[0063] 3) Using the extracted four-stage feature maps as input, the recursive feature pyramid first performs top-down feature fusion:
[0064] The highest layer, x4, remains unchanged; the features of each other layer are derived from the input x of the current layer. i and its higher-level input x i+1 The fusion is achieved through interpolation scaling and addition at corresponding positions. This fusion relationship can be represented to obtain the top-down feature fusion results:
[0065]
[0066] Then the top-down fused features x i Returning to the previous bottom-up feature extraction module, feature extraction is performed again. Let B(x) represent this feature extraction operation. Thus, the feature map x after feature extraction is obtained. i It can be represented as:
[0067] x i =B(x) i (2)
[0068] The feature map that has undergone further feature extraction is then input again into the top-down feature fusion module for fusion, resulting in a new feature map that has undergone further feature extraction and fusion, such as... Figure 5 As shown, solid arrows represent the feature extraction process, dashed lines represent the feature pyramid network structure, and dotted lines represent back-passing recurrent connections in a recurrent neural network.
[0069] 4) Generate candidate regions from the feature maps of each of the above stages:
[0070] Candidate regions are generated by a candidate region generation network. This network takes a feature map as input and generates anchor boxes of different sizes and dimensions at different locations on the feature map at different stages. Features are extracted using operations such as convolution, activation, and pooling. Foreground and background classification is performed on each anchor box, and class confidence is calculated. Non-foreground anchor boxes are filtered out based on a confidence threshold. The remaining foreground anchor boxes undergo target class prediction and bounding box prediction to obtain the target class information and bounding box location information of the anchor boxes. Figure 6 As shown.
[0071] 5) The bounding box positions generated by the RPN network are generally floating-point values, which contain decimals. Therefore, quantization errors will occur when mapping the RPN region proposals to the feature map. ROIAlign is used to align the candidate boxes to the feature map to reduce this quantization error and obtain the Regions of Interest (ROIs).
[0072] 6) The obtained Regions of Interest (ROIs) are fed into the object detection branch and the mask prediction branch, respectively. The object detection branch will predict the instance class, class confidence, and instance bounding box of the feature map, while the mask branch will perform fine prediction of the instance mask.
[0073] In one embodiment of the present invention, an image target detection and instance segmentation system is provided, comprising:
[0074] The first processing module inputs the image into a backbone residual network containing switchable dilated convolutions to extract features at different stages, thereby obtaining multi-scale feature maps at different depths.
[0075] The second processing module inputs multi-scale feature maps into the neck network, performs recursive fusion of recursive feature pyramids to obtain new feature maps, and generates candidate regions to obtain regions of interest.
[0076] The prediction module takes the region of interest as input to the head network and predicts the instance class, class confidence, instance bounding box, and instance mask.
[0077] The recursive fusion of the recursive feature pyramid in the second processing module mentioned above includes:
[0078] The highest layer features remain unchanged, while the features of each other layer are obtained by fusing the input of the current layer and the input of the layer above it through interpolation scaling and addition at corresponding positions, resulting in a top-down feature fusion result.
[0079] The features fused from top to bottom are then extracted from bottom to top again to obtain a feature map with re-extracted features.
[0080] The feature maps with further extracted features are then fused from top to bottom to obtain new feature maps that have undergone further feature extraction and fusion.
[0081] In the second processing module mentioned above, candidate regions are generated, specifically as follows:
[0082] Using the new feature map as input, anchor boxes of different sizes and dimensions are generated at different locations of the feature map at different stages; features are extracted, foreground and background classification is performed on each anchor box and the category confidence is calculated, non-foreground anchor boxes are filtered out according to the confidence threshold, and target category prediction and bounding box prediction are performed on the remaining foreground anchor boxes to obtain the target category information and bounding box location information of the anchor boxes.
[0083] In the second processing module mentioned above, the acquisition of Regions of Interest (ROIs) is specifically achieved by using the ROIAlign strategy to align candidate regions onto a new feature map, thereby reducing quantization errors and obtaining the ROIs.
[0084] In the prediction module described above, the head network includes an object detection branch and a mask prediction branch. Wherein:
[0085] The object detection branch is used to predict the instance class and instance bounding box of the new feature map;
[0086] The mask branch is used for fine-grained prediction of instance masks on new feature maps.
[0087] In this embodiment, the loss function used in predicting instance categories and instance masks is CrossEntropy Loss; the loss function used in predicting instance bounding boxes is L1 Loss.
[0088] In the above embodiments, before step 1) inputting the image into the backbone residual network containing switchable dilated convolutions for feature extraction at different stages, the method further includes: loading the network pre-trained weight parameters.
[0089] The system provided in this embodiment is used to execute the above-described method embodiments. For specific processes and details, please refer to the above embodiments, which will not be repeated here.
[0090] In one embodiment of the present invention, a computing device structure is provided. This computing device can be a terminal, which may include: a processor, a communication interface, memory, a display screen, and an input device. The processor, communication interface, and memory communicate with each other via a communication bus. The processor provides computing and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system and a computer program. When executed by the processor, the computer program implements an image target detection and instance segmentation method. The internal memory provides an environment for the operation of the operating system and computer program in the non-volatile storage medium. The communication interface is used for wired or wireless communication with external terminals. Wireless communication can be achieved through Wi-Fi, a management network, NFC (Near Field Communication), or other technologies. The display screen can be a liquid crystal display or an e-ink display. The input device can be a touch layer covering the display screen, or buttons, a trackball, or a touchpad mounted on the casing of the computing device, or an external keyboard, touchpad, or mouse, etc. The processor can call logical instructions in memory to execute the following methods: inputting the image into a backbone residual network containing switchable dilated convolutions for feature extraction at different stages to obtain multi-scale feature maps at different depths; inputting the multi-scale feature maps into the neck network for recursive fusion of recursive feature pyramids to obtain new feature maps, and generating candidate regions to obtain regions of interest; inputting the regions of interest into the head network for prediction of instance class, class confidence, instance bounding box, and instance mask.
[0091] Furthermore, the logical instructions in the aforementioned memory can be implemented as software functional units and sold or used as independent products, and can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0092] Those skilled in the art will understand that the structure of the above-described computing device is only a partial structure related to the solution of this application and does not constitute a limitation on the computing device to which the solution of this application is applied. A specific computing device may include more or fewer components, or combine certain components, or have different component arrangements.
[0093] In one embodiment of the present invention, a computer program product is provided, the computer program product including a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer is able to perform the methods provided in the above-described method embodiments, for example including: inputting an image into a backbone residual network containing switchable dilated convolutions for feature extraction at different stages to obtain multi-scale feature maps at different depths; inputting the multi-scale feature maps into a neck network, performing recursive fusion of recursive feature pyramids to obtain new feature maps, and generating candidate regions to obtain regions of interest; inputting the regions of interest into a head network to predict instance categories, category confidence, instance bounding boxes, and instance masks.
[0094] In one embodiment of the present invention, a non-transitory computer-readable storage medium is provided, which stores server instructions that cause a computer to perform the methods provided in the above embodiments, such as: inputting an image into a backbone residual network containing switchable dilated convolutions for feature extraction at different stages to obtain multi-scale feature maps at different depths; inputting the multi-scale feature maps into a neck network, performing recursive fusion of recursive feature pyramids to obtain new feature maps, and generating candidate regions to obtain regions of interest; inputting the regions of interest into a head network to predict instance categories, category confidence, instance bounding boxes, and instance masks.
[0095] The computer-readable storage medium provided in the above embodiments has a similar implementation principle and technical effect to the above method embodiments, and will not be described again here.
[0096] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0097] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0098] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0099] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for image target detection and instance segmentation, characterized in that, include: The image is input into a backbone residual network containing switchable dilated convolutions for feature extraction at different stages, resulting in multi-scale feature maps at different depths. Multi-scale feature maps are input into the neck network, recursively fused using recursive feature pyramids to obtain new feature maps, and candidate regions are generated to obtain regions of interest. The region of interest is input into the head network to predict the instance class, class confidence, instance bounding box, and instance mask; The recursive fusion of the recursive feature pyramid includes: The highest layer features remain unchanged, while the features of each other layer are obtained by fusing the input of the current layer and the input of the layer above it through interpolation scaling and addition at corresponding positions, resulting in a top-down feature fusion result. The features fused from top to bottom are then extracted from bottom to top again to obtain a feature map with re-extracted features. The feature maps with further extracted features are then fused from top to bottom to obtain new feature maps that have undergone further feature extraction and fusion. The generation of candidate regions includes: Using the new feature map as input, anchor boxes of different sizes and dimensions are generated at different locations of the feature map at different stages; features are extracted, foreground and background classification is performed on each anchor box and the category confidence is calculated, non-foreground anchor boxes are filtered out according to the confidence threshold, and target category prediction and bounding box prediction are performed on the remaining foreground anchor boxes to obtain the target category information and bounding box location information of the anchor boxes.
2. The image target detection and instance segmentation method as described in claim 1, characterized in that, The acquisition of the region of interest includes: The ROIAlign strategy is used to align candidate regions onto a new feature map, reducing quantization error and obtaining the region of interest.
3. The image target detection and instance segmentation method as described in claim 1, characterized in that, The head network includes an object detection branch and a mask prediction branch; The object detection branch predicts the instance category and instance bounding box of the new feature map; The mask branch performs fine predictions of instance masks on new feature maps.
4. The image target detection and instance segmentation method as described in claim 3, characterized in that, The loss function used in predicting the instance category and instance mask is Cross Entropy Loss; the loss function used in predicting the instance bounding box is L1 Loss.
5. The image target detection and instance segmentation method as described in claim 1, characterized in that, Before inputting the image into the backbone residual network containing switchable dilated convolutions for feature extraction at different stages, the method further includes: loading the network pre-trained weight parameters.
6. An image target detection and instance segmentation system, used to implement the image target detection and instance segmentation method as described in any one of claims 1 to 5, characterized in that, include: The first processing module inputs the image into a backbone residual network containing switchable dilated convolutions to extract features at different stages, thereby obtaining multi-scale feature maps at different depths. The second processing module inputs multi-scale feature maps into the neck network, performs recursive fusion of recursive feature pyramids to obtain new feature maps, and generates candidate regions to obtain regions of interest. The prediction module takes the region of interest as input to the head network and predicts the instance class, class confidence, instance bounding box, and instance mask.
7. A computer-readable storage medium for storing one or more programs, characterized in that, The one or more programs include instructions that, when executed by a computing device, cause the computing device to perform any of the methods described in claims 1 to 5.
8. A computing device, characterized in that, include: One or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described in claims 1 to 5.