Pose estimation method, electronic device, and storage medium

By using a target estimation model for keypoint extraction and instance segmentation, the problem of inaccurate keypoint grouping in multi-object pose estimation is solved, thus improving the accuracy of pose estimation.

CN116797657BActive Publication Date: 2026-06-26HANGZHOU KUANGYUN JINZHI TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU KUANGYUN JINZHI TECH CO LTD
Filing Date
2023-03-20
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies for multi-object pose estimation, the high complexity of the Hungarian matching algorithm and the inaccurate grouping of key points due to object occlusion affect the accuracy of pose estimation.

Method used

A target estimation model is used for key point extraction and instance segmentation. Through the backbone network layer, key point extraction module and instance segmentation module, key point heatmap and instance segmentation results are obtained. Key points are grouped based on the mapping relationship between pixels and key points to improve the accuracy of grouping.

Benefits of technology

By using pixel-level instance segmentation and keypoint mapping, keypoints are accurately grouped, improving the accuracy of pose estimation and avoiding inaccurate grouping caused by occlusion between objects.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116797657B_ABST
    Figure CN116797657B_ABST
Patent Text Reader

Abstract

The application relates to a pose estimation method, an electronic device, a storage medium and a computer program product. The method comprises the following steps: acquiring a to-be-processed image; inputting the to-be-processed image into a target estimation model, performing key point extraction and instance segmentation processing on the to-be-processed image through the target estimation model, and obtaining a key point heat map and an instance segmentation result of the to-be-processed image; the instance segmentation result represents objects to which each pixel point belongs; based on the mapping relationship between each pixel point and each key point contained in the key point heat map and the instance segmentation result of the to-be-processed image, the objects to which each key point belongs are determined; and based on the key points corresponding to each object, the pose estimation processing is performed on the objects respectively, and the pose estimation result of each object is obtained. The method improves the accuracy of the pose estimation result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, and in particular to a pose estimation method, electronic device, and storage medium. Background Technology

[0002] With the development of computer vision technology, it has been widely used in various fields, such as pose estimation.

[0003] When performing pose estimation, for the case of multiple objects in the same image, it is usually necessary to obtain all the key points in the image, and then group all the key points according to the grouping criteria of key points corresponding to each object, and perform pose estimation of the object based on each key point group.

[0004] However, current methods such as Hungarian matching are commonly used for keypoint grouping. These methods are highly complex, and the accuracy of keypoint grouping is easily affected by the possibility of multiple objects occluding each other, which in turn affects the accuracy of object pose estimation. Summary of the Invention

[0005] Therefore, it is necessary to provide an attitude estimation method, an electronic device, a computer-readable storage medium, and a computer program product to address the aforementioned technical problems.

[0006] Firstly, this application provides a pose estimation method. The method includes:

[0007] Obtain the image to be processed;

[0008] The image to be processed is input into a target estimation model, and the target estimation model is used to extract key points and segment instances of the image to be processed to obtain a key point heatmap and instance segmentation results of the image to be processed; the instance segmentation results represent the object to which each pixel belongs.

[0009] Based on the mapping relationship between each pixel and each key point contained in the key point heatmap, and the instance segmentation result, the object to which each key point belongs is determined;

[0010] The pose estimation process is performed on each object based on its corresponding key points to obtain the pose estimation results for each object.

[0011] In one embodiment, the target estimation model includes a backbone network layer, a key point extraction module, and an instance segmentation module; the step of performing key point extraction and instance segmentation processing on the image to be processed using the target estimation model to obtain a key point heatmap and the instance segmentation result of the image to be processed includes:

[0012] Feature images are obtained by extracting features from the image to be processed through the backbone network layer.

[0013] The key point extraction module performs key point feature extraction processing on the feature image to obtain the key point heatmap, and the instance segmentation module performs image segmentation processing on the feature image to obtain the instance segmentation result.

[0014] In one embodiment, the step of performing key point feature extraction processing on the feature image through the key point extraction module to obtain the key point heatmap includes:

[0015] Key points are extracted from the feature image to obtain an initial key point heatmap;

[0016] The initial key point heatmap and the feature image are concatenated to obtain a first fused feature image;

[0017] The first fused feature image is upsampled to obtain a key point heatmap.

[0018] In one embodiment, the step of performing image segmentation processing on the feature image through the instance segmentation module to obtain the instance segmentation result of the image to be processed includes:

[0019] The center coordinate information of each object contained in the feature image is determined, and the center coordinate information of each object is added to the pixel position corresponding to each object in the feature image to obtain a second fused feature image; the center coordinate information of the objects is used to identify the objects to which they belong.

[0020] For each pixel in the second fused feature image that has had the object center coordinate information added, perform object classification and discrimination, and output the probability of each pixel corresponding to the classification category of each object embodiment;

[0021] Based on the probability of each pixel corresponding to each of the objects, the target object to which the pixel belongs is determined.

[0022] In one embodiment, determining the object to which each key point belongs based on the mapping relationship between each pixel and each key point contained in the key point heatmap and the instance segmentation result includes:

[0023] Based on the mapping relationship between the target pixel and each key point contained in the key point heatmap and the instance segmentation result of the target pixel, the key points corresponding to the target pixel contained in the target mapping relationship are classified and a classification label is added to the key points.

[0024] Key points with the same classification label are grouped together as the same key point.

[0025] In one embodiment, before classifying the key points corresponding to the target pixel in the mapping relationship based on the mapping relationship between the target pixel and each key point contained in the key point heatmap and the instance segmentation result of the target pixel, the method further includes:

[0026] Based on the masking rules, determine the key points corresponding to the segmentation results of each instance in the key point heatmap;

[0027] Determine the location coordinates of the key points corresponding to the segmentation results of each instance;

[0028] For the position coordinates of each key point contained in the key point heatmap, a target pixel with the same position coordinates as each key point is determined among each pixel, and a mapping relationship between the target pixel and the key point is established.

[0029] In one embodiment, the target estimation model is trained through the following process:

[0030] Obtain training samples, which include sample images, a reference keypoint heatmap corresponding to the sample images, and instance classification labels corresponding to the sample images; the instance classification labels are used to characterize the object to which a pixel belongs.

[0031] The training samples are input into the key point extraction module and the instance segmentation module respectively, and the key point heatmap corresponding to the sample image and the instance segmentation result corresponding to each pixel are output.

[0032] According to the loss algorithm, a first loss value is calculated between the key point heatmap and the reference key point heatmap, and a second loss value is calculated between the instance segmentation result of each pixel and the object to which it belongs as represented by the instance classification label;

[0033] The model accuracy of the key point extraction module and the instance segmentation module is determined based on the first loss threshold, the second loss threshold, the first loss value, and the second loss value. The training of the key point extraction module and the instance segmentation module is considered complete when the model accuracy of both the key point extraction module and the instance segmentation module meets the accuracy conditions.

[0034] Secondly, this application also provides an electronic device. The electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the attitude estimation method described in the first aspect.

[0035] Thirdly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, implements the attitude estimation method described in the first aspect.

[0036] Fourthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, implements the attitude estimation method described in the first aspect.

[0037] In the pose estimation method provided in this application embodiment, an image to be processed is acquired; the image to be processed is input into a target estimation model, and the target estimation model performs keypoint extraction and instance segmentation processing on the image to be processed to obtain a keypoint heatmap and instance segmentation results of the image to be processed; the instance segmentation results represent the object to which each pixel belongs; based on the mapping relationship between each pixel and each keypoint contained in the keypoint heatmap and the instance segmentation results, the object to which each keypoint belongs is determined; based on each keypoint corresponding to each object, pose estimation processing is performed on the object to obtain the pose estimation results of each object. Using this method, by extracting keypoints and performing pixel-level instance segmentation on the image to be processed, a keypoint heatmap and instance segmentation results of the image to be processed are obtained. Based on the positional mapping relationship between each keypoint and pixel, and the instance segmentation results, keypoints are accurately grouped, avoiding the problem of inaccurate keypoint grouping caused by mutual occlusion between objects in the image to be processed, thereby improving the accuracy of pose estimation results. Attached Figure Description

[0038] Figure 1 This is a flowchart illustrating the attitude estimation method in one embodiment;

[0039] Figure 2 This is a schematic diagram of the algorithm framework for a target estimation model in one embodiment;

[0040] Figure 3 This is a flowchart illustrating the image processing steps in one embodiment;

[0041] Figure 4 This is an algorithm framework diagram of the key point extraction module in one embodiment;

[0042] Figure 5 This is a flowchart illustrating a key point extraction method in one embodiment;

[0043] Figure 6 This is an algorithm framework diagram of the instance segmentation module in one embodiment;

[0044] Figure 7 This is a flowchart illustrating an instance segmentation method in one embodiment;

[0045] Figure 8 This is a flowchart illustrating the key point grouping steps in one embodiment;

[0046] Figure 9 This is a flowchart illustrating the mapping relationship between target pixels and key points in one embodiment.

[0047] Figure 10 This is a training flowchart for the key point extraction module and the instance segmentation module in one embodiment;

[0048] Figure 11 This is a structural block diagram of an attitude estimation device in one embodiment;

[0049] Figure 12 This is a diagram of the internal structure of an electronic device in one embodiment. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0051] In applications of computer vision technology, acquired image data is typically processed using machine vision techniques such as recognition and measurement. This allows electronic devices to simulate the human eye in analyzing and recognizing target objects within the image data. The process of analyzing and recognizing target objects in image data includes the need to identify the pose of the target object. Therefore, such as... Figure 1 As shown, this application provides an attitude estimation method, which is illustrated by taking the application of this method to an electronic device as an example. The method includes the following steps:

[0052] Step 102: Obtain the image to be processed.

[0053] In practice, in scenarios where pose estimation of a target object (e.g., an object) in image data is performed, the electronic device acquires an image to be processed. This image can be image data acquired in real-time by a camera device (e.g., a webcam), or image data pre-stored in the electronic device's memory. The data source and timeliness of the image data to be processed can be determined according to actual needs, and this application embodiment does not impose any limitations on this.

[0054] Step 104: Input the image to be processed into the target estimation model. The target estimation model performs key point extraction and instance segmentation on the image to be processed to obtain the key point heatmap and the instance segmentation result of the image to be processed.

[0055] The instance segmentation result represents the object to which each pixel belongs.

[0056] In implementation, the target estimation model is deployed on an electronic device. This model contains different functional layers and sub-models (also called modules) for processing image data in different dimensions. The target estimation model has been trained using training samples and has established an artificial intelligence learning mechanism for extracting feature information from image data for analysis and recognition. Therefore, when processing an image, the electronic device runs this target estimation model to extract key points from the image, obtaining a key point heatmap containing all key points, and performs pixel-level instance segmentation of the image data to obtain the instance segmentation result of the image to be processed.

[0057] Step 106: Based on the mapping relationship between each pixel and each key point contained in the key point heatmap and the instance segmentation result, determine the object to which each key point belongs.

[0058] In practice, since keypoint extraction and instance segmentation are performed on the same image, each keypoint in the keypoint heatmap must have a corresponding pixel (i.e., keypoints and pixels at the same location), and each pixel has an instance segmentation result after instance segmentation. Therefore, the object estimation model groups the keypoints based on the instance segmentation result of the pixel and the mapping relationship between the pixel and the keypoint, thus determining the object to which each keypoint belongs.

[0059] Step 108: Perform pose estimation processing on each object based on the key points corresponding to each object to obtain the pose estimation results of each object.

[0060] In implementation, based on the key points corresponding to each object, pose estimation processing is performed on each object (this process can be executed by the target estimation network) to obtain the pose estimation result of the object corresponding to each key point group. This pose estimation result can characterize the two-dimensional pose of the object. The specific content of the pose estimation result can be determined based on the discriminant layer of the target estimation model; however, this application embodiment does not limit the pose estimation result.

[0061] In the above pose estimation method, the electronic device acquires the image to be processed; the image to be processed is input into the target estimation model, and the target estimation model performs key point extraction and instance segmentation processing on the image to be processed to obtain a key point heatmap and instance segmentation results of the image to be processed; the instance segmentation results represent the object to which each pixel belongs; based on the mapping relationship between each pixel and each key point contained in the key point heatmap and the instance segmentation results of the image to be processed, the object to which each key point belongs is determined; based on the key points corresponding to each object, pose estimation processing is performed on the object to obtain the pose estimation results of each object. This method, by extracting key points and performing pixel-level instance segmentation on the image to be processed, obtains a key point heatmap and instance segmentation results of the image to be processed, and accurately groups key points based on the positional mapping relationship between each key point and pixel, as well as the instance segmentation results of the image to be processed, avoiding the problem of inaccurate key point grouping caused by mutual occlusion between objects in the image to be processed, thereby improving the accuracy of pose estimation results.

[0062] In one embodiment, such as Figure 2 As shown, Figure 2 An algorithmic framework for an object estimation model is presented, comprising a backbone network layer, a keypoint extraction module, and a pixel-segmentation module. Through the functional layers (e.g., the backbone network layer) and the functional modules (i.e., the keypoint extraction module and the pixel-segmentation module) within the object estimation model, keypoint extraction and pixel-segmentation are performed on the image data to be processed. Figure 3 As shown, the specific processing steps of step 104 include:

[0063] Step 302: Extract features from the image to be processed through the backbone network layer to obtain a feature image.

[0064] In implementation, feature images are obtained by extracting features from the image to be processed through the backbone network layer in the target estimation model. These feature images may, but are not limited to, include image attributes and content information such as grayscale values, brightness, edges, texture, and color of the image to be processed. This application embodiment does not limit the feature information contained in the feature images.

[0065] Step 304: The key point extraction module performs key point feature extraction processing on the feature image to obtain a key point heatmap, and the instance segmentation module performs image segmentation processing on the feature image to obtain the instance segmentation result of the image to be processed.

[0066] In implementation, since the keypoint extraction module and instance segmentation module in the target estimation model are parallel processing logics, after the backbone network layer extracts features from the image to be processed to obtain a feature image, the backbone network layer can transmit the feature image to the keypoint extraction module and the instance segmentation module respectively. Then, the keypoint extraction module extracts keypoint features from the feature image to obtain a keypoint heatmap, and the instance segmentation module performs image segmentation processing on the feature image to obtain the instance segmentation result of the image to be processed.

[0067] In this embodiment, based on the algorithm framework of the target estimation model, the backbone network layer, the key point extraction module, and the instance segmentation module process the image to be processed in sequence to obtain a key point heatmap and pixel-level instance segmentation results. Based on the obtained key point heatmap and the instance segmentation results of the image to be processed, key points are grouped, which can improve the pose estimation accuracy of the image to be processed.

[0068] In one embodiment, such as Figure 4 As shown, Figure 4 This describes the algorithm framework for the keypoint extraction module. Specifically, the keypoint extraction module includes a first convolutional layer and a deconvolutional layer. The first convolutional layer and the deconvolutional layer are executed sequentially, processing the feature image in turn, such as... Figure 5 As shown, in step 304, the key point feature extraction module performs key point feature extraction processing on the feature image to obtain the key point heatmap. The specific processing steps include the following:

[0069] Step 502: Extract key points from the feature image to obtain an initial key point heatmap.

[0070] In implementation, key points are extracted from the feature image through the first convolutional layer in the key point extraction module to obtain an initial key point heatmap. The output dimension (conv block) of this initial key point heatmap can be 128×128, which contains all the key points of each object in the feature image.

[0071] Step 504: Perform feature stitching between the initial key point heatmap and the feature image to obtain the first fused feature image.

[0072] In implementation, to enhance the information of key points contained in the initial keypoint heatmap, the keypoint extraction module enhances the initial keypoint heatmap based on the feature image output by the backbone network layer. Specifically, after the initial keypoint heatmap is output by the first convolutional layer, the initial keypoint heatmap and the initial feature image are concatenated by the concat unit after the first convolutional layer in the keypoint extraction module to obtain a first fused feature image. This first fused feature image contains not only all the feature information in the feature image but also the keypoint information contained in the initial keypoint heatmap.

[0073] Step 506: Upsample the first fused feature image to obtain a key point heatmap.

[0074] In implementation, in the keypoint extraction module, after the first fused feature image is output by the concat unit, it is input into the deconvolution block. The deconvolution block upsamples the first fused feature image to obtain the final keypoint heatmap. Specifically, the deconvolution block interpolates the first fused feature image based on the feature information contained in the feature image, thereby achieving upsampling and obtaining an enhanced keypoint heatmap. This keypoint heatmap has a larger image size and enhanced image clarity; for example, it is enhanced from a 128×128-dimensional image to a 256×256-dimensional image. The keypoint heatmap can contain 17 channels, meaning 17 keypoints are extracted from each object, with each keypoint representing one channel. This application embodiment does not limit the number of channels in the keypoint heatmap.

[0075] In this embodiment, the key point extraction module in the target estimation model is pre-set with an algorithm framework including a first convolutional layer and a deconvolutional layer. The first convolutional layer and the deconvolutional layer in the algorithm framework are used to extract key points at the pixel level from the feature image to obtain a key point heatmap.

[0076] In one embodiment, such as Figure 6 As shown, Figure 6 This is the algorithm framework for the instance segmentation module, which includes a coordinate convolutional layer and a second convolutional layer in the instance segmentation model. This module enables image segmentation of objects within an image to be processed. The specific processing steps are as follows: Figure 7 As shown, step 304 involves using the instance segmentation module to perform image segmentation processing on the feature image to obtain the instance segmentation result of the image to be processed. Specifically, this includes the following steps:

[0077] Step 702: Determine the object center coordinate information of each object contained in the feature image, and add the object center coordinate information to the corresponding pixel position of each object in the feature image to obtain the second fused feature image.

[0078] The object center coordinates are used to identify the object to which it belongs. The object center coordinates of each object are obtained by taking the midpoint of the position coordinates of all the key points of that object.

[0079] In implementation, the instance segmentation module uses a coordinate convolutional layer to extract the object center coordinates of each object in the feature image. Then, the coordinate convolutional layer adds this object center coordinate information to the corresponding pixel positions of each object in the feature image, generating a second fused feature image. That is, the second fused feature image contains not only image features but also object center coordinate information for pixel classification. Specifically, the method for calculating object center coordinates is as follows: if each object corresponds to 17 keypoints (these 17 keypoints can be, but are not limited to, the keypoints extracted during keypoint extraction), and these 17 keypoints correspond to position coordinates (determined based on the image coordinate system), for each object's corresponding 17 keypoint position coordinates, the center coordinate value of the object in the pose represented by these 17 keypoints is calculated, thus obtaining the object center coordinate information. Furthermore, each center coordinate value (i.e., object center coordinate information) can be used to distinguish the object to which it belongs. The object center coordinate information includes horizontal and vertical coordinates. The horizontal and vertical coordinates can be used to distinguish the objects belonging to the image in the horizontal and vertical directions respectively. That is, the objects are classified according to the horizontal coordinate and the objects are classified according to the vertical coordinate. Then, the objects belonging to the pixels are determined based on the classification results in the horizontal and vertical directions. This application embodiment does not limit this.

[0080] Step 704: Perform object classification and discrimination on each pixel in the second fused feature image that has had object center coordinate information added, and output the probability of each pixel corresponding to the classification category of each object embodiment.

[0081] In implementation, the second fused feature image is input into the second convolutional layer of the instance segmentation module. This second convolutional layer classifies each pixel in the second fused feature image to its corresponding object and outputs the probability of each pixel corresponding to the classification category of each object instance. For example, after instance segmentation of the feature image corresponding to the image to be processed using a coordinate convolutional layer, the center coordinates of three objects are obtained, indicating that the image to be processed contains three objects. Therefore, when the second convolutional layer performs object segmentation on each pixel in the second fused feature image, it can output the probability of each pixel corresponding to these three different objects.

[0082] Step 706: Determine the target object to which each pixel belongs based on the probability of each object corresponding to each pixel.

[0083] In implementation, for each pixel corresponding to each object's probability, the instance segmentation module determines the object with the highest probability as the target object to which the pixel belongs, i.e., it determines that the pixel belongs to that target object. Then, the instance segmentation module adds a label representing the target object to the pixel, as the instance segmentation result for that pixel. For example, if the image to be processed contains three objects (object 1, object 2, and object 3), and the classification probabilities of pixel A for each of these three objects are 20%, 75%, and 5%, respectively, then the instance segmentation module determines the object with the highest probability of 75% (i.e., object 2) as the target object to which pixel A belongs. Then, the instance segmentation module adds a label representing the classification category of object 2 to pixel A. Furthermore, it can determine the target objects corresponding to other pixels in the feature image, add labels to each pixel for the target object, and thus obtain the instance segmentation result for the image to be processed.

[0084] Optionally, the label representing the target object can be the center coordinates of the object to which it belongs, which is not limited in this embodiment.

[0085] In this embodiment, the instance segmentation module performs pixel-level instance segmentation on the feature image extracted from the image to be processed, determines the target object to which each pixel belongs, and obtains the instance segmentation result of the image to be processed. This allows for key point grouping based on the instance segmentation result of the image to be processed, thereby improving the accuracy of key point grouping.

[0086] In one embodiment, such as Figure 8 As shown, the specific processing steps in step 106, which determine the object to which each key point belongs based on the mapping relationship between each pixel and the key points contained in the key point heatmap and the instance segmentation results of the image to be processed, include:

[0087] Step 802: Based on the mapping relationship between the target pixel and the key point heatmap and the instance segmentation result of the target pixel, classify the key points corresponding to the target pixel in the mapping relationship and add classification labels to the key points.

[0088] In implementation, the target estimation model classifies the keypoints corresponding to the target pixels based on the mapping relationship between keypoints and target pixels, as well as the instance segmentation results of the target pixels, and adds a classification label to each keypoint. For example, if keypoint 'a' has a mapping relationship with target pixel 'A', and the instance segmentation result corresponding to target pixel 'A' is object 2, based on this mapping relationship and the instance segmentation result of target pixel 'A', the instance segmentation module determines that the object corresponding to keypoint 'A' is object 2 and adds a classification label representing the classification category of object 2 to this keypoint. For example, the classification label could be the center coordinates of object 2.

[0089] Step 804: Group key points with the same category label as the same key point.

[0090] In implementation, the target estimation model is based on the grouping algorithm, which identifies key points with the same classification label as the same key point group, that is, it represents that the key point group belongs to the same object classification category, thereby realizing pixel-level key point classification processing of the key point heatmap.

[0091] In this embodiment, by establishing a mapping relationship between key points and pixels, key points can be grouped based on the instance segmentation results of pixels with which they are mapped. Since there is no overlap or occlusion between pixels, grouping key points based on the instance segmentation results of pixels improves the accuracy of key point grouping.

[0092] In one embodiment, such as Figure 9 As shown, prior to step 802, the method further includes:

[0093] Step 902: Based on the masking rules, determine the key points corresponding to the segmentation results of each instance in the key point heatmap.

[0094] In implementation, the target estimation model determines the key points corresponding to each instance segmentation result in the key point heatmap according to a preset masking rule. Specifically, the key point selection condition represented by the preset masking rule is to select only one target key point at each joint included in each instance segmentation result, and this target key point is the brightest point in the heatmap at that joint.

[0095] Step 904: Determine the location coordinates of the key points corresponding to the segmentation results of each instance.

[0096] In practice, the target estimation model determines the location coordinates of the key point based on the key point corresponding to the segmentation result of each instance.

[0097] Step 906: For the position coordinates of each key point contained in the key point heatmap, determine the target pixel point with the same position coordinates as each key point in each pixel point, and establish the mapping relationship between the target pixel point and the key point.

[0098] In practice, key points are extracted and pixel instances are segmented for the feature image of the same image to be processed. Therefore, the position of the extracted key point corresponds to the position of a certain pixel in the feature image. For each key point in the key point heatmap, the target estimation model determines the pixel corresponding to the key point based on the position coordinates of the key point and takes the pixel as the target pixel, thereby establishing the mapping relationship between the target pixel and the key point.

[0099] In one embodiment, the target estimation model needs to be trained before pose estimation. The keypoint extraction module and instance segmentation module included in the target estimation model, being logic for parallel processing of image data, can be trained simultaneously. Figure 10 As shown, the target estimation model is trained through the following process:

[0100] Step 1002: Obtain training samples.

[0101] The training samples include sample images, heatmaps of reference keypoints corresponding to the sample images, and instance classification labels corresponding to the sample images. These instance classification labels characterize the object to which a pixel belongs. Optionally, the instance classification labels of the sample images in the training samples can, but are not limited to, be obtained through manual annotation.

[0102] In implementation, the electronic device acquires training samples, which are used to train the key point extraction module and instance segmentation module in the target estimation model.

[0103] Step 1004: Input the training samples into the key point extraction module and the instance segmentation module respectively, and output the key point heatmap corresponding to the sample image and the instance segmentation result corresponding to the image to be processed.

[0104] In implementation, training samples are input into the target estimation model. Through the backbone network layer in the target estimation model, feature extraction is performed on the sample images in the training samples to obtain the feature images corresponding to the sample images. Then, the feature images are input into the key point extraction module and the instance segmentation module respectively. The key point extraction module and the instance segmentation module process the feature images respectively and output the key point heatmap corresponding to the sample image and the instance segmentation result of the image to be processed corresponding to the sample image.

[0105] Specifically, the processing procedures of the key point extraction module and the instance segmentation module are similar to those in steps 502 to 505 and steps 702 to 706 of the above embodiments, respectively, and are not limited in this application embodiment.

[0106] Step 1006: According to the loss algorithm, calculate the first loss value between the key point heatmap and the reference key point heatmap, and the second loss value between the instance segmentation result of the image to be processed and the object represented by the instance classification label.

[0107] In implementation, the target estimation model has a pre-set loss algorithm (or loss function) for loss calculation. After the key point extraction module outputs the key point heatmap of the sample image and the instance segmentation module outputs the instance segmentation result of the image to be processed corresponding to the sample image, the target estimation model performs loss calculation. According to the loss algorithm, it calculates the first loss value between the output key point heatmap and the reference key point heatmap, and the second loss value between the object represented by the instance segmentation result of the image to be processed and the object represented by the pre-labeled instance classification label.

[0108] Step 1008: Determine the model accuracy of the key point extraction module and the instance segmentation module based on the first loss threshold, the second loss threshold, the first loss value, and the second loss value. When the model accuracy of both the key point extraction module and the instance segmentation module meets the accuracy conditions, the training of the key point extraction module and the instance segmentation module is considered complete.

[0109] In implementation, when training the keypoint extraction module and the instance segmentation module, the target estimation model also presets a first loss threshold and a second loss threshold. The first loss threshold is used to determine the model accuracy of the keypoint extraction module, and the second loss threshold is used to determine the model accuracy of the instance segmentation module. Specifically, during model training, when the first loss value and the second loss value are obtained, the target estimation model compares the first loss threshold with the first loss value, and the second loss threshold with the second loss value, to determine the model accuracy of the keypoint extraction module and the instance segmentation module. For example, when the first loss value is less than the first loss threshold and the second loss value is less than the second loss threshold, it is determined that the model accuracy of both the keypoint extraction module and the instance segmentation module meets the accuracy condition, and therefore, it is determined that the training of the keypoint extraction module and the instance segmentation module is complete.

[0110] Optionally, if the model accuracy of the key point extraction module and the instance segmentation module fails to meet the accuracy condition, the model training of the key point extraction module and the instance segmentation module continues until the model accuracy meets the accuracy condition. Specifically, the model training process of the key point extraction module and the instance segmentation module has been described in detail in steps 1002 to 1006, and this embodiment of the application will not repeat this process.

[0111] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0112] Based on the same inventive concept, this application also provides an attitude estimation apparatus for implementing the attitude estimation method described above. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more attitude estimation apparatus embodiments provided below can be found in the limitations of the attitude estimation method described above, and will not be repeated here.

[0113] In one embodiment, such as Figure 11 As shown, an attitude estimation device 1100 is provided, including: an acquisition module 1101, a detection module 1102, a grouping module 1103, and a judgment module 1104, wherein:

[0114] The acquisition module 1101 is used to acquire the image to be processed.

[0115] The detection module 1102 is used to extract key points and segment instances of the image to be processed through a target estimation model, so as to obtain a key point heatmap and instance segmentation results of the image to be processed; the instance segmentation results represent the object to which each pixel belongs.

[0116] Grouping module 1103 is used to determine the object to which each key point belongs based on the mapping relationship between each pixel and each key point contained in the key point heatmap and the instance segmentation result of the image to be processed.

[0117] The discrimination module 1104 is used to perform pose estimation processing on the objects based on the key points corresponding to each object, and obtain the pose estimation results of each object.

[0118] In one embodiment, the target estimation model includes a backbone network layer, a key point extraction module, and an instance segmentation module. Specifically, the detection module 1102 is used to extract features from the image to be processed through the backbone network layer to obtain a feature image.

[0119] The feature image is processed by the key point extraction module to obtain the key point heatmap, and the feature image is processed by the instance segmentation module to obtain the instance segmentation result of the image to be processed.

[0120] In one embodiment, the key point extraction module includes a first convolutional layer and a deconvolutional layer. The detection module 1102 is specifically used to extract key points from the feature image to obtain an initial key point heatmap.

[0121] The initial key point heatmap and the feature image are concatenated to obtain the first fused feature image;

[0122] The first fused feature image is upsampled to obtain a key point heatmap.

[0123] In one embodiment, the instance segmentation module includes a coordinate convolutional layer and a second convolutional layer. The detection module 1102 is specifically used to determine the object center coordinate information corresponding to each object contained in the feature image, and add the object center coordinate information to the pixel position corresponding to each object in the feature image to obtain a second fused feature image; wherein, the object center coordinate information is used to identify the object to which it belongs.

[0124] For each pixel in the second fused feature image that has had the object center coordinate information added, perform object classification and discrimination, and output the probability of each pixel corresponding to the classification category of each object embodiment;

[0125] Based on the probability of each pixel corresponding to each object, the target object to which the pixel belongs is determined.

[0126] In one embodiment, the grouping module 1103 is specifically used to classify the key points corresponding to the target pixel in the mapping relationship based on the mapping relationship between the target pixel and each key point contained in the key point heatmap and the instance segmentation result of the target pixel, and to add classification labels to the key points.

[0127] Keypoints with the same category label are grouped together as the same keypoint.

[0128] In one embodiment, the attitude estimation device 1100 further includes:

[0129] The first determining module is used to determine the key points corresponding to each instance segmentation result in the key point heatmap according to the masking rules.

[0130] The second determining module is used to determine the location coordinates of key points corresponding to each instance segmentation result;

[0131] A module is established to determine, based on the position coordinates of each key point contained in the key point heatmap, a target pixel with the same position coordinates as each key point, and to establish a mapping relationship between the target pixel and the key point.

[0132] In one embodiment, the attitude estimation device 1100 further includes:

[0133] The sample acquisition module is used to acquire training samples, which include sample images, heatmaps of reference key points corresponding to the sample images, and instance classification labels corresponding to the sample images; the instance classification labels are used to characterize the object to which a pixel belongs.

[0134] The training module is used to input training samples into the key point extraction module and the instance segmentation module respectively, and output the key point heatmap corresponding to the sample image and the instance segmentation result corresponding to each pixel.

[0135] The loss calculation module is used to calculate the first loss value between the key point heatmap and the reference key point heatmap, and the second loss value between the instance segmentation result of the image to be processed and the object represented by the instance classification label, according to the loss algorithm.

[0136] The discrimination module is used to determine the model accuracy of the key point extraction module and the instance segmentation module based on the first loss threshold, the second loss threshold, the first loss value, and the second loss value. The training of the key point extraction module and the instance segmentation module is considered complete when the model accuracy of both modules meets the accuracy conditions.

[0137] The modules in the aforementioned attitude estimation device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in the processor of the electronic device in hardware form or independent of it, or stored in the memory of the electronic device in software form, so that the processor can call and execute the operations corresponding to each module.

[0138] In one embodiment, an electronic device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 12As shown, the electronic device includes a processor, memory, communication interface, display screen, and input device connected via a system bus. The processor provides computing and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage medium. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When the computer program is executed by the processor, it implements an attitude estimation method. The display screen can be an LCD screen or an e-ink screen. The input device can be a touch layer covering the display screen, buttons, a trackball, or a touchpad mounted on the device's casing, or an external keyboard, touchpad, or mouse.

[0139] Those skilled in the art will understand that Figure 12 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the electronic device to which the present application is applied. The specific electronic device may include more or fewer components than shown in the figure, or combine certain components, or have different component arrangements.

[0140] In one embodiment, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above-described method embodiments.

[0141] In one embodiment, a computer-readable storage medium is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0142] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0143] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0144] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0145] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A pose estimation method, characterized in that, The method includes: Obtain the image to be processed; The image to be processed is input into a target estimation model, and the target estimation model is used to extract key points and segment instances of the image to be processed to obtain a key point heatmap and instance segmentation results of the image to be processed; the instance segmentation results represent the object to which each pixel belongs. Based on the mapping relationship between each pixel and each key point contained in the key point heatmap, and the instance segmentation result, the object to which each key point belongs is determined; Based on the key points corresponding to each object, pose estimation is performed on each object to obtain the pose estimation result of each object; The process of determining the object to which each key point belongs based on the mapping relationship between each pixel and the key points contained in the key point heatmap, and the instance segmentation result, includes: Based on the mapping relationship between the target pixel and each key point contained in the key point heatmap and the instance segmentation result of the target pixel, the key points corresponding to the target pixel contained in the mapping relationship are classified and a classification label is added to the key points. Key points with the same classification label are grouped together as the same key point; Before classifying the key points corresponding to the target pixel in the mapping relationship based on the mapping relationship between the target pixel and each key point contained in the key point heatmap and the instance segmentation result of the target pixel, the method further includes: Based on the masking rules, determine the key points corresponding to the segmentation results of each instance in the key point heatmap; Determine the location coordinates of the key points corresponding to the segmentation results of each instance; For the position coordinates of each key point contained in the key point heatmap, a target pixel with the same position coordinates as each key point is determined among each pixel, and a mapping relationship between the target pixel and the key point is established.

2. The method according to claim 1, characterized in that, The target estimation model includes a backbone network layer, a key point extraction module, and an instance segmentation module; the step of performing key point extraction and instance segmentation on the image to be processed using the target estimation model to obtain a key point heatmap and instance segmentation results of the image to be processed includes: Feature images are obtained by extracting features from the image to be processed through the backbone network layer. The key point extraction module performs key point feature extraction processing on the feature image to obtain the key point heatmap, and the instance segmentation module performs image segmentation processing on the feature image to obtain the instance segmentation result.

3. The method according to claim 2, characterized in that, The step of extracting key point features from the feature image using the key point extraction module to obtain the key point heatmap includes: Key points are extracted from the feature image to obtain an initial key point heatmap; The initial key point heatmap and the feature image are concatenated to obtain a first fused feature image; The first fused feature image is upsampled to obtain a key point heatmap.

4. The method according to claim 2, characterized in that, The step of obtaining the instance segmentation result of the image to be processed by performing image segmentation processing on the feature image through the instance segmentation module includes: The center coordinate information of each object contained in the feature image is determined, and the center coordinate information of each object is added to the pixel position corresponding to each object in the feature image to obtain a second fused feature image; the center coordinate information of the objects is used to identify the objects to which they belong. For each pixel in the second fused feature image that has had the object center coordinate information added, perform object classification and discrimination, and output the probability of each pixel corresponding to the classification category of each object embodiment; Based on the probability of each pixel corresponding to each of the objects, the target object to which the pixel belongs is determined.

5. The method according to claim 2, characterized in that, The target estimation model is trained through the following process: Obtain training samples, which include sample images, reference key point heatmaps corresponding to the sample images, and instance classification labels corresponding to the sample images; The instance classification label is used to characterize the object to which a pixel belongs; The training samples are input into the key point extraction module and the instance segmentation module respectively, and the key point heatmap corresponding to the sample image and the instance segmentation result corresponding to each pixel are output. According to the loss algorithm, a first loss value is calculated between the key point heatmap and the reference key point heatmap, and a second loss value is calculated between the instance segmentation result of each pixel and the object to which it belongs as represented by the instance classification label; The model accuracy of the key point extraction module and the instance segmentation module is determined based on the first loss threshold, the second loss threshold, the first loss value, and the second loss value. The training of the key point extraction module and the instance segmentation module is considered complete when the model accuracy of both the key point extraction module and the instance segmentation module meets the accuracy conditions.

6. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 5.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.

8. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.