Key point detection method, model training method, and device
By introducing a heatmap prediction network to assist the coordinate regression network in key point detection and using differentiable transformation to achieve gradient backpropagation, the problem of difficulty in balancing detection accuracy and efficiency is solved, thus improving detection accuracy and efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING ZITIAO NETWORK TECH CO LTD
- Filing Date
- 2022-04-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing key point detection methods struggle to balance detection accuracy and efficiency. Coordinate regression networks offer high detection efficiency but low accuracy, while heatmap prediction networks offer high accuracy but low efficiency.
During training, a heatmap prediction network is used to assist the coordinate regression network. Differentiable transformations are used to convert the predicted coordinates of key points into an initial heatmap, thereby achieving gradient backpropagation and improving the detection accuracy of the coordinate regression network.
It improves the accuracy and efficiency of key point detection, achieving a balance between detection accuracy and efficiency.
Smart Images

Figure CN116993818B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a key point detection method, model training method and device. Background Technology
[0002] Keypoint detection is a task that uses neural networks to predict key points on a target object in a given image. For example, facial keypoint detection involves using a neural network to predict facial keypoints in a given face image. These keypoints are points that are defined in advance by humans, such as eyebrow points, corner of the eye points, and facial contour points.
[0003] Currently, in keypoint detection tasks, keypoints can be represented as coordinate values, and a keypoint coordinate regression method can be used for keypoint prediction. This method employs a simple coordinate regression network to extract global features from the image, and then predicts the coordinates of keypoints based on these global features. While this method has high detection efficiency, its accuracy is low, making it difficult to balance both efficiency and accuracy in keypoint detection.
[0004] Therefore, how to balance the detection accuracy and efficiency of key points is an urgent problem to be solved. Summary of the Invention
[0005] This disclosure provides a key point detection method, a model training method, and an apparatus to overcome the problem of simultaneously achieving high detection accuracy and efficiency for key points in images.
[0006] In a first aspect, embodiments of this disclosure provide a key point detection method, including:
[0007] Identify the target image for keypoint detection;
[0008] By using a coordinate regression network, the coordinates of key points in the target image are detected, and the predicted coordinates of key points in the target image are obtained.
[0009] The coordinate regression network is trained with the assistance of a heatmap prediction network. The input data of the heatmap prediction network includes an initial heatmap of key points on the training image and a feature map extracted from the training image by the coordinate regression network. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image.
[0010] Secondly, embodiments of this disclosure provide a model training method, including:
[0011] Determine the training data, which includes training images and sample labels corresponding to the training images. The sample labels include the actual coordinates of key points on the training images and the actual heatmap.
[0012] The coordinate regression network is trained based on the training data.
[0013] Specifically, a heatmap prediction network is used to assist the training of the coordinate regression network. The input data of the heatmap prediction network includes an initial heatmap of key points on the training image and a feature map extracted from the training image by the coordinate regression network. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image.
[0014] Thirdly, embodiments of this disclosure provide a key point detection device, including:
[0015] The determining unit is used to determine the target image to be detected for key points;
[0016] The detection unit is used to detect the coordinates of key points in the target image through a coordinate regression network, and obtain the predicted coordinates of the key points in the target image.
[0017] The coordinate regression network is trained with the assistance of a heatmap prediction network. The input data of the heatmap prediction network includes an initial heatmap of key points on the training image and a feature map extracted from the training image by the coordinate regression network. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image.
[0018] Fourthly, embodiments of this disclosure provide a model training device, comprising:
[0019] A determining unit is used to determine training data, the training data including training images and sample labels corresponding to the training images, the sample labels including the actual coordinates of key points on the training images and the actual heatmap;
[0020] The training unit is used to train the coordinate regression network based on the training data.
[0021] Specifically, a heatmap prediction network is used to assist the training of the coordinate regression network. The input data of the heatmap prediction network includes an initial heatmap of key points on the training image and a feature map extracted from the training image by the coordinate regression network. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image.
[0022] Fifthly, embodiments of this disclosure provide an electronic device, including: at least one processor and a memory; the memory stores computer-executable instructions; the at least one processor executes the computer-executable instructions stored in the memory, causing the at least one processor to execute the key point detection method provided in the first aspect above, or to execute the model training method provided in the second aspect.
[0023] In a sixth aspect, embodiments of this disclosure provide a computer-readable storage medium storing computer-executable instructions. When a processor executes the computer-executable instructions, it implements the key point detection method provided in the first aspect above, or implements the model training method provided in the second aspect.
[0024] In a seventh aspect, according to one or more embodiments of the present disclosure, a computer program product is provided, the computer program product including computer execution instructions, which, when executed by a processor, implement the key point detection method provided in the first aspect above, or implement the model training method provided in the second aspect.
[0025] In the keypoint detection method, model training method, and device provided in this disclosure, considering that the coordinate regression network has high detection efficiency for keypoints and the heatmap prediction network has high detection accuracy for keypoints, the heatmap prediction network is used to assist the training of the coordinate regression network. During the training process, the predicted coordinates of the keypoints are obtained through the coordinate regression network. Based on these predicted coordinates, a differentiable transformation is performed on the reference heatmap of the keypoints to obtain the initial heatmap used as input to the heatmap prediction network. Due to the use of a differentiable transformation, the gradient generated by the heatmap prediction network can be transferred to the coordinate regression network for parameter optimization, thereby improving the accuracy of the coordinate regression network.
[0026] Therefore, by performing a differentiable transformation on the predicted coordinates output by the coordinate regression network, a heatmap prediction network is connected after the coordinate regression network. By utilizing the heatmap prediction network, the key point detection accuracy of the coordinate regression network is improved, achieving a balance between the detection accuracy and efficiency of key points on the image. Attached Figure Description
[0027] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0028] Figure 1 An example diagram illustrating an application scenario provided by an embodiment of this disclosure;
[0029] Figure 2 Model illustration provided for embodiments of this disclosure Figure 1 ;
[0030] Figure 3 A schematic flowchart of the key point detection method provided in the embodiments of this disclosure;
[0031] Figure 4 A schematic flowchart illustrating the model training method provided in this embodiment of the disclosure;
[0032] Figure 5 Example diagram of a reference heatmap of facial key points provided in embodiments of this disclosure;
[0033] Figure 6 A flowchart illustrating a single training process of the coordinate regression network provided in this embodiment of the present disclosure. Figure 1 ;
[0034] Figure 7 A flowchart illustrating a single training process of the coordinate regression network provided in this embodiment of the present disclosure. Figure 2 ;
[0035] Figure 8 Model illustration provided for embodiments of this disclosure Figure 2 ;
[0036] Figure 9 A flowchart illustrating a single training process of the coordinate regression network provided in this embodiment of the present disclosure. Figure 3 ;
[0037] Figure 10 This is a structural block diagram of the key point detection device provided in the embodiments of this disclosure;
[0038] Figure 11 A structural block diagram of the model training device provided in the embodiments of this disclosure;
[0039] Figure 12 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0040] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0041] In deep learning tasks involving facial landmark detection, landmarks can be represented in two ways: as coordinate values or as heatmaps. Based on these different representations, facial landmark detection methods can be categorized into two types: methods that obtain landmark locations by regressing coordinate values, and methods that obtain landmark locations by generating heatmaps. Each of these methods has its advantages and disadvantages.
[0042] Method 1, which derives keypoint locations by regressing coordinates, first uses a feature extraction network to extract image features, then uses a fully connected layer to regress coordinates, obtaining N*2 values, i.e., the X and Y coordinates of N keypoints. This method extracts more global features of the image without considering its geometric information. It uses a smaller network structure with lower computational cost, resulting in higher keypoint detection efficiency, but lower detection accuracy.
[0043] Method two, which obtains keypoint locations by generating heatmaps of keypoints, assigns a separate heatmap to each keypoint. The heatmap's image size is the same as the overall image size. The probability of a pixel being a keypoint is represented by its brightness; higher brightness indicates a greater probability. This method considers both global and geometric features of the image, resulting in high keypoint detection accuracy. However, to maintain image resolution, this method employs a deep network, leading to longer computation and inference times and lower keypoint detection efficiency.
[0044] Therefore, neither Method 1 nor Method 2 can simultaneously achieve both detection efficiency and detection accuracy of key points.
[0045] Given that Method 1 has higher keypoint detection efficiency and Method 2 has higher keypoint detection accuracy, this applicant proposes combining the two methods to achieve a balance between keypoint detection efficiency and accuracy. Specifically, during network training, the network from Method 2 can be used to assist the training of the network from Method 1, thereby improving the keypoint detection accuracy of the network from Method 1. The key to achieving effective gradient backpropagation between the two networks during the connection between Method 1 and Method 2 lies in how to enable the network from Method 2 to assist the training of the network from Method 1.
[0046] Since the output of the network in Method 1 is the predicted coordinates of keypoints, while the input of the network in Method 2 is an image, one way to convert the coordinates of keypoints into an image is to determine the probability values of surrounding pixels using a Gaussian distribution, centered on the keypoint coordinates, to generate a heatmap of the keypoints, thus achieving the conversion from keypoint coordinates to image. However, this method of generating heatmaps is non-differentiable, making it impossible for the gradient of the network in Method 2 to be backpropagated to the network in Method 1.
[0047] To address the aforementioned issues, this application provides a keypoint detection method, a model training method, and an apparatus. In this application, a coordinate regression network is trained with the assistance of a heatmap prediction network, and the trained coordinate regression network is used for keypoint detection in images. The input data for the heatmap prediction network includes an initial heatmap of keypoints on the training image and feature maps extracted from the training image by the coordinate regression network. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the keypoints based on their predicted coordinates. Thus, by performing a differentiable transformation on the reference heatmap of the keypoints based on their predicted coordinates, the heatmap prediction network can achieve gradient backpropagation during the training of the coordinate regression network. This compensates for the coordinate regression network's insufficient focus on the geometric structural features of the image, paying attention to both global and geometric features while maintaining the computationally low cost of the coordinate regression network in model application. This achieves a balance between detection efficiency and accuracy for keypoints in images.
[0048] refer to Figure 1 , Figure 1 This is an example diagram illustrating an application scenario provided by an embodiment of this disclosure.
[0049] like Figure 1 As shown, the application scenario is image keypoint detection, which includes an image processing device 101. In this scenario, the image processing device 101 detects keypoints on a given image using a neural network. These keypoints are points that are predefined by a person. The image processing device 101 can be a terminal or a server. Figure 1 Taking the image processing device 101 as an example as a server.
[0050] Optionally, the application scenario also includes an image acquisition device 102, which communicates with the image processing device 101, for example, via a network. The image acquisition device 102 can send the acquired images to the image processing device 101 for key point detection. The image acquisition device 102 is a terminal device with a camera, such as a camera, mobile phone, tablet computer, smart wearable device, or smart home appliance.
[0051] The application scenario can be real-time online image processing, where the image processing device 101 performs key point detection on images acquired in real-time by the image acquisition device 102. Alternatively, it can be offline image processing, where the image processing device 101 performs key point detection on images not acquired in real-time.
[0052] The image processing device 101 and the image acquisition device 102 can be the same device, for example, performing key point detection on an image taken by a user using a mobile phone. Alternatively, the image processing device 101 and the image acquisition device 102 can be different devices.
[0053] Optionally, this application scenario involves facial landmark detection in images. For example, with user authorization, facial landmark detection can be performed on a user's selfie image to identify the user or generate interesting image effects based on the identified facial landmarks. Facial landmarks include, for example, eyebrow points, corner of the eye points, and facial contour points.
[0054] Optionally, the image detected in this embodiment is a face image, and the key points are facial key points.
[0055] Below, several embodiments of the key point detection method are provided. It should be noted that the execution subject of the method embodiments of this disclosure is an electronic device, which may be a terminal or a server.
[0056] refer to Figure 2 , Figure 2 Model illustration provided for embodiments of this disclosure Figure 1 .
[0057] like Figure 2 As shown, the coordinate regression network includes a feature extraction layer and a coordinate regression layer, while the heatmap prediction network includes a feature extraction layer (…). Figure 2 (Not shown), the feature extraction layer in the coordinate regression network can be considered as a shared feature extraction layer between the coordinate regression network and the heatmap prediction network. Specifically, the coordinate regression network is used to detect the coordinates of key points on the image, while the heatmap prediction network assists in the training of the coordinate regression network to improve the key point detection accuracy of the coordinate regression network.
[0058] like Figure 2 As shown, an image can be input into a coordinate regression network, where feature extraction layers extract features to obtain a feature map. This feature map is then input into the coordinate regression layer to obtain the predicted coordinates of keypoints output by the coordinate regression network.
[0059] like Figure 2 As shown, during training, a differentiable transformation can be performed on the reference heatmap of the keypoints based on their predicted coordinates to obtain the initial heatmap of the keypoints. In the heatmap prediction network, the geometric structural features in the image can be further learned based on the initial heatmap of the keypoints and the feature map of the image to obtain the predicted heatmap of the keypoints.
[0060] Since the predicted coordinates of keypoints output by the coordinate regression network and the initial heatmap of keypoints input into the heatmap prediction network are obtained in a differentiable manner, the error in the predicted heatmap of keypoints predicted by the heatmap prediction network can be gradient-propagated back to the coordinate regression network to adjust its parameters and improve its accuracy in detecting keypoints in the image. Thus, the heatmap prediction network assists in the training of the coordinate regression network.
[0061] exist Figure 2 Based on the model structure shown, refer to Figure 3 , Figure 3 This is a schematic flowchart illustrating the key point detection method provided in the embodiments of this disclosure. Figure 3 As shown, the key point detection method includes:
[0062] S301. Determine the target image for which key point detection will be performed.
[0063] The target image is the image from which key point detection of the target object is to be performed.
[0064] Optionally, the target object is a face, the key points are predefined facial key points, such as eyebrow points, corner of the eye points, facial contour points, etc., and the target image is the image to be detected for facial key points.
[0065] In this embodiment, one or more target images can be determined.
[0066] In one example, a target image input by the user can be acquired, or a target image can be received from an image acquisition device, or an image displayed on the user terminal or a video frame in a video played on the user terminal can be identified as the target image, so as to perform real-time online key point detection on the target image.
[0067] In another example, the target image can be obtained from an image database or a video database to perform offline keypoint detection on the target image.
[0068] S302. Using a coordinate regression network, the coordinates of key points in the target image are detected to obtain the predicted coordinates of key points in the target image.
[0069] The coordinate regression network was trained with the assistance of the heatmap prediction network.
[0070] During training, the input data for the coordinate regression network is the training image, while the input data for the heatmap prediction network includes the initial heatmap of key points on the training image and the feature map extracted from the training image by the coordinate regression network. The initial heatmap of key points on the training image is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image (predicted coordinates obtained by key point detection in the training image through the coordinate regression network).
[0071] Based on the predicted coordinates of key points, a differentiable transformation is performed on the reference heatmap of the key points to obtain the initial heatmap of the key points. This initial heatmap is then passed to the heatmap prediction network. Essentially, this adds a differentiable data processing step between the heatmap prediction network and the coordinate regression network: "inputting the predicted coordinates of the key points and outputting the initial heatmap of the key points." For example, y = f(x), where x represents the predicted coordinates of the key points, y represents the predicted heatmap of the key points, and f(x) is a differentiable function.
[0072] Since differentiability in data processing is a prerequisite for gradient backpropagation, the process of "performing a differentiable transformation on the reference heatmap of key points based on their predicted coordinates to obtain an initial heatmap of the key points, and then transferring this initial heatmap to the heatmap prediction network" enables gradient backpropagation between the heatmap prediction network and the coordinate regression network. Gradient backpropagation is crucial for assisting the training process of the coordinate regression network by the heatmap prediction network.
[0073] In this embodiment, the coordinate regression network can be obtained through multiple training iterations. After obtaining the target image, the target image can be input into the trained coordinate regression network. The coordinate regression network detects the key point locations on the target image and obtains the predicted coordinates of the key points. Due to the assistance of the heatmap prediction network, the trained coordinate regression network has high accuracy in detecting key points on the image. Because of its low computational cost and simple network structure, the coordinate regression network has high efficiency in detecting key points on the image. Therefore, by using the trained coordinate regression network, the detection efficiency and accuracy of key points on the target image can be effectively improved.
[0074] In this embodiment, gradient backpropagation between the heatmap prediction network and the coordinate regression network is achieved by performing a differentiable transformation on the reference heatmap of key points based on their predicted coordinates. This assists the training process of the coordinate regression network, improving the key point detection accuracy and efficiency of the coordinate regression network. Therefore, in the process of key point detection in an image, using the coordinate regression network to detect the coordinates of key points improves both the detection efficiency and accuracy, achieving a balance between key point detection efficiency and accuracy.
[0075] Below are several embodiments of the model training process.
[0076] exist Figure 2 Based on the model structure shown, refer to Figure 4 , Figure 4 This is a schematic flowchart illustrating the model training method provided in an embodiment of this disclosure. Figure 4 As shown, the model training methods include:
[0077] S401. Determine the training data.
[0078] The training data includes training images and their corresponding sample labels. The training images are the training samples, and the training images and their corresponding sample labels are used for supervised training of the keypoint detection model. Specifically, the sample labels corresponding to the training images may include the actual coordinates (i.e., true coordinates) of the keypoints on the training images and the actual heatmap (i.e., true heatmap) of the keypoints on the training images.
[0079] For example, when there are N keypoints in a training image, the sample labels for the training image include the actual coordinates of the N keypoints and the actual heatmaps corresponding to each of the N keypoints. This can be achieved through manual annotation or by processing the training image using a high-precision keypoint detection model to obtain the actual keypoint coordinates and heatmaps.
[0080] In this embodiment, training data can be obtained from a database, from an authorized network or other authorized platform, or from user-input training data.
[0081] S402. Train the coordinate regression network based on the training data.
[0082] Among them, a heatmap prediction network is used to assist the training of the coordinate regression network. The input data of the heatmap prediction network includes the initial heatmap of key points on the training image and the feature map extracted by the coordinate regression network from the training image. The initial heatmap of key points on the training image is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image.
[0083] Among them, key points and reference heat Figure 1 There is a one-to-one correspondence, meaning that each key point corresponds to a specific reference heatmap. The reference heatmap corresponding to each key point reflects its reference coordinates.
[0084] Optionally, the reference heatmap for keypoints can be obtained based on images labeled with keypoints in the image dataset. For each keypoint, its actual coordinates on multiple images in the image dataset can be obtained. Based on these actual coordinates, reference coordinates for the keypoint are determined, and a reference heatmap for the keypoint is generated. Thus, by leveraging a large number of images labeled with keypoints, the accuracy of the reference heatmap for keypoints can be improved.
[0085] In determining the reference coordinates of a keypoint based on its actual coordinates across multiple images in an image dataset, the average of these actual coordinates can be calculated to obtain the keypoint's reference coordinates. These reference coordinates are then known as the mean coordinates, and the reference heatmap can be called a mean heatmap. Besides calculating the mean, other parameters such as the median and mode can also be used.
[0086] Different key points can correspond to different IDs. For example, if there are M key points in a face, then 1, 2, ..., M can be used as the IDs of the key points to distinguish different key points by ID.
[0087] As an example, when the key points are facial key points, for eyebrow points, the actual coordinates of eyebrow points on multiple face images can be determined, the average of the actual coordinates of the eyebrow points on multiple face images can be calculated to obtain the reference coordinates of the eyebrow points, and a reference heatmap of the eyebrow points can be generated based on the reference coordinates of the eyebrow points. For corner of the eye points, the actual coordinates of corner of the eye points on multiple face images can be determined, the average of the actual coordinates of the corner of the eye points on multiple face images can be calculated to obtain the reference coordinates of the corner of the eye points, and a reference heatmap of the corner of the eye points can be generated based on the reference coordinates of the corner of the eye points.
[0088] As an example, Figure 5 This is an example diagram of a reference heatmap of facial key points provided in an embodiment of this disclosure. For ease of representation, reference heatmaps of all facial key points are provided. Figure 5 The image in the image is a contour map synthesized by overlaying reference heatmaps of all facial key points. Figure 5 In the reference heatmap shown, multiple facial key points have corresponding reference coordinates. The reference coordinates of the facial key points can be obtained by averaging the actual coordinates of the facial key points on multiple face images.
[0089] Furthermore, considering that the position, pose, and size of the target object (such as a face) may differ in different images, to improve the accuracy of the reference heatmap for key points, the target object can be identified from multiple images before determining the reference coordinates of the key points. Based on the identified target object, sub-images of a preset size containing the target object are segmented from the multiple images. In this way, multiple sub-images of the same size containing the target object are obtained. Based on the actual coordinates of each key point in these sub-images, a reference heatmap for each key point is generated.
[0090] In this embodiment, during training, the training image is input into a coordinate regression network, where a feature map is extracted from the training image through a feature extraction layer. The feature map is then input into the coordinate regression layer of the network, where coordinate regression is performed to generate predicted coordinates for key points on the training image. Next, based on the predicted coordinates of the key points, a differentiable transformation is performed on the reference heatmap of the key points to obtain an initial heatmap of the key points.
[0091] After obtaining the initial heatmap of keypoints, the heatmap prediction network learns the geometric structural features of keypoints in the training image based on the initial heatmap and the feature maps of the training image extracted by the feature extraction layer in the coordinate regression network, generating a predicted heatmap of keypoints in the training image. Then, the error in the predicted heatmap generated by the heatmap prediction network can be passed to the coordinate regression network via gradient backpropagation to adjust the parameters of the coordinate regression network, thereby improving the accuracy of keypoint extraction by the coordinate regression network using the heatmap prediction network.
[0092] The following example illustrates a training process for a coordinate regression network.
[0093] refer to Figure 6 , Figure 6 A flowchart illustrating a single training process of the coordinate regression network provided in this embodiment of the present disclosure. Figure 1 .like Figure 6 As shown, one training process of a coordinate regression network includes:
[0094] S601. Using a coordinate regression network, coordinate detection is performed on key points in the training image to obtain the predicted coordinates of key points in the training image and the feature map of the training image.
[0095] The implementation principle and technical effects of S601 can be referred to in the aforementioned embodiments, and will not be repeated here.
[0096] S602. Based on the predicted coordinates of key points on the training image, perform a differentiable transformation on the reference heatmap of the key points to obtain the initial heatmap of the key points on the training image.
[0097] In this embodiment, for each key point, an image transformation operation can be performed on the reference heatmap of the key point based on the predicted coordinates of the key point. This moves the coordinates of the key point on the reference heatmap to the predicted coordinates of the key point. The reference heatmap after the image transformation operation is the initial heatmap of the key point. The image transformation operation can include at least one of the following: translation, rotation, and scaling. Transforming from the reference heatmap of the key point to the initial heatmap of the key point is a differentiable transformation process. For example, this transformation process can be represented as Y = KX, where X corresponds to the reference heatmap, Y corresponds to the initial heatmap, K corresponds to the transformation operation on the reference heatmap, and finding the derivative is equivalent to finding K.
[0098] Optionally, one possible implementation of S602 includes: determining the transformation matrix of the key points based on the difference between the predicted coordinates of the key points on the training image and the reference coordinates of the key points on the reference heatmap of the key points; and performing image transformation on the reference heatmap of the key points based on the transformation matrix of the key points to obtain the initial heatmap of the key points on the training image.
[0099] Different key points can correspond to different transformation matrices.
[0100] The transformation matrix of keypoints may include at least one of translation, rotation, and scaling. Translation may include vertical and horizontal translation. Therefore, the transformation matrix can be one-dimensional or multi-dimensional. For example, if the image transformation operation only includes translation, the transformation matrix can be one-dimensional, including vertical and horizontal translation. Alternatively, if the image operation includes translation, rotation, and scaling, the transformation matrix can be multi-dimensional, including vertical and horizontal translation, rotation, and scaling.
[0101] In this implementation, the predicted coordinates of keypoints on the training image are compared with the reference coordinates of keypoints on the reference heatmap to obtain the offsets of the predicted coordinates of the keypoints on the X-axis and the Y-axis relative to their reference coordinates. Based on these offsets, the transformation matrix of the keypoints is determined. In determining the transformation matrix, the offsets on the X-axis and Y-axis can be defined as the left-right and up-down offsets in the transformation matrix, respectively. If the transformation matrix also includes rotation and scaling, default rotation or scaling values can be used.
[0102] For example, the predicted coordinates of a keypoint obtained through a coordinate regression network are (x_1, y_1), and the reference coordinates of the keypoint on the reference heatmap are (x_reference, y_reference). Subtracting x_1 from x_reference gives the x_offset, and subtracting y_1 from y_reference gives the y_offset. Based on the x_offset and y_offset, the transformation matrix of the keypoint is obtained.
[0103] In this implementation, the transformation matrix establishes a mapping relationship between the predicted coordinates and the reference coordinates of keypoints. Therefore, when performing image transformation on the reference heatmap of keypoints based on the transformation matrix, the keypoints can be transformed from the reference coordinates to the predicted coordinates, resulting in the initial heatmap of the keypoints. Thus, the transformation matrix explicitly provides a transformation relationship from the reference heatmap to the initial heatmap of the keypoints, i.e., a transformation process from the predicted coordinates of the keypoints to the initial heatmap, making the entire process differentiable.
[0104] Furthermore, in the process of performing image transformation on the reference heatmap of key points based on the transformation matrix of key points to obtain the initial heatmap of key points on the training image, one possible implementation includes: determining the mapping coordinates of multiple points on the reference heatmap of key points based on the transformation matrix of key points; and performing image transformation on the reference heatmap of key points based on the mapping coordinates of multiple points on the reference heatmap of key points to obtain the initial heatmap. Thus, it considers not only the positional transformation of key points on the reference heatmap of key points but also the positional transformation of other points on the reference heatmap of key points, ensuring that each point on the heatmap of key points undergoes a differentiable transformation process.
[0105] In this implementation, given the transformation matrix of each key point, the mapping coordinates of multiple points on the key point reference heatmap can be calculated based on the regression coordinates (i.e., the predicted coordinates) of each key point. This allows for the transformation of multiple points on the key point reference heatmap to obtain the initial heatmap (i.e., the initial heatmap of the key point) corresponding to the regression coordinates of the key point.
[0106] S603. Using a heatmap prediction network, features are extracted from the initial heatmap and feature map of key points on the training image to obtain the predicted heatmap of key points on the training image.
[0107] In this embodiment, the initial heatmaps of key points on the training image can be cascaded to obtain a contour map of the target object. The contour map and the feature map of the training image are then input into the heatmap prediction network. In the heatmap prediction network, image features are further learned from the contour map and the feature map of the training image through a feature extraction layer, especially the geometric structural features of the training image, ultimately obtaining the predicted heatmap of key points on the training image.
[0108] S604. Adjust the parameters of the coordinate regression network based on the differences between the actual coordinates and the predicted coordinates of the key points, as well as the differences between the actual heatmap and the predicted heatmap of the key points.
[0109] The loss function of the keypoint detection model can include a first loss function and a second loss function. The loss value of the first loss function reflects the difference between the actual keypoint coordinates of the training image and the predicted keypoint coordinates of the training image, while the loss value of the second loss function reflects the difference between the actual keypoint heatmap of the training image and the predicted keypoint heatmap of the training image.
[0110] In this embodiment, based on the first loss function, the actual coordinates of keypoints in the training image, and the predicted coordinates of keypoints in the training image, the difference between the actual coordinates and the predicted coordinates of keypoints in the training image is determined, resulting in a first loss value (i.e., the function value of the first loss function). Based on the second loss function, the actual heatmap of keypoints in the training image, and the predicted heatmap of keypoints in the training image, the difference between the actual heatmap and the predicted heatmap of keypoints in the training image is determined, resulting in a second loss value (i.e., the function value of the second loss function). Based on the first and second loss values, the parameters of the coordinate regression network are adjusted using a model optimization algorithm.
[0111] The specific formula for the loss function and the model optimization algorithm are not limited.
[0112] Therefore, compared to training a coordinate regression network solely based on the difference between the predicted coordinates and the actual coordinates of key points in the training image, this embodiment considers the low key point detection accuracy of the coordinate regression network. During training, a data connection between the coordinate regression network and the heatmap prediction network is established by performing a differentiable transformation on the reference heatmap of the key points. The heatmap prediction network is then used to further extract image features to obtain the predicted heatmap of the key points in the training image. The difference between the predicted heatmap and the actual heatmap of the key points in the training image is used to assist the training of the coordinate regression network and improve its key point detection accuracy.
[0113] Optionally, the heatmap prediction network can use a pre-trained network.
[0114] Optionally, during the training of the coordinate regression network, the heatmap prediction network can also be trained to further improve the key point detection accuracy of the coordinate regression network.
[0115] The heatmap prediction network is also trained during the training of the coordinate regression network, referencing... Figure 7 , Figure 7 A flowchart illustrating a single training process of the coordinate regression network provided in this embodiment of the present disclosure. Figure 2 .like Figure 7 As shown, a training process includes:
[0116] S701. Using a coordinate regression network, coordinate detection is performed on key points in the training image to obtain the predicted coordinates of key points in the training image and the feature map of the training image.
[0117] S702. Based on the predicted coordinates of key points on the training image, perform a differentiable transformation on the reference heatmap of the key points to obtain the initial heatmap of the key points on the training image.
[0118] S703. Using a heatmap prediction network, features are extracted from the initial heatmap and feature map of key points on the training image to obtain the predicted heatmap of key points on the training image.
[0119] S704. Adjust the parameters of the coordinate regression network based on the differences between the actual coordinates and the predicted coordinates of the key points, as well as the differences between the actual heatmap and the predicted heatmap of the key points.
[0120] The implementation principles and technical effects of S701 to S704 can be referred to in the aforementioned embodiments, and will not be repeated here.
[0121] S705. Adjust the parameters of the heatmap prediction network based on the difference between the actual heatmap and the predicted heatmap of the key points.
[0122] In this embodiment, the difference between the actual heatmap and the predicted heatmap of the key points in the training image can be determined based on the second loss function, the actual heatmap of the key points in the training image, and the predicted heatmap of the key points in the training image, thus obtaining the second loss value. Based on the second loss value, the parameters of the heatmap prediction network are adjusted using a model optimization algorithm.
[0123] Thus, the coordinate regression network and the heatmap prediction network are trained once. During the training process, the continuous optimization of the heatmap prediction network will improve the optimization effect of the coordinate regression network, thereby continuously improving the key point detection accuracy of the coordinate regression network.
[0124] refer to Figure 8 , Figure 8Model illustration provided for embodiments of this disclosure Figure 2 .
[0125] like Figure 8 As shown, in Figure 2 Based on the model structure shown, the coordinate regression network includes multiple downsampling layers and coordinate regression layers, while the heatmap prediction network includes multiple upsampling layers. That is, the feature extraction layer in the coordinate regression network includes multiple downsampling layers, and the feature extraction layer in the heatmap prediction network includes multiple upsampling layers. Therefore, the heatmap prediction network can extract more image features on the basis of the coordinate regression network, achieving more accurate key point detection.
[0126] like Figure 8 As shown, the last downsampling layer in the coordinate regression network connects to the coordinate regression layer and the first upsampling layer in the heatmap prediction network. During keypoint detection or model training, the feature map extracted by the last downsampling layer in the coordinate regression network can be input into the coordinate regression layer to obtain the predicted coordinates of the keypoints output by the coordinate regression layer (e.g., ...). Figure 8 In the heatmap prediction network, the predicted coordinates of n keypoints have a total of n*2 coordinate values (including predicted coordinates on the X-axis and predicted coordinates on the Y-axis). The feature map extracted from the last downsampling layer of the coordinate regression network can be input into the first upsampling layer of the heatmap prediction network. Simultaneously, based on the predicted coordinates of the keypoints output by the coordinate regression layer, a differentiable transformation can be performed on the reference heatmap of the keypoints to obtain the initial heatmap of the keypoints. This initial heatmap is then input into at least one upsampling layer in the heatmap prediction network. In the heatmap prediction network, after multiple upsampling layers upsample the feature map from the last downsampling layer of the coordinate regression network and the initial heatmap of the keypoints, the final predicted heatmap of the keypoints on the image is obtained.
[0127] Optionally, the coordinate regression layer is a fully connected layer (FC layer).
[0128] based on Figure 8 The key point detection model shown is for reference. Figure 9 , Figure 9 A flowchart illustrating a single training process of the coordinate regression network provided in this embodiment of the present disclosure. Figure 3 .like Figure 9 As shown, the training process of a keypoint detection model includes:
[0129] S901. Using a coordinate regression network, coordinate detection is performed on key points in the training image to obtain the predicted coordinates of key points in the training image and the feature map of the training image.
[0130] In this embodiment, the training image is input into a coordinate regression network. Multiple downsampling layers within the network downsample the training image multiple times, resulting in a feature map output by the last downsampling layer, which is the feature map of the training image. This feature map is then input into the coordinate regression layer of the network to obtain the predicted coordinates of key points on the training image.
[0131] S902. Based on the predicted coordinates of key points on the training image, perform a differentiable transformation on the reference heatmap of the key points to obtain the initial heatmap of the key points on the training image.
[0132] The implementation principle and technical effects of S902 can be referred to in the aforementioned embodiments, and will not be repeated here.
[0133] S903. Cascade the initial heatmaps of key points on the training image to obtain the target contour map formed by the key points on the training image.
[0134] In this embodiment, since the initial heatmaps of key points on the training images are of the same size, the initial heatmaps of key points on the training images can be merged, i.e., cascaded, to obtain a target contour map formed by the key points on the training images. The target contour map then displays multiple key points.
[0135] S904. After fusing the feature maps of the target contour map and the training image, the feature maps are input into the first upsampling layer in the heatmap prediction network for upsampling processing.
[0136] S905. After feature fusion of the target contour map and the output data of the first upsampling layer, the data is input into the next upsampling layer for upsampling processing. After multiple upsampling layers, a predicted heatmap of key points on the training image is obtained.
[0137] In this embodiment, during training, the input data of the upsampling layer of the heatmap prediction network may include a feature map resulting from feature fusion of the feature map output from the previous network layer and the target contour map. This provides richer image features for the upsampling process, improves the accuracy of predicting keypoint heatmaps, and consequently enhances the training performance of the coordinate regression network.
[0138] Optionally, during the feature fusion process between the feature map output from the previous network layer of the upsampling layer and the target contour map, the target contour map can be convolved to obtain a corresponding feature map. This feature map can then be merged with the feature map output from the previous network layer. For example, the pixel values of the feature map of the target contour map and the feature map output from the previous network layer can be added or weighted, or each pixel in the merged result can correspond to two pixel values, one from the feature map of the target contour map and the other from the feature map output from the previous network layer.
[0139] Optionally, during training, the input data for the heatmap prediction network may also include training images. Specifically, the training images and feature maps extracted from the training images by the coordinate regression network (especially the feature maps output by the last downsampling layer in the coordinate regression network) can be input into the first upsampling layer of the heatmap prediction network. Thus, by inputting the original images into the heatmap prediction network, richer image features are provided for the upsampling processing in the network, improving the accuracy of predicting keypoint heatmaps and consequently enhancing the training performance of the coordinate regression network.
[0140] S906. Adjust the parameters of the coordinate regression network based on the differences between the actual coordinates and the predicted coordinates of the key points on the training image, as well as the differences between the actual heatmap and the predicted heatmap of the key points on the training image.
[0141] The implementation principle and technical effects of S906 can be referred to in the aforementioned embodiments, and will not be repeated here.
[0142] Optionally, a training process may also include: S907, adjusting the parameters of the heatmap prediction network based on the difference between the actual heatmap of the key points on the training image and the predicted heatmap of the key points on the training image.
[0143] The implementation principle and technical effects of S906 can be referred to in the aforementioned embodiments, and will not be repeated here.
[0144] Therefore, in this embodiment, based on the predicted coordinates of key points output by the coordinate regression network, a differentiable transformation is performed on the reference heatmap of the key points to obtain an initial heatmap of the key points. The initial heatmap is then concatenated to form a target contour map. In the heatmap prediction network, the target contour map and the feature map output by the network layer are fused, thereby providing more and richer image information to the upsampling layer in the heatmap prediction network, improving the accuracy of key point heatmap prediction, and thus improving the key point detection accuracy of the coordinate regression network.
[0145] In some embodiments, during the keypoint detection process of the target image to be detected using the keypoint detection model, in addition to detecting the predicted coordinates of the keypoints on the target image through the coordinate regression network, a heatmap prediction network can also be used to detect the predicted heatmap of the keypoints on the target image.
[0146] In this embodiment, the training of the coordinate regression network was improved during model training (i.e., two types of loss were considered during the training process: a first loss value determined by the difference between the actual coordinates and the predicted coordinates of keypoints in the training image, and a second loss value determined by the difference between the actual heatmap and the predicted heatmap of keypoints in the training image, and the parameters of the coordinate regression network were adjusted accordingly). This improved the keypoint detection accuracy of the coordinate regression network, and consequently, the keypoint detection accuracy of the heatmap prediction network trained alongside the coordinate regression network was also improved. Therefore, continuing to use the heatmap prediction network to generate predicted heatmaps of keypoints in the target image after the coordinate regression network, although with lower detection efficiency than keypoint detection using only the coordinate regression network, provides a more accurate way to predict keypoint heatmaps of the target image, thus improving the accuracy of keypoint heatmaps.
[0147] Corresponding to the key point detection method in the above embodiments, Figure 10 This is a structural block diagram of a key point detection device provided in an embodiment of this disclosure. For ease of explanation, only the parts relevant to the embodiments of this disclosure are shown. (Refer to...) Figure 10 The key point detection equipment includes: a determination unit 1001 and a detection unit 1002.
[0148] Determining unit 1001 is used to determine the target image to be detected for key points;
[0149] The detection unit 1002 is used to detect the coordinates of key points in the target image through a coordinate regression network, and obtain the predicted coordinates of key points in the target image.
[0150] Among them, the coordinate regression network is trained with the assistance of the heatmap prediction network. The input data of the heatmap prediction network includes the initial heatmap of key points on the training image and the feature map extracted by the coordinate regression network from the training image. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image.
[0151] According to one or more embodiments of this disclosure, the coordinate regression network is trained multiple times. Each training process of the coordinate regression network includes: using the coordinate regression network to detect the coordinates of key points on a training image, obtaining predicted coordinates and feature maps of the key points on the training image; performing a differentiable transformation on a reference heatmap of the key points based on the predicted coordinates of the key points on the training image, obtaining an initial heatmap of the key points on the training image; using a heatmap prediction network to extract features from the initial heatmap and feature maps of the key points on the training image, obtaining a predicted heatmap of the key points on the training image; and adjusting the parameters of the coordinate regression network based on the differences between the actual and predicted coordinates of the key points on the training image and the differences between the actual and predicted heatmaps of the key points on the training image.
[0152] According to one or more embodiments of this disclosure, an initial heatmap of the key points on the training image is obtained by performing a differentiable transformation on a reference heatmap of the key points based on the predicted coordinates of the key points on the training image, including: determining a transformation matrix of the key points based on the difference between the predicted coordinates of the key points on the training image and the reference coordinates of the key points on the reference heatmap of the key points; and performing an image transformation on the reference heatmap of the key points based on the transformation matrix of the key points to obtain the initial heatmap of the key points on the training image.
[0153] According to one or more embodiments of this disclosure, an image transformation is performed on a reference heatmap of key points based on a transformation matrix of key points to obtain an initial heatmap of key points on a training image, including: determining the mapping coordinates of multiple points on the reference heatmap of key points based on the transformation matrix of key points; and performing an image transformation on the reference heatmap of key points based on the mapping coordinates of multiple points on the reference heatmap of key points to obtain an initial heatmap.
[0154] According to one or more embodiments of this disclosure, a coordinate regression network includes multiple downsampling layers and a coordinate regression layer, and a heatmap prediction network includes multiple upsampling layers. The heatmap prediction network extracts features from the initial heatmap and feature map of key points on a training image to obtain a predicted heatmap of the key points on the training image. This includes: cascading the initial heatmaps of key points on the training image to obtain a target contour map formed by the key points on the training image; fusing the target contour map and feature map, and then inputting them into the first upsampling layer of the heatmap prediction network for upsampling processing; fusing the target contour map and the output data of the first upsampling layer, and then inputting them into the next upsampling layer for upsampling processing. After multiple upsampling layers, a predicted heatmap of the key points on the training image is obtained.
[0155] According to one or more embodiments of this disclosure, after adjusting the parameters of the coordinate regression network based on the difference between the actual coordinates and the predicted coordinates and the difference between the actual heatmap and the predicted heatmap, the method further includes: adjusting the parameters of the heatmap prediction network based on the difference between the actual heatmap and the predicted heatmap.
[0156] The key point detection device provided in this embodiment can be used to execute the technical solution of the above-described key point detection method embodiment. Its implementation principle and technical effect are similar, and will not be described again here.
[0157] Corresponding to the model training method in the above embodiments, Figure 11 This is a structural block diagram of a model training device provided in an embodiment of this disclosure. For ease of explanation, only the parts relevant to the embodiments of this disclosure are shown. (Refer to...) Figure 11 The model training device includes: a determination unit 1101 and a training unit 1102.
[0158] The determining unit 1101 is used to determine training data, which includes training images and sample labels corresponding to the training images. The sample labels include the actual coordinates of key points on the training images and the actual heatmap.
[0159] Training unit 1102 is used to train the coordinate regression network based on training data;
[0160] Among them, a heatmap prediction network is used to assist the training of the coordinate regression network. The input data of the heatmap prediction network includes the initial heatmap of key points on the training image and the feature map extracted by the coordinate regression network from the training image. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image.
[0161] According to one or more embodiments of this disclosure, the coordinate regression network is trained multiple times. During one training iteration of the coordinate regression network, the training unit 1102 is used to: detect the coordinates of key points on the training image using the coordinate regression network to obtain predicted coordinates and feature maps of the key points on the training image; perform a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image to obtain an initial heatmap of the key points on the training image; extract features from the initial heatmap and feature maps of the key points on the training image using a heatmap prediction network to obtain a predicted heatmap of the key points on the training image; and adjust the parameters of the coordinate regression network based on the differences between the actual coordinates and the predicted coordinates, as well as the differences between the actual heatmap and the predicted heatmap.
[0162] According to one or more embodiments of this disclosure, in the process of performing a differentiable transformation on a reference heatmap of key points based on the predicted coordinates of key points on a training image to obtain an initial heatmap of key points on the training image, the training unit 1102 is configured to: determine the transformation matrix of key points based on the difference between the predicted coordinates of key points on the training image and the reference coordinates of key points on the reference heatmap of key points; and perform image transformation on the reference heatmap of key points based on the transformation matrix of key points to obtain an initial heatmap of key points on the training image.
[0163] According to one or more embodiments of this disclosure, in the process of performing image transformation on the reference heatmap of key points based on the transformation matrix of key points to obtain an initial heatmap of key points on a training image, the training unit 1102 is used to: determine the mapping coordinates of multiple points on the reference heatmap of key points based on the transformation matrix of key points; and perform image transformation on the reference heatmap of key points based on the mapping coordinates of multiple points on the reference heatmap of key points to obtain an initial heatmap.
[0164] According to one or more embodiments of this disclosure, the coordinate regression network includes multiple downsampling layers and coordinate regression layers, and the heatmap prediction network includes multiple upsampling layers. In the process of extracting features from the initial heatmap and feature map of key points on the training image through the heatmap prediction network to obtain a predicted heatmap of key points on the training image, the training unit 1102 is used to: cascade the initial heatmap of key points on the training image to obtain a target contour map formed by the key points on the training image; fuse the target contour map and feature map, and then input them into the first upsampling layer of the heatmap prediction network for upsampling processing; fuse the target contour map and the output data of the first upsampling layer, and then input them into the next upsampling layer for upsampling processing. After multiple upsampling layers, a predicted heatmap of key points on the training image is obtained.
[0165] According to one or more embodiments of the present disclosure, the training unit 1102 is further configured to: adjust the parameters of the heatmap prediction network based on the difference between the actual heatmap and the predicted heatmap.
[0166] The model training device provided in this embodiment can be used to execute the technical solutions of the above-described model training method embodiments. Its implementation principle and technical effects are similar, and will not be described again here.
[0167] refer to Figure 12The diagram illustrates a structural schematic of an electronic device 1200 suitable for implementing embodiments of the present disclosure. The electronic device 1200 can be a terminal device or a server. The terminal device can include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, personal digital assistants (PDAs), portable Android devices (PADs), portable media players (PMPs), and in-vehicle terminals (e.g., in-vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 12 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.
[0168] like Figure 12 As shown, the electronic device 1200 may include a processing unit (e.g., a central processing unit, a graphics processor, etc.) 1201, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1202 or a program loaded from a storage device 1208 into a random access memory (RAM) 1203. The RAM 1203 also stores various programs and data required for the operation of the electronic device 1200. The processing unit 1201, ROM 1202, and RAM 1203 are interconnected via a bus 1204. An input / output (I / O) interface 1205 is also connected to the bus 1204.
[0169] Typically, the following devices can be connected to the I / O interface 1205: input devices 1206 including, for example, a touchscreen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 1207 including, for example, a liquid crystal display (LCD), speaker, vibrator, etc.; storage devices 1208 including, for example, magnetic tape, hard disk, etc.; and communication devices 1209. Communication device 1209 allows electronic device 1200 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 12 An electronic device 1200 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.
[0170] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 1209, or installed from storage device 1208, or installed from ROM 1202. When the computer program is executed by processing device 1201, it performs the functions defined in the methods of embodiments of this disclosure.
[0171] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0172] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.
[0173] The aforementioned computer-readable medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.
[0174] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0175] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0176] The units described in the embodiments of this disclosure can be implemented in software or in hardware. The name of a unit does not necessarily limit the unit itself; for example, the first acquisition unit can also be described as "a unit that acquires at least two Internet Protocol addresses".
[0177] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.
[0178] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0179] In a first aspect, according to one or more embodiments of this disclosure, a keypoint detection method is provided, comprising: determining a target image to be detected; performing keypoint coordinate detection on the target image using a coordinate regression network to obtain predicted coordinates of keypoints on the target image; wherein the coordinate regression network is trained with the assistance of a heatmap prediction network, the input data of the heatmap prediction network includes an initial heatmap of keypoints on the training image and a feature map extracted from the training image by the coordinate regression network, the initial heatmap being obtained by performing a differentiable transformation on a reference heatmap of keypoints based on the predicted coordinates of keypoints on the training image.
[0180] According to one or more embodiments of this disclosure, the coordinate regression network is trained multiple times. Each training process of the coordinate regression network includes: using the coordinate regression network to detect the coordinates of key points on the training image, obtaining predicted coordinates of the key points and the feature map on the training image; performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image, obtaining an initial heatmap of the key points on the training image; using the heatmap prediction network to extract features from the initial heatmap and the feature map of the key points on the training image, obtaining a predicted heatmap of the key points on the training image; and adjusting the parameters of the coordinate regression network based on the difference between the actual coordinates and the predicted coordinates of the key points on the training image, and the difference between the actual heatmap and the predicted heatmap of the key points on the training image.
[0181] According to one or more embodiments of this disclosure, the step of performing a differentiable transformation on a reference heatmap of key points based on the predicted coordinates of key points on the training image to obtain an initial heatmap of key points on the training image includes: determining a transformation matrix of key points based on the difference between the predicted coordinates of key points on the training image and the reference coordinates of key points on the reference heatmap of key points; and performing an image transformation on the reference heatmap of key points based on the transformation matrix of key points to obtain the initial heatmap of key points on the training image.
[0182] According to one or more embodiments of this disclosure, the step of performing image transformation on the reference heatmap of key points based on the transformation matrix of key points to obtain an initial heatmap of key points on the training image includes: determining the mapping coordinates of multiple points on the reference heatmap of key points based on the transformation matrix of key points; and performing image transformation on the reference heatmap of key points based on the mapping coordinates of multiple points on the reference heatmap of key points to obtain the initial heatmap.
[0183] According to one or more embodiments of this disclosure, the coordinate regression network includes multiple downsampling layers and coordinate regression layers, and the heatmap prediction network includes multiple upsampling layers. The step of extracting features from the initial heatmap and the feature map of key points on the training image using the heatmap prediction network to obtain a predicted heatmap of key points on the training image includes: cascading the initial heatmaps of key points on the training image to obtain a target contour map formed by the key points on the training image; fusing the target contour map and the feature map, and then inputting them into the first upsampling layer of the heatmap prediction network for upsampling processing; fusing the target contour map and the output data of the first upsampling layer, and then inputting them into the next upsampling layer for upsampling processing; after multiple upsampling layers, the predicted heatmap of key points on the training image is obtained.
[0184] According to one or more embodiments of this disclosure, after adjusting the parameters of the coordinate regression network based on the difference between the actual coordinates and the predicted coordinates and the difference between the actual heatmap and the predicted heatmap, the method further includes: adjusting the parameters of the heatmap prediction network based on the difference between the actual heatmap and the predicted heatmap.
[0185] Secondly, according to one or more embodiments of this disclosure, a model training method is provided, comprising: determining training data, the training data including training images and sample labels corresponding to the training images, the sample labels including actual coordinates and actual heatmaps of key points on the training images; training a coordinate regression network based on the training data; wherein a heatmap prediction network is used to assist the training of the coordinate regression network, the input data of the heatmap prediction network including an initial heatmap of key points on the training images and feature maps extracted from the training images by the coordinate regression network, the initial heatmap being obtained by performing a differentiable transformation on a reference heatmap of key points based on the predicted coordinates of key points on the training images.
[0186] According to one or more embodiments of this disclosure, the coordinate regression network is trained multiple times. Each training process of the coordinate regression network includes: using the coordinate regression network to detect the coordinates of key points on the training image, obtaining predicted coordinates of the key points and the feature map on the training image; performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image, obtaining an initial heatmap of the key points on the training image; using the heatmap prediction network to extract features from the initial heatmap and the feature map of the key points on the training image, obtaining a predicted heatmap of the key points on the training image; and adjusting the parameters of the coordinate regression network based on the difference between the actual coordinates and the predicted coordinates, and the difference between the actual heatmap and the predicted heatmap.
[0187] According to one or more embodiments of this disclosure, the step of performing a differentiable transformation on a reference heatmap of key points based on the predicted coordinates of key points on the training image to obtain an initial heatmap of key points on the training image includes: determining a transformation matrix of key points based on the difference between the predicted coordinates of key points on the training image and the reference coordinates of key points on the reference heatmap of key points; and performing an image transformation on the reference heatmap of key points based on the transformation matrix of key points to obtain the initial heatmap of key points on the training image.
[0188] According to one or more embodiments of this disclosure, the step of performing image transformation on the reference heatmap of key points based on the transformation matrix of key points to obtain an initial heatmap of key points on the training image includes: determining the mapping coordinates of multiple points on the reference heatmap of key points based on the transformation matrix of key points; and performing image transformation on the reference heatmap of key points based on the mapping coordinates of multiple points on the reference heatmap of key points to obtain the initial heatmap.
[0189] According to one or more embodiments of this disclosure, the coordinate regression network includes multiple downsampling layers and coordinate regression layers, and the heatmap prediction network includes multiple upsampling layers. The step of extracting features from the initial heatmap and the feature map of key points on the training image using the heatmap prediction network to obtain a predicted heatmap of key points on the training image includes: cascading the initial heatmaps of key points on the training image to obtain a target contour map formed by the key points on the training image; fusing the target contour map and the feature map, and then inputting them into the first upsampling layer of the heatmap prediction network for upsampling processing; fusing the target contour map and the output data of the first upsampling layer, and then inputting them into the next upsampling layer for upsampling processing; after multiple upsampling layers, the predicted heatmap of key points on the training image is obtained.
[0190] According to one or more embodiments of this disclosure, after adjusting the parameters of the coordinate regression network based on the difference between the actual coordinates and the predicted coordinates and the difference between the actual heatmap and the predicted heatmap, the method further includes: adjusting the parameters of the heatmap prediction network based on the difference between the actual heatmap and the predicted heatmap.
[0191] Thirdly, according to one or more embodiments of this disclosure, a key point detection device is provided, comprising: a determining unit, configured to determine a target image to be detected; and a detection unit, configured to perform key point coordinate detection on the target image using a coordinate regression network to obtain predicted coordinates of key points on the target image; wherein the coordinate regression network is trained with the assistance of a heatmap prediction network, the input data of the heatmap prediction network includes an initial heatmap of key points on the training image and a feature map extracted from the training image by the coordinate regression network, and the initial heatmap is obtained by performing a differentiable transformation on a reference heatmap of key points based on the predicted coordinates of key points on the training image.
[0192] Fourthly, according to one or more embodiments of this disclosure, an image training device is provided, comprising: a determining unit, configured to determine training data, the training data including a training image and sample labels corresponding to the training image, the sample labels including actual coordinates and actual heatmaps of key points on the training image; and a training unit, configured to train a coordinate regression network based on the training data; wherein a heatmap prediction network is used to assist the training of the coordinate regression network, the input data of the heatmap prediction network including an initial heatmap of key points on the training image and feature maps extracted from the training image by the coordinate regression network, the initial heatmap being obtained by performing a differentiable transformation on a reference heatmap of key points based on the predicted coordinates of key points on the training image.
[0193] Fifthly, according to one or more embodiments of the present disclosure, an electronic device is provided, comprising: at least one processor and a memory; the memory storing computer-executable instructions; the at least one processor executing the computer-executable instructions stored in the memory, such that the at least one processor performs a key point detection method as provided in the first aspect or various possible embodiments of the first aspect, or performs a model training method as provided in the second aspect or various possible embodiments of the second aspect.
[0194] In a sixth aspect, according to one or more embodiments of the present disclosure, a computer-readable storage medium is provided, wherein computer-executable instructions are stored therein, which, when executed by a processor, implement the key point detection method provided by the first aspect and various possible embodiments of the first aspect, or implement the model training method provided by the second aspect or various possible embodiments of the second aspect.
[0195] In a seventh aspect, according to one or more embodiments of the present disclosure, a computer program product is provided, the computer program product comprising computer execution instructions, which, when executed by a processor, implement the key point detection method as provided in the first aspect and various possible embodiments of the first aspect, or implement the model training method as provided in the second aspect and various possible embodiments of the second aspect.
[0196] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.
[0197] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.
[0198] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.
Claims
1. A key point detection method, comprising: Identify the target image for keypoint detection; By using a coordinate regression network, the coordinates of key points in the target image are detected, and the predicted coordinates of key points in the target image are obtained. The coordinate regression network is trained with the assistance of the heatmap prediction network. The input data of the heatmap prediction network includes the initial heatmap of key points on the training image and the feature map extracted by the coordinate regression network from the training image. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image. The differentiable transformation includes: The transformation matrix of the key points is determined based on the difference between the predicted coordinates of the key points on the training image and the reference coordinates of the key points on the reference heatmap. Based on the transformation matrix of the key points, the reference heatmap of the key points is transformed to obtain the initial heatmap of the key points on the training image. 2.The key point detection method of claim 1, wherein the number of times of training of the coordinate regression network is a plurality of times, wherein, One training process of the coordinate regression network includes: The coordinate regression network is used to detect the coordinates of key points on the training image to obtain the predicted coordinates of the key points on the training image and the feature map. Based on the predicted coordinates of key points on the training image, a differentiable transformation is performed on the reference heatmap of the key points to obtain the initial heatmap of the key points on the training image. The heatmap prediction network is used to extract features from the initial heatmap and feature map of key points on the training image to obtain the predicted heatmap of key points on the training image. The parameters of the coordinate regression network are adjusted based on the differences between the actual coordinates and the predicted coordinates of key points on the training image, as well as the differences between the actual heatmap and the predicted heatmap of key points on the training image.
3. The key point detection method according to claim 2, wherein the step of performing image transformation on the reference heatmap of the key points based on the transformation matrix of the key points to obtain the initial heatmap of the key points on the training image includes: Based on the transformation matrix of the key points, determine the mapping coordinates of multiple points on the reference heatmap of the key points; Based on the mapped coordinates of multiple points on the reference heatmap of the key points, the reference heatmap of the key points is transformed to obtain the initial heatmap.
4. The keypoint detection method according to claim 2, wherein the coordinate regression network includes multiple downsampling layers and coordinate regression layers, the heatmap prediction network includes multiple upsampling layers, and the step of extracting features from the initial heatmap and feature map of keypoints on the training image through the heatmap prediction network to obtain a predicted heatmap of keypoints on the training image includes: The initial heatmaps of key points on the training image are concatenated to obtain the target contour map formed by the key points on the training image. After feature fusion of the target contour map and the feature map, the data is input into the first upsampling layer of the heatmap prediction network for upsampling processing. After feature fusion of the target contour map and the output data of the first upsampling layer, the data is input into the next upsampling layer for upsampling processing. After multiple upsampling layers, a predicted heatmap of key points on the training image is obtained.
5. The key point detection method according to any one of claims 1 to 4, further comprising, after adjusting the parameters of the coordinate regression network based on the difference between the actual coordinates and the predicted coordinates and the difference between the actual heatmap and the predicted heatmap: The parameters of the heat map prediction network are adjusted based on the difference between the actual heat map and the predicted heat map.
6. A model training method, comprising: Determine the training data, which includes training images and sample labels corresponding to the training images. The sample labels include the actual coordinates of key points on the training images and the actual heatmap. The coordinate regression network is trained based on the training data. The training of the coordinate regression network is assisted by a heatmap prediction network. The input data of the heatmap prediction network includes an initial heatmap of key points on the training image and a feature map extracted by the coordinate regression network from the training image. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image. The differentiable transformation includes: The transformation matrix of the key points is determined based on the difference between the predicted coordinates of the key points on the training image and the reference coordinates of the key points on the reference heatmap. Based on the transformation matrix of the key points, the reference heatmap of the key points is transformed to obtain the initial heatmap of the key points on the training image. 7.The model training method of claim 6, wherein the training of the coordinate regression network is performed a plurality of times, wherein, One training process of the coordinate regression network includes: The coordinate regression network is used to detect the coordinates of key points on the training image to obtain the predicted coordinates of the key points on the training image and the feature map. Based on the predicted coordinates of key points on the training image, a differentiable transformation is performed on the reference heatmap of the key points to obtain the initial heatmap of the key points on the training image. The heatmap prediction network is used to extract features from the initial heatmap and feature map of key points on the training image to obtain the predicted heatmap of key points on the training image. The parameters of the coordinate regression network are adjusted based on the differences between the actual coordinates and the predicted coordinates, as well as the differences between the actual heatmap and the predicted heatmap.
8. The model training method according to claim 7, wherein the step of performing image transformation on the reference heatmap of the key points based on the transformation matrix of the key points to obtain the initial heatmap of the key points on the training image includes: Based on the transformation matrix of the key points, determine the mapping coordinates of multiple points on the reference heatmap of the key points; Based on the mapped coordinates of multiple points on the reference heatmap of the key points, the reference heatmap of the key points is transformed to obtain the initial heatmap.
9. The model training method according to claim 7, wherein the coordinate regression network includes multiple downsampling layers and coordinate regression layers, the heatmap prediction network includes multiple upsampling layers, and the step of extracting features from the initial heatmap and the feature map of key points on the training image through the heatmap prediction network to obtain the predicted heatmap of key points on the training image includes: The initial heatmaps of key points on the training image are concatenated to obtain the target contour map formed by the key points on the training image. After feature fusion of the target contour map and the feature map, the data is input into the first upsampling layer of the heatmap prediction network for upsampling processing. After feature fusion of the target contour map and the output data of the first upsampling layer, the data is input into the next upsampling layer for upsampling processing. After multiple upsampling layers, a predicted heatmap of key points on the training image is obtained.
10. The model training method according to any one of claims 6 to 9, further comprising, after adjusting the parameters of the coordinate regression network based on the difference between the actual coordinates and the predicted coordinates and the difference between the actual heatmap and the predicted heatmap: The parameters of the heat map prediction network are adjusted based on the difference between the actual heat map and the predicted heat map.
11. A key point detection device, comprising: The determining unit is used to determine the target image to be detected for key points; The detection unit is used to detect the coordinates of key points in the target image through a coordinate regression network, and obtain the predicted coordinates of the key points in the target image. The coordinate regression network is trained with the assistance of the heatmap prediction network. The input data of the heatmap prediction network includes the initial heatmap of key points on the training image and the feature map extracted by the coordinate regression network from the training image. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image. The differentiable transformation includes: The transformation matrix of the key points is determined based on the difference between the predicted coordinates of the key points on the training image and the reference coordinates of the key points on the reference heatmap. Based on the transformation matrix of the key points, the reference heatmap of the key points is transformed to obtain the initial heatmap of the key points on the training image.
12. A model training device, comprising: A determining unit is used to determine training data, the training data including training images and sample labels corresponding to the training images, the sample labels including the actual coordinates of key points on the training images and the actual heatmap; The training unit is used to train the coordinate regression network based on the training data. The training of the coordinate regression network is assisted by a heatmap prediction network. The input data of the heatmap prediction network includes an initial heatmap of key points on the training image and a feature map extracted by the coordinate regression network from the training image. The initial heatmap is obtained by performing a differentiable transformation on the reference heatmap of the key points based on the predicted coordinates of the key points on the training image. The differentiable transformation includes: The transformation matrix of the key points is determined based on the difference between the predicted coordinates of the key points on the training image and the reference coordinates of the key points on the reference heatmap. Based on the transformation matrix of the key points, the reference heatmap of the key points is transformed to obtain the initial heatmap of the key points on the training image.
13. An electronic device, comprising: At least one processor and memory; The memory stores computer-executed instructions; The at least one processor executes computer execution instructions stored in the memory, causing the at least one processor to perform the key point detection method as described in any one of claims 1 to 5 or the model training method as described in any one of claims 6 to 10.
14. A computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the key point detection method as described in any one of claims 1 to 5 or the model training method as described in any one of claims 6 to 10.
Citation Information
Patent Citations
Key point detection method and device, equipment and computer readable medium
CN113297973A
Face key point detection method and face key point detection device
CN113688664A
Joint image key point automatic detection method and device based on deep learning, equipment and storage medium
CN113706463A