Kalman filter-based hair key point prediction method and system, and storage medium
By combining Kalman filtering with a mapping network to calculate the inter-frame motion vectors of hair keypoints, the problem of unstable hair keypoint detection in video streams is solved, achieving higher detection accuracy and real-time application capabilities.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAMEN MEITUZHIJIA TECH
- Filing Date
- 2023-05-04
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, the detection of hair key points in video streams is poor, especially when there is irregular jitter. Existing methods usually sacrifice accuracy for stability, resulting in poor overall performance.
A hair keypoint prediction method based on Kalman filtering is adopted. By obtaining the motion vectors of the face keypoints in the current frame and the face keypoints in the previous frame, and combining the mapping network and the Kalman filtering formula, the inter-frame motion vectors of the hair keypoints are calculated, and the hair keypoints are output using the Kalman filtering formula.
It improves the stability and accuracy of hair key point detection in video streams, reduces computational load, and enables effective application in real-time scenarios.
Smart Images

Figure CN116596965B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and in particular to a method, system, and storage medium for predicting hair key points based on Kalman filtering. Background Technology
[0002] Keypoint detection is a computer vision technique that aims to identify meaningful feature points in an image or video and locate each feature point. These feature points are often called keypoints because they are unique in an image and can be used to represent the content or shape of the image.
[0003] Keypoint detection is fundamental to computer vision and is widely used in many different fields, including face recognition, gesture recognition, and video analytics. A common problem in keypoint detection within video streams is that the changes in keypoints are not smooth across consecutive video frames, but rather exhibit irregular jitter within a certain range. Furthermore, because neural networks are sensitive to numerical values, even when the image is almost static, changes in small pixels (such as noise) can cause visible instability and jitter.
[0004] In facial landmark detection tasks, the detection quality varies among different landmarks. Some landmarks have relatively clear edges, such as the corners of the eyes and eyebrows; others are more blurred, such as the edges of hair. Landmarks at various locations on the hair are easily affected by factors such as changing hairstyles and scattered hair strands, resulting in poor edge localization quality. In video streams, this often manifests as irregular jitter in localization. Therefore, existing technologies perform poorly on the detection of relatively unstable hair landmarks in video streams.
[0005] In existing technologies, conventional sequence information filtering methods (such as smoothing filtering, Gaussian filtering, etc.) are usually used for hair key point detection. Essentially, this sacrifices accuracy in exchange for improved stability, resulting in poor overall performance. Summary of the Invention
[0006] The main objective of this invention is to provide a hair keypoint prediction method, system, and storage medium based on Kalman filtering, aiming to solve the technical problem of poor detection performance of existing hair keypoint prediction methods.
[0007] To achieve the above objectives, this invention provides a hair keypoint prediction method based on Kalman filtering, which includes the following steps: acquiring the current frame as the input image and performing keypoint prediction processing on it; the keypoint prediction processing includes face keypoint prediction and hair keypoint prediction, to obtain the predicted face keypoint coordinate information A at time t. t The predicted coordinates of the hair key points at time t (Z) t; Obtain the facial landmark coordinates of the previous frame A t-1 Based on the facial landmark coordinates A from the previous frame t-1 The facial landmark coordinates A at time t are predicted. r The inter-frame motion vectors of facial key points are obtained to represent the motion of facial key points between two frames. The inter-frame motion vectors of facial key points are input into a pre-trained mapping network B to obtain the predicted inter-frame motion vectors of hair key points. Based on the predicted inter-frame motion vectors of hair key points and the preset Kalman filter formula, the output result of hair key points is calculated.
[0008] Optionally, the inter-frame motion vector of the facial landmarks is specifically the facial landmark coordinate information A from the previous frame. r-1 The facial landmark coordinates A at time t are predicted. t The difference between them.
[0009] Optionally, the current frame is obtained as the input image, and keypoint prediction processing is performed on it. Specifically, the input image is fed into the keypoint prediction model for keypoint prediction processing. The dataset generation process of the mapping network B is as follows: For any original single image I containing keypoint labels, affine transformation processing is performed to simulate the inter-frame motion interval, resulting in transformed image I'. Each pixel in any image I undergoes random initialization affine transformation processing, specifically according to the following formula:
[0010]
[0011] Where C is the scaling and rotation matrix, B is the translation matrix, and X... t Let X be the coordinates of a pixel at time t. t+1 The coordinates of a pixel at time t+1 are the positions of that pixel in the next frame.
[0012] Optionally, the preset Kalman filter formula is as follows:
[0013]
[0014] Where F, Q, L, and H are all identity matrices, and R t For noise information, To predict the coordinates of key hair points in this frame by combining key facial information, H is the optimal estimate of the hair key points at time t-1. T F T Indicates matrix transpose. Here, K represents the optimal estimate of the hair keypoints at time t, and K is the Kalman gain. Let P be the noise covariance at time t. t For the updated noise covariance, Let be the noise covariance at time t+1.
[0015] Optionally, noise information is obtained by statistically analyzing the output of the key point prediction model and the actual values of the dataset in the mapping network B.
[0016] Optionally, the training process of the mapping network B specifically uses the original single image I containing key point labels and the transformed image I'. The coordinates of the facial key points corresponding to the original single image I and the transformed image I' are subtracted to obtain the inter-frame motion vector. The inter-frame motion vector is used as the input of the mapping network B, and the output of the mapping network B is the predicted hair key point motion vector. Then, a preset loss function is used to guide the network training.
[0017] Optionally, the preset loss function is a cosine similarity function, used to measure the difference between the predicted inter-frame motion vector of the hair keypoint and the true value of the inter-frame motion vector of the hair keypoint. The specific calculation formula is as follows:
[0018]
[0019] Where cosθ is a scalar measure of the difference, α i Let α be the inter-frame motion vector of the hair keypoints predicted by network B, and let α be the vector of the keypoints. i ∈{α1,α2,…α m}, β i Let β be the true value of the inter-frame motion vector of the hair keypoints calculated from the dataset labels, and let β be the true value of the motion vector of the hair keypoints. i ∈{β1,β2,…β m}
[0020] Optionally, the mapping network B consists of three fully connected layers.
[0021] Corresponding to the aforementioned hair keypoint prediction method based on Kalman filtering, this invention provides a hair keypoint prediction system based on Kalman filtering, comprising: an image acquisition module for acquiring the current frame as an input image; and a keypoint prediction module for performing keypoint prediction processing on the input image; the keypoint prediction processing includes face keypoint prediction and hair keypoint prediction, to obtain the predicted face keypoint coordinate information A at time t. t The predicted coordinates of the hair key points at time t (Z) t The inter-frame motion vector calculation module is used to obtain the facial key point coordinate information A from the previous frame. t-1 Based on the facial landmark coordinates A from the previous frame t-1 The facial landmark coordinates A at time t are predicted. tThe system obtains the inter-frame motion vectors of facial key points, which are used to represent the motion of facial key points between two frames; the mapping network module is used to obtain the inter-frame motion vectors of hair key points, and obtain the predicted inter-frame motion vectors of hair key points; the Kalman filter module is used to calculate the output result of hair key points based on the predicted inter-frame motion vectors of hair key points and the preset Kalman filter formula.
[0022] Furthermore, to achieve the above objectives, the present invention also provides a computer-readable storage medium storing a hair keypoint prediction program based on Kalman filtering, wherein when the hair keypoint prediction program based on Kalman filtering is executed by a processor, it implements the steps of the hair keypoint prediction method based on Kalman filtering as described above.
[0023] The beneficial effects of this invention are:
[0024] (1) By using key point prediction processing, combined with mapping network B and Kalman filtering, the detection effect of relatively unstable hair key points in video stream can be effectively improved.
[0025] (2) By representing the motion of hair key points between two frames with the inter-frame motion vector of hair key points, and combining it with the preset Kalman filter formula, the state of the dynamic system can be estimated in the combination of information with many uncertainties. By combining Kalman filter, this invention greatly improves the detection effect of relatively unstable hair key points in video streams.
[0026] (3) Based on the preset Kalman filter formula, the computational load of the present invention is small, which is beneficial for deployment and application in real-time scenarios;
[0027] (4) The difference between the predicted inter-frame motion vector of the hair key points and the true value of the inter-frame motion vector of the hair key points is measured by the cosine similarity function, which makes the training effect of the mapping network B better. Attached Figure Description
[0028] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this invention, illustrate exemplary embodiments of the invention and are used to explain the invention, but do not constitute an undue limitation of the invention. In the drawings:
[0029] Figure 1 This is a simplified flowchart of the hair keypoint prediction method based on Kalman filtering according to the present invention. Detailed Implementation
[0030] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0031] like Figure 1 As shown, the hair keypoint prediction method based on Kalman filtering of the present invention includes the following steps: acquiring the current frame as the input image and performing keypoint prediction processing on it; the keypoint prediction processing includes face keypoint prediction and hair keypoint prediction, to obtain the predicted face keypoint coordinate information A at time t. t The predicted coordinates of the hair key points at time t (Z) t ; Obtain the facial landmark coordinates of the previous frame A t-1 Based on the facial landmark coordinates A from the previous frame t-1 The facial landmark coordinates A at time t are predicted. t The inter-frame motion vectors of facial key points are obtained to represent the motion of facial key points between two frames. The inter-frame motion vectors of facial key points are input into a pre-trained mapping network B to obtain the predicted inter-frame motion vectors of hair key points. Based on the predicted inter-frame motion vectors of hair key points and the preset Kalman filter formula, the output result of hair key points is calculated.
[0032] This invention effectively improves the detection performance of relatively unstable hair key points in video streams by using key point prediction processing, combined with mapping network B and Kalman filtering.
[0033] In this embodiment, the inter-frame motion vector of the facial key points is specifically the facial key point coordinate information A from the previous frame. t-1 The facial landmark coordinates A at time t are predicted. t The difference between them.
[0034] This invention represents the motion of hair key points between two frames by using the inter-frame motion vector of hair key points. Combined with a preset Kalman filter formula, it can estimate the state of a dynamic system in a combination of information with many uncertainties. By combining Kalman filtering, this invention greatly improves the detection effect of relatively unstable hair key points in video streams.
[0035] In this embodiment, the current frame is obtained as the input image, and key point prediction processing is performed on it. Specifically, the input image is input into the key point prediction model for key point prediction processing.
[0036] In this embodiment, the dataset generation process of mapping network B is as follows: For any original single image I containing keypoint labels, an affine transformation is performed to simulate the inter-frame motion interval, resulting in a transformed image I'; each pixel in any image I undergoes a randomly initialized affine transformation, specifically according to the following formula:
[0037]
[0038] Where C is the scaling and rotation matrix, B is the translation matrix, and X... t Let X be the coordinates of a pixel at time t. t+1 The coordinates of a pixel at time t+1 are the positions of that pixel in the next frame.
[0039] Preferably, the key point labels include facial key point labels and hair key point labels.
[0040] In this embodiment, the preset Kalman filter formula is as follows:
[0041]
[0042] Where F, Q, L, and H are all identity matrices, and R t For noise information, To predict the coordinates of key hair points in this frame by combining key facial information, H is the optimal estimate of the hair key points at time t-1. T F T Indicates matrix transpose. Here, K represents the optimal estimate of the hair keypoints at time t, and K is the Kalman gain. Let P be the noise covariance at time t. t For the updated noise covariance, Let be the noise covariance at time t+1.
[0043] In this embodiment, P is initially an identity matrix, and it is continuously updated over time.
[0044] This invention, based on a preset Kalman filter formula, has a relatively low computational load, which is beneficial for deployment and application in real-time scenarios.
[0045] In this embodiment, noise information is obtained by statistically analyzing the output of the key point prediction model and the actual values of the dataset of the mapping network B.
[0046] In this embodiment, the training process of the mapping network B specifically uses an original single image I containing key point labels and a transformed image I'. The coordinates of the facial key points corresponding to the original single image I and the transformed image I' are subtracted to obtain the inter-frame motion vector. The inter-frame motion vector is used as the input of the mapping network B, and the output of the mapping network B is the predicted hair key point motion vector. Then, a preset loss function is used to guide the network training.
[0047] In this embodiment, the preset loss function is specifically a cosine similarity function, which is used to measure the difference between the predicted inter-frame motion vector of the hair keypoint and the true value of the inter-frame motion vector of the hair keypoint. The specific calculation formula is as follows:
[0048]
[0049] Where cosθ is a scalar measure of the difference, α i Let α be the inter-frame motion vector of the hair keypoints predicted by network B, and let α be the vector of the keypoints. i ∈{α1,α2,…α m}, β i Let β be the true value of the inter-frame motion vector of the hair keypoints calculated from the dataset labels, and let β be the true value of the motion vector of the hair keypoints. i ∈{β1,β2,…β m}
[0050] This invention uses a cosine similarity function to measure the difference between the predicted inter-frame motion vector of hair keypoints and the true value of the inter-frame motion vector of hair keypoints, thereby improving the training effect of the mapping network B.
[0051] In this embodiment, the mapping network B consists of three fully connected layers. Specifically, it is a fully connected network with two hidden layers, and the activation function is ReLU.
[0052] Corresponding to the aforementioned hair keypoint prediction method based on Kalman filtering, this invention provides a hair keypoint prediction system based on Kalman filtering, comprising: an image acquisition module for acquiring the current frame as an input image; and a keypoint prediction module for performing keypoint prediction processing on the input image; the keypoint prediction processing includes face keypoint prediction and hair keypoint prediction, to obtain the predicted face keypoint coordinate information A at time t. t The predicted coordinates of the hair key points at time t (Z) t The inter-frame motion vector calculation module is used to obtain the facial key point coordinate information A from the previous frame. t-1 Based on the facial landmark coordinates A from the previous frame t-1 The facial landmark coordinates A at time t are predicted. tThe system obtains the inter-frame motion vectors of facial key points, which are used to represent the motion of facial key points between two frames; the mapping network module is used to obtain the inter-frame motion vectors of hair key points, and obtain the predicted inter-frame motion vectors of hair key points; the Kalman filter module is used to calculate the output result of hair key points based on the predicted inter-frame motion vectors of hair key points and the preset Kalman filter formula.
[0053] Furthermore, to achieve the above objectives, the present invention also provides a computer-readable storage medium storing a hair keypoint prediction program based on Kalman filtering, wherein when the hair keypoint prediction program based on Kalman filtering is executed by a processor, it implements the steps of the hair keypoint prediction method based on Kalman filtering as described above.
[0054] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the device embodiments, equipment embodiments, and storage medium embodiments, since they are basically similar to the method embodiments, the descriptions are relatively simple, and relevant parts can be referred to the descriptions of the method embodiments.
[0055] Furthermore, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0056] The foregoing description illustrates and describes preferred embodiments of the present invention. It should be understood that the present invention is not limited to the forms disclosed herein and should not be construed as excluding other embodiments. It can be used in various other combinations, modifications, and environments, and can be altered within the scope of the inventive concept by means of the foregoing teachings or techniques or knowledge in related fields. Any modifications and variations made by those skilled in the art that do not depart from the spirit and scope of the present invention should be within the protection scope of the appended claims.
Claims
1. A hair keypoint prediction method based on Kalman filtering, characterized in that, Includes the following steps: The current frame is acquired as the input image, and keypoint prediction processing is performed on it. The keypoint prediction processing includes facial keypoint prediction and hair keypoint prediction, to obtain the predicted facial keypoint coordinate information at time t. Compared with the predicted coordinates of key hair points at time t ; Obtain the coordinates of facial landmarks from the previous frame. Based on the facial landmark coordinates of the previous frame Compared with the predicted facial landmark coordinates at time t This yields the inter-frame motion vectors of the facial key points, which are used to represent the motion of the facial key points between two frames. The inter-frame motion vectors of facial key points are input into a pre-trained mapping network B' to obtain the predicted inter-frame motion vectors of hair key points. Based on the predicted inter-frame motion vectors of the hair key points and the preset Kalman filter formula, the output results of the hair key points are calculated. The current frame is obtained as the input image, and keypoint prediction processing is performed on it. Specifically, the input image is input into the keypoint prediction model for keypoint prediction processing. The dataset generation process of mapping network B' is as follows: For any original single image I containing key point labels, perform affine transformation processing to simulate the inter-frame motion interval, and obtain the transformed image I'; Each pixel in any image I undergoes a randomly initialized affine transformation, specifically according to the following formula: ; Where C is the scaling and rotation matrix, and B is the translation matrix. Let be the coordinates of a pixel at time t. The coordinates of a pixel at time t+1 are the positions of that pixel in the next frame. The training process of the mapping network B' specifically uses the original single image I containing key point labels and the transformed image I'. The coordinates of the facial key points corresponding to the original single image I and the transformed image I' are subtracted to obtain the inter-frame motion vector. The inter-frame motion vector is used as the input of the mapping network B', and the output of the mapping network B' is the predicted hair key point motion vector. Then, a preset loss function is used to guide the network training. The preset loss function is a cosine similarity function, used to measure the difference between the predicted inter-frame motion vector of the hair keypoint and the true value of the inter-frame motion vector of the hair keypoint. The specific calculation formula is as follows: ; in, A scalar for measuring the gap. Let B' be the inter-frame motion vector of the hair keypoints predicted by the mapping network B', and ∈{ }, The values of the inter-frame motion vectors of the hair keypoints are calculated from the dataset labels, and ∈{ } 2. The hair keypoint prediction method based on Kalman filtering according to claim 1, characterized in that: The inter-frame motion vectors of facial landmarks are specifically the coordinate information of facial landmarks in the previous frame. Compared with the predicted facial landmark coordinates at time t The difference between them.
3. The hair keypoint prediction method based on Kalman filtering according to claim 1, characterized in that: The preset Kalman filter formula is as follows: ; in, L and All are identity matrices. For noise information, To predict the coordinates of key hair points in this frame by combining key facial information, This is the optimal estimate of the hair key points at time t-1. , Indicates matrix transpose. This represents the optimal estimate of the key points of the hair at time t. For Kalman gain, Let be the noise covariance at time t. For the updated noise covariance, Let be the noise covariance at time t+1.
4. The hair keypoint prediction method based on Kalman filtering according to claim 3, characterized in that: Noise information is obtained by statistically analyzing the output of the key point prediction model and the actual values of the dataset of the mapping network B'.
5. The hair keypoint prediction method based on Kalman filtering according to claim 1, characterized in that: The mapping network B' consists of three fully connected layers.
6. A hair keypoint prediction system based on Kalman filtering, using the hair keypoint prediction method based on Kalman filtering as described in any one of claims 1-5, characterized in that, include: The image acquisition module is used to acquire the current frame as the input image; The key point prediction module is used to perform key point prediction processing on the input image; The keypoint prediction process includes facial keypoint prediction and hair keypoint prediction, obtaining the predicted facial keypoint coordinates at time t. Compared with the predicted coordinates of key hair points at time t ; The inter-frame motion vector calculation module is used to obtain the coordinate information of facial key points in the previous frame. Based on the facial landmark coordinates of the previous frame Compared with the predicted facial landmark coordinates at time t This yields the inter-frame motion vectors of the facial key points, which are used to represent the motion of the facial key points between two frames. The mapping network module is used to obtain the inter-frame motion vectors of hair key points and to obtain the predicted inter-frame motion vectors of hair key points. The Kalman filter module is used to calculate the output result of hair key points based on the predicted inter-frame motion vectors of hair key points and the preset Kalman filter formula.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a hair keypoint prediction program based on Kalman filtering, which, when executed by a processor, implements the steps of the hair keypoint prediction method based on Kalman filtering as described in any one of claims 1 to 5.