Head pose tracking method and apparatus, electronic device, and storage medium

By converting video into an image sequence and using occlusion recognition and deformation completion networks to fill in the occluded areas, the robustness problem of head pose tracking methods under occlusion conditions is solved, and more accurate head pose tracking is achieved.

CN115830067BActive Publication Date: 2026-06-23TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2022-11-21
Publication Date
2026-06-23

Smart Images

  • Figure CN115830067B_ABST
    Figure CN115830067B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of computer vision, in particular to a head posture tracking method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: converting a video of a to-be-tracked target into an image sequence, inputting the image sequence into a pre-trained occlusion object recognition network, and outputting a mask image of an occlusion object in each frame of the image sequence; dividing the image sequence into a reference frame image and a candidate frame image sequence by using the mask image, and generating a to-be-completed frame image sequence by using the mask image of each frame of the candidate frame image sequence to cover the occlusion area; inputting the to-be-completed frame image sequence, the reference frame image and the mask image into a pre-trained deformation completion network, outputting a completed image sequence, and realizing head posture tracking of the to-be-tracked target based on the completed image sequence. Thus, the problems of false detection and jitter of the head posture caused by the poor robustness of the related art video-based head posture tracking method which is easily affected by the occlusion are solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, and in particular to a head pose tracking method, device, electronic device, and storage medium. Background Technology

[0002] With the development of computer technology, the concept of the metaverse has been proposed, aiming to create a digital living space with a social system. Head pose, as crucial information in this application scenario, is of great significance for the development of related applications. Due to the widespread availability of color cameras, color video is one of the commonly used inputs for head pose tracking.

[0003] Head pose estimation has numerous applications in fields such as face reconstruction and facial animation. For example, accurate head pose estimation provides a good initial value for face reconstruction algorithms, preventing the use of facial shape information to compensate for head pose deviations during the reconstruction process, which could lead to erroneous results. Simultaneously, accurate head pose tracking makes facial animations more realistic and vivid, enhancing user immersion and experience. Furthermore, the accuracy of head pose estimation is crucial for some downstream tasks. For instance, in gaze estimation, the direction of a person's gaze is determined by both head pose and eye rotation; therefore, the accuracy of head pose estimation significantly impacts the final gaze estimation result.

[0004] Related technologies are usually end-to-end methods based on neural networks, or optimization methods that combine facial feature points and 3D facial models to achieve head pose tracking. However, they are easily affected by occlusion, have poor robustness, and can cause false detections and jitter in head pose. Summary of the Invention

[0005] This application provides a head pose tracking method, apparatus, electronic device, and storage medium to solve the problems of head pose tracking based on video in related technologies, which are easily affected by occlusion, have poor robustness, and lead to false detection and jitter of head pose.

[0006] The first aspect of this application provides a head pose tracking method, comprising the following steps: acquiring a video of a target to be tracked; converting the video into an image sequence, inputting the image sequence into a pre-trained occlusion recognition network, and outputting a mask image of the occlusion in each frame of the image sequence; dividing the image sequence into a reference frame image sequence and a candidate frame image sequence using the mask image, and using the mask image of each frame of the candidate frame image sequence to cover the occlusion region, generating a sequence of frames to be completed; inputting the sequence of frames to be completed, the reference frame image, and the mask image of each frame into a pre-trained deformation completion network, and outputting a completed image sequence of the sequence of frames to be completed, and realizing head pose tracking of the target to be tracked based on the completed image sequence.

[0007] Optionally, in one embodiment of this application, dividing the image sequence into a reference frame image and a candidate frame image sequence using the mask image includes: identifying whether there is an unoccluded image in the image sequence based on the mask image; if there is an unoccluded image, selecting any unoccluded image as the reference frame image; otherwise, selecting the image with the least occlusion in the image sequence as the reference frame image, and using the remaining frame images in the image sequence other than the reference frame image as the candidate frame image sequence.

[0008] Optionally, in one embodiment of this application, the training process of the occlusion recognition network includes: acquiring training data of an untrained occlusion recognition network, wherein the training data includes an occlusion image and a real mask image corresponding to the occlusion image; inputting the occlusion image into the untrained occlusion recognition network, outputting a training mask image of the occlusion image, calculating a training loss value based on the real mask image and the training mask image, and if the training loss value is greater than a preset value, continuing iterative training based on the training; otherwise, stopping iterative training to obtain a trained occlusion recognition network.

[0009] Optionally, in one embodiment of this application, the training process of the deformable completion network includes: acquiring an image sequence of unoccluded images; using any frame in the image sequence as a reference frame image and the remaining frames as candidate frame images; randomly generating mask images of different shapes, using the mask images to cover random image regions of the candidate frame images to obtain a frame image to be completed; inputting the frame image to be completed, the reference frame image, and the mask image into an untrained deformable completion network, using image reconstruction loss to constrain the loss between the completed image output by the deformable completion network and the candidate frame image before masking to be within a preset range, and using total variation loss to constrain the deformation field between the reference frame image and the image to be completed to satisfy a preset rigid deformation condition, thereby obtaining a trained deformable completion network.

[0010] A second aspect of this application provides a head pose tracking device, comprising: an acquisition module for acquiring a video of a target to be tracked; a conversion module for converting the video into an image sequence, inputting the image sequence into a pre-trained occlusion recognition network, and outputting a mask image of occlusions in each frame of the image sequence; a segmentation module for using the mask image to segment the image sequence into a reference frame image sequence and a candidate frame image sequence, and using the mask image of each frame in the candidate frame image sequence to cover occlusion areas, generating a sequence of frames to be completed; and a tracking module for inputting the sequence of frames to be completed, the reference frame image, and the mask image of each frame into a pre-trained deformation completion network, outputting a completed image sequence of the sequence of frames to be completed, and realizing head pose tracking of the target to be tracked based on the completed image sequence.

[0011] Optionally, in one embodiment of this application, the segmentation module is further configured to identify whether there is an unoccluded image in the image sequence based on the mask image; if there is an unoccluded image, then any frame of unoccluded image is selected as a reference frame image; otherwise, the image with the least occlusion in the image sequence is selected as a reference frame image, and the remaining frame images in the image sequence other than the reference frame image are used as a candidate frame image sequence.

[0012] Optionally, in one embodiment of this application, it further includes: a first training module, configured to acquire training data of an untrained occlusion recognition network, wherein the training data includes an occlusion image and a real mask image corresponding to the occlusion image; inputting the occlusion image into the untrained occlusion recognition network, outputting a training mask image of the occlusion image, calculating a training loss value based on the real mask image and the training mask image, and if the training loss value is greater than a preset value, continuing iterative training based on the training; otherwise, stopping iterative training to obtain a trained occlusion recognition network.

[0013] Optionally, in one embodiment of this application, it further includes: a second training module, used to acquire an image sequence of unoccluded images; any frame in the image sequence is used as a reference frame image, and the remaining frames are candidate frame images; mask images of different shapes are randomly generated, and the mask images are used to cover random image regions of the candidate frame images to obtain a frame image to be completed; the frame image to be completed, the reference frame image, and the mask image are input into an untrained deformation completion network, and the image reconstruction loss is used to constrain the loss between the completed image output by the deformation completion network and the candidate frame image before masking to be within a preset range, and the total variation loss is used to constrain the deformation field between the reference frame image and the image to be completed to meet a preset rigid deformation condition, thereby obtaining a trained deformation completion network.

[0014] A third aspect of this application provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the head pose tracking method as described in the above embodiments.

[0015] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which is executed by a processor to implement the head pose tracking method as described in the above embodiments.

[0016] Therefore, this application has at least the following beneficial effects:

[0017] By converting the video of the target to be tracked into an image sequence, and utilizing the inter-frame information of the video, a trained network estimates the deformation field between the reference frame and the frame to be completed. The corresponding region of the reference frame is then automatically filled into the occluded region of the frame to be completed, thus ensuring compatibility with head pose estimation under occlusion conditions. This significantly alleviates problems such as false head pose detection and jitter caused by occlusion. Therefore, this method solves the problems of directly tracking head pose based on video, which is easily affected by occlusion, has poor robustness, and suffers from false head pose detection and jitter.

[0018] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description

[0019] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:

[0020] Figure 1 This is a flowchart of a head pose tracking method provided according to an embodiment of this application;

[0021] Figure 2 This is a schematic diagram illustrating the steps of head pose tracking according to an embodiment of this application;

[0022] Figure 3 This is a block diagram of a head pose tracking device according to an embodiment of this application;

[0023] Figure 4 This is a schematic diagram of the structure of an electronic device provided according to an embodiment of this application.

[0024] Explanation of reference numerals in the attached diagram: Acquisition module-100, Conversion module-200, Division module-300, Tracking module-400, Memory-401, Processor-402, Communication interface-403. Detailed Implementation

[0025] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.

[0026] The head pose tracking method, apparatus, electronic device, and storage medium of this application are described below with reference to the accompanying drawings. Addressing the problems mentioned in the background section, this application provides a head pose tracking method. In this method, the video of the target to be tracked is converted into an image sequence. Utilizing the inter-frame information of the video, a trained network is used to estimate the deformation field between a reference frame and the frame to be completed. The corresponding region of the reference frame is automatically filled into the occluded region of the frame to be completed, thus ensuring compatibility with head pose estimation under occlusion conditions. This significantly alleviates problems such as false detections and jitter caused by occlusion. Therefore, this method solves the problems of directly tracking head pose based on video in related technologies, which is easily affected by occlusion, has poor robustness, and suffers from false detections and jitter.

[0027] Specifically, Figure 1 This is a flowchart illustrating a head pose tracking method provided in an embodiment of this application.

[0028] like Figure 1 As shown, the head pose tracking method includes the following steps:

[0029] In step S101, a video of the target to be tracked is acquired.

[0030] It is understandable that accurate head pose tracking can make facial animations appear more realistic and vivid, enhancing the user's immersion and user experience. In this embodiment, the video of the target to be tracked can be obtained first, so that subsequent operations such as detecting and filling in the occluded areas in the video can be performed.

[0031] In step S102, the video is converted into an image sequence, the image sequence is input into a pre-trained occlusion recognition network, and the mask image of the occlusion in each frame of the image sequence is output.

[0032] Because occluded objects vary in shape and color, they are difficult to handle using traditional image processing algorithms. This application's embodiments utilize neural networks in a data-driven manner to achieve automatic occlusion recognition. The video of the tracked target can be converted into an image sequence, and the neural network, in a data-driven manner, can automatically identify face occlusions and output a mask image of the occluded object.

[0033] In step S103, the image sequence is divided into a reference frame image and a candidate frame image sequence using a mask image, and the mask image of each frame in the candidate frame image sequence is used to cover the occluded area to generate a frame image sequence to be completed.

[0034] Specifically, the embodiments of this application can make full use of the inter-frame information of the video and use the mask image of each frame in the candidate frame image sequence to cover the occlusion area. Compared with redeveloping the head pose tracking method with occlusion data, the embodiments of this application can reuse the head pose tracking method without considering occlusion and the data without occlusion, which greatly reduces the cost of solving the occlusion problem.

[0035] In one embodiment of this application, an image sequence is divided into a reference frame image and a candidate frame image sequence using a mask image, including: identifying whether there is an unoccluded image in the image sequence based on the mask image; if there is an unoccluded image, then any unoccluded image is selected as the reference frame image; otherwise, the image with the least occlusion in the image sequence is selected as the reference frame image, and the remaining frame images in the image sequence other than the reference frame image are selected as the candidate frame image sequence.

[0036] In this embodiment, an unoccluded frame can be selected as the reference frame. If all frames are occluded, the image with the least occlusion is selected as the reference frame image, and the images of other frames are used as candidate frame image sequences. This allows the mask image to cover the occluded areas of other frames, generating a sequence of frames to be completed, thus providing better input conditions for the head pose tracking method.

[0037] In step S104, the image sequence to be completed, the reference frame image, and the mask image of each frame image are input into the pre-trained deformation completion network, and the completed image sequence of the image sequence to be completed is output. The head pose tracking of the target to be tracked is realized based on the completed image sequence.

[0038] This application embodiment can utilize the inter-frame information of video to fill the occluded area of ​​the reference frame into the frame to be completed by image deformation, ensuring the consistency of facial biometric features, thereby avoiding deviations caused by inconsistencies between the completed content and the actual face, providing better input conditions for the head pose tracking method, and thus avoiding problems such as false detection of head pose and jitter caused by occlusion.

[0039] In actual implementation, the embodiments of this application can take the optimization method of facial feature points and three-dimensional facial models as an example. The existing facial feature point detection technology is used to detect facial feature points on the completed image. The two-dimensional objective function between feature points and model projection points is constructed by combining the three-dimensional facial model. The head pose is solved by the optimization method, which greatly improves the accuracy of head pose tracking under occlusion. The optimization method can be Gauss-Newton, conjugate gradient, or other algorithms.

[0040] In one embodiment of this application, the training process of the occlusion recognition network includes: acquiring training data of an untrained occlusion recognition network, wherein the training data includes an occlusion image and a real mask image corresponding to the occlusion image; inputting the occlusion image into the untrained occlusion recognition network and outputting a training mask image of the occlusion image; calculating a training loss value based on the real mask image and the training mask image; if the training loss value is greater than a preset value, then continuing iterative training based on the training; otherwise, stopping iterative training, and obtaining a trained occlusion recognition network.

[0041] This application embodiment can automatically identify occlusions using a trained occlusion recognition network and output a mask image of the occlusions. The detailed training process is as follows: by acquiring a face image with occlusions and the corresponding mask image of the occlusions, the image with occlusions is input into an untrained occlusion recognition network, and the cross-entropy loss function is used to constrain its output to be close to the real mask image.

[0042] In one embodiment of this application, the training process of the deformable completion network includes: acquiring an image sequence of unoccluded images; using any frame in the image sequence as a reference frame image and the remaining frames as candidate frame images; randomly generating mask images of different shapes and using the mask images to cover random image regions of the candidate frame images to obtain a frame image to be completed; inputting the frame image to be completed, the reference frame image, and the mask image into an untrained deformable completion network; using image reconstruction loss to constrain the loss between the completed image output by the deformable completion network and the candidate frame image before masking to be within a preset range; and using total variation loss to constrain the deformation field between the reference frame image and the image to be completed to satisfy a preset rigid deformation condition, thereby obtaining a trained deformable completion network.

[0043] This application embodiment can output a completed image sequence of the image sequence to be completed through a trained deformable completion network, thereby achieving head pose tracking of the target. The detailed training process is as follows: By collecting unobstructed color videos of faces, one frame is randomly selected as a reference frame, and the remaining frames are selected as candidate frames. Mask images of different shapes are randomly generated and used to cover random image regions of the candidate frames to obtain the completed frame. The completed frame, the reference frame, and the mask images are input into an untrained deformable completion network. The image reconstruction loss is used to constrain its output to be similar to the candidate frames before masking, and the total variation loss is used to constrain the deformation fields of the reference frame and the completed frame to be as close as possible to rigid deformation.

[0044] The following is combined with Figure 2 The head pose tracking method of this application embodiment will be described in detail, and the steps are as follows:

[0045] Step 1: Input a color video and convert it into a sequence of color images;

[0046] Step 2: Input the color image sequence into the trained occlusion recognition network and output the mask image of the occlusion in each frame;

[0047] Step 3: Select an unobstructed frame as the reference frame (if all frames are obstructed, select the frame with the least obstruction);

[0048] Step 4: Use a mask image to cover up the occluded areas of other frames, generating a sequence of frames to be completed;

[0049] Step 5: Input the frame sequence to be completed, the reference frame, and the mask image sequence into the deformation completion network, and output the completed image sequence;

[0050] Step 6: Taking the optimization method combining facial feature points and 3D face model as an example, facial feature point detection technology is used to detect facial feature points in the completed image. A two-dimensional objective function between feature points and model projection points is constructed by combining the 3D face model. The head pose is solved by the optimization method. The facial feature point detection technology can be selected according to the actual situation and is not specifically limited.

[0051] The head pose tracking method proposed in this application converts the video of the target to be tracked into an image sequence. Utilizing the inter-frame information of the video, a trained network estimates the deformation field between a reference frame and the frame to be completed. The corresponding region of the reference frame is automatically filled into the occluded region of the frame to be completed, thus ensuring compatibility with head pose estimation under occlusion conditions. This significantly alleviates problems such as false detections and jitter caused by occlusion. Therefore, it solves the problems of directly tracking head pose based on video in related technologies, which is easily affected by occlusion, has poor robustness, and suffers from false detections and jitter.

[0052] Next, referring to the accompanying drawings, a head posture tracking device according to an embodiment of this application is described.

[0053] Figure 3 This is a block diagram of a head posture tracking device according to an embodiment of this application.

[0054] like Figure 3 As shown, the head posture tracking device 10 includes: an acquisition module 100, a conversion module 200, a division module 300, and a tracking module 400.

[0055] The acquisition module 100 is used to acquire the video of the target to be tracked; the conversion module 200 is used to convert the video into an image sequence, input the image sequence into a pre-trained occlusion recognition network, and output the mask image of the occlusion in each frame of the image sequence; the segmentation module 300 is used to divide the image sequence into a reference frame image sequence and a candidate frame image sequence using the mask image, and use the mask image of each frame in the candidate frame image sequence to cover the occlusion area, generating a sequence of frames to be completed; the tracking module 400 is used to input the sequence of frames to be completed, the reference frame image, and the mask image of each frame into a pre-trained deformation completion network, output the completed image sequence of the sequence of frames to be completed, and realize the head pose tracking of the target to be tracked based on the completed image sequence.

[0056] Optionally, in one embodiment of this application, the segmentation module 300 is further configured to identify whether there is an unoccluded image in the image sequence based on the mask image; if there is an unoccluded image, then any frame of unoccluded image is selected as a reference frame image; otherwise, the image with the least occlusion in the image sequence is selected as the reference frame image, and the remaining frame images in the image sequence other than the reference frame image are used as candidate frame image sequences.

[0057] Optionally, in one embodiment of this application, the apparatus 10 of this application embodiment further includes: a first training module.

[0058] The first training module is used to acquire training data for an untrained occlusion recognition network. The training data includes occlusion images and corresponding real mask images. The occlusion images are input into the untrained occlusion recognition network, which outputs training mask images of the occlusion images. The training loss value is calculated based on the real mask image and the training mask image. If the training loss value is greater than a preset value, iterative training continues based on the training; otherwise, iterative training is stopped, resulting in a trained occlusion recognition network.

[0059] Optionally, in one embodiment of this application, the apparatus 10 of this application embodiment further includes: a second training module.

[0060] The second training module is used to acquire an image sequence of unoccluded images. Any frame in the image sequence is used as a reference frame image, and the remaining frames are candidate frame images. Mask images of different shapes are randomly generated and used to cover random image regions of the candidate frame images to obtain the image to be completed. The image to be completed, the reference frame image, and the mask image are input into an untrained deformation completion network. The image reconstruction loss is used to constrain the loss between the completed image output by the deformation completion network and the candidate frame image before masking to be within a preset range. The total variation loss is used to constrain the deformation field between the reference frame image and the image to be completed to meet a preset rigid deformation condition, thus obtaining the trained deformation completion network.

[0061] It should be noted that the foregoing explanation of the head pose tracking method embodiment also applies to the head pose tracking device of this embodiment, and will not be repeated here.

[0062] The head pose tracking device proposed in this application converts the video of the target to be tracked into an image sequence. Utilizing the inter-frame information of the video, a trained network estimates the deformation field between a reference frame and the frame to be completed. The corresponding region of the reference frame is automatically filled into the occluded region of the frame to be completed, thus ensuring compatibility with head pose estimation under occlusion conditions. This significantly alleviates problems such as false detections and jitter caused by occlusion. Therefore, it solves the problems of directly tracking head pose based on video in related technologies, which is easily affected by occlusion, has poor robustness, and suffers from false detections and jitter.

[0063] Figure 4 A schematic diagram of the structure of an electronic device provided in an embodiment of this application. The electronic device may include:

[0064] The memory 401, the processor 402, and the computer program stored on the memory 401 and capable of running on the processor 402.

[0065] When the processor 402 executes the program, it implements the head pose tracking method provided in the above embodiments.

[0066] Furthermore, electronic devices also include:

[0067] Communication interface 403 is used for communication between memory 401 and processor 402.

[0068] The memory 401 is used to store computer programs that can run on the processor 402.

[0069] The memory 401 may include high-speed RAM (Random Access Memory) memory, and may also include non-volatile memory, such as at least one disk storage.

[0070] If the memory 401, processor 402, and communication interface 403 are implemented independently, then the communication interface 403, memory 401, and processor 402 can be interconnected via a bus to complete communication between them. The bus can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry Standard Architecture) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 4 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0071] Optionally, in a specific implementation, if the memory 401, processor 402, and communication interface 403 are integrated on a single chip, then the memory 401, processor 402, and communication interface 403 can communicate with each other through an internal interface.

[0072] Processor 402 may be a CPU (Central Processing Unit), an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of this application.

[0073] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the head pose tracking method described above.

[0074] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0075] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "N" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0076] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more N executable instructions for implementing custom logic functions or processes, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.

[0077] It should be understood that the various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (FPGAs), field-programmable gate arrays (FPGAs), etc.

[0078] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

Claims

1. A head pose tracking method, characterized in that, Includes the following steps: Acquire video of the target to be tracked; The video is converted into an image sequence, the image sequence is input into a pre-trained occlusion recognition network, and the mask image of the occlusion in each frame of the image sequence is output. The training process of the occlusion recognition network includes: Acquire training data for an untrained occlusion recognition network, wherein the training data includes occlusion images and the corresponding real mask images of the occlusion images; The occluded object image is input into the untrained occluded object recognition network, and the training mask image of the occluded object image is output. The training loss value is calculated based on the real mask image and the training mask image. If the training loss value is greater than a preset value, the training continues iteratively based on the training; otherwise, the iterative training is stopped, and the trained occluded object recognition network is obtained. The image sequence is divided into a reference frame image and a candidate frame image sequence using the mask image, and the occluded area is covered by the mask image of each frame in the candidate frame image sequence to generate a frame image sequence to be completed. The image sequence to be completed, the reference frame image, and the mask image of each frame image are input into a pre-trained deformation completion network, which outputs a completed image sequence of the image sequence to be completed. Based on the completed image sequence, the head pose tracking of the target to be tracked is realized. Using the inter-frame information of the video, the corresponding region of the reference frame is filled into the occluded region of the image sequence to be completed by image deformation. The training process of the deformable completion network includes: Obtain an image sequence of unobstructed images; Take any one frame in the image sequence as the reference frame image, and the remaining frames as candidate frame images; Randomly generate mask images of different shapes, and use the mask images to cover random image regions of the candidate frame images to obtain the image to be completed; The image to be completed, the reference frame image, and the mask image are input into an untrained deformable completion network. The image reconstruction loss is used to constrain the loss between the completed image output by the deformable completion network and the candidate frame image before masking to be within a preset range. The total variation loss is used to constrain the deformation field between the reference frame image and the image to be completed to meet a preset rigid deformation condition, thus obtaining the trained deformable completion network.

2. The method according to claim 1, characterized in that, The step of dividing the image sequence into a reference frame image sequence and a candidate frame image sequence using the mask image includes: Based on the mask image, identify whether there is an unobstructed image in the image sequence; If an unoccluded image exists, then any unoccluded image is selected as the reference frame image; otherwise, the image with the least occlusion in the image sequence is selected as the reference frame image, and the remaining frame images in the image sequence other than the reference frame image are selected as the candidate frame image sequence.

3. A head posture tracking device, characterized in that, include: The acquisition module is used to acquire video of the target to be tracked. The conversion module is used to convert the video into an image sequence, input the image sequence into a pre-trained occlusion recognition network, and output the mask image of the occlusion in each frame of the image sequence. The first training module is used to acquire training data for an untrained occlusion recognition network, wherein the training data includes occlusion images and the real mask images corresponding to the occlusion images; The occluded object image is input into the untrained occluded object recognition network, and the training mask image of the occluded object image is output. The training loss value is calculated based on the real mask image and the training mask image. If the training loss value is greater than a preset value, the training continues iteratively based on the training; otherwise, the iterative training is stopped, and the trained occluded object recognition network is obtained. The segmentation module is used to divide the image sequence into a reference frame image and a candidate frame image sequence using the mask image, and to cover the occluded area using the mask image of each frame image in the candidate frame image sequence to generate a frame image sequence to be completed. The tracking module is used to input the image sequence to be completed, the reference frame image, and the mask image of each frame image into a pre-trained deformation completion network, output the completed image sequence of the image sequence to be completed, realize the head pose tracking of the target to be tracked based on the completed image sequence, and use the inter-frame information of the video to fill the corresponding area of ​​the reference frame into the occluded area of ​​the image sequence to be completed by image deformation. The second training module is used to obtain image sequences of unobstructed images; Take any one frame in the image sequence as the reference frame image, and the remaining frames as candidate frame images; Randomly generate mask images of different shapes, and use the mask images to cover random image regions of the candidate frame images to obtain the image to be completed; The image to be completed, the reference frame image, and the mask image are input into an untrained deformable completion network. The image reconstruction loss is used to constrain the loss between the completed image output by the deformable completion network and the candidate frame image before masking to be within a preset range. The total variation loss is used to constrain the deformation field between the reference frame image and the image to be completed to meet a preset rigid deformation condition, thus obtaining the trained deformable completion network.

4. The apparatus according to claim 3, characterized in that, The partitioning module is further used for: Based on the mask image, identify whether there is an unobstructed image in the image sequence; If an unoccluded image exists, then any unoccluded image is selected as the reference frame image; otherwise, the image with the least occlusion in the image sequence is selected as the reference frame image, and the remaining frame images in the image sequence other than the reference frame image are selected as the candidate frame image sequence.

5. An electronic device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the program to implement the head pose tracking method as described in any one of claims 1-2.

6. A computer-readable storage medium having a computer program stored thereon, characterized in that, The program is executed by the processor to implement head pose tracking as described in any one of claims 1-2.