An anti-occlusion eye tracking method

By generating high frame rate image sequences through an event camera and a multi-scale feature extraction network, and combining adaptive templates and occlusion rate awareness strategies, the accuracy and robustness issues of existing eye-tracking methods under occlusion conditions are solved, achieving stable tracking of high-frequency eye movements, which is suitable for virtual reality and augmented reality interaction.

CN122244930APending Publication Date: 2026-06-19XIAN UNVERSITY OF ARTS & SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIAN UNVERSITY OF ARTS & SCI
Filing Date
2026-02-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing eye-tracking methods suffer from insufficient accuracy and robustness under high-frequency eye-tracking capture and occlusion conditions, making it difficult to achieve stable tracking.

Method used

An event camera continuously captures images of the eye region. A high frame rate image sequence is generated through an event-driven frame interpolation network. Multi-scale feature extraction networks and feature pyramid networks are used to extract multi-level features. Adaptive templates and a lightweight U-Net architecture are combined to segment the pupil region and perceive the occlusion rate. Processing strategies are dynamically selected for anti-occlusion tracking.

Benefits of technology

It achieves high-precision and continuous eye tracking under occlusion conditions, improves the robustness and practicality of the system under interference such as blinking, and is suitable for application scenarios such as VR/AR interaction and clinical eye movement analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244930A_ABST
    Figure CN122244930A_ABST
Patent Text Reader

Abstract

This invention relates to eye-tracking methods, specifically to an occlusion-resistant eye-tracking method. It addresses the problems of insufficient tracking accuracy, lack of robustness, and difficulty in maintaining stable tracking under interference such as continuous blinking in existing eye-tracking methods. This invention first reconstructs low frame rates into kilohertz-level high-frequency sequences using an event-driven frame interpolation network, overcoming the frame rate limitations of traditional sensors. Subsequently, through multi-scale feature fusion and a temporal matching mechanism based on adaptive templates, it effectively utilizes historical clear frame features to compensate for occlusion areas. Furthermore, it dynamically selects three processing strategies based on an occlusion rate perception strategy, thereby maintaining high accuracy and continuous tracking even under occlusion conditions. This significantly improves robustness and practicality under interference such as blinking, providing a reliable technical foundation for applications requiring high-frequency, interference-resistant eye-tracking data, such as VR / AR interaction and clinical eye-tracking analysis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to eye-tracking methods, and more specifically to an occlusion-resistant eye-tracking method. Background Technology

[0002] Eye-tracking technology has significant application value in virtual reality, augmented reality, and assisted interaction. To achieve high-precision eye-tracking analysis, current technologies primarily rely on eye images acquired by traditional frame-based image sensors (such as CCD / CMOS cameras) and utilize image processing or deep learning algorithms for pupil localization and tracking. However, these existing technologies suffer from the following drawbacks in practical applications:

[0003] First, due to the inherent frame rate of traditional image sensors, it is difficult to capture and track rapid eye movements at the millisecond or even microsecond level, such as saccades or tremors, resulting in the loss of high-frequency eye movement information and limiting tracking accuracy.

[0004] Second, when the user blinks or the eyelids completely cover the pupil, the detection method based on single-frame images is prone to tracking drift, positioning errors, or even tracking interruption due to the lack of target information, resulting in insufficient robustness.

[0005] Third, although emerging event sensors such as event cameras can asynchronously output brightness change event streams at microsecond resolution, providing a hardware foundation for high-frequency tracking, directly processing unstructured event streams lacks semantic information, and existing event-based methods still have poor adaptability to occlusion, making it difficult to maintain stable tracking under interference such as continuous blinking. Summary of the Invention

[0006] The purpose of this invention is to solve the technical problems of insufficient tracking accuracy, insufficient robustness, and difficulty in maintaining stable tracking under interference such as continuous blinking in existing eye tracking methods, and to provide an occlusion-resistant eye tracking method.

[0007] To achieve the above objectives, the technical solution provided by this invention is as follows:

[0008] An occlusion-resistant eye-tracking method, characterized by the following steps:

[0009] S1. Use an event camera to continuously capture multiple frames of the subject's eye region to obtain a low frame rate grayscale video sequence of the eye region and the corresponding event stream; input the low frame rate grayscale video sequence of the eye region and the corresponding event stream into a pre-trained event-driven frame interpolation network for processing to obtain a high frame rate eye image sequence.

[0010] S2. Input the images in the high-frame-rate eye image sequence into the multi-scale feature extraction network frame by frame, extract the multi-level spatial features of each frame of image, and obtain four feature maps C1, C2, C3, and C4 of each frame of image, corresponding to the downsampling scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image respectively;

[0011] Use the Feature Pyramid Network (FPN) to process the four feature maps C1, C2, C3, and C4 of each frame of image, and correspondingly obtain a set of fused feature maps P2, P3, P4, and P5 that enhance and fuse multi-scale information for each frame of image;

[0012] S3. While obtaining the high-frame-rate eye image sequence, use an event camera to perform image detection on the high-frame-rate eye image sequence frame by frame. When an unoccluded or low-occluded pupil image frame appears in a certain frame of the high-frame-rate eye image sequence, crop out a local feature region centered on the pupil center coordinates from the fused feature map P3 corresponding to this frame of image, normalize the region, and save it as an adaptive template; for each subsequent frame of image to be processed, perform depth cross-correlation calculation between the adaptive template and its fused feature maps P2, P3, P4, and P5 to obtain a spatio-temporal fused feature map for each frame of image;

[0013] S4. Use a segmentation network based on the lightweight U-Net architecture to perform real-time pixel-level pupil region segmentation on the images in each frame of the high-frame-rate eye image sequence, and generate a binary mask image;

[0014] By counting the number of pixels in the pupil region in the binary mask image, obtain the visible area Ps of the current frame pupil, and at the same time, according to the most recently detected complete pupil area Pl, calculate the occlusion rate a of the current frame;

[0015] a = (Pl - Ps) / Pl;

[0016] S5. Based on the occlusion rate a calculated in step S4, compare it with the preset first threshold THR1 and second threshold THR2, and dynamically select the following processing strategies accordingly, so as to complete anti-occlusion eye movement tracking: [[ID=—19]]

[0017] 1) When a < THR1, it is judged as low occlusion, directly use the fused feature map P3 of the current image frame, input it into the rotation target detector for pupil ellipse parameter estimation, generate the pupil ellipse parameters at this moment, and thus obtain a smooth eye movement trajectory image;

[0018] 2) When THR1≤a≤THR2, it is determined to be a medium to high occlusion. The adaptive template matching mechanism is enabled. The spatiotemporal fusion feature map of the current image frame is used and input into the rotating target detector to estimate the pupil ellipse parameters, generate the pupil ellipse parameters at that moment, and thus obtain a smooth eye movement trajectory image.

[0019] 3) When a>THR2, it is determined to be extremely high occlusion. The information of the current image frame is unreliable. Direct detection of this image frame is abandoned. Instead, the nearest valid image frame before and after the timestamp of this image frame is input into the rotating target detector to estimate the pupil ellipse parameters, generate the pupil ellipse parameters at that moment, and perform time-weighted linear interpolation to complete the trajectory of the current image frame and obtain a smooth eye movement trajectory image.

[0020] Furthermore, step S1 specifically includes the following steps:

[0021] S1.1. Use an event camera to continuously capture multiple frames of the subject's eye region to obtain a low frame rate grayscale video sequence of the eye region at 25 FPS and the corresponding event stream;

[0022] S1.2 Input the 25FPS low frame rate grayscale video sequence and the corresponding event stream into the pre-trained event-driven frame interpolation network to generate a 1000FPS high frame rate eye image sequence.

[0023] Furthermore, the event-driven frame interpolation network in step S1.2 includes the TimeLens model;

[0024] The pre-training of the event-driven frame interpolation network includes the following steps:

[0025] A. Take five consecutive frames of images and the event stream between them in a low frame rate grayscale video sequence of the eye region as a sample group. Input the first and last two frames and their corresponding event streams, as well as the event streams corresponding to the three middle frames, into the TimeLens model to predict the three middle frames.

[0026] B. Optimize network parameters by minimizing the L1 loss and perceptual loss between the predicted image and the corresponding image;

[0027] C. Return to step A and repeat the training until the optimized network parameters are less than the preset value, thus completing the pre-training of the event-driven frame interpolation network.

[0028] Furthermore, step S2 specifically includes the following steps:

[0029] S2.1 Input the high frame rate eye image sequence of 1000FPS frame by frame into the multi-scale feature extraction network with the Swin-Transformer-Tiny version as the backbone network;

[0030] First, the multi-scale feature extraction network segments each frame of the image into non-overlapping 4x4 image blocks and embeds them into a 48-dimensional feature vector;

[0031] Image patches are projected to 96 dimensions through the linear embedding layer of a multi-scale feature extraction network and then input into two Swin-Transformer blocks for processing. The output resolution is the original resolution. Figure 1 / 4. Feature map C1 with 96 channels;

[0032] The feature map C1 is then downsampled by a factor of 2 and projected to 192 dimensions using a Patch Merging layer, followed by processing with two Swin-Transformer blocks, resulting in an output resolution equal to the original. Figure 1 / 8. Feature map C2 with 192 channels;

[0033] The feature map C2 is then downsampled by a factor of 2 and projected to 384 dimensions using a Patch Merging layer. After further processing by six Swin-Transformer blocks, the output resolution is equal to the original. Figure 1 / 16. Feature map C3 with 384 channels;

[0034] Finally, the feature map C3 is downsampled by a factor of 2 and projected to 768 dimensions using a Patch Merging layer. After processing by two Swin-Transformer blocks, the output resolution is equal to the original resolution. Figure 1 / 32, Feature map C4 with 768 channels;

[0035] Finally, four feature maps C1, C2, C3, and C4 are obtained for each frame of the image, which correspond to downsampling scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image, respectively, and integrate information from local details to global semantics.

[0036] S2.2 Input the four feature maps C1, C2, C3, and C4 of each frame into the Feature Pyramid Network (FPN);

[0037] First, a 1×1 convolution is performed on feature map C4 to reduce the number of channels from 768 to 256, and then a 3×3 convolution is performed to generate fused feature map P5.

[0038] Subsequently, the fused feature map P5 is upsampled by a factor of two nearest neighbors, while feature map C3 is convolved with a 1×1 convolution, reducing the number of channels from 384 to 128. The two feature maps are then element-wise summed and followed by a 3×3 convolution to eliminate aliasing, resulting in a final resolution equal to the original. Figure 1 / 16, Fusion feature map P4' with 256 channels;

[0039] Subsequently, the fused feature map P4' is upsampled by 2x, while feature map C2 is convolved with a 1×1 convolution, reducing the number of channels from 192 to 64. The two are then element-wise summed and followed by a 3×3 convolution to eliminate aliasing, resulting in a final resolution equal to the original. Figure 1 / 8, fusion feature map P3' with 256 channels to enhance the multi-scale representation of the pupil region;

[0040] Finally, the fused feature map P3' is upsampled by 2x, while feature map C1 is convolved with a 1×1 convolution, reducing the number of channels from 96 to 32. The two are then element-wise added together and followed by a 3×3 convolution to eliminate aliasing, resulting in a final resolution equal to the original. Figure 1 / 4. A fusion feature map P2' with 256 channels to preserve fine edge localization information;

[0041] Meanwhile, feature map C1 is downsampled by a 3×3 convolution with a stride of 2 and then added to the fused feature map P2' to generate fused feature map P2;

[0042] The feature map C2 is downsampled by a 3×3 convolution with a stride of 2 and then added to the fused feature map P3' to generate the fused feature map P3;

[0043] The feature map C3 is downsampled by a 3×3 convolution with a stride of 2 and then added to the fused feature map P4' to generate the fused feature map P4;

[0044] Finally, a set of enhanced fusion feature maps P2, P3, P4, and P5, which incorporate multi-scale information, are obtained for each frame of the image.

[0045] Furthermore, in step S3, the occlusion rate α of the unoccluded pupil image frame... 无 The occlusion rate 'a' of a low-occlusion pupil image frame is 0. 低 Satisfy: 0 < a 低 <0.25;

[0046] The size of the local feature region is 13×13 pixels.

[0047] Furthermore, step S5 specifically includes:

[0048] Based on the occlusion rate 'a' calculated in step S4, it is compared with the preset first threshold THR1 and second threshold THR2, and the following processing strategy is dynamically selected accordingly to complete anti-occlusion eye tracking:

[0049] 1) When a < THR1, it is judged as low occlusion, and the fused feature map P3 of the current image frame is directly used. The fused feature map P3 is input into the rotated object detector improved based on the R3Det architecture. First, rotated anchor boxes with densities of {3, 2, 1} are respectively laid on the fused feature map P3, and multi-angle anchor boxes with 3 scales, 3 aspect ratios and 6 angular directions are preset at each position;

[0050] The fused feature map P3 with preset multi-angle anchor boxes is input into the detection head of the rotated object detector and optimized step by step through 5 stacked refinement stages;

[0051] After the first refinement stage extracts features through a 5×5 convolutional layer, the classification branch predicts the class confidence of the anchor box through a fully connected layer; the regression branch predicts the center point offset, width and height scaling amount and rotation angle offset to obtain the initial parameters of the rotated rectangle box;

[0052] In subsequent refinement stages, the parameters of the rotated rectangle box output by the previous refinement stage are resampled into a feature region of a fixed size and aligned to the current fused feature map P3 through interpolation;

[0053] Finally, through the feature refinement module of the rotated object detector, 3×3 convolution, batch normalization and ReLU activation are performed in sequence to extract finer local features, predict the residual correction amount of the parameters of the rotated rectangle box, and achieve step-by-step regression from coarse to fine. Finally, the parameters of the rotated rectangle box that accurately represent the pupil position and shape are output, including the center coordinates, width and height, and tilt angle;

[0054] Subsequently, based on the pixel information within the parameters of the rotated rectangle box, the least squares method is used for ellipse fitting to obtain pupil ellipse parameters that match the physiological shape of the pupil. The pupil ellipse parameters include the center coordinates, major and minor axis lengths, and tilt angle of the pupil, thereby obtaining a smooth eye movement trajectory image;

[0055] 2) When THR1 ≤ a ≤ THR2, it is determined as medium-high occlusion, and the adaptive template matching mechanism is enabled. The spatio-temporal fused feature map of the current image frame is used, and the spatio-temporal fused feature map is input into the rotated object detector improved based on the R3Det architecture. First, rotated anchor boxes with densities of {3, 2, 1} are respectively laid on the spatio-temporal fused feature map, and multi-angle anchor boxes with 3 scales, 3 aspect ratios and 6 angular directions are preset at each position;

[0056] The spatio-temporal fused feature map with preset multi-angle anchor boxes is input into the detection head of the rotated object detector and optimized step by step through 5 stacked refinement stages;

[0057] In the first refining stage, features are extracted through a 5×5 convolutional layer. The classification branch predicts the class confidence of the anchor box through a fully connected layer. The regression branch predicts the center point offset, width and height scaling, and rotation angle offset to obtain the initial parameters of the rotated rectangle.

[0058] In subsequent refining stages, the parameters of the rotated rectangles output from the previous refining stage are resampled into feature regions of a fixed size, and then aligned to the current spatiotemporal fusion feature map through interpolation.

[0059] Finally, the feature refinement module of the rotating target detector performs 3×3 convolution, batch normalization and ReLU activation in sequence to extract more refined local features, predict the residual correction amount of the rotating rectangle parameters, realize the stepwise regression from coarse to fine, and finally output the rotating rectangle parameters that accurately represent the position and shape of the pupil, including the center coordinates, width and height and tilt angle.

[0060] Subsequently, based on the pixel information within the rotating rectangular frame parameters, the least squares method is used to fit an ellipse to obtain pupil ellipse parameters that match the physiological morphology of the pupil. These pupil ellipse parameters include the center coordinates of the pupil, the length of its major and minor axes, and the tilt angle, thereby obtaining a smooth eye movement trajectory image.

[0061] 3) When a>THR2, it is determined to be extremely occluded. The information of the current image frame is unreliable. Direct detection of this image frame is abandoned. Instead, the nearest valid image frame before and after the timestamp of this image frame is input into the rotating target detector based on the improved R3Det architecture. First, rotating anchor frames with a density of {3,2,1} are laid on the valid image frames respectively. Each position is preset with 3 scales, 3 aspect ratios and 6 angular directions of multi-angle anchor frames.

[0062] The effective image frames with preset multi-angle anchor frames are input into the detection head of the rotating target detector and optimized step by step through 5 stacked refining stages;

[0063] In the first refining stage, features are extracted through a 5×5 convolutional layer. The classification branch predicts the class confidence of the anchor box through a fully connected layer. The regression branch predicts the center point offset, width and height scaling, and rotation angle offset to obtain the initial parameters of the rotated rectangle.

[0064] Each subsequent refining stage resamples the rotated rectangle parameters output from the previous refining stage into a fixed-size feature region, and then aligns it to the current valid image frame through interpolation.

[0065] Finally, the feature refinement module of the rotating target detector performs 3×3 convolution, batch normalization and ReLU activation in sequence to extract more refined local features, predict the residual correction amount of the rotating rectangle parameters, realize the stepwise regression from coarse to fine, and finally output the rotating rectangle parameters that accurately represent the position and shape of the pupil, including the center coordinates, width and height and tilt angle.

[0066] Subsequently, based on the pixel information within the rotating rectangular frame parameters, the least squares method is used to perform ellipse fitting to obtain pupil ellipse parameters that match the physiological morphology of the pupil. These pupil ellipse parameters include the center coordinates of the pupil, the lengths of the major and minor axes, and the tilt angle. Time-weighted linear interpolation is then performed on these parameters to complete the trajectory of the current image frame and obtain a smooth eye movement trajectory image.

[0067] Furthermore, step S4 specifically includes:

[0068] S4.1. Input the 512×512 RGB 3-channel color image from each frame of the high frame rate eye image sequence into the segmentation network based on the lightweight U-Net architecture. First, process it through its Gradient Aggregation module to transform it into a 32-channel feature map D with a spatial dimension of 512×512. 初 With a unified channel dimension;

[0069] S4.2, Convert the 32-channel feature map D with a spatial dimension of 512×512. 初 The input max pooling layer is downsampled by 2 times to output a 64-channel feature map D0 with a spatial dimension of 256×256.

[0070] S4.3 Input the 64-channel feature map D0 with a spatial dimension of 256×256 into the Ghost_CBAM attention extraction module for channel and spatial attention enhancement processing, and output the 128-channel feature map D1 with a spatial dimension of 128×128.

[0071] S4.4 The 128-channel feature map D1 with a spatial dimension of 128×128 is downsampled by the max pooling layer to obtain a 256-channel feature map with a spatial dimension of 64×64. This feature map is then input into the Ghost_CBAM attention extraction module, which outputs a 256-channel feature map D2 with a spatial dimension of 64×64.

[0072] S4.5 The 256-channel feature map D2 with a spatial dimension of 64×64 is downsampled by the max pooling layer to obtain a 512-channel feature map with a spatial dimension of 32×32. This feature map is then input into the Graph Correlation Attention module for global relation modeling to obtain a 512-channel feature map D3 with a spatial dimension of 32×32.

[0073] S4.6 Upsample the 512-channel feature map D3 with a spatial dimension of 32×32 to obtain a 256-channel feature map with a spatial dimension of 64×64. Then, concatenate and fuse it with the feature map D2. Finally, process it through the Ghost_CBAM attention extraction module to obtain a 256-channel feature map D2′ with a spatial dimension of 64×64.

[0074] S4.7 Upsample the 256-channel feature map D2′ with a spatial dimension of 64×64 to obtain a 128-channel feature map with a spatial dimension of 128×128. Then, concatenate and fuse it with the feature map D1. Finally, process it through the Ghost_CBAM attention extraction module to obtain the 128-channel feature map D1′ with a spatial dimension of 128×128.

[0075] S4.8 Upsample the 128-channel feature map D1′ with a spatial dimension of 128×128 to obtain a 64-channel feature map with a spatial dimension of 256×256. Then, concatenate and fuse it with the feature map D0. Finally, process it through the Ghost_CBAM attention extraction module to obtain a 64-channel feature map D0′ with a spatial dimension of 256×256.

[0076] S4.9 Upsample the 64-channel feature map D0′ with a spatial dimension of 256×256 to obtain a 32-channel feature map with a spatial dimension of 512×512, and then compare it with feature map D0′. 初 The data is then concatenated and fused, and further processed by the Ghost_CBAM attention extraction module, simultaneously outputting a 32-channel feature map D with a spatial dimension of 512×512. 初 ′ and a 1-channel binarized mask with a spatial dimension of 512×512;

[0077] S4.10. By counting the number of pixels in the pupil region in the binarized mask image, the visible area Ps of the pupil in the current frame is obtained. At the same time, based on the most recently detected complete pupil area Pl, the occlusion rate a of the current frame is calculated.

[0078] a = (Pl - Ps) / Pl.

[0079] Furthermore, in step S5, the first threshold THR1 is 25%, and the second threshold THR2 is 85%.

[0080] Furthermore, in step C, the preset value is 0.05.

[0081] Compared with the prior art, the present invention has the following beneficial effects:

[0082] (1) The present invention provides an occlusion-resistant eye tracking method that uses event-driven frame interpolation to reconstruct low frame rate to kilohertz-level high frequency sequences, breaking through the frame rate limitation of traditional sensors. Through multi-scale feature fusion and a temporal matching mechanism based on adaptive templates, it effectively utilizes historical clear frame features to compensate for occlusion areas and combines an occlusion rate perception strategy to dynamically select processing strategies, thereby maintaining high accuracy and continuous tracking under occlusion conditions. This significantly improves the robustness and practicality of the system under interference such as blinking, providing a reliable technical foundation for application scenarios such as VR / AR interaction and clinical eye movement analysis that require high frequency and interference-resistant eye movement data.

[0083] (2) The eye-tracking method against occlusion provided by the present invention uses the fusion feature map P3 to make an adaptive target because it achieves the best balance between resolution and semantic information. The medium resolution of the fusion feature map P3 not only retains the key details of the pupil, but also has sufficient semantic information and a moderate receptive field, which is highly matched with the medium-scale characteristics of the pupil in the eye image, and can accurately capture the core features of the pupil. At the same time, the fusion feature map P3 has stronger robustness to interference such as changes in illumination and slight occlusion. As a template, it can improve the matching accuracy and stability of subsequent deep cross-correlation calculation. Compared with the high-resolution fusion feature map P2, its computation is lower. Compared with the high semantic but low-resolution fusion feature maps P4 and P5, it can ensure sufficient feature discrimination. Attached Figure Description

[0084] Figure 1 This is a schematic diagram illustrating the cooperation between the multi-scale feature extraction network and the feature pyramid network in step S2 of an embodiment of the eye-tracking method for resisting occlusion according to the present invention.

[0085] Figure 2 This is a schematic diagram of the processing flow of the spatiotemporal fusion feature map by the rotating target detector in step S5 of an embodiment of the present invention;

[0086] Figure 3 This is a schematic diagram of the processing flow of the segmentation network based on the lightweight U-Net architecture in step S4 of an embodiment of the present invention.

[0087] Figure 4 Line graphs showing the F1 scores of an embodiment of the occlusion-resistant eye tracking method of the present invention and other eye tracking methods on the EV-Eye dataset;

[0088] Figure 5 The diagram shows a comparison of the tracking performance of an embodiment of the anti-occlusion eye tracking method of the present invention with other eye tracking methods on the EV-Eye dataset. (b) shows the tracking performance of this embodiment on the EV-Eye dataset, while (a) and (c)-(f) show the tracking performance of other eye tracking methods on the EV-Eye dataset. Detailed Implementation

[0089] To make the objectives, advantages, and features of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Those skilled in the art should understand that these embodiments are merely used to explain the technical principles of the present invention and are not intended to limit the scope of protection of the present invention.

[0090] like Figures 1-5 As shown, this embodiment provides an occlusion-resistant eye-tracking method, including the following steps:

[0091] S1. Use an event camera to continuously capture multiple frames of the subject's eye region to obtain a low frame rate grayscale video sequence of the eye region and the corresponding event stream; input the low frame rate grayscale video sequence of the eye region and the corresponding event stream into a pre-trained event-driven frame interpolation network for processing to obtain a high frame rate eye image sequence.

[0092] Specifically, step S1 includes the following steps:

[0093] S1.1. Use an event camera to continuously capture multiple frames of the subject's eye region to obtain a low frame rate grayscale video sequence of the eye region at 25 FPS and the corresponding event stream;

[0094] The event stream format is (x, y, t, p), where x and y represent pixel coordinates, t represents timestamp, p represents polarity, and the event camera model can be DAVIS346;

[0095] S1.2 Input the 25FPS low frame rate grayscale video sequence and the corresponding event stream into the pre-trained event-driven frame interpolation network to generate a 1000FPS high frame rate eye image sequence.

[0096] In this embodiment, the event-driven frame interpolation network includes the TimeLens model, and the pre-training of the event-driven frame interpolation network includes the following steps:

[0097] A. Take five consecutive frames of images and the event stream between them in a low frame rate grayscale video sequence of the eye region as a sample group. Input the first and last two frames and their corresponding event streams, as well as the event streams corresponding to the three middle frames, into the TimeLens model to predict the three middle frames.

[0098] B. Optimize network parameters by minimizing the L1 loss and perceptual loss between the predicted image and the corresponding image;

[0099] C. Return to step A and repeat the training until the optimized network parameters are less than the preset value, thus completing the pre-training of the event-driven frame interpolation network;

[0100] In this embodiment, the preset value is 0.05. In other embodiments, the preset value can be set according to actual needs.

[0101] This step uses an event-driven frame interpolation network to reconstruct low frame rates into kilohertz-level high-frequency sequences, breaking through the frame rate limitations of traditional sensors.

[0102] S2. Input the images in the high frame rate eye image sequence frame by frame into the multi-scale feature extraction network to extract the multi-level spatial features of each frame image, and obtain four feature maps C1, C2, C3, and C4 for each frame image, which correspond to the downsampling scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image, respectively.

[0103] The Feature Pyramid Network (FPN) is used to process the four feature maps C1, C2, C3, and C4 of each frame image, resulting in a set of enhanced fused feature maps P2, P3, P4, and P5 that incorporate multi-scale information for each frame image.

[0104] Specifically, step S2 includes the following steps:

[0105] S2.1 Input the high frame rate eye image sequence of 1000FPS frame by frame into the multi-scale feature extraction network with the Swin-Transformer-Tiny version as the backbone network;

[0106] First, the multi-scale feature extraction network segments each frame of the image into non-overlapping 4x4 image blocks and embeds them into a 48-dimensional feature vector;

[0107] Image patches are projected to 96 dimensions through the linear embedding layer of a multi-scale feature extraction network and then input into two Swin-Transformer blocks for processing. The output resolution is the original resolution. Figure 1 / 4. Feature map C1 with 96 channels;

[0108] The feature map C1 is then downsampled by a factor of 2 and projected to 192 dimensions using a Patch Merging layer, followed by processing with two Swin-Transformer blocks, resulting in an output resolution equal to the original. Figure 1 / 8. Feature map C2 with 192 channels;

[0109] The feature map C2 is downsampled by a factor of 2 and projected to 384 dimensions again through a Patch Merging layer, and then processed by 6 Swin-Transformer blocks, with the output resolution being the original. Figure 1 / 16. Feature map C3 with 384 channels;

[0110] Finally, the feature map C3 is downsampled by a factor of 2 and projected to 768 dimensions using a Patch Merging layer. After processing by two Swin-Transformer blocks, the output resolution is equal to the original resolution. Figure 1 / 32, Feature map C4 with 768 channels;

[0111] Finally, four feature maps C1, C2, C3, and C4 are obtained for each frame of the image, which correspond to downsampling scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image, respectively, and integrate information from local details to global semantics.

[0112] S2.2 Input the four feature maps C1, C2, C3, and C4 of each frame into the Feature Pyramid Network (FPN);

[0113] First, a 1×1 convolution is performed on feature map C4 to reduce the number of channels from 768 to 256, and then a 3×3 convolution is performed to generate fused feature map P5.

[0114] Subsequently, the fused feature map P5 is upsampled by a factor of two nearest neighbors, while feature map C3 is convolved with a 1×1 convolution, reducing the number of channels from 384 to 128. The two feature maps are then element-wise summed and followed by a 3×3 convolution to eliminate aliasing, resulting in a final resolution equal to the original. Figure 1 / 16, Fusion feature map P4' with 256 channels;

[0115] Subsequently, the fused feature map P4' is upsampled by 2x, while feature map C2 is convolved with a 1×1 convolution, reducing the number of channels from 192 to 64. The two are then element-wise summed and followed by a 3×3 convolution to eliminate aliasing, resulting in a final resolution equal to the original. Figure 1 / 8, fusion feature map P3' with 256 channels to enhance the multi-scale representation of the pupil region;

[0116] Finally, the fused feature map P3' is upsampled by 2x, while feature map C1 is convolved with a 1×1 convolution, reducing the number of channels from 96 to 32. The two are then element-wise added together and followed by a 3×3 convolution to eliminate aliasing, resulting in a final resolution equal to the original. Figure 1 / 4. A fusion feature map P2' with 256 channels to preserve fine edge localization information;

[0117] Meanwhile, feature map C1 is downsampled by a 3×3 convolution with a stride of 2 and then added to the fused feature map P2' to generate fused feature map P2;

[0118] The feature map C2 is downsampled by a 3×3 convolution with a stride of 2 and then added to the fused feature map P3' to generate the fused feature map P3;

[0119] The feature map C3 is downsampled by a 3×3 convolution with a stride of 2 and then added to the fused feature map P4' to generate the fused feature map P4;

[0120] Finally, a set of enhanced fusion feature maps P2, P3, P4, and P5, which incorporate multi-scale information, are obtained for each frame of the image.

[0121] S3. While obtaining the high frame rate eye image sequence, the high frame rate eye image sequence is subjected to frame-by-frame image detection using an event camera. When a frame of the high frame rate eye image sequence is detected to contain an unobstructed or slightly obstructed pupil image, a local feature region of size 13×13 pixels is cropped from the fusion feature map P3 corresponding to that frame image, with the pupil center coordinates as the center. This region is normalized and saved as an adaptive template. For each subsequent frame image to be processed, the adaptive template is subjected to depth cross-correlation calculation with its fusion feature maps P2, P3, P4, and P5 to obtain a spatiotemporal fusion feature map for each frame image.

[0122] Among them, the occlusion rate a of the unoccluded pupil image frame 无 The occlusion rate 'a' of a low-occlusion pupil image frame is 0. 低 Satisfy: 0 < a 低 <0.25;

[0123] S4. Using a segmentation network based on a lightweight U-Net architecture, perform real-time pixel-level pupil region segmentation on the images in each frame of the high frame rate eye image sequence to generate a binarized mask image.

[0124] By counting the number of pixels in the pupil region in the binarized mask image, the visible area Ps of the pupil in the current frame is obtained. At the same time, based on the most recently detected complete pupil area Pl, the occlusion rate a of the current frame is calculated.

[0125] a = (Pl - Ps) / Pl;

[0126] The occlusion rate 'a' ranges from 0 to 1, which can accurately quantify the severity of pupil occlusion in the current frame, providing a key basis for subsequent dynamic strategy switching.

[0127] Step S4 is as follows:

[0128] S4.1. Input the 512×512 RGB 3-channel color image from each frame of the high frame rate eye image sequence into the segmentation network based on the lightweight U-Net architecture. First, process it through its Gradient Aggregation module to transform it into a 32-channel feature map D with a spatial dimension of 512×512. 初 With a unified channel dimension;

[0129] S4.2, Convert the 32-channel feature map D with a spatial dimension of 512×512. 初The input max pooling layer is downsampled by 2 times to output a 64-channel feature map D0 with a spatial dimension of 256×256.

[0130] S4.3 Input the 64-channel feature map D0 with a spatial dimension of 256×256 into the Ghost_CBAM attention extraction module for channel and spatial attention enhancement processing, and output the 128-channel feature map D1 with a spatial dimension of 128×128.

[0131] S4.4 The 128-channel feature map D1 with a spatial dimension of 128×128 is downsampled by the max pooling layer to obtain a 256-channel feature map with a spatial dimension of 64×64. This feature map is then input into the Ghost_CBAM attention extraction module, which outputs a 256-channel feature map D2 with a spatial dimension of 64×64.

[0132] S4.5 The 256-channel feature map D2 with a spatial dimension of 64×64 is downsampled by the max pooling layer to obtain a 512-channel feature map with a spatial dimension of 32×32. This feature map is then input into the Graph Correlation Attention module for global relation modeling to obtain a 512-channel feature map D3 with a spatial dimension of 32×32.

[0133] S4.6 Upsample the 512-channel feature map D3 with a spatial dimension of 32×32 to obtain a 256-channel feature map with a spatial dimension of 64×64. Then, concatenate and fuse it with the feature map D2. Finally, process it through the Ghost_CBAM attention extraction module to obtain a 256-channel feature map D2′ with a spatial dimension of 64×64.

[0134] S4.7 Upsample the 256-channel feature map D2′ with a spatial dimension of 64×64 to obtain a 128-channel feature map with a spatial dimension of 128×128. Then, concatenate and fuse it with the feature map D1. Finally, process it through the Ghost_CBAM attention extraction module to obtain the 128-channel feature map D1′ with a spatial dimension of 128×128.

[0135] S4.8 Upsample the 128-channel feature map D1′ with a spatial dimension of 128×128 to obtain a 64-channel feature map with a spatial dimension of 256×256. Then, concatenate and fuse it with the feature map D0. Finally, process it through the Ghost_CBAM attention extraction module to obtain a 64-channel feature map D0′ with a spatial dimension of 256×256.

[0136] S4.9. Upsample the 64-channel feature map D0' with a spatial dimension of 256×256 to obtain a 32-channel feature map with a spatial dimension of 512×512, and splice and fuse it with the feature map D 初 Then process it through the Ghost_CBAM attention extraction module, and simultaneously output a 32-channel feature map D 初 ′ with a spatial dimension of 512×512 and a binary mask map with a spatial dimension of 512×512 and 1 channel;

[0137] S4.10. By counting the number of pixels in the pupil area in the binary mask map, obtain the visible area Ps of the pupil in the current frame. At the same time, based on the complete pupil area Pl detected last time, calculate the occlusion rate a of the current frame;

[0138] a = (Pl - Ps) / Pl.

[0139] Steps S4.1 - S4.5 are the encoder processing flow of the segmentation network based on the lightweight U-Net architecture, and S4.6 - S4.10 are the decoder processing flow. The decoder part uses the Upsample+Conv structure for bottom-up recovery. As Figure 3 shown, its unique U-shaped structure performs cross-level splicing and fusion of the four multi-scale feature maps of the encoder and the corresponding feature maps of the decoder through 4-layer skip connections, forming a complete encoder-decoder symmetric architecture.

[0140] S5. Based on the occlusion rate a calculated in step five, compare it with the preset first threshold THR1 and second threshold THR2, and dynamically select the following processing strategies accordingly:

[0141] Based on the occlusion rate a calculated in step S4, compare it with the preset first threshold THR1 and second threshold THR2, and dynamically select the following processing strategies accordingly, and then complete the anti-occlusion eye movement tracking:

[0142] 1) When a < THR1, it is judged as low occlusion. Directly use the fused feature map P3 of the current image frame, and input the fused feature map P3 into the rotated object detector improved based on the R3Det architecture. First, lay rotated anchor boxes with densities of {3, 2, 1} on the fused feature map P3 respectively. Each position presets multi-angle anchor boxes with 3 scales, 3 aspect ratios, and 6 angle directions; in this embodiment, the preset three scales are 8×8, 16×16, 32×32 respectively, the 3 aspect ratios are 1:1, 1:2, 1:4 respectively, and the 6 angle directions are -75°, -45°, -15°, 0°, 30°, 60° respectively;

[0143] The fused feature map P3 of the preset multi-angle anchor frames is input into the detection head of the rotating target detector and optimized step by step through 5 stacked refining stages;

[0144] In the first refining stage, features are extracted through a 5×5 convolutional layer. The classification branch predicts the class confidence of the anchor box through a fully connected layer. The regression branch predicts the center point offset, width and height scaling, and rotation angle offset to obtain the initial parameters of the rotated rectangle.

[0145] In subsequent refining stages, the parameters of the rotated rectangle output from the previous refining stage are resampled into feature regions of a fixed size, and then aligned to the current fused feature map P3 through interpolation.

[0146] Finally, the feature refinement module of the rotating target detector performs 3×3 convolution, batch normalization and ReLU activation in sequence to extract more refined local features, predict the residual correction amount of the rotating rectangle parameters, realize the stepwise regression from coarse to fine, and finally output the rotating rectangle parameters that accurately represent the position and shape of the pupil, including the center coordinates, width and height and tilt angle.

[0147] Subsequently, based on the pixel information within the rotating rectangular frame parameters, the least squares method is used to fit an ellipse to obtain pupil ellipse parameters that match the physiological morphology of the pupil. These pupil ellipse parameters include the center coordinates of the pupil, the length of its major and minor axes, and the tilt angle, thereby obtaining a smooth eye movement trajectory image.

[0148] 2) When THR1≤a≤THR2, it is determined to be a medium-high occlusion. The adaptive template matching mechanism is enabled. The spatiotemporal fusion feature map of the current image frame is used. The spatiotemporal fusion feature map is input to the rotating target detector based on the R3Det architecture. First, rotating anchor boxes with a density of {3,2,1} are laid on the spatiotemporal fusion feature map. Each position is preset with 3 scales, 3 aspect ratios and 6 angular directions. In this embodiment, the three preset scales are 8×8, 16×16 and 32×32, the three aspect ratios are 1:1, 1:2 and 1:4, and the 6 angular directions are -75°, -45°, -15°, 0°, 30° and 60°.

[0149] The spatiotemporal fusion feature map of the preset multi-angle anchor frame is input into the detection head of the rotating target detector and optimized step by step through 5 stacked refining stages;

[0150] In the first refining stage, features are extracted through a 5×5 convolutional layer. The classification branch predicts the class confidence of the anchor box through a fully connected layer. The regression branch predicts the center point offset, width and height scaling, and rotation angle offset to obtain the initial parameters of the rotated rectangle.

[0151] In subsequent refining stages, the parameters of the rotated rectangles output from the previous refining stage are resampled into feature regions of a fixed size, and then aligned to the current spatiotemporal fusion feature map through interpolation.

[0152] Finally, the feature refinement module of the rotating target detector performs 3×3 convolution, batch normalization and ReLU activation in sequence to extract more refined local features, predict the residual correction amount of the rotating rectangle parameters, realize the stepwise regression from coarse to fine, and finally output the rotating rectangle parameters that accurately represent the position and shape of the pupil, including the center coordinates, width and height and tilt angle.

[0153] Subsequently, based on the pixel information within the rotating rectangular frame parameters, the least squares method is used to fit an ellipse to obtain pupil ellipse parameters that match the physiological morphology of the pupil. These pupil ellipse parameters include the center coordinates of the pupil, the length of its major and minor axes, and the tilt angle, thereby obtaining a smooth eye movement trajectory image.

[0154] 3) When a>THR2, it is determined to be extremely occluded, and the information of the current image frame is unreliable. Direct detection of this image frame is abandoned, and the nearest valid image frame before and after the timestamp of this image frame is input into the rotating target detector based on the improved R3Det architecture. First, rotating anchor frames with a density of {3,2,1} are laid on the valid image frames respectively. Each position is preset with 3 scales, 3 aspect ratios and 6 angular directions. In this embodiment, the three preset scales are 8×8, 16×16 and 32×32, the three aspect ratios are 1:1, 1:2 and 1:4, and the 6 angular directions are -75°, -45°, -15°, 0°, 30° and 60°.

[0155] The effective image frames with preset multi-angle anchor frames are input into the detection head of the rotating target detector and optimized step by step through 5 stacked refining stages;

[0156] In the first refining stage, features are extracted through a 5×5 convolutional layer. The classification branch predicts the class confidence of the anchor box through a fully connected layer. The regression branch predicts the center point offset, width and height scaling, and rotation angle offset to obtain the initial parameters of the rotated rectangle.

[0157] Each subsequent refining stage resamples the rotated rectangle parameters output from the previous refining stage into a fixed-size feature region, and then aligns it to the current valid image frame through interpolation.

[0158] Finally, the feature refinement module of the rotating target detector performs 3×3 convolution, batch normalization and ReLU activation in sequence to extract more refined local features, predict the residual correction amount of the rotating rectangle parameters, realize the stepwise regression from coarse to fine, and finally output the rotating rectangle parameters that accurately represent the position and shape of the pupil, including the center coordinates, width and height and tilt angle.

[0159] Subsequently, based on the pixel information within the rotating rectangular frame parameters, the least squares method is used to perform ellipse fitting to obtain pupil ellipse parameters that match the physiological morphology of the pupil. These pupil ellipse parameters include the center coordinates of the pupil, the lengths of the major and minor axes, and the tilt angle. Time-weighted linear interpolation is then performed on these parameters to complete the trajectory of the current image frame and obtain a smooth eye movement trajectory image.

[0160] In this embodiment, the first threshold THR1 is 25% and the second threshold THR2 is 85%.

[0161] Figure 4 In the diagram, the purple line representing Robust-Eye indicates the F1 score of the occlusion-resistant eye-tracking method of this invention on the EV-Eye dataset. It can be seen that its score is the highest compared to other eye-tracking methods. Figure 5 In the figure, (b) represents the tracking effect of the present invention. The yellow box in the figure is the true value of the eye movement trajectory, and the green box is the predicted value of the eye movement trajectory. It can be seen that the predicted eye movement trajectory output by the present invention is very close to the true value.

[0162] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the present invention.

Claims

1. An occlusion-resistant eye-tracking method, characterized in that, It includes the following steps: S1. Continuously collect multiple frames of the eye region of the subject using an event camera to obtain a low-frame-rate grayscale video sequence of the eye region and the corresponding event stream; input the low-frame-rate grayscale video sequence of the eye region and the corresponding event stream into a pre-trained event-driven frame interpolation network for processing to obtain a high-frame-rate eye image sequence; S2. Input the images in the high-frame-rate eye image sequence into a multi-scale feature extraction network frame by frame, extract multi-level spatial features of each frame of the image, and obtain four feature maps C1, C2, C3, and C4 of each frame of the image, corresponding to the downsampling scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image respectively; Use the Feature Pyramid Network (FPN) to process the four feature maps C1, C2, C3, and C4 of each frame of the image, and correspondingly obtain a set of fused feature maps P2, P3, P4, and P5 that enhance and fuse multi-scale information for each frame of the image; S3. While obtaining the high-frame-rate eye image sequence, perform image detection on the high-frame-rate eye image sequence frame by frame using an event camera. When a frame of the high-frame-rate eye image sequence with an unobstructed or low-obstructed pupil image frame is detected, crop a local feature region centered on the pupil center coordinates from the corresponding fused feature map P3 of this frame of the image, normalize this region, and save it as an adaptive template; for each subsequent frame of the image to be processed, perform depth cross-correlation calculation between the adaptive template and its fused feature maps P2, P3, P4, and P5 to obtain a spatio-temporal fused feature map for each frame of the image; S4. Use a segmentation network based on the lightweight U-Net architecture to perform real-time pixel-level pupil region segmentation on the images in each frame of the high-frame-rate eye image sequence, generating a binary mask map; By counting the number of pixels in the pupil region in the binary mask map, obtain the visible area Ps of the pupil in the current frame, and at the same time, calculate the occlusion rate a of the current frame according to the most recently detected complete pupil area Pl; a = (Pl - Ps) / Pl; S5. Based on the occlusion rate a calculated in step S4, compare it with the preset first threshold THR1 and second threshold THR2, and dynamically select the following processing strategies accordingly to complete anti-occlusion eye movement tracking: 1) When a < THR1, it is judged as low occlusion, directly use the fused feature map P3 of the current image frame, input it into a rotated object detector for pupil ellipse parameter estimation, generate the pupil ellipse parameters at this moment, and thus obtain a smooth eye movement trajectory image; 2) When THR1 ≤ a ≤ THR2, it is determined as medium-high occlusion, enable the adaptive template matching mechanism, use the spatio-temporal fused feature map of the current image frame, input it into a rotated object detector for pupil ellipse parameter estimation, generate the pupil ellipse parameters at this moment, and thus obtain a smooth eye movement trajectory image; 3) When a>THR2, it is determined to be extremely high occlusion. The information of the current image frame is unreliable. Direct detection of this image frame is abandoned. Instead, the nearest valid image frame before and after the timestamp of this image frame is input into the rotating target detector to estimate the pupil ellipse parameters, generate the pupil ellipse parameters at that moment, and perform time-weighted linear interpolation to complete the trajectory of the current image frame and obtain a smooth eye movement trajectory image.

2. The anti-occlusion eye-tracking method according to claim 1, characterized in that, Step S1 specifically includes the following steps: S1.

1. Use an event camera to continuously capture multiple frames of the subject's eye region to obtain a low frame rate grayscale video sequence of the eye region at 25 FPS and the corresponding event stream; S1.2 Input the 25FPS low frame rate grayscale video sequence and the corresponding event stream into the pre-trained event-driven frame interpolation network to generate a 1000FPS high frame rate eye image sequence.

3. The anti-occlusion eye-tracking method according to claim 2, characterized in that, The event-driven frame interpolation network in step S1.2 includes the TimeLens model; The pre-training of the event-driven frame interpolation network includes the following steps: A. Take five consecutive frames of images and the event stream between them in a low frame rate grayscale video sequence of the eye region as a sample group. Input the first and last two frames and their corresponding event streams, as well as the event streams corresponding to the three middle frames, into the TimeLens model to predict the three middle frames. B. Optimize network parameters by minimizing the L1 loss and perceptual loss between the predicted image and the corresponding image; C. Return to step A and repeat the training until the optimized network parameters are less than the preset value, thus completing the pre-training of the event-driven frame interpolation network.

4. The anti-occlusion eye-tracking method according to claim 3, characterized in that, Step S2 specifically includes the following steps: S2.1 Input the high frame rate eye image sequence of 1000FPS frame by frame into the multi-scale feature extraction network with the Swin-Transformer-Tiny version as the backbone network; First, the multi-scale feature extraction network segments each frame of the image into non-overlapping 4x4 image blocks and embeds them into a 48-dimensional feature vector; The image patch is projected to 96 dimensions through the linear embedding layer of the multi-scale feature extraction network and then input into two Swin-Transformer blocks for processing, outputting a feature map C1 with a resolution of 1 / 4 of the original image and 96 channels; Subsequently, the feature map C1 is downsampled by 2 times and projected to 192 dimensions through the Patch Merging layer, and then processed by two Swin-Transformer blocks to output a feature map C2 with a resolution of 1 / 8 of the original image and 192 channels. The feature map C2 is then downsampled by 2 times and projected to 384 dimensions through the Patch Merging layer, and then processed by 6 Swin-Transformer blocks to output a feature map C3 with a resolution of 1 / 16 of the original image and 384 channels. Finally, the feature map C3 is downsampled by 2 times and projected to 768 dimensions through the Patch Merging layer. After being processed by two Swin-Transformer blocks, the output feature map C4 has a resolution of 1 / 32 of the original image and 768 channels. Finally, four feature maps C1, C2, C3, and C4 of each frame of image are obtained, which respectively correspond to the downsampling scales of 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image, and integrate information from local details to global semantics; S2.

2. Input the four feature maps C1, C2, C3, and C4 of each frame of image into the Feature Pyramid Network (FPN); First, perform a 1×1 convolution on the feature map C4 to reduce the number of channels from 768 to 256, and then generate a fused feature map P5 through a 3×3 convolution; Subsequently, perform 2-fold nearest neighbor upsampling on the fused feature map P5. At the same time, perform a 1×1 convolution on the feature map C3 to reduce the number of channels from 384 to 128, and add the two element-wise and then perform a 3×3 convolution to eliminate aliasing effects, obtaining a fused feature map P4' with a resolution of 1 / 16 of the original image and 256 channels; Subsequently, perform 2-fold upsampling on the fused feature map P4'. At the same time, perform a 1×1 convolution on the feature map C2 to reduce the number of channels from 192 to 64, and add the two element-wise and then perform a 3×3 convolution to eliminate aliasing effects, obtaining a fused feature map P3' with a resolution of 1 / 8 of the original image and 256 channels to strengthen the multi-scale representation of the pupil area; Finally, perform 2-fold upsampling on the fused feature map P3'. At the same time, perform a 1×1 convolution on the feature map C1 to reduce the number of channels from 96 to 32, and add the two element-wise and then perform a 3×3 convolution to eliminate aliasing effects, obtaining a fused feature map P2' with a resolution of 1 / 4 of the original image and 256 channels to retain fine edge localization information; At the same time, downsample the feature map C1 through a 3×3 convolution with a stride of 2 and add it to the fused feature map P2' to generate a fused feature map P2; Downsample the feature map C2 through a 3×3 convolution with a stride of 2 and add it to the fused feature map P3' to generate a fused feature map P3; Downsample the feature map C3 through a 3×3 convolution with a stride of 2 and add it to the fused feature map P4' to generate a fused feature map P4; Finally, a set of fused feature maps P2, P3, P4, and P5 that enhance and integrate multi-scale information of each frame of image are correspondingly obtained.

5. The anti-occlusion eye movement tracking method according to claim 4, characterized in that: In step S3, the occlusion rate α of the unoccluded pupil image frame is... 无 The occlusion rate 'a' of a low-occlusion pupil image frame is 0. 低 Satisfy: 0 < a 低 <0.25; The size of the local feature region is 13×13 pixels.

6. The occlusion-resistant eye-tracking method according to claim 5, characterized in that, Step S5 is specifically as follows: Based on the occlusion rate a calculated in step S4, compare it with a preset first threshold THR1 and a second threshold THR2, and dynamically select the following processing strategies accordingly, thereby completing anti-occlusion eye movement tracking: 1) When a < THR1, it is judged as low occlusion, and directly use the fused feature map P3 of the current image frame. Input the fused feature map P3 into a rotated object detector improved based on the R3Det architecture. First, lay rotated anchor boxes with densities of {3, 2, 1} on the fused feature map P3, and preset 3 scales, 3 aspect ratios, and 6 angular directions of multi-angle anchor boxes at each position; Input the fused feature map P3 with preset multi-angle anchor boxes into the detection head of the rotated object detector and perform逐级 optimization through 5 stacked refinement stages; In the first refining stage, features are extracted through a 5×5 convolutional layer. The classification branch predicts the class confidence of the anchor box through a fully connected layer. The regression branch predicts the center point offset, width and height scaling, and rotation angle offset to obtain the initial parameters of the rotated rectangle. In subsequent refining stages, the parameters of the rotated rectangle output from the previous refining stage are resampled into feature regions of a fixed size, and then aligned to the current fused feature map P3 through interpolation. Finally, the feature refinement module of the rotating target detector performs 3×3 convolution, batch normalization and ReLU activation in sequence to extract more refined local features, predict the residual correction amount of the rotating rectangle parameters, realize the stepwise regression from coarse to fine, and finally output the rotating rectangle parameters that accurately represent the position and shape of the pupil, including the center coordinates, width and height and tilt angle. Subsequently, based on the pixel information within the rotating rectangular frame parameters, the least squares method is used to fit an ellipse to obtain pupil ellipse parameters that match the physiological morphology of the pupil. These pupil ellipse parameters include the center coordinates of the pupil, the length of its major and minor axes, and the tilt angle, thereby obtaining a smooth eye movement trajectory image. 2) When THR1≤a≤THR2, it is determined to be medium to high occlusion. The adaptive template matching mechanism is enabled. The spatiotemporal fusion feature map of the current image frame is used. The spatiotemporal fusion feature map is input to the rotating target detector based on the R3Det architecture. First, rotating anchor boxes with a density of {3,2,1} are laid on the spatiotemporal fusion feature map. Each position is preset with 3 scales, 3 aspect ratios and 6 angular directions of multi-angle anchor boxes. The spatiotemporal fusion feature map of the preset multi-angle anchor frame is input into the detection head of the rotating target detector and optimized step by step through 5 stacked refining stages; In the first refining stage, features are extracted through a 5×5 convolutional layer. The classification branch predicts the class confidence of the anchor box through a fully connected layer. The regression branch predicts the center point offset, width and height scaling, and rotation angle offset to obtain the initial parameters of the rotated rectangle. In subsequent refining stages, the parameters of the rotated rectangles output from the previous refining stage are resampled into feature regions of a fixed size, and then aligned to the current spatiotemporal fusion feature map through interpolation. Finally, the feature refinement module of the rotating target detector performs 3×3 convolution, batch normalization and ReLU activation in sequence to extract more refined local features, predict the residual correction amount of the rotating rectangle parameters, realize the stepwise regression from coarse to fine, and finally output the rotating rectangle parameters that accurately represent the position and shape of the pupil, including the center coordinates, width and height and tilt angle. Subsequently, based on the pixel information within the rotating rectangular frame parameters, the least squares method is used to fit an ellipse to obtain pupil ellipse parameters that match the physiological morphology of the pupil. These pupil ellipse parameters include the center coordinates of the pupil, the length of its major and minor axes, and the tilt angle, thereby obtaining a smooth eye movement trajectory image. 3) When a>THR2, it is determined to be extremely occluded. The information of the current image frame is unreliable. Direct detection of this image frame is abandoned. Instead, the nearest valid image frame before and after the timestamp of this image frame is input into the rotating target detector based on the improved R3Det architecture. First, rotating anchor frames with a density of {3,2,1} are laid on the valid image frames respectively. Each position is preset with 3 scales, 3 aspect ratios and 6 angular directions of multi-angle anchor frames. The effective image frames with preset multi-angle anchor frames are input into the detection head of the rotating target detector and optimized step by step through 5 stacked refining stages; In the first refining stage, features are extracted through a 5×5 convolutional layer. The classification branch predicts the class confidence of the anchor box through a fully connected layer. The regression branch predicts the center point offset, width and height scaling, and rotation angle offset to obtain the initial parameters of the rotated rectangle. Each subsequent refining stage resamples the rotated rectangle parameters output from the previous refining stage into a fixed-size feature region, and then aligns it to the current valid image frame through interpolation. Finally, the feature refinement module of the rotating target detector performs 3×3 convolution, batch normalization and ReLU activation in sequence to extract more refined local features, predict the residual correction amount of the rotating rectangle parameters, realize the stepwise regression from coarse to fine, and finally output the rotating rectangle parameters that accurately represent the position and shape of the pupil, including the center coordinates, width and height and tilt angle. Subsequently, based on the pixel information within the rotating rectangular frame parameters, the least squares method is used to perform ellipse fitting to obtain pupil ellipse parameters that match the physiological morphology of the pupil. These pupil ellipse parameters include the center coordinates of the pupil, the lengths of the major and minor axes, and the tilt angle. Time-weighted linear interpolation is then performed on these parameters to complete the trajectory of the current image frame and obtain a smooth eye movement trajectory image.

7. The occlusion-resistant eye-tracking method according to claim 6, characterized in that, Step S4 is as follows: S4.

1. Input the 512×512 RGB 3-channel color image from each frame of the high frame rate eye image sequence into the segmentation network based on the lightweight U-Net architecture. First, process it through its Gradient Aggregation module to transform it into a 32-channel feature map D with a spatial dimension of 512×512. 初 With a unified channel dimension; S4.2, Convert the 32-channel feature map D with a spatial dimension of 512×512. 初 The input max pooling layer is downsampled by 2 times to output a 64-channel feature map D0 with a spatial dimension of 256×256. S4.3 Input the 64-channel feature map D0 with a spatial dimension of 256×256 into the Ghost_CBAM attention extraction module for channel and spatial attention enhancement processing, and output the 128-channel feature map D1 with a spatial dimension of 128×128. S4.4 The 128-channel feature map D1 with a spatial dimension of 128×128 is downsampled by the max pooling layer to obtain a 256-channel feature map with a spatial dimension of 64×64. This feature map is then input into the Ghost_CBAM attention extraction module, which outputs a 256-channel feature map D2 with a spatial dimension of 64×64. S4.5 The 256-channel feature map D2 with a spatial dimension of 64×64 is downsampled by the max pooling layer to obtain a 512-channel feature map with a spatial dimension of 32×32. This feature map is then input into the Graph Correlation Attention module for global relation modeling to obtain a 512-channel feature map D3 with a spatial dimension of 32×32. S4.6 Upsample the 512-channel feature map D3 with a spatial dimension of 32×32 to obtain a 256-channel feature map with a spatial dimension of 64×64. Then, concatenate and fuse it with the feature map D2. Finally, process it through the Ghost_CBAM attention extraction module to obtain a 256-channel feature map D2′ with a spatial dimension of 64×64. S4.7 Upsample the 256-channel feature map D2′ with a spatial dimension of 64×64 to obtain a 128-channel feature map with a spatial dimension of 128×128. Then, concatenate and fuse it with the feature map D1. Finally, process it through the Ghost_CBAM attention extraction module to obtain the 128-channel feature map D1′ with a spatial dimension of 128×128. S4.8 Upsample the 128-channel feature map D1′ with a spatial dimension of 128×128 to obtain a 64-channel feature map with a spatial dimension of 256×256. Then, concatenate and fuse it with the feature map D0. Finally, process it through the Ghost_CBAM attention extraction module to obtain a 64-channel feature map D0′ with a spatial dimension of 256×256. S4.9 Upsample the 64-channel feature map D0′ with a spatial dimension of 256×256 to obtain a 32-channel feature map with a spatial dimension of 512×512, and then compare it with feature map D0′. 初 The data is then concatenated and fused, and further processed by the Ghost_CBAM attention extraction module, simultaneously outputting a 32-channel feature map D with a spatial dimension of 512×512. 初 ′ and a 1-channel binarized mask with a spatial dimension of 512×512; S4.

10. By counting the number of pixels in the pupil region in the binarized mask image, the visible area Ps of the pupil in the current frame is obtained. At the same time, based on the most recently detected complete pupil area Pl, the occlusion rate a of the current frame is calculated. a = (Pl - Ps) / Pl.

8. The anti-occlusion eye-tracking method according to claim 1, characterized in that: In step S5, the first threshold THR1 is 25%, and the second threshold THR2 is 85%.

9. The anti-occlusion eye-tracking method according to claim 3, characterized in that: In step C, the preset value is 0.05.