Eye movement classification method and system based on image feature extraction and spatio-temporal sequence analysis
By extracting eye movement and image features using an improved I-VT filtering algorithm and the semantic segmentation model DeepLabv3, and then fusing them with an LSTM network, the problem of insufficient accuracy in eye movement classification in existing technologies is solved, achieving higher classification accuracy and reliability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGXI NORMAL UNIV
- Filing Date
- 2024-02-06
- Publication Date
- 2026-06-23
Smart Images

Figure CN117789283B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of computer vision, eye tracking, and psychoanalysis, and in particular to an eye-tracking classification method and system based on image feature extraction and spatiotemporal sequence analysis. Background Technology
[0002] Eye-tracking technology is a crucial tool for studying human visual processes, helping us understand visual attention, cognitive processes, and user experience. It allows for a better understanding of the visual cognitive processes behind eye-tracking data, improving human-computer interaction and user interface design. Furthermore, these technologies can be applied to other fields, such as eye-tracking driving behavior recognition and eye-fatigue detection, providing us with more insights and solutions. For example, the eye tracker is considered a revolutionary instrument for eye-tracking research. Its principle involves using a camera to acquire images of the eyes, then processing the images to obtain the pupil position (pixel coordinates). This position information is then used by a built-in algorithm to calculate the point where the user's gaze falls on the interface, i.e., the user's current gaze point on the screen.
[0003] In eye-tracking data analysis, spatiotemporal sequence analysis is another common eye-tracking classification method, based on the temporal and spatial relationships of the data. Eye-tracking data can be viewed as a sequence containing a time dimension, and eye movements exhibit spatial correlation. Using spatiotemporal sequence analysis, models can be built to capture the temporal and spatial features of eye-tracking data. Commonly used spatiotemporal sequence analysis methods include Markov Models, Hidden Markov Models (HMMs), and Recurrent Neural Networks (RNNs). Image feature extraction-based eye-tracking classification methods are also frequently used. These methods typically utilize computer vision techniques to extract specific visual features from eye-tracking images. Common image feature extraction methods include local binary patterns and histograms of orientation. By extracting these features, eye-tracking data can be transformed into a numerical form that computers can process, thereby enabling classifier training and classification prediction. Currently, most existing eye-tracking classification algorithms are based on shallow mathematical analysis, such as statistically analyzing information like fixation point location, fixation duration, and saccade paths within a region of interest, and then grouping the data to calculate the mean and variance. Existing machine learning algorithms typically perform simple serialization of normal samples, such as converting eye-tracking trajectories into specific letter sequences through simple region division.
[0004] However, while spatiotemporal sequence analysis has some mathematical theoretical support, it cannot obtain deeper information from eye-tracking sequences at a higher dimension; image feature extraction-based eye-tracking classification methods focus too much on the image itself, ignoring the spatiotemporal information of eye movement trajectories; existing machine learning algorithms lack specific eye-tracking information determination and discard information such as eye-tracking events. Therefore, existing technologies cannot comprehensively capture the feature information in eye-tracking data, resulting in insufficient accuracy and reliability for eye-tracking classification. Summary of the Invention
[0005] Based on this, this application proposes an eye-tracking classification method and system based on image feature extraction and spatiotemporal sequence analysis, aiming to solve the problem that existing technologies cannot capture the feature information in eye-tracking data more comprehensively, resulting in insufficient accuracy and reliability of eye-tracking classification.
[0006] A first aspect of the embodiments provides an eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis, comprising:
[0007] Acquire image samples and eye-tracking sequences;
[0008] The eye movement sequence is input into the improved I-VT filtering algorithm, and gap filling interpolation, noise reduction, velocity calculation and fixation point merging operations are performed on the eye movement sequence according to the improved I-VT filtering algorithm. The features extracted by each operation are fused to obtain the superficial eye movement features.
[0009] Based on the semantic segmentation model DeepLabv3, image semantic features are obtained from the image samples, and the eye-tracking shallow features are fused with the image semantic features to obtain eye-tracking fusion features;
[0010] The eye-tracking fusion features are input into an LSTM eye-tracking sequence classification network to obtain eye-tracking classification results based on the output.
[0011] As an optional implementation of the first aspect, the step of inputting the eye-tracking sequence into the improved I-VT filtering algorithm, and performing gap-filling interpolation, noise reduction, velocity calculation, and fixation point merging operations on the eye-tracking sequence according to the improved I-VT filtering algorithm, and fusing the features extracted from each operation to obtain shallow eye-tracking features includes:
[0012] The gap-filling interpolation operation fills in special eye movement sequences in the eye movement sequence that are not caused by sampling errors.
[0013] To determine the gap time for filling the gap, the position data of the last eye movement sample before the gap time and the position data of the first eye movement sample after the gap time are selected. A scaling factor is multiplied by the position data of the first eye movement sample after the gap time, and the result is added to the position data of the last eye movement sample before the gap time to obtain the replacement value for the current specific eye movement sample. The scaling factor is calculated using the following formula:
[0014] ,
[0015] Where S represents the scaling factor, t1 represents the timestamp of the eye-tracking sample to be replaced, t2 represents the timestamp of the last eye-tracking sample before the gap, t3 represents the timestamp at the start of the gap, and t4 represents the timestamp at the end of the gap.
[0016] The noise reduction operation averages the eye movement positions in the eye movement sequence;
[0017] Taking any moment in the eye-tracking sequence as a time node, and selecting N samples before and after the time node, the average eye-tracking position of the 2N samples is calculated:
[0018] ,
[0019] in, The data is the denoised eye-tracking data, where n is the current denoising position, k is the kth accumulated index value between -N and N, and x is the eye-tracking sequence that needs to be denoised.
[0020] The linear velocity of eye movement in the eye movement sequence is obtained through the velocity calculation operation:
[0021] ,
[0022] in, Indicates the velocity of eye movement. This represents the window lengths at times t1 and t2. This represents the difference in the straight-line distance of eye fixation between time t1 and time t2. , These represent the straight-line distances at time t1 and t2, respectively.
[0023] The eye-tracking sequence is categorized by merging fixation points.
[0024] Determine whether any of the adjacent fixation points meets the preset merging and classification conditions. The preset merging and classification conditions include that the time interval between the next fixation point and the current fixation point is less than a first preset time threshold, and the distance between the center of the fixation circle of the next fixation point and the current fixation point, after being converted into a viewing angle, is less than a second preset viewing angle threshold, and the continuous fixation time of the fixation point is greater than a third preset time threshold.
[0025] If the currently determined adjacent gaze points meet the preset merging and classification conditions, then the adjacent gaze points are merged and classified as saccades.
[0026] As an optional implementation of the first aspect, the step of obtaining image semantic features from the image samples based on the semantic segmentation model DeepLabv3 includes:
[0027] The semantic segmentation model DeepLabv3 includes a backbone network, a multi-scale fusion module, and a decoder. The backbone network extracts high-level features of the image, the multi-scale fusion module captures the contextual information of the image, and the decoder outputs the image features.
[0028] As an optional implementation of the first aspect, the step of capturing the contextual information of the image through the multi-scale fusion module includes:
[0029] Multiple parallel dilated convolution branches are used to extract features at different sampling rates, and then the extracted features are fused. The formula for dilated convolution is:
[0030] ,
[0031] Where x represents the input, y represents the output, w represents the convolution kernel, k represents the accumulated index value, i represents the position on the input and output, and r represents the dilation coefficient.
[0032] As an optional implementation of the first aspect, the step of outputting image features through the decoder includes:
[0033] DeepLabv3 adds an upsampling layer to the encoder output;
[0034] The feature maps of different layers of the encoder and the feature map of the decoder are fused by skip connections, and then input into the upsampling layer for upsampling operation.
[0035] As an optional implementation of the first aspect, the step of fusing the superficial eye movement features with the image semantic features to obtain eye movement fusion features includes:
[0036] Construct a one-dimensional vector by fusing the superficial eye movement features with the semantic features of the image, wherein the length of the one-dimensional vector is 4+N;
[0037] The 0th index indicates which column of the image the current eye is looking at;
[0038] The first index indicates which row of the image the current eye is looking at;
[0039] The second index indicates whether the current gaze position belongs to a gaze point. If the current gaze position belongs to a gaze point, the third index indicates the gaze duration; otherwise, the value of the third index is 0.
[0040] The 4th to N-1th indexes represent the image features corresponding to the current gaze position.
[0041] A second aspect of this application provides an eye-tracking classification system based on image feature extraction and spatiotemporal sequence analysis, comprising:
[0042] The acquisition module is used to acquire image samples and eye-tracking sequences;
[0043] The feature extraction module is used to input the eye movement sequence into the improved I-VT filtering algorithm, and to perform gap filling interpolation, noise reduction, velocity calculation and fixation point merging operations on the eye movement sequence according to the improved I-VT filtering algorithm, and to fuse the features extracted by each operation to obtain shallow eye movement features.
[0044] The feature fusion module is used to obtain image semantic features from the image samples based on the semantic segmentation model DeepLabv3, and fuse the eye-tracking shallow features with the image semantic features to obtain eye-tracking fusion features;
[0045] The classification module is used to input the eye-tracking fusion features into the LSTM eye-tracking sequence classification network to obtain the eye-tracking classification result based on the output result.
[0046] Compared with existing technologies, this application provides an eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis. First, image samples and eye-tracking sequences are acquired. Through gap-filling interpolation, noise reduction, velocity calculation, and gaze point merging operations using the I-VT filtering algorithm, shallow eye-tracking features are obtained, comprising features such as eye-tracking position, eye-tracking linear velocity, and gaze point category. This ensures that the spatiotemporal information of the eye-tracking trajectory is no longer ignored. Then, through the three components of the DeepLabv3 semantic segmentation model—the backbone network, multi-scale fusion module, and decoder—image semantic features are obtained, fusing high-level image features and contextual information. This allows for better capture of features at different scales, thereby improving the accuracy of segmentation details. Finally, the shallow eye-tracking features and image semantic features are fused, and the resulting fused features are input into an LSTM eye-tracking sequence classification network for classification. Therefore, the method proposed in this application can solve the problem that existing technologies cannot comprehensively capture the feature information in eye-tracking data, leading to insufficient accuracy and reliability in eye-tracking classification.
[0047] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by means of embodiments thereof. Attached Figure Description
[0048] Figure 1 This is a flowchart of the eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis proposed in the first embodiment of this application;
[0049] Figure 2 This is a diagram of the overall network architecture in the eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis proposed in the first embodiment of this application.
[0050] Figure 3 This is a network structure diagram of the improved encoder in the eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis proposed in the first embodiment of this application;
[0051] Figure 4 This is a structural diagram of the LSTM network part in the eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis proposed in the first embodiment of this application;
[0052] Figure 5 This refers to an image from a video viewed by the experimenter in the eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis proposed in the second embodiment of this application.
[0053] Figure 6 This is a schematic diagram of the structure of the eye-tracking classification system based on image feature extraction and spatiotemporal sequence analysis proposed in the third embodiment of this application.
[0054] The following detailed description, in conjunction with the accompanying drawings, will further illustrate this application. Detailed Implementation
[0055] To facilitate understanding of this application, a more complete description will be provided below with reference to the accompanying drawings, which illustrate several embodiments of the present application. However, the present application can be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that the disclosure of this application will be thorough and complete.
[0056] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the specification of this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0057] To illustrate the technical solution described in this application, specific embodiments are provided below.
[0058] Please see Figure 1 The diagram shows a flowchart of the eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis proposed in the first embodiment of this application, which is described in detail below:
[0059] Step S01: Obtain image samples and eye-tracking sequences.
[0060] For example, eye movement sequence data of the test subject to be identified is recorded using an eye tracker.
[0061] like Figure 2 As shown, this is a diagram of the overall network architecture in the eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis proposed in the first embodiment of this application. The obtained image samples and eye-tracking sequences are respectively input into the semantic segmentation model DeepLabv3 and the improved I-VT filtering algorithm.
[0062] Step S02: Input the eye movement sequence into the improved I-VT filtering algorithm, and perform gap filling interpolation, noise reduction, velocity calculation and fixation point merging operations on the eye movement sequence according to the improved I-VT filtering algorithm, and fuse the features extracted by each operation to obtain the shallow eye movement features.
[0063] It should be noted that the shallow eye movement features are formed by the eye movement position, eye movement linear velocity (which can also be converted into angular velocity), and gaze point category obtained through the gap filling interpolation operation, noise reduction operation, velocity calculation and gaze point merging operation of the I-VT filtering algorithm.
[0064] Specifically, the purpose of gap-filling interpolation is to fill in data lost due to non-sampling errors such as blinking, head turning, and occlusion by the subject through linear interpolation. Since an eye-tracking sequence contains multiple eye-tracking samples, there is a gap time between every two eye-tracking samples. For a specific eye-tracking sequence to be replaced, the gap time for filling the gap is determined, and the default value in the software is generally 75mm.
[0065] To determine the gap time for filling the gap, the position data of the last eye movement sample before the gap time and the position data of the first eye movement sample after the gap time are selected. A scaling factor is multiplied by the position data of the first eye movement sample after the gap time, and the result is added to the position data of the last eye movement sample before the gap time to obtain the replacement value for the current specific eye movement sample. The scaling factor is calculated using the following formula:
[0066] ,
[0067] Where S represents the scaling factor, t1 represents the timestamp of the eye-tracking sample to be replaced, t2 represents the timestamp of the last eye-tracking sample before the gap, t3 represents the timestamp at the start of the gap, and t4 represents the timestamp at the end of the gap.
[0068] Specifically, if the acquired eye-tracking data is highly noisy, noise reduction will be very effective. Taking any moment within the eye-tracking sequence as a time node, and selecting N samples before and after that time node, the average eye-tracking position of the 2N samples is calculated:
[0069] ,
[0070] in, The data is the denoised eye-tracking data, where n is the current denoising position, k is the kth accumulated index value between -N and N, and x is the eye-tracking sequence that needs to be denoised.
[0071] Specifically, the linear velocity of eye movement in the eye movement sequence is obtained through the velocity calculation operation:
[0072] ,
[0073] in, Indicates the velocity of eye movement. This represents the window lengths at times t1 and t2. This represents the difference in the straight-line distance of eye fixation between time t1 and time t2. , These represent the straight-line distances at time t1 and t2, respectively.
[0074] Optionally, when using angular velocity classification, it is necessary to combine the longitudinal distance between the eye tracker and the subject, convert it into a viewing angle, and then perform the calculation. The choice of window length is very important. Empirically, setting it to 20ms can handle reasonable noise levels without causing too much distortion to the data, and can provide consistent results regardless of the sampling frequency.
[0075] Specifically, for adjacent fixation points, the decision on whether to merge or discard them is based on three parameters: maximum fixation interval, maximum fixation angle interval, and merging adjacent fixation points.
[0076] The eye-tracking sequence is categorized by merging fixation points.
[0077] Determine whether any of the adjacent fixation points meets the preset merging and classification conditions. The preset merging and classification conditions include that the time interval between the next fixation point and the current fixation point is less than a first preset time threshold, and the distance between the center of the fixation circle of the next fixation point and the current fixation point, after being converted into a viewing angle, is less than a second preset viewing angle threshold, and the continuous fixation time of the fixation point is greater than a third preset time threshold.
[0078] If the currently determined adjacent gaze points meet the preset merging and classification conditions, then the adjacent gaze points are merged and classified as saccades.
[0079] For example, the maximum fixation interval is defined as merging the next fixation point into the current fixation point if the time interval between the next fixation point and the current fixation point is less than 75ms.
[0080] The maximum fixation angle interval is calculated by taking the distance between the center of the fixation circle of the next fixation point and the current fixation point. If the distance is less than 0.5° after being converted into a visual angle, the next fixation point is merged into the current fixation point.
[0081] Merging adjacent fixations means that if the fixation time of the current fixation point is less than 60ms, the fixation point is discarded. If multiple consecutive fixations have a fixation time of less than 60ms, these fixations are defined as saccades.
[0082] Step S03: Obtain image semantic features from the image samples according to the semantic segmentation model DeepLabv3, and fuse the eye movement shallow features with the image semantic features to obtain eye movement fusion features.
[0083] It should be noted that the semantic segmentation model DeepLabv3 includes a backbone network, a multi-scale fusion module, and a decoder. The backbone network extracts high-level features of the image, the multi-scale fusion module captures the contextual information of the image, and the decoder outputs the image features. Specifically:
[0084] The backbone network uses a ResNet deep convolutional neural network as the DeepLabv3 backbone. After DeepLabv3 extracts image features, it obtains semantic features with the same dimensions as the original image. It is responsible for extracting high-level features from the input image; commonly used feature extraction networks include ResNet, Xception, and MobileNet. These networks have deep structures and large receptive fields, enabling them to capture detailed information and global context within the image.
[0085] The multi-scale fusion module, a key component of DeepLabv3, combines features from different scales. It utilizes multiple parallel dilated convolution branches to extract features at varying sampling rates, then fuses these features. Through multi-scale feature fusion, DeepLabv3 can better handle targets of different sizes and improve the accuracy of semantic segmentation. This module is based on dilated convolution (also known as dilated convolution), which aims to increase the receptive field without introducing additional parameters. By introducing a dilation rate parameter into the convolutional layer, the sampling interval of the convolutional kernel on the input image can be adjusted, thereby capturing broader contextual information.
[0086] The formula for dilated convolution is:
[0087] ,
[0088] Where x represents the input, y represents the output, w represents the convolution kernel, k represents the accumulated index value, i represents the position on the input and output, and r represents the dilation coefficient.
[0089] Corresponding to the stride of the sampling signal, this is equivalent to convolving the input x with an upsampled convolution kernel, which is generated by inserting r-1 zeros between two consecutive convolution kernel values in each spatial dimension. Dilated convolution allows control over the size of the receptive field through the dilation factor. DeepLabv3 also combines separable convolution and dilated convolution.
[0090] The decoder in DeepLabv3 performs specific upsampling operations on the encoder output to restore resolution and add detail. It also utilizes skip connections to fuse feature maps from different layers of the encoder and the decoder. This design allows the model to better capture features at different scales, thereby improving the accuracy of segmentation details. This method improves the output branch of the decoder, enabling the network to output image features instead of classification information.
[0091] like Figure 3 The diagram shown is a network structure diagram of the improved encoder.
[0092] Step S04: Input the eye movement fusion features into the LSTM eye movement sequence classification network to obtain the eye movement classification result based on the output result.
[0093] For example, such as Figure 4 The diagram shown is a partial structure diagram of an LSTM network.
[0094] Specifically, a one-dimensional vector of length 4+N is constructed, where the 0th index indicates which column of the image the current eye gaze position is in; the 1st index indicates which row of the image the current eye gaze position is in; the 2nd index indicates whether the current gaze position is a fixation point. If the current gaze position is a fixation point, the 3rd index indicates the fixation point duration; otherwise, the value of the 3rd index is 0; and the 4th to N-1th indices represent the image features corresponding to the current gaze position.
[0095] Finally, this one-dimensional vector is treated as a single word vector and input into the LSTM network.
[0096] In another embodiment, eye-tracking analysis was performed using a computer program to collect data from two experimental scenarios: a dataset for watching geometric videos and a dataset for watching human movement. The methods described above were then used to classify and predict autism spectrum disorders (ASD) and non-autism (TD) populations. Figure 5 As shown, this is an image from a video watched by the experimenter; the left side shows geometric shapes, and the right side shows human movement.
[0097] The classification results based on the algorithm in this application show:
[0098] The average classification accuracy of the dataset from the study of social geometric preference experiments was 64.02%, with a classification accuracy of 64.95% for the TD population and 62.42% for the ASD population.
[0099] The average classification accuracy of the dataset in the object motion type preference experiment of Study 2 was 68.72%, with a classification accuracy of 71.71% for the TD population and 62.28% for the ASD population.
[0100] In summary, the eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis provided in this application first acquires image samples and eye-tracking sequences. Through gap-filling interpolation, noise reduction, velocity calculation, and gaze point merging operations of the I-VT filtering algorithm, shallow eye-tracking features, including eye-tracking position, eye-tracking linear velocity, and gaze point category, are obtained, ensuring that the spatiotemporal information of the eye-tracking trajectory is no longer ignored. Then, through the three components of the DeepLabv3 semantic segmentation model—the backbone network, multi-scale fusion module, and decoder—image semantic features, which are fused with high-level image features and contextual information, are obtained, enabling better capture of features at different scales and improving the accuracy of segmentation details. Finally, the shallow eye-tracking features and image semantic features are fused, and the resulting fused features are input into an LSTM eye-tracking sequence classification network for classification.
[0101] Please see Figure 6 The diagram shows a schematic representation of an eye-tracking classification system based on image feature extraction and spatiotemporal sequence analysis proposed in the third embodiment of this application. The system includes:
[0102] Acquisition module 10 is used to acquire image samples and eye-tracking sequences;
[0103] Feature extraction module 20 is used to input the eye movement sequence into the improved I-VT filtering algorithm, so as to perform gap filling interpolation, noise reduction, velocity calculation and fixation point merging operations on the eye movement sequence according to the improved I-VT filtering algorithm, and fuse the features extracted by each operation to obtain shallow eye movement features;
[0104] The feature fusion module 30 is used to obtain image semantic features from the image samples according to the semantic segmentation model DeepLabv3, and fuse the eye-tracking shallow features with the image semantic features to obtain eye-tracking fusion features;
[0105] The classification module 40 is used to input the eye-tracking fusion features into the LSTM eye-tracking sequence classification network to obtain the eye-tracking classification result based on the output result.
[0106] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0107] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.
Claims
1. An eye-tracking classification method based on image feature extraction and spatiotemporal sequence analysis, characterized in that, The method includes: Acquire image samples and eye-tracking sequences; The eye-tracking sequence is input into an improved I-VT filtering algorithm to perform gap-filling interpolation, noise reduction, velocity calculation, and fixation point merging operations on the eye-tracking sequence according to the improved I-VT filtering algorithm. The features extracted from each operation are then fused to obtain shallow eye-tracking features, specifically including: The gap-filling interpolation operation fills in the special eye movement sequences caused by non-sampling errors in the eye movement sequence; the gap time for filling the gap is determined, the position data of the last eye movement sample before the gap time and the position data of the first eye movement sample after the gap time are selected, the scaling factor is multiplied by the position data of the first eye movement sample after the gap time, and the result is added to the position data of the last eye movement sample before the gap time to obtain the replacement value of the current special eye movement sample. The scaling factor is calculated using the following formula: Where S represents the scaling factor, t1 represents the timestamp of the eye-tracking sample to be replaced, t2 represents the timestamp of the last eye-tracking sample before the gap, t3 represents the timestamp at the start of the gap, and t4 represents the timestamp at the end of the gap. The eye movement positions in the eye movement sequence are averaged through the noise reduction operation; taking any moment in the eye movement sequence as a time node, N samples before and after the time node are selected respectively, and the average eye movement position of the 2N samples is calculated: ,in, Here, n represents the denoised eye-tracking data, k represents the current denoising position, k represents the kth accumulated index value between -N and N, and x represents the eye-tracking sequence that needs to be denoised. The linear velocity of eye movement in the eye movement sequence is obtained through the velocity calculation operation: ,in, Indicates the velocity of eye movement. This represents the window lengths at times t1 and t2. This represents the difference in the straight-line distance of eye fixation between time t1 and time t2. , These represent the straight-line distances at time t1 and t2, respectively; The eye movement sequence is categorized by merging fixation points. It is then determined whether any of the adjacent fixation points meets preset merging criteria, which include: the time interval between the next fixation point and the current fixation point is less than a first preset time threshold; the distance between the fixation center of the next fixation point and the current fixation point, converted to a viewing angle, is less than a second preset viewing angle threshold; and the duration of fixation at the fixation point is greater than a third preset time threshold. If the currently determined adjacent fixation point meets the preset merging criteria, then the adjacent fixation point is merged and categorized as a saccade. Based on the semantic segmentation model DeepLabv3, image semantic features are obtained from the image samples, and the eye-tracking shallow features are fused with the image semantic features to obtain eye-tracking fusion features; The eye-tracking fusion features are input into an LSTM eye-tracking sequence classification network to obtain eye-tracking classification results based on the output.
2. The eye-tracking classification method according to claim 1, characterized in that, The step of obtaining image semantic features from the image samples based on the semantic segmentation model DeepLabv3 includes: The semantic segmentation model DeepLabv3 includes a backbone network, a multi-scale fusion module, and a decoder. The backbone network extracts high-level features of the image, the multi-scale fusion module captures the contextual information of the image, and the decoder outputs the image features.
3. The eye-tracking classification method according to claim 2, characterized in that, The step of capturing the contextual information of the image through the multi-scale fusion module includes: Multiple parallel dilated convolution branches are used to extract features at different sampling rates, and then the extracted features are fused. The formula for dilated convolution is: , Where x represents the input, y represents the output, w represents the convolution kernel, k represents the accumulated index value, i represents the position on the input and output, and r represents the dilation coefficient.
4. The eye-tracking classification method according to claim 2, characterized in that, The step of outputting image features through the decoder includes: DeepLabv3 adds an upsampling layer to the encoder output; The feature maps of different layers of the encoder and the feature map of the decoder are fused by skip connections, and then input into the upsampling layer for upsampling operation.
5. The eye-tracking classification method according to claim 1, characterized in that, The step of fusing the superficial eye movement features with the semantic features of the image to obtain eye movement fusion features includes: Construct a one-dimensional vector by fusing the superficial eye movement features with the semantic features of the image, wherein the length of the one-dimensional vector is 4+N; The 0th index indicates which column of the image the current eye is looking at; The first index indicates which row of the image the current eye is looking at; The second index indicates whether the current gaze position belongs to a gaze point. If the current gaze position belongs to a gaze point, the third index indicates the gaze duration; otherwise, the value of the third index is 0. The 4th to N-1th indexes represent the image features corresponding to the current gaze position.
6. An eye-tracking classification system based on image feature extraction and spatiotemporal sequence analysis, characterized in that, The eye-tracking classification system is applied to the eye-tracking classification method as described in claim 1, the eye-tracking classification system comprising: The acquisition module is used to acquire image samples and eye-tracking sequences; The feature extraction module is used to input the eye movement sequence into the improved I-VT filtering algorithm, and to perform gap filling interpolation, noise reduction, velocity calculation and fixation point merging operations on the eye movement sequence according to the improved I-VT filtering algorithm, and to fuse the features extracted by each operation to obtain shallow eye movement features. The feature fusion module is used to obtain image semantic features from the image samples based on the semantic segmentation model DeepLabv3, and fuse the eye-tracking shallow features with the image semantic features to obtain eye-tracking fusion features; The classification module is used to input the eye-tracking fusion features into the LSTM eye-tracking sequence classification network to obtain the eye-tracking classification result based on the output result.