Visual information recognition method, system, medium and device based on eye movement attention
By combining forward and reverse saccade experiments with multi-head attention units and local attention units to construct a visual information recognition model, the problem of difficult feature extraction from eye movement data was solved, thus improving the accuracy of depression detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QILU UNIVERSITY OF TECHNOLOGY (SHANDONG ACADEMY OF SCIENCES)
- Filing Date
- 2022-10-11
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies face difficulties in extracting features from eye-tracking data, resulting in low accuracy in depression detection and an inability to accurately reflect visual information.
Eye movement data was obtained by setting up forward and reverse saccade experiments. A visual information recognition model was constructed using multi-head attention units, local attention units, and summation units. Spatial attention features of the eye movement data were extracted, and the model was trained and evaluated to improve detection accuracy.
With limited eye-tracking data, this method maximizes the acquisition of key information from features, improving the accuracy of visual information recognition, particularly in the diagnosis of depression.
Smart Images

Figure CN115393946B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of visual analysis technology, and in particular to a method, system, medium, and device for visual information recognition based on eye movement attention. Background Technology
[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.
[0003] With the gradual development of society, the vast amount of information derived from the analysis of visual information has been utilized in various fields, making the accurate identification of visual information particularly important. Eye-tracking data is a crucial source of visual information. Human eye-tracking data can reveal a variety of information, and its analysis can provide assistance in the diagnosis of various diseases, including depression. Depression is a common mental illness that severely impacts a patient's daily life when it occurs. Currently, most methods of diagnosing depression are subjective, requiring highly skilled physicians with varying levels of expertise, significantly increasing the possibility of misdiagnosis. Therefore, by extracting features from eye-tracking data and identifying the visual information they represent, the probability of misdiagnosis for depression can be greatly reduced.
[0004] However, the inventors discovered that the usual detection method for extracting eye-tracking data involves directly predicting the data. Because eye-tracking data is small in volume but has many feature attributes, feature extraction is difficult, or the extracted features cannot accurately reflect the visual information it contains, thus reducing detection accuracy. Summary of the Invention
[0005] To address the shortcomings of existing technologies, the present invention aims to provide a visual information recognition method, system, medium, and device based on eye-tracking attention. This method can extract attention features from eye-tracking information and further extract spatial attention features. With limited eye-tracking data, it can maximize the acquisition of key information from the features and improve the accuracy of visual information recognition.
[0006] To achieve the above objectives, the present invention is implemented through the following technical solution:
[0007] The first aspect of this invention provides a visual information recognition method based on eye-tracking attention, comprising the following steps:
[0008] Determine the feature attributes of the eye-tracking data based on the required visual information;
[0009] Eye movement data were obtained through saccade experiments, and the data were then filtered, registered, and fitted.
[0010] A visual information recognition model is constructed based on the attention mechanism. The fitted data is used as a dataset to train and evaluate the visual information recognition model.
[0011] Input the data to be detected into the evaluated visual information recognition model, and output the visual information recognition result.
[0012] Furthermore, the eye-tracking system was calibrated using the nine-point calibration method before the saccade test began.
[0013] Further, the specific steps for determining the feature attributes of eye-tracking data are as follows: select some features of the eye-tracking data through random forest, sort them according to the random forest Gini coefficient, and remove attributes that have little or no impact on the classification of the desired target.
[0014] Furthermore, the saccade test is divided into positive saccades and negative saccades. Positive saccades are as follows: First, the central fixation point is displayed on the screen, then the fixation point disappears, and the target stimulus appears to the left or right of the central fixation point. The subject is asked to look at the location where the target stimulus appears immediately.
[0015] Reverse saccades are similar to forward saccades, but require the subject to look in the opposite direction to the direction in which the target stimulus appears.
[0016] Furthermore, the process of filtering, registering, and fitting eye-tracking data is as follows: delete data with more than 30% null values in each experimental record, register the remaining data, and after data registration, treat each independent experiment as a sample, and fit the features of the data records of each independent experiment into a multidimensional data set.
[0017] Furthermore, the specific process of using the fitted data as a dataset to train and evaluate the visual information recognition model is as follows: the eye-tracking dataset is divided into training, testing, and validation datasets, all of which are input into the model. The model is trained using the training set and validated using the validation set. The model is continuously adjusted and the best model is selected. A final model is then trained using the training and validation set data. Finally, the final model is evaluated using the test set.
[0018] Furthermore, the visual information recognition model includes a multi-head attention unit, a local attention (SA) unit, and an add unit. The multi-head attention unit is composed of multiple self-attention layers stacked together. The local attention unit inputs the output data obtained from the multi-head attention unit into a one-dimensional convolutional network to capture local features. The add unit is used to directly add the attention matrices output by the local attention units to obtain the final attention matrix, which is then passed through a fully connected layer to obtain the model result.
[0019] A second aspect of the present invention provides a visual information recognition system based on eye-tracking attention, comprising:
[0020] The feature attribute module is configured to determine the feature attributes of eye-tracking data based on the required visual information.
[0021] The data acquisition module is configured to obtain eye movement data through saccade experiments and to filter, register, and fit the eye movement data.
[0022] The model building module is configured to build a visual information recognition model based on the attention mechanism, and use the fitted data as a dataset to train and evaluate the visual information recognition model.
[0023] The information recognition module is configured to input the data to be detected into the evaluated visual information recognition model and output the visual information recognition result.
[0024] A third aspect of the present invention provides a medium having a program stored thereon, which, when executed by a processor, implements the steps of the visual information recognition method based on eye-tracking attention as described in the first aspect of the present invention.
[0025] A fourth aspect of the present invention provides an apparatus including a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the visual information recognition method based on eye-tracking attention as described in the first aspect of the present invention.
[0026] The above one or more technical solutions have the following beneficial effects:
[0027] This technical solution proposes setting up forward and reverse saccade experiments to detect people's reactions, thereby acquiring eye movement data, and proposes a novel method for visual information recognition based on eye movement attention. In this method, each independent experiment is processed as a sample data. In this invention, after processing, the eye movement data first enters a multi-head attention unit to calculate the self-attention weight of the eye movement data to perform the first stage of feature selection, and then enters a local attention unit to further extract local attention features and obtain key information from the eye movement data. Finally, the data information is summed by an addition unit, making the extracted features more comprehensive and accurate. With limited eye movement data, this method maximizes the acquisition of key information from the features and more accurately detects visual information. The method has performed well in the experimental stage and has good application prospects in various fields such as medicine.
[0028] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description
[0029] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.
[0030] Figure 1 This is a flowchart of visual information recognition in Embodiment 1 of the present invention.
[0031] Figure 2 This is a schematic diagram of the saccade experiment in Embodiment 1 of the present invention;
[0032] Figure 3 This is a flowchart of constructing a visual information recognition model in Embodiment 1 of the present invention; Detailed Implementation
[0033] It should be noted that the following detailed descriptions are exemplary and intended to provide further explanation of this application. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains.
[0034] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments according to this application. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0035] Terminology Explanation:
[0036] Attention originates from research on human vision. In cognitive science, due to information processing bottlenecks, humans selectively focus on a portion of all information while ignoring other visible information. To make efficient use of limited visual information processing resources, humans need to select specific parts of the visual region and then concentrate on them. For example, when reading, people typically only pay attention to and process a small number of words. In computer science, the core focus of attention is to guide learning networks to focus on what they need most—a mechanism for focusing on localized information.
[0037] Example 1:
[0038] Embodiment 1 of the present invention provides a visual information recognition method based on eye-tracking attention. This embodiment takes depression as an example to recognize the visual information of patients with depression, such as... Figure 1 As shown, it includes the following steps:
[0039] The characteristic attributes of eye-tracking data are determined based on the required visual information. Specifically, the eye-tracking data extracted by the eye tracker includes over 50 numerical features such as pupil size and position, fixation position, and corneal reflex position. A subset of features from the eye-tracking data is selected using a random forest algorithm, and then sorted according to the Gini coefficient of the random forest, removing attributes that have little or no impact on the classification of the desired target. Taking the visual information of patients with depression as an example, the data is sorted according to the Gini coefficient of the random forest algorithm, and attributes that have little or no impact on the classification of depression are removed.
[0040] As a further technical solution, the eye tracking system is calibrated using the nine-point correction method before the saccade experiment to obtain more accurate experimental results. When using the nine-point correction method, the deviation on the x and y axes is less than 0.5 degrees of visual angle.
[0041] First, an eye tracker was used to acquire data from the subjects in the forward and reverse saccade tests. The saccade test was divided into forward and reverse saccades, such as... Figure 1 As shown, positive saccades are as follows: First, the central gaze point (white "+") is displayed on the screen, then the gaze point disappears, and the target stimulus (green dot) appears to the left or right of the central gaze point. The subject is asked to look at the location where the target stimulus appears immediately.
[0042] Reverse saccades are similar to forward saccades, but require the subject to look in the opposite direction to the direction in which the target stimulus appears.
[0043] Eye movement data was obtained through saccade experiments. This data was then filtered, registered, and fitted. The fitted data was used as the training dataset for evaluation. Data with more than 30% null values in each experimental record was removed, and the remaining data was registered. After registration, each independent experiment was treated as a single sample, and the features of each independent experiment's data record were fitted into a multidimensional dataset.
[0044] Specifically, each saccade reflex experiment yields 100 eye movement data records. Data with more than 30% null values in each record is removed. After data registration, 2018 saccade reflex and saccade reflex records are obtained respectively. Since each experiment is an independent experiment for each person, each independent experiment is treated as a sample. All features of the 100 data records from each independent experiment are fitted into a 162-dimensional data set using methods such as mean, maximum, minimum, variance, median, and quartiles (1 / 4, 3 / 4). Ultimately, the saccade reflex and saccade reflex experiments yield 2018... Eye movement data of 162.
[0045] A visual information recognition model is constructed based on the attention mechanism. The fitted data is used as a dataset to train and evaluate the visual information recognition model. The data to be detected is input into the evaluated visual information recognition model, and the visual information recognition result is output.
[0046] As a further technical solution, the specific process of using the fitted data as a dataset to train and evaluate the visual information recognition model is as follows: The eye-tracking dataset is divided into training, testing, and validation datasets in a 6:2:2 ratio. All datasets are input into the model. The model is trained using the training set and validated using the validation set. The model is continuously adjusted based on the results, and the best model is selected. A final model is then trained using the training and validation sets. Finally, the final model is evaluated using the test set.
[0047] The visual information recognition model includes a multi-head attention unit, a local attention unit, and a summing unit. The multi-head attention unit is composed of multiple self-attention layers stacked together. The local attention unit inputs the output data obtained from the multi-head attention unit into a one-dimensional convolutional network to capture local features. The summing unit is used to directly add the attention matrices output by the local attention units to obtain the final attention matrix, which is then passed through a fully connected layer to obtain the model result.
[0048] As a further technical solution, the MultiHead attention unit, as the name suggests, consists of multiple stacked self-attention layers. Self-attention maps a query Q, a key K, and a set of key-value pairs V to an output, which is a weighted sum of V. The weights assigned to each value are calculated using a correlation function to determine the relevance of Q to the current key K.
[0049] Local Attention Unit: This unit primarily extracts features with stronger local expressive power in space. The output data from the MultiHead unit is input into a two-layer one-dimensional convolutional network to capture local features. After eye-tracking attention features are input into the first convolutional network, the output dimension is proportionally reduced, activated by the ReLU function, and then transformed back to its original dimension by another convolutional layer. Subsequently, for the convolutional data, the maximum and average values of the channels at each feature point are taken to compress the channels and obtain spatial information. The two datasets are stacked and then passed through another one-dimensional convolutional network to learn spatial attention weights. Finally, the attention weights are activated by the Sigmoid function. After obtaining the attention weights, we multiply them by the original input features to obtain the required attention matrix. The expression for this unit is as follows: where Conv1, Conv2, and Conv3 represent the first, second, and third one-dimensional convolutional layers, respectively.
[0050] cvs((MH) = Conv2(ReLU(Conv1(MH)))
[0051] catM = Sigmoid(Conv3(concat(mean(cvs),max(cvs))))
[0052] SA(MH) = MH catM
[0053] Sum Unit: Used to sum the attention matrices output by each encoder unit. The attention matrices are directly summed to obtain the final attention matrix, which is then passed through a fully connected layer to obtain the model's result.
[0054] Example 2:
[0055] Embodiment 2 of the present invention provides a visual information recognition system based on eye-tracking attention, comprising:
[0056] The feature attribute module is configured to determine the feature attributes of eye-tracking data based on the required visual information.
[0057] The data acquisition module is configured to obtain eye movement data through saccade experiments and to filter, register, and fit the eye movement data.
[0058] The model building module is configured to build a visual information recognition model based on the attention mechanism, and use the fitted data as a dataset to train and evaluate the visual information recognition model.
[0059] The information recognition module is configured to input the data to be detected into the evaluated visual information recognition model and output the visual information recognition result.
[0060] Example 3:
[0061] Embodiment 3 of the present invention provides a medium on which a program is stored. When the program is executed by a processor, it implements the steps in the visual information recognition method based on eye movement attention as described in Embodiment 1 of the present invention.
[0062] Example 4:
[0063] Embodiment 4 of the present invention provides a device including a memory, a processor, and a program stored in the memory and executable on the processor. When the processor executes the program, it implements the steps in the visual information recognition method based on eye movement attention as described in Embodiment 1 of the present invention.
[0064] The steps and methods involved in the apparatuses of Embodiments 2, 3, and 4 above correspond to those in Embodiment 1. For specific implementation details, please refer to the relevant description section of Embodiment 1. The term "computer-readable storage medium" should be understood as a single medium or multiple media including one or more instruction sets; it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set for execution by a processor and enabling the processor to perform any of the methods in this invention.
[0065] Those skilled in the art will understand that the modules or steps of the present invention described above can be implemented using general-purpose computer devices. Optionally, they can be implemented using computer-executable program code, thereby allowing them to be stored in a storage device for execution by a computer device, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. The present invention is not limited to any particular combination of hardware and software.
[0066] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.
Claims
1. A visual information recognition method based on eye-tracking attention, characterized in that, Includes the following steps: Determine the feature attributes of the eye-tracking data based on the required visual information; Eye movement data were obtained through saccade experiments, and the data were then filtered, registered, and fitted. The saccade test is divided into positive saccades and negative saccades. Positive saccades are as follows: First, the central gaze point is displayed on the screen. Then the gaze point disappears and the target stimulus appears to the left or right of the central gaze point. The subject is asked to look at the location where the target stimulus appears immediately. Reverse saccades are similar to forward saccades, but require the subject to look in the opposite direction to where the target stimulus appears. A visual information recognition model is constructed based on the attention mechanism. The fitted data is used as a dataset to train and evaluate the visual information recognition model. The visual information recognition model includes a multi-head attention unit, a local attention unit, and a summing unit. The multi-head attention unit is composed of multiple stacked self-attention layers, which are input to the local attention unit via residual connections. The local attention unit inputs the output data obtained from the multi-head attention unit into a one-dimensional convolutional network to capture local features. Specifically, after the eye-tracking attention features are input into the first convolutional network layer, the output dimension is reduced proportionally. After activation by the ReLU function, it is transformed back to the original dimension by another convolutional layer. Then, for the convolutional data, the maximum and average values of the channels at each feature point are taken to compress the channels and obtain spatial information. The two sets of data are stacked and then passed through a one-dimensional convolutional network layer to learn spatial attention weights. Finally, the attention weights are activated by the Sigmoid function. After obtaining the attention weights, they are multiplied by the original input features to obtain the attention matrix. The summation unit is used to directly add the attention matrices output by the local attention units of the two parallel paths to obtain the final attention matrix, which is then passed through a fully connected layer to obtain the model result. Input the data to be detected into the evaluated visual information recognition model, and output the visual information recognition result.
2. The visual information recognition method based on eye-tracking attention as described in claim 1, characterized in that, Before the saccade test, the eye tracking system was calibrated using the nine-point calibration method.
3. The visual information recognition method based on eye-tracking attention as described in claim 1, characterized in that, The specific steps for determining the feature attributes of eye-tracking data are as follows: select some features of the eye-tracking data through random forest, sort them according to the Gini coefficient of random forest, and remove attributes that have little or no impact on the classification of the desired target.
4. The visual information recognition method based on eye movement attention according to claim 1, wherein The process of filtering, registering, and fitting eye-tracking data is as follows: delete data with more than 30% null values in each experimental record, register the remaining data, treat each independent experiment as a sample after data registration, and fit the features of the data records of each independent experiment into a multidimensional data set.
5. The visual information recognition method based on eye movement attention according to claim 1, wherein The specific process of using the fitted data as a dataset to train and evaluate the visual information recognition model is as follows: the eye-tracking dataset is divided into training, testing, and validation datasets, all of which are input into the model. The model is trained using the training set and validated using the validation set. The model is continuously adjusted and the best model is selected. A final model is then trained using the training and validation set data. Finally, the final model is evaluated using the test set.
6. A visual information recognition system based on eye movement attention, characterized by, include: The feature attribute module is configured to determine the feature attributes of eye-tracking data based on the required visual information. The data acquisition module is configured to obtain eye movement data through saccade experiments and to filter, register, and fit the eye movement data. The saccade experiment is divided into positive saccade and negative saccade. Positive saccade is as follows: First, the central gaze point is displayed on the screen, then the gaze point disappears, and the target stimulus appears to the left or right of the central gaze point. The subject is asked to look at the location where the target stimulus appears immediately. Reverse saccades are similar to forward saccades, but require the subject to look in the opposite direction to where the target stimulus appears. The model building module is configured to construct a visual information recognition model based on an attention mechanism. The fitted data is used as the dataset to train and evaluate the visual information recognition model. The visual information recognition model includes a multi-head attention unit, local attention units, and a summing unit. The multi-head attention unit is composed of multiple stacked self-attention layers, which are input to the local attention unit via residual connections. The local attention unit inputs the output data obtained from the multi-head attention unit into a one-dimensional convolutional network to capture local features. Specifically, after eye-tracking attention features are input into the first convolutional network layer, the output dimension is proportionally reduced. After activation by the ReLU function, it is transformed back to its original dimension by another convolutional layer. Subsequently, for the convolutional data, the maximum and average values of the channels at each feature point are taken to compress the channels and obtain spatial information. The two datasets are stacked and then passed through a one-dimensional convolutional network layer to learn spatial attention weights. Finally, the attention weights are activated by the Sigmoid function. The attention weights are then multiplied by the original input features to obtain the attention matrix. The summation unit is used to directly add the attention matrices output by the local attention units of the two parallel paths to obtain the final attention matrix, which is then passed through a fully connected layer to obtain the model result. The information recognition module is configured to input the data to be detected into the evaluated visual information recognition model and output the visual information recognition result.
7. A computer-readable storage medium, characterized in that, It stores multiple instructions, which are adapted to be loaded by the processor of the terminal device and executed by the visual information recognition method based on eye movement attention as described in any one of claims 1-5.
8. A terminal device, comprising: The invention includes a processor and a computer-readable storage medium, the processor being used to implement various instructions; the computer-readable storage medium being used to store multiple instructions adapted to be loaded by the processor and executed by the processor for the visual information recognition method based on eye-tracking attention as described in any one of claims 1-5.