A robot vision operation control method and system based on eye tracking
By constructing a dual-path attention fusion model based on eye-tracking, the problems of computational redundancy and insufficient robustness of traditional robot vision operation models in complex dynamic environments are solved, and efficient task-oriented attention allocation and fine operation capabilities are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENYANG INST OF AUTOMATION - CHINESE ACAD OF SCI
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional robot vision manipulation models suffer from computational redundancy, slow response speed, and insufficient robustness in complex dynamic environments. They also lack selective attention mechanisms, making it difficult to quickly focus on key areas, especially when dealing with small targets and delicate operations.
An eye-tracking-based approach is adopted, which simultaneously collects eye movement data and scene visual data through a head-mounted device. A dynamic Bayesian network is used to extract gaze-intent coupling features, and a dual-path attention fusion model is constructed. Combined with reinforcement learning optimization strategies, data-driven and knowledge-driven paths are adaptively integrated to generate a task-oriented attention map.
It improves perception efficiency and robustness, reduces computational resource waste, enhances the model's generalization performance in new scenarios, provides interpretability and natural human-computer interaction, and is suitable for a variety of robot operation tasks.
Smart Images

Figure CN121893301B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of robot vision and artificial intelligence, specifically relating to a robot vision operation control method and system based on eye tracking. Background Technology
[0002] Traditional robot vision operation models (such as grasping detection and target pose estimation models) adopt a "global processing" paradigm, performing indiscriminate feature extraction and intensive computation on the input image. While this approach performs reasonably well in structured, static industrial scenarios, it suffers from the following technical bottlenecks when facing complex, dynamic environments: First, computational redundancy leads to poor real-time performance. The model processes the entire field of view with equal weight, wasting significant computational resources on background areas irrelevant to the task, resulting in slow response times and high energy consumption. Second, when the target is partially occluded or visually disturbed, the lack of a task-oriented attention mechanism makes it difficult for the model to quickly focus on key areas, resulting in insufficient robustness. Third, the detection capability for small targets and fine-grained operations is limited. The root cause of these problems lies in the model's lack of human-like "selective attention" capabilities.
[0003] In contrast, the human visual system exhibits a highly efficient selective processing mechanism. According to cognitive neuroscience research, when performing visual tasks, humans do not process the entire visual field, but rather actively and selectively direct high-resolution visual attention towards areas of interest through rapid fixation-saccade sequences. This mechanism forms an intention-driven attention sequence, making human visual information processing far more efficient than global processing methods.
[0004] In recent years, researchers have begun to explore the application of attention mechanisms in computer vision. However, existing methods have fundamental limitations when applied to robotic manipulation. First, most methods use human gaze points as direct inputs or posterior evaluation criteria for the model, failing to elevate them to transferable "cognitive prior knowledge." Second, these methods lack temporal modeling of eye-tracking sequences, neglecting the dynamic patterns of attention shifts and failing to establish a causal relationship between gaze and action. More importantly, current technologies have failed to deeply integrate eye-tracking knowledge into the perception-decision joint training framework of robotic manipulation models, resulting in a disconnect between attention mechanisms and the manipulation task.
[0005] With technological advancements, head-mounted devices with eye-tracking capabilities have gradually entered the consumer market, providing a convenient platform for eye-tracking data acquisition. These devices integrate high-precision eye-tracking sensors and possess perspective or mixed reality capabilities, enabling the simultaneous acquisition of first-person view scene images and eye-tracking data. However, current technologies still have shortcomings in effectively converting eye-tracking data into knowledge that robots can learn.
[0006] Therefore, a systematic approach is needed to effectively extract, encode, and inject the visual attention patterns of humans when performing tasks into the training process of robot models, so that robots can acquire human-like efficient perception and intelligent decision-making abilities. Summary of the Invention
[0007] To address the shortcomings of existing technologies, this application proposes a robot vision operation control method and system based on eye tracking, which is particularly suitable for unstructured dynamic scenarios requiring high environmental understanding and fine operation. By systematically recording the visual attention patterns of human experts when performing operational tasks, the method transforms these patterns into computable "attention prior knowledge". Through a specific network architecture and training strategy, this prior knowledge is integrated into the robot model, enabling the model to learn "where to look" and "what to do".
[0008] In a first aspect, the present invention provides a robot vision operation control method based on eye tracking, comprising:
[0009] A head-mounted device with eye-tracking function is used to synchronously collect data streams when the operator performs the task. The collected data streams include: eye-tracking data streams when the operator performs the task and scene visual data streams.
[0010] Preprocess the collected data stream;
[0011] For the preprocessed data stream, a dynamic Bayesian network is used to extract gaze-intent coupling features;
[0012] The preprocessed data stream and gaze-intent coupling features are input into the constructed dual-path attention fusion robot operation model to obtain online attention results. Based on the online attention results, the robot performs corresponding operations. The constructed dual-path attention fusion robot operation model includes: a bottom-up data-driven path and a top-down knowledge-driven path. The data-driven path and the knowledge-driven path are adaptively integrated through a gating mechanism, and the weight ratio of action loss and attention loss is dynamically adjusted. Through reinforcement learning optimization strategy, attention consistency is used as an intrinsic reward.
[0013] The eye-tracking data stream during the operation task performed by the operator includes: fixation point coordinates, saccade event markers, fixation duration, and timestamps.
[0014] The scene visual data stream includes: RGB images or RGB-D images from a first-person perspective.
[0015] The method of extracting gaze-intent coupling features using dynamic Bayesian networks includes:
[0016] A dynamic Bayesian network is constructed based on the preprocessed scene visual data stream;
[0017] Transform the gaze point coordinates to three-dimensional space;
[0018] The size of the perception window for the fixation point is determined based on the duration of fixation.
[0019] Within the perception window, the Bayesian posterior probability is calculated based on the three-dimensional space of the gaze point, the observation likelihood in the dynamic Bayesian network, and the prior probability.
[0020] A multi-scale significance map sequence is generated based on the Bayesian posterior probability and the amount of uncertainty reduction.
[0021] The gaze-intent coupling feature is obtained by weighted fusion of multi-scale saliency map sequences.
[0022] The formula for converting the gaze point coordinates to three-dimensional space is as follows:
[0023] ;
[0024] in, Here, K represents the pixel coordinates of the gaze point, and K is the intrinsic parameter matrix of the camera in the head-mounted device with eye-tracking capabilities. For the corresponding depth values of the camera in a head-mounted device with eye-tracking capabilities, , , ) represents the coordinates of the gaze point in three-dimensional space.
[0025] The size of the perception window changes exponentially with the duration of gaze, including:
[0026] When the fixation duration is less than the preset minimum fixation time threshold, the maximum window is used. As a sensory window;
[0027] When the fixation duration exceeds the preset maximum fixation time threshold, the minimum window is used. As a sensory window;
[0028] When the fixation duration is less than or equal to the preset longest fixation time threshold, and when the fixation duration is greater than or equal to the preset shortest fixation time threshold, the following applies: As the perception window, the calculation formula is as follows:
[0029] ;
[0030] in, The duration of fixation at the current fixation point. For adaptive adjustment coefficients, Minimum window For the largest window, The window size is perceived at time t.
[0031] The multi-scale saliency map sequence includes: spatial saliency map, temporal saliency map, and task relevance map.
[0032] The multi-scale saliency map sequence is weighted and fused to obtain the gaze-intent coupling feature, calculated as follows:
[0033] ;
[0034] in, The weighting coefficients for the spatial saliency plot. The weighting coefficients for the time series significance plot are... These are the weighting coefficients for the task relevance graph. For spatial saliency maps, This is a time-series saliency plot. For task relevance graph, This is a gaze-intent coupling feature.
[0035] The bottom-up data-driven path includes: using a convolutional neural network as the backbone network, inputting the preprocessed data stream into the backbone network, and obtaining multi-scale features;
[0036] Multi-scale features are input into a feature pyramid network and fused to obtain fused multi-scale features.
[0037] Attention modules are applied at each scale to weight the fused multi-scale features, resulting in weighted features at different scales.
[0038] Weighted features at different scales are upsampled to obtain features of the same size. Features of the same size are then concatenated and fused to obtain a data-driven attention map.
[0039] The top-down knowledge-driven path includes:
[0040] A sequence encoder is used to encode the gaze-intent coupling features to obtain a global context vector;
[0041] A decoder network is used to decode the global context vector to obtain a knowledge-driven attention graph.
[0042] The adaptive integration of data-driven and knowledge-driven paths through a gating mechanism includes:
[0043] Based on the data-driven attention map and the knowledge-driven attention map, a convolutional layer is used to calculate the gating weights and normalize them using an activation function to obtain the fused attention map. The calculation formula is as follows:
[0044] ;
[0045] Where ⊙ represents element-wise multiplication, and gate is the gate weight. For data-driven attention graphs, For knowledge-driven attention graphs To merge attention maps.
[0046] Furthermore, the weight ratio of action loss to attention loss is dynamically adjusted, and the optimization strategy is implemented through reinforcement learning, using attentional consistency as an intrinsic reward, including:
[0047] Construct the total loss function based on the action loss, attention loss, regularization loss, and their corresponding weights.
[0048] During the attention-dominant phase, the attention loss weight is set to be greater than the first attention weight threshold, and the action loss weight is set to be less than the first action loss threshold.
[0049] During the balanced learning phase, a fixed interval value for the action loss weight is used to gradually increase the action loss weight, while an exponential decay function is used to gradually decrease the attention loss weight.
[0050] During the action-dominant phase, a fixed interval value for the action loss weight is continued to be used, and the action loss weight is gradually increased. The attention loss weight is set to be less than the second attention weight threshold, wherein the first attention weight threshold is greater than the second attention weight threshold.
[0051] Based on external and intrinsic rewards, a reward function is constructed, and the robot operation model with dual-path attention fusion is optimized using the reward function. The intrinsic reward is obtained using the attention consistency method. The similarity between the model's attention and human attention patterns is measured by the intersection-union ratio, and the information gain and gaze efficiency reward of attention guidance are calculated.
[0052] Secondly, the present invention provides a robot vision operation control system based on eye tracking, implemented using the robot vision operation control method based on eye tracking described in the first aspect, comprising:
[0053] The data acquisition module is used to acquire data streams synchronously when the operator performs the operation task using a head-mounted device with eye-tracking function. The acquired data streams include: eye-tracking data streams when the operator performs the operation task and scene visual data streams.
[0054] The data preprocessing module is used to preprocess the acquired data stream;
[0055] The feature extraction module is used to extract gaze-intent coupling features from the preprocessed data stream using a dynamic Bayesian network.
[0056] The action execution module is used to input the preprocessed data stream and gaze-intent coupling features into the constructed dual-path attention fusion robot operation model to obtain online attention results. Based on the online attention results, the robot performs corresponding operations. The constructed dual-path attention fusion robot operation model includes: a bottom-up data-driven path and a top-down knowledge-driven path. The data-driven path and the knowledge-driven path are adaptively integrated through a gating mechanism, and the weight ratio of action loss and attention loss is dynamically adjusted. Through reinforcement learning optimization strategy, attention consistency is used as an intrinsic reward.
[0057] Beneficial effects:
[0058] This application proposes a robot vision operation control method and system based on eye tracking, with the following beneficial effects:
[0059] (1) This invention significantly improves perception efficiency by introducing a human-like selective attention mechanism and extracting gaze-intention coupling features. The model can concentrate computing resources on task-related key areas like a human, reducing redundant computation on background areas, thereby improving response speed and reducing energy consumption. In complex scenarios such as occlusion and interference, the model can quickly focus on key areas, showing stronger robustness than traditional global processing methods and improving task success rate.
[0060] (2) This invention introduces human cognitive priors as initialization and training constraints, accelerating the training convergence process. Compared to completely data-driven methods, this invention can achieve comparable or even better performance with fewer training samples, reducing data acquisition costs. By learning task-oriented attention patterns rather than scene-specific pixel features, the model exhibits better generalization performance and stronger adaptability in new scenes.
[0061] (3) The attention map generated by this invention provides a visual explanation of the model's decision-making process, making debugging and optimization more intuitive. This interpretability not only helps developers understand the model's behavior but also helps improve the naturalness and credibility of human-computer interaction. The method proposed in this invention has strong versatility and is applicable to various robot operation tasks such as grasping, assembly, navigation, and surgical assistance, and has broad application prospects. Attached Figure Description
[0062] Figure 1 A flowchart of a robot vision operation control method based on eye tracking according to an embodiment of the present invention;
[0063] Figure 2 Schematic diagram of the robot vision operation control system according to an embodiment of the present invention;
[0064] Figure 3 A schematic flowchart of the robot vision operation control method based on eye tracking according to an embodiment of the present invention;
[0065] Figure 4 A schematic diagram of the multimodal data acquisition process according to an embodiment of the present invention;
[0066] Figure 5 A schematic diagram of 3D mapping and coordinate system transformation of the gaze point in this embodiment of the invention;
[0067] Figure 6 A schematic diagram of the dual-path attention fusion neural network architecture according to an embodiment of the present invention;
[0068] Figure 7 A schematic diagram of the three-stage training strategy for learning according to an embodiment of the present invention;
[0069] Among them, 1-robot, 2-operator, 3-VR glasses, 4-VR glasses screen, 5-point of gaze. Detailed Implementation
[0070] The specific implementation methods of this application will be further described in detail below with reference to the accompanying drawings and embodiments.
[0071] Addressing the shortcomings of existing technologies, the core of this invention lies in establishing a systematic transfer path from human visual cognition to machine intelligence. First, human gaze behavior is modeled as a Bayesian inference process that actively reduces cognitive uncertainty, rather than a simple information input. By recursively updating scene understanding through a dynamic Bayesian network, the gaze point is modeled as an active observation that reduces state uncertainty, thereby generating an attentional prior representation that integrates spatial, temporal, and task semantics.
[0072] In terms of network architecture design, this invention constructs a dual-stream fusion architecture comprising a bottom-up data-driven path and a top-down knowledge-driven path. The data-driven path extracts features from raw visual data, demonstrating inductive learning capabilities; the knowledge-driven path transforms human cognitive priors into attention-guided signals, demonstrating deductive reasoning capabilities. The two paths are adaptively integrated through a gating mechanism, preserving the flexibility of end-to-end learning while explicitly incorporating the cognitive patterns of human experts.
[0073] Regarding the training strategy, this invention designs a "learn to see first, then learn to do" learning strategy. By dynamically adjusting the weight ratio of action loss and attention loss, the model first learns to identify key areas of the task, establishing task-oriented visual attention, and then gradually transitions to accurate action prediction. In the reinforcement learning stage, this invention uses attention consistency as an intrinsic reward, guiding the strategy to maintain a human-like visual sampling pattern during autonomous exploration, ensuring that the model not only remembers "where to look in which scene," but truly learns the task-oriented attention allocation strategy.
[0074] Example 1:
[0075] This embodiment provides a robot vision operation control method based on eye tracking, such as... Figure 1 , Figure 3 As shown, it includes:
[0076] Step S1: Use a head-mounted device with eye-tracking function to synchronously collect data streams while the operator is performing the task. The collected data streams include: eye-tracking data streams and scene visual data streams while the operator is performing the task.
[0077] In this embodiment, as Figure 2 As shown, the robot vision operation control system includes: robot 1, operator 2, and VR glasses 3. It employs a head-mounted device with eye-tracking capabilities, namely VR (Virtual Reality) glasses 3. While operator 2 wears VR glasses 3 and performs the operation task, three types of data streams are collected synchronously. The VR glasses screen 4 includes a gaze point 5.
[0078] The eye-tracking data stream during operator task execution includes: gaze point coordinates, saccade event markers, gaze duration, and timestamps. The scene visual data stream includes: a first-person perspective RGB image (Red, Green, Blue, primary color image) or RGB-D image (Red, Green, Blue – Depth, red, green, blue – depth information image). The operation motion data stream includes robotic arm joint angles, end effector pose, and gripper state. This operation motion data stream is used in training a robot operation model with dual-path attention fusion. The eye-tracking device must meet the technical specifications of an eye-tracking frequency of at least 90Hz, eye-tracking accuracy better than 1.5° of visual field, and eye-tracking latency less than 10ms.
[0079] Necessary calibration work needs to be completed before data collection, such as... Figure 4As shown. First, the intrinsic parameters of the camera in the head-mounted device are calibrated to obtain the camera matrix K and distortion coefficients D. Then, eye-tracking-camera extrinsic parameter calibration is performed. By setting calibration points at known locations in the environment, eye-tracking data is recorded when the operator gazes at each point. An optimization algorithm is used to solve the 4×4 homogeneous transformation matrix from the eye-tracking coordinate system to the camera coordinate system, ensuring that the calibration error is less than 0.5°. Finally, time synchronization calibration is completed by measuring the time delay between the eye-tracking data stream and the video stream, and establishing a precise timestamp alignment function. The eye-tracking data stream, visual data stream, and operation action data stream are input into the data synchronization module for data synchronization. The synchronized data is then stored to obtain a multimodal database. This part is prior art and will not be described in detail in this application.
[0080] Step S2: Preprocess the acquired data stream;
[0081] In this embodiment, the collected raw data needs to undergo systematic preprocessing to ensure quality. All data streams are aligned to a unified time axis through time alignment, filtering algorithms are applied to smooth gaze trajectory and identify saccades and gaze events, gaze points are transformed from normalized coordinates to image pixel coordinates and robot coordinates, trials with excessively high data loss rates or operational failures are eliminated, and consistency checks are performed on multi-expert data.
[0082] Step S3: For the preprocessed data stream, a dynamic Bayesian network is used to extract gaze-intent coupling features, including:
[0083] Step S3.1: Construct a dynamic Bayesian network based on the preprocessed scene visual data stream;
[0084] In this embodiment, the acquired eye-tracking data stream is transformed into an attentional prior representation rich in cognitive semantics. First, the preprocessed scene visual data stream is modeled as a dynamic Bayesian network. The state space of the Bayesian network includes object pose, scene layout, and task progress, while the observation space includes visual observation and eye-tracking observation. Then, the gaze point is transformed from the image coordinate system to three-dimensional space using the inverse of the camera intrinsic parameter matrix K and the corresponding depth value. How to construct the dynamic Bayesian network is existing technology and will not be elaborated upon in this embodiment.
[0085] Step S3.2: Transform the gaze point coordinates to three-dimensional space. The calculation formula is as follows:
[0086] ;
[0087] in, Let be the pixel coordinates of the gaze point in the eye-tracking coordinate system, and K be the intrinsic parameter matrix of the camera of the head-mounted device with eye-tracking capability. For the corresponding depth values of the camera in a head-mounted device with eye-tracking capabilities, , , () represents the coordinates of the gaze point in the three-dimensional space within the camera coordinate system.
[0088] Then, by combining depth information and mapping it to the robot coordinate system using the extrinsic parameter matrix calibrated by hand and eye, the world coordinates X are obtained. r Y r Z r This gives the gaze point a physical meaning that can be manipulated by the robot, where the extrinsic parameter matrix represents the fixed pose transformation between the camera and the robot, such as... Figure 5 As shown.
[0089] Step S3.3: Determine the perceptual window size of the fixation point based on the fixation duration, including:
[0090] When the fixation duration is less than the preset minimum fixation time threshold, the maximum window is used. As a sensory window;
[0091] When the fixation duration exceeds the preset maximum fixation time threshold, the minimum window is used. As a sensory window;
[0092] When the fixation duration is less than or equal to the preset longest fixation time threshold, and when the fixation duration is greater than or equal to the preset shortest fixation time threshold, the following applies: As the perception window, the calculation formula is as follows:
[0093] ;
[0094] in, The duration of fixation at the current fixation point. For adaptive adjustment coefficients, Minimum window For the largest window, The window size is perceived at time t.
[0095] A Bayesian filtering framework is used to implement recursive state estimation. For each time step, the prior state distribution is first predicted based on the motion model, and then a perceptual window is defined around the fixation point as the observation region. The window size is adaptively adjusted according to the fixation duration to reflect human cognitive characteristics.
[0096] The minimum window is used for detailed observation corresponding to prolonged fixation. A short-time scan corresponds to a coarse search using the largest window. , The duration of the current gaze point. The coefficients are adjusted adaptively. After determining the size of the perception window, the Bayesian posterior distribution is calculated based on the observation likelihood and prior probability. Finally, the uncertainty is updated and the information gain is quantified.
[0097] Step S3.4: In the perception window, calculate the Bayesian posterior probability based on the three-dimensional space of the gaze point, the observation likelihood in the dynamic Bayesian network, and the prior probability.
[0098] In this embodiment, the calculation of the Bayesian posterior probability based on the three-dimensional space of the gaze point, the observation likelihood in the dynamic Bayesian network, and the prior probability is a prior art technique, and will not be described in detail in this embodiment.
[0099] Step S3.5: Generate a multi-scale significance map sequence based on the Bayesian posterior probability and the amount of uncertainty reduction;
[0100] The multi-scale saliency map sequence includes: spatial saliency map, temporal saliency map, and task relevance map.
[0101] In this embodiment, Bayesian filtering is used as a unified framework. At each scale of the feature pyramid, the posterior probability of the state is recursively calculated time-by-time as the basic saliency. The saliency is then adaptively weighted and enhanced using uncertainty reduction (UR), ultimately resulting in a multi-scale saliency map sequence that is updated synchronously with time and scale. This achieves a progressive visual saliency representation from coarse to fine, and from global intent to local fixation.
[0102] In this embodiment, a multi-scale saliency map sequence is generated based on the Bayesian posterior probability and the uncertainty reduction. Spatial saliency map The temporal saliency map is calculated by combining the posterior probability of a pixel belonging to the target with the reduction in uncertainty in that region. Historical gaze points are weighted by time decay to encode dynamic patterns of attention shifts; task relevance graph. Weights are assigned to different sub-objectives based on the task stage, resulting in a task relevance map. The final gaze-guided saliency map is obtained through weighted fusion. .
[0103] Step S3.6: Weighted fusion of the multi-scale saliency map sequence yields the gaze-intent coupling feature, calculated as follows:
[0104] ;
[0105] in, The weighting coefficients are for the spatial saliency plot. The weighting coefficients for the time series significance plot are... The weighting coefficients of the task relevance graph are not less than zero and are normalized. For spatial saliency maps, This is a time-series saliency plot. For task relevance graph, This is a gaze-intent coupling feature.
[0106] Finally, the dynamic time warping algorithm was used to align the gaze sequence and the action sequence, identify gaze precursor patterns, and verify the predictive power of gaze on action, providing a basis for using attention as a precursor signal for action prediction in the future.
[0107] Step S4: Input the preprocessed data stream and gaze-intent coupling features into the constructed dual-path attention fusion robot operation model to obtain online attention results. Based on the online attention results, the robot performs corresponding operations. The constructed dual-path attention fusion robot operation model includes: a bottom-up data-driven path and a top-down knowledge-driven path. The data-driven path and knowledge-driven path are adaptively integrated through a gating mechanism, and the weight ratio of action loss and attention loss is dynamically adjusted. An optimization strategy using reinforcement learning is employed, with attention consistency as an intrinsic reward. Figure 6 As shown.
[0108] The bottom-up data-driven path includes: using a convolutional neural network (CNN) as the backbone network, inputting the preprocessed data stream into the backbone network, and obtaining multi-scale features;
[0109] Multi-scale features are input into a Feature Pyramid Network (FPN) and fused to obtain fused multi-scale features.
[0110] Attention modules are applied at each scale to weight the fused multi-scale features, resulting in weighted features at different scales.
[0111] Weighted features at different scales are upsampled to obtain features of the same size. Features of the same size are then concatenated and fused to obtain a data-driven attention map.
[0112] The top-down knowledge-driven path includes:
[0113] A sequence encoder is used to encode the gaze-intent coupling features to obtain a global context vector;
[0114] A decoder network is used to decode the global context vector to obtain a knowledge-driven attention graph.
[0115] In this embodiment, an end-to-end neural network model is designed, and a dual-stream fusion architecture is adopted to integrate data-driven and knowledge-driven learning paradigms.
[0116] The bottom-up path is responsible for extracting features from the preprocessed data stream. This path uses a convolutional neural network as the backbone network, pre-trained on the data stream acquired in step S1 to obtain rich visual priors. After the backbone network outputs multi-scale features, they are fused through a feature pyramid network to integrate semantic information and spatial details at different scales. A spatial attention module is applied at each scale, and feature weighting is achieved by compressing channel dimensions, generating attention maps through convolutional layers, and normalizing them using activation functions. The weighted features at different scales are upsampled to the same size and then concatenated to obtain a data-driven attention map.
[0117] The top-down path is responsible for transforming human cognitive priors into attention-guiding signals. This path uses a sequence encoder to process the gaze saliency map sequence, dividing each frame into patches and embedding them into a high-dimensional space. After adding positional encoding, it models temporal dependencies through a multi-layer attention mechanism, capturing long-distance temporal associations and attention shift patterns. Global context vectors are extracted from the encoded features, and a decoder network generates a knowledge-driven attention map. The decoding process also incorporates the gaze prior of the current frame and adaptively adjusts the fusion ratio based on gaze confidence.
[0118] The dual-path fusion module is the core of the entire architecture, requiring an adaptive trade-off between the two information sources. This module integrates knowledge-driven attention graphs. and data-driven attention graphs The gating weights are calculated by convolutional layers and normalized using activation functions. Adaptive weighted fusion is then performed to generate a fused attention map. .
[0119] The adaptive integration of data-driven and knowledge-driven paths through a gating mechanism includes:
[0120] Based on the data-driven attention map and the knowledge-driven attention map, a convolutional layer is used to calculate the gating weights and normalize them using an activation function to obtain the fused attention map. The calculation formula is as follows:
[0121] ;
[0122] Where ⊙ represents element-wise multiplication, and gate is the gate weight. For data-driven attention graphs, For knowledge-driven attention graphs To achieve the fusion of attention maps, the gating weights are data-adaptive, relying more on knowledge-driven approaches in the early stages of training and gradually increasing the proportion of data-driven approaches as the model's capabilities improve. The fused attention map must satisfy sparsity constraints, entropy constraints, and peak constraints to ensure that attention is concentrated in key regions and that there is a clear focus of attention.
[0123] In this embodiment, the operation decision network transforms the attention-weighted features into specific operation instructions. The operation decision network includes an RNN (Recurrent Neural Network) and an MLP (Multilayer Perceptron). First, spatial features are extracted using attention-based weighted pooling, while global average pooling is calculated as the global context; the two are concatenated to obtain a comprehensive feature representation. To model the temporal dependencies of actions, a recurrent neural network is used to process the feature sequence. Finally, an action prediction head is constructed using the MLP to predict the pose vector, gripper state, and action uncertainty, outputting the action vector, predicted attention map, uncertainty estimate, and updated hidden state.
[0124] Furthermore, the weight ratio between action loss and attention loss is dynamically adjusted, and the optimization strategy is implemented through reinforcement learning, using attentional consistency as an intrinsic reward. Figure 7 As shown, it includes:
[0125] Construct the total loss function based on the action loss, attention loss, regularization loss, and their corresponding weights.
[0126] During the attention-dominant phase, the attention loss weight is set to be greater than the first attention weight threshold, and the action loss weight is set to be less than the first action loss threshold.
[0127] During the balanced learning phase, a fixed interval value for the action loss weight is used to gradually increase the action loss weight, while an exponential decay function is used to gradually decrease the attention loss weight.
[0128] During the action-dominant phase, a fixed interval value for the action loss weight is continued to be used, and the action loss weight is gradually increased. The attention loss weight is set to be less than the second attention weight threshold, wherein the first attention weight threshold is greater than the second attention weight threshold.
[0129] Based on external and intrinsic rewards, a reward function is constructed, and the robot operation model with dual-path attention fusion is optimized using the reward function. The intrinsic reward is obtained using the attention consistency method. The similarity between the model's attention and human attention patterns is measured by the intersection-union ratio, and the information gain and gaze efficiency reward of attention guidance are calculated.
[0130] In this embodiment, a three-stage training process is adopted to ensure that the model can learn both human operational skills and human visual attention patterns.
[0131] The first phase employs behavior cloning based on curriculum learning.
[0132] Based on the action loss, attention loss, regularization loss, and their corresponding weights (action loss weight, attention loss weight, regularization loss weight), a total loss function is constructed, calculated as follows:
[0133] ;
[0134] in, For the total loss function, The motion loss includes pose loss, gripper loss, and temporal consistency loss. For attention loss, the consistency between model attention and human attention is comprehensively evaluated using multiple metrics such as KL (Kullback-Leibler, relative entropy) divergence, structural similarity, gaze coverage loss, and Wasserstein distance. The regularization loss includes sparsity constraints, entropy constraints, and peak constraints to ensure the rationality of the attention map. Weighting the action loss. For attention loss weights, The weights are for regularization loss.
[0135] The key innovation lies in adopting a "learn to see first, then learn to do" learning strategy, guiding the learning process by dynamically adjusting weight coefficients. In the attention-driven phase, a higher weight for attention loss and a lower weight for action loss are set, aiming to enable the model to first learn to identify key task regions and establish task-oriented visual attention. In the balanced learning phase, the weight for action loss is gradually increased, while the weight for attention loss decays exponentially, simultaneously learning both "seeing" and "doing" abilities. In the action-driven phase, the weight for action loss is further increased, focusing on improving action prediction accuracy while still retaining appropriate attentional supervision to maintain a human-like visual pattern. The training process employs a suitable optimizer; initial learning rate and weight decay parameters are used to prevent overfitting; a cosine annealing restart strategy is used for learning rate scheduling, with a warm restart at the end of each phase; gradient clipping is used to stabilize training.
[0136] The second stage, building upon the basic capabilities established by behavioral cloning, further optimizes the strategy using reinforcement learning. The total reward function is calculated as follows:
[0137] ;
[0138] in, For the total reward function, External rewards include task success rewards, efficiency rewards, and security rewards; The weight corresponding to the intrinsic reward. To provide intrinsic rewards, this study innovatively quantifies attentional consistency by measuring the similarity between the model's attention and human attention patterns using the intersection-union ratio (IUGR), and calculates the information gain guided by attention and the gaze efficiency reward. A policy optimization algorithm is employed to implement reinforcement learning. The policy network directly uses the trained model, while the value network estimates state values through a multilayer perceptron. The training objective adds entropy regularization to the standard policy objective and value function loss to encourage exploration, while retaining the attention loss to maintain consistency with human patterns, ensuring that the model does not deviate from human-established cognitive priors during autonomous exploration.
[0139] The third stage involves hybrid fine-tuning, combining offline expert demonstration data with online autonomous exploration experience for continuous optimization. Two experience pools are maintained: an expert demonstration buffer and an autonomous exploration buffer. The sampling strategy during training dynamically adjusts the proportion of expert data, gradually decreasing it from a relatively high initial proportion to achieve a smooth transition from imitation to autonomy. Optionally, a discriminator network can be introduced to distinguish expert actions from model actions, further improving performance through generative adversarial training, or a meta-learning framework can be employed to enable the model to learn from few samples, supporting rapid adaptation to new tasks.
[0140] In this embodiment, the entire training process is divided into fixed proportions: the first 30% to 40% of the total training rounds in the attention-dominated phase; the middle 30% to 40% of the total training rounds in the balanced learning phase; and the last 20% to 30% of the total training rounds in the action-dominated phase.
[0141] The trained dual-path attention fusion robot operation model is deployed to a real robot system and enables online adaptive capabilities. First, the constructed dual-path attention fusion robot operation model is compressed and optimized to meet real-time requirements. Quantization techniques are used to reduce model accuracy, redundant parameters are removed through pruning, and optionally, a lightweight student model can be trained using knowledge distillation, or the inference process can be accelerated through compiler optimization. The combined application of these techniques reduces inference latency to within the range required for real-time applications.
[0142] After deployment, the robot operation model with dual-path attention fusion generates online attention. The deployment architecture adopts a modular design. The data flow starts with camera acquisition, passes through a preprocessing module for image distortion correction and depth alignment, then the model inference module performs attention prediction and action decision-making, followed by a motion planning module for trajectory optimization and collision detection, and finally, the robotic arm control module executes the actions. A parallel attention visualization module is also maintained for debugging and human-robot interaction. Each module uses an asynchronous processing pipeline, running in threads at different frequencies to optimize real-time performance.
[0143] The system enables autonomous attention generation. Based on current observations and task context, the model predicts the next attention distribution, generates multiple candidate attention maps, and selects the optimal candidate using a value function. The attention generation process integrates task type, execution stage, and historical attention information, enabling the model to generate reasonable attention patterns even in unseen scenarios.
[0144] Implement online adaptive capabilities to cope with environmental changes. Domain adaptation is achieved by adjusting batch normalization statistics and feature distribution alignment, enabling the model to quickly adapt to data distributions in new environments. Online fine-tuning with small learning rates is performed using failure cases, learning and improving from errors. Optionally, human attention can be captured during user intervention, incorporating human corrective guidance into the model through contrastive learning. These mechanisms ensure that the model can continuously learn and improve.
[0145] Based on the adaptively adjusted online attention results, the robot performs the corresponding operation.
[0146] Implement a comprehensive fault detection and safety mechanism. Anomalies are detected by monitoring the information entropy of the attention map; warnings are triggered when the entropy value is too high (indicating excessive distraction) or too low (indicating abnormal concentration). When the uncertainty estimate in the model output exceeds a set threshold, it automatically switches to a conservative mode, reducing movement speed and increasing a safe distance, or requests human intervention. Collision warnings are provided based on the intersection of the attention map and obstacle masks, proactively avoiding potential hazards.
[0147] To more clearly describe the specific implementation process of steps S1 to S4, the following examples are provided:
[0148] Example 1: Industrial data crawling task:
[0149] This example addresses a basket-grabbing task in an industrial environment. In a basket containing various workpieces, the robotic arm needs to identify the target workpiece and grasp it. Challenges in this scenario include object stacking and occlusion, varying lighting, and a mix of workpieces with different shapes and materials.
[0150] Several skilled operators were invited to perform demonstration tasks wearing head-mounted devices with eye-tracking capabilities. The technical specifications of the data acquisition equipment met the requirements of an eye-tracking frequency of 90Hz or higher, an eye-tracking accuracy better than 1.5°, and an RGB-D camera. Before data acquisition, camera intrinsic parameter calibration, eye-tracking-camera extrinsic parameter calibration, and time synchronization calibration were completed as described in step S1. During calibration, multiple calibration points at known locations were set in a virtual or real environment, and eye-tracking data was recorded when the operator gazed at each point. The calibration parameters were then solved using an optimization algorithm.
[0151] In each demonstration task, the operator's eye movement data, including gaze coordinates, saccade events, and gaze duration, a first-person perspective RGB-D image sequence, and the joint angles and gripper states of the robotic arm, are recorded synchronously. A sufficient number of successful demonstrations and some failed cases are collected to cover different scene variations. Following step S2, the collected raw data is preprocessed: linear interpolation is used to align all data streams to a unified time axis; filtering algorithms are applied to smooth the gaze trajectory; saccades and gaze events are identified and labeled; the gaze point is transformed from normalized coordinates to image pixel coordinates; and trials with poor data quality are discarded.
[0152] Feature extraction is performed on the preprocessed eye-tracking sequence as described in step S3. First, the gaze point is combined with depth information to transform from the image coordinate system to 3D space, and then transformed to the robot base coordinate system through a calibrated extrinsic parameter matrix. Then, a dynamic Bayesian network model is established for scene understanding. For each time step, a prediction-update loop is executed. The prior probability distribution of the current state is predicted based on the motion model. A perception window is defined around the gaze point, and the window size is adaptively adjusted according to the gaze duration. The posterior probability distribution is calculated based on the observation likelihood and prior probability. The state uncertainty is updated, and the information gain is calculated.
[0153] Multi-scale saliency maps are generated based on Bayesian posterior probabilities and uncertainty reduction. The spatial saliency map is calculated by combining the posterior probability of a pixel belonging to the target object and the uncertainty reduction of that region. The temporal saliency map applies time decay weighting to historical gaze points. The task relevance map assigns weights to different sub-targets based on the current task stage. Weighted fusion yields the final gaze-guided saliency map sequence. A dynamic time warping algorithm is used to align the gaze sequence with the action sequence, identifying gaze leader patterns—the time difference between the gaze point's arrival at the target position—to verify the predictive power of gaze on action.
[0154] The dual-path attention fusion neural network model is constructed according to step S4. The bottom-up path uses a convolutional neural network as the backbone network and is pre-trained on a large-scale image dataset. Multi-scale features output by the backbone network are fused using a feature pyramid network. A spatial attention module is applied at each scale to generate a data-driven attention map. The top-down path uses a sequence encoder to process the gaze saliency map sequence. Each frame's saliency map is divided into patches and embedded into a high-dimensional space. After adding positional encoding, temporal dependencies are modeled using a multi-layer self-attention mechanism. Contextual information is extracted from the encoded features, and a knowledge-driven attention map is generated through a decoder network. The dual-path fusion module adaptively integrates the attention maps of the two paths using a gating mechanism. The gating weights are learned from the concatenated features through convolutional layers, normalized by an activation function, and used for weighted fusion. The fused attention map must satisfy sparsity, entropy, and peak constraints. The operation decision network transforms the attention-weighted features into operation instructions. Spatial features are extracted through attention-based weighted pooling, temporal dependencies are modeled using a recurrent neural network, and action vectors, including pose, gripper state, and uncertainty estimation, are predicted using a multi-layer perceptron.
[0155] The collected data is divided into training, validation, and test sets to ensure that data from different operators are evenly distributed across each set. Three-stage training is performed as described in step S4. The first stage employs behavior cloning, with loss functions including action loss, attention loss, and regularization loss. Action loss includes pose loss, gripper loss, and temporal consistency loss; attention loss evaluates the consistency between the model's attention and human attention using metrics such as KL divergence, structural similarity, gaze coverage loss, and Wasserstein distance; regularization loss includes sparsity, entropy, and peak constraint. A curriculum learning strategy is used to dynamically adjust the loss weights in three stages: a higher attention loss weight is set in the attention-dominated stage, the weights are dynamically adjusted in the balancing learning stage, and a higher action loss weight is set in the action-dominated stage. Training is performed using an appropriate optimizer and learning rate scheduling strategy.
[0156] The second stage involves reinforcement learning in a simulated environment. The total reward function includes external task rewards (success, efficiency, and safety rewards) and intrinsic attention rewards (calculated based on attentional consistency, information gain, and gaze efficiency). A policy optimization algorithm is used for training, preserving attentional loss to maintain consistency with human patterns. The third stage involves hybrid fine-tuning. An expert demonstration buffer and an autonomous exploration buffer are maintained, and the sampling ratio is dynamically adjusted during training to achieve a smooth transition from imitation to autonomy.
[0157] After training, the model is optimized. Quantization techniques are used to reduce model accuracy, redundant parameters are removed through pruning, and knowledge distillation is optionally used to train a lightweight model. The optimized model exhibits significantly reduced inference latency. The model is deployed on an embedded computing platform, configured with a processor with neural network acceleration capabilities, an RGB-D camera, a multi-axis collaborative robotic arm, and an end effector. The software architecture employs a modular design, including modules for data acquisition, preprocessing, model inference, motion planning, and control. Autonomous attention generation is implemented, predicting the next attention distribution based on current observations and task context, generating multiple candidate attention maps, and selecting the optimal candidate using a value function. Online adaptive functionality is implemented, adapting to new environments by adjusting batch normalized statistics, fine-tuning online with small learning rates using failure cases, and optionally utilizing user-interventional attention for comparative learning. A safety mechanism is implemented, monitoring the information entropy of the attention map to detect anomalies, switching to a conservative mode or requesting human intervention when model uncertainty exceeds a set threshold.
[0158] The deployed system was tested in a real-world environment. Test scenarios included different frame layouts, different workpiece combinations, and different lighting conditions compared to the training data. The system demonstrated good grasping ability and adaptability. The attention map generated by the model showed consistency with the gaze patterns of human experts, and the attention transfer patterns conformed to typical cognitive sequences. In scenarios with interfering objects, attention could correctly focus on the target workpiece. In terms of computational efficiency, the model, deployed on an edge computing platform, could achieve real-time inference, meeting the real-time requirements of industrial applications.
[0159] Example 2: Precision assembly task:
[0160] This example addresses a precision assembly task, such as bolt tightening in electronic product assembly. Challenges of this task include high positioning accuracy requirements, a workflow involving multiple sequential steps, and the need for different vision and force control strategies for each step.
[0161] To address high-precision requirements, improvements are made to the basic method. A hierarchical attention mechanism is designed, where coarse-scale attention is used to locate the approximate area of the target, while fine-scale attention performs pixel-level precise localization within that area. The two levels of attention are dynamically selected based on the current task stage. Multimodal perception information is fused, adding a force sensor data input channel to the model, and defining tactile and visual attention for fusion. The fusion weights differ at different stages: vision dominates in the coarse localization stage, vision and tactile are fused in the fine alignment stage, and tactile weight increases in the manipulation stage. This progressive strategy simulates the human attention shift pattern from vision to touch.
[0162] The basic process of data acquisition, feature extraction, model building, and training is similar to Example 1, but it is important to collect multimodal data, namely visual and tactile data, and add corresponding input channels and processing modules to the model. In assembly task testing, the system demonstrated high-precision positioning and operational capabilities. The positioning error met the task requirements, the operation time was comparable to that of human experts, and the good force control accuracy ensured assembly quality. Hierarchical attention mechanism and multimodal fusion are key factors in performance improvement.
[0163] Example 3: Mobile robot navigation:
[0164] This example focuses on the autonomous navigation task of a mobile robot in a dynamic environment such as an office, requiring real-time avoidance of pedestrians and dynamic obstacles while efficiently planning a path. By analyzing attention data from human remote-controlled navigation, typical attention patterns were identified. Proactive attention is characterized by looking ahead at the path in advance; hazard monitoring is characterized by frequent switching of gaze between potential collision points; and multi-target parallel monitoring is characterized by a periodic pattern of distracted and focused attention.
[0165] A spatiotemporal attention module was designed specifically for navigation tasks. This module not only encodes the attention in the current frame but also predicts the expected attention distribution for several steps ahead, enabling path planning to consider future environmental changes. A risk assessment module estimates collision risk based on attention duration and obstacle distance, adjusting behavioral strategies according to the risk level. In long-distance tests, the system maintained good safety and navigation efficiency, keeping the safe distance within a reasonable range, and its navigation efficiency was superior to traditional methods. Pedestrians showed high acceptance of the robot's behavior, indicating that the robot exhibited a human-like navigation pattern, making it easier for pedestrians to understand and predict.
[0166] All steps not mentioned in this example are the same as steps S1 to S4, and will not be repeated here.
[0167] Example 4: Surgical robot assistance:
[0168] This example focuses on assisted surgical tasks using a surgical robot. The scenario is unique due to the use of binocular stereo vision, extremely high precision requirements, the need to handle tissue deformation and occlusion, and stringent safety requirements. Experienced operators are invited to demonstrate the procedure on a simulator or training model. The data acquisition system simultaneously records binocular eye-tracking data for independent tracking, as well as the collaborative gaze patterns of the primary operator and assistant. Binocular data processing involves handling binocular disparity, generating a stereo attention representation, and calculating depth-sensing attention based on the disparity map.
[0169] The attention of the operator and assistant often exhibits a complementary pattern. By calculating the correlation between their attention, a collaborative pattern is identified: high correlation indicates a collaborative attention pattern, while low correlation indicates a division of labor pattern. The model learns to recognize these two patterns and adjusts its attention allocation strategy accordingly. On the validation dataset, the model demonstrates high-precision localization and operational capabilities approaching those of human experts, with no safety incidents occurring. This system can reduce the cognitive load on the operator, provides attention cues to help prevent omissions, and can serve as a training aid.
[0170] All steps not mentioned in this example are the same as steps S1 to S4, and will not be repeated here.
[0171] Example 5: Intelligent Driving Assistance System
[0172] This example focuses on environmental perception and decision-making tasks in an intelligent driving assistance system. Driving scenarios are highly dynamic, requiring drivers to continuously monitor various elements such as roads, vehicles, pedestrians, and traffic signs, while making rapid decisions based on changing road conditions. The challenges of this scenario include the need to quickly switch attention between multiple targets, anticipate potential hazards, and understand the driver's operational intentions.
[0173] Data acquisition utilized driving simulators or real-vehicle testing platforms equipped with eye-tracking capabilities. Experienced drivers were invited to perform driving tasks in various typical scenarios, including urban roads, highways, and complex intersections. Simultaneously, driver eye-tracking data was collected, including gaze sequences and saccade patterns, visual data from forward-facing and surround-view cameras, vehicle status data such as steering wheel angle, accelerator and brake pedals, and turn signals, as well as onboard sensor data including radar and lidar detection results. Analysis of the driver's eye-tracking data revealed typical attention patterns. Road centerline tracking was characterized by the driver's gaze periodically sweeping across the center of the road ahead; hazard detection was characterized by the gaze rapidly shifting towards suddenly appearing obstacles or unusual vehicles; intent prediction was characterized by the driver checking the rearview mirror and target lane before changing lanes; and intersection decision-making was characterized by the gaze rapidly switching between traffic lights, oncoming vehicles, and pedestrians when approaching an intersection.
[0174] Multimodal fusion modeling is implemented to address the characteristics of driving scenarios. Building upon the basic approach, an onboard sensor data input channel is added to fuse information such as target location detected by radar, LiDAR point cloud, and GPS positioning with visual attention. An intent prediction module is designed to predict the driver's operational intent based on the temporal patterns of gaze sequences. For example, lane-changing intent is predicted based on the frequency and duration of the driver's glances at the rearview mirror, and turning intent is predicted based on the pattern of gaze at intersections. A risk assessment mechanism is introduced, combining the duration of the driver's gaze with the target's hazard level. A warning is triggered when the driver fails to gaze at a high-risk area, such as a vehicle or pedestrian ahead, for an extended period.
[0175] The training strategy employs a multi-task learning framework, simultaneously learning three tasks: environmental perception, intent prediction, and risk assessment. The environmental perception task learns human-like scene understanding, identifying key elements such as roads, vehicles, and pedestrians and predicting their movement trends. The intent prediction task learns to infer driving operations from gaze patterns, providing decision-making references for the autonomous driving system. The risk assessment task learns to identify hazards that the driver may overlook, providing proactive safety warnings. Attention consistency loss ensures that the system's attention allocation remains consistent with that of experienced drivers, and adversarial training improves robustness in complex scenarios.
[0176] Once deployed, the system can perform multiple functions. In assisted driving mode, the attention map generated by the system can be used for augmented reality head-up displays, highlighting key areas to assist the driver in allocating attention. In autonomous driving mode, the human-like attention mechanism allows the system to prioritize high-risk areas like a human driver, improving the interpretability and safety of decision-making. The system can also be used for driver status monitoring, detecting fatigue or distraction by comparing real-time attention with normal attention patterns.
[0177] In various test scenarios, the system demonstrated excellent environmental perception and rational decision-making. At complex intersections, the system correctly identified key elements such as traffic lights, pedestrians crossing the street, and oncoming vehicles, and allocated attention appropriately. On highways, the system promptly detected slow-moving vehicles and lane-changing vehicles. In attention monitoring tests, the system accurately detected driver distraction and provided timely warnings. The generated attention map closely matched human driver gaze patterns, making autonomous driving decisions easier to understand and trust.
[0178] All steps not mentioned in this example are the same as steps S1 to S4, and will not be repeated here.
[0179] Example 2:
[0180] This embodiment provides a robot vision operation control system based on eye tracking, which is implemented using the robot vision operation control method based on eye tracking described in the first aspect. It includes: a data acquisition module, a data preprocessing module, a feature extraction module, and an action execution module. The data acquisition module is connected to the data preprocessing module, the data preprocessing module is connected to the feature extraction module, and the feature extraction module is connected to the action execution module.
[0181] The data acquisition module is used to acquire data streams synchronously when the operator performs the operation task using a head-mounted device with eye-tracking function. The acquired data streams include: eye-tracking data streams when the operator performs the operation task and scene visual data streams.
[0182] The data preprocessing module is used to preprocess the acquired data stream;
[0183] The feature extraction module is used to extract gaze-intent coupling features from the preprocessed data stream using a dynamic Bayesian network.
[0184] The action execution module is used to input the preprocessed data stream and gaze-intent coupling features into the constructed dual-path attention fusion robot operation model to obtain online attention results. Based on the online attention results, the robot performs corresponding operations. The constructed dual-path attention fusion robot operation model includes: a bottom-up data-driven path and a top-down knowledge-driven path. The data-driven path and the knowledge-driven path are adaptively integrated through a gating mechanism, and the weight ratio of action loss and attention loss is dynamically adjusted. Through reinforcement learning optimization strategy, attention consistency is used as an intrinsic reward.
[0185] Example 3:
[0186] This embodiment proposes an electronic device, including: one or more processors, and a memory, wherein the memory is used to store instructions, and when the instructions are executed by the one or more processors, the one or more processors execute the aforementioned eye-tracking-based robot vision operation control method.
[0187] The electronic device can be a mobile phone, computer, or tablet computer, etc., and includes a memory and a processor. The memory stores a computer program, which, when executed by the processor, implements a robot vision operation control method based on eye tracking as described in the embodiments. It is understood that the electronic device may also include an input / output (I / O) interface and communication components.
[0188] The processor is used to execute all or part of the steps in the eye-tracking-based robot vision operation control method described in the above embodiments. The memory is used to store various types of data, which may include, for example, instructions for any application or method in the electronic device, as well as application-related data.
[0189] The processor can be implemented as an application-specific integrated circuit (ASIC), a digital signal processor (DSP), a programmable logic device (PLD), a field-programmable gate array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is used to execute the eye-tracking-based robot vision operation control method described in the above embodiments.
[0190] Example 4:
[0191] This embodiment proposes a computer-readable storage medium that stores executable instructions. When these instructions are executed, if they are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
[0192] The computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the eye-tracking-based robot vision operation control method described in the various embodiments of this application.
[0193] The aforementioned storage media include: flash memory, hard disk, multimedia card, card-type memory (e.g., SD (Secure Digital Memory Card) or DX (Memory Data Register, MDR) memory), random access memory (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic storage, disk, optical disk, server, APP (Application) application store, and other media capable of storing program verification codes. These media store computer programs, which, when executed by a processor, can implement the various steps of the aforementioned eye-tracking-based robot vision operation control method.
[0194] Example 5:
[0195] This embodiment proposes a computer program product, including a computer program or instructions, which, when executed by a processor, implements the aforementioned eye-tracking-based robot vision operation control method.
[0196] Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or part of the technical solution, can be embodied in the form of a computer program product.
[0197] The various embodiments in this application are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments.
[0198] The scope of protection of this application is not limited to the embodiments described above. Obviously, those skilled in the art can make various modifications and variations to this disclosure without departing from the scope and spirit of this disclosure. If such modifications and variations fall within the scope of equivalent technology of this disclosure, then the intent of this disclosure also includes such modifications and variations.
Claims
1. A robot vision operation control method based on eye tracking, characterized in that, include: A head-mounted device with eye-tracking function is used to synchronously collect data streams when the operator performs the task. The collected data streams include: eye-tracking data streams when the operator performs the task and scene visual data streams. Preprocess the collected data stream; For the preprocessed data stream, a dynamic Bayesian network is used to extract gaze-intent coupling features; The preprocessed data stream and gaze-intent coupling features are input into the constructed dual-path attention fusion robot operation model to obtain online attention results. Based on the online attention results, the robot performs corresponding operations. The constructed dual-path attention fusion robot operation model includes: a bottom-up data-driven path and a top-down knowledge-driven path. The data-driven path and the knowledge-driven path are adaptively integrated through a gating mechanism, and the weight ratio of action loss and attention loss is dynamically adjusted. Through reinforcement learning optimization strategy, attention consistency is used as an intrinsic reward. The method of extracting gaze-intent coupling features using dynamic Bayesian networks includes: A dynamic Bayesian network is constructed based on the preprocessed scene visual data stream; Transform the gaze point coordinates to three-dimensional space; The size of the perceptual window for the fixation point is determined based on the duration of fixation. Within the perception window, the Bayesian posterior probability is calculated based on the three-dimensional space of the gaze point, the observation likelihood in the dynamic Bayesian network, and the prior probability. A multi-scale significance map sequence is generated based on the Bayesian posterior probability and the amount of uncertainty reduction. The multi-scale saliency map sequence is weighted and fused to obtain gaze-intent coupling features; The size of the perception window changes exponentially with the duration of gaze, including: When the fixation duration is less than the preset minimum fixation time threshold, the maximum window is used. As a sensory window; When the fixation duration exceeds the preset maximum fixation time threshold, the minimum window is used. As a sensory window; When the fixation duration is less than or equal to the preset longest fixation time threshold, and when the fixation duration is greater than or equal to the preset shortest fixation time threshold, the following applies: As the perception window, the calculation formula is as follows: ; in, The duration of fixation at the current fixation point. For adaptive adjustment coefficients, Minimum window For the largest window, The window size is perceived at time t.
2. The robot vision operation control method based on eye tracking according to claim 1, characterized in that, The formula for converting the gaze point coordinates to three-dimensional space is as follows: ; in, Here, represents the pixel coordinates of the gaze point, and K is the intrinsic parameter matrix of the camera in the head-mounted device with eye-tracking capabilities. For the corresponding depth values of the camera in a head-mounted device with eye-tracking capabilities, , , ) represents the coordinates of the gaze point in three-dimensional space.
3. The robot vision operation control method based on eye tracking according to claim 1, characterized in that, The multi-scale saliency map sequence includes: spatial saliency map, temporal saliency map, and task relevance map; The multi-scale saliency map sequence is weighted and fused to obtain the gaze-intent coupling feature, calculated as follows: ; in, The weighting coefficients are for the spatial saliency plot. The weighting coefficients for the time series significance plot are... These are the weighting coefficients for the task relevance graph. For spatial saliency maps, This is a time-series saliency plot. For task relevance graph, This is a gaze-intent coupling feature.
4. The robot vision operation control method based on eye tracking according to claim 1, characterized in that, The bottom-up data-driven path includes: using a convolutional neural network as the backbone network, inputting the preprocessed data stream into the backbone network, and obtaining multi-scale features; Multi-scale features are input into a feature pyramid network and fused to obtain fused multi-scale features. Attention modules are applied at each scale to weight the fused multi-scale features, resulting in weighted features at different scales. Weighted features at different scales are upsampled to obtain features of the same size. Features of the same size are then concatenated and fused to obtain a data-driven attention map.
5. The robot vision operation control method based on eye tracking according to claim 1, characterized in that, The top-down knowledge-driven path includes: A sequence encoder is used to encode the gaze-intent coupling features to obtain a global context vector; A decoder network is used to decode the global context vector to obtain a knowledge-driven attention graph.
6. The robot vision operation control method based on eye tracking according to claim 1, characterized in that, The adaptive integration of data-driven and knowledge-driven paths through a gating mechanism includes: Based on the data-driven attention map and the knowledge-driven attention map, a convolutional layer is used to calculate the gating weights and normalize them using an activation function to obtain the fused attention map. The calculation formula is as follows: ; Where ⊙ represents element-wise multiplication, and gate is the gate weight. For data-driven attention graphs, For knowledge-driven attention graphs To merge attention maps.
7. The robot vision operation control method based on eye tracking according to claim 1, characterized in that, Furthermore, the weight ratio of action loss to attention loss is dynamically adjusted, and the optimization strategy is implemented through reinforcement learning, using attentional consistency as an intrinsic reward, including: Construct the total loss function based on the action loss, attention loss, regularization loss, and their corresponding weights. During the attention-dominant phase, the attention loss weight is set to be greater than the first attention weight threshold, and the action loss weight is set to be less than the first action loss threshold. During the balanced learning phase, a fixed interval value for the action loss weight is used to gradually increase the action loss weight, while an exponential decay function is used to gradually decrease the attention loss weight. During the action-dominant phase, a fixed interval value for the action loss weight is continued to be used, and the action loss weight is gradually increased. The attention loss weight is set to be less than the second attention weight threshold, wherein the first attention weight threshold is greater than the second attention weight threshold. Based on external and intrinsic rewards, a reward function is constructed, and the robot operation model with dual-path attention fusion is optimized using the reward function. The intrinsic reward is obtained using the attention consistency method. The similarity between the model's attention and human attention patterns is measured by the intersection-union ratio, and the information gain and gaze efficiency reward of attention guidance are calculated.
8. A robot vision operation control system based on eye tracking, implemented using the robot vision operation control method based on eye tracking as described in any one of claims 1 to 7, characterized in that, include: The data acquisition module is used to acquire data streams synchronously when the operator performs the operation task using a head-mounted device with eye-tracking function. The acquired data streams include: eye-tracking data streams when the operator performs the operation task and scene visual data streams. The data preprocessing module is used to preprocess the acquired data stream; The feature extraction module is used to extract gaze-intent coupling features from the preprocessed data stream using a dynamic Bayesian network. The action execution module is used to input the preprocessed data stream and gaze-intent coupling features into the constructed dual-path attention fusion robot operation model to obtain online attention results. Based on the online attention results, the robot performs corresponding operations. The constructed dual-path attention fusion robot operation model includes: a bottom-up data-driven path and a top-down knowledge-driven path. The data-driven path and the knowledge-driven path are adaptively integrated through a gating mechanism, and the weight ratio of action loss and attention loss is dynamically adjusted. An optimization strategy using reinforcement learning is employed, with attention consistency as an intrinsic reward. The extraction of gaze-intent coupling features using a dynamic Bayesian network includes: A dynamic Bayesian network is constructed based on the preprocessed scene visual data stream; Transform the gaze point coordinates to three-dimensional space; The size of the perception window for the fixation point is determined based on the duration of fixation. Within the perception window, the Bayesian posterior probability is calculated based on the three-dimensional space of the gaze point, the observation likelihood in the dynamic Bayesian network, and the prior probability. A multi-scale significance map sequence is generated based on the Bayesian posterior probability and the amount of uncertainty reduction. The gaze-intent coupling feature is obtained by weighted fusion of multi-scale saliency map sequences. The size of the perception window changes exponentially with the duration of gaze, including: When the fixation duration is less than the preset minimum fixation time threshold, the maximum window is used. As a sensory window; When the fixation duration exceeds the preset maximum fixation time threshold, the minimum window is used. As a sensory window; When the fixation duration is less than or equal to the preset longest fixation time threshold, and when the fixation duration is greater than or equal to the preset shortest fixation time threshold, the following applies: As the perception window, the calculation formula is as follows: ; in, The duration of fixation at the current fixation point. For adaptive adjustment coefficients, Minimum window For the largest window, The window size is perceived at time t.