A control method and system for multimodal large-scale robots based on eye-tracking feature enhancement

The multimodal large-scale robot control method enhanced by eye-tracking features solves the decision ambiguity problem of existing systems under fuzzy commands, realizes efficient execution and intent understanding of complex tasks, reduces the user's operating threshold, and improves the accuracy and adaptability of robot control.

CN122299652APending Publication Date: 2026-06-30SOUTHEAST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SOUTHEAST UNIV
Filing Date
2026-04-30
Publication Date
2026-06-30

Smart Images

  • Figure CN122299652A_ABST
    Figure CN122299652A_ABST
Patent Text Reader

Abstract

This application discloses a multimodal large-scale robot control method and system based on eye-tracking feature enhancement. The method includes: multimodal teaching data acquisition; spatiotemporal encoding of gaze features; construction of teleoperation datasets; construction of a multimodal large-scale model; model training; and intent recognition and real-time inference. The method provided in this application, based on the OpenVLA large-scale model, achieves a natural interaction mode of "one-time command issuance and long-term intent guidance" by introducing visual gaze features. Furthermore, through a "Gaze-Guided Loss" function, it achieves strong supervised guidance of the large-scale model's attention mechanism based on human visual priors at the model training level. While retaining the powerful cross-task, zero-shot generalization capabilities of the OpenVLA large-scale model, the method provided in this application significantly enhances the system's adaptability to complex and unstructured environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the technical field of embodied intelligence and human-computer interaction, AI vision and machine learning algorithms, and in particular relates to a multimodal large model robot control method and system based on eye-tracking feature enhancement, which is used to control the robotic arm through human visual gaze features under fuzzy semantic commands to complete daily auxiliary actions. Background Technology

[0002] With the increasing demand for elderly and disabled assistance and family services, collaborative robots that can assist humans in completing daily tasks have become a research hotspot in the field of robotics. For people with disabilities or the elderly with motor dysfunction, using eye-tracking technology as an interaction medium to control robots to perform tasks such as grasping and moving through eye gaze is an important technical approach to establish efficient human-robot collaboration and improve self-care abilities.

[0003] Currently, most mainstream eye-tracking robot systems are based on pre-programmed control or pose generation technology based on 3D point clouds. In terms of interaction logic, the system typically maps eye-tracking coordinates to a 3D point cloud space, triggering a pre-set action script through gaze, or generating a grasping pose based on the geometric center of the target object. However, this approach has significant limitations: due to its heavy reliance on structured environment modeling and pre-set action libraries, the system can only complete simple "point-to-point" grasping tasks, making it difficult to handle unstructured daily tasks with continuous and complex movements, such as folding clothes, wiping a table, or fine assembly. Furthermore, traditional solutions lack an understanding of the user's underlying semantic intent; in complex scenarios such as object stacking or target occlusion, coordinate mapping alone is insufficient to achieve high-precision motion planning and real-time interaction.

[0004] In recent years, the emergence of Vision-Language-Action (VLA) large-scale models in the field of embodied intelligence has provided new possibilities for solving the above problems. VLA models, pre-trained on large-scale multimodal data, possess powerful semantic understanding and action generalization capabilities, enabling them to perform various complex daily auxiliary actions such as "folding clothes" and "sorting garbage" based on natural language instructions. However, in practical deployment, VLA models heavily rely on explicit and detailed text input (Prompting). Users typically need to input lengthy instructions containing target attributes, spatial relationships, and action details (such as "grab the red apple on the left and put it in the blue plate") to drive the robot to produce accurate actions.

[0005] For disabled users or those requiring frequent assistance, continuously inputting precise descriptive commands presents a high operational barrier and cognitive load. In actual interactions, users tend to issue vague commands with incomplete semantics, such as "move it," "pick it up," or "put it away." Existing systems, when processing such vague commands, often suffer from ambiguous decision-making due to the lack of spatial anchors and correspondences with specific targets. This results in the robot being unable to make the correct choice among multiple similar targets, and also failing to capture instantaneous changes in user intent during task execution (such as suddenly changing the target during grasping).

[0006] Therefore, there is an urgent need for a new robot control scheme that can combine the multi-task generalization advantages of the VLA large model with the intuitive spatial attributes of eye-tracking interaction: by introducing visual prior features such as dynamic gaze heatmaps, the target disambiguation problem under ambiguous commands can be solved, and dynamic intent switching and implicit task reconstruction can be achieved in the absence of continuous command input, thereby breaking through the technical bottleneck of traditional interaction methods in terms of refined task execution and ease of use. Summary of the Invention

[0007] To address the aforementioned technical problems, this application provides a multimodal large-scale robot control method and system based on eye-tracking feature enhancement.

[0008] The technical solution provided in this application is as follows.

[0009] Firstly, a multimodal large-scale robot control system based on eye-tracking feature enhancement is provided, including: Scene camera, used to acquire RGB image information of the global view including the operation area; Collaborative robotic arms are used to perform specific tasks, and their state is represented by a multi-dimensional end-effector pose vector. A local camera, installed on the end effector of a robotic arm, is used to acquire close-up local RGB images of a target object and its corresponding depth visual features. Display device for image display; screen-type eye tracker for high-frequency acquisition of user gaze behavior within the display area of ​​the display device; The host computer computing platform is communicatively connected to the scene camera, the local camera, the screen-type eye tracker, and the collaborative robotic arm.

[0010] Secondly, a multimodal large-scale robot control method based on eye-tracking feature enhancement is provided, the method being implemented based on the system described in the first aspect, including: Multimodal teaching data acquisition: Through teleoperation and multi-task scenario construction, acquire global view RGB image information, close-up local RGB image, gaze point trajectory map, semantic commands, and corresponding robotic arm multidimensional pose data. Spatial-time encoding of gaze features: Spatial-time encoding of gaze feature of gaze point trajectory map is performed, and RGB image of global view is fused to obtain four-channel multimodal input tensor; Teleoperation Dataset Construction: Constructing a multimodal fusion teleoperation dataset; Multimodal large model construction: Using OpenVLA as the backbone network, ViT encoder as the visual encoder, they are connected through connectors, and LoRA adapter is introduced to construct a multimodal large model; Model training: The multimodal large model is trained using the teleoperation dataset and the gaze-guided loss function; Intent recognition and real-time reasoning: Based on the fuzzy control commands issued by the user, the system utilizes a trained multimodal large model to couple dynamic gaze heatmap features with near-field RGB image features in real time, and outputs commands for robot control in real time.

[0011] In one possible implementation, the multi-task scenario includes task scenarios for target disambiguation tasks, dynamic intent switching tasks, and implicit intent tasks. The method for constructing the task scenario includes: In the target disambiguation task scenario, when there are multiple similar or identical objects in the work space and a vague semantic instruction without object subdivision attribute description is received, the target direction of the vague instruction can be clarified by identifying the real-time eye movement coverage features of specific targets similar to the user's gaze point in the teaching data. In dynamic intent switching task scenarios, the spatial offset of the user's gaze point is monitored in real time during the execution of the predetermined task path by the robotic arm. If the gaze point suddenly changes from the initial target to the new target, the robotic arm can smoothly change its trajectory in the air to switch the operation target based on eye guidance without having to re-enter the command. In implicit intent task scenarios, by extracting the temporal logic of the user's gaze point looking first towards the starting operation point and then towards the target placement point, the received minimal and ambiguous instructions with missing action elements are transformed into a complete composite action sequence containing two stages: grasping and placing.

[0012] In one possible implementation, the gaze feature spatiotemporal encoding of the gaze trajectory map is performed, and the RGB image of the global viewpoint is fused to obtain a four-channel multimodal input tensor, including: Extract all gaze point trajectories within a preset duration window preceding the timestamp of the current video frame to obtain multiple trajectory maps; Based on the time difference between the sampling time of the gaze point and the current frame, each gaze point is assigned a weighting coefficient that decays over time. Within the spatial threshold, a two-dimensional Gaussian kernel function is constructed with each gaze point coordinate as the center; A weighted Gaussian kernel function is used to perform spatial diffusion processing on the gaze points in each trajectory map, and all diffusion results are accumulated in the image space to obtain a single-channel gaze heatmap. The single-channel gaze heatmap is superimposed as the fourth channel information onto the Alpha channel of the RGB image from the global perspective, thus constructing a four-channel multimodal input tensor that incorporates human visual attention guidance features.

[0013] Furthermore, the weighting coefficients are:

[0014] in, These are weighting coefficients. For the current frame time, The sampling time for the i-th gaze point is... This is the time decay factor.

[0015] Furthermore, the two-dimensional Gaussian kernel function is:

[0016] in, This represents the weight value at position (x, y), where (x, y) represents the coordinate offset of a certain position within the kernel relative to the kernel center. The standard deviation of the Gaussian kernel is used to control the spatial diffusion range; The formula for the weighted Gaussian kernel function is:

[0017] in, This represents the weight value after stacking at position (x,y), and N represents the number of fixation points.

[0018] In one possible implementation, the gaze guidance loss function is:

[0019] in, λ represents the loss term for predicting the robotic arm's motion; λ is the adjustment weight coefficient. The attention-guided loss term, based on human visual priors, is calculated by extracting the self-attention weight matrix A from the last Transformer block of the ViT visual encoder and using Kullback-Leibler (KL) divergence to determine the information difference between the spatial distribution of A and the standardized gaze heatmap distribution H. .

[0021] Thirdly, a multimodal large-scale robot control device based on eye-tracking feature enhancement is provided, including: The acquisition module is used to acquire global RGB image information, close-up local RGB image, gaze point trajectory map, semantic commands, and corresponding robotic arm multi-dimensional pose data through remote operation and multi-task scenario construction. The encoding module is used to perform spatiotemporal encoding of gaze feature on the gaze trajectory map, and fuse the RGB image of the global viewpoint to obtain a four-channel multimodal input tensor; The dataset building module is used to build multimodal fusion teleoperation datasets; The multimodal large model building module is used to build a multimodal large model with OpenVLA as the backbone network, ViT encoder as the visual encoder, connected through connectors, and LoRA adapter introduced. The training module is used to train the multimodal large model using the teleoperation dataset and the gaze-guided loss function; The output module is used to infer and output commands for robot control in real time by coupling dynamic gaze heatmap features and near-field RGB image features in real time based on the fuzzy control commands given by the user and the trained multimodal large model.

[0022] Fourthly, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the multimodal large model robot control method based on eye-tracking feature enhancement as described in the second aspect.

[0023] Fifthly, a non-transitory computer-readable storage medium is provided, on which a computer program is stored, wherein when the computer program is executed by a processor, it implements the multimodal large model robot control method based on eye-tracking feature enhancement as described in the first aspect.

[0024] This application has the following beneficial effects: (1) It greatly reduces the complexity of interaction and the cognitive load on users: The method provided in this application achieves a natural interaction mode of "one-time command issuance and long-term intention guidance" by introducing visual gaze features. Compared with traditional large-scale model control schemes that require continuous input of detailed and precise descriptive text commands (such as specifying the color, shape, and orientation of objects), the method provided in this application allows users to complete complex target disambiguation and task completion simply by using semantically incomplete and vague commands such as "move it" or "grab it" in conjunction with eye gaze. This not only significantly improves interaction efficiency but also specifically addresses the technical pain point that users with disabilities or in high-frequency usage scenarios find it difficult to provide detailed descriptions through language or text, making human-computer collaboration more intuitive and embodied.

[0025] (2) Significantly improved the model's accuracy and generalization ability in intent understanding: The method presented in this application achieves strong supervised guidance of human visual priors on the attention mechanism of large models at the model training level through a "Gaze-Guided Loss" function. By using KL divergence constraints, the self-attention map within the model is forced to coincide with the real human eye-tracking heatmap, enabling the model to focus on key interactive regions in the scene "like a human." Combined with the LoRA fine-tuning architecture, the method presented in this application significantly enhances the system's adaptability to complex unstructured environments (such as object stacking and occlusion) while retaining the powerful cross-task, zero-shot generalization capabilities of OpenVLA large models, ensuring extremely high intent recognition accuracy even when facing unseen objects or unfamiliar scenes. Attached Figure Description

[0026] Figure 1 A schematic diagram of the composition of a multimodal large-scale robot control system based on eye-tracking feature enhancement provided in this application embodiment; Figure 2 This is a schematic diagram of the system composition for constructing a teleoperation dataset provided in an embodiment of this application; Figure 1 , Figure 2 In the middle: 1-Collaborative robotic arm, 2-Scene camera, 3-Screen eye tracker, 4-Local camera, 5-Host computer, 6-Eye-tracking heatmap, 7-Target object, 8-Remote control device; Figure 3 A flowchart illustrating the multimodal large-scale robot control method based on eye-tracking feature enhancement provided in this application embodiment; Figure 4 This is a schematic diagram illustrating the task scenario of three gradients during the data set construction process provided in the embodiments of this application; Figure 4 1-Target disambiguation task, 2-Dynamic intent switching task, 3-Implicit intent task; Figure 5 A framework diagram illustrating the training and usage process of a multimodal large model based on eye-tracking feature enhancement, as provided in the embodiments of this application; Figure 6 This is a structural diagram of a multimodal large-scale robot control device based on eye-tracking feature enhancement provided in an embodiment of this application. Figure 7 This is a schematic diagram of the electronic device structure provided in an embodiment of this application. Detailed Implementation

[0027] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0028] In the description of this application, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of the stated features. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.

[0029] In the description of this application, it should also be noted that, unless otherwise expressly specified and limited, the terms "set up," "install," "connect," and "link" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this technology based on the specific circumstances.

[0030] In the description of this application, spatial relation terms such as "below," "under," "below," "below," "above," "over," etc., are used herein to describe the relationship between one element or feature shown in the figures and other elements or features. It should be understood that, in addition to the orientation shown in the figures, spatial relation terms also include different orientations of the device in use and operation. For example, if the device in the figures is flipped, an element or feature described as "below" or "under" or "below" of other elements or features will be oriented "above" other elements or features. Therefore, the exemplary terms "below" and "under" can include both upper and lower orientations. Furthermore, the device may also include other orientations (e.g., rotated 90 degrees or other orientations), and the spatial descriptive terms used herein are interpreted accordingly.

[0031] In the description of this application, the term "for example" is used to mean "used as an example, illustration, or description." Any embodiment described as "for example" in this application is not necessarily to be construed as being more preferred or advantageous than other embodiments. The following description is provided to enable any person skilled in the art to make and use this application. Details are set forth in the following description for purposes of explanation. It should be understood that those skilled in the art will recognize that this application can be made without using these specific details. In other instances, well-known structures and processes will not be described in detail to avoid unnecessarily obscuring the description of this application. Therefore, this application is not intended to be limited to the embodiments shown, but is consistent with the broadest scope of the principles and features disclosed in this application.

[0032] Currently, existing robot operation control methods with semantically ambiguous instruction assistance often lead to ambiguous robot decisions due to the lack of correspondence between spatial anchor points and specific targets. They are unable to make the correct choice among multiple similar targets, nor can they capture instantaneous changes in user intent during task execution (such as suddenly changing the target during grasping). This results in problems such as poor generalization ability, high error rate, and high usability.

[0033] Therefore, embodiments of this application provide a multimodal large-scale robot control method and system based on eye-tracking feature enhancement.

[0034] See Figure 1 , 2 A multimodal large-scale robot control system based on eye-tracking feature enhancement, provided in this application embodiment, includes: Scene camera, used to acquire RGB image information of the global view including the operation area; Collaborative robotic arms are used to perform specific tasks, and their state is represented by a multi-dimensional end-effector pose vector. A local camera, installed on the end effector of a robotic arm, is used to acquire close-up local RGB images of a target object and its corresponding depth visual features. Display device for displaying images; Screen-based eye trackers are used to collect high-frequency data on user gaze behavior within the display area of ​​a display device. The host computer computing platform is communicatively connected to the scene camera, the local camera, the screen-type eye tracker, and the collaborative robotic arm.

[0035] Specifically, the scene camera uses an Intel RealSense (such as the D435 series) RGB-D depth camera, and is mounted on a tripod in the work area to obtain global view of the operation area, including the panoramic view of the robotic arm's operation.

[0036] Specifically, the local camera is mounted above the gripper of the robotic arm's end effector and uses an Intel RealSense (such as the D435 series) RGB-D depth camera. This camera moves synchronously with the robotic arm, i.e., an "Eye-in-hand" deployment method, and is specifically used to acquire close-up local RGB images of the target object to be grasped and the corresponding depth visual features, providing high-resolution object material and fine pose cues for large models.

[0037] Specifically, the screen-based eye tracker uses the Tobii Pro Fusion screen-based eye tracker, which is deployed directly below the computer display screen. This eye tracker uses infrared light source capture technology to synchronously collect user gaze behavior within the screen display area at high frequency (up to 120Hz or 250Hz). The collected data includes the pixel position of the gaze point in the screen coordinate system, the duration of gaze, and the trajectory characteristics of eye saccades.

[0038] Specifically, the collaborative robotic arm adopts the AgileX Piper series six-axis collaborative robotic arm to perform specific precision tasks such as grasping, moving, and palletizing. The real-time state of the robotic arm is characterized by a 7-dimensional end effector pose vector [x,y,z,roll,pitch,yaw,gripper], which includes the three-dimensional coordinates of the end effector in the base coordinate system, the Euler angle pose, and the opening degree of the gripper.

[0039] Specifically, the host computer computing platform is a workstation equipped with a high-performance graphics processing unit (GPU), with its operating system installed as Ubuntu 22.04 and equipped with the ROS2 distributed communication framework to coordinate the data flow between various sensor hardware and the robotic arm drive.

[0040] Specifically, the global RGB image captured by the scene camera is transmitted in real time to the display terminal (equipped with a screen-type eye tracker) connected to the host computer, and the host computer processing program overlays the dynamic heat map features of the user's gaze point on the display interface in real time to intuitively present the current gaze guidance area.

[0041] Specifically, a Docker virtualized container environment was built within the host computer computing platform. Inside this Docker container, the multimodal large model policy network provided in this embodiment was deployed. This network is based on the OpenVLA (Vision-Language-Action) architecture and incorporates a lightweight LoRA (Low-Rank Adaptation) fine-tuning layer specifically for eye-tracking guided task training. The Docker container design ensures the independence and portability of the large model runtime environment (including the PyTorch deep learning framework, GPU driver interface, etc.).

[0042] During actual operation, the host computer platform receives fuzzy control commands from the user and simultaneously couples gaze heatmap features generated by the scene camera and eye tracker, as well as near-field features captured by local cameras. This multi-dimensional information is then fused in a multimodal tokenized manner within a policy network within a Docker container, enabling end-to-end real-time prediction of the Piper robotic arm's 7-dimensional motion vector. The resulting control commands are sent to the robotic arm controller via serial port, driving the robotic arm to complete precise operational tasks aligned with the user's visual intent.

[0043] The following section introduces a control method for a multimodal large model robot based on the above system.

[0044] See Figure 3 The multimodal large-scale robot control method provided in this application includes the following steps: S301, Multimodal Teaching Data Acquisition In one possible implementation, S301 includes: S301a. Establish a task scenario that includes target disambiguation tasks, dynamic intent switching tasks, and implicit intent tasks. S301b: In the task scenario, the grasping task is executed based on teleoperation and supplemented by fuzzy semantic commands. S301c acquires global RGB image information, close-up local RGB image, gaze point trajectory map, semantic commands, and corresponding multi-dimensional pose data of the robotic arm during the grasping task.

[0045] Furthermore, in S301a, the target disambiguation task scenario involves the identification and confirmation of targets that include multiple objects of the same category but with different appearance attributes (such as color, texture, and size).

[0046] For example, see Figure 4In step 1, red and green apples, and blue and red cubes are used as the objects of the operation. The operator inputs a fuzzy command, "grab the fruit" or "pick up the cube," and fixates on one of the specific objects (e.g., the red apple) to drive the robotic arm to grasp it. This step aims to enhance the generalization ability of the subsequent multimodal large model to eye-tracking logic. In subsequent testing, the test set is changed to previously unseen objects, such as "yellow toys" and "green toys," with the fuzzy command "take that one." The model is then observed to see if it can accurately perform the task based solely on the coverage features of the gaze point on the yellow toy, rather than relying on color priors.

[0047] Furthermore, in S301b, the dynamic intent switching task scenario is as follows: If the gaze point suddenly changes from the initial target to a new target while the robotic arm is executing a predetermined task path, the robotic arm can smoothly change its trajectory in the air based on eye guidance to switch the operation object without having to re-enter ambiguous semantic instructions.

[0048] For example, see Figure 4 In step 2, two target points A and B (e.g., a cup and a bottle) are set in the workspace, with a distance of more than 20cm between them. The robotic arm initially moves towards target A. When the path has progressed to about 50%, the operator's gaze suddenly switches from target A to target B, and the robotic arm is driven to draw a smooth deflection curve (S-shaped trajectory) in the air to finally grasp target B. This step is also to enhance the generalization ability of the subsequent multimodal large model. In the subsequent testing process, the training set uses everyday objects (cups, bottles), while the test set is changed to stationery objects (staplers, pen holders) to test the model's response speed and trajectory interpolation ability to sudden changes in gaze under different visual semantics.

[0049] Furthermore, in S301c, in the implicit intent task scenario: based on the temporal sequence logic of the user's gaze point at the starting operation point and the target placement point, the received simplified and ambiguous instructions with missing action elements (such as "move it" or "place it") are transformed into a complete composite action sequence containing two stages: grasping and placing.

[0050] For example, see Figure 4 In step 3, multiple starting objects (such as cubes A, B, and C) and their corresponding placement targets (such as plates 1, 2, and 3) are arranged on the table. Simple, fuzzy instructions such as "move it" or "place it" are input. The operator follows natural temporal logic in their gaze, for example: first, the gaze is fixed on cube A to form a grasping intention; then, during the robotic arm's displacement, the gaze shifts to the plates to form a placement intention, resulting in continuous gaze switching between the two. During data acquisition, the system needs to record this "A first, then B" visual transfer sequence, enabling the model to learn the depth logic of encoding the first gaze hotspot as grasping spatial coordinates and the second gaze hotspot as placement target coordinates.

[0051] In one possible implementation, S301b includes: In the described task scenario, a teaching system incorporating a teleoperation control link is constructed. The operator controls the Songling Piper collaborative robotic arm in real time via a Touch teleoperation device, enabling the arm to perform grasping or transporting actions within the workspace. Before the teleoperation begins, a fuzzy semantic command is input into the system once. This fuzzy semantic command is an expression of information that does not contain explicit target attributes or complete action constraints, including but not limited to "grab it," "move this," "put it over," or "take that." Throughout the teleoperation process, the fuzzy semantic command remains continuously active and does not require repeated input during task execution. While performing teleoperation control, the operator provides visual guidance to the target object or area through natural gaze behavior, establishing a spatial correspondence between eye movement trajectories and the target, thereby establishing an implicit association between semantic commands, eye movement features, and robotic arm movements.

[0052] In one possible implementation, in S301c, the global view RGB image information, the close-up local RGB image, and the corresponding robotic arm multidimensional pose data are the timestamp data of the current video frame; the gaze point trajectory is the gaze point trajectory within a preset duration window before the timestamp of the current video frame; the semantic instruction is global semantic constraint information that is input once before the start of teleoperation and remains unchanged throughout the entire task cycle, used to uniformly constrain the operation behavior in the corresponding video frame sequence in the time dimension.

[0053] S302. Perform spatiotemporal encoding of gaze feature on the gaze trajectory map, and fuse it with the RGB image of the global viewpoint to obtain a four-channel multimodal input tensor. In one possible implementation, S302 includes: projecting the gaze point trajectory onto a global camera image to generate a dynamic gaze heatmap with a time decay effect.

[0054] Specifically, S302 includes: S302a. Extract all gaze point trajectories within a preset duration window (preferably 4 seconds) before the timestamp of the current video frame to obtain multiple trajectory maps.

[0055] S302b. Based on the time difference between the sampling time of the gaze point and the current frame, a weighted coefficient with a decaying distribution over time is assigned to each gaze point using an exponential decay function. The calculation formula is as follows:

[0056] in, These are weighting coefficients. For the current frame time, The sampling time for the i-th gaze point is... This is the time decay factor.

[0057] S302c, At the spatial threshold, using the coordinates of each gaze point. Construct a two-dimensional Gaussian kernel function centered at the core:

[0058] in, This represents the weight value at position (x, y), where (x, y) represents the coordinate offset of a certain position within the kernel relative to the kernel center. The standard deviation of the Gaussian kernel is used to control the spatial diffusion range.

[0059] S302d: Spatial diffusion processing is performed on the gaze points in each trajectory map using a weighted Gaussian kernel function, and all diffusion results are accumulated in the image space to obtain a single-channel gaze heatmap. The formula for the weighted Gaussian kernel function is:

[0060] in, This represents the weight value after stacking at position (x,y), and N represents the number of fixation points.

[0061] Thus, the discrete set of gaze points is transformed into a continuous spatial attention distribution.

[0062] For example, assuming four gaze points are captured within a 4-second time window, their timestamps are as follows: , , , This generates four trajectory maps. Each trajectory map contains one fixation point, with the corresponding fixation points being... And satisfy < < < ≤ The weights, calculated using the exponential decay function described above, satisfy the following conditions: < < < This indicates that the closer the gaze point is to the current moment, the greater its contribution to the heatmap. Subsequently, weighted Gaussian distributions are generated centered on these four points. ,use Each trajectory map is subjected to spatial diffusion processing, and the processed trajectory maps are superimposed in the image space to finally form a continuous heat map with time decay characteristics.

[0063] S302e, The single-channel gaze heatmap is used as the fourth channel information (Alpha channel) and fused with the global RGB image to construct a four-channel multimodal input tensor.

[0064] Understandably, step S302 transforms discrete eye-tracking points into visual prior features with spatiotemporal continuity, enabling the model to achieve temporal modeling and spatial focusing of gaze behavior. These features enhance the target search space by highlighting high-attention areas and suppressing background interference, thereby improving the target disambiguation capability. They also enhance the responsiveness to changes in intent through time decay, thus improving the real-time performance and stability of dynamic intent switching and implicit intent reasoning, while reducing the dependence on precise semantic instructions.

[0065] In the multimodal large model robot control system, the user's gaze point coordinate sequence on the display interface is acquired in real time by a screen-type eye tracker and projected onto the RGB image coordinate system captured by the global camera.

[0066] S303, Construction of Remote Operation Dataset In one possible implementation, S303 includes: aligning four-channel multimodal input tensors, close-up local RGB images, semantic commands, and multidimensional (7-dimensional) pose data of the robotic arm with the robotic arm control frequency as a reference to construct a multimodal fused teleoperation dataset.

[0067] Specifically, data acquisition uses the control frequency of the robotic arm (preferably 60Hz) as the synchronization benchmark. Through timestamp alignment technology, the four-channel multimodal input tensor, near-field local RGB image, semantic instructions, and the 7-dimensional end-effector pose vector of the robotic arm at each sampling moment are synchronously encapsulated. By continuously acquiring multiple teaching episodes, a multi-task teaching dataset containing visual and semantic instructions and human eye gaze intentions is constructed.

[0068] Specifically, the data is divided into an 80% training set and a 20% test set. The physical objects involved in the 20% test set are completely different from those in the 80% training set. The training set focuses on covering rich eye-tracking interaction path features, while the test set focuses on evaluating whether the model can still accurately calculate the 7-dimensional movements of the robotic arm using the core feature of "eye-tracking hotspots" when faced with objects it has never seen before, thus proving that the method provided in this application has strong task versatility.

[0069] S304, Multimodal Large Model Construction In one possible implementation, the multimodal large model includes: using OpenVLA as the backbone network and ViT encoder (Vision Transformer) as the visual encoder, connected by connectors to construct the multimodal large model; wherein, LoRA adapter is introduced in LLM (Large Language Model).

[0070] Furthermore, the LoRA adapter is applied to the query and value matrices in the attention mechanism of the Transformer decoder in LLM.

[0071] It is understandable that the multimodal large model achieves rapid adaptation of eye-tracking and visual feature fusion logic with minimal parameter computation by updating gradients only on the low-rank matrix parameters (LoRA adapter) in the bypass branch. This endows the model with the ability to understand human intentions in real time while fully preserving the original knowledge representation capabilities of the large-scale embodied pre-trained model, effectively avoiding catastrophic forgetting caused by full parameter fine-tuning, and ensuring that the system has robust zero-shot generalization performance when dealing with unseen objects and unfamiliar task scenarios.

[0072] During the fine-tuning phase, the original OpenVLA backbone network parameters are kept frozen, and only the lightweight parameters of the LoRA branch are updated. Through this architecture, the model uses a visual encoder to extract global information and local near-field features that fuse gaze features, and transforms them into action tokens, thereby quickly adapting to the user's eye-guided intentions while retaining common operational knowledge.

[0073] S305, Model Training In one possible implementation, S305 includes: training the multimodal large model using the teleoperation dataset and the gaze-guided loss function to obtain the trained multimodal large model.

[0074] Furthermore, the gaze-guided loss function provides strong supervision and guidance for the fine-tuning process, and the total loss function is defined as follows:

[0075] in, λ is the loss term for predicting the robot arm's motion, used to calculate the L1 norm or mean square error between the 7-dimensional motion vector output by the model and the taught true value sequence, in order to ensure the trajectory accuracy of the robot arm's end effector; λ is the adjustment weight coefficient. The attention-guided loss term, based on human visual priors, is calculated by extracting the self-attention weight matrix A from the last Transformer block of the ViT visual encoder and using Kullback-Leibler (KL) divergence to determine the information difference between the spatial distribution of A and the standardized gaze heatmap (grayscale image) distribution H.

[0076] By constraining the loss function, the feature attention region inside the model is forced to coincide with the operator's actual visual gaze point, thereby improving the interpretability and accuracy of the model's decision-making in complex and unstructured environments.

[0077] The training process is as follows: First, the multimodal teaching data is organized into training samples according to time series, and a multimodal input containing visual input, semantic instructions, and corresponding robotic arm action ground truth values ​​is constructed. The input is fed into a policy network composed of a visual encoder and a large language model for forward propagation to obtain action prediction results and attention distribution. Based on the action prediction error, the action loss term is calculated, and combined with the gaze heatmap, the gaze guidance loss term is calculated. The two are weighted and summed to obtain the total loss function. The model parameters are updated through the backpropagation algorithm, wherein the backbone parameters of the visual encoder and the large language model are kept frozen, and only the LoRA adapter parameters are trained. The above process is iteratively executed on the training dataset until the model converges or reaches the preset number of training rounds.

[0078] After the model training is completed, the test set in the dataset described in S303 is used to perform offline inference verification on the trained model. The multimodal input data in the test set is input into the model to generate the corresponding action prediction sequence, and compared with the real action sequence in the teleoperation dataset to evaluate the model's ability to perform intent recognition and action generation based on gaze hotspot features in the absence of objects, thereby completing the preliminary test of the model's generalization performance.

[0079] S306, Intent Recognition and Real-time Reasoning In one possible implementation, based on the fuzzy control commands issued by the user, the trained multimodal large model couples dynamic gaze heatmap features with near-field local RGB image features in real time to perform tasks such as target disambiguation, dynamic intent switching, and implicit intent recognition, and infers and outputs commands for robot (robotic arm) control in real time.

[0080] Specifically, the system enters real-time control mode. After receiving a one-time fuzzy control command (such as "move it") from the user, the host computer keeps it continuously active during the current work cycle.

[0081] The host computer acquires the video streams from the dual cameras and eye-tracking trajectories in real time through the ROS2 node, and couples them to generate multimodal features, which are then input into a large multimodal model within a Docker container. The host computer inputs multimodal perception data in real time, including fuzzy semantic instructions, gaze heatmaps, and visual streams from dual cameras. It performs online prediction through a multimodal large model policy network and outputs the robotic arm's 7-dimensional motion vector in real time.

[0082] The reasoning process of the multimodal large model is described below: 1. Input Preparation The system acquires a close-up RGB image of the robot's first-person perspective at the current moment and a four-channel multimodal input tensor fused with a gaze heatmap, and normalizes it in the same way as during the training phase. At the same time, it acquires the semantic instructions input before the task starts and remain unchanged throughout the task, and converts them into corresponding token sequences through the word segmenter of the large language model.

[0083] 2. Visual Feature Extraction The image is input into a visual encoder (ViT) to extract the corresponding visual feature token sequence, wherein the visual encoder parameters are frozen during the inference phase. 3. Feature Alignment (Connector) Visual features are mapped using a trainable connector module, which is a multilayer perceptron structure containing nonlinear activation functions, to project visual tokens into an embedding space consistent with the language model, thereby obtaining a sequence of visual tokens aligned with the dimensions of the text tokens.

[0084] 4. LLM autoregressive generation of action tokens The visual token sequence is concatenated with the text token sequence, and a start action identifier token is added to the end of the sequence to form a unified multimodal input sequence, which is then input into the large language model. The visual token serves as contextual information and participates in the generation of subsequent actions.

[0085] Based on the multimodal input sequence, the large language model uses an autoregressive approach to gradually generate action token sequences. In each generation step, the model predicts the next discrete action token based on the current context until an action sequence of a preset length is generated. During the inference process, LoRA adapter parameters are loaded to enhance the model's response to eye-tracking features while maintaining its original knowledge representation capabilities.

[0086] 5. Motion Decoding The generated discrete motion token sequence is restored to continuous motion control quantity through a predefined inverse quantization mapping to obtain the corresponding robotic arm control vector, including end-effector pose and gripper state.

[0087] 6. Execution and Loops The control vector is sent to the robotic arm controller for execution, and new visual input data is acquired after one control cycle. The above steps are repeated to achieve closed-loop real-time control.

[0088] The method provided in this application introduces visual gaze features, realizing a natural interaction mode of "one-time command issuance and long-term intention guidance." Complex target disambiguation and task completion can be achieved simply by combining ambiguous commands with eye gaze. This not only significantly improves interaction efficiency but also specifically addresses the technical pain point of users with disabilities or in high-frequency usage scenarios where it is difficult to provide detailed descriptions through language or text, making human-computer collaboration more intuitive and tangible. The method also utilizes a "Gaze-Guided Loss" function to achieve strong supervised guidance of human visual priors on the attention mechanism of large models at the model training level. Through KL divergence constraints, the self-attention map within the model is forced to coincide with the real human eye-tracking heatmap, enabling the model to focus on key interactive areas in the scene "like a human." Building upon this foundation, and combining the LoRA fine-tuning architecture, the method provided in this application significantly enhances the system's adaptability to complex unstructured environments (such as object stacking and occlusion) while retaining the powerful cross-task and zero-shot generalization capabilities of the OpenVLA large model, ensuring that it still possesses extremely high intent recognition accuracy when facing unseen objects or unfamiliar scenes.

[0089] The following describes the control device for a multimodal large model robot based on eye-tracking feature enhancement provided in this application. The control device for a multimodal large model robot based on eye-tracking feature enhancement described below can be referred to in correspondence with the control method for a multimodal large model robot based on eye-tracking feature enhancement described above.

[0090] Figure 6 This is a schematic diagram of the structure of the multimodal large-scale robot control device based on eye-tracking feature enhancement provided in the embodiments of this application, as shown below. Figure 6 As shown, it includes: an acquisition module 61, an encoding module 62, a dataset construction module 63, a multimodal large model construction module 64, a training module 65, and an output module 66, wherein: The acquisition module 61 is used to acquire global RGB image information, close-up local RGB image, gaze point trajectory map, semantic commands and corresponding robotic arm multi-dimensional pose data through remote operation and multi-task scenario construction. Encoding module 62 is used to perform spatiotemporal encoding of gaze feature on the gaze trajectory map, and fuse the RGB image of the global viewpoint to obtain a four-channel multimodal input tensor; Dataset building module 63 is used to build a multimodal fusion teleoperation dataset; The multimodal large model building module 64 is used to build a multimodal large model with OpenVLA as the backbone network, ViT encoder as the visual encoder, connected through connectors, and LoRA adapter introduced. Training module 65 is used to train the multimodal large model using the teleoperation dataset and the gaze-guided loss function; Output module 66 is used to infer and output instructions for robot control in real time by using the trained multimodal large model to couple dynamic gaze heat map features and near-field RGB image features in real time according to the fuzzy control instructions given by the user.

[0091] Figure 7 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 7 As shown, the electronic device may include a processor 710, a communications interface 720, a memory 730, and a communications bus 740. The processor 710, communications interface 720, and memory 730 communicate with each other via the communications bus 740. The processor 710 can call logic instructions from the memory 730 to execute a multimodal large-scale robot control method based on eye-tracking feature enhancement.

[0092] Furthermore, the logical instructions in the aforementioned memory 730 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0093] On the other hand, this application also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, is implemented to perform the eye-tracking feature-enhanced multimodal large model robot control methods provided by the above methods.

[0094] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0095] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0096] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A multimodal large-scale robot control system based on eye-tracking feature enhancement, characterized in that, include: Scene camera, used to acquire RGB image information of the global view including the operation area; Collaborative robotic arms are used to perform specific tasks, and their state is represented by a multi-dimensional end-effector pose vector. A local camera, installed on the end effector of a robotic arm, is used to acquire close-up local RGB images of a target object and its corresponding depth visual features. Display device for image display; screen-type eye tracker for high-frequency acquisition of user gaze behavior within the display area of ​​the display device; The host computer computing platform is communicatively connected to the scene camera, the local camera, the screen-type eye tracker, and the collaborative robotic arm.

2. A multimodal large-scale robot control method based on eye-tracking feature enhancement, characterized in that, The method is implemented based on the system described in claim 1, and includes: Multimodal teaching data acquisition: Through teleoperation and multi-task scenario construction, acquire global view RGB image information, close-up local RGB image, gaze point trajectory map, semantic commands, and corresponding robotic arm multidimensional pose data. Spatial-time encoding of gaze features: Spatial-time encoding of gaze feature of gaze point trajectory map is performed, and RGB image of global view is fused to obtain four-channel multimodal input tensor; Teleoperation Dataset Construction: Constructing a multimodal fusion teleoperation dataset; Multimodal large model construction: Using OpenVLA as the backbone network, ViT encoder as the visual encoder, they are connected through connectors, and LoRA adapter is introduced to construct a multimodal large model; Model training: The multimodal large model is trained using the teleoperation dataset and the gaze-guided loss function; Intent recognition and real-time reasoning: Based on the fuzzy control commands issued by the user, the system utilizes a trained multimodal large model to couple dynamic gaze heatmap features with near-field RGB image features in real time, and outputs commands for robot control in real time.

3. The method according to claim 2, characterized in that, The multi-task scenarios include target disambiguation tasks, dynamic intent switching tasks, and implicit intent tasks. The method for constructing the task scenario includes: In the target disambiguation task scenario, when there are multiple similar or identical objects in the work space and a vague semantic instruction without object subdivision attribute description is received, the target direction of the vague instruction can be clarified by identifying the real-time eye movement coverage features of specific targets similar to the user's gaze point in the teaching data. In dynamic intent switching task scenarios, the spatial offset of the user's gaze point is monitored in real time during the execution of the predetermined task path by the robotic arm. If the gaze point suddenly changes from the initial target to the new target, the robotic arm can smoothly change its trajectory in the air to switch the operation target based on eye guidance without having to re-enter the command. In implicit intent task scenarios, by extracting the temporal logic of the user's gaze point looking first towards the starting operation point and then towards the target placement point, the received minimal and ambiguous instructions with missing action elements are transformed into a complete composite action sequence containing two stages: grasping and placing.

4. The method according to claim 2, characterized in that, The gaze feature spatiotemporal encoding of the gaze trajectory map is performed, and the RGB image of the global viewpoint is fused to obtain a four-channel multimodal input tensor, including: Extract all gaze point trajectories within a preset duration window preceding the timestamp of the current video frame to obtain multiple trajectory maps; Based on the time difference between the sampling time of the gaze point and the current frame, each gaze point is assigned a weighting coefficient that decays over time. Within the spatial threshold, a two-dimensional Gaussian kernel function is constructed with each gaze point coordinate as the center; A weighted Gaussian kernel function is used to perform spatial diffusion processing on the gaze points in each trajectory map, and all diffusion results are accumulated in the image space to obtain a single-channel gaze heatmap. The single-channel gaze heatmap is superimposed as the fourth channel information onto the Alpha channel of the RGB image from the global perspective, thus constructing a four-channel multimodal input tensor that incorporates human visual attention guidance features.

5. The method according to claim 3, characterized in that, The weighting coefficients are: in, These are weighting coefficients. For the current frame time, The sampling time for the i-th gaze point is... This is the time decay factor.

6. The method according to claim 3, characterized in that, The two-dimensional Gaussian kernel function is: in, This represents the weight value at position (x, y), where (x, y) represents the coordinate offset of a certain position within the kernel relative to the kernel center. The standard deviation of the Gaussian kernel is used to control the spatial diffusion range; The formula for the weighted Gaussian kernel function is: in, This represents the weight value after stacking at position (x,y), and N represents the number of fixation points.

7. The method according to claim 2, characterized in that, The gaze guidance loss function is: in, λ represents the loss term for predicting the robotic arm's motion; λ is the adjustment weight coefficient. The attention-guided loss term, based on human visual priors, extracts the self-attention weight matrix A from the last Transformer block of the ViT visual encoder and uses Kullback-Leibler divergence to calculate the information difference between the spatial distribution of A and the standardized gaze heatmap distribution H. 。 8. A multimodal large-scale robot control device based on eye-tracking feature enhancement, characterized in that, include: The acquisition module is used to acquire global RGB image information, close-up local RGB image, gaze point trajectory map, semantic commands, and corresponding robotic arm multi-dimensional pose data through remote operation and multi-task scenario construction. The encoding module is used to perform spatiotemporal encoding of gaze feature on the gaze trajectory map, and fuse the RGB image of the global viewpoint to obtain a four-channel multimodal input tensor; The dataset building module is used to build multimodal fusion teleoperation datasets; The multimodal large model building module is used to build a multimodal large model with OpenVLA as the backbone network, ViT encoder as the visual encoder, connected through connectors, and LoRA adapter introduced. The training module is used to train the multimodal large model using the teleoperation dataset and the gaze-guided loss function; The output module is used to infer and output commands for robot control in real time by coupling dynamic gaze heatmap features and near-field RGB image features in real time based on the fuzzy control commands given by the user and the trained multimodal large model.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the multimodal large model robot control method based on eye-tracking feature enhancement as described in any one of claims 2 to 7.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the multimodal large model robot control method based on eye-tracking feature enhancement as described in any one of claims 2 to 7.