Robot control method and system, electronic device, and storage medium

By fusing robot control commands, color images, and depth image features, a multimodal large model was used to improve the robot's operational accuracy and flexibility in complex environments, and to solve the problem of deviation in the robot's recognition of user intentions.

WO2026138201A1PCT designated stage Publication Date: 2026-07-02BEIJING HUMANOID ROBOTICS INNOVATION CENTER CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
BEIJING HUMANOID ROBOTICS INNOVATION CENTER CO LTD
Filing Date
2025-11-10
Publication Date
2026-07-02

Smart Images

  • Figure CN2025133811_02072026_PF_FP_ABST
    Figure CN2025133811_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A robot control method and system, an electronic device, and a storage medium, relating to the technical field of artificial intelligence technology. The robot control method comprises: receiving a robot control instruction, and extracting a semantic feature of the robot control instruction; acquiring a color image of an environment where a robot is located, performing image segmentation on the color image to obtain a mask corresponding to image content, and determining a semantic category of the mask; fusing the mask, the semantic category of the mask, and the semantic feature to obtain a fused feature; inputting the fused feature and a preset feature into a multi-modal large model to obtain control information; and controlling the robot on the basis of the control information, so as to complete a task corresponding to the robot control instruction. A user intent can be accurately identified, and the flexibility and accuracy of robot control are improved.
Need to check novelty before this filing date? Find Prior Art

Description

A robot control method, system, electronic device, and storage medium

[0001] Cross-references to related applications

[0002] This disclosure claims priority to Chinese Patent Application No. CN202411920719.7, filed on December 25, 2024, entitled "A Robot Control Method, System, Electronic Device and Storage Medium", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This disclosure relates to the field of artificial intelligence technology, and in particular to a robot control method, system, electronic device, and storage medium. Background Technology

[0004] Robotics is an interdisciplinary science that involves multiple fields such as mechanical engineering, electrical engineering, computer science, and artificial intelligence.

[0005] In related technologies, robots can perform corresponding operations based on control commands input by users. However, when performing complex operations, robots often have misunderstandings of human language commands. Moreover, due to the real-time changes in the external environment and the robot's internal state, robots lack the ability to make flexible adjustments.

[0006] Therefore, how to accurately identify user intentions and improve the flexibility and accuracy of robot control is a technical problem that needs to be solved by those skilled in the art.

[0007] Public content

[0008] In view of this, the purpose of this disclosure is to provide a robot control method, system, electronic device, and storage medium that can accurately identify user intentions and improve the flexibility and accuracy of robot control.

[0009] To address the aforementioned technical problems, in a first aspect, this disclosure provides a robot control method, comprising:

[0010] Receive robot control commands and extract the semantic features of the robot control commands;

[0011] A color image of the robot's environment is acquired, the color image is segmented to obtain a mask corresponding to the image content, and the mask features and the semantic category label features of the mask are extracted.

[0012] Construct a joint feature that includes the mask feature and the label feature, and fuse the joint feature with the semantic feature to obtain the fused feature;

[0013] The fused features and preset features are input into a multimodal large model to obtain control information; wherein, the preset features include semantic features of robot control commands, color image features of color images of the robot's environment, depth image features of depth images of the robot's environment, and robot state features.

[0014] The robot is controlled according to the control information in order to complete the task corresponding to the robot control command.

[0015] In one optional implementation, constructing a joint feature comprising the mask feature and the label feature includes:

[0016] The joint feature is obtained by concatenating the mask features and label features corresponding to each mask;

[0017] The joint feature is assigned a corresponding positional code based on the position of the mask in the color image.

[0018] In one optional implementation, the joint features are fused with the semantic features to obtain fused features, including:

[0019] The joint features containing the positional encoding and the semantic features are processed based on a multi-head cross-attention mechanism to obtain intermediate processing results;

[0020] The intermediate processing results are processed using the fully connected layer, pooling layer, and normalization layer of the semantic fusion model to obtain the fused features.

[0021] In one optional implementation, the color image features are obtained in the following manner:

[0022] The color image features are obtained by encoding and extracting color images of the robot's environment using a color image coding model.

[0023] The depth image features are obtained in the following manner:

[0024] The depth image of the robot's environment is encoded and extracted using a depth image coding model to obtain depth image features;

[0025] The robot's state features are obtained in the following manner:

[0026] The robot's state information is encoded and extracted using a robot state coding model to obtain robot state features; wherein, the state information includes at least one of joint angles, end effector state, and positioning information.

[0027] In one optional implementation, the color image of the robot's environment includes a first color image captured by a first camera and a second color image captured by a second camera; wherein the first camera is located on the robot, and the second camera is located within the area where the robot is situated.

[0028] The color image of the robot's environment is encoded and extracted using a color image coding model to obtain color image features, including:

[0029] The first color image and the second color image are encoded and extracted using a color image coding model to obtain color image features.

[0030] In one optional implementation, the depth image of the robot's environment includes a first depth image captured by the first camera and a second depth image captured by the second camera.

[0031] The depth image of the robot's environment is encoded and extracted using a depth image coding model to obtain depth image features, including:

[0032] The first depth image and the second depth image are encoded and extracted using a depth image coding model to obtain depth image features.

[0033] Secondly, this disclosure also provides a robot control system, which includes:

[0034] The instruction processing module is configured to receive robot control instructions and extract the semantic features of the robot control instructions;

[0035] The image processing module is configured to acquire a color image of the robot's environment, perform image segmentation on the color image to obtain a mask corresponding to the image content, and extract the mask features of the mask and the label features of the semantic category of the mask.

[0036] The feature fusion module is configured to construct a joint feature containing the mask feature and the label feature, and fuse the joint feature with the semantic feature to obtain the fused feature;

[0037] The feature processing module is configured to input the fused features and preset features into a multimodal large model to obtain control information; wherein, the preset features include semantic features of robot control commands, color image features of color images of the robot's environment, depth image features of depth images of the robot's environment, and robot state features.

[0038] The execution module is configured to control the robot according to the control information in order to complete the task corresponding to the robot control command.

[0039] Thirdly, this disclosure also provides an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor invokes the computer program in the memory to implement the steps of the above-described robot control method.

[0040] Fourthly, this disclosure also provides a storage medium on which a computer program is stored, wherein the computer program, when executed, implements the steps of the above-described robot control method.

[0041] This disclosure provides a robot control method. The method extracts semantic features from robot control commands, segments a color image of the environment to generate a corresponding mask, and determines the semantic category of the mask. The mask, its semantic category, and semantic features are fused to generate fused features, which associate robot control commands with specific objects in the environment. The fused features and preset features are input into a multimodal large-scale model to obtain control information for task completion. The preset features include at least one of semantic features, color image features, depth image features, and robot state features. By inputting the fused features and preset features into the multimodal large-scale model, the model captures the correlation between various features and synthesizes these features to generate control information. Therefore, this disclosure can accurately identify user intentions and improve the flexibility and accuracy of robot control. This disclosure also provides a robot control system, electronic device, and storage medium, which have the above-mentioned beneficial effects, and will not be elaborated further here.

[0042] To make the above-mentioned objects, features and advantages of this disclosure more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0043] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of this disclosure and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0044] Figure 1 is a flowchart of a robot control method provided in an embodiment of this disclosure;

[0045] Figure 2 is a schematic diagram of the workflow of an embodied intelligent operation large model framework based on multimodal semantic fusion provided in an embodiment of this disclosure;

[0046] Figure 3 is a schematic diagram of the principle of an embodied intelligent operation large model framework based on multimodal semantic fusion provided in the embodiments of this disclosure;

[0047] Figure 4 is a schematic diagram of a mask feature generation principle provided by an embodiment of this disclosure;

[0048] Figure 5 is a schematic diagram of a tag feature generation principle provided by an embodiment of this disclosure;

[0049] Figure 6 is a schematic diagram of the structure of a semantic fusion model provided in an embodiment of this disclosure. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0051] Please refer to Figure 1 below. Figure 1 is a flowchart of a robot control method provided in an embodiment of this disclosure.

[0052] Specific steps may include:

[0053] S101: Receive robot control instructions and extract the semantic features of the robot control instructions.

[0054] This embodiment can be applied to robots such as automata, transport robots, and robotic vacuum cleaners. The robot control commands mentioned above contain natural language information, i.e., language commands. Users can issue robot control commands via voice or text, such as, "Please move the book on the table to the sofa." The robot can capture the user's robot control commands through a microphone or input device and convert them into digital signals or text. The robot control commands can also be commands transmitted from other electronic devices (such as smart speakers, smartphones, etc.). This embodiment can perform natural language processing on the robot control commands to obtain semantic features. These semantic features can be vectors, containing the main content and intent of the robot control commands, such as action type, object category, and location information. This step can utilize a large language model to extract the semantic features of the robot control commands.

[0055] S102: Obtain a color image of the robot's environment, perform image segmentation on the color image to obtain a mask corresponding to the image content, and determine the semantic category of the mask.

[0056] In this embodiment, a camera can be mounted on the robot or placed within the robot's environment to acquire a color image of the robot's surroundings. The camera can be a depth camera. This step involves image segmentation of the color image to obtain masks corresponding to the image content. Each mask has a corresponding semantic category. The semantic category describes the classification of the content corresponding to each mask in the color image, such as cars, pedestrians, trees, and houses. This embodiment can use a segmentation model (i.e., an image segmentation model) to obtain the masks corresponding to the image content.

[0057] S103: The mask, the semantic category of the mask, and the semantic features are fused to obtain fused features.

[0058] As one feasible implementation method, this embodiment can use feature concatenation to fuse the mask, semantic category, and semantic features; as another feasible implementation method, this embodiment can utilize a feature fusion network to fuse the mask, semantic category, and semantic features.

[0059] The fusion features obtained in this step can associate user commands with specific objects in the environment, enabling the robot to perform tasks more accurately.

[0060] S104: Input the fused features and preset features into the multimodal large model to obtain control information.

[0061] The aforementioned preset features include at least one of the following: semantic features of robot control commands, color image features of a color image of the robot's environment, depth image features of a depth image of the robot's environment, and robot state features.

[0062] This step inputs semantic features and preset features into a multimodal large model. The multimodal large model can comprehensively process features from different modalities to capture the complex relationships between these features. The multimodal large model can fuse features from different modalities through multi-layer neural networks, and then utilize the control information generated by the output layer. The aforementioned control information can be specific action sequences, such as movement, grasping, and navigation, or higher-level strategies, such as path planning and task scheduling.

[0063] Through the above operations, the multimodal large model can generate accurate control information based on features from multiple sources, ensuring that the robot can efficiently and accurately complete the tasks corresponding to the robot control commands.

[0064] S105: Control the robot according to the control information in order to complete the task corresponding to the robot control command.

[0065] This embodiment extracts semantic features from robot control commands, segments the color image of the environment to generate corresponding masks, and determines the semantic category of the masks. The masks, their semantic categories, and semantic features are fused to generate fused features, which associate robot control commands with specific objects in the environment. The fused features and preset features are input into a multimodal large-scale model to obtain control information for task completion. The preset features include at least one of semantic features, color image features, depth image features, and robot state features. This embodiment inputs the fused features and preset features into the multimodal large-scale model, enabling the model to capture the correlations between various features and synthesize these features to generate control information. Therefore, this embodiment can accurately identify user intentions and improve the flexibility and accuracy of robot control.

[0066] As a further explanation of the embodiment corresponding to Figure 1, the fusion feature can be generated in the following manner:

[0067] Step A1: Input the mask into the mask feature extractor to obtain the mask features.

[0068] Step A2: Input the semantic category of the mask into the feature extractor of the large language model to obtain the label features.

[0069] Step A3: Construct a joint feature that includes the mask feature and the label feature, and add a corresponding positional code to the joint feature.

[0070] In this embodiment, the mask features and label features corresponding to each mask can be concatenated to obtain the joint feature, and the joint feature can be added with a corresponding position code according to the position of the mask in the color image.

[0071] Step A4: Fuse the joint feature containing the location code with the semantic feature to obtain the fused feature.

[0072] Specifically, in this embodiment, the joint features containing the positional encoding and the semantic features can be processed based on a multi-head cross-attention mechanism to obtain intermediate processing results; the intermediate processing results can be processed using the fully connected layer, pooling layer and normalization layer of the semantic fusion model to obtain the fused features.

[0073] As a further description of the embodiment corresponding to Figure 1, if the preset features include color image features, the color image features can be obtained in the following way before inputting the preset features into the multimodal large model: the color image of the robot's environment is encoded and extracted using a color image encoding model to obtain the color image features.

[0074] If the preset features include depth image features, the depth image features can be obtained in the following way before inputting the preset features into the multimodal large model: use a depth image coding model to encode and extract the depth image of the robot's environment to obtain the depth image features.

[0075] If the preset features include robot state features, then before inputting the preset features into the multimodal large model, the robot state features can be obtained in the following way: The robot state information is encoded and extracted using a robot state coding model to obtain the robot state features; wherein, the state information includes at least one of joint angles, end effector state, positioning information, acceleration, battery level, and sensor readings (such as temperature, humidity, light intensity, etc.). This embodiment can preprocess the acquired state information, for example, by standardizing or normalizing the numerical range to ensure data consistency and comparability. This embodiment can encode the preprocessed state information to obtain the robot state features. This embodiment can use a robot state coding model to extract the above-mentioned robot state features.

[0076] As a further description of the embodiment corresponding to Figure 1, the robot is equipped with a first camera, and a second camera is installed within the area where the robot is located. The second camera, also known as a third-view camera, is independent of the robot and does not move with the robot. If the robot is a robot with a robotic arm, the first camera can be installed at the end of the robot's robotic arm to effectively acquire environmental information about the robot.

[0077] The color images of the robot's environment include: a first color image captured by a first camera and a second color image captured by a second camera. Correspondingly, the process of extracting color image features using a color image coding model includes: encoding and extracting color image features from the first and second color images using the color image coding model.

[0078] The depth image of the robot's environment includes: a first depth image captured by a first camera and a second depth image captured by a second camera. Correspondingly, the process of extracting depth image features using a depth image coding model includes: encoding and extracting depth image features from the first and second depth images using the depth image coding model.

[0079] The process described in the above embodiments is illustrated below through examples in practical applications.

[0080] Although the rapid development of large-scale models in recent years has led some methods to attempt to improve robots' task understanding and execution capabilities by directly fusing language instructions and visual signals into these models, these methods have not explicitly provided an effective mechanism for combining language instructions with the visual environment, resulting in insufficient information transmission during the fusion process. When performing complex operations, robots often misunderstand human language instructions, thus affecting the overall success rate. This phenomenon is particularly prominent in practical applications because existing robotic systems often cannot accurately understand or parse complex task descriptions from humans and lack the ability to flexibly adjust to dynamic environments. In contrast, humans typically possess stronger intent understanding and environmental awareness when performing tasks. When a human receives an instruction such as "grab an apple," they first analyze the key information in the instruction, namely, identifying the object "apple," and quickly determining its presence in the current visual environment. Next, the human adjusts their actions based on the object's position and posture information to complete the grasping task. This process involves not only the close integration of language and visual information but also real-time environmental perception and adjustment of operational strategies. Clearly, humans demonstrate a highly intelligent perception and reaction mechanism when performing tasks, which is a key area that current robotics technology urgently needs to improve.

[0081] While current technologies primarily focus on directly fusing multimodal information, they lack in-depth analysis and optimization strategies, failing to effectively address the issue of insufficient integration between language commands and the visual environment. This deficiency limits robots' ability to understand complex instructions, especially in diverse and dynamically changing environments, where the accuracy and success rate of operations still need improvement. To address these challenges, greater emphasis needs to be placed on more closely integrating language commands with visual perception information, enabling robots to understand key elements of tasks like humans and make intelligent decisions and actions based on environmental perception. This requires not only more efficient multimodal fusion algorithms but also further optimization of the robot's learning and reasoning mechanisms to ensure accurate responses and execution of human commands in complex real-world environments, thereby improving overall operational efficiency and accuracy. Therefore, effectively combining language and visual information to enable robots to understand human intentions more deeply and achieve intelligent operation through comprehensive environmental perception remains a pressing issue for future robotics development.

[0082] To address the aforementioned technical challenges, this disclosure provides a framework for an embodied intelligent operation model based on multimodal semantic fusion. This embodiment, through deep fusion of visual and linguistic instruction information, further enhances the robot's understanding and execution capabilities for complex tasks, thereby significantly improving operational accuracy. This system comprehensively utilizes data from multiple modalities, including but not limited to visual perception information, linguistic instructions, and environmental perception information, aiming to help the model fully and meticulously understand the external environment and human intentions, providing precise guidance for robot operation.

[0083] First, this embodiment employs a Large Language Model (LLM) to perform deep intent parsing of human-input language commands and transforms the parsing results into a feature form acceptable to the multimodal large model, enabling the robot to more accurately understand and execute language commands from humans. This transformation process not only allows language information to complement visual information but also enhances the robot's ability to handle commands in complex task scenarios. Simultaneously, this embodiment also uses a depth camera to acquire RGB images and depth information of the environment and introduces a segmentation model to perform detailed localization and segmentation of all objects in the images, thereby generating object categories and corresponding mask outputs.

[0084] The depth camera in this embodiment includes a hand-eye camera and a third-view camera. The hand-eye camera is mounted 3 cm from the end effector of the robotic arm, allowing real-time observation of the dynamics of the gripper at the end effector and its surrounding environment to ensure precise operation. The third-view camera is positioned directly in front of the robotic arm, providing a more comprehensive overview of the entire working environment, encompassing the robotic arm and its operating space. By combining these two perspectives, this embodiment acquires more comprehensive environmental perception information, enabling the robot to make more accurate judgments and operations.

[0085] This embodiment uses a semantic fusion model to deeply fuse the object masking results obtained from instance segmentation with the language command features generated by a large language model, thereby generating multimodal information with semantic integration. This semantic fusion model combines a cross-attention mechanism to perform detailed fusion and attention extraction of language and visual information, ensuring that language commands and visual perception information complement and reinforce each other at multiple levels, resulting in operational commands that the robot can execute.

[0086] Regarding image information, RGB (red, green, blue) images and depth images captured by hand-eye cameras and third-view cameras are used for feature extraction through image coding models. Different types of images (RGB images and depth images) are processed by different encoders to meet the needs of different input dimensions.

[0087] After all features are extracted, this embodiment inputs the multimodal data from vision, language, and depth sensing into a multimodal large model, which is responsible for the final feature fusion and decision output. The multimodal large model can output various types of robot control information, including the robot's joint states at future moments, end-effector pose, and status signals indicating whether the robot has completed its task. The robot's future joint states encompass the angle values ​​of each joint and the state of the gripper, while the end-effector pose is represented by a six-dimensional vector, specifically describing the translation and rotation information of the robot's end-effector in the x, y, and z directions. By combining this information, the robot can perform more precise and efficient operations in complex and changing environments.

[0088] Please refer to Figure 2, which is a schematic diagram of the workflow of an embodied intelligent operation large-scale model framework based on multimodal semantic fusion provided in this embodiment of the present disclosure, so as to realize the robot's comprehensive perception and timely response to the environment, language commands, and its own state. The above-mentioned embodied intelligent operation large-scale model framework includes a multi-sensor system, a multimodal encoding module, a multimodal fusion module, and a robot control execution module. The multi-sensor system can realize 2D (dimensionality) environment perception, 3D environment perception, language perception, and current state perception.

[0089] First, a multi-sensor system comprehensively perceives the external environment and verbal commands. 2D and 3D environmental information is acquired using a depth camera, with color and depth images of the same size. This 2D and 3D environmental perception method comprehensively captures visual information of the surrounding environment, ensuring the robot can obtain rich environmental data in real time. The language perception module processes the input text commands, converting human-provided task descriptions into language features that the system can understand. These text commands typically contain explicit descriptions of the task, such as object recognition, position indication, or operational requirements. Furthermore, the robot's current state information is also perceived at this stage, primarily through a current state perception module directly connected to the robot's control interface, which can acquire key state parameters of the robot in real time during task execution. This state information includes, but is not limited to, the robot's joint angles, end effector positions, and the open or closed state of the gripper. This data effectively assists the robot in performing precise operations, providing real-time feedback on the robot's specific posture and operational capabilities in the environment. In this embodiment, all perceived information can be uniformly input into a multimodal encoding module for processing. The multimodal coding module extracts features from data from different sources, transforming visual, linguistic, and robot state information into high-dimensional feature vectors. Through this feature extraction, the system can uniformly represent heterogeneous data (such as images, text, and state parameters) into a feature space usable for subsequent processing, providing a foundation for multimodal fusion. Once all features have been extracted, they are input into the multimodal fusion module. In this module, the system employs a multi-level feature fusion strategy, combining features from different modalities—visual, linguistic, and state perception—to generate a comprehensive fused feature. This fusion process not only aligns and jointly processes information from different modalities but also ensures, through hierarchical feature extraction and attention mechanisms, that the system can capture the most critical details for task execution in each modality. The multimodal fusion module generates a set of specific robot execution instructions, which precisely guide the robot to complete the predetermined task. After the robot executes the instructions, the system re-acquires information about the current environment, updates the 2D and 3D visual perception data, and simultaneously acquires the robot's new state. This closed-loop process enables the system to dynamically adjust according to changes in the environment, ensuring that the robot can always make optimal decisions based on the latest perceived information when performing continuous tasks. Through this iterative cycle of continuous perception-execution-reperception, the system can effectively improve the robot's operational efficiency and accuracy, enabling it to flexibly respond to and complete tasks in complex and ever-changing environments.

[0090] Please refer to Figure 3, which is a schematic diagram of the principle of an embodied intelligent operation large model framework based on multimodal semantic fusion provided in this embodiment of the present disclosure. The figure shows the process of acquiring, processing, and fusing multimodal information. The inputs of the framework include human language commands, RGB images captured by hand-eye cameras, RGB images captured by third-view cameras, depth images captured by hand-eye cameras, depth images captured by third-view cameras, and robot states. The models used in the framework include: a large language model, an instance segmentation model, a semantic fusion model, an RGB image encoding model, a depth image encoding model, a robot state encoding model, and a multimodal large model. The outputs of the framework include the robot's future state, the robot's end-effector pose, and the detection result of whether the robot has completed the task.

[0091] This embodiment can acquire human language commands through a text input module. These commands are then input into a large language model to extract language features. During this process, the large language model performs deep semantic analysis on the input text, generating corresponding feature vectors with a preset dimension n. These language features lay the foundation for subsequent multimodal fusion, ensuring that the system can understand and process complex human commands.

[0092] In terms of visual perception, this embodiment employs two depth cameras, positioned at different locations, to capture RGB images and depth data from the environment. First, a hand-eye camera is mounted 3 cm above the end effector of the robotic arm; its unique position ensures real-time monitoring of the operation of the end effector and detailed changes in its surrounding environment. Second, a third-view camera is positioned directly above and in front of the robot, providing a panoramic view of the entire working environment, including the robotic arm and its operating space. The complementary perspectives of the two cameras ensure the system can comprehensively perceive detailed changes in the robot's operating environment from multiple dimensions. Next, the RGB images captured by these cameras are input into a segmentation model for image segmentation prediction. Through the segmentation model, this embodiment can generate corresponding masks based on the image content, each mask corresponding to a specific semantic category. These masks and their category information are further passed to a semantic fusion model for deep fusion with language features extracted from a large language model. The fused feature vector has a preset dimension value n. This multimodal feature fusion step ensures a close integration of language commands and visual perception information, providing more comprehensive input for subsequent robot control decisions.

[0093] This embodiment also performs separate feature extraction on the RGB images. Specifically, the RGB images from the hand-eye camera and the third-view camera are input into an image coding model, which encodes and extracts the images to generate image feature vectors. In this embodiment, these image feature vectors can be further processed by a fully connected layer to convert them into n-dimensional feature vectors, aligning them with other feature dimensions. Similarly, the depth images acquired by the two depth cameras are also processed by the image coding model and the fully connected layer to ultimately generate n-dimensional depth feature vectors. This process ensures that the system can fully utilize visual and depth information to support robot operation tasks.

[0094] This embodiment can also extract the robot's own state information. The current robot state can be obtained through a dedicated robot state encoding model. This model includes two fully connected layers, which can encode key parameters such as robot joint angles and end effector states, generating a feature vector of length n. Through this state encoding process, the system can understand the robot's posture and state in real time, providing accurate feedback for subsequent action decisions.

[0095] All extracted feature vectors, including linguistic features, visual features, depth features, and robot state features, are input into the multimodal large model in a predetermined order. This predetermined order is: semantic features, fused features, color image features, depth image features, and robot state features. The multimodal large model, as the core processing module, generates the final feature instruction vector through multi-feature fusion and deep learning. This feature instruction vector is then processed through a fully connected layer to generate the system's output. These outputs include three aspects: first, the robot's future state information, specifically the angles of all robot joints and the gripper's state in the next operation, ensuring continuity and coordination when performing complex tasks; second, the robot's end effector pose information, including its precise position and orientation in the operating space, which is crucial for improving operational accuracy; and finally, the system outputs a signal indicating whether the task is completed. If completed, the system will stop subsequent actions, thereby improving task execution efficiency.

[0096] Please refer to Figure 4, which is a schematic diagram of a mask feature generation principle provided in this embodiment. After inputting masks 1 to k into the mask feature extractor, mask features 1 to k can be obtained. In this embodiment, the mask output by the segmentation model can be input into the mask feature extractor, which can perform deep feature extraction on the mask corresponding to each semantic category. This feature extraction process can effectively capture the spatial structure and detail information of each segmented object in the image, ensuring that the mask can accurately reflect the shape and position of objects in the scene, providing a foundation for subsequent feature fusion.

[0097] Please refer to Figure 5, which is a schematic diagram of a label feature generation principle provided in this embodiment. After inputting mask labels 1 to k into the large language model feature extractor, label features 1 to k can be obtained. In this embodiment, the large language model feature extractor can be used to further process the category label corresponding to each mask. The large language model feature extractor can perform semantic analysis on each category label and extract its semantic features. These semantic features reflect the high-level information contained in the category label, such as the function, attributes, and correlation with other categories of the object. The extracted label features are then subjected to dimensionality reduction processing to finally obtain the feature vector of the corresponding dimension.

[0098] Please refer to Figure 6, which is a schematic diagram of the semantic fusion model provided in this embodiment. The semantic fusion model includes a multi-head cross-attention structure, a multilayer perceptron (MLP) layer, a pooling layer, and a normalization layer. The input parameters include mask features 1-k, label features 1-k, positional codes 1-k, and human instruction features (i.e., semantic features of human language instructions). The output parameter is the fused feature. After feature extraction, this embodiment can concatenate the mask features and label features. This concatenation operation combines information from two different modalities to form a joint feature vector, which contains both the visual information of the object and its corresponding semantic information. This embodiment can add its corresponding positional code to each joint feature vector. These positionally encoded joint features are input into the multi-head cross-attention structure. In the multi-head cross-attention mechanism, the system fuses the mask features, label features, and human instruction features previously extracted from the language input through multi-layer cross-attention operations. The cross-attention mechanism can capture long-range dependencies between features, ensuring that the system can understand the association between language instructions and different objects in the scene. For example, when a human instructs a robot to "grab an apple," the cross-attention mechanism can effectively match and associate the linguistic instruction "apple" with the corresponding object features in the visual mask, thereby generating a feature vector that can guide the robot's operation. This approach, through multi-layered feature extraction and fusion, ensures a high degree of integration between semantic and visual information, supporting the robot's understanding and execution of complex tasks. After completing the cross-attention fusion, this embodiment inputs the fused feature vector into a series of post-processing modules, including fully connected layers, pooling layers, and normalization layers. The fully connected layer further adjusts and compresses the feature dimensions, ensuring that the fused information can be efficiently transmitted and processed; the pooling layer extracts the most important features through downsampling operations, reducing redundant information; and the normalization layer standardizes the features, making the feature distribution more balanced and avoiding deviations in subsequent operations. Finally, this embodiment generates a semantic fusion feature of dimension n. This feature vector not only contains spatial details and category information from the visual information but also integrates the instruction intent from the linguistic input, providing the robot with a comprehensive reference for task execution.

[0099] The multimodal fusion framework proposed in this embodiment enhances the intelligence of the robot during task execution and improves the interaction between the robot and humans to some extent. This deep fusion approach strengthens the robot's ability to understand and respond to human commands, improving the accuracy of task execution. This embodiment improves the reliability and precision of the robot during task execution, enhances its understanding of the needs in complex application scenarios, makes human-robot collaboration more efficient, and optimizes the user experience.

[0100] The structure of a robot control system provided in this disclosure includes:

[0101] The instruction processing module is configured to receive robot control instructions and extract the semantic features of the robot control instructions;

[0102] The image processing module is configured to acquire a color image of the robot's environment, perform image segmentation on the color image to obtain a mask corresponding to the image content, and extract the mask features of the mask and the label features of the semantic category of the mask.

[0103] The feature fusion module is configured to construct a joint feature containing the mask feature and the label feature, and fuse the joint feature with the semantic feature to obtain the fused feature;

[0104] The feature processing module is configured to input the fused features and preset features into a multimodal large model to obtain control information; wherein, the preset features include at least one of the semantic features of robot control commands, color image features of a color image of the robot's environment, depth image features of a depth image of the robot's environment, and robot state features.

[0105] The execution module is configured to control the robot according to the control information in order to complete the task corresponding to the robot control command.

[0106] In one optional implementation, the process by which the feature fusion module constructs a joint feature comprising the mask feature and the label feature includes: concatenating the mask feature and the label feature corresponding to each mask to obtain the joint feature; and adding a corresponding positional code to the joint feature according to the position of the mask in the color image.

[0107] In one optional implementation, the feature fusion module fuses the joint features containing the location encoding with the semantic features to obtain the fused features. The process includes: processing the joint features containing the location encoding and the semantic features based on a multi-head cross-attention mechanism to obtain an intermediate processing result; and processing the intermediate processing result using the fully connected layer, pooling layer and normalization layer of the semantic fusion model to obtain the fused features.

[0108] In one optional implementation, the robot control system further includes:

[0109] The color image feature extraction module is configured to use a color image encoding model to encode and extract color images of the robot's environment to obtain color image features;

[0110] The depth image feature extraction module is configured to use a depth image coding model to encode and extract depth images of the robot's environment to obtain depth image features;

[0111] The robot state feature extraction module is configured to encode and extract the robot's state information using a robot state coding model to obtain robot state features; wherein the state information includes at least one of joint angles, end effector state, and positioning information.

[0112] In one optional implementation, the color image of the robot's environment includes a first color image captured by a first camera and a second color image captured by a second camera; wherein the first camera is located on the robot, and the second camera is located within the area where the robot is situated.

[0113] Accordingly, the color image feature extraction module is configured to use a color image coding model to encode and extract the first color image and the second color image to obtain color image features.

[0114] In one optional implementation, the depth image of the robot's environment includes a first depth image captured by the first camera and a second depth image captured by the second camera.

[0115] Accordingly, the depth image feature extraction module is configured to use a depth image coding model to encode and extract the first depth image and the second depth image to obtain depth image features.

[0116] Since the embodiments of the system part correspond to the embodiments of the method part, please refer to the description of the embodiments of the method part for the embodiments of the system part, and they will not be repeated here.

[0117] This disclosure also provides a computer-readable storage medium storing a computer program thereon, which, when executed, can perform the steps provided in the above embodiments. The computer-readable storage medium may include various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0118] This disclosure also provides an electronic device that may include a memory and a processor. The memory stores a computer program, and when the processor invokes the computer program in the memory, it can perform the steps provided in the above embodiments. Of course, the electronic device may also include various network interfaces, power supplies, and other components.

[0119] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the systems disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple, and relevant parts can be referred to the method section. It should be noted that those skilled in the art can make various improvements and modifications to this disclosure without departing from its principles, and these improvements and modifications also fall within the protection scope of this disclosure.

[0120] It should also be noted that, in this specification, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0121] The above descriptions are merely various embodiments of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this disclosure should be included within the scope of protection of this disclosure. Therefore, the scope of protection of this disclosure should be determined by the scope of the claims. Industrial applicability

[0122] The above scheme extracts semantic features from robot control commands, segments the color image of the environment to generate corresponding masks, and determines the semantic category of the masks. The masks, their semantic categories, and semantic features are fused to generate fused features, which associate robot control commands with specific objects in the environment. The fused features and preset features are input into a multimodal large-scale model to obtain control information for task completion. The preset features include at least one of semantic features, color image features, depth image features, and robot state features. This disclosure inputs the fused features and preset features into the multimodal large-scale model, enabling the model to capture the correlation between various features and synthesize these features to generate control information. Therefore, this disclosure can accurately identify user intentions and improve the flexibility and accuracy of robot control. This disclosure also provides a robot control system, electronic device, and storage medium with the above-mentioned beneficial effects, which will not be elaborated further here.

Claims

1. A robot control method characterized by, include: Receive robot control commands and extract the semantic features of the robot control commands; A color image of the robot's environment is acquired, the color image is segmented to obtain a mask corresponding to the image content, and the mask features and the semantic category label features of the mask are extracted. Construct a joint feature that includes the mask feature and the label feature, and fuse the joint feature with the semantic feature to obtain the fused feature; The fused features and preset features are input into a multimodal large model to obtain control information; wherein, the preset features include semantic features of robot control commands, color image features of color images of the robot's environment, depth image features of depth images of the robot's environment, and robot state features. The robot is controlled according to the control information in order to complete the task corresponding to the robot control command.

2. The robot control method according to claim 1, wherein, Constructing a joint feature that includes the mask feature and the label feature includes: The mask features and label features corresponding to each mask are concatenated to obtain the joint features; The joint feature is assigned a corresponding positional code based on the position of the mask in the color image.

3. The robot control method according to claim 2, characterized in that, The joint features are fused with the semantic features to obtain fused features, including: The joint features containing the positional encoding and the semantic features are processed based on a multi-head cross-attention mechanism to obtain intermediate processing results; The intermediate processing results are processed using the fully connected layer, pooling layer, and normalization layer of the semantic fusion model to obtain the fused features.

4. The robot control method according to claim 1, characterized in that, The color image features are obtained using the following method: The color image features are obtained by encoding and extracting color images of the robot's environment using a color image coding model. The depth image features are obtained in the following manner: The depth image of the robot's environment is encoded and extracted using a depth image coding model to obtain depth image features; The robot's state features are obtained in the following manner: The robot's state information is encoded and extracted using a robot state coding model to obtain robot state features; wherein, the state information includes at least one of joint angles, end effector state, and positioning information.

5. The robot control method according to claim 4, characterized in that, The color image of the robot's environment includes a first color image captured by a first camera and a second color image captured by a second camera; wherein the first camera is positioned on the robot and the second camera is positioned within the area where the robot is located; The color image of the robot's environment is encoded and extracted using a color image coding model to obtain color image features, including: The first color image and the second color image are encoded and extracted using a color image coding model to obtain color image features.

6. The robot control method according to claim 5, characterized in that, The depth image of the robot's environment includes a first depth image captured by the first camera and a second depth image captured by the second camera; The depth image of the robot's environment is encoded and extracted using a depth image coding model to obtain depth image features, including: The first depth image and the second depth image are encoded and extracted using a depth image coding model to obtain depth image features.

7. A robot control system, characterized in that, include: The instruction processing module is configured to receive robot control instructions and extract the semantic features of the robot control instructions; The image processing module is configured to acquire a color image of the robot's environment, perform image segmentation on the color image to obtain a mask corresponding to the image content, and extract the mask features of the mask and the label features of the semantic category of the mask. The feature fusion module is configured to construct a joint feature containing the mask feature and the label feature, and fuse the joint feature with the semantic feature to obtain the fused feature; The feature processing module is configured to input the fused features and preset features into a multimodal large model to obtain control information; wherein, the preset features include semantic features of robot control commands, color image features of color images of the robot's environment, depth image features of depth images of the robot's environment, and robot state features. The execution module is configured to control the robot according to the control information in order to complete the task corresponding to the robot control command.

8. An electronic device, characterized in that, It includes a memory and a processor, wherein the memory stores a computer program, and the processor invokes the computer program in the memory to implement the steps of the robot control method as described in any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when loaded and executed by a processor, implement the steps of the robot control method as described in any one of claims 1 to 6.