A large model driven overall modeling composite robot navigation and path autonomous planning method and system
By using a large-scale model-driven vision-language model for overall environment modeling and path planning, the problems of environmental understanding and human-computer interaction in traditional robot navigation technology are solved, enabling efficient intelligent navigation and collaborative operation of composite robots in complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 杭州市余杭区海创人形机器人产业创新中心
- Filing Date
- 2026-04-09
- Publication Date
- 2026-06-30
AI Technical Summary
Traditional robot navigation technology has limited environmental understanding capabilities, cannot comprehend semantic information, results in unnatural human-computer interaction, and has poor path planning adaptability. In particular, it is difficult to achieve globally optimal collaborative planning of movement and operation in composite robots.
A large-model-driven approach is adopted, which uses a vision-language large model to model the overall environment and combines multimodal sensor data to build a hierarchical environment model. This enables the understanding of semantic information and the perception of dynamic objects, as well as global path planning to optimize robot motion and operational coordination.
It improves the robot's intelligence and operational efficiency in complex and dynamic environments, and enables autonomous planning of advanced commands and natural human-machine interaction.
Smart Images

Figure CN122299640A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of robotics technology, specifically to a method and system for navigation and autonomous path planning of a large-model-driven, holistic modeling composite robot. Background Technology
[0002] Traditional mobile robot navigation technologies, such as SLAM (Simultaneous Localization and Mapping) based on LiDAR, can construct accurate 2D or 3D geometric maps and achieve path planning and obstacle avoidance through algorithms such as A* and DWA. However, these technologies also have significant limitations. Due to their limited environmental understanding capabilities, traditional maps are merely geometric depictions of physical space, making it impossible for robots to understand semantically rich commands such as "go to the printer." This lack of semantic understanding also results in rigid and unintuitive human-robot interaction, often requiring users to issue commands through programming or specifying precise coordinates on the map. Furthermore, in complex dynamic environments where humans and robots coexist, traditional planning algorithms have poor dynamic adaptability, and frequent replanning often leads to robot lag or hesitation. For composite robots with robotic arms, navigation and manipulation tasks are usually planned separately, making it difficult to achieve globally optimal integrated "mobility-manipulation" collaborative planning, thus limiting their overall operational efficiency.
[0003] In recent years, the rapid development of deep learning and large-scale artificial intelligence (VLM) technologies has provided new possibilities for solving the aforementioned problems. The powerful scene understanding, language parsing, and logical reasoning capabilities of large-scale models promise to endow robots with a higher level of intelligence. Therefore, how to deeply integrate large-scale model technology with robot navigation tasks to construct a new technological paradigm capable of overall environment modeling and autonomous planning is a pressing technical problem in the field of robotics. Summary of the Invention
[0004] To address the technical problems of existing technologies, such as weak robot environmental understanding, complex human-computer interaction, and poor path planning adaptability, this invention provides a large-model-driven overall modeling composite robot navigation and autonomous path planning method and system, the technical solution of which is as follows:
[0005] A large-model-driven holistic modeling method for navigation and autonomous path planning of composite robots includes the following steps:
[0006] Step 1: Acquire real-time multimodal environmental data through at least one sensor mounted on the composite robot, and perform preprocessing on the multimodal environmental data, including timestamp alignment and coordinate system registration, to form an environmental data stream with a unified spatiotemporal reference.
[0007] Step 2: Input the environmental data stream into a pre-trained visual-language large model, which analyzes the environmental data stream and outputs an overall environmental model containing geometric information, semantic information, and dynamic object information.
[0008] Step 3: Receive natural language navigation instructions from external input, and use the vision-language big model to understand the intent and decompose the task of the natural language navigation instructions. Combined with the overall environment model, generate one or more navigation target points containing position, attitude and / or operation status.
[0009] Step 4: Based on the overall environment model and navigation target point, perform global path planning to generate a preliminary path. Then, incorporate the body kinematics model and end effector dynamics model of the composite robot into the path optimization considerations to perform dynamic smoothing and collaborative optimization on the preliminary path, generate an executable body movement trajectory and end effector action sequence, and drive the composite robot to perform navigation tasks according to the movement trajectory and action sequence.
[0010] Furthermore, the overall environment model is a hierarchical data structure, including:
[0011] A geometric layer of an occupancy grid map or voxel map constructed from high-precision lidar point clouds for real-time obstacle avoidance;
[0012] A semantic layer generated by analyzing camera images using a large vision-language model, which assigns semantic labels to different areas in the map;
[0013] A dynamic layer used to identify and track dynamic obstacles such as pedestrians and other robots in the environment and predict their movement trajectories in the next few seconds.
[0014] Furthermore, the vision-language big model is Qwen3-VL-30B-A3B-Thinking, which is jointly trained with massive amounts of image, text and robot trajectory data and is used for scene description, object recognition and physical common sense reasoning.
[0015] Furthermore, step 2 updates a global voxel map using point cloud data, and then inputs the RGB image into the visual-language large model. The reasoning process of the visual-language large model is formally described as follows:
[0016] ,
[0017] in, It is the input RGB image. This is the current geometric map. It has parameters A large-scale visual-language model; output is semantic information. and dynamic object information .
[0018] Furthermore, in step 2, the dynamic object information output by the visual-language large model includes a prediction of the future movement trajectory of the dynamic object, and the area covered by the predicted trajectory is marked as an area with a high passage risk in the overall environment model.
[0019] Furthermore, in step 4, the global path planning adopts a semantically guided random tree algorithm. When sampling path nodes, this algorithm prioritizes sampling within the passable areas marked in the semantic information of the overall environment model.
[0020] Furthermore, in step 4, when the navigation task involves interaction with the environment, the path planning algorithm uses the reachable space of the robotic arm and the operation time as constraints to perform optimal collaborative planning between the mobile platform and the robotic arm.
[0021] Furthermore, the optimal collaborative planning in step 4 is achieved by solving an optimization problem, the objective function of which is shown below:
[0022] ,
[0023] in, It is the total cost. It's in robot mode. It is a control input; Represents energy consumption. Represents time; These are weighting coefficients; It is the cost of the terminal state.
[0024] A large-model-driven holistic modeling composite robot navigation and path autonomous planning system, used to implement any of the methods described above, includes:
[0025] Data acquisition module: Deploys the aforementioned visual-language large model for environmental data acquisition and preprocessing;
[0026] Environment Modeling Module: Used for dynamic overall environment modeling;
[0027] Task parsing module: used for navigation task parsing and target generation;
[0028] Path planning and control module: Used for collaborative path planning and control.
[0029] Beneficial effects
[0030] This invention solves the problems of weak environmental understanding and unnatural human-computer interaction in traditional robot navigation technology by introducing a large model for overall environment modeling and task analysis. It achieves deep perception of complex dynamic environments and autonomous planning of advanced instructions, significantly improving the intelligence level and operational efficiency of composite robots in real-world scenarios. Attached Figure Description
[0031] Figure 1 A schematic diagram illustrating the principle of navigation and autonomous path planning for a large-scale, model-driven composite robot.
[0032] Figure 2 Flowchart of a large-model-driven, holistic modeling method for navigation and autonomous path planning of composite robots;
[0033] Figure 3 Architecture diagram of a large-scale model-driven composite robot navigation and path autonomous planning system;
[0034] Figure 4 This is a schematic diagram of the hierarchical structure of the overall environment model in an embodiment of the present invention;
[0035] Figure 5 This is a schematic diagram of a specific scenario in an embodiment of the present invention. Detailed Implementation
[0036] The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for illustration and explanation only and are not intended to limit the present invention.
[0037] like Figure 1 As shown, this invention is based on a vision-language big model. By constructing an overall environment model, parsing natural semantics and combining them with the overall environment, it plans the body movement and end effector control of the composite robot.
[0038] like Figure 2 As shown, the large-model-driven overall modeling composite robot navigation and path autonomous planning method of the present invention includes the following steps:
[0039] Step 1: Acquire real-time multimodal environmental data through at least one sensor mounted on the composite robot, and perform preprocessing on the multimodal environmental data, including timestamp alignment and coordinate system registration, to form an environmental data stream with a unified spatiotemporal reference.
[0040] Step 2: Input the environmental data stream into a pre-trained visual-language large model, which analyzes the environmental data stream and outputs an overall environmental model containing geometric information, semantic information, and dynamic object information.
[0041] Step 3: Receive natural language navigation instructions from external input, and use the vision-language big model to understand the intent and decompose the task of the natural language navigation instructions. Combined with the overall environment model, generate one or more navigation target points containing position, attitude and / or operation status.
[0042] Step 4: Perform global path planning using the overall environment model and navigation target points to generate a preliminary path. Then, incorporate the body kinematics model and end effector dynamics model of the composite robot into the path optimization considerations to perform dynamic smoothing and collaborative optimization on the preliminary path, generating an executable body movement trajectory and end effector action sequence. Drive the composite robot to perform navigation tasks according to the movement trajectory and action sequence.
[0043] like Figure 3 As shown, the large-model-driven overall modeling composite robot navigation and path autonomous planning system of the present invention completes the above method through the following modules, including:
[0044] Data acquisition module: Deploys the aforementioned visual-language large model for environmental data acquisition and preprocessing;
[0045] Environment Modeling Module: Used for dynamic overall environment modeling;
[0046] Task parsing module: used for navigation task parsing and target generation;
[0047] Path planning and control module: Used for collaborative path planning and control.
[0048] Specific embodiments of the present invention
[0049] The system is deployed on a TurtleBot4 composite robot, which has been modified to include an NVIDIA Jetson AGX Orin computing platform, a Velodyne VLP-16 LiDAR, and an Orbbec Astra Pro depth camera, as well as a Robotis OpenManipulator-X robotic arm.
[0050] In step 1, the data acquisition module acquires point clouds from the LiDAR at a frequency of 10Hz and RGB-D images from the depth camera at a frequency of 30Hz. All data is timestamped and published through the ROS 2 robot operating system.
[0051] In step 2, the environment modeling module subscribes to relevant ROS 2 topics. The Visual-Language Large Model (VLM) is encapsulated as a ROS 2 node. Upon receiving new data, this node first updates a global voxel map using point cloud data. Subsequently, it inputs an RGB image into the VLM. The inference process of the VLM can be formally described as follows:
[0052] ,
[0053] in, It is the input RGB image. This is the current geometric map. It has parameters The VLM outputs semantic information. and dynamic object information This information is integrated into the overall global environment model.
[0054] like Figure 4 As shown, the model specifically includes three core layers:
[0055] The bottom layer is the geometry layer, which consists of a voxel map constructed in real time from high-precision LiDAR point clouds. It mainly provides high-frequency geometric spatial information for the robot's underlying path planning and real-time obstacle avoidance.
[0056] The middle layer is the semantic layer, generated by VLM after analyzing the RGB image. This layer assigns high-level semantic labels to different regions and objects in the geometric map, such as identifying and labeling a spatial region as "meeting room" or an object as "table," which provides the foundation for understanding locations and objects in natural language instructions.
[0057] The top layer is the dynamic layer, where the VLM is responsible for identifying and tracking dynamic obstacles (such as pedestrians) in the environment and predicting their trajectories for the next few seconds based on their historical states. These predicted trajectories are marked as areas with high passage costs in the model for the path planning module to actively avoid.
[0058] In step 3, the user inputs the command "Go to the meeting room and get me the blue water glass on the table." The VLM node in the task parsing module receives this text command and performs the following inference:
[0059] 1. Break down the user command into tasks, with the following task content: {Task 1: "Navigate to the meeting room"}, {Task 2: "Identify and locate the blue water glass"}, {Task 3: "Execute grab"}, and {Task 4: "Return to the starting point"}.
[0060] 2. Query the semantic map to find the area coordinates corresponding to the "Meeting Room" label, and use them as the target point for Task 1.
[0061] 3. After navigating to the conference room, activate the visual search mode, call VLM again to identify the "blue water glass" in the field of view, and calculate its three-dimensional coordinates based on the depth information, which will serve as the target point for Task 2 and Task 3.
[0062] 4. Record the coordinates of the starting point as the target point for Task 4.
[0063] In step 4, the path planning module receives the aforementioned sequence of target points. This module first plans a global path from the current location to the meeting room for Task 1. During the robot's movement, such as... Figure 5 As shown, the local planner continuously fine-tunes based on the dynamic environment model to avoid suddenly appearing pedestrians. Upon reaching the conference room, the system switches to a "movement-operation" collaborative planning mode, finely adjusting the chassis position and orientation so that the robotic arm can grasp the water glass in the optimal posture, rather than simply allowing the chassis to stop in front of the table. This process is achieved by solving an optimization problem, the objective function of which is shown below:
[0064] ,
[0065] in, It is the total cost. It's in robot mode. It is a control input; Represents energy consumption. Represents time; These are weighting coefficients; It is the cost of the terminal state.
[0066] Through the above steps, the present invention enables the composite robot to efficiently and intelligently complete the entire navigation and operation task from receiving high-level instructions to final execution in complex and dynamic real environments.
[0067] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A large-model-driven, holistic modeling method for navigation and autonomous path planning of a composite robot, characterized in that, Includes the following steps: Step 1: Acquire real-time multimodal environmental data through at least one sensor mounted on the composite robot, and perform preprocessing on the multimodal environmental data, including timestamp alignment and coordinate system registration, to form an environmental data stream with a unified spatiotemporal reference. Step 2: Input the environmental data stream into a pre-trained visual-language large model, which analyzes the environmental data stream and outputs an overall environmental model containing geometric information, semantic information, and dynamic object information. Step 3: Receive natural language navigation instructions from external input, and use the vision-language big model to understand the intent and decompose the task of the natural language navigation instructions. Combined with the overall environment model, generate one or more navigation target points containing position, attitude and / or operation status. Step 4: Based on the overall environment model and navigation target point, perform global path planning to generate a preliminary path. Then, incorporate the body kinematics model and end effector dynamics model of the composite robot into the path optimization considerations to perform dynamic smoothing and collaborative optimization on the preliminary path, generate an executable body movement trajectory and end effector action sequence, and drive the composite robot to perform navigation tasks according to the movement trajectory and action sequence.
2. The large-model-driven overall modeling composite robot navigation and path autonomous planning method as described in claim 1, characterized in that: The overall environment model is a hierarchical data structure, including: A geometric layer of an occupancy grid map or voxel map constructed from high-precision lidar point clouds for real-time obstacle avoidance; A semantic layer generated by analyzing camera images using a large vision-language model, which assigns semantic labels to different areas in the map; A dynamic layer used to identify and track dynamic obstacles such as pedestrians and other robots in the environment and predict their trajectories in the next few seconds.
3. The large-model-driven overall modeling composite robot navigation and path autonomous planning method as described in claim 1, characterized in that: The aforementioned visual-language big data model is Qwen3-VL-30B-A3B-Thinking. This model is jointly trained with massive amounts of image, text, and robot trajectory data and is used for scene description, object recognition, and physical commonsense reasoning.
4. The large-model-driven overall modeling composite robot navigation and path autonomous planning method as described in claim 1, characterized in that: Step 2 uses point cloud data to update a global voxel map, and then inputs the RGB image into the visual-language large model. The reasoning process of the visual-language large model is formally described as follows: , in, It is the input RGB image. This is the current geometric map. It has parameters A large-scale visual-language model; output is semantic information. and dynamic object information .
5. The large-model-driven overall modeling composite robot navigation and path autonomous planning method as described in claim 1, characterized in that: In step 2, the dynamic object information output by the visual-language large model includes a prediction of the future movement trajectory of the dynamic object, and the area covered by the predicted trajectory is marked as an area with a high passage risk in the overall environment model.
6. The large-model-driven overall modeling composite robot navigation and path autonomous planning method as described in claim 1, characterized in that: In step 4, the global path planning adopts a semantically guided random tree algorithm. When sampling path nodes, this algorithm prioritizes sampling within the traversable areas marked in the semantic information of the overall environment model.
7. The large-model-driven overall modeling composite robot navigation and path autonomous planning method as described in claim 1, characterized in that: In step 4, when the navigation task involves interaction with the environment, the path planning algorithm uses the reachable space of the robotic arm and the operation time as constraints to perform optimal collaborative planning between the mobile platform and the robotic arm.
8. The large-model-driven overall modeling composite robot navigation and path autonomous planning method as described in claim 7, characterized in that, The optimal collaborative planning in step 4 is achieved by solving an optimization problem, the objective function of which is shown below: , in, It is the total cost. It's in robot mode. It is a control input; Represents energy consumption. Represents time; These are weighting coefficients; It is the cost of the terminal state.
9. A large-model-driven, holistic modeling composite robot navigation and path autonomous planning system, characterized in that, For implementing the method as described in any one of claims 1 to 8, comprising: Data acquisition module: Deploys the aforementioned visual-language large model for environmental data acquisition and preprocessing; Environment Modeling Module: Used for dynamic overall environment modeling; Task parsing module: used for navigation task parsing and target generation; Path planning and control module: Used for collaborative path planning and control.