Implicit instruction visual language navigation method and system based on multi-modal large model
By generating semantic and occupancy maps using a multimodal large model, and combining cross-modal attention mechanisms and hierarchical learning strategies, the navigation problem in implicit instructions and complex scenarios is solved, resulting in an efficient and safe navigation system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENYANG INST OF AUTOMATION - CHINESE ACAD OF SCI
- Filing Date
- 2026-04-16
- Publication Date
- 2026-06-16
AI Technical Summary
Existing visual language navigation methods suffer from problems such as insufficient understanding of implicit instructions, lack of scene semantic association modeling, poor robustness, and insufficient scene memory when faced with implicit instructions and complex scenes, resulting in low navigation efficiency and poor safety.
A multimodal large model is used to generate semantic maps and occupancy maps by combining visual RGB images and depth images. Implicit instructions are parsed through a cross-modal attention mechanism, and navigation training is carried out by combining a hierarchical joint optimization strategy, including imitation learning and reinforcement learning, to detect obstacles and avoid collisions in real time.
It achieves efficient understanding of implicit instructions and persistent navigation in complex scenarios, improving navigation performance and security. It is a lightweight and efficient navigation system suitable for deployment on edge computing devices.
Smart Images

Figure CN122015880B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision, artificial intelligence and robotics, and in particular to an implicit instruction visual language navigation method and system based on a multimodal large model. Background Technology
[0002] In the rapid development of modern artificial intelligence and robotics, Vision-and-Language Navigation (VLN), as a research hotspot of multimodal fusion, has received widespread attention. Its goal is to guide robots to complete autonomous navigation tasks by parsing user language commands and combining visual perception with environmental modeling. However, traditional VLN methods mainly rely on explicit step-by-step instructions (such as "walk forward 5 meters, turn right 90 degrees") for path planning. This approach faces the following technical bottlenecks in practical applications: 1. Insufficient implicit command understanding: In actual human-computer interaction, users tend to use vague and natural implicit commands (such as "It's too hot, please get me some drinks"), which requires the navigation system to have strong semantic reasoning and intent parsing capabilities. However, traditional methods cannot infer specific navigation goals and paths from users' implicit needs. 2. Lack of scene semantic association modeling: While existing multimodal large language models possess powerful reasoning capabilities, they lack the ability to model specific semantic associations within the navigation scene, making it difficult to guide robots to unknown destinations based solely on prior knowledge. 3. Lack of robustness in complex scenarios: Existing navigation systems struggle to reliably complete navigation tasks in complex environments such as dynamic conditions, occlusion, and noise, and are prone to collisions or deadlocks. 4. Lack of scene memory: Most methods rely on instantaneous perception and lack mechanisms for modeling and remembering persistent scenes, resulting in inefficiency in repetitive navigation tasks and an inability to improve navigation performance as scene familiarity increases.
[0003] Therefore, designing an efficient navigation method that can combine multimodal large models, understand implicit instructions, and adapt to persistent and complex scenarios has become a research challenge in the current technical field. Summary of the Invention
[0004] The technical problem to be solved by the present invention is to address the shortcomings of the prior art by providing an implicit instruction visual language navigation method and system based on a multimodal large model. By combining user implicit instructions, scene semantic maps and robot real-time observation status, a navigation robot with semantic reasoning and action correction capabilities is designed, which significantly improves the robot's navigation performance in complex scenarios.
[0005] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:
[0006] On the one hand, this invention provides an implicit instruction visual language navigation method based on a multimodal large model, comprising:
[0007] Semantic maps and occupancy maps are generated based on visual RGB images and depth image data, and the semantic information of objects and passable areas in the scene are dynamically recorded.
[0008] By using a pre-trained multimodal large model to parse implicit commands input by the user, and combining local scene maps and the robot's current observation state to generate semantic navigation reasoning tokens, the next navigation action is predicted by fusing multi-source information through a cross-modal attention mechanism.
[0009] Real-time obstacle detection is performed based on depth maps and occupancy maps, and collisions and deadlocks in navigation are avoided by using a collision penalty loss term.
[0010] A hierarchical joint optimization strategy is adopted, which combines implicit instruction datasets to optimize the training process of robot navigation through two stages: imitation learning and reinforcement learning.
[0011] Furthermore, the specific method for generating semantic maps and occupancy maps based on visual RGB image and depth image data is as follows:
[0012] RGB images based on robot observation and depth map The data is used to extract semantic information of objects in the scene using a pre-trained 3D semantic segmentation network, generate a semantically segmented 3D point cloud, and record the position, category and spatial relationship of objects in the scene.
[0013] By using the inverse pinhole projection method and combining the robot's own pose, the semantically segmented 3D point cloud is projected onto a 2D plane to generate an occupancy map. Identify passable and impassable areas in the scene and generate a semantic map. Record the semantic location of the object.
[0014] Furthermore, the method also performs time-series cumulative updates on the semantic map and the occupancy map, and stores them as a scene memory structure;
[0015] Based on the current robot pose, a local scene map centered on the robot itself is cropped from the global scene map for subsequent inference and navigation decisions.
[0016] Furthermore, the specific method for generating semantic navigation reasoning tokens by parsing implicit user input using a pre-trained multimodal large model and combining the local scene map and the robot's current observation state is as follows:
[0017] Define two special tokens and <map>Two tokens are used to represent scene RGB observations and local semantic maps, respectively, and the scene RGB observations and semantic maps are encoded by two pre-trained visual encoders to obtain two tokens. and <map>;
[0018] Expanding the vocabulary of the large language model and adding tokens <act>To represent the output of reasoning navigation actions, a three-segment thought chain prompt template is designed to guide the large language model in instruction reasoning; this three-segment thought chain prompt template includes a token. <instru>Token and <map>and tokens <act>; token <instru>Guiding large language models to understand and infer users' implicit intentions; tokens and <map>Guide the large language model to perform spatial reasoning by combining current scene RGB observations and semantic maps to determine possible navigation directions; token <act>Guide the large language model to output the semantic representation of the next navigation action;
[0019] The implicit commands input by the user, along with the robot's current local scene map and observation state, are integrated into a three-stage thought chain prompt template and then input into a large language model for command reasoning. When the model generates a prompt containing a token... <act>Extract the token when the text response is received. <act>The corresponding positions are embedded with different depth hidden layers, and projected into inference tokens through two layers of linear transformation.
[0020] Furthermore, the specific method for predicting the next navigation action by fusing multi-source information through a cross-modal attention mechanism is as follows:
[0021] Use a map encoder to extract the robot's spatial location information from the occupied map and generate a map token;
[0022] Use a depth encoder to extract spatial distance information from the depth map and generate a depth token;
[0023] The action output head module is used to fuse inference tokens, map tokens, and depth tokens to predict the next navigation action. The action output head module contains two GRU networks with cross-modal attention mechanisms; the first GRU network processes the robot's current multimodal observations to update the robot's hidden state.
[0024] The second GRU network, based on the robot's hidden state output by the first GRU and the previous navigation action, combines an attention mechanism to calculate a weighted representation of the inference token, map token, and depth token, and outputs a hidden state that integrates multimodal information to predict the next navigation action.
[0025] Furthermore, the specific method for real-time obstacle detection based on depth maps and occupancy maps, and for avoiding collisions and deadlocks in navigation through a collision penalty loss term, is as follows:
[0026] Use depth maps and occupancy maps to detect obstacles on the navigation path in real time;
[0027] Define a collision indication function to determine whether a collision is imminent for the robot; the collision indication function uses a depth threshold hyperparameter. Determine the function value when the Euclidean distance from the robot to the nearest obstacle in the obstacle set in the environment does not exceed [a certain value]. If a collision risk is detected, the collision indicator function value is set to 1; otherwise, it is set to 0.
[0028] A collision penalty loss term was designed for obstacle avoidance learning during the robot's training phase;
[0029] When a collision risk is detected, the robot is forced to choose to perform a steering action to guide it to avoid the obstacle.
[0030] Furthermore, the specific method for optimizing the robot navigation training process through a two-stage process of imitation learning and reinforcement learning, using a hierarchical joint optimization strategy combined with an implicit instruction dataset, is as follows:
[0031] In the first training phase, the robot learns basic navigation actions using the DAgger imitation learning algorithm, and then trains the robot to master basic navigation skills using expert-annotated correction actions.
[0032] The objective function for the first training phase of imitation learning is:
[0033] ;
[0034] in, The first training phase involves imitating and learning the objective function. To learn the loss function of basic navigation actions using the DAgger imitation learning algorithm, Indicates the penalty loss for collision;
[0035] The second training phase builds upon the navigation action capabilities acquired in the first training phase to further learn semantic-aware reasoning navigation, optimizing end-to-end trajectory-level objectives through a reinforcement learning paradigm.
[0036] ;
[0037] in, The objective function for the second stage of reinforcement learning; For the first Step combination rewards; For parameters The strategy network; This represents the total number of steps the robot actually took to complete the navigation task; For the first Step navigation actions; For the robot in the The environmental conditions of the step;
[0038] Furthermore, the first Step combination rewards Considering navigation completion, semantic correctness, and trajectory efficiency, a weighted fusion of trajectory alignment reward, destination semantic association reward, and step efficiency reward is applied.
[0039] On the other hand, the present invention provides an implicit instruction visual language navigation system based on a multimodal large model, including a scene semantic mapping module, an implicit instruction reasoning and navigation action prediction module, an obstacle avoidance module, and a hierarchical learning module;
[0040] The scene semantic mapping module generates semantic maps and occupancy maps based on visual RGB images and depth image data, and dynamically records the semantic information of objects and passable areas in the scene.
[0041] The implicit instruction reasoning and navigation action prediction module uses a pre-trained multimodal large model to parse the implicit instructions input by the user, combines the local scene map and the robot's current observation state to generate a semantic navigation reasoning token, and uses a cross-modal attention mechanism to fuse multi-source information to predict the next navigation action.
[0042] The obstacle avoidance module performs real-time obstacle detection based on depth maps and occupancy maps, and avoids collisions and deadlocks in navigation by using collision penalty loss items;
[0043] The hierarchical learning module adopts a hierarchical joint optimization strategy, combining implicit instruction datasets to optimize the robot navigation training process through two stages: imitation learning and reinforcement learning.
[0044] Thirdly, this application proposes a computer program product, including a computer program or instructions that, when executed by a processor, implement the aforementioned implicit instruction visual language navigation method based on a multimodal large model.
[0045] The beneficial effects of adopting the above technical solution are as follows: The implicit instruction visual language navigation method and system based on multimodal large model provided by the present invention have the following characteristics: (1) Innovative implicit instruction reasoning ability: The present invention parses implicit instructions based on multimodal large model, and encodes scene semantic graph and real-time observation state in a unified manner by defining special tokens, thereby realizing efficient reasoning and navigation decision of implicit semantics and overcoming the dependence of traditional methods on explicit step-by-step instructions. (2) Scene memory and persistent navigation: By combining semantic map and occupancy map, the present invention constructs a persistent scene memory module, which supports the robot to perform efficient renavigation tasks in the same scene. As the number of navigation iterations increases, the robot can learn scene semantic associations, and the navigation performance continues to improve. (3) Hierarchical learning strategy: The two-stage hierarchical learning strategy proposed in the present invention follows the principle of curriculum learning. First, it obtains basic navigation ability through imitation learning, and then learns semantic reasoning associations through reinforcement learning. This effectively avoids the sparse reward and gradient variance problem in end-to-end training, accelerates convergence and improves the final strategy performance. (4) Dynamic obstacle avoidance: The action correction module detects obstacles in real time through collision penalty loss and avoids collision and deadlock problems, improving safety and robustness in complex scenarios. (5) Lightweight and efficient: The module design of this invention is lightweight and efficient, requiring only about 22GB of GPU memory at FP16 accuracy, with a single-step inference time of about 200ms, corresponding to a real-time response speed of about 5FPS, making it suitable for deployment on edge computing devices. (6) Superior generalization performance: Through the joint optimization strategy of imitation learning and reinforcement learning, this invention significantly improves the generalization ability of the navigation system in diverse implicit instructions and complex scenarios, and also shows good navigation performance in unseen scenarios. Attached Figure Description
[0046] Figure 1 This is a comparative diagram of traditional VLN and implicit instruction navigation provided in Embodiment 1 of the present invention, wherein (a) is traditional VLN navigation and (b) is reasoning navigation based on implicit instructions (ReasonWalker).
[0047] Figure 2 This is a structural block diagram of the implicit instruction visual language navigation method based on a multimodal large model provided in Embodiment 1 of the present invention;
[0048] Figure 3 This is an example of navigation visualization for the implicit instruction (instruction: I'm thirsty, please get me a cold cola) corresponding to the task of getting cola provided in Embodiment 1 of the present invention. In this example, (a) is the original RGB image of the color camera frame from the robot's current perspective, (b) is the grayscale depth map corresponding to (a), where brightness represents distance, (c) is a top-down binary map obtained by projecting depth information, where white areas are passable areas and black areas are obstacles, (d) is an unexplored boundary (frontier) point map with superimposed color annotations, where the letter "A" represents the robot's current position, and (e) is a global semantic navigation map, where gray represents mapped areas, green dots are historical semantic target positions, red dots are the current target point, and blue arrows indicate the robot's orientation.
[0049] Figure 4 This is an example of navigation visualization for the implicit instruction (instruction: I need a tool to tighten screws, it may be on the tool table) corresponding to the task of picking up a screw-tightening tool provided in Embodiment 1 of the present invention. In this example, (a) is the original RGB image of the color camera frame from the robot's current perspective, and the scene is a wooden indoor workshop. (b) is the grayscale depth map corresponding to (a), and the brightness represents the distance of each object in the scene from the robot. (c) is the top-down obstacle projection map obtained by projecting depth information, where the white area is the passable area and the black area is the obstacle. (d) is the front exploration point map with superimposed color annotations, where the colored area represents the distribution of boundary points to be explored, the letter "A" represents the robot's current position, and (e) is the current global semantic navigation map, where gray represents the mapped area, green dots are the historical semantic target positions, red dots are the current target points, and blue arrows indicate the robot's orientation.
[0050] Figure 5 This is an example of navigation visualization for the implicit instruction (instruction: The clothes are washed, please put them on the table on the other side of the corridor) corresponding to the clothing handling task provided in Embodiment 1 of the present invention. In this example, (a) is the original RGB image of the color camera frame from the robot's current perspective, and the scene is a laundry room containing a washing machine. (b) is the grayscale depth map corresponding to (a), and the brightness represents the distance of each object in the scene from the robot. (c) is the top-down obstacle projection map obtained by projecting depth information. The white area is the passable area, the black area is the obstacle, and the letter "A" represents the robot's current position. (d) is the front exploration point map with superimposed color annotations. The colored points represent the distribution of boundary points to be explored, and the letter "A" represents the robot's current position. (e) is the current global semantic navigation map. The gray area represents the mapped area, the green point is the historical semantic target position, the red point is the current target point, and the blue arrow is the direction the robot is facing.
[0051] Figure 6 This is a radar chart of ablation experiments using depth threshold and collision decay rate parameters provided in Embodiment 1 of the present invention. In the chart, (a) represents the depth threshold, and (b) represents the collision decay rate. The meanings of each indicator in the chart are as follows: Seen-OS represents the prediction success rate of the seen scene (Oracle). Success), Seen-nDTW is the normalized dynamic time warp of the seen scene, which measures the similarity between the trajectory and the reference path. Seen-SR is the success rate of the seen scene. Seen-SPL is the path length weighted success rate of the seen scene. Seen-tnDTW is the target-weighted DTW of the seen scene. Unseen-SR is the success rate of the unseen scene. Unseen-SPL is the path length weighted success rate of the unseen scene. Unseen-OS is the prediction success rate of the unseen scene. Unseen-nDTW is the normalized DTW of the unseen scene. Unseen-tnDTW is the target-weighted DTW of the unseen scene. The numerical values corresponding to each indicator, such as in Seen-SR (32, 29), the first number is the full score normalized baseline value of the indicator (i.e., the actual value corresponding to the maximum scale of the coordinate axis), and the second number is the actual score of the method of this invention on the indicator.
[0052] Figure 7 This is an analysis diagram of navigation failure cases in a multi-kitchen ambiguous scenario provided in Embodiment 1 of the present invention;
[0053] Figure 8 This is a schematic diagram of the deployment of a real robot navigation system provided in Embodiment 1 of the present invention. Detailed Implementation
[0054] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.
[0055] Example 1:
[0056] like Figures 1-8 As shown in this embodiment, the implicit instruction visual language navigation method based on a multimodal large model specifically includes the following steps:
[0057] Step S1: Construct a scene semantic mapping module to generate semantic maps and occupancy maps based on visual RGB image and depth image data, and dynamically record the semantic information of objects and passable areas in the scene;
[0058] To achieve persistent and efficient navigation, this invention uses a Scene Semantic Mapping Module (SSMM) to construct an explicit scene map, which includes a semantic map. and occupying the map Two complementary representations are used to record semantic associations and traversable area information across navigation rounds, respectively. Unlike previous map navigation methods that rely on topological maps, this invention constructs a continuous 2D visual scene map and explicitly records object semantic information.
[0059] Step S11: RGB image based on robot observation and depth map The data is used to extract semantic information of objects in the scene using a pre-trained 3D semantic segmentation network (such as RedNet), generate a 3D point cloud of semantic segmentation, and record the position, category and spatial relationship of objects in the scene.
[0060] In this embodiment, a pre-trained 3D semantic segmentation network (RedNet) is first used to segment RGB images. Perform pixel-level semantic annotation to extract semantic information of objects in the scene; then combine with depth map. The semantic annotations are back-projected into 3D space to generate 3D point clouds with semantic labels, recording the position, category, and spatial relationships of objects in the scene.
[0061] In this embodiment, at each time step t, real-time scene semantic annotation is performed using a pre-trained 3D semantic segmentation network to generate a semantic segmentation point cloud. and occupied point cloud :
[0062] ;
[0063] Here, SM(·) represents a pre-trained 3D semantic segmentation network, which uses RedNet as its backbone, takes an RGB image and a depth map as input, and outputs a semantic segmentation point cloud. and occupied point cloud .
[0064] Step S12: Using the Inverse Pinhole Projection Method, combined with the robot's own pose, the semantically segmented 3D point cloud is projected onto a 2D plane to generate an occupancy map. Identify passable and impassable areas in the scene and generate a semantic map. Record the semantic location of the object;
[0065] In this embodiment, the semantically segmented 3D point cloud generated in step S11 is projected onto a 2D plane to construct a scene map, which includes a semantic map. and occupying the map As shown in the formula below:
[0066] ;
[0067] in, This represents the inverse pinhole projection method, which converts 3D spatial point information into a top-view 2D representation.
[0068] This invention introduces two complementary map representations: semantic maps. and occupying the map Among them, semantic map Record the top-down spatial positions of common objects in the navigation scene and encode object-centric semantic relationships to facilitate reasoned navigation based on implicit instructions. Occupation Map This represents a passable area. Each cell in the map is assigned a binary label, where 1 indicates occupied space and 0 indicates free space.
[0069] Step S13: Perform time-series cumulative updates on the semantic map and the occupancy map, and store them as a scene memory structure to support persistent navigation and multiple re-navigation tasks;
[0070] like Figure 2 As shown, scene map The data is incrementally refined as the robot moves and new observations are added, eventually generating a complete global scene map. .
[0071] Step S14: Based on the current robot pose, crop a local scene map centered on the robot itself from the global scene map for subsequent inference and navigation decisions.
[0072] To improve computational efficiency, this invention further crops a local scene map centered on the robot itself from the global scene map based on the current robot pose. This reduces redundant information.
[0073] Step S2: Design an implicit instruction reasoning navigation action module, use a pre-trained multimodal large model to parse the implicit instructions input by the user, combine the local scene map and the robot's current observation state to generate a semantic navigation reasoning token, and use a cross-modal attention mechanism to fuse multi-source information to predict the next navigation action;
[0074] Step S21: Define two special tokens and <map>Two tokens are used to represent scene RGB observations and local semantic maps, respectively, and the scene RGB observations and semantic maps are encoded by two pre-trained visual encoders to obtain two tokens. and <map>;
[0075] To achieve reasoning navigation based on implicit user commands, this invention uses an implicit command reasoning navigation action module (RNAM) in conjunction with a local map. User instructions and robot observation status Perform navigation action prediction.
[0076] Specifically, to facilitate unified large language model reasoning, this invention first processes the implicit instructions input by the user. Local scene map and robot observation status The unified encoding is in token format.
[0077] To infer implicit commands input by the user, this invention utilizes a pre-trained large language model (such as Qwen) as an implicit command inference navigation action module to interpret user-input commands and provide navigation guidance. Since standard large language models lack direct understanding of visual and spatial information, this invention references visual-language alignment models (such as LLaVA) to achieve multimodal inference. Unlike LLaVA, which integrates visual information using only a single image token, this invention simultaneously introduces two visual inputs: scene RGB observation and semantic map, each encoded by an independent visual encoder, achieving richer multimodal alignment. This invention defines two special tokens. and <map>Used to integrate scene RGB observations and semantic maps. Scene RGB observations and local semantic maps Two pre-trained visual encoders were used respectively. and The token is obtained by encoding as shown in the following formula:
[0078] ;
[0079] ;
[0080] in, RGB observation of the scene and local semantic maps The token obtained after encoding, i.e., the token and <map>;
[0081] Step S22: Expand the vocabulary of the large language model and add tokens. <act>To represent the output of reasoning navigation actions, a three-segment thought chain prompt template is designed to guide the large language model in instruction reasoning; this three-segment thought chain prompt template includes a token. <instru>Token and <map>and tokens <act>; token <instru>Guiding large language models to understand and infer users' implicit intentions; tokens and <map>Guide the large language model to perform spatial reasoning by combining current scene RGB observations and semantic maps to determine possible navigation directions; token <act>Guide the large language model to output the semantic representation of the next navigation action;
[0082] To construct the navigation reasoning structure, this invention first expands the original large language model vocabulary, adding... <act>The token represents a request to output a navigation action. This invention also introduces a three-stage thought chain prompt template for large language model instruction reasoning: User: You are a navigation agent assistant. You need to respond to the user's instructions... <instru>Infer their intention, and combine the current scene with the scene map. <map>Perform navigation reasoning and output the next location to proceed to, in the following format: <act>.
[0083] The logical structure of this three-part thinking chain prompt template is as follows: First part ( <instru>Part 1) guides the large language model to understand and infer the user's implicit intent; Part 2 ( <map>The first part) guides the model to perform spatial reasoning by combining current scene RGB observations and semantic maps to determine possible navigation directions; the third part ( <act>(Partially) guides the large language model to output the semantic representation of the next navigation action. The three stages progress sequentially, forming a complete reasoning chain of "intent understanding → scene perception → action decision".
[0084] Step S23: After integrating the implicit instructions input by the user with the robot's current local scene map and observation state into the three-stage thought chain prompt template, input it into the large language model for instruction reasoning. When the model generates a prompt containing a token... <act>Extract the token when the text response is received. <act>Embedding at different depths of hidden layers at corresponding positions, and projecting them as inference tokens through two layers of linear transformation;
[0085] In this embodiment, the hermit command input by the user is given. and the robot's current local scene map and observation status These are then incorporated into a three-stage thought chain prompt template and input into a large language model, which then infers and outputs a text response. When the large language model is ready to generate possible next directions, the output will contain tokens. <act>Then retrieve the corresponding token. <act>The location-based large language model embeds data from hidden layers of different depths, and the concatenated embeddings are projected onto the model to generate inference tokens. Specifically, inference tokens are generated through two linear transformations and a nonlinear activation:
[0086] ;
[0087] ;
[0088] ;
[0089] in, Embed concatenated multi-layer hidden state vectors for hidden layers of different depths. These represent the shallow, middle, and deep layers of the large language model in the token. Location-based hidden layer embedding; This is a hidden representation in the middle; All are learnable weight matrices; All are learnable bias vectors; It is a ReLU nonlinear activation function; For the hidden layer dimension of a large language model, For the intermediate projection dimension, Output dimension for inference token; This is the final generated inference token, used for subsequent cross-modal attention computation.
[0090] Step S24: Use the map encoder to extract the robot's spatial location information from the occupied map and generate a map token;
[0091] Besides obtaining the Deduction Token In addition, this invention also uses a map encoder to extract data from the occupied map. Extract spatial location information and generate a map token:
[0092] ;
[0093] in, For map encoders; For the first The robot is currently occupying a portion of the map. The map token output by the map encoder is used for subsequent cross-modal attention fusion with the inference token.
[0094] In this embodiment, the map encoder consists of four CBACB blocks (Conv-BN-AvgPool-Conv-BNblock). Each CBACB block includes a first convolutional layer, a first batch normalization layer, an average pooling layer, a second convolutional layer, and a second batch normalization layer, all sequentially connected in series. The four CBACB blocks are stacked in series, and the average pooling layer in each block downsamples the feature map, causing the spatial resolution of the feature map to decrease block by block, thereby achieving efficient spatial encoding of the occupied map with low computational cost.
[0095] Step S25: Use a depth encoder to extract spatial distance information from the depth map and generate a depth token;
[0096] To enable the navigation robot to perceive horizontal spatial distances and avoid obstacles, this invention also uses a depth encoder to analyze the depth map. Encode to obtain a depth token In this embodiment, the deep encoder uses a pre-trained modified version of ResNet50, which is fine-tuned for point target navigation tasks.
[0097] Step S26: The action output head module is used to fuse inference tokens, map tokens and depth tokens to predict the next navigation action. The action output head module contains two GRU networks with cross-modal attention mechanisms.
[0098] To achieve navigation action prediction, the action output head module of this invention uses two GRU networks with cross-modal attention mechanisms to predict the next navigation action.
[0099] The first GRU network processes the robot's current multimodal observations to update the robot's hidden state. Specifically, the first GRU network encodes the current multimodal observation and the previous action into a hidden state, and updates the robot's hidden state. :
[0100] ;
[0101] in, This is the encoding vector for the previous action. This represents the robot's hidden state from the previous moment. The updated robot is now in a hidden state.
[0102] The second GRU network calculates tokens based on the robot's hidden state (output of the first GRU) and the previous navigation action, combined with an attention mechanism. , and The weighted representation of the hidden state, which incorporates multimodal information, is output to predict the next navigation action.
[0103] ;
[0104] ;
[0105] in, , , These are the weighted representations of the inference token, map token, and depth token after cross-modal attention fusion, respectively. The robot's hidden state is set for the first output of the GRU; The robot's hidden state at the previous time step is output by the second GRU network; The hidden state is the result of fusing multimodal information from the output of the second GRU network and used for action prediction. This represents the robot's complete motion space, encompassing all candidate actions such as moving forward (MOVE_FORWARD), turning left (TurnLeft), and turning right (TurnRight). To perform a linear transformation on the hidden state, output the unnormalized score for each action; The unnormalized scores of each action are converted into probability distributions for each action. Indicates from action space Select the action with the highest probability as the first. Perform the action step by step.
[0106] Next navigation action The robot moves to a new state and gains new observations, repeating this process to complete navigation.
[0107] Step S3: Develop a motion correction module to perform real-time obstacle detection based on depth map and occupancy map, and avoid collisions and deadlocks in navigation through collision penalty loss.
[0108] Step S31: Use depth map and occupancy map to detect obstacles on the navigation path in real time;
[0109] Step S32: Define a collision indication function to determine whether the robot is about to collide;
[0110] The collision indication function uses the depth threshold hyperparameter. The function value is determined as shown in the following formula:
[0111] ;
[0112] in, For collision indication function, Let C represent the robot's position state at step t, and let C represent the set of obstacles in the environment. Represents the set of obstacles The position of the i-th obstacle; The preset depth safety distance threshold hyperparameter, The Euclidean distance from the robot to the nearest obstacle is given when this distance does not exceed [a certain value]. If a collision risk is detected, the collision indicator function value is set to 1; otherwise, it is set to 0.
[0113] Step S33: Design a collision penalty loss term for obstacle avoidance learning during the robot training phase;
[0114] This invention observes that robots inevitably collide and even get stuck in corners during the initial training phase. To alleviate the collision problem in the early stages of navigation and motion learning, this invention proposes a collision penalty loss for robot obstacle avoidance and improved motion training:
[0115] ;
[0116] in, Indicates the penalty loss for collision; This represents the total number of steps the robot actually took to complete the navigation task; This represents the Euclidean distance from the robot to the nearest obstacle in the set of obstacles; To control the decay rate based on distance penalty, The larger the value, the faster the distance penalty decreases with increasing distance. In this embodiment, it is set to... The penalty coefficient related to the distance to the obstacle is defined as follows: To prevent the smoothing constant from having a denominator of zero; This is a distance decay term; the closer the distance, the greater the penalty, and the farther the distance, the closer the penalty approaches zero.
[0117] Step S34: When a collision risk is detected (i.e. The motion correction module, through the motion output head module, The probability distribution is masked, setting the probability of the forward action (MOVE_FORWARD) to zero and renormalizing the probabilities of the remaining actions. This forces the robot to choose between turning actions (TurnLeft or TurnRight) to avoid obstacles. The obstacle avoidance navigation action is shown in the following formula:
[0118] ;
[0119] in, For the first The final obstacle avoidance navigation action executed; The action that corresponds to the action with the highest probability; To exclude the set of candidate actions after the forward action, For a complete action space, This represents the set difference operation; For hidden state Perform a linear transformation and output the unnormalized score (logits) for each action. The normalized exponential function converts the output of the linear layer into a probability distribution for each action.
[0120] Step S4: Employ a hierarchical joint optimization strategy, combining implicit instruction datasets, to optimize the robot navigation training process through two stages of imitation learning and reinforcement learning, thereby improving the robot navigation's generalization ability and robustness.
[0121] This invention proposes a new two-stage hierarchical learning paradigm, which enables robots to first acquire low-level navigation and motion capabilities through imitation learning, and then refine high-level semantic reasoning capabilities through reinforcement learning.
[0122] Step S41: In the first training phase, the robot learns basic navigation actions using the DAgger imitation learning algorithm, and trains the robot to master basic navigation skills using expert-annotated correction actions.
[0123] The first training phase: This invention learns basic navigation actions using the DAgger (DatasetAggregation) imitation learning algorithm, an interactive imitation learning algorithm that corrects accumulated errors by iteratively aggregating expert demonstrations. Specifically, it collects a set of expert demonstration actions. Then at time step t, the robot follows the policy network. Generate trajectory: ,in The robot operates on a policy network at time step t. The action, This refers to the robot's environmental state at time step t; correction actions are provided based on expert annotations. Update the navigation action imitation learning dataset:
[0124] ;
[0125] in, For navigation action imitation learning dataset, This refers to the iteration number of the DAgger algorithm. The dataset is the initial set of expert demonstration actions. In each iteration, the robot bases its strategy on the current policy network. New trajectories are collected through interaction with the environment, and experts annotate and correct the actions. Then, the new data Incorporate the navigation action imitation learning dataset to obtain the first Round-up navigation action imitation learning dataset and in Update strategy network parameters This process is repeated until the robot training converges.
[0126] The loss for learning basic navigation actions using the DAgger imitation learning algorithm is:
[0127] ;
[0128] in, To learn the loss function of basic navigation actions using the DAgger imitation learning algorithm, sum and iterate through the updated dataset. All state-action pairs ; For policy networks In state Output expert actions The logarithmic probability; this loss function drives the policy network by maximizing the probability of the policy network predicting the expert's actions. By emulating expert behavior, we can correct accumulated errors.
[0129] The objective function for the first training phase of imitation learning is:
[0130] ;
[0131] in, The first training phase involves imitating and learning the objective function;
[0132] Step S42: In the second stage, reinforcement learning is used to further learn semantic reasoning navigation and optimize the end-to-end trajectory-level objective;
[0133] The second training phase: Based on the navigation action capabilities acquired in the first training phase, semantic-aware reasoning navigation is further learned. Unlike the first training phase, which updates the policy network in a supervised, stepwise manner, the second training phase optimizes the end-to-end trajectory-level objective through a reinforcement learning paradigm.
[0134] ;
[0135] in, The objective function for the second stage of reinforcement learning; For the first Step combination rewards; For parameters The policy network is constructed; the objective function employs the REINFORCE policy gradient framework, optimizing the policy network parameters by maximizing the expected cumulative reward. .
[0136] No. Step combination rewards Considering navigation completion, semantic accuracy, and trajectory efficiency:
[0137] ;
[0138] in, For the first Step combination rewards; The weighting coefficients for each reward item satisfy the following conditions: This is used to balance the contributions of different rewards; The trajectory alignment reward measures how closely the robot's trajectory resembles that of an expert's trajectory. The destination semantic association reward measures the semantic consistency between the navigation destination and the implicit instructions; A step efficiency reward is provided to encourage the robot to complete the task with fewer steps. The smaller the value, the bigger the reward.
[0139] Trajectory alignment rewards ensure that the robot follows the policy network Generated trajectory With expert trajectory Alignment:
[0140] ;
[0141] in, For the first Step robot position Corresponding position to expert trajectory The Euclidean distance between them; For the expert trajectory in the first Reference position of the step; denominator Sum the trajectory deviations over all time steps, and add 1 to prevent the denominator from being zero; A larger value indicates that the robot's trajectory is closer to the expert's trajectory.
[0142] Step S43: Design a destination semantic association reward to encourage navigation destinations to be consistent with the locations indicated by implicit instructions.
[0143] Furthermore, to encourage navigation destinations to align with the locations indicated by implicit instructions, this invention proposes a destination semantic association reward, which calculates tokens. The cosine similarity between the destination embedding and the semantic embedding of the hermit instruction is used as the destination semantic association reward:
[0144] ;
[0145] in, The normalized vector for the destination embedding of the token [MAP], i.e., the vector for the original destination embedding. The result after L2 normalization; For the first The semantic embedding vector of an implicit instruction; The inner product of the two measures the similarity between the destination representation and the instruction semantics. and These are the L2 norms of the corresponding vectors; the above formula is essentially a cosine similarity calculation, i.e. The range of values is A larger value indicates that the destination is more consistent with the semantics of the implicit instruction.
[0146] This end-to-end reward ensures that the learned spatial representation is aligned with the semantic description in the instructions, so that the navigation destination is consistent with the location indicated by the implicit instructions.
[0147] This embodiment also undergoes experimental validation in a continuous environment based on the Habitat simulator, using an extended version of the IR2R-CE dataset. The training set contains 142,940 rounds, 60 scenes, and 222 paths. The validation set includes val-seen (50 scenes, 747 rounds, 156 paths) and val-unseen (10 scenes, 1824 rounds, 36 paths). The large language model used is Qwen2-7b, and the hidden states at layers 3, 16, and 32 of this large language model are extracted. Depth threshold hyperparameters are also used. Setting it to 0.01 controls the decay rate based on distance penalty. Weighting coefficients of each reward item , , .
[0148] Experimental results show that the method of this invention achieves significantly better performance than existing methods in implicit instruction navigation tasks. In the val-unseen scenario, the success rate (SR) reaches 28%, which is 5% higher than the existing best method; the oracle success rate (OS) reaches 41%, which is 10% higher than the existing best method; and the path-weighted success rate (SPL) reaches 27%, which is 5% higher than the existing best method.
[0149] Ablation experiments demonstrated the effectiveness of each step: the semantic map input to the large language model plays a key role in implicit instruction reasoning navigation; extracting and combining embeddings from the shallow, medium and deep hidden layers of the large language model can achieve the best navigation performance; the two-stage hierarchical learning strategy is significantly better than the single-stage end-to-end training.
[0150] This embodiment was also deployed and tested on a real robot platform (DeepRobot Doglite2 with RealSense SR300RGB-D camera). The test results further verified the practicality of the method of the present invention. The robot was able to successfully complete the reasoning and navigation task based on the user's implicit instructions (such as "I have a stomachache, please take me to the bathroom").
[0151] Example 2:
[0152] The implicit instruction visual language navigation system based on a multimodal large model includes a scene semantic mapping module, an implicit instruction reasoning and navigation action prediction module, an obstacle avoidance module, and a hierarchical learning module.
[0153] The scene semantic mapping module generates semantic maps and occupancy maps based on visual RGB images and depth image data, and dynamically records the semantic information of objects and passable areas in the scene.
[0154] The implicit instruction reasoning and navigation action prediction module uses a pre-trained multimodal large model to parse the implicit instructions input by the user, combines the local scene map and the robot's current observation state to generate a semantic navigation reasoning token, and uses a cross-modal attention mechanism to fuse multi-source information to predict the next navigation action.
[0155] The obstacle avoidance module performs real-time obstacle detection based on depth maps and occupancy maps, and avoids collisions and deadlocks in navigation by using collision penalty loss items;
[0156] The hierarchical learning module adopts a hierarchical joint optimization strategy, combining implicit instruction datasets to optimize the robot navigation training process through two stages: imitation learning and reinforcement learning.
[0157] Example 3:
[0158] This embodiment proposes a computer program product, including a computer program or instructions, which, when executed by a processor, implements the implicit instruction visual language navigation method based on a multimodal large model.
[0159] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the present invention.< / act> < / act> < / act> < / act> < / act> < / map> < / instru> < / act> < / map> < / instru> < / act> < / act> < / map> < / instru> < / act> < / map> < / instru> < / act> < / map> < / map> < / map> < / map> < / act> < / act> < / act> < / map> < / instru> < / act> < / map> < / instru> < / act> < / map> < / map>
Claims
1. A visual language navigation method based on a multimodal large model with implicit instructions, characterized in that, include: Semantic maps and occupancy maps are generated based on visual RGB images and depth image data, and the semantic information of objects and passable areas in the scene are dynamically recorded. By using a pre-trained multimodal large model to parse implicit commands input by the user, and combining local scene maps and the robot's current observation state to generate semantic navigation reasoning tokens, the next navigation action is predicted by fusing multi-source information through a cross-modal attention mechanism. The specific method for generating semantic navigation reasoning tokens by parsing implicit user input using a pre-trained multimodal large model and combining it with a local scene map and the robot's current observation state is as follows: Define two special tokens and <map>Two tokens are used to represent scene RGB observations and local semantic maps, respectively, and the scene RGB observations and semantic maps are encoded by two pre-trained visual encoders to obtain two tokens. and <map>;< / map> < / map> Expanding the vocabulary of the large language model and adding tokens <act>To represent the output of reasoning navigation actions, a three-segment thought chain prompt template is designed to guide the large language model in instruction reasoning; this three-segment thought chain prompt template includes a token. <instru>Token and <map>and tokens <act>; token <instru>Guiding large language models to understand and infer users' implicit intentions; tokens and <map>Guide the large language model to perform spatial reasoning by combining current scene RGB observations and semantic maps to determine possible navigation directions; token <act> Guide the large language model to output the semantic representation of the next navigation action;< / act> < / map> < / instru> < / act> < / map> < / instru> < / act> The implicit commands input by the user, along with the robot's current local scene map and observation state, are integrated into a three-stage thought chain prompt template and then input into a large language model for command reasoning. When the model generates a prompt containing a token... <act>Extract the token when the text response is received. <act> Embedding at different depths of hidden layers at corresponding positions, and projecting them as inference tokens through two layers of linear transformation;< / act> < / act> Real-time obstacle detection is performed based on depth maps and occupancy maps, and collisions and deadlocks in navigation are avoided by using a collision penalty loss term. A hierarchical joint optimization strategy is adopted, which combines implicit instruction datasets to optimize the training process of robot navigation through two stages: imitation learning and reinforcement learning.
2. The implicit instruction visual language navigation method based on a multimodal large model according to claim 1, characterized in that, The specific method for generating semantic maps and occupancy maps based on visual RGB image and depth image data is as follows: RGB images based on robot observation and depth map The data is used to extract semantic information of objects in the scene using a pre-trained 3D semantic segmentation network, generate a semantically segmented 3D point cloud, and record the position, category and spatial relationship of objects in the scene. By using the inverse pinhole projection method and combining the robot's own pose, the semantically segmented 3D point cloud is projected onto a 2D plane to generate an occupancy map. Identify passable and impassable areas in the scene and generate a semantic map. Record the semantic location of the object.
3. The implicit instruction visual language navigation method based on a multimodal large model according to claim 2, characterized in that, The method also performs time-series cumulative updates on the semantic map and the occupancy map, and stores them as a scene memory structure; Based on the current robot pose, a local scene map centered on the robot itself is cropped from the global scene map for subsequent inference and navigation decisions.
4. The implicit instruction visual language navigation method based on a multimodal large model according to claim 3, characterized in that, The specific method for predicting the next navigation action by fusing multi-source information through a cross-modal attention mechanism is as follows: Use a map encoder to extract the robot's spatial location information from the occupied map and generate a map token; Use a depth encoder to extract spatial distance information from the depth map and generate a depth token; An action output head module is used to fuse inference tokens, map tokens, and depth tokens to predict the next navigation action. The action output head module contains two GRU networks with cross-modal attention mechanisms. The first GRU network processes the robot's current multimodal observations to update the robot's hidden state; The second GRU network, based on the robot's hidden state output by the first GRU and the previous navigation action, combines an attention mechanism to calculate a weighted representation of the inference token, map token, and depth token, and outputs a hidden state that integrates multimodal information to predict the next navigation action.
5. The implicit instruction visual language navigation method based on a multimodal large model according to claim 4, characterized in that, The specific method for real-time obstacle detection based on depth maps and occupancy maps, and for avoiding collisions and deadlocks in navigation through a collision penalty loss term, is as follows: Use depth maps and occupancy maps to detect obstacles on the navigation path in real time; Define a collision indication function to determine whether the robot is about to collide; The collision indication function uses a depth threshold hyperparameter. Determine the function value when the Euclidean distance from the robot to the nearest obstacle in the obstacle set in the environment does not exceed [a certain value]. If a collision risk is detected, the collision indicator function value is set to 1; otherwise, it is set to 0. A collision penalty loss term was designed for obstacle avoidance learning during the robot's training phase; When a collision risk is detected, the robot is forced to choose to perform a steering action to guide it to avoid the obstacle.
6. The implicit instruction visual language navigation method based on a multimodal large model according to claim 5, characterized in that, The specific method for optimizing the robot navigation training process through a two-stage process of imitation learning and reinforcement learning, using a hierarchical joint optimization strategy combined with an implicit instruction dataset, is as follows: In the first training phase, the robot learns basic navigation actions using the DAgger imitation learning algorithm, and then trains the robot to master basic navigation skills using expert-annotated correction actions. The objective function for the first training phase of imitation learning is: ; in, The first training phase involves imitating and learning the objective function. To learn the loss function of basic navigation actions using the DAgger imitation learning algorithm, Indicates the penalty loss for collision; The second training phase builds upon the navigation action capabilities acquired in the first training phase to further learn semantic-aware reasoning navigation, optimizing end-to-end trajectory-level objectives through a reinforcement learning paradigm. ; in, The objective function for the second stage of reinforcement learning; For the first Step combination rewards; For parameters The strategy network; This represents the total number of steps the robot actually took to complete the navigation task; For the first Step navigation actions; For the robot in the The environmental conditions of the step.
7. The implicit instruction visual language navigation method based on a multimodal large model according to claim 6, characterized in that, The first Step combination rewards Considering navigation completion, semantic correctness, and trajectory efficiency, a weighted fusion of trajectory alignment reward, destination semantic association reward, and step efficiency reward is applied.
8. A visual language navigation system based on a multimodal large model, implemented based on the implicit instruction visual language navigation method based on a multimodal large model as described in claim 1, characterized in that, It includes a scene semantic mapping module, an implicit instruction reasoning and navigation action prediction module, an obstacle avoidance module, and a hierarchical learning module; The scene semantic mapping module generates semantic maps and occupancy maps based on visual RGB images and depth image data, and dynamically records the semantic information of objects and passable areas in the scene. The implicit instruction reasoning and navigation action prediction module uses a pre-trained multimodal large model to parse the implicit instructions input by the user, combines the local scene map and the robot's current observation state to generate a semantic navigation reasoning token, and uses a cross-modal attention mechanism to fuse multi-source information to predict the next navigation action. The obstacle avoidance module performs real-time obstacle detection based on depth maps and occupancy maps, and avoids collisions and deadlocks in navigation by using collision penalty loss items; The hierarchical learning module adopts a hierarchical joint optimization strategy, combining implicit instruction datasets to optimize the robot navigation training process through two stages: imitation learning and reinforcement learning.
9. A computer program product for executing the implicit instruction visual language navigation method based on a multimodal large model as described in any one of claims 1-7, characterized in that, This includes computer programs or instructions that, when executed by a processor, implement the implicit instruction visual language navigation method based on a multimodal large model.