Vision language navigation method, device, controller and robot of robot
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING HUMANOID ROBOTICS INNOVATION CENTER CO LTD
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-12
Smart Images

Figure CN122194972A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of embodied artificial intelligence technology, and more specifically, to a visual language navigation method, device, controller, and robot for robots. Background Technology
[0002] Embodied Artificial Intelligence (Embodied AI), an important branch of artificial intelligence, is dedicated to developing intelligent agents capable of autonomously interacting and performing tasks in real physical or simulated environments. Its core objective is to achieve natural collaboration and efficient adaptation between the agent and its environment. Vision-and-Language Navigation (VLN), as one of the core tasks in Embodied AI, plays a crucial role in connecting language understanding and spatial navigation. It requires intelligent agents to accurately parse natural language commands input by humans and, combined with real-time visual observation, complete autonomous navigation in diverse and untrained unknown environments. It is widely used in scenarios such as indoor service robots and autonomous mobile devices.
[0003] In vision-language navigation tasks, agents need to simultaneously process cross-modal fusion of linguistic and visual information. They must accurately understand the path logic, spatial relationships, and target location descriptions contained in instructions, while also adjusting navigation strategies in real-time based on environmental visual observations. However, to ensure the consistency and accuracy of decision-making, traditional methods generally rely heavily on historical observation data (including past visual frames, pose information, etc.) as the spatiotemporal context for decision-making, inferring environmental structure and navigation paths by accumulating historical data.
[0004] This design approach, which relies on historical observations, leads to two key problems: First, as the navigation process progresses, the amount of historical observation data continues to accumulate, requiring a large amount of storage resources to preserve this data. This significantly increases the storage overhead of the agent, especially for mobile devices or robots with limited hardware resources, severely impacting their deployment feasibility. Second, during the decision-making process, the agent needs to process and analyze massive amounts of historical observation data in real time to extract effective spatiotemporal context information. This process consumes a large amount of computing resources, leading to increased decision-making delays, reduced navigation real-time performance and response efficiency, and making it difficult to meet the dual requirements of navigation accuracy and speed in practical applications.
[0005] Therefore, in current visual-language navigation technology, the excessive storage and computational overhead caused by relying on historical observation data has become a key bottleneck restricting its large-scale application in the field of embodied artificial intelligence. A new navigation technology solution is urgently needed to overcome this limitation. Summary of the Invention
[0006] This application addresses the shortcomings of the prior art by providing a visual language navigation method, device, controller, and robot for robots, in order to solve the problems existing in the prior art.
[0007] The technical solution adopted in the embodiments of this application is as follows: In a first aspect, embodiments of this application provide a visual language navigation method for a robot, comprising: Acquire natural language navigation commands for the robot in a 3D indoor scene, as well as environmental observation images and depth images of the robot at the current time step; Based on the environmental observation image and the depth image, an annotated semantic map for the current time step is constructed; each semantic region in the annotated semantic map is marked with corresponding text annotation information, which is used to indicate the object category within the corresponding semantic region; The environmental observation image is encoded using a visual encoder in a preset visual language navigation model to obtain image features; The map encoder in the preset visual language navigation model is used to encode the annotated semantic map to obtain map features; Based on the image features, map features, and natural language navigation instructions, the language model in the preset visual language navigation model is used for processing to obtain the robot's navigation action at the next time step; Based on the navigation action at the next time step, the robot is controlled to move at the next time step.
[0008] In one embodiment, constructing an annotated semantic map for the current time step based on the environmental observation image and the depth image includes: The current semantic mask of the three-dimensional indoor scene is determined based on the environmental observation image; The point cloud data of the three-dimensional indoor scene is determined based on the depth image; Align the current semantic mask with the point cloud data to generate the initial semantic map for the current time step; Add corresponding text annotation information to each semantic region in the initial semantic map to obtain the annotated semantic map.
[0009] In one embodiment, adding corresponding text annotation information to each semantic region in the initial semantic map to obtain the annotated semantic map includes: Connectivity component analysis is performed on the initial semantic map to obtain each semantic region; Obtain the centroid position of each semantic region; Add corresponding text annotation information to the centroid position of each semantic region in the initial semantic map to obtain the annotated semantic map.
[0010] In one embodiment, the step of processing the image features, map features, and natural language navigation instructions using the language model in the preset visual language navigation model to obtain the robot's navigation action at the next time step includes: A preset multimodal projector is used to project the image features and the map features onto the feature space of the language model to obtain the corresponding image token and map token; Using a preset analyzer, the natural language navigation instructions are projected onto the feature space of the language model to obtain text tokens; Based on the image token, the map token, and the text token, the language model is used to predict the action, thereby obtaining the robot's navigation action at the next time step.
[0011] In one embodiment, the step of using the language model to predict the robot's navigation action at the next time step based on the image token, the map token, and the text token includes: The input encoding module in the language model is used to concatenate the image token, the map token, and the text token, and add corresponding token tags to generate multimodal input information. The action prediction module in the language model is used to predict the action based on the multimodal input information to obtain the navigation action for the next time step.
[0012] In one embodiment, the step of using the action prediction module in the language model to predict actions based on the multimodal input information to obtain the navigation action for the next time step includes: The action prediction module in the language model is used to perform action matching on the multimodal input information and the matching rule set of multiple preset action types to obtain the navigation action for the next time step.
[0013] In one embodiment, aligning the current semantic mask and the point cloud data to generate an initial semantic map for the current time step includes: Align the current semantic mask and the point cloud data to construct the current obstacle map and the current explored area map, respectively. The current obstacle map, the currently explored area map, the robot's current position, and past positions are stored in multiple channels of a preset image to generate an initial semantic map for the current time step.
[0014] Secondly, embodiments of this application provide a visual language navigation device for a robot, comprising: The acquisition module is used to acquire the robot's natural language navigation commands in a 3D indoor scene, as well as the robot's environmental observation image and depth image at the current time step. A construction module is used to construct an annotated semantic map for the current time step based on the environmental observation image and the depth image; each semantic region in the annotated semantic map is marked with corresponding text annotation information, which is used to indicate the object category within the corresponding semantic region; The first encoding module is used to encode the environmental observation image using a visual encoder in a preset visual language navigation model to obtain image features; The second encoding module is used to encode the annotated semantic map using the map encoder in the preset visual language navigation model to obtain map features; The processing module is used to process the image features, map features, and natural language navigation instructions using the language model in the preset visual language navigation model to obtain the robot's navigation action at the next time step. The control module is used to control the robot to move in the next time step according to the navigation action in the next time step.
[0015] Thirdly, embodiments of this application provide a controller, including: a processor, a storage medium, and a bus. The storage medium stores program instructions executable by the processor. When the controller is running, the processor communicates with the storage medium via the bus, and the processor executes the program instructions to implement the visual language navigation method for a robot described in any of the above embodiments.
[0016] Fourthly, embodiments of this application provide a robot, including at least: a robot body and a controller disposed within the robot body, the controller being used to execute the visual language navigation method of the robot described in any of the above embodiments.
[0017] The beneficial effects of this application are: it provides a visual language navigation method for robots, which breaks through the technical limitations of traditional methods that rely on historical observation frames. It uses an annotated semantic map to replace historical data frames as the core spatial memory representation, eliminating the need to store accumulated historical environmental data throughout the process. This significantly reduces storage overhead and computational redundancy during robot navigation, and the map memory usage is constant and unaffected by the number of navigation time steps. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a schematic diagram of the overall architecture of an embodiment of this application; Figure 2 One of the flowcharts of the visual language navigation method for a robot provided in the embodiments of this application; Figure 3 A second schematic flowchart of the visual language navigation method for a robot provided in this application embodiment; Figure 4 A schematic diagram of the architecture for constructing annotated semantic maps provided in an embodiment of this application; Figure 5 The third schematic flowchart of the visual language navigation method for robots provided in the embodiments of this application; Figure 6 Fourth flowchart illustrating the visual language navigation method for robots provided in this application embodiment; Figure 7 Fifth flowchart illustrating the visual language navigation method for robots provided in this application embodiment; Figure 8 A schematic flowchart of the visual language navigation method for robots provided in this application embodiment is shown in Figure 6. Figure 9 A schematic diagram of the structure of the visual language navigation device for a robot provided in the embodiments of this application; Figure 10 This is a schematic diagram of the controller provided in an embodiment of this application. Detailed Implementation
[0020] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are some embodiments of this application, but not all embodiments.
[0021] Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.
[0022] Furthermore, the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Additionally, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0023] It should be noted that, where there is no conflict, the features in the embodiments of this application can be combined with each other.
[0024] This application provides a visual language navigation method for robots, which can be generated by any controller with computing and processing capabilities. Figure 1 This is a schematic diagram of the overall architecture of an embodiment of this application. Figure 1 This demonstrates the entire process of robot visual-language navigation: First, an annotated semantic map is generated through map drawing, while simultaneously acquiring the current environment observation image and natural language navigation commands; then, the environment observation image and the annotated semantic map are processed by a visual encoder and a map encoder respectively to obtain corresponding features, which are then converted into image tokens and map tokens by a multimodal projector; finally, a text analyzer and a word segmenter are used to convert the navigation commands into text tokens; and finally, the three types of tokens are sorted... <task>(text), <obs>(image), <map>The map labels are concatenated into the language model, and the final output navigation actions such as forward, left turn, right turn, and stop are used to control the robot's movement.
[0025] The following combination Figure 1 The visual language navigation method for robots provided in this application is illustrated with specific examples through multiple accompanying figures and examples.
[0026] Figure 2 This is one of the flowcharts illustrating the visual language navigation method for robots provided in this application embodiment, such as... Figure 2 As shown, the method includes: S101. Acquire natural language navigation commands for the robot in a 3D indoor scene, as well as environmental observation images and depth images of the robot at the current time step.
[0027] After the visual-language navigation task in the 3D indoor scene is started, the natural language navigation instructions input by the user are acquired in real time. The instructions are natural language text containing spatial relationships, action logic and target areas, such as "walk past the white chair on the left to enter the kitchen, turn right in front of the refrigerator and then turn left, and stop by the door on the left". At the same time, the first-person environmental observation image (RGB image) at the current time step is acquired by the RGB camera on the robot, and the depth image at the same scene perspective is acquired by the depth camera. The environmental observation image is used to capture the visual features and semantic target information of the scene, and the depth image is used to capture the 3D spatial depth information of the scene.
[0028] S102. Based on the environmental observation image and depth image, construct an annotated semantic map for the current time step.
[0029] Based on environmental observation images and depth images, an annotated semantic map (ASM) of the scene where the robot is located at the current time step is generated through steps such as semantic segmentation, point cloud construction, spatial alignment and text annotation.
[0030] Annotated semantic maps are top-down two-dimensional semantic maps that include spatial distribution information of physical obstacles, explored areas, robot current position, historical position, and various semantic targets. Each semantic region in the map is marked with corresponding text annotation information, which are natural language labels that match the semantic region and are used to indicate the object category in the corresponding semantic region, such as "chair", "refrigerator", "bed", "potted plant", etc.
[0031] S103. The environmental observation image is encoded using the visual encoder in the preset visual language navigation model to obtain image features.
[0032] The environmental observation image at the current time step is input into the visual encoder of the preset visual language navigation model. The visual encoder uses the SigLIP visual coding network to perform block segmentation, feature extraction and spatial perception transformation of the environmental observation image, and output high-dimensional image features.
[0033] Image features include visual textures, target outlines, and scene layouts in environmental observation images, and the feature dimensions are adapted to the mapping requirements of subsequent multimodal projectors.
[0034] S104. The map encoder in the preset visual language navigation model is used to encode the annotated semantic map to obtain map features.
[0035] The annotated semantic map constructed at the current time step is input into the map encoder of the preset visual language navigation model. The map encoder performs spatial feature extraction, semantic region feature fusion and structured information encoding on the annotated semantic map, and outputs high-dimensional map features.
[0036] Map features include information such as obstacle distribution, explored area range, robot pose, semantic target location, and semantic association of text annotations in the annotated semantic map, and the map features maintain the same dimensionality as the image features.
[0037] S105. Based on image features, map features, and natural language navigation instructions, the language model in the preset visual language navigation model is used for processing to obtain the robot's navigation action at the next time step.
[0038] Image and map features are mapped to the feature space of the language model using a multimodal projector, transforming them into image and map tokens that the language model can parse. Simultaneously, natural language navigation instructions are encoded and converted into text tokens. The image, map, and text tokens are then structurally concatenated and input into the language model. Through cross-modal fusion and reasoning of the language model, the robot's navigation action for the next time step is output. The navigation action is a discrete basic movement action, including one of forward, left turn, right turn, or stop.
[0039] S106. Based on the navigation action of the next time step, control the robot to move in the next time step.
[0040] The navigation action output by the language model for the next time step is converted into motion control instructions for the robot and sent to the robot's motion execution module. This module controls the robot to perform the corresponding movement actions in the 3D indoor scene. After completing the movement for that time step, the robot's pose information is updated, and the navigation process for the next time step is entered. S101~S106 are repeated until the robot completes all the navigation tasks specified by the natural language navigation instructions.
[0041] In summary, this embodiment provides a visual language navigation method for robots, which breaks through the technical limitations of traditional methods that rely on historical observation frames. It uses annotated semantic maps (ASM) to replace historical data frames as the core spatial memory representation, eliminating the need to store accumulated historical environmental data throughout the process. This significantly reduces storage overhead and computational redundancy during robot navigation, and the map memory usage remains constant, unaffected by the number of navigation time steps.
[0042] Figure 3 The second schematic flowchart illustrates the visual language navigation method for robots provided in this application. Figure 4 This is a schematic diagram of the architecture for constructing annotated semantic maps provided in an embodiment of this application, such as... Figure 3 and Figure 4 As shown, step S102, which involves constructing an annotated semantic map for the current time step based on environmental observation images and depth images, includes: S201. Determine the current semantic mask of the 3D indoor scene based on the environmental observation image.
[0043] The environmental observation image at the current time step is input into the pre-trained semantic segmentation module (Mask2Former). The semantic segmentation algorithm performs semantic classification on each pixel in the environmental observation image, identifies the pixel regions corresponding to various semantic targets in the image, and generates the current semantic mask.
[0044] The current semantic mask is a binary mask image that matches the size of the environmental observation image. Each mask region corresponds to a preset object category, such as chair, refrigerator, wall, ground, etc., and is used to mark the spatial distribution of different semantic targets in the environmental observation image.
[0045] S202. Determine the point cloud data of the 3D indoor scene based on the depth image.
[0046] After preprocessing the depth image to remove noise points and invalid depth values, the two-dimensional pixel coordinates of the depth image are converted into three-dimensional world coordinates by combining the intrinsic parameters of the depth camera and the current pose information of the robot, thus generating point cloud data of the three-dimensional indoor scene.
[0047] Point cloud data is a three-dimensional point set containing a large number of spatial points. Each spatial point carries three-dimensional coordinate information, which can accurately represent the three-dimensional spatial structure of the scene and the spatial position of objects.
[0048] S203. Align the current semantic mask and point cloud data to generate the initial semantic map for the current time step.
[0049] A three-dimensional spatial coordinate system is established with the robot's current pose as the origin. The two-dimensional semantic information of the current semantic mask is spatially registered and aligned with the three-dimensional spatial information of the point cloud data. Various semantic target regions in the semantic mask are mapped to the corresponding three-dimensional spatial positions in the point cloud data. The aligned three-dimensional point cloud data is then projected onto a two-dimensional plane to generate an initial semantic map from top to bottom. The initial semantic map is a semantic map in the form of a multi-channel tensor, which includes the spatial distribution information of physical obstacles, explored areas, the robot's current position, and historical positions.
[0050] Figure 5 This is the third flowchart illustrating the visual language navigation method for robots provided in this application embodiment, as shown below. Figure 5 As shown, S203 specifically includes: S301. Align the current semantic mask and point cloud data to construct the current obstacle map and the current explored area map, respectively.
[0051] After spatially aligning the current semantic mask with the point cloud data, based on the 3D spatial distance information of the point cloud data and the semantic classification information of the semantic mask, the physical obstacle areas (such as walls, furniture, cabinets, etc.) that hinder the robot's movement in the 3D indoor scene are identified. These obstacles are then projected onto a 2D plane and binarized to construct a current obstacle map. Simultaneously, based on the field of view of the environmental observation image and depth image acquired at the current time step, combined with the robot's pose information, the robot's scene exploration range at the current time step is determined. This range is then projected onto a 2D plane and marked to construct a map of the currently explored area. The map of the explored area includes all spatial areas that the robot can currently perceive.
[0052] S302. Store the current obstacle map, the current explored area map, the robot's current position and past positions into multiple channels of the preset image to generate the initial semantic map for the current time step.
[0053] A pre-defined multi-channel two-dimensional tensor image is constructed as the carrier of the initial semantic map. The number of channels in the two-dimensional tensor image includes basic channels and semantic target channels. There are four basic channels, which are used to store physical obstacles, explored areas, the robot's current position, and the robot's past position information, respectively. The current obstacle map and the current explored area map constructed in S301 are stored in the corresponding positions of the basic channels, respectively. At the same time, the three-dimensional coordinates of the current position are obtained through the robot's pose sensor, projected onto the two-dimensional plane, and marked in the corresponding positions of the basic channels. The position coordinates of each historical time step during the robot's navigation process are concatenated and marked in the corresponding positions of the basic channels to form the robot's historical position trajectory. The spatial distribution information of various semantic targets in the semantic mask is stored in the semantic target channel of the tensor image, and finally, a multi-channel initial semantic map of the current time step is generated.
[0054] S204. Add corresponding text annotation information to each semantic region in the initial semantic map to obtain an annotated semantic map.
[0055] The semantic regions in the initial semantic map are identified and clustered to determine the object category corresponding to each semantic region. Natural language text annotation information matching the object category is then added to the feature positions of each semantic region, transforming the abstract semantic features into text labels that the language model can understand, and finally obtaining the annotated semantic map at the current time step.
[0056] The text annotation information corresponds one-to-one with the semantic region. For example, the semantic region corresponding to the chair is labeled "chair", and the semantic region corresponding to the refrigerator is labeled "refrigerator".
[0057] This embodiment establishes an ASM map construction process. During the execution of the robot's visual language navigation method provided in this application, a brand-new ASM map is constructed based on the current environmental data at each time step, realizing real-time dynamic updates of the map and ensuring that the map information is completely matched with the robot's current environment, thus avoiding map information failure due to environmental changes or robot movement.
[0058] Figure 6 This is the fourth flowchart illustrating the visual language navigation method for robots provided in this application embodiment, as shown below. Figure 6 As shown, S204 may specifically include: S401. Perform connectivity component analysis on the initial semantic map to obtain each semantic region.
[0059] Connected component analysis algorithm is used to detect the semantic target channel of the initial semantic map. Based on the connectivity of pixels and the consistency of semantic category, connected pixel regions belonging to the same object category in the initial semantic map are clustered and segmented to obtain multiple independent semantic regions. Each semantic region corresponds to an entity semantic target in the 3D indoor scene, such as the spatial region corresponding to a single chair, refrigerator, or potted plant.
[0060] S402. Obtain the centroid position of each semantic region.
[0061] Geometric feature calculation is performed on each independent semantic region obtained in S401, and the geometric centroid coordinates of each semantic region are solved. The centroid coordinates are used as the text annotation feature positions of the semantic region.
[0062] The centroid position is the center of the semantic region, which ensures that the annotation position of the text annotation information is visually salient and does not obscure the core features of the semantic region.
[0063] S403. Add corresponding text annotation information to the centroid of each semantic region in the initial semantic map to obtain an annotated semantic map.
[0064] For each semantic region, a corresponding natural language text annotation is matched. The text annotation is a concise label that matches the object category of the semantic region.
[0065] Text annotation information is visualized and marked at the centroid of the corresponding semantic region in the initial semantic map, thus merging the text annotation with the semantic map to obtain the annotated semantic map for the current time step. This map contains both spatial structured information of the scene and linguistic semantic annotation information, which can be directly parsed by the visual language model, greatly improving the model's efficiency in parsing the map.
[0066] Figure 7 This is the fifth flowchart illustrating the visual language navigation method for robots provided in this application embodiment, as shown below. Figure 7 As shown in step S105, based on image features, map features, and natural language navigation instructions, the language model in the preset visual language navigation model is used for processing to obtain the robot's navigation actions at the next time step, including: S501. Using a preset multimodal projector, image features and map features are projected onto the feature space of the language model to obtain corresponding image tokens and map tokens.
[0067] The image features obtained in S103 and the map features obtained in S104 are respectively input into a preset multimodal projector. The multimodal projector contains two independent two-layer MLP projection networks, namely the image feature projection network and the map feature projection network, both configured with the GELU activation function. The multimodal projector performs linear transformation and nonlinear mapping on the image features and map features respectively, mapping the high-dimensional image features and map features from their respective feature spaces to the feature space of the language model, so that the feature dimensions after mapping are consistent with the embedding dimensions of the language model, and generates image tokens and map tokens respectively. The tokens are vector-form feature labels that can be parsed by the language model.
[0068] S502. Using a preset analyzer, the natural language navigation instructions are projected onto the feature space of the language model to obtain text tokens.
[0069] The natural language navigation instructions are input into a preset text analyzer. First, the natural language navigation instructions are segmented and encoded by a word segmenter to generate an initial text feature vector. Then, the initial text feature vector is mapped to the feature space of the language model through the projection layer of the text analyzer, so that the feature dimension after mapping is consistent with the dimension of the image token and the map token, and a text token is generated. The text token contains feature information such as the semantic logic, spatial relationship and action requirements of the natural language navigation instructions.
[0070] S503. Based on the image token, map token, and text token, a language model is used to predict the action and obtain the robot's navigation action in the next time step.
[0071] Image tokens, map tokens, and text tokens are input into the language model of the visual language navigation model. The cross-modal fusion capability of the language model enables feature fusion of the three types of tokens. Combined with the reasoning capability of the language model, navigation action inference is performed on the fused multimodal features. At the same time, multiple pre-configured matching rule sets for preset action types are invoked, and the inference results are matched with the matching rule sets to perform action matching. Finally, the navigation action of the robot in the next time step is output. The preset action types include forward, left turn, right turn, and stop, and the matching rule set is a set of natural language synonyms for each type of action.
[0072] This embodiment projects image features and map features onto the feature space of the language model using a multimodal projector, achieving dimensional unification and spatial alignment of different modal features (visual, spatial, and linguistic). This solves the core technical challenge of traditional methods where visual / map features and linguistic features cannot be effectively fused due to differences in dimension and space. The image tokens, map tokens, and text tokens generated after projection are all vector-based feature markers that the language model can parse, enabling the model to process multimodal features in a unified manner and significantly improving the model's cross-modal fusion and reasoning capabilities.
[0073] Figure 8 This is the sixth flowchart illustrating the visual language navigation method for robots provided in this application embodiment, as shown below. Figure 8 As shown, S403 specifically includes: S601. Using the input encoding module in the language model, image tokens, map tokens, and text tokens are concatenated and corresponding token labels are added to generate multimodal input information.
[0074] The language model's built-in input encoding module is invoked to concatenate the image tokens, map tokens, and text tokens obtained in steps S501-S502 according to a preset order. Simultaneously, preset unique token tags are added to each of the three types of tokens. These token tags are predefined special identifier tokens, including... <task> 、 <obs> 、 <map>,in <task>Corresponding text tokens are used to mark the features of natural language navigation instructions. <obs>Corresponding image tokens are used to tag environmental observation image features. <map>Corresponding map tokens are used to mark annotated semantic map features.
[0075] The concatenated and labeled feature sequences are encoded to generate multimodal input information that can be directly processed by the language model. This information preserves the independence of each modality feature and the cross-modal correlation.
[0076] S602. Using the action prediction module in the language model, the action is predicted based on the multimodal input information to obtain the navigation action for the next time step.
[0077] The multimodal input information generated by S601 is input into the action prediction module of the language model. The action prediction module first performs deep feature extraction and cross-modal reasoning on the multimodal input information, and parses the core requirements of the natural language navigation instructions, the visual features of the current environment, and the spatial structure information of the annotated semantic map. Then, it calls the pre-configured matching rule set of multiple preset action types and matches the inferred navigation action intention with the matching rule set sentence by sentence. The matching rule set contains multiple natural language expressions corresponding to various preset actions, such as "moveforward" and "straight ahead" for forward, and "turnleft" for left turn. The action matching is completed by using case-insensitive regular expressions to determine the action type most suitable for the current navigation scenario. Finally, the robot's navigation action in the next time step is output, which is one of forward, left turn, right turn, or stop.
[0078] This embodiment adds a dedicated token tag ( <task> 、 <obs> 、 <map>Furthermore, it concatenates and encodes multimodal tokens to achieve structured labeling and orderly fusion of features from different modalities. This allows the language model to accurately distinguish between text tokens (language commands), image tokens (visual observations), and map tokens (spatial maps), avoiding information confusion during multimodal feature fusion and improving the model's efficiency in recognizing and parsing information from different modalities. In addition, it combines a set of regular expression matching rules for action matching, supporting multiple natural language expressions of the same navigation action (such as "forward", "straight", "moveforward"), improving the model's robustness to variations in natural language command expressions, and solving the problem of action prediction errors caused by different command expressions.
[0079] In all the above steps, after completing the navigation action at each time step, the robot updates its own pose information and repeats the above steps based on the new pose information until the navigation task specified by the natural language navigation instruction is completed. Moreover, during the entire navigation process, an annotated semantic map is reconstructed and updated in real time at each time step, without the need to store historical observation frames, which effectively reduces the robot's storage and computational overhead.
[0080] It should also be noted that in this application, the visual encoder is preferably a SigLIP-so400M-patch14-384 encoding network, the language model is preferably a Qwen2-7BInstruct large language model, the semantic segmentation module is preferably a Mask2Former segmentation network, the MLP network of the multimodal projector is configured with the GELU activation function, and the robot is preferably equipped with an Intel RealSense D435i depth camera and an RGB camera, with the camera installed at a height of 40cm above the robot body to ensure that the acquired images and depth data match the robot's first-person perspective.
[0081] In summary, the embodiments of this application provide a visual language navigation method for robots, with the following advantages: 1) By constructing an annotated semantic map to replace the traditional historical observation frames, the storage and computational overhead during navigation is significantly reduced, and the map maintains a constant memory footprint, unaffected by the number of navigation steps; 2) The annotated semantic map integrates spatial structured information and linguistic text annotations, achieving deep alignment between visual and linguistic features, improving the robot's understanding of natural language navigation instructions and the accuracy of spatial navigation; 3) An end-to-end visual language navigation model is adopted, realizing integrated processing from environmental perception, map construction to action prediction, improving the real-time performance and coherence of the navigation process; 4) By using a multimodal projector to map features of different modalities to a unified language model feature space, and combining token tags to achieve structured input of multimodal features, the model's cross-modal fusion capability is improved; 5) Action prediction, combined with a preset matching rule set, improves the robustness of navigation action output and can adapt to various forms of natural language instructions.
[0082] Optionally, this application also provides a robot, which includes at least a robot body and a controller disposed within the robot body. The controller is used to execute the visual language navigation method of the robot described in any of the above embodiments.
[0083] The following will continue to explain the apparatus, device, and storage medium for implementing the visual language navigation method for robots provided in any of the above embodiments of this application. The specific implementation process and the resulting technical effects are the same as those in the corresponding method embodiments. For the sake of brevity, the parts not mentioned in the following embodiments can be referred to the corresponding content in the method embodiments.
[0084] Figure 9 This is a schematic diagram of the structure of the visual language navigation device for a robot provided in the embodiments of this application, as shown below. Figure 9 As shown, this application provides a visual language navigation device for a robot, comprising: The acquisition module 10 is used to acquire the robot's natural language navigation instructions in a three-dimensional indoor scene, as well as the robot's environmental observation image and depth image at the current time step.
[0085] The construction module 20 is used to construct an annotated semantic map for the current time step based on the environmental observation image and the depth image; each semantic region in the annotated semantic map is marked with corresponding text annotation information, which is used to indicate the object category in the corresponding semantic region.
[0086] The first encoding module 30 is used to encode the environmental observation image using a visual encoder in a preset visual language navigation model to obtain image features.
[0087] The second encoding module 40 is used to encode the annotated semantic map using the map encoder in the preset visual language navigation model to obtain map features.
[0088] The processing module 50 is used to process the image features, the map features, and the natural language navigation instructions using the language model in the preset visual language navigation model to obtain the navigation action of the robot in the next time step.
[0089] The control module 60 is used to control the robot to move in the next time step according to the navigation action of the next time step.
[0090] Optionally, the construction module 20 is configured to determine the current semantic mask of the three-dimensional indoor scene based on the environmental observation image; determine the point cloud data of the three-dimensional indoor scene based on the depth image; align the current semantic mask and the point cloud data to generate an initial semantic map for the current time step; and add corresponding text annotation information to each semantic region in the initial semantic map to obtain the annotated semantic map.
[0091] Optionally, the construction module 20 is used to perform connected component analysis on the initial semantic map to obtain each semantic region; obtain the centroid position of each semantic region; and add corresponding text annotation information at the centroid position of each semantic region in the initial semantic map to obtain the annotated semantic map.
[0092] Optionally, the processing module 50 is configured to use a preset multimodal projector to project the image features and the map features into the feature space of the language model to obtain corresponding image tokens and map tokens; use a preset analyzer to project the natural language navigation instructions into the feature space of the language model to obtain text tokens; and use the language model to predict actions based on the image tokens, the map tokens, and the text tokens to obtain the navigation actions of the robot in the next time step.
[0093] Optionally, the processing module 50 is used to use the input encoding module in the language model to concatenate the image token, the map token, and the text token, and add corresponding token tags to generate multimodal input information; and to use the action prediction module in the language model to perform action prediction based on the multimodal input information to obtain the navigation action for the next time step.
[0094] Optionally, the processing module 50 is used to perform action matching on the multimodal input information and a set of matching rules for multiple preset action types using the action prediction module in the language model, so as to obtain the navigation action for the next time step.
[0095] Optionally, the construction module 20 is used to align the current semantic mask and the point cloud data to construct the current obstacle map and the current explored area map respectively; and to store the current obstacle map, the current explored area map, the robot's current position and past positions into multiple channels of a preset image to generate the initial semantic map of the current time step.
[0096] The above-described device is used to execute the method provided in the foregoing embodiments, and its implementation principle and technical effect are similar, so they will not be described again here.
[0097] These modules can be one or more integrated circuits configured to implement the above methods, such as one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors, or one or more Field Programmable Gate Arrays (FPGAs). Alternatively, when a module is implemented using processing element scheduler code, the processing element can be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. Furthermore, these modules can be integrated together as a system-on-a-chip (SOC).
[0098] Figure 10 This is a schematic diagram of the controller provided in the embodiments of this application, such as... Figure 10 As shown, this application also provides a controller, including a processor 100, a storage medium 200 and a bus 300. The storage medium stores program instructions executable by the processor. When the controller is running, the processor communicates with the storage medium via the bus, and the processor executes the program instructions to implement the visual language navigation method for robots described in any of the above embodiments.
[0099] This application also provides a readable storage medium storing program instructions, which, when executed by a processor, implement the visual language navigation method for a robot described in any of the above embodiments.
[0100] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0101] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0102] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in a combination of hardware and software functional units.
[0103] The integrated units implemented as software functional units described above can be stored in a computer-readable storage medium. These software functional units, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0104] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.< / map> < / obs> < / task> < / map> < / obs> < / task> < / map> < / obs> < / task> < / map> < / obs> < / task>
Claims
1. A visual language navigation method for a robot, characterized in that, include: Acquire natural language navigation commands for the robot in a 3D indoor scene, as well as environmental observation images and depth images of the robot at the current time step; Based on the environmental observation image and the depth image, an annotated semantic map for the current time step is constructed; each semantic region in the annotated semantic map is marked with corresponding text annotation information, which is used to indicate the object category within the corresponding semantic region; The environmental observation image is encoded using a visual encoder in a preset visual language navigation model to obtain image features; The map encoder in the preset visual language navigation model is used to encode the annotated semantic map to obtain map features; Based on the image features, map features, and natural language navigation instructions, the language model in the preset visual language navigation model is used for processing to obtain the robot's navigation action at the next time step; Based on the navigation action at the next time step, the robot is controlled to move at the next time step.
2. The method according to claim 1, characterized in that, The step of constructing an annotated semantic map for the current time step based on the environmental observation image and the depth image includes: The current semantic mask of the three-dimensional indoor scene is determined based on the environmental observation image; The point cloud data of the three-dimensional indoor scene is determined based on the depth image; Align the current semantic mask with the point cloud data to generate the initial semantic map for the current time step; Add corresponding text annotation information to each semantic region in the initial semantic map to obtain the annotated semantic map.
3. The method according to claim 2, characterized in that, The step of adding corresponding text annotation information to each semantic region in the initial semantic map to obtain the annotated semantic map includes: Connectivity component analysis is performed on the initial semantic map to obtain each semantic region; Obtain the centroid position of each semantic region; Add corresponding text annotation information to the centroid position of each semantic region in the initial semantic map to obtain the annotated semantic map.
4. The method according to claim 1, characterized in that, The step of processing the image features, map features, and natural language navigation instructions using the language model in the preset visual language navigation model to obtain the robot's navigation action at the next time step includes: A preset multimodal projector is used to project the image features and the map features onto the feature space of the language model to obtain the corresponding image token and map token; Using a preset analyzer, the natural language navigation instructions are projected onto the feature space of the language model to obtain text tokens; Based on the image token, the map token, and the text token, the language model is used to predict the action, thereby obtaining the robot's navigation action at the next time step.
5. The method according to claim 4, characterized in that, The step of predicting the robot's navigation action at the next time step using the language model based on the image token, the map token, and the text token includes: The input encoding module in the language model is used to concatenate the image token, the map token, and the text token, and add corresponding token tags to generate multimodal input information. The action prediction module in the language model is used to predict the action based on the multimodal input information to obtain the navigation action for the next time step.
6. The method according to claim 5, characterized in that, The action prediction module in the language model performs action prediction based on the multimodal input information to obtain the navigation action for the next time step, including: The action prediction module in the language model is used to perform action matching on the multimodal input information and the matching rule set of multiple preset action types to obtain the navigation action for the next time step.
7. The method according to claim 2, characterized in that, Aligning the current semantic mask with the point cloud data to generate an initial semantic map for the current time step includes: Align the current semantic mask and the point cloud data to construct the current obstacle map and the current explored area map, respectively. The current obstacle map, the currently explored area map, the robot's current position, and past positions are stored in multiple channels of a preset image to generate an initial semantic map for the current time step.
8. A visual language navigation device for a robot, characterized in that, include: The acquisition module is used to acquire the robot's natural language navigation commands in a 3D indoor scene, as well as the robot's environmental observation image and depth image at the current time step. A construction module is used to construct an annotated semantic map for the current time step based on the environmental observation image and the depth image; each semantic region in the annotated semantic map is marked with corresponding text annotation information, which is used to indicate the object category within the corresponding semantic region; The first encoding module is used to encode the environmental observation image using a visual encoder in a preset visual language navigation model to obtain image features; The second encoding module is used to encode the annotated semantic map using the map encoder in the preset visual language navigation model to obtain map features; The processing module is used to process the image features, map features, and natural language navigation instructions using the language model in the preset visual language navigation model to obtain the robot's navigation action at the next time step. The control module is used to control the robot to move in the next time step according to the navigation action in the next time step.
9. A controller, characterized in that, include: The system includes a processor, a storage medium, and a bus. The storage medium stores program instructions executable by the processor. When the controller is running, the processor communicates with the storage medium via the bus, and the processor executes the program instructions to implement the visual language navigation method for the robot according to any one of claims 1 to 7.
10. A robot, characterized in that, At least including: The robot body and a controller disposed within the robot body, the controller being used to execute the visual language navigation method of the robot according to any one of claims 1 to 7.