Visual language navigation data enhancement method and device for indoor mobile service robot and storage medium
By training the instruction generation model in a simulation environment and fine-tuning it in the deployment environment, and combining the Transformer architecture and synchronous localization technology, the problem of small visual language navigation dataset size is solved, and the adaptability and accuracy of the model in indoor mobile service robots are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TONGJI UNIV
- Filing Date
- 2024-12-23
- Publication Date
- 2026-06-23
AI Technical Summary
Existing visual language navigation datasets are small in scale and cannot cover the diversity of indoor navigation, resulting in limited generalization ability of trained models in the real world. Furthermore, the annotation process is time-consuming and complex, affecting the comprehensiveness and adaptability of the models.
The instruction generation model is trained in a simulation environment to generate an expanded simulation environment dataset. The navigation model is then fine-tuned in the deployment environment. Image and text information are fused using a sequence-to-sequence Transformer structure and a contrastive learning method. A new environment is generated through an environmental visual feature occlusion method, and navigation instructions are automatically generated. Combined with simultaneous localization and mapping (SLAM) technology, the model's adaptability is improved.
It effectively improves the generalization of navigation models in deployment environments, reduces data collection costs, improves the performance and accuracy of navigation models, reduces computing resource requirements, and is suitable for edge computing devices.
Smart Images

Figure CN119762714B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of visual language navigation technology, and in particular to a visual language navigation data enhancement method, apparatus and storage medium for indoor mobile service robots. Background Technology
[0002] In the field of visual-language navigation, dataset annotation is a time-consuming and complex task. Annotating language commands requires considering information from both visual and linguistic modalities, as well as ensuring that the language descriptions match the actual environment. This process necessitates expertise and often involves subjective judgment, increasing the uncertainty of the annotation. In the complex deployment environment of indoor mobile service robots, ensuring the accuracy of annotated language commands is crucial for training reliable navigation models.
[0003] Currently available visual language navigation datasets are relatively small, which limits the training of deep learning models. In the field of computer vision, the COCO (Common Objects in Context) dataset is one of the most widely used datasets, mainly for tasks such as object detection, segmentation, and image annotation. The COCO dataset contains 118,287 training images, 5,000 validation images, and 40,670 test images, annotating approximately 1.23 million object instances, covering 80 common categories in daily life. Each image contains an average of 7.2 object instances and provides detailed bounding boxes, segmentation masks, and keypoint detection annotations. In the field of natural language processing, dataset sizes range from tens of thousands to millions, and some even exceed hundreds of millions. For example, the SQuAD (Stanford Question Answering Dataset) 2.0 dataset contains 150,000 question-answer pairs, and the OpenAIGPT-3 training data contains hundreds of gigabytes of text data from various sources on the Internet. BERT training data uses large-scale corpora such as Wikipedia (approximately 250 million words) and BooksCorpus (approximately 800 million words). The Room-to-Room (R2R) dataset commonly used in visual language navigation only contains 21,567 navigation commands across 90 building scenes. This small dataset struggles to encompass the diverse scenarios encountered in indoor navigation, resulting in limited generalization ability of trained models in the real world. Furthermore, due to the diversity of indoor environments, a single dataset often cannot comprehensively cover all different styles of navigation scenarios, thus affecting the model's comprehensiveness and adaptability. Summary of the Invention
[0004] The purpose of this invention is to overcome the limitations of existing technologies where models trained on a single dataset have limited generalization capabilities, and to provide a visual language navigation data augmentation method, apparatus, and storage medium for indoor mobile service robots. This method can provide richer and more diverse training data for deep learning models, effectively improving the comprehensiveness and adaptability of the models.
[0005] The objective of this invention can be achieved through the following technical solutions:
[0006] According to a first aspect of the present invention, a method for visual language navigation data enhancement for indoor mobile service robots is provided, comprising the following steps: S1, training a preset instruction generation model using a pre-acquired visual language navigation dataset; S2, generating a new environment based on a preset simulation environment; S3, acquiring a first navigation path in the new environment; S4, generating a first navigation instruction using the instruction generation model trained in S1 based on the first navigation path, thereby obtaining an expanded simulation environment dataset; S5, training a preset navigation model using the expanded simulation environment dataset; S6, performing synchronous positioning and map building in the actual deployment environment of the indoor mobile service robot, and constructing a topological navigation map based on the positioning and mapping results; S7, acquiring a panoramic view in the actual deployment environment based on the topological navigation map; S8, acquiring a second navigation path in the actual deployment environment based on the panoramic view, and generating a second navigation instruction using the instruction generation model trained in S1, thereby obtaining a deployment environment dataset; S9, fine-tuning the navigation model trained in S5 using the deployment environment dataset to complete the visual language navigation data enhancement.
[0007] As a preferred technical solution, in S4 and S8, the trained instruction generation model is used to automatically generate the first navigation instruction and the second navigation instruction.
[0008] As a preferred technical solution, the instruction generation model is constructed using a sequence-to-sequence Transformer structure. The Transformer structure includes a connected spatial encoder and a temporal encoder. The spatial encoder is used to fuse spatial information at each time stage, and the temporal encoder is used to learn the intrinsic relationship between different time stages after the spatial encoder has completed the fusion.
[0009] As a preferred technical solution, the Transformer structure further includes a text decoder connected to the time encoder, the text encoder including an instruction-generating word predictor for performing prediction tasks.
[0010] As a preferred technical solution, the instruction-generating word predictor is optimized using a preset cross-entropy loss, the expression of which is:
[0011]
[0012] In the formula, θ is a preset parameter. f is a text instruction in the dataset, where l is the number of words in the text instruction, l≤L, and f θ (·) represents the probability that the target word, predicted by the instruction generation model based on action A, environmental observation E, and the first i-1 words in the instruction, will appear in the i-th position of the instruction.
[0013] As a preferred technical solution, before inputting the image to be processed into the instruction generation model, the image to be processed is encoded using a trained image encoder to extract corresponding image features. The image encoder is trained using a contrastive learning method.
[0014] As a preferred technical solution, S2 specifically includes: using an environmental visual feature masking method to generate a new environment based on visual feature masking that is consistent with the view and the viewpoint.
[0015] As a preferred technical solution, the view features f observed from the new environment t ′ ,i From the original feature f t,i and environmental masking ξ E The Hadamard product is obtained as follows:
[0016] f t ' ,i =f t,i ⊙ξ E
[0017]
[0018] In the formula, E represents the preset simulation environment, and the environment occlusion mask ξ E Each element in Both are samples of a random variable, where Ber(·) represents the Bernoulli distribution and p is the probability of obscuring a feature of the view.
[0019] According to a second aspect of the present invention, a visual language navigation data enhancement device for indoor mobile service robots is provided, comprising a memory, a processor, and a program stored in the memory, wherein the processor executes the program to implement the method described therein.
[0020] According to a third aspect of the present invention, a storage medium is provided having a program stored thereon, which, when executed, implements the method described thereon.
[0021] Compared with the prior art, the present invention has the following beneficial effects:
[0022] 1. The data augmentation method provided by this invention first trains an instruction generation model on an existing visual language navigation dataset in a simulation environment, trains a navigation model on an expanded simulation environment dataset, and then fine-tunes the trained navigation model on a deployment environment dataset. This method only requires a small amount of cost to collect the dataset of the actual deployment environment, which can effectively improve the generalization of the navigation model in the deployment environment.
[0023] 2. In this invention, the second navigation path collected in the actual deployment environment of the indoor mobile service robot is automatically generated by the trained instruction generation model. This can effectively solve the problem of the large amount of manpower required to collect visual language navigation datasets. Moreover, experimental verification shows that the performance of fine-tuning the navigation model using the instructions generated by the instruction generation model in this invention is close to that of fine-tuning the navigation model using manually labeled instructions.
[0024] 3. This invention uses an image encoder trained by contrastive learning to encode the input image, which enables the representation of images and text in one embedding space, effectively fusing information from the two modalities of images and text, and improving the accuracy of instruction generation;
[0025] 4. The present invention adopts an environmental visual feature masking method to generate a new environment based on visual feature masking that is consistent with the view and the viewpoint. This environmental visual feature enhancement method and navigation model require less computing resources, which is conducive to its widespread use and deployment in edge computing devices. Attached Figure Description
[0026] Figure 1 A flowchart illustrating the method of this invention;
[0027] Figure 2 This is a schematic diagram showing the position, heading, and elevation angle of the agent in the Matterport3D simulator in this embodiment of the invention;
[0028] Figure 3 This is a schematic diagram illustrating the division of a panoramic image into 36 sub-images in an embodiment of the present invention;
[0029] Figure 4 This refers to the changes in key performance indicators during the training process of the instruction generation model in this embodiment of the invention.
[0030] Figure 5 The definition of the heading angle in the Matterport3D coordinate system (left) and the ROS coordinate system (right) in this embodiment of the invention (top view);
[0031] Figure 6 This is an example of an instruction-navigation path pair for a dataset collected in a deployment environment in an embodiment of the present invention;
[0032] Figure 7The main performance metrics changes in the process of fine-tuning the model using a dataset collected in the deployment environment in this embodiment of the invention. Detailed Implementation
[0033] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0034] Example
[0035] Data augmentation is a method that transforms and expands existing data to generate more training samples. In the field of visual language navigation, the application of data augmentation methods can alleviate the problems of difficult dataset labeling and insufficient data volume, providing richer and more diverse training data for deep learning models, thereby improving their performance. The method provided in this invention is aimed at indoor mobile service robots and performs data augmentation on visual language navigation data. This method first trains an instruction generation model on an existing visual language navigation dataset in a simulation environment, then trains a navigation model on an expanded simulation environment dataset. Finally, it fine-tunes the trained navigation model on a deployment environment dataset. This method requires only a small cost to collect datasets from the actual deployment environment, thereby improving the generalization ability of the visual language navigation model and enhancing its performance in the robot deployment environment. Figure 1 As shown in the figure, this embodiment provides a visual language navigation data enhancement method for indoor mobile service robots.
[0036] This embodiment uses the Matterport3D simulator as the simulation environment for the experiment and uses the Room-to-Room (R2R) dataset as the base dataset (i.e., the pre-acquired visual language navigation dataset) to train the model.
[0037] In the Matterport3D simulator, a virtual agent can virtually "move" throughout the scene by adopting the same pose as the panoramic viewpoint. For example... Figure 2 As shown, the agent's pose consists of a 3D position v∈V, a heading ψ∈[0,2π), and an elevation angle α∈[-0.5π,0.5π], where V is the set of 3D points corresponding to the panoramic viewpoint in the scene. At each step t, the simulator outputs an RGB image observation value o. t This corresponds to the first-person camera view of the intelligent agent.
[0038] At each step t, the simulator also outputs a set of viewpoints that can be reached in the next step. The agent selects a new viewpoint v from the reachable viewpoints. t+1 ∈W t+1and the specified camera heading change (Δψ) t+1 ) and elevation angle change (Δα) t+1 The agent interacts with the simulator. Therefore, the total number of actions the agent can take is fixed.
[0039] To determine W t+1 The simulator generates a panoramic viewpoint-weighted undirected graph for each scene, i.e., G =<V,E> The weight of the edge reflects the straight-line distance between different viewpoints.
[0040] Given a navigation graph G, the set of viewpoints reachable in the next step is given by the following formula:
[0041] W t+1 ={v t}∪{v i ∈V| <v t ,v i >∈E∧v i ∈P t} (1)
[0042] In the formula, v t It is the current viewpoint, v i Let P be the i-th 3D point. t It is the spatial region enclosed by the left and right sides of the cone from the camera's perspective at step t.
[0043] The R2R task requires an agent to navigate from a starting position to a target position in the Matterport3D simulator, following natural language instructions. At the start of the task, each agent receives a natural language instruction. Where L is the length of the instruction, x i It is a word, token. The agent first observes the initial RGB image o0, which is determined by the agent's initial pose and includes three-dimensional position, heading, and elevation angle s0 =<v0,ψ0,α0> Triads. An intelligent agent needs to perform a series of actions. <s0,a0,s1,a1,…,s T ,a T To complete the task, the agent executes each action. t Then a new pose s will be reached t+1 = <v t+1 ,ψ t+1 ,α t+1 The task ends when the agent selects a specific stop action. If the action sequence brings the agent closer to the predetermined target position v... * If so, the task is successfully completed.
[0044] Suppose the task objective is to learn the mapping X→Y from X to Y using paired data {(X,Y)} and unpaired data {X′}. In this case, the back translation method refers to first training the feedforward model P using the paired data {(X,Y)}. X→Y and backward model P Y→X Then, using the backward model P Y→X Generate additional data X' from the unpaired data Y', and fine-tune the feedforward model P using the paired data {(X',Y')} as additional training data. X→Y .
[0045] In visual language navigation tasks, the instruction generation-navigation path generation method utilizes the idea of back translation. The forward model is a navigation model P. E,d→r In environment E, it generates the correct navigation route r based on a given instruction d. The backward model is an instruction generation model P. E,r→d It generates instructions d in environment E based on a given navigation route r.
[0046] Based on this, the specific execution flow of the method provided in this embodiment is as follows:
[0047] Step S1: Train a pre-defined instruction generation model using a pre-acquired visual language navigation dataset. This model can generate corresponding navigation instructions based on visual input, providing a foundation for subsequent generation of navigation instruction data. Specifically:
[0048] The Matterport3D simulator is used to generate a connectivity graph G = {P, ξ}, where P represents navigable viewpoints and ξ represents the connections between these viewpoints. In the R2R dataset, the data is represented as a pair of trajectories τ = {p1, p2, ..., p...}. N} and instruction I = {w1, w2, ..., w L}, where p i and w i These represent the visited nodes and words, respectively, while N and L represent the length of the path and the instruction, respectively.
[0049] At each step, the robot can observe a panoramic view of the environment, including three perspectives, each divided into 12 30° images, each with a resolution of 640×480. (Using d...) v and d o These represent the dimensions of the image features and the orientation features, respectively. Specifically, each step of the robot's movement path is located on one of the images; this image and its offset direction are defined as the action feature A = {V}. a ;γ a},in Representing N-dimensional image features, This represents directional characteristics. Similarly, environmental characteristics E = {V} e ;γ e}From the feature set of panoramic images and directional feature set Composition, where M represents the number of images into which each panoramic image is segmented.
[0050] In this embodiment, the panoramic image is divided into 36 sub-images, and image features are extracted from each sub-image, such as... Figure 3 As shown.
[0051] Optionally, before inputting the image into the model, the image is encoded using an image encoder trained with a contrastive learning method. Specifically, the CLIP image encoder, trained using contrastive learning, extracts image features. CLIP jointly trains the image encoder and text encoder to predict the matching degree between images and text, projecting relevant image-text pairs into a shared embedding space. In this space, relevant image-text pairs are close together, while unrelated image-text pairs are far apart. The main body of the image encoder adopts a Transformer Encoder architecture with a normalization layer added before the Transformer; the text encoder adopts a Transformer architecture, with the output of the highest layer of the Transformer at the [EOS] marker serving as the feature representation of the text, which is linearly projected into the multimodal embedding space after layer normalization. During training, CLIP learns joint representations by maximizing the similarity between the image and its corresponding text description while minimizing the similarity between the image and unrelated text.
[0052] In this embodiment, the core of the instruction generation model is built on a sequence-to-sequence Transformer structure, which includes a connected spatial encoder, a temporal encoder, and a text decoder.
[0053] To better utilize action and environmental information in both time and space throughout the navigation process, a spatiotemporal Transformer encoder structure designed based on a cross-modal attention mechanism module is employed. Two encoders are used to fuse spatiotemporal information; specifically, a spatial encoder is used to fuse spatial information from each time stage, and a temporal encoder is used to learn the intrinsic connections between different time stages.
[0054] (1) Spatial encoder
[0055] The input to the instruction generation model consists of two types of observation data: a set of action features and a set of environmental observations The proper fusion of visual features is crucial, as it affects the model's cognitive performance. Since action features represent the main changes during navigation, A is considered the query, and E is considered the key and value. Utilizing a multi-head attention mechanism, projecting different linear transformations onto the query, key, and value allows the model to focus on different representational subspaces at different locations. After concatenating the results of the multi-head attention, residual connections are applied, followed by layer normalization to obtain the final result. To improve generalization ability, some additional feature masking was added to the model. The formula for the spatial encoder is as follows:
[0056]
[0057] MultiHead(Q,K,V)=Concat(head1,...,head H W o (6)
[0058]
[0059] In the formula, After aggregating action features and environmental features, the output of the spatial encoder is: Multi-head attention mechanisms improve the model's ability to capture important semantic information in the environment and fuse it into image features in a specific direction, reducing useless noise and interference.
[0060] (2) Time encoder
[0061] After fusing spatial information from each time stage using a spatial encoder, a temporal encoder is used to learn the intrinsic connections between different time stages. Extracting these connections is crucial for the instruction generation model, as instructions are generated based on the agent's entire navigation process. Specifically, the spatial fusion features {z1, z2, ..., z...} generated by the spatial encoder from different navigation points... N The tokens are input as independent tokens into the L-layer temporal encoder. Position encoding PE(·) is added to Z to preserve positional information. Each encoder layer consists of a multi-head self-attention layer (MSA) and a small feedforward neural network (FFN). Since the trajectory commands generated by the command generation model are closely related to the actual navigation sequence, position encoding is added to the visual sequence encoding. Similar to the spatial encoder, residual connections are used between each sub-layer, followed by layer normalization. Let l denote the l-th layer, where l = 1…L, then the formula for the temporal encoder is:
[0062] Z0 = PE(Z) (8)
[0063] Z l=LayerNorm(MSA(Z) l-1 )+Z l-1 (9)
[0064] Z l =LayerNorm(FFN(Z) l ')+Z l ') (10)
[0065] (3) Text Decoder
[0066] The decoder section employs a Transformer decoder architecture. Since language generation is an autoregressive process, it is necessary to ensure that each predicted word depends only on the previous predicted word. Therefore, the word embeddings are specially tagged. <bos>Cancellation is performed, and a masking function is applied to the attention matrix to mask illegal positions. Positional encoding is also added to the embedding vector to capture the relative positions of tokens in the sequence. Assume the target word is d. vocab Each instruction contains at most L words. The output value of the last hidden layer is processed by a linear layer and softmax to obtain... Convert to predicted probability The output head of this task is called the speaker word projector (SWP), which is optimized using cross-entropy loss, as shown in the following formula:
[0067]
[0068] In the formula, θ is a preset parameter. Let f be a text instruction in the dataset, where l is the number of words in the text instruction, l≤L, and f θ (·) represents the probability that the target word, predicted by the instruction generation model based on action A, environmental observation E, and the first i-1 words in the instruction, will appear in the i-th position of the instruction.
[0069] Figure 4 The changes in key performance metrics during the training of the instruction generation model are shown. Among them, Figure 4 Part (a) shows the variation curve of the performance metric BLEU on the known environmental validation set. Figure 4 Part (b) shows the variation curve of the performance index BLEU-4 on the existing environmental validation set. Figure 4 Part (c) shows the variation curve of the performance metric Loss on the known environment validation set. Figure 4 Part (d) shows the change curve of the performance index BLEU on the unseen environment validation set. Figure 4 Part (e) shows the variation curve of the performance metric BLEU-4 on the unseen environmental validation set. Figure 4 Part (f) is the curve showing the change of the performance metric Loss on the unseen environment validation set.
[0070] Step S2: Generate a new environment based on the preset simulation environment.
[0071] This step simulates and generates a new environment by randomly occluding some visual features within the simulation environment, thereby enhancing visual features and increasing the model's adaptability to different environments. Specifically:
[0072] Command generation-navigation path generation methods fine-tune the model by generating additional training data through back-translation. This is typically achieved using a command generation model that synthesizes additional route commands in the existing environment. However, the bottleneck of this semi-supervised learning approach lies in the limited variation of the given environment, resulting in limited generalization ability in unfamiliar environments. To overcome this problem, an environmental visual feature masking method is employed. New environments are generated based on viewpoint-consistent visual feature masking, and new navigation routes are then collected from these new environments. Finally, the command generation model generates new navigation commands based on these routes, and the model is fine-tuned using this augmented data.
[0073] A new environment E′ is generated by applying environment occlusion to the existing environment E (the preset simulation environment):
[0074] E′=EnvDrop p (E) (12)
[0075] View features f′ observed by the navigation agent from the new environment E′ t,i It is composed of the original feature f t,i and environmental masking ξ E The result obtained by calculating the Hadamard product is:
[0076] f′ t,i =f t,i ⊙ξ E (13)
[0077]
[0078] In the formula, the environmental occlusion mask ξ E Each element in Both are samples of a random variable, where Ber(·) represents the Bernoulli distribution and p is the probability of obscuring a feature of the view.
[0079] To preserve the spatial relationships of the viewpoint, environmental visual feature occlusion only covers image features, while directional features (cos(α)) are masked. t,i ),sin(α t,i ),cos(φ t,i ),sin(φ t,i )) Remain unchanged.
[0080] Step S3: In the new environment, collect the first navigation path. This process effectively expands the existing dataset to cover more path variations, specifically:
[0081] When the training data only covers a limited number of navigation instructions and route pairs, i.e., D = (d1, r1)...(d N ,r N To allow the agent to better adapt to new routes, the training data can be augmented using an instruction generation model that generates instructions on the new route samples in the training environment. To achieve this, M routes are collected in the training environment using the same shortest path method as the original dataset. This yields the first navigation path.
[0082] Step S4: Based on the first navigation path, generate the first navigation command using the command generation model trained in step S1, thus obtaining the expanded simulation environment dataset. Specifically:
[0083] By performing greedy prediction in the instruction generation model, each navigation path collected in step S3 is... Generate a text instruction similar to a human command:
[0084]
[0085] These M generated navigation paths and instructions The original training dataset D is merged with the original training dataset D to form an augmented training set S∪D, which is the expanded simulation environment dataset. During training, the navigation model is first trained on this augmented training set, and then further fine-tuned on the original training set D. This data augmentation, centered on the instruction generation model, aims to overcome the data scarcity problem in visual language navigation datasets, allowing the navigation model to be trained on newly acquired routes and synthesized instructions.
[0086] Step S5: Train the pre-defined navigation model using the expanded simulation environment dataset. Specifically:
[0087] The basic navigation path generation model, also known as the navigation model, adopts an encoder-decoder structure.
[0088] The encoder is a bidirectional LSTM-RNN network with an embedding layer:
[0089]
[0090] In the formula, u j It is the j-th word in the instruction, and the instruction length is L.
[0091] The decoder is an LSTM-RNN network with an attention mechanism. In each decoding step t, the agent first processes the view features {f}. t Attention weighting is applied to i to obtain the attention-weighted visual features.
[0092]
[0093] The input to the decoder is attention-weighted visual features. With the previous action The splicing of the LSTM hidden layer output and the attention-weighted instruction features. After combination, the hidden layer output with instruction awareness is obtained. The probability p of moving to the k-th navigation point t (a t,k ) represents the navigation point feature g t,k With instruction-aware hidden layer output Alignment is then followed by softmax calculation:
[0094]
[0095] Step S6: In the actual deployment environment of the indoor mobile service robot, perform simultaneous localization and map building, and construct a topological navigation map based on the localization and mapping results. Specifically:
[0096] Currently, visual language navigation methods without mapping still have low accuracy and are prone to collisions with environmental objects, making them unsuitable for deployment in real-world environments. Therefore, it is necessary to use Simultaneous Localization and Mapping (SLAM) methods to build maps, restricting the movement of intelligent robots within a safe range where collisions will not occur, while simultaneously providing the robot with location information.
[0097] This embodiment uses a self-developed sampling mobile robot to complete scene mapping and image acquisition. The mobile robot is equipped with hardware such as a mobile chassis, high-precision LiDAR, and panoramic camera, enabling it to move and capture panoramic images on relatively flat ground.
[0098] After mapping, select navigation points on the map. These navigation points should ideally cover all areas the robot is expected to reach, and the distance between adjacent navigation points should be approximately 2 meters. Based on these navigation points, the topology navigation map can be determined.
[0099] Step S7: Based on the topology navigation map, acquire a panoramic view in the actual deployment environment. Specifically:
[0100] After the waypoints are generated, the intelligent robot proceeds sequentially to each waypoint in the environment, capturing a panoramic view upon arrival. If the positioning is inaccurate, the robot's position needs to be manually adjusted. During panoramic image capture, the robot's position and attitude information returned by the SLAM program needs to be recorded for adjusting the topology map and panoramic image preprocessing.
[0101] The image acquisition robot uses the ROS coordinate system for mapping and localization. The definition of the heading angle differs between the ROS coordinate system and the Matterport3D simulator coordinate system. Figure 5 As shown, in the Matterport3D coordinate system, the positive direction of the y-axis is the heading angle 0, and the heading angle increases to the right, ranging from [0, 2π]. In the ROS coordinate system, the positive direction of the x-axis is the heading angle 0, and the heading angle increases to the left, ranging from [-π, π]. When the Matterport3D simulator slices sub-images, it starts with the pitch angle as -π / 3 and the heading angle as 0, and increases by π / 3 to the right and then upward (the pitch angle is increased again after the heading angle has been traversed). The horizontal and vertical span of each sub-image is π / 3. Therefore, after rotating the captured panoramic image to the image center at 0 in the ROS coordinate system, it needs to be rotated to the right by π / 3 again, and then the sub-images are sliced starting from the lower left. The slicing effect is as follows. Figure 3 As shown. After segmentation, features are extracted from each sub-image sequentially, and then the feature vectors are concatenated.
[0102] Step S8: Based on the panoramic image, collect the second navigation path in the actual deployment environment, and use the instruction generation model trained in S1 to generate the second navigation instructions, thus obtaining the expanded deployment environment dataset. Specifically:
[0103] First, the environment is divided into multiple regions, typically one room per region, with doors serving as connecting passageways. Then, each region is designated as the start and end point of a path. All cross-region paths are traversed, and paths with fewer than 4 or more than 10 navigation points, as well as paths shorter than 3 meters, are filtered out. A command generation model trained on an existing dataset is used to automatically generate navigation language commands (i.e., second navigation commands) for the collected paths (i.e., second navigation paths), resulting in the deployment environment dataset.
[0104] Figure 6 This example demonstrates how to generate navigation instructions from the navigation path collected in the deployment environment using the instruction generation model in S1. Figure 6 In part (a):
[0105] Generated instructions include: exit the office and turn right.walk down the hallway andturn right at the exit sign.wait in the doorway of the room on the right.
[0106] Human instructions include: Turn right, leave the room through the door closest to you, then turn right and walk down the corridor until you pass the black chair in the hallway, then turn right, enter the room with the basketball by the door and wait at the door.
[0107] Figure 6 In part (b) of:
[0108] Generated instructions include: walk around the glass table and turn right. Walk past the glass wall and turn left. Walk into the first door on the left and stop.
[0109] Human instructions include: Turn around the wooden counter facing the restaurant and go straight to the wall, then turn left and follow the wall to the restaurant door and turn right out of the restaurant, with the cardboard boxes in the storage room on the left and the corridor on the right. Turn left and go straight to the door of the storeroom and turn left into the storeroom.
[0110] Figure 6 In part (c) of:
[0111] Generated instructions include: walk forward and exit the room. Turn right and enter the first door on the right. Wait near the desk.
[0112] Human instructions include: Go ahead to the edge of the three vases, then turn right and follow the black counter. After passing the black sprinkler, you will see the open glass door on the left. Go through this door to the hallway, then turnright into the next door. Wait at the study desk with computers and books.
[0113] Advantages of the instruction generation model: ① Automatic language instruction generation, eliminating the need for manual intervention. Therefore, it can be used to automatically generate instructions from collected image datasets, enabling the automatic acquisition of path-instruction pairs in the deployment environment, thus allowing for model fine-tuning and improved performance in the deployment environment. ② Roughly accurate navigation direction. The directional descriptors such as "turn left" and "turn right" and action instructions such as "leave room" and "enter room" generated by the instruction generation model are consistent with the actual navigation path. ③ Ability to identify room types and common items in the environment. For example... Figure 6 The "office", "hallway", and "exit sign" in part (a) of the instruction. Figure 6 In part (b) of the instructions, "glass table", "glass wall" and "door" are used. Figure 6 The "door" and "desk" in part (c) of the instructions.
[0114] In steps S4 and S8 above, the trained instruction generation model is used to automatically generate the first navigation instruction and the second navigation instruction.
[0115] Step S9 involves fine-tuning the navigation model trained in S5 using the expanded deployment environment dataset, thus enhancing the visual-language navigation data. This process further optimizes model performance, enabling it to perform better in specific real-world deployment environments. Specifically:
[0116] Fine-tune the model using a dataset collected in the deployment environment (i.e., the deployment environment dataset). Figure 7 The changes in key performance metrics during the model fine-tuning process using a dataset collected in the deployment environment are shown. It can be seen that the fine-tuned navigation model significantly improves its key performance metrics on the test set collected in the deployment environment. Specifically, Figure 7 Part (a) is the Loss performance index curve. Figure 7 Part (b) is the curve showing the change in the performance index NE. Figure 7 Part (c) is the performance index PL change curve. Figure 7 Part (d) is the performance index SR change curve. Figure 7 Part (e) is the performance index SPL change curve.
[0117] Furthermore, this embodiment also provides a visual language navigation data enhancement device for indoor mobile service robots, including a memory, a processor, and a program stored in the memory. When the processor executes the program, it implements one or more steps of the aforementioned method. The specific execution flow is basically the same as the method execution flow and will not be repeated here. The device processor includes a central processing unit (CPU), which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) or loaded from the memory unit into random access memory (RAM). Various programs and data required for device operation can also be stored in the RAM. The CPU, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus. Multiple components in the device are connected to the I / O interfaces, including: input units, such as keyboards and mice; output units, such as various types of displays and speakers; storage units, such as disks and optical discs; and communication units, such as network cards, modems, and wireless transceivers. The communication units allow the device to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks. The processing unit performs the various methods and processes described above, such as one or more steps of the aforementioned methods. For example, in some embodiments, one or more steps of the aforementioned methods may be implemented as a computer software program tangibly contained in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and / or installed on the device via ROM and / or a communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of the aforementioned methods may be performed. Alternatively, in other embodiments, the CPU may be configured to perform one or more steps of the aforementioned methods by any other suitable means (e.g., by means of firmware). The functions described above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SOCs), Complex Programmable Logic Devices (CPLDs), and so on.
[0118] Furthermore, this embodiment also provides a storage medium on which a program is stored. When the program is executed, it implements one or more steps of the aforementioned method. The specific execution flow is basically the same as the method execution flow, and will not be described again here.
[0119] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.< / bos>
Claims
1. A visual-language navigation data augmentation method for indoor mobile service robots, characterized in that, Includes the following steps: S1, train a pre-defined instruction generation model using a pre-acquired visual language navigation dataset; S2 generates a new environment based on the preset simulation environment; S3, in the new environment, acquire the first navigation path; S4. Based on the first navigation path, use the instruction generation model trained in S1 to generate the first navigation instruction, and obtain the expanded simulation environment dataset. S5, Use the expanded simulation environment dataset to train the preset navigation model; S6 performs simultaneous localization and map building in the actual deployment environment of the indoor mobile service robot, and constructs a topology navigation map based on the localization and mapping results; S7. Based on the topology navigation map, acquire a panoramic view in the actual deployment environment; S8. Based on the panoramic view, a second navigation path is collected in the actual deployment environment, and a second navigation instruction is generated using the instruction generation model trained in S1 to obtain the deployment environment dataset. S9. Fine-tune the navigation model trained in S5 using the deployment environment dataset to complete visual language navigation data augmentation. S2 specifically includes: using an environmental visual feature masking method to generate a new environment based on visual feature masking that is consistent with the view and the viewpoint; In order to maintain the spatial relationship of the viewpoint, when performing environmental visual feature occlusion, only image features are occluded, while directional features remain unchanged; View features observed from the new environment From original features and environmental masking The Hadamard product is obtained as follows: In the formula, This represents the preset simulation environment, and the environment occlusion mask. Each element in They are all samples of a single random variable. Indicates the Bernoulli distribution. It is the probability of obscuring view features.
2. The visual-language navigation data enhancement method for indoor mobile service robots according to claim 1, characterized in that, In S4 and S8, the trained instruction generation model is used to automatically generate the first navigation instruction and the second navigation instruction.
3. The visual-language navigation data enhancement method for indoor mobile service robots according to claim 1, characterized in that, The instruction generation model is constructed using a sequence-to-sequence Transformer structure, which includes a connected spatial encoder and a temporal encoder. The spatial encoder is used to fuse spatial information at each time stage, and the temporal encoder is used to learn the intrinsic connections between different time stages after the spatial encoder has completed the fusion.
4. The visual-language navigation data enhancement method for indoor mobile service robots according to claim 3, characterized in that, The Transformer architecture also includes a text decoder connected to the time encoder, the text encoder including an instruction-generating word predictor for performing prediction tasks.
5. The visual-language navigation data enhancement method for indoor mobile service robots according to claim 4, characterized in that, The instruction-generating word predictor is optimized using a preset cross-entropy loss, the expression of which is: In the formula, These are preset parameters. It is a text instruction in the dataset. It is the number of words contained in the text instruction. , The instruction generation model is based on action 、 Environmental observation and the preceding instructions The predicted target word appears in instruction number 1. The probability of each position 。 6. The visual-language navigation data enhancement method for indoor mobile service robots according to claim 3, characterized in that, Before inputting the image to be processed into the instruction generation model, the image to be processed is encoded using a trained image encoder to extract corresponding image features. The image encoder is trained using a contrastive learning method.
7. A visual-language navigation data enhancement device for indoor mobile service robots, comprising a memory, a processor, and a program stored in the memory, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1-6.
8. A storage medium having a program stored thereon, characterized in that, When the program is executed, it implements the method as described in any one of claims 1-6.