A path obstacle avoidance method based on multimodal deep learning
By fusing data from cameras, LiDAR, and sonar using multimodal deep learning, and combining it with deep reinforcement learning, the problem of obstacle recognition difficulties for amphibious vehicles in complex environments has been solved. This has enabled comprehensive perception and accurate identification of both land and water, improving autonomous obstacle avoidance capabilities and mission completion rates.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WUHU SHIPYARD CO LTD
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-30
AI Technical Summary
Current amphibious vehicles rely on a single sensor for obstacle recognition in complex and ever-changing environments, resulting in insufficient obstacle avoidance capabilities, especially in environments with strong light, fog, water, transparent obstacles, and complex underwater conditions.
Employing a multimodal deep learning approach, this method integrates data from cameras, LiDAR, and multibeam sonar. It extracts features using ResNet, PointNet++, and one-dimensional convolutional neural networks, and provides vehicle status information through an inertial measurement unit and a global positioning system. Combined with deep reinforcement learning, it performs path planning to achieve comprehensive perception and accurate identification of obstacles.
It enhances the amphibious vehicle's autonomous obstacle avoidance capabilities in complex environments, ensuring safety, path smoothness, and energy efficiency, while also improving environmental understanding and robustness.
Smart Images

Figure CN122308366A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of environmental perception technology, and more specifically, this invention relates to a path obstacle avoidance method based on multimodal deep learning. Background Technology
[0002] With the widespread application of amphibious vehicles in military reconnaissance, emergency rescue, land and water transportation, and environmental monitoring, existing amphibious vehicles often face complex and ever-changing environments in actual use, including the transition zone between land and water, wetlands, shoals, floating obstacles, and irregular terrain. The obstacle avoidance capability in these environments directly affects the safety and mission completion efficiency of amphibious vehicles.
[0003] Currently, the obstacle avoidance capability of amphibious vehicles relies on the accurate identification of obstacles in the environment. Existing obstacle identification mainly relies on a single sensor (such as lidar, camera or ultrasonic) for environmental perception and obstacle detection. However, single-modal sensors have limitations in different environments. For example, visual sensors are easily interfered with under strong light, fog or water reflection conditions, lidar is prone to failure in front of water surface and transparent obstacles, and sonar has a low signal-to-noise ratio in complex underwater environments, making it difficult to accurately identify obstacles. This greatly affects the autonomous obstacle avoidance capability of existing amphibious vehicles in changing environments. Summary of the Invention
[0004] This invention provides a path obstacle avoidance method based on multimodal deep learning, which aims to solve at least one of the above-mentioned problems.
[0005] This invention is implemented as follows: a path obstacle avoidance method based on multimodal deep learning, the method being as follows:
[0006] (1) Collect multimodal environmental perception information and perform time and space alignment on the multimodal environmental perception information;
[0007] (2) Extract environmental features from the environmental perception information of each modality, fuse the environmental features of each modality, and output the fusion vector;
[0008] (3) Based on the fusion vector, identify obstacles in the current environment and detect whether all obstacles exist in the map. If they exist, execute step (4). If they do not exist, execute step (4) after completing obstacle avoidance based on deep reinforcement learning.
[0009] (4) Continue driving along the planned driving route and perform step (1) until you reach the target location.
[0010] Furthermore, environmental perception information includes: environmental images captured by cameras, environmental point clouds captured by lidar, and sonar data captured by multibeam sonar.
[0011] Furthermore, the environmental image is input into the ResNet network to extract image features. Input the environmental point cloud into the PointNet++ network to extract point cloud features. Sonar data is input into a one-dimensional convolutional neural network to extract sonar features. .
[0012] Furthermore, the inertial measurement unit (IMU) measures the vehicle's acceleration, angular velocity, and attitude changes in real time during its motion, while the global positioning system (GPS) collects the vehicle's global position and speed information.
[0013] Furthermore, the acceleration, angular velocity, attitude changes, global position information, and velocity information during the motion process are input into a multi-layer fully connected network or a lightweight temporal network to extract the vehicle's current state features. .
[0014] Furthermore, the process of obtaining the fusion vector is as follows:
[0015] (21) Image features Point cloud features Sonar characteristics and vehicle status characteristics Align;
[0016] (22) Calculate the weights of each modality feature through the channel attention mechanism, including: image features Point cloud features Sonar characteristics Vehicle status characteristics weight , , , ;
[0017] (23) Aligned image features based on the weights of each modality feature Point cloud features Sonar characteristics and vehicle status characteristics Perform weighted summation to form multimodal fusion features. ;
[0018] (24) Fusing features from multiple consecutive frames and multiple modalities Composition of multimodal fusion feature sequences Multimodal fusion feature sequences Input the fusion module, and the fusion module outputs the fusion vector;
[0019] The fusion module consists of a Transformer encoder and a multilayer perceptron (MLP).
[0020] Furthermore, multimodal fusion features The details are as follows:
[0021] .
[0022] Furthermore, the optimal obstacle avoidance path from the current position to the target point is determined based on deep reinforcement learning. The method for determining the target point is as follows: a series of aiming points are set on the planned driving path. The new obstacle farthest from the current position is read. The nearest aiming point behind the new obstacle is determined. It is checked whether the distance between the aiming point and the new obstacle is greater than the set safe distance threshold. If the detection result is yes, the aiming point is taken as the target point. If the detection result is no, the next aiming point of the aiming point is taken as the target point.
[0023] Furthermore, the time reward in deep reinforcement learning is... It is expressed as follows:
[0024] ;
[0025] in, Indicates weight, This represents the cost of obstacle avoidance; within a safe distance, the closer the path is to the obstacle, the lower the cost. Increase rapidly; This represents the cost of path smoothing; the greater the curvature of the path, the higher the cost. The larger; This represents the energy cost; the longer the path, the higher the cost. The larger, This represents the positive reward given when getting closer to the target point; the closer to the target point, the greater the reward. The larger.
[0026] This invention integrates data from multiple sensors, including vision, lidar, and sonar, to achieve comprehensive perception and accurate identification of obstacles on land and in water. Furthermore, the path planning decision based on deep reinforcement learning effectively balances multiple factors such as safety, path smoothness, and energy consumption, thereby improving the amphibious vehicle's driving efficiency and mission completion rate. Attached Figure Description
[0027] Figure 1 The flowchart illustrates the path obstacle avoidance method based on multimodal deep learning provided in this embodiment of the invention. Detailed Implementation
[0028] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings, so as to help those skilled in the art to have a more complete, accurate and in-depth understanding of the inventive concept and technical solution of the present invention.
[0029] Figure 1 The flowchart of the path obstacle avoidance method based on multimodal deep learning provided in the embodiments of the present invention is as follows:
[0030] (1) Collect multimodal environmental perception information and perform time and space alignment on the multimodal environmental perception information. The environmental perception information includes: environmental images collected by cameras, environmental point clouds collected by lidar, and sonar data collected by multibeam sonar.
[0031] In this embodiment of the invention, the sensors used for environmental perception include: a camera, a lidar (LiDAR), and a multibeam sonar. These sensors complement each other to improve the perception capability and environmental modeling accuracy in complex scenes. The camera is primarily used to collect visible light image information from the scene, effectively identifying visual features such as terrain textures, obstacle outlines, water boundaries, shoreline distribution, and road access areas, providing crucial information for semantic understanding and target recognition. The lidar, by actively emitting laser beams and receiving echo information, can obtain high-precision three-dimensional point cloud data, used to accurately describe ground undulations, obstacle sizes, spatial relationships, and the geometric features of the surrounding environment. It exhibits high distance measurement accuracy and environmental mapping capabilities in land scenes. Addressing the issue that traditional optical sensors are significantly affected by light attenuation and turbidity in underwater environments, multibeam sonar is introduced to detect the underwater environment. Multibeam sonar can acquire information such as water depth, underwater terrain changes, underwater obstacle distribution, and echo intensity in local areas, making it particularly suitable for environmental perception tasks in low visibility, turbid water, or nighttime conditions.
[0032] To ensure consistency of multi-source heterogeneous sensor data in both the temporal and spatial domains, a high-precision time synchronization device is employed to uniformly trigger clocks and align timestamps for each sensor, ensuring that different sensors observe the same environmental conditions simultaneously. Furthermore, rigorous extrinsic parameter calibration is used to obtain the pose relationships of each sensor relative to the vehicle's coordinate system, achieving spatial registration and unified representation of multi-sensor data. All raw perception data is transmitted in real-time to the main control unit via the vehicle's high-speed communication bus, forming a continuous and stable multimodal data stream. This data stream includes not only visual texture information, spatial geometric structure information, and underwater echo characteristics, but also vehicle motion state and global position information, providing a comprehensive and rich information foundation for subsequent data preprocessing, multimodal feature extraction, cross-modal information fusion, and path planning decisions.
[0033] The acquired multimodal environment perception information is preprocessed, followed by temporal and spatial alignment.
[0034] The preprocessing of multimodal environmental perception information includes: For environmental image data, distortion correction algorithms are used to eliminate lens distortion and color correction is performed; for environmental point clouds, filtering algorithms (such as Voxel Grid and Statistical Outlier Removal) are used to remove noise, and this is combined with IMU data for ground correction; for sonar data, denoising and echo enhancement signal processing techniques are used to improve the accuracy of underwater obstacle detection. Through an extrinsic parameter matrix, environmental information perceived by different sensors is uniformly transformed into the vehicle's body coordinate system, achieving spatial alignment of multimodal environmental perception information and ensuring spatial consistency for subsequent feature extraction and fusion.
[0035] (2) Extract environmental features from the environmental perception information of each modality, fuse the environmental features of each modality, and output the fusion vector F;
[0036] Since image, point cloud, sonar, and inertial navigation positioning data differ significantly in data structure, representation, and physical meaning, appropriate deep learning network structures need to be adopted for different modalities of environmental perception information in order to fully explore the features in various types of data.
[0037] For environmental image data, a ResNet network is used as a feature extractor to extract image features. The ResNet network can automatically learn edges, textures, local structures, and high-level semantic information from raw pixels through multi-layer convolution and non-linear activation operations, thereby extracting image features such as terrain type, road boundaries, land-water boundaries, obstacle appearance, and passable areas. To adapt to scale variations and background interference in complex environments, feature pyramid structures or attention mechanisms can be introduced to enhance the model's ability to represent multi-scale targets and key regions.
[0038] For environmental point cloud data, the PointNet++ network model, suitable for processing unstructured 3D data, is adopted. Starting from an unordered point set, it can learn local neighborhood relationships and global geometric distribution features, thereby extracting point cloud features such as obstacle height, ground slope, surface roughness, spatial connectivity, and scene geometry. Point cloud features can accurately reflect the spatial structure of the environment and have significant advantages in terrain undulation analysis, obstacle size estimation, and accessibility assessment.
[0039] For sonar data, which typically exhibits significant temporal and signal characteristics, a one-dimensional convolutional neural network (1D-CNN) is used to extract sonar features such as underwater obstacle echo morphology, water depth variation trends, and spatial distribution characteristics of local areas. This enhances the ability to perceive underwater scenes.
[0040] This invention utilizes an inertial measurement unit (IMU) to measure the vehicle's acceleration, angular velocity, and attitude changes in real time during motion, and a Global Positioning System (GPS) to provide global position and velocity information. The combination of these two methods constructs a stable vehicle motion reference coordinate system, providing fundamental support for subsequent positioning, mapping, and path planning. While IMU and GPS data have relatively low dimensionality, they possess significant state constraint implications. Therefore, the current vehicle state features can be extracted using a multi-layer fully connected network or a lightweight temporal network. These include speed, acceleration, heading angle, attitude angle changes, and global position information. These features reflect the vehicle's own motion state and serve as an important bridge between environmental perception and decision-making, providing support for dynamic scene understanding and path feasibility analysis.
[0041] After feature extraction from each modality is completed, feature vectors from different sources need to undergo unified dimensionality mapping and normalization to reduce differences in scale distribution, numerical range, and statistical characteristics between different modalities, thereby improving the stability and effectiveness of subsequent fusion processes. By fusing multimodal features, the complementary relationships between modalities can be fully explored, enhancing scene understanding capabilities in complex environments. Since different modalities exhibit significant differences in data dimension, statistical distribution, feature representation, and noise characteristics, direct splicing or superposition can easily lead to feature redundancy, information imbalance, or weakening of key semantics. Therefore, this invention designs a multi-layer feature fusion network composed of feature alignment, attention weighting, and deep fusion encoding to achieve efficient integration of heterogeneous features.
[0042] First, to ensure that features from different modalities can interact within a unified semantic space, a feature alignment module is needed to perform linear mapping, dimensionality unification, and normalization. Let image features, point cloud features, sonar features, and vehicle state features be denoted as follows: , , and Then its alignment process can be expressed as:
[0043] ;
[0044] ;
[0045] ;
[0046] ;
[0047] in, Representing image features Alignment operations, Representing point cloud features Alignment operations, Indicate sonar characteristics Alignment operations, Representing vehicle state characteristics Alignment operations, , , , These represent the aligned image features. Point cloud features Sonar characteristics Vehicle status characteristics The alignment operation for these features consists of fully connected layers, 1×1 convolutions, layer normalization, and nonlinear activation functions, used to map features from different modalities to a latent representation space of the same dimension. The aligned features are not only more uniform in numerical scale but also facilitate subsequent cross-modal interactions and joint modeling.
[0048] To highlight key modal information that contributes more to the understanding of the current environment, a channel attention mechanism is introduced to adaptively weight the features of each modality. The formula for calculating the weight coefficient of each modality is as follows:
[0049] ;
[0050] ;
[0051] ;
[0052] ;
[0053] in, It is the Sigmoid activation function. , , , For learnable parameter matrix, , , , These represent the aligned image features. Point cloud features Sonar characteristics Vehicle status characteristics The weights of each modality are dynamically adjusted based on the current scene. For example, in land areas, image and LiDAR features typically have a higher weight, while in murky water or nighttime environments, sonar information becomes significantly more important. After attention weighting, enhanced multimodal fusion features are obtained. The details are as follows:
[0054] ;
[0055] in, This represents element-wise multiplication. By weighted summation of features from different modalities, it highlights key information and suppresses invalid noise, thereby enhancing the discriminative ability of fused features.
[0056] To further model the long-range dependencies and cross-modal semantic associations among multimodal features, this invention introduces a fusion module composed of a Transformer encoder and a multilayer perceptron (MLP) based on the attention fusion result, which fuses multimodal features from consecutive frames. Composition of multimodal fusion feature sequences Multimodal fusion feature sequences The input fusion module is represented as:
[0057] ;
[0058] in, Represents the fusion vector. This is a Transformer encoder used to learn the global relationships between modalities. As a multilayer perceptron, it is responsible for mapping the fused high-dimensional representation into a unified vector consistent with the understanding of the environment. The above fusion method not only retains the unique advantages of each modality, but also explicitly models the deep coupling relationship between "image texture - point cloud geometry - sonar echo - vehicle state", enabling the system to have a stronger expressive ability for complex terrain, dynamic obstacles, water and land boundaries, and local environmental changes.
[0059] (3) Based on the fusion vector F, identify obstacles in the current environment and detect whether all obstacles exist in the map. If they exist, proceed to step (4). If they do not exist, perform obstacle avoidance based on deep reinforcement learning and proceed to step (4).
[0060] Before obstacle avoidance, a map is constructed, which mainly contains static obstacles. Based on the map, a driving path from the current location to the target location is planned. During the driving process, obstacles that do not exist in the map may appear in the environment, such as dynamic obstacles or added static obstacles. Therefore, during the driving process based on the driving path, it is necessary to scan the surrounding environment. When an obstacle that does not exist in the map is detected, local obstacle avoidance planning is required for the current road segment.
[0061] (4) Continue driving along the planned driving route while performing step (1) until you reach the target location.
[0062] The current environment's obstacle and vehicle status information, along with the target pose, are used as inputs to the decision network. Let the state at time t be... ,state Defined as:
[0063] ;
[0064] in, This represents information about obstacles in the environment at time t, including their location and size. It indicates the vehicle's own status information, including position, speed, heading angle, acceleration, and attitude. The target point is indicated by setting a series of aiming points on the planned driving path. The farthest newly added obstacle from the current position is read, and the nearest aiming point behind the newly added obstacle is determined. It is then checked whether the distance between the aiming point and the newly added obstacle is greater than the set safe distance threshold. If the detection result is yes, the aiming point is taken as the target point. If the detection result is no, the next aiming point of the aiming point is taken as the target point.
[0065] Based on the current state Strategy network outputs control actions Controlling actions Represented as:
[0066] ;
[0067] in, This represents the current parameters of the policy network and the control actions. This includes steering angle, propulsion speed, acceleration, and braking amount. For continuous control tasks, deep reinforcement learning algorithms such as DDPG, TD3, SAC, or PPO can be used for training, enabling the agent to learn the optimal control strategy that balances safety and efficiency in complex dynamic environments.
[0068] Based on the current environment and mission objective, search for the optimal obstacle avoidance path sequence from the current location to the target point. Let the path sequence be... ,in, Let i represent the i-th trajectory point in the path sequence, then the optimal obstacle avoidance path. The solution can be expressed as:
[0069] ;
[0070] in, This represents the cost function that satisfies the obstacle constraints in the current environment. To ensure that the planning results balance safety, smoothness, energy efficiency, and high throughput, the specific cost function is as follows:
[0071] ;
[0072] in, Indicates weight, This represents the cost of obstacle avoidance; within a safe distance, the closer the path is to the obstacle, the lower the cost. Increase rapidly; This represents the cost of path smoothing; the greater the curvature of the path, the higher the cost. The larger; This represents the energy cost; the longer the path, the higher the cost. The larger.
[0073] Correspondingly, in the reinforcement learning framework, the cost function described above can also be transformed into an immediate reward function. The optimal policy is learned by maximizing the cumulative reward, where the immediate reward is denoted as . Instant rewards It is expressed as follows:
[0074] ;
[0075] in, This represents the positive reward given when getting closer to the target point; the closer to the target point, the greater the reward. The larger the reward, the more likely it is to help the student learn comprehensive decision-making strategies such as "avoiding obstacles, moving towards the goal, moving smoothly, and reducing energy consumption" during training.
[0076] To address the unique application scenarios faced by amphibious vehicles, this invention further incorporates an environmental pattern recognition and navigation strategy switching mechanism. First, based on fused features, it determines whether the current area belongs to land, water, or a transitional zone. Then, it selects appropriate planning parameters and control constraints. For example, in land areas, greater emphasis is placed on terrain slope, tire adhesion conditions, and obstacle avoidance; in water, greater attention is paid to heading stability, water depth changes, current disturbances, and buoyancy control; and in transitional zones, both shoreline passability, attitude stability, and mode switching safety must be considered simultaneously. This mechanism enables amphibious vehicles to automatically adjust their navigation strategies according to environmental changes, thereby improving their passability and robustness in complex scenarios.
[0077] The planned obstacle avoidance path and control commands are output to the amphibious vehicle's execution unit. The control unit then drives the vehicle to perform steering, acceleration, deceleration, and braking operations according to the commands, achieving autonomous obstacle avoidance. The system monitors the vehicle's performance in real time and adjusts the control strategy through sensor feedback in a closed loop to ensure the actual trajectory matches the planned path. In case of emergencies (such as sudden obstacle collisions or system malfunctions), the system can automatically switch to a safety mode or allow remote manual intervention. All operational data and decision-making processes are recorded for subsequent analysis and continuous model optimization, enabling the system's self-evolution and performance improvement.
[0078] The path avoidance method based on multimodal deep learning proposed in this invention significantly improves the autonomous navigation and obstacle avoidance capabilities of amphibious vehicles in complex and variable environments. Compared with traditional single-sensor or shallow learning methods, this invention achieves comprehensive perception and accurate identification of land-water boundaries, complex terrain, and dynamic obstacles by fusing data from multiple sources such as vision, lidar, and sonar. The introduction of multimodal feature fusion and deep learning technology gives the system stronger environmental understanding and adaptability, maintaining high robustness under extreme conditions such as severe weather, changing lighting, and water reflection. Simultaneously, the path planning decision based on deep reinforcement learning effectively balances multiple factors such as safety, path smoothness, and energy consumption, improving the amphibious vehicle's driving efficiency and mission completion rate.
[0079] The present invention has been described by way of example. Obviously, the specific implementation of the present invention is not limited to the above-described manner. Any non-substantial improvements made using the inventive concept and technical solution of the present invention, or the direct application of the inventive concept and technical solution of the present invention to other occasions without modification, are all within the protection scope of the present invention.
Claims
1. A path obstacle avoidance method based on multimodal deep learning, characterized in that, The method is as follows: (1) Collect multimodal environmental perception information and perform time and space alignment on the multimodal environmental perception information; (2) Extract environmental features from the environmental perception information of each modality, fuse the environmental features of each modality, and output the fusion vector; (3) Based on the fusion vector, identify obstacles in the current environment and detect whether all obstacles exist in the map. If they exist, execute step (4). If they do not exist, perform obstacle avoidance based on deep reinforcement learning and then execute step (4). (4) Continue driving along the planned driving route and perform step (1) until you reach the target location.
2. The path obstacle avoidance method based on multimodal deep learning as described in claim 1, characterized in that, Environmental perception information includes: environmental images captured by cameras, environmental point clouds captured by lidar, and sonar data captured by multibeam sonar.
3. The path obstacle avoidance method based on multimodal deep learning as described in claim 2, characterized in that, The environmental image is input into the ResNet network to extract image features. ; The environmental point cloud is input into the PointNet++ network to extract point cloud features. Sonar data is input into a one-dimensional convolutional neural network to extract sonar features. .
4. The path obstacle avoidance method based on multimodal deep learning as described in claim 1, characterized in that, The inertial measurement unit (IMU) measures the vehicle's acceleration, angular velocity, and attitude changes in real time during its motion, while the global positioning system (GPS) collects the vehicle's global position and speed information.
5. The path obstacle avoidance method based on multimodal deep learning as described in claim 4, characterized in that, The acceleration, angular velocity, attitude changes, global position information, and velocity information during the motion process are input into a multi-layer fully connected network or a lightweight temporal network to extract the current vehicle state features. .
6. The path obstacle avoidance method based on multimodal deep learning as described in claim 1, characterized in that, The process of obtaining the fusion vector is as follows: (21) Image features Point cloud features Sonar characteristics and vehicle status characteristics Align; (22) Calculate the weights of each modality feature through the channel attention mechanism, including: image features Point cloud features Sonar characteristics Vehicle status characteristics weight , , , ; (23) Aligned image features based on the weights of each modality feature Point cloud features Sonar characteristics and vehicle status characteristics Perform weighted summation to form multimodal fusion features. ; (24) Fusing features from multiple consecutive frames and multiple modalities Composition of multimodal fusion feature sequences Multimodal fusion feature sequences Input the fusion module, and the fusion module outputs the fusion vector; The fusion module consists of a Transformer encoder and a multilayer perceptron (MLP).
7. The path obstacle avoidance method based on multimodal deep learning as described in claim 4, characterized in that, Multimodal fusion features The details are as follows: 。 8. The path obstacle avoidance method based on multimodal deep learning as described in claim 1, characterized in that, The optimal obstacle avoidance path from the current position to the target point is determined based on deep reinforcement learning. The method for determining the target point is as follows: a series of aiming points are set on the planned driving path. The new obstacle farthest from the current position is read. The nearest aiming point behind the new obstacle is determined. It is checked whether the distance between the aiming point and the new obstacle is greater than the set safe distance threshold. If the detection result is yes, the aiming point is taken as the target point. If the detection result is no, the next aiming point of the aiming point is taken as the target point.
9. The path obstacle avoidance method based on multimodal deep learning as described in claim 1, characterized in that, Time reward in deep reinforcement learning It is expressed as follows: ; in, Indicates weight, This represents the cost of obstacle avoidance; within a safe distance, the closer the path is to the obstacle, the lower the cost. Increase rapidly; This represents the cost of path smoothing; the greater the curvature of the path, the higher the cost. The larger; This represents the energy cost; the longer the path, the higher the cost. The larger, This represents the positive reward given when getting closer to the target point; the closer to the target point, the greater the reward. The larger.