Ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation

By employing a multimodal training method that combines boundary adversarial enhancement and uncertainty distillation, the robustness and adaptability issues of multimodal fusion in intelligent ships are addressed. This method enables robust learning and adaptive decision-making in complex maritime environments, thereby improving the robustness and predictive stability of the model.

CN122309922APending Publication Date: 2026-06-30SHANGHAI MARITIME UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI MARITIME UNIVERSITY
Filing Date
2026-03-31
Publication Date
2026-06-30

Smart Images

  • Figure CN122309922A_ABST
    Figure CN122309922A_ABST
Patent Text Reader

Abstract

This invention discloses a ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation. It collects heterogeneous sensor data from the entire ship domain and constructs a multimodal dataset through spatiotemporal synchronization. The data undergoes distortion correction, normalization, and filtering to generate standardized feature tensors and construct a modal availability statistical model. A dual-channel guided learning network containing a dominant channel and an adaptive channel is constructed. A shared cross-modal attention module is used for feature mapping, and an environment-driven gating mechanism is used to dynamically adjust modal weights. A dynamic memory matrix is ​​constructed based on a moving average algorithm to quantify the stability of modal combinations and output a reweighted sample index. Boundary hard examples are constructed based on the weak modality index, and the distillation temperature coefficient is dynamically calculated using an uncertainty metric. Network parameters are updated synchronously through backpropagation. The model performance is tested under various sea states. During deployment, an online monitoring mechanism is introduced to detect input distribution drift and trigger fine-tuning, while the adaptive channel is retained for online inference. This invention effectively addresses sensor failure, low visibility, signal interference, and modal imbalance issues through joint modeling and missing modality statistical simulation, thereby improving the robustness of ship perception.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent ships and autonomous maritime navigation technology, and in particular to a multimodal training method for ships based on boundary adversarial enhancement and uncertainty distillation. Background Technology

[0002] With the rapid development of intelligent ships and autonomous navigation technologies, shipborne multimodal sensors (including IMU, GPS, radar, sonar, visual cameras, lidar, and AIS) are widely used in perception, localization, and environmental understanding tasks. However, most existing multimodal fusion methods are optimized only for single tasks or specific sensor combinations, lacking a systematic design for time synchronization, spatial calibration, and metadata association between multimodal sensors. Due to the significant dynamism and complexity of the marine environment, sensors are susceptible to factors such as weather, waves, and obstruction, leading to problems such as data temporal drift, information mismatch, and modal missingness, thereby reducing the robustness of perception and decision-making systems. This makes it difficult for existing models to achieve stable feature fusion and reliable scene perception when faced with multi-source heterogeneous information.

[0003] Traditional multimodal learning frameworks often rely on fixed modal inputs and a unified learning structure. When some modalities fail or noise increases, model performance degrades significantly. Furthermore, existing knowledge distillation and transfer learning methods often assume consistent training and inference environments, failing to adequately consider the variations in modal availability and environmental uncertainties in maritime scenarios. In this context, the model lacks dynamic adjustment mechanisms to adapt to modal absence and environmental changes, resulting in insufficient generalization ability, poor adaptability, and limited capacity to handle weak modalities and anomalous samples in practical deployments.

[0004] Current model training and validation processes in the field of intelligent ships generally lack dynamic feedback mechanisms and uncertainty measurement. Traditional distillation methods use fixed temperature parameters and static sample sampling strategies, making it difficult to adaptively adjust according to environmental changes, modal reliability, and sample complexity. Meanwhile, although adversarial training can improve model robustness, the lack of physical constraints and environmental consistency regularization often leads to overfitting or feature drift, weakening the model's reliability under real sea conditions.

[0005] Therefore, the research on a multimodal dual-channel guided learning network architecture, which achieves robust learning and safe decision-making under multimodal data through the collaborative distillation of the dominant and adaptive channels, boundary constraints, and uncertainty-driven mechanisms, is one of the important problems that urgently need to be solved. Summary of the Invention

[0006] To address the aforementioned problems in existing technologies, the purpose of this invention is to provide a ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation. The method described in this application introduces an adversarial mechanism into a dual-channel guided architecture to enhance the diversity of modal features, guiding the model to focus on features in the discrimination boundary region. Simultaneously, it constructs an adaptive distillation temperature adjustment strategy to improve adaptability to different sample uncertainty distributions, thereby achieving robust multimodal learning under various combinations of missing modalities. This invention not only improves the practicality and reliability of the model in complex scenarios but also lays a key technological foundation for the practical deployment and promotion of multimodal artificial intelligence systems.

[0007] To address the aforementioned problems, this invention employs the following technical solution: a ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation. This method, based on the operating environment of an intelligent ship, deploys multiple types of sensors to collect multimodal data. The method includes the following steps:

[0008] S1. Intelligent ship multimodal data acquisition and metadata construction: Collect raw data and operational status parameters from heterogeneous sensors across the entire ship domain. After eliminating spatiotemporal deviations through clock synchronization and spatial calibration, construct a multimodal dataset with spatiotemporal consistency through metadata encapsulation.

[0009] S2. Data preprocessing and modality loss statistical modeling: Based on the data collected in step S1, distortion correction, normalization and filtering are performed on visual, radar, sonar and time series data to generate standardized feature tensors. At the same time, the signal integrity is analyzed to construct a modal availability statistical model and output a set of quantitative modality loss distribution parameters.

[0010] S3. Construction of a multimodal dual-channel guided learning network: Receive the standardized feature tensor, missing mode distribution parameter set, and metadata from step S2. Construct a dual-channel guided learning network structure that includes a dominant channel for performing full-modal high-confidence feature extraction and an adaptive channel for simulating missing scenarios. Utilize a shared cross-modal attention module for feature mapping and compensation, and dynamically adjust modal weights using an environment-driven gating mechanism.

[0011] S4. Memory matrix construction and multiple modal identification: During the training process in step S3, multidimensional state observation data and performance feedback signals are received. A dynamic memory matrix is ​​constructed based on the moving average algorithm to quantify the stability of each modal combination. Weak regions are marked by threshold retrieval and inverse performance weights are calculated. The reweighted sample index is output to adjust the subsequent sampling strategy.

[0012] S5, Boundary Adversarial Feature Attack and Uncertainty-Driven Temperature Multi-Task Distillation: Based on the weak modality index, directional perturbation or linear interpolation is applied to the adaptive channel fusion layer to construct boundary hard example samples. At the same time, the uncertainty metric value output by the dominant channel is used to dynamically calculate the distillation temperature coefficient to scale the KL divergence or mean square error between the two channels. A multi-task joint objective function including adversarial loss and variable temperature distillation loss is constructed. The network parameters are updated synchronously through backpropagation, and the optimized final model parameters, environment adaptive robust fusion feature vector and convergence state log are output.

[0013] S6. Validation, evaluation and model deployment: After completing the distillation training, the model recall, false negative rate and response latency are tested under various sea conditions and modal combinations to quantify reliability. During the deployment phase, an online monitoring mechanism is introduced to detect input distribution drift to trigger fine-tuning, and the adaptive channel network is retained to perform online inference.

[0014] Further, step S1 specifically involves: receiving raw observation data streams from the ship's global heterogeneous sensor array, as well as multi-dimensional operational status parameters; simultaneously acquiring clock synchronization references and spatial calibration reference information; employing a high-precision clock synchronization mechanism to eliminate timing deviations in data acquisition from each node; and performing spatial calibration operations to unify the spatial reference systems of different sensors, thereby establishing a spatiotemporal reference for the multimodal data; and using metadata encapsulation technology to structurally map the raw observation signals and multi-dimensional operational status parameters to form a multimodal dataset with spatiotemporal consistency and semantic integrity, specifically including the following steps;

[0015] S11. Deployment and data acquisition of multiple sensors for intelligent ships: To meet the perception needs in the operating environment of intelligent ships, the deployment, calibration and time synchronization of multiple types of sensors are completed; through reasonable layout, the surrounding environment of the ship is monitored and multi-dimensional data is collected.

[0016] Based on the aforementioned intelligent ship multi-sensor settings and standard data acquisition, the total multimodal set is set as follows: , No. Each mode in time The observation is recorded as Each record also stores a metadata vector. Its typical components include GPS location. Platform posture Sea state indicators (classification) ,visibility , wave height ), timestamp With sensor health Spatial coordinates and camera / sensor calibration are represented using extrinsic and intrinsic parameter matrices: world coordinates to camera coordinates The change in rigidity is as follows:

[0017]

[0018] The camera pixel projection (homogeneous representation) is as follows:

[0019]

[0020] Where K is the camera intrinsic parameter. Representing homogeneous coordinates; conversion of marine radar polar coordinates to planar coordinates. Select reference time The low / high frame rate modes are aligned using linear / high-order interpolation between adjacent frames. The linear interpolation form is as follows:

[0021]

[0022] For each modality, a signal-to-noise ratio (SRN) and health score are defined, specifically expressed as follows:

[0023]

[0024] in for The mapping normalizes the health level to (0,1);

[0025] S12. Statistical modeling is performed on the sensor missing data in the multimodal data from step S11 above. This model is used to simulate possible modal missing data in actual maritime environments during the training phase. First, for each modality... Define modal availability indicator variables ,in This indicates that the mode is available at the current moment. Indicates missing information; establish a conditional probability model based on sea state:

[0026]

[0027] Where vector The state vector represents metadata including sea state (wind speed, wave height, visibility), time (day / night, season), and platform attitude (pitch angle, roll angle, heading angle); coefficients These are learnable parameters, obtained through maximum likelihood estimation trained from historical logs; used to generate the sampling distribution for missing modes;

[0028] S13. For each modality in step S12, a preprocessing and feature transformation process is adopted to ensure consistency of subsequent model inputs: For the visual (RGB / IR) modality, distortion correction, color normalization, and local contrast adaptation are first performed. The standardization formula is as follows: In optical modal degradation modeling, a fog / haze imaging model is used to synthesize training samples. A commonly used atmospheric scattering model is written as follows:

[0029]

[0030] in, For degraded images, For a clear image, For ambient lighting, Transmittance is determined by distance With attenuation coefficient Decision; Based on the model, generate training samples under different visibility conditions and use them for robust training;

[0031] Radar / sonar modes often undergo amplitude logarithmic transformation and normalization: To compress the dynamic range; acoustic / sonar signals are converted into time-spectrum (matrix) signals using STFT and input into the convolutional network. The STFT expression is:

[0032]

[0033] in, As a window function, low-pass or exponential smoothing filters are used to dejitter the timing modes. The exponential smoothing formula is: Commonly used To balance smoothness and responsiveness; in terms of annotation, a unified annotation standard is defined for each task: detection is based on bounding boxes. With category tags This indicates that pixel-level segmentation is based on Indicated; consistency of annotations is quantified using IoU and Cohen's kappa:

[0034]

[0035] in, To observe the consistency rate, To achieve a random consistency rate, the confidence scores of multiple annotators are aggregated using simple voting or probabilistic fusion. The labeler on the sample The tag set is The label confidence score is defined as follows:

[0036]

[0037] Alternatively, weighted voting can be employed; areas with strong radar / sonar response can be utilized. Generating visual candidate boxes and providing them for human confirmation is formalized as conditional probability propagation. Maximum a posteriori estimation; in terms of data augmentation, in addition to the fog model, a physics-driven water droplet / splash occlusion generation was designed: generating spatial occlusion masks. (Local alpha channel), and simulated observations using alpha blending.

[0038]

[0039] All preprocessing and enhancement processes record transformation parameters for reversibility or traceability, and annotation and preprocessing results, along with metadata, are written into the sample record.

[0040] Further, in step S2, distortion correction and dehazing operations are performed on the visual data, amplitude normalization is performed on the radar and sonar echo signals, and a low-pass smoothing filter algorithm is applied to the time-series sensor data to generate a standardized multimodal feature tensor. Simultaneously, the mapping relationship between metadata and sensor signal integrity is analyzed, a modal availability statistical model is constructed, and a quantized modality loss distribution parameter set is output. This parameter set and the standardized feature tensor serve as input data for the distillation training of the dual-channel guided learning network model. Specifically:

[0041] S21. Based on the multimodal data acquisition and preprocessing completed in step S1, the labeled sample set is then sorted by flight segment / time / scene. Perform training / validation / test segmentation; introduce availability vectors (AC) to simulate and train scenarios lacking modalities. For a certain sample The mask input is represented as ;

[0042] The original multimodal input representation of the i-th sample is usually composed of multiple modal features concatenated or set together.

[0043] This represents the input feature of the i-th sample in the m-th modality, where M represents the total number of modalities (such as vision, radar, sonar, AIS, etc.).

[0044] The modality availability vector (AC) describes whether each modality is available: where This indicates that the m-th mode is available; otherwise, 0 indicates that the m-th mode is missing.

[0045] This indicates a per-modal masking effect, but here it is a structured modal-level mask.

[0046] S22. To address the missing model situation, three parallel strategies are employed to generate missing model combinations during the training phase:

[0047] (1) Empirical independent sampling, for each modality based on historical availability Perform independent Bernoulli sampling: ;

[0048] (2) Conditional sampling (MAR), based on sea state / time / attitude information sampling Logistic regression is often used for fitting:

[0049]

[0050] in Estimated from historical data to reflect the lack of optical modes in foggy weather;

[0051] (3) Worst-case sampling: Select several key ACs manually or based on security requirements and increase the sampling ratio of these key ACs in the training samples. To enhance the model's robustness in extreme cases; random fields are used to generate spatial occlusion masks for local spatial occlusion. Applications in visual images:

[0052]

[0053] Here, AC represents a specific modal combination. Indicates spatial occlusion mask, The range of values ​​is

[0054] ∈ ; 0: Complete occlusion; 1: No occlusion; To facilitate training strategies and resource allocation, the number of samples for each AC is counted. and baseline performance And construct a weak modal memory matrix. , storage entries The training samples are weighted or resampled to prioritize weak modes. The weighting function is defined as a linear or smooth function. Linear weighting:

[0055]

[0056] Or exponential resampling probability:

[0057]

[0058] This increases the probability of poorer ACs being sampled during training, thus improving accelerator performance.

[0059] Further, step S3 receives the standardized multimodal feature tensor, quantized missing mode distribution parameter set, and metadata from steps S1 and S2; constructs a dual-path network architecture including a dominant channel and an adaptive channel. The dominant channel performs full-modal high-confidence feature extraction, and the adaptive channel simulates missing scenarios based on the missing mode distribution parameter set. A shared cross-modal attention fusion module performs spatial mapping and difference compensation on the dual-channel features. An environment-driven gating mechanism dynamically adjusts the modal weights according to real-time sea conditions, outputting a distillation-trained environment-adaptive robust fusion feature vector and network model parameters. Specifically, the steps include:

[0060] S31. Based on steps S1 and S2, obtain a trainable dataset and construct dual-channel guided learning network models, where the dominant channel network... Training with full modality input, adaptive channel network Training with missing modal inputs, and using distillation learning, the adaptive channel achieves prediction performance close to that of the dominant channel even when some modalities are missing; for each modality... Let its dedicated encoder be Output features ;

[0061] S32. To handle the lack of modes, the metadata record mask gating mechanism of S1 is used. ,in Indicates modal availability. The Sigmoid activation function is used to suppress interference from missing modalities by dynamically adjusting feature weights; multimodal fusion features. Represented as a weighted sum:

[0062]

[0063] in, To align the modal features to a matrix (or convolution), and to enhance complementary information across modalities, a cross-modal multi-head attention fusion module is used: for each modal feature to be mapped, it is first mapped to... Then calculate the attention on all modal tokens:

[0064]

[0065] The output is concatenated and linearly transformed to obtain the fused features. Then send it to the mission head. Get logits: ;

[0066] S33. During training, based on sample sea state metadata Adaptive adjustment gating This allows for higher radar / AIS feature weights under adverse sea conditions; then, a mode reconstruction auxiliary head is introduced, and a decoder is added to the adaptive channel network. Reconstructing missing modal features, training objective This enhances the adaptive channel network's learning of inter-modal correlations; simultaneously, for complex maritime tasks, it utilizes dynamic priors (AIS track velocity vectors) Embed attention weight calculation and add a track distance penalty term when scoring attention. This is to improve the stability of target trajectory maintenance and prediction.

[0067] Further, step S4 specifically involves establishing a memory matrix during the dual-channel guided learning network structure training process to record the performance and sample distribution of different modality combinations; inputting the multidimensional state observation data and performance feedback signals from the training process in step S3, and processing them by constructing a dynamic memory matrix based on a moving average algorithm, mapping the input states and signals to the matrix space for exponentially weighted moving average calculation to quantify the long-term stability of each modality combination in a specific context, and automatically marking weak performance regions through threshold retrieval, thereby calculating inverse performance weights to dynamically adjust the sampling strategy of subsequent training batches, and outputting the updated dynamic memory matrix state, reweighted sample index, and optimized network parameters; specifically,

[0068] S41. Before entering the main training, identify weak modal combinations through rapid pre-training and construct a memory matrix. First, for each AC Define performance metrics:

[0069]

[0070] Wherein, metric refers to the task evaluation indicator. The sample set that satisfies AC; if (Weak mode threshold), then record it in the matrix. And define the combined weights:

[0071]

[0072] S42. To avoid excessive bias towards rare combinations, limit the upper bound of the weights. A dynamic memory update strategy is adopted: during the training process, each... epoch update And smooth the memory matrix:

[0073]

[0074] in, The attenuation coefficient is... This represents the performance estimate for the current training epoch; to improve cross-region generalization ability, the memory matrix is ​​expanded into a three-dimensional tensor. Recorded under different sea state conditions Performance improvement; and the introduction of domain weights. This ensures that data from different sea areas are covered during training.

[0075] Furthermore, step S5 specifically involves:

[0076] First, based on the weak mode combination index and adversarial boundary feature enhancement mechanism of the input, the original features are subjected to directional perturbation or linear interpolation in the feature fusion layer of the adaptive channel to construct synthetic hard example samples located at the decision boundary to expand the training set and enhance the model's ability to discriminate fuzzy regions.

[0077] Subsequently, the distillation temperature coefficient is dynamically calculated using the uncertainty metric output by the dominant channel, and a positive correlation mapping function between uncertainty and temperature is established. This allows high uncertainty samples to correspond to higher temperature values ​​to smooth their probability distribution and retain more dark knowledge, while low uncertainty samples correspond to lower temperature values ​​to maintain prediction sharpness.

[0078] Finally, a multi-task joint objective function is constructed, which includes adversarial loss terms and variable-temperature distillation loss terms. The variable-temperature distillation loss term uses a dynamic temperature coefficient to scale the KL divergence or mean square error between the adaptive channel and the dominant channel. The dual-channel network parameters are updated synchronously through the backpropagation algorithm. The output consists of the final model parameters of the dual-channel guided learning network after boundary adversarial enhancement and uncertainty-driven temperature multi-task distillation iterative optimization, the environment-adaptive robust fusion feature vector generated by distillation training, and the performance index log that records the convergence status of this training round.

[0079] Furthermore, step S5 specifically includes:

[0080] S51, using the memory matrix of S4 Based on the conditional availability model from step S2, generate modal availability combinations AC for the training phase; for a given combination Its sampling priority is determined by both performance and sample scarcity, and is defined as follows:

[0081]

[0082] in, This represents the moving average performance of the combination. This represents the number of historical samples for this combination. The training sampling probability is combined with the conditional mode-deficient distribution to balance performance and sample sparsity.

[0083]

[0084] in, Derived from the conditional probability model of modal availability in step S2 This allows the sampling to closely resemble the actual missing model statistics while also specifically strengthening weak ACs.

[0085] S52. Generate boundary samples, i.e., features close to the decision boundary or that reduce the original class margin, in the intermediate / fusion feature layer of the adaptive channel network; in the fusion layer of the adaptive channel network (denoted as... The higher-order output of either the encoder or the shared encoder is used as the target of adversarial example feature enhancement operations, denoted as... The boundary loss for category k is defined as:

[0086]

[0087] in This represents the logit for the corresponding classification. The loss is 0 at the decision boundary, and moving towards a negative value indicates that the sample is closer to or beyond the target class boundary. For the ... The perturbation of the next iteration Update with steering factor:

[0088]

[0089] in, Step size, Steering factor For perturbation projection, when the predicted class becomes Or reach the number of submissions Stop, and obtain boundary enhancement features. To ensure that the disturbance is consistent with the real environment, a regularization term based on S1 clutter statistics is introduced:

[0090]

[0091] in, For clutter projection operators, and use coefficients The objective of incorporating ABFA is to constrain the direction and magnitude of the perturbation; where, Estimated from historical statistics of S1 under similar sea state / sensor conditions, when The margin becomes negative or the maximum iteration is reached. Updates will stop at this time.

[0092] S53. Adaptively adjust the distillation temperature based on the confidence / uncertainty of each sample according to the dominant channel network. (Sample-by-sample temperature) allows samples with high uncertainty to use higher temperatures, reducing the dependence of the adaptive channel network on the dominant channel network's high-confidence but potentially noise-affected predictions. The softmax probability of the dominant channel network is defined as: Uncertainty is defined by entropy:

[0093]

[0094] At the same time, combining logits with class center distance As a supplementary metric, the final temperature mapping is:

[0095]

[0096] in, As the reference temperature, The modulation factor is calculated from S1 sea state metadata; the sample-level distillation loss is:

[0097]

[0098] S54. Adversarial features generated through adversarial feature enhancement networks Logits are obtained in the adaptive channel network. And perform boundary distillation with the logits in the dominant channel network:

[0099]

[0100] If the system contains multi-head tasks, for each head Calculate KD or characteristic distillation loss separately and sum them using a weighted average; assuming a task set... Then the overall distillation term:

[0101]

[0102] in, The weighted coefficients are derived from the memory matrix, and the hyperparameters are... Controlling the relative contributions of different distillation terms; for sample i, the AC is defined as:

[0103]

[0104] To amplify the KD signal from the weak AC in the loss and the boundary distillation ratio ( Adaptive channel network minimizes supervision loss and modal reconstruction loss The overall loss is:

[0105]

[0106] in, Indicates monitoring losses, This indicates the temperature-dependent distillation loss per sample. This represents the KD loss calculated for adversarial / boundary samples generated by ABFA. This refers to the loss during modal reconstruction.

[0107] Further, step S6 specifically involves: performing system verification and deployment evaluation of the model; testing performance under various sea conditions and modal combinations to establish a safety evaluation index centered on recall, false negative rate, and response latency to quantify model reliability; wherein, during the deployment phase, an online monitoring mechanism is introduced to detect input distribution drift in real time and trigger model fine-tuning, while retaining the adaptive channel network for online inference; specifically: performing system verification and deployment evaluation of the model; testing performance under various sea conditions and modal combinations to establish a safety evaluation index centered on recall, false negative rate, and response latency to quantify model reliability; wherein, during the deployment phase, an online monitoring mechanism is introduced to detect input distribution drift in real time and trigger model fine-tuning, while retaining the adaptive channel network for online inference, including the following steps:

[0108] S61. After training, perform multi-dimensional evaluation on the validation set, first for each AC and sea state condition. Calculation indicators:

[0109]

[0110] Plot the performance curve as a function of modal quantity. To measure robustness to missing models; significance is calculated using paired t-tests or Wilcoxon tests. - value, if A score <0.05 is considered a significant improvement; additional key safety indicators are evaluated to form a comprehensive safety risk score.

[0111]

[0112] S62. If the RiskScore decreases, only the adaptive channel network and gating mechanism are retained during deployment to ensure low latency; the dominant feature channel and ABFA are only enabled during offline updates; the input distribution is monitored at runtime, and the KL divergence between the feature layer distribution and the training distribution is calculated.

[0113]

[0114] An alarm is triggered if the divergence exceeds a threshold; online performance monitoring calculates runtime performance by collecting a small number of manually labeled samples. If the critical AC performance degrades Then proceed to the fine-tuning process; the memory matrix is ​​updated online:

[0115]

[0116] The sampling weights for the next round of training are adjusted accordingly.

[0117] Compared with the prior art, the beneficial technical effects of the present invention are as follows:

[0118] 1. This application effectively addresses common problems in maritime operations, such as sensor failure, low visibility, signal interference, and modal imbalance, by jointly modeling multimodal data and introducing statistical simulations with missing modal conditions during the training phase. By combining a priority sampling strategy with dynamic memory matrix scheduling, weak modal combinations receive sufficient attention during training, thereby significantly improving the model's perception robustness and generalization ability under complex sea conditions.

[0119] 2. This invention significantly improves prediction stability and security in high-uncertainty scenarios. By combining Adversarial Boundary Feature Enhancement (ABFA) with Uncertainty-Driven Temperature Distillation (UdT), adversarial samples close to the discrimination boundary are generated in the feature layer of the adaptive channel network, effectively enhancing the model's discrimination ability in the boundary region. At the same time, the distillation temperature is dynamically adjusted according to the uncertainty entropy of the dominant channel network and the actual sea conditions, making knowledge transfer more targeted and efficient. Compared with the traditional fixed-temperature distillation method, this invention significantly improves prediction stability and security in high-uncertainty and weak-modal scenarios.

[0120] 3. This invention not only optimizes model resource allocation during the training phase through risk assessment and performance memory mechanisms, but also introduces an input distribution drift monitoring and risk scoring system during the deployment phase, enabling real-time detection of input data changes and model performance degradation. Upon detection of anomalies, data backhaul and retraining are automatically triggered, achieving closed-loop adaptive optimization. Compared to existing static models, this invention possesses long-term adaptive capabilities, continuously ensuring the safety, reliability, and stable operation of intelligent ships in dynamic marine environments. Attached Figure Description

[0121] Figure 1 This is a flowchart illustrating the specific steps of the ship multimodal data training method based on boundary-guided adversarial feature enhancement and uncertainty-driven distillation, as described in this application embodiment.

[0122] Figure 2 This is a diagram illustrating the overall framework for training ship multimodal data based on boundary-guided adversarial feature enhancement and uncertainty-driven distillation, as described in this application embodiment.

[0123] Figure 3 This is a schematic diagram of the boundary feature enhancement (AFBF) module described in the embodiments of this application. Detailed Implementation

[0124] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0125] Example 1

[0126] like Figure 1 As shown, this invention provides a method for training ship multimodal data based on boundary-guided adversarial feature enhancement and uncertainty-driven distillation, specifically including the following steps:

[0127] S1: Ship Multimodal Data Acquisition and Metadata Construction. Step S1, tailored to the operational scenarios of intelligent ships, involves the rational deployment, calibration, and time synchronization of multimodal sensors to ensure the integrity and consistency of the acquired data. By arranging sensors such as IMU, GPS, radar, sonar, visual cameras, lidar, and AIS at different locations on the hull according to their functional characteristics, comprehensive perception of the surrounding environment can be achieved. Simultaneously, global time alignment is performed using PPS signals, and each sensor is assigned a unique identifier and spatial parameter matrix to ensure the uniformity of multi-source data in both spatial and temporal dimensions. The final output structured sample includes not only the raw observation data but also metadata such as timestamps, ship attitude, meteorological information, and wave characteristics, providing a solid data foundation for subsequent preprocessing, statistical modeling of missing modes, and environmentally relevant distillation temperature adjustment. Therefore, S1 establishes a high-quality input data source for the entire process, serving as a prerequisite and guarantee for subsequent steps.

[0128] S2: Data Preprocessing and Statistical Modeling of Missing Modes. Based on the data collected in S1, specialized preprocessing and normalization transformations are performed on various modes to obtain stable and comparable feature representations. For visual modes, distortion correction and atmospheric scattering models are used to normalize and simulate image degradation, thereby improving the system's robustness under low visibility conditions. For radar and sonar modes, logarithmic amplitude or power spectrum transformations are employed to enhance the separability of echo features. For temporal modes such as AIS, exponential smoothing is used to reduce jitter and preserve motion trends. Based on this, S2 utilizes metadata to establish a conditional mode availability model, thereby statistically modeling the mode failure probability under different sea states and meteorological conditions. This modeling result is not only used to simulate missing sample patterns but also determines the priority sampling distribution during training, providing input standards for the multimodal dual-channel guided learning network architecture in S3, and laying the theoretical foundation for memory matrix updates in S4 and sampling strategies in S5.

[0129] S3: Construction of a Multimodal Dual-Channel Guided Learning Network

[0130] Based on the standardized features and mode-deficit masks obtained in step S2, a multimodal dual-channel guided learning network is constructed to achieve robust learning. The "dominant channel" handles high-confidence feature extraction and knowledge generation under full-modal input, providing a complete prediction distribution and soft labels. The "adaptive channel" operates under partial mode-deficit conditions, simulating real-world deployment scenarios and generating adaptive outputs. Features for each modality are extracted using a dedicated modality encoder and combined with environmental metadata from S1 to generate dynamic gating coefficients, enabling adaptive adjustment of modality importance under different sea states. Simultaneously, the adaptive channel incorporates a modality reconstruction module, which can infer features of masked modalities under mode-deficit conditions, thereby enhancing the network's generalization ability. The fused features and prediction results output from this step can serve as the target for subsequent adversarial perturbation generation and also provide an input channel for uncertainty-driven transfer processes, enabling knowledge transfer to consider different modalities and environmental conditions, achieving robust adaptive learning.

[0131] S4: Memory Matrix Construction and Weak Modality Recognition

[0132] Building upon the multimodal dual-channel guided learning network in S3, a dynamic memory matrix is ​​constructed to track the performance of different modality availability combinations. The performance metrics of each combination are updated using an exponential moving average method, smoothing out random fluctuations and maintaining long-term performance trends. The memory matrix not only records the average performance and sampling count of each modality combination but also reflects the network's learning progress under different environments and modality-deficient conditions. Its core function is to provide a priority basis for training samples in step S5: weak modality combinations are assigned higher weights during sampling and contribute more significantly to loss calculation through weighting factors. Thus, S4 achieves the linkage between performance evaluation and training scheduling, ensuring that resources in subsequent transfer and distillation stages are focused on weaker areas, achieving targeted optimization and robustness improvement.

[0133] S5: Boundary Adversarial Feature Attacks and Uncertainty-Driven Temperature Multitasking Distillation

[0134] Step S5, building upon the data preparation and statistical support of S1–S4, completes the core process of distillation training. First, based on the dynamic update results of the memory matrix and the missing modality statistical model in S2, a priority sampling distribution is constructed to ensure greater coverage of weak modality combinations in the training batch. Second, Adversarial Boundary Feature Enhancement (ABFA) is introduced to generate perturbation samples on the fusion feature layer of the adaptive channel network, forcing the model to receive stronger supervision at the discrimination boundary, thereby enhancing its boundary discrimination ability. Simultaneously, a clutter regularization term based on the metadata of S1 is introduced to ensure that the adversarial perturbation is consistent with the actual physical environment. Furthermore, Uncertainty-Driven Temperature Distillation (UdT) is proposed, dynamically assigning distillation temperature to each sample based on the uncertainty of the dominant channel prediction and sea state conditions, thereby enhancing the knowledge transfer effect on high-uncertainty samples and boundary samples. Finally, by uniformly weighting the supervision loss, distillation loss, reconstruction loss, and regularization term, the overall training objective balances global performance while strengthening the learning effect of weak modalities and key samples, achieving robust distillation in complex environments.

[0135] S6: Model Validation, Evaluation, and Deployment

[0136] Step S6 involves system validation and online monitoring of the trained model. Performance evaluation is conducted on the validation set under different modal combinations and sea state conditions. Statistical tests are used to determine the significance of the improvement effect, and a risk assessment system centered on key safety indicators is established. For example, a risk score is constructed by combining false negative rate, recall rate, and response latency to quantify the model's reliability in safety-critical scenarios. Simultaneously, this step introduces an online monitoring mechanism to compare the input feature distribution in the deployment environment with the reference distribution during training in real time. When significant drift is detected, sample backtesting and model fine-tuning are triggered. Thus, S6 not only statically validates the model's performance but also provides a mechanism to support dynamic adaptation and risk control during the deployment phase, ensuring the model's robustness and safety in long-term operation.

[0137] Furthermore, such as Figure 2 As shown, the specific steps of step S1 are as follows:

[0138] S11. The specific steps for the deployment and data acquisition of multiple sensors on intelligent ships are as follows:

[0139] In the multimodal data acquisition phase of intelligent ships, the first step is to rationally deploy and install various types of sensors. An inertial measurement unit (IMU) and a global positioning system (GPS) are positioned at the center of the hull to provide attitude angles, acceleration, angular velocity, and high-precision positioning information. Radar and sonar are installed on the hull bottom and sides, respectively, to sense the distance and relative motion of surrounding obstacles. Cameras are installed at the bow, stern, and both sides to acquire omnidirectional visual images. A lidar system (LiDAR) is deployed on the deck to generate high-resolution 3D point clouds. After installation, each sensor is assigned a unique identifier, and its spatial position and orientation are recorded to facilitate subsequent multimodal coordinate alignment and fusion. Subsequently, each sensor is initialized and its functions are tested. The IMU and GPS are warmed up to eliminate initial drift. The system uses GPS pulse-per-second (PPS) signals for global time synchronization, ensuring that all acquired data are aligned on the same time reference, and appropriate sampling frequencies are set for different types of sensors. After completing the above steps, the sensors are activated to begin multimodal data acquisition, including attitude and acceleration data from the IMU, GPS position information, target distance and velocity measured by radar, sonar echo depth, camera image sequences, and LiDAR point clouds. The acquired data is stored in real-time on the ship's onboard storage medium and uploaded to the shore-based monitoring center or cloud platform via a wireless communication module. Simultaneously, navigation status parameters such as speed, heading, and rudder angle are recorded to provide auxiliary information for subsequent data synchronization, error correction, and fusion modeling.

[0140] Based on the above-mentioned intelligent ship multi-sensor settings and standard data acquisition, the total multimodal set is set as follows: , No. Each mode in time The observation is recorded as Each record should also store a metadata vector. Its typical components include GPS location. Platform posture Sea state indicators (classification) ,visibility , wave height ), timestamp With sensor health Spatial coordinates and camera / sensor calibration are represented using extrinsic and intrinsic parameter matrices: world coordinates. to camera coordinates The change in rigidity is as follows:

[0141]

[0142] The camera pixel projection (homogeneous representation) is as follows:

[0143]

[0144] Where K is the camera intrinsic parameter. Representing homogeneous coordinates. Transformation of marine radar polar coordinates to planar coordinates. To ensure multimodal time alignment, a reference time is selected. The low / high frame rate modes are aligned using linear / high-order interpolation between adjacent frames. The linear interpolation form is as follows:

[0145]

[0146] For each modality, a signal-to-noise ratio (SRN) and health score are defined, specifically expressed as follows:

[0147]

[0148] in for The mapping normalizes the health level to (0,1).

[0149] S12. Statistical modeling is performed on the sensor missing data in the multimodal data from step S11 above. This model is used during the training phase to simulate possible modal missing conditions in actual maritime environments, thereby improving the robustness of the model. First, for each modality... Define modal availability indicator variables ,in This indicates that the mode is available at the current moment. This indicates a missing value. Establish a conditional probability model based on sea state:

[0150]

[0151] Where vector The state vector represents metadata including sea state (wind speed, wave height, visibility), time (day / night, season), and platform attitude (pitch angle, roll angle, heading angle); coefficients The learnable parameters are obtained through maximum likelihood estimation trained from historical logs. This formula describes the variation of modal availability with sea state, time, and attitude, and can be used to generate sampling distributions for missing modes. To improve the practicality of this modeling method, this invention first performs image stabilization and image de-shaking in real time for modes containing image data using attitude angles measured by an IMU, i.e., applying a rotation matrix at the pixel level. Geometric transformations are performed on the image coordinates to compensate for jitter caused by the ship's motion, making subsequent model-deficient sampling more closely resemble the real scene. Then, for the optical sensor (visible camera), the solar incidence vector is calculated. relative to the camera's line of sight The included angle is used to obtain the geometric visibility index. This is used for subsequent selective enhancement or sample weight adjustment. Finally, a consistency scoring function is defined for the AIS / trajectory data:

[0152]

[0153] This is used to automatically label potential AIS spoofing / abnormal situations and store them in metadata. Each collected sample is ultimately recorded. Stored in a database, where This indicates whether the sample is a true missing modality. The output of this step is a database containing complete metadata, time alignment, health scores, and annotated missing modality mechanisms, for use in subsequent preprocessing and AC training.

[0154] S13. For each modality in the above steps, a dedicated preprocessing and feature transformation process is further adopted to ensure the consistency of subsequent model inputs: For the visual (RGB / IR) modality, distortion correction, color normalization, and local contrast adaptation (e.g., CLAHE) are first performed, with the standardization formula being... In optical modal degradation modeling, a fog / haze imaging model (Koschmieder / imaging model) is used to synthesize training samples. A commonly used atmospheric scattering model is written as...

[0155]

[0156] in For degraded images, For a clear image, For ambient lighting, Transmittance is determined by distance With attenuation coefficient Decision. Based on this model, training samples under different visibility conditions can be generated and used for robust training. Radar / sonar modalities often undergo amplitude logarithmic transformation and normalization: To compress the dynamic range; acoustic / sonar signals are converted into time-spectrum (matrix) signals using STFT and input into the convolutional network. The STFT expression is:

[0157]

[0158] in This is a window function. Low-pass or exponential smoothing filters are used to dejitter the timing modes (AIS / trajectory). The exponential smoothing formula is: Commonly used To balance smoothness and responsiveness. For annotation, a unified annotation standard is defined for each task: detection is based on bounding boxes. With category tags This indicates that pixel-level segmentation is based on Indicated; consistency of annotations is quantified using IoU and Cohen's kappa:

[0159]

[0160] in To observe the consistency rate, Let the random consistency rate be used. Confidence aggregation for multiple annotators can be achieved using simple voting or probabilistic fusion. The labeler on the sample The tag set is Then the label confidence level can be defined as

[0161]

[0162] Alternatively, weighted voting (assigning weight to more reliable annotators) can be used. To accelerate annotation and improve cross-modal consistency, this embodiment proposes a "cross-modal semi-automatic annotation" process: utilizing radar / sonar strong response regions. Generating visual candidate boxes and providing them for human confirmation is formalized as conditional probability propagation. Maximum a posteriori estimation. For data augmentation, in addition to the fog model, a physics-driven water droplet / splash occlusion generation method was designed: generating spatial occlusion masks. (Local alpha channel), and simulated observations using alpha blending.

[0163]

[0164] in Local occlusion intensity distribution. To ensure the traceability and reproducibility of the data processing, this step records all key transformation parameters in the preprocessing and enhancement processes and stores them together with the sample and annotation information.

[0165] In practical applications, for each sample data point, the parameters involved in the preprocessing are recorded, including but not limited to: the attenuation coefficient in the atmospheric scattering model. With ambient light parameters Mask in occlusion enhancement With weight distribution This includes parameters related to normalization and contrast enhancement. The above information, along with the processed data annotation results, constitutes a complete sample record.

[0166]

[0167] in, This represents metadata associated with the sample, including timestamps, sea state parameters, and sensor status information. During subsequent model training and analysis, the sample can be reconstructed or the process reproduced based on the recorded transformation parameters. For example, through... and It can reconstruct the image degradation process, thereby analyzing the model's performance under specific sea conditions (such as dense fog or strong occlusion); when false detections or missed detections occur, it can quickly locate the corresponding combination of enhancement parameters and perform targeted optimization.

[0168] All preprocessing and enhancement processes should record transformation parameters (such as...). To enable reversibility or traceability, the annotation and preprocessing results, along with metadata, are written into the sample record.

[0169] In some embodiments, the specific steps of S2 are described as follows:

[0170] S21. Based on the multimodal data acquisition and preprocessing completed in step S1, the labeled sample set is then sorted by flight segment / time / scene. Perform training / validation / testing segmentation. To simulate and train scenarios with missing modalities, introduce an availability vector (AC). For a certain sample The mask input is represented as .

[0171] S22. To address the missing model situation, three parallel strategies are employed to generate missing model combinations during the training phase:

[0172] (1) Empirical independent sampling: For each modality, based on historical availability Perform independent Bernoulli sampling: ;

[0173] (2) Conditional sampling (MAR): based on sea state / time / attitude information sampling Logistic regression is often used for fitting:

[0174]

[0175] in Estimated from historical data, this is used to reflect, for example, the greater likelihood of optical modes being missing in foggy weather;

[0176] (3) Worst-case: Select several key ACs manually or based on security requirements (e.g., lacking both vision and radar at the same time) and increase the sampling ratio of these key ACs in the training samples. To enhance the model's robustness in extreme cases, random fields are used to generate spatial occlusion masks for local spatial occlusion (such as wave occlusion). (Perlin noise or a wave height-based threshold can be used) to apply to visual images:

[0177]

[0178] To facilitate training strategies and resource allocation, the number of samples for each AC is counted. and baseline performance (e.g., mAP / mIoU measured on the validation set), and construct the weak modal memory matrix. , storage entries A weighted or resampling strategy is used on the training samples to prioritize weak modes. The weighting function can be defined as a linear or smooth function, such as linear weighting.

[0179]

[0180] Or exponential resampling probability:

[0181]

[0182] This increases the probability of poorly performing ACs being sampled during training, thereby improving accelerator performance. Through the above steps, the statistical characteristics of the true mode-deficient distribution are preserved, and the proportion of weak mode combinations in the training set is increased through conditionalization and adversarial sampling, thus improving robustness and generalization ability in rare and extreme mode-deficient scenarios.

[0183] In some embodiments, the specific steps of S3 are as follows:

[0184] S31. Based on steps S1 and S2, obtain a trainable dataset. Construct dual-channel guided learning network models, where the dominant channel network... Training with full modality input, adaptive channel network Training with missing modal inputs aims to enable the adaptive channel to achieve prediction performance close to that of the dominant channel even when some modalities are missing, through distillation learning. For each modality... Let its dedicated encoder be Output features .

[0185] To handle the lack of modes, S32 utilizes the metadata record mask gating mechanism of S1. .in Indicates modal availability. The Sigmoid activation function is used to dynamically adjust feature weights to suppress interference from missing modalities, thereby avoiding their negative impact on the fused features. Multimodal fusion features. This can be represented as a weighted sum:

[0186]

[0187] in This is the modal feature alignment matrix (or convolution). To further enhance the complementary information across modalities, this invention uses a cross-modal multi-head attention fusion module: for each modal feature to be mapped, it is first mapped to... Then calculate the attention on all modal tokens:

[0188]

[0189] The output is concatenated and linearly transformed to obtain the fused features. Then send it to the mission head. Get logits: .

[0190] S33. During training, based on sample sea state metadata Adaptive adjustment gating To ensure higher radar / AIS feature weights under adverse sea conditions, a modal reconstruction auxiliary head is introduced, and a decoder is added to the adaptive channel network. Reconstructing missing modal features, training objective This further enhances the adaptive channel network's learning of inter-modal correlations. Furthermore, for complex maritime tasks, this invention utilizes dynamic priors (AIS trajectory velocity vectors)... Embed attention weight calculation and add a track distance penalty term when scoring attention. This improves the stability of target trajectory maintenance and prediction.

[0191] In some embodiments, the specific steps of S4 are as follows:

[0192] S41. Before entering the main training, identify weak modal combinations through rapid pre-training and construct a memory matrix. First, for each AC... Define performance metrics:

[0193]

[0194] The metric is the task evaluation indicator (e.g., classification accuracy, mAP, mIoU). The sample set that satisfies AC. If (Weak mode threshold), then record it in the matrix. And define the combined weights:

[0195]

[0196] S42. To avoid excessive bias towards rare combinations, the upper bound of the weights can be limited. Furthermore, a dynamic memory update strategy is adopted: during the training process, each... epoch update And smooth the memory matrix:

[0197]

[0198] in This is the attenuation coefficient. This represents the performance estimate for the current training epoch. To improve cross-region generalization ability, the memory matrix is ​​expanded into a three-dimensional tensor. Recorded under different sea state conditions Performance improvement; and the introduction of domain weights. Ensure that data from different sea areas are covered during training.

[0199] Furthermore, the specific training steps for S5 are as follows:

[0200] Step S5 aims to effectively transfer knowledge from the dominant channel network (high-quality logical capabilities trained on all modes) to the adaptive channel network (operating under modal absence conditions) through targeted sampling strategies, adversarial boundary feature enhancement (ABFA), and instance-level uncertainty temperature distillation (UdT). This will maintain or approach the performance of the dominant channel network under multimodal absence and complex sea conditions, and ensure the physical consistency and reliability of training perturbations.

[0201] S51, using the memory matrix of S4 Using the conditional availability model of S2, modal availability combinations (ACs) are generated during the training phase. For a given combination... Its sampling priority is determined by both performance and sample scarcity, and is defined as follows:

[0202]

[0203] in This represents the moving average performance of the combination. This represents the number of historical samples for this combination. This is a balance coefficient between performance and sample sparsity. The training sampling probability is further combined with the conditional mode-deficient distribution:

[0204]

[0205] in Derived from the conditional probability model of modal availability of S2 This strategy ensures that the sampling closely reflects the actual missing mode statistics while specifically reinforcing weak ACs. By employing this strategy, weak mode combinations in the training batch obtain more samples, thereby enhancing their learning strength during the distillation process.

[0206] S52, such as Figure 3As shown, "boundary samples"—features close to the decision boundary or that reduce the original class margin—are generated in the intermediate / fusion feature layer of the adaptive channel network to increase the proportion of boundary samples during training and improve the robustness of the discrimination boundary. The fusion layer in the adaptive channel network (denoted as...) The higher-order output of either the encoder or the shared encoder is used as the target of adversarial example feature enhancement operations, denoted as... The boundary loss for category k is defined as:

[0207]

[0208] in Let represent the logit for the corresponding classification. This loss is 0 at the decision boundary, and moving towards a negative value indicates that the sample is closer to or beyond the target class boundary. For the The perturbation of the next iteration Updates with a steering factor are used (equivalent to gradient direction stepping with normalization):

[0209]

[0210] in Step size, Steering factor For perturbation projection. When the predicted class becomes Or reach the number of submissions Stop, and obtain boundary enhancement features. To ensure that the disturbance is consistent with the real environment, a regularization term based on S1 clutter statistics is introduced:

[0211]

[0212] in For clutter projection operators, and use coefficients The objective of incorporating ABFA is to constrain the direction and magnitude of the perturbation. Here It was estimated from historical statistics of S1 under similar sea state / sensor conditions. When The margin becomes negative or the maximum iteration is reached. Updates will cease at this time. Experimental recommendations. To maintain training stability.

[0213] S53. Adaptively adjust the distillation temperature based on the confidence / uncertainty of each sample according to the dominant channel network. (Sample-by-sample temperature) allows samples with high uncertainty to use higher temperatures (smoother softlabels), reducing the dependence of the adaptive channel network on the dominant channel network's high-confidence but potentially noise-affected predictions, and improving the knowledge transfer efficiency of boundary samples. The softmax probability of the dominant channel network is defined as: Uncertainty is defined by entropy:

[0214]

[0215] At the same time, combining logits with class center distance As a supplementary metric, the final temperature mapping is:

[0216]

[0217] in As the reference temperature, The modulation factor is calculated from S1 sea state metadata. The sample-level distillation loss is:

[0218]

[0219] S54. Adversarial features generated through adversarial feature enhancement networks Logits are obtained in the adaptive channel network. And perform boundary distillation with the logits in the dominant channel network:

[0220]

[0221] If the system contains multiple heads (e.g., detection head, navigation risk prediction head, modal reconstruction head, etc.), for each head... Calculate KD or characteristic distillation loss separately and sum them using a weighted average. Let the task set be... Then the overall distillation term:

[0222]

[0223] in The weighted coefficients are derived from the memory matrix, and the hyperparameters are... Controlling the relative contributions of different distillation terms. For AC of sample i, define:

[0224]

[0225] In order to amplify the KD signal from the weak AC in the loss and the boundary distillation ratio ( Adaptive channel networks also need to minimize the supervision loss. (e.g., cross-entropy, detection loss, or segmentation mIoU-inverse term) and mode reconstruction loss (Only enabled when a missing mode requires reconstruction), the overall construction loss is:

[0226]

[0227] in This represents the supervision loss (e.g., cross-entropy, detection loss, etc.). This indicates the temperature-dependent distillation loss per sample. This represents the KD loss calculated on adversarial / boundary samples generated by ABFA (using the same or different methods). ), Refers to the modal reconstruction loss (used in adaptive channel networks).

[0228] S6: Verification, Risk Assessment and Online Monitoring:

[0229] S61. After training, perform multi-dimensional evaluation on the validation set. First, evaluate each AC and sea state condition. Calculation indicators:

[0230]

[0231] And plot the performance curve as a function of the number of available modes. To measure robustness to model loss. Significance is calculated using a paired t-test or Wilcoxon test. - value, if A value <0.05 is considered a significant improvement. Additional assessments should be conducted on key safety metrics, such as the collision warning false alarm rate. FNcitical, Average Response Time And a comprehensive safety risk score is formed:

[0232]

[0233] S62. If the RiskScore decreases, it indicates that the method has improved security. During deployment, only the adaptive channel network and gating mechanism are retained to ensure low latency; the dominant feature channel and ABFA are only enabled during offline updates. The input distribution is monitored at runtime by calculating the KL divergence between the feature layer distribution and the training distribution.

[0234]

[0235] An alarm is triggered if the divergence exceeds a threshold. Online performance monitoring calculates runtime performance by collecting a small number of manually labeled samples. If the critical AC performance degrades Then proceed to the fine-tuning process. Memory matrix updated online:

[0236]

[0237] The sampling weights for the next round of training are adjusted accordingly.

[0238] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the embodiments described above. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.

[0239] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.

[0240] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A ship multimodal training method based on boundary counter-boosting and uncertainty distillation, characterized in that, The method, based on the operating environment of intelligent ships, deploys multiple types of sensors to collect multimodal data. The method includes the following steps: S1. Intelligent ship multimodal data acquisition and metadata construction: Collect raw data and operational status parameters from heterogeneous sensors across the entire ship domain. After eliminating spatiotemporal deviations through clock synchronization and spatial calibration, construct a multimodal dataset with spatiotemporal consistency through metadata encapsulation. S2. Data preprocessing and modality loss statistical modeling: Based on the data collected in step S1, distortion correction, normalization and filtering are performed on visual, radar, sonar and time series data to generate standardized feature tensors. At the same time, the signal integrity is analyzed to construct a modal availability statistical model and output a set of quantitative modality loss distribution parameters. S3. Construction of a multimodal dual-channel guided learning network: Receive the standardized feature tensor, missing mode distribution parameter set, and metadata from step S2. Construct a dual-channel guided learning network structure that includes a dominant channel for performing full-modal high-confidence feature extraction and an adaptive channel for simulating missing scenarios. Utilize a shared cross-modal attention module for feature mapping and compensation, and dynamically adjust modal weights using an environment-driven gating mechanism. S4. Memory matrix construction and multiple modal identification: During the training process in step S3, multidimensional state observation data and performance feedback signals are received. A dynamic memory matrix is ​​constructed based on the moving average algorithm to quantify the stability of each modal combination. Weak regions are marked by threshold retrieval and inverse performance weights are calculated. The reweighted sample index is output to adjust the subsequent sampling strategy. S5, Boundary Adversarial Feature Attack and Uncertainty-Driven Temperature Multi-Task Distillation: Based on the weak modality index, directional perturbation or linear interpolation is applied to the adaptive channel fusion layer to construct boundary hard example samples. At the same time, the uncertainty metric value output by the dominant channel is used to dynamically calculate the distillation temperature coefficient to scale the KL divergence or mean square error between the two channels. A multi-task joint objective function including adversarial loss and variable temperature distillation loss is constructed. The network parameters are updated synchronously through backpropagation, and the optimized final model parameters, environment adaptive robust fusion feature vector and convergence state log are output. S6. Validation, evaluation and model deployment: After completing the distillation training, the model recall, false negative rate and response latency are tested under various sea conditions and modal combinations to quantify reliability. During the deployment phase, an online monitoring mechanism is introduced to detect input distribution drift to trigger fine-tuning, and the adaptive channel network is retained to perform online inference.

2. The ship multimodal training method based on boundary adversarial augmentation and uncertainty distillation according to claim 1, characterized in that, Step S1 specifically involves: receiving raw observation data streams from the ship's global heterogeneous sensor array, as well as multi-dimensional operational status parameters; simultaneously acquiring clock synchronization references and spatial calibration reference information; employing a high-precision clock synchronization mechanism to eliminate timing deviations in data acquisition from each node; and performing spatial calibration operations to unify the spatial reference systems of different sensors, thereby establishing a spatiotemporal reference for the multimodal data; and using metadata encapsulation technology to structurally map the raw observation signals and multi-dimensional operational status parameters to form a multimodal dataset with spatiotemporal consistency and semantic integrity, specifically including the following steps; S11. Deployment and data acquisition of multiple sensors for intelligent ships: To meet the perception needs in the operating environment of intelligent ships, the deployment, calibration and time synchronization of multiple types of sensors are completed; through reasonable layout, the surrounding environment of the ship is monitored and multi-dimensional data is collected. Based on the aforementioned intelligent ship multi-sensor settings and standard data acquisition, the total multimodal set is set as follows: , No. Each mode in time The observation is recorded as ; Each record also saves a metadata vector. Its typical components include GPS location. Platform posture Sea state indicators (classification) ,visibility , wave height ), timestamp With sensor health Spatial coordinates and camera / sensor calibration are represented using extrinsic and intrinsic parameter matrices: world coordinates to camera coordinates The change in rigidity is as follows: The camera pixel projection (homogeneous representation) is as follows: Where K is the camera intrinsic parameter. Representing homogeneous coordinates; conversion of marine radar polar coordinates to planar coordinates. Select reference time The low / high frame rate modes are aligned using linear / high-order interpolation between adjacent frames. The linear interpolation form is as follows: For each modality, a signal-to-noise ratio (SRN) and health score are defined, specifically expressed as follows: in for The mapping normalizes the health level to (0,1); S12. Statistical modeling is performed on the sensor missing data in the multimodal data from step S11 above. This model is used to simulate possible modal missing data in actual maritime environments during the training phase. First, for each modality... Define modal availability indicator variables ,in This indicates that the mode is available at the current moment. Indicates missing information; establish a conditional probability model based on sea state: Where vector The state vector represents information about sea state, time, and platform attitude; the coefficients... These are learnable parameters, obtained through maximum likelihood estimation trained from historical logs; used to generate the sampling distribution for missing modes; S13. For each modality in step S12, a preprocessing and feature transformation process is adopted to ensure consistency of subsequent model inputs: For the visual (RGB / IR) modality, distortion correction, color normalization, and local contrast adaptation are first performed. The standardization formula is as follows: ; In optical modal degradation modeling, a fog / haze imaging model is used to synthesize training samples. A commonly used atmospheric scattering model is written as follows: in, For degraded images, For a clear image, For ambient lighting, Transmittance is determined by distance With attenuation coefficient Decision; Based on the model, generate training samples under different visibility conditions and use them for robust training; Radar / sonar modes often undergo amplitude logarithmic transformation and normalization: To compress the dynamic range; acoustic / sonar signals are converted into time-spectrum (matrix) signals using STFT and input into the convolutional network. The STFT expression is: in, As a window function, low-pass or exponential smoothing filters are used to dejitter the timing modes. The exponential smoothing formula is: Commonly used To balance smoothness and responsiveness; in terms of annotation, a unified annotation standard is defined for each task: detection is based on bounding boxes. With category tags This indicates that pixel-level segmentation is based on Indicated; consistency of annotations is quantified using IoU and Cohen's kappa: in, To observe the consistency rate, To achieve a random consistency rate, the confidence scores of multiple annotators are aggregated using simple voting or probabilistic fusion. The labeler for the sample The tag set is The label confidence score is defined as follows: Alternatively, weighted voting can be employed; areas with strong radar / sonar response can be utilized. Generating visual candidate boxes and providing them for human confirmation is formalized as conditional probability propagation. Maximum a posteriori estimation; in terms of data augmentation, in addition to the fog model, a physics-driven water droplet / splash occlusion generation was designed: generating spatial occlusion masks. (Local alpha channel), and simulated observations using alpha blending. All preprocessing and enhancement processes record transformation parameters for reversibility or traceability, and annotation and preprocessing results, along with metadata, are written into the sample record.

3. The ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation according to claim 1, characterized in that, Step S2 involves performing distortion correction and dehazing operations on the visual data, amplitude normalization processing on the radar and sonar echo signals, and applying a low-pass smoothing filter algorithm to the time-series sensor data to generate a standardized multimodal feature tensor. Simultaneously, it analyzes the mapping relationship between metadata and sensor signal integrity, constructs a modal availability statistical model, and outputs a quantized set of mode-deficient distribution parameters. This parameter set and the standardized feature tensor serve as input data for distillation training of the dual-channel guided learning network model. Specifically: S21. Based on the multimodal data acquisition and preprocessing completed in step S1, the labeled sample set is then sorted by flight segment / time / scene. Perform training / validation / test segmentation; introduce availability vectors (AC) to simulate and train scenarios lacking modalities. For a certain sample The mask input is represented as ; The original multimodal input representation of the i-th sample is usually composed of multiple modal features concatenated or set together. Let M represent the input features of the i-th sample in the m-th modality, and M represent the total number of modalities. The modality availability vector (AC) describes whether each modality is available: where This indicates that the m-th mode is available; otherwise, 0 indicates that the m-th mode is missing. This indicates a per-modal masking effect, but here it is a structured modal-level mask. S22. To address the missing model situation, three parallel strategies are employed to generate missing model combinations during the training phase: (1) Empirical independent sampling, for each modality based on historical availability Perform independent Bernoulli sampling: ; (2) Conditional sampling (MAR), based on sea state / time / attitude information sampling Logistic regression is often used for fitting: in Estimated from historical data to reflect the lack of optical modes in foggy weather; (3) Worst-case sampling: Select several key ACs manually or based on security requirements and increase the sampling ratio of these key ACs in the training samples. To enhance the robustness of the model under extreme conditions; Spatial occlusion masks are generated using random fields for localized spatial occlusion. Applications in visual images: Here, AC represents a specific combination of modes. Indicates spatial occlusion mask, The range of values ​​is ∈ ; 0: Complete occlusion; 1: No occlusion; To facilitate training strategies and resource allocation, the number of samples for each AC is counted. and baseline performance And construct a weak modal memory matrix. , storage entries A weighted or resampling strategy is used for the training samples to prioritize weak modes. The weighting function is defined as a linear or smooth function, with linear weighting as follows: Or exponential resampling probability: This increases the probability of poorer ACs being sampled during training, thus improving accelerator performance.

4. The ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation according to claim 1, characterized in that, Step S3 receives the standardized multimodal feature tensor, quantized mode-deficient distribution parameter set, and metadata from steps S1 and S2; constructs a dual-path network architecture including a dominant channel and an adaptive channel. The dominant channel performs full-modal high-confidence feature extraction, while the adaptive channel simulates missing scenarios based on the mode-deficient distribution parameter set. A shared cross-modal attention fusion module performs spatial mapping and difference compensation on the dual-channel features. An environment-driven gating mechanism dynamically adjusts the modal weights based on real-time sea conditions, outputting a distillation-trained environment-adaptive robust fusion feature vector and network model parameters. Specifically, the steps include: S31. Based on steps S1 and S2, obtain a trainable dataset and construct dual-channel guided learning network models, where the dominant channel network... Training with full modality input, adaptive channel network Training with missing modal inputs, and using distillation learning, the adaptive channel achieves prediction performance close to that of the dominant channel even when some modalities are missing; for each modality... Let its dedicated encoder be Output features ; S32. To handle the lack of modes, the metadata record mask gating mechanism of S1 is used. ,in Indicates modal availability. The Sigmoid activation function is used to suppress interference from missing modalities by dynamically adjusting feature weights; multimodal fusion features. Represented as a weighted sum: in, To align the modal features to a matrix (or convolution), and to enhance complementary information across modalities, a cross-modal multi-head attention fusion module is used: for each modal feature to be mapped, it is first mapped to... Then calculate the attention on all modal tokens: The output is concatenated and linearly transformed to obtain the fused features. Then send it to the mission head. Get logits: ; S33. During training, based on sample sea state metadata Adaptive adjustment gating This allows for higher radar / AIS feature weights under adverse sea conditions; then, a mode reconstruction auxiliary head is introduced, and a decoder is added to the adaptive channel network. Reconstructing missing modal features, training objective This enhances the adaptive channel network's learning of inter-modal correlations; simultaneously, for complex maritime tasks, it utilizes dynamic priors (AIS track velocity vectors) Embed attention weight calculation and add a track distance penalty term when scoring attention. This ensures that the target trajectory remains stable in line with the prediction.

5. The ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation according to claim 1, characterized in that, Step S4 specifically involves establishing a memory matrix during the dual-channel guided learning network structure training process to record the performance and sample distribution of different modality combinations. The multi-dimensional state observation data and performance feedback signals from the training process in step S3 are input, and processed by constructing a dynamic memory matrix based on a moving average algorithm. The input states and signals are mapped to the matrix space for exponentially weighted moving average calculation to quantify the long-term stability of each modality combination in a specific context. Weak performance regions are automatically marked using threshold retrieval, and inverse performance weights are calculated to dynamically adjust the sampling strategy for subsequent training batches. The output is the updated dynamic memory matrix state, reweighted sample index, and optimized network parameters. S41. Before entering the main training, identify weak modal combinations through rapid pre-training and construct a memory matrix. First, for each AC Define performance metrics: Wherein, metric refers to the task evaluation indicator. The sample set that satisfies AC; if (Weak mode threshold), then record it in the matrix. And define the combined weights: S42. To avoid excessive bias towards rare combinations, limit the upper bound of the weights. A dynamic memory update strategy is adopted: during the training process, each... epoch update And smooth the memory matrix: in, The attenuation coefficient is... This represents the performance estimate for the current training epoch; to improve cross-region generalization ability, the memory matrix is ​​expanded into a three-dimensional tensor. Recorded under different sea state conditions Performance improvement; and the introduction of domain weights. This ensures that data from different sea areas are covered during training.

6. The ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation according to claim 1, characterized in that, Step S5 specifically involves: First, based on the weak mode combination index and adversarial boundary feature enhancement mechanism of the input, the original features are subjected to directional perturbation or linear interpolation in the feature fusion layer of the adaptive channel to construct synthetic hard example samples located at the decision boundary to expand the training set and enhance the model's ability to discriminate fuzzy regions. Subsequently, the distillation temperature coefficient is dynamically calculated using the uncertainty metric output by the dominant channel, and a positive correlation mapping function between uncertainty and temperature is established. This allows high uncertainty samples to correspond to higher temperature values ​​to smooth their probability distribution and retain more dark knowledge, while low uncertainty samples correspond to lower temperature values ​​to maintain prediction sharpness. Finally, a multi-task joint objective function is constructed, which includes adversarial loss terms and variable-temperature distillation loss terms. The variable-temperature distillation loss term uses a dynamic temperature coefficient to scale the KL divergence or mean square error between the adaptive channel and the dominant channel. The dual-channel network parameters are updated synchronously through the backpropagation algorithm. The output consists of the final model parameters of the dual-channel guided learning network after boundary adversarial enhancement and uncertainty-driven temperature multi-task distillation iterative optimization, the environment-adaptive robust fusion feature vector generated by distillation training, and the performance index log that records the convergence status of this training round.

7. The ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation according to claim 6, characterized in that, Step S5 specifically includes: S51, using the memory matrix of S4 Based on the conditional availability model from step S2, generate modal availability combinations AC for the training phase; for a given combination Its sampling priority is determined by both performance and sample scarcity, and is defined as follows: in, This represents the moving average performance of the combination. The number of historical samples for this combination. The training sampling probability is combined with the conditional mode-deficient distribution to balance performance and sample sparsity. in, Derived from the conditional probability model of modal availability in step S2 This allows the sampling to closely resemble the actual missing model statistics while also specifically strengthening weak ACs. S52. Generate boundary samples, i.e., features close to the decision boundary or that reduce the original class margin, in the intermediate / fusion feature layer of the adaptive channel network; in the fusion layer of the adaptive channel network (denoted as... The higher-order output of either the encoder or the shared encoder is used as the target of adversarial example feature enhancement operations, denoted as... The boundary loss for category k is defined as: in This represents the logit for the corresponding classification. The loss is 0 at the decision boundary, and moving towards a negative value indicates that the sample is closer to or beyond the target class boundary. For the ... The perturbation of the next iteration Update with steering factor: in, Step size, Steering factor For perturbation projection, when the predicted class becomes Or reach the number of submissions Stop, and obtain boundary enhancement features. To ensure that the disturbance is consistent with the real environment, a regularization term based on S1 clutter statistics is introduced: in, For clutter projection operators, and with coefficients The objective of incorporating ABFA is to constrain the direction and magnitude of the perturbation; where, Estimated from historical statistics of S1 under similar sea state / sensor conditions, when The margin becomes negative or the maximum iteration is reached. Updates will stop at this time. S53. Adaptively adjust the distillation temperature based on the confidence / uncertainty of each sample according to the dominant channel network. (Sample-by-sample temperature) allows samples with high uncertainty to use higher temperatures, reducing the dependence of the adaptive channel network on the dominant channel network's high-confidence but potentially noise-affected predictions. The softmax probability of the dominant channel network is defined as: Uncertainty is defined by entropy: At the same time, combining logits with class center distance As a supplementary metric, the final temperature mapping is: in, As the reference temperature, The modulation factor is calculated from the S1 sea state metadata; the sample-level distillation loss is: S54. Adversarial features generated through adversarial feature enhancement networks Logits are obtained in the adaptive channel network. And perform boundary distillation with the logits in the dominant channel network: If the system contains multi-head tasks, for each head Calculate KD or characteristic distillation loss separately and sum them using a weighted average; assuming a task set... Then the overall distillation term: in, The weighted coefficients are derived from the memory matrix, and the hyperparameters are... Controlling the relative contributions of different distillation terms; for sample i, the AC is defined as: To amplify the KD signal from the weak AC in the loss and the boundary distillation ratio ( Adaptive channel network minimizes supervision loss and modal reconstruction loss The overall loss is: in, Indicates monitoring losses, This indicates the temperature-dependent distillation loss per sample. This represents the KD loss calculated for adversarial / boundary samples generated by ABFA. This refers to the loss during modal reconstruction.

8. The ship multimodal training method based on boundary adversarial enhancement and uncertainty distillation according to claim 1, characterized in that, Step S6 specifically involves: performing system verification and deployment evaluation of the model; testing performance under various sea states and modal combinations to establish a safety evaluation index centered on recall, false negative rate, and response latency to quantify model reliability; wherein, during the deployment phase, an online monitoring mechanism is introduced to detect input distribution drift in real time and trigger model fine-tuning, while retaining the adaptive channel network for online inference; specifically: performing system verification and deployment evaluation of the model; testing performance under various sea states and modal combinations to establish a safety evaluation index centered on recall, false negative rate, and response latency to quantify model reliability; wherein, during the deployment phase, an online monitoring mechanism is introduced to detect input distribution drift in real time and trigger model fine-tuning, while retaining the adaptive channel network for online inference, including the following steps: S61. After training, perform multi-dimensional evaluation on the validation set, first for each AC and sea state condition. Calculation indicators: Plot the performance curve as a function of modal quantity. To measure robustness to missing models; significance is calculated using paired t-tests or Wilcoxon tests. - value, if A score <0.05 is considered a significant improvement; additional key safety indicators are evaluated to form a comprehensive safety risk score. S62. If the RiskScore decreases, only the adaptive channel network and gating mechanism are retained during deployment to ensure low latency; the dominant feature channel and ABFA are only enabled during offline updates; the input distribution is monitored at runtime, and the KL divergence between the feature layer distribution and the training distribution is calculated. An alarm is triggered if the divergence exceeds a threshold; online performance monitoring calculates runtime performance by collecting a small number of manually labeled samples. If the critical AC performance degrades Then proceed to the fine-tuning process; the memory matrix is ​​updated online: The sampling weights for the next round of training are adjusted accordingly.