A monocular video-based wave field reconstruction and monitoring system and method

By using a monocular video-based wave field reconstruction system, a high-fidelity ocean dynamics synthesis environment and a physically embedded spatiotemporal transformer network, combined with a teacher-student distillation training framework, the problems of high cost, low accuracy, and spatiotemporal consistency in wave field reconstruction and monitoring are solved, achieving low-cost, high-precision real-time wave monitoring.

CN122199854APending Publication Date: 2026-06-12SHANGHAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI UNIV
Filing Date
2026-04-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for wave field reconstruction and monitoring are costly, difficult to deploy, have low accuracy, and lack spatiotemporal consistency. Traditional methods cannot meet the needs of marine engineering for high-precision, real-time monitoring.

Method used

A wave field reconstruction and monitoring system based on monocular video is adopted, including a high-fidelity marine dynamics synthetic environment construction subsystem, a physical embedding-based spatiotemporal transformer wave reconstruction network, and a teacher-student distillation training framework. By generating synthetic datasets, a physical rendering engine, and transfer learning techniques, high-precision and spatiotemporally consistent wave field reconstruction is achieved.

🎯Benefits of technology

It reduces hardware costs and deployment complexity, improves reconstruction accuracy and spatiotemporal consistency, and enables real-time high-frame-rate wave field monitoring on a lightweight platform, meeting the real-time monitoring needs of marine engineering.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199854A_ABST
    Figure CN122199854A_ABST
Patent Text Reader

Abstract

The application provides a monocular video-based wave field reconstruction and monitoring system and method. The system comprises: a high-fidelity marine dynamics synthetic environment construction subsystem configured to generate a synthetic dataset with absolute physical height true value according to a physical wave spectrum model and a physical rendering engine; a physically embedded space-time transformer wave reconstruction network configured to receive a monocular video stream and camera geometry parameters and output an absolute wave field in metric units; and a teacher-student distillation training framework configured to pretrain a teacher network using the synthetic dataset, and when facing real sea surface videos, the teacher network generates pseudo labels to guide a student network to learn the light and shadow distribution characteristics of waves. The system proposed by the application significantly reduces hardware cost and load weight, has stronger robustness, and can be easily deployed on lightweight platforms such as unmanned ships and unmanned aerial vehicles.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision, photogrammetry and marine engineering technology, and specifically to a wave field reconstruction and monitoring system and method based on monocular video. Background Technology

[0002] High-precision 3D reconstruction and monitoring of sea surface wave fields have significant application value in marine resource development, maritime safety, marine engineering, and autonomous vehicles. Accurate and real-time wave height information is crucial for ensuring the safety of maritime operations, optimizing vehicle path planning, and verifying marine engineering structural designs.

[0003] Currently, wave field reconstruction and monitoring technologies mainly fall into the following categories, but all of them have significant drawbacks: Traditional instrumental measurement methods include physical contact and remote sensing measurements using wave buoys, radar, and satellites. In-situ measurement devices such as buoys and acoustic doppler current profilers (ADCPs) can only provide single-point time-series data, resulting in sparse data, low spatial resolution, and an inability to acquire dense spatial field information about the sea surface, making it difficult to describe the two-dimensional propagation morphology of waves. Satellite synthetic aperture radar (SAR) is limited by revisit periods, making it difficult to meet the needs of minute-level real-time monitoring. While shipborne X-band radar has a wide coverage area, its large size, high power consumption, and expensive installation make it unsuitable for lightweight platforms such as unmanned surface vessels (USVs) or small unmanned aerial vehicles, thus limiting its deployment.

[0004] Photogrammetry based on binocular stereo vision: The sea surface, as a weakly textured target, exhibits high homogeneity and specular reflection. In areas of strong illumination or smooth waves, traditional stereo matching algorithms heavily rely on image texture features, making it difficult to effectively extract feature points. This results in numerous holes in the reconstruction results and many depth calculation errors. Traditional solutions, such as the WASS system, must rely on binocular or multi-camera arrays, placing strict requirements on baseline stability and synchronous triggering between cameras. Hardware deployment is complex and lacks robustness; ship vibrations can easily cause calibration parameter drift, leading to the failure of the entire reconstruction system. Furthermore, existing pipelines like the WASS system typically process frames independently, ignoring the hydrodynamic laws of wave motion. This results in non-physical flickering on the reconstructed wave surface over time, lacking physical temporal consistency and smooth, continuous dynamic evolution characteristics.

[0005] Existing general-purpose visual large-scale model solutions: These models are trained to predict relative depth (distance from the object to the camera), dominated by perspective projection. However, marine engineering requires absolute wave height (vertical height relative to mean sea level). For existing general-purpose visual large-scale models, the geometric task definition is misaligned. Directly applying these models leads to severe scale blurring and projection distortion, failing to acquire realistic physical metric data. Furthermore, existing models are mostly trained in rigid scenes (indoors, urban areas), tending to output smooth surfaces. When faced with complex high-frequency details on the sea surface (such as capillary waves and foam textures), the model often treats these as noise and suppresses them, failing to reconstruct microscopic wave topography. Simultaneously, real sea surfaces cannot obtain pixel-level dense wave height ground truth values ​​from sensors, making supervised training impossible. Models trained solely on synthetic data perform poorly on real videos, exhibiting severe domain shift problems.

[0006] Therefore, there is an urgent need for a low-cost, easy-to-deploy, high-precision wave field reconstruction and monitoring solution that can guarantee spatiotemporal consistency. Summary of the Invention

[0007] The objective of this invention is to provide a wave field reconstruction and monitoring system and method based on monocular video. Through the system and / or method, the problems of high cost, difficult deployment, low accuracy, and lack of spatiotemporal consistency in the existing wave field reconstruction and monitoring technologies are solved. It can achieve high-precision and high spatiotemporal consistency reconstruction of the absolute wave height field of the sea surface without the need for expensive sensors and complex hardware deployment.

[0008] In a first aspect of the invention, the aforementioned task is solved by a wave field reconstruction and monitoring system based on monocular video, the system comprising: A high-fidelity marine dynamics synthetic environment construction subsystem is configured to generate synthetic datasets with absolute physical height truth values ​​based on a physical spectrum model and a physical rendering engine. A physically embedded spatiotemporal transformer-based wave reconstruction network is configured to receive a monocular video stream and camera geometry parameters, and output an absolute wave field in metric units; and A teacher-student distillation training framework is configured to pre-train a teacher network using the synthetic dataset. When faced with real sea surface videos, the teacher network generates pseudo-labels to guide the student network in learning the light and shadow distribution characteristics of waves.

[0009] In one embodiment of the present invention, the high-fidelity marine dynamics synthetic environment construction subsystem includes: An empirical spectrum-based ocean dynamics solution unit, with built-in JONSWAP and Pierson-Moskowitz wave spectrum models, is configured to generate nonlinear wave surface meshes conforming to hydrodynamic laws through randomized physical parameters; and The physically based rendering-based multimodal data generation unit is configured to simulate real light transmission, render video streams, and output absolute physical height maps, surface normal maps, and sea-sky segmentation masks aligned with RGB pixels.

[0010] In one embodiment of the present invention, the physically embedded spatiotemporal transformer wave reconstruction network includes: The geometry-aware feature encoding module is configured to encode the input camera geometry parameters into a high-dimensional vector and inject them into the visual features of the video frame; The decoupled spatiotemporal converter module is configured to consist of alternating stacked time attention modules and spatial attention modules, and is used to capture wave dynamics. A physics-a priori guided wave refinement module, configured to recover high-frequency details of waves; and The multi-task geometry decoding module includes multiple parallel prediction heads that output the absolute height field, surface normal field, and sea-sky mask, respectively.

[0011] In one embodiment of the present invention, the geometric perception feature encoding module maps the camera's focal length and / or principal point and / or installation height and / or pitch angle into a geometric embedding vector through a multilayer perceptron, and fuses it with the visual feature map by element-wise addition or channel stitching.

[0012] In one embodiment of the present invention, in the decoupled spatiotemporal converter module, the temporal attention module calculates self-attention weights only in the time axis dimension to model wave propagation; the spatial attention module calculates self-attention weights only in the spatial dimension of a single frame to maintain the topology of the wave surface; the temporal attention module and the spatial attention module are stacked alternately to form a serial processing pipeline.

[0013] In one embodiment of the present invention, the physically prior-guided wave refinement module employs a residual structure, including: Feature direct paths are used to preserve original feature information; The physical sharpening path includes a parameter-frozen Laplacian convolutional layer for extracting high-frequency gradient information, and a learnable channel-level gain coefficient for adaptively controlling the sharpening intensity. The final output is the sum of the output of the feature pass-through path and the output of the physical sharpening path.

[0014] In one embodiment of the invention, within the teacher-student distillation training framework, The teacher network is built on a large-scale pre-trained visual model and undergoes fully supervised training on the synthetic dataset. The student network is a wave reconstruction network based on a physically embedded spatiotemporal transformer; During the distillation stage, the teacher network generates pseudo-labels for real sea surface videos, constrains the output of the student network through a loss function, and freezes the branch weights in the student network used to predict normals and masks. Among them, the loss function The calculation formula is as follows: in, This indicates the absolute physical height loss. Represents the surface normal vector loss. This indicates the loss of the sea-sky mask; β represents the attention weight assigned to the wave absolute physical height prediction task, and β represents the attention weight assigned to the surface normal vector task. This represents the attention weights assigned to the sea-sky segmentation task; Among them, absolute physical height loss The calculation formula is as follows: Where i represents the i-th pixel. This represents the effective water surface mask for the i-th pixel. This represents the predicted absolute height of the i-th pixel. This represents the true absolute height of the i-th pixel; Surface normal vector loss The calculation formula is as follows: Where i represents the i-th pixel. This represents the effective water surface mask for the i-th pixel. Let represent the predicted surface normal vector of the i-th pixel. This represents the true surface normal vector of the i-th pixel; Sea and Sky mask loss The calculation formula is as follows: Where i represents the i-th pixel, and N is the total number of pixels. This represents the true value of the mask for the i-th pixel. This represents the logits predicted value of the i-th pixel in the network output. This represents the sigmoid activation function.

[0015] In a second aspect of the invention, the aforementioned task is further addressed by a wave field reconstruction and monitoring method based on monocular video. This method is implemented using a monocular video-based wave field reconstruction and monitoring system and includes the following steps: A high-fidelity ocean dynamics synthetic environment construction subsystem generates a synthetic dataset with true values ​​of absolute physical height. A physical embedding-based spatiotemporal transformer wave reconstruction network is constructed, which includes a geometrically aware feature encoding module, a decoupled spatiotemporal transformer module, a physically prior-guided wave refinement module, and a multi-task geometric decoding module. The teacher network is pre-trained using the synthetic dataset to obtain a pre-trained teacher network; The pre-trained teacher network is used to generate pseudo-labels for unlabeled real sea surface videos. The student network is a physical embedding-based spatiotemporal transformer wave reconstruction network. The parameters of the student network are updated according to the loss function, and the student network is trained. The trained student network is deployed on the target platform to receive real-time monocular video streams and camera geometry parameters, and output wave fields.

[0016] In a third aspect, the present invention also provides an electronic device comprising: A processor, configured to execute machine-readable instructions; A graphics card configured to train the monocular video-based wave field reconstruction and monitoring method; and A memory configured to store machine-readable instructions that, when executed by a processor and / or graphics card, perform the steps of the monocular video-based wave field reconstruction and monitoring method.

[0017] In a fourth aspect, the present invention also provides a computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a processor, perform the steps of the wave field reconstruction and monitoring method based on monocular video.

[0018] The technical solution provided by this invention has the following advantages: 1. Traditional wave visual reconstruction (such as the WASS system) relies on binocular or even multi-view camera arrays, and has extremely strict mechanical requirements for camera synchronization and baseline calibration. The wave field reconstruction and monitoring system based on monocular video proposed in this invention only requires monocular camera input, eliminating the dependence on expensive hardware such as binocular / multi-view cameras and LiDAR, as well as complex calibration processes. This significantly reduces hardware costs and payload weight, enhances robustness, and can be easily deployed on lightweight platforms such as unmanned surface vessels and drones.

[0019] 2. The wave field reconstruction and monitoring system based on monocular video proposed in this invention innovatively connects a physically prior-guided wave refinement module in series at the end of the decoder. Through the physically prior-guided wave refinement module, using a fixed Laplacian operator and learnable channel coefficients, it effectively recovers microscopic sea surface topography such as capillary waves and broken foam, thereby improving the geometric realism of the reconstruction.

[0020] 3. The wave field reconstruction and monitoring system based on monocular video proposed in this invention introduces a decoupled spatiotemporal converter in its structure to capture the evolution of waves in time and space, ensuring the smoothness and continuity of the reconstructed wave field in the time series and eliminating the flicker artifacts common in traditional methods.

[0021] 4. The wave field reconstruction and monitoring system based on monocular video proposed in this invention combines computer graphics rendering technology with transfer learning technology. First, a synthetic dataset is constructed using a high-fidelity physics engine, providing the model with perfect physical supervision signals. Second, through a teacher-student distillation strategy, a teacher network pre-trained on virtual data guides the student network to adapt to the real sea surface illumination distribution. This allows ordinary monocular cameras to acquire physical measurement capabilities previously only available to active sensors through algorithmic compensation, without requiring the collection of any height-labeled training data in real sea areas, greatly reducing the threshold and cost of algorithm implementation.

[0022] 5. The wave field reconstruction and monitoring system based on monocular video proposed in this invention significantly reduces the number of parameters and computational complexity compared to using computationally intensive 3D convolutional neural networks (3D-CNN) to process video data. This enables the algorithm to achieve high frame rate real-time inference on resource-constrained edge computing devices (such as unmanned shipborne industrial control computers) while ensuring the capture of long-range spatiotemporal dependencies, thus meeting the urgent need for real-time sea condition monitoring in marine operations. Attached Figure Description

[0023] To further illustrate the above and other advantages and features of the various embodiments of the present invention, a more specific description of the various embodiments of the present invention will be presented with reference to the accompanying drawings. It is to be understood that these drawings depict only typical embodiments of the invention and are therefore not intended to limit its scope. In the drawings, identical or corresponding parts will be indicated by identical or similar reference numerals for clarity.

[0024] Figure 1 A schematic diagram of a wave field reconstruction and monitoring system based on monocular video according to an embodiment of the present invention is shown; Figure 2 A schematic diagram of the overall architecture of a wave field reconstruction and monitoring system based on monocular video according to an embodiment of the present invention is shown. Figure 3This diagram illustrates the architecture of a high-fidelity marine dynamics synthesis environment construction subsystem according to an embodiment of the present invention. Figure 4 A schematic diagram of the architecture of a physical embedding-based spatiotemporal transformer wave reconstruction network according to an embodiment of the present invention is shown. Figure 5 A detailed structural block diagram of a physically priori guided wave refinement module according to an embodiment of the present invention is shown. Figure 6 A schematic diagram of a teacher-student distillation training framework according to an embodiment of the present invention is shown; and Figure 7 A flowchart illustrating a wave field reconstruction and monitoring method based on monocular video according to an embodiment of the present invention is shown. Detailed Implementation

[0025] In the following description, the invention is described with reference to various embodiments. However, those skilled in the art will recognize that the embodiments may be practiced without one or more specific details or with other alternatives and / or additional methods or components. In other instances, well-known structures or operations are not shown or described in detail so as not to obscure the inventive points of the invention. Similarly, for illustrative purposes, specific numbers and configurations are set forth to provide a comprehensive understanding of the embodiments of the invention. However, the invention is not limited to these specific details.

[0026] In this specification, references to "an embodiment" or "this embodiment" mean that a particular feature, structure, or characteristic described in connection with that embodiment is included in at least one embodiment of the invention. The phrase "in one embodiment" appearing throughout this specification does not necessarily refer to the same embodiment in all instances.

[0027] It should be noted that the embodiments of the present invention describe the method steps in a specific order; however, this is only for illustrating the specific embodiment and not for limiting the order of the steps. On the contrary, in different embodiments of the present invention, the order of the steps can be adjusted according to actual needs.

[0028] In this invention, the various networks, modules, or units of the system according to the invention can be implemented using software, hardware, firmware, or a combination thereof. When a module is implemented using software, its function can be implemented through computer program flow. For example, the module can be implemented using code segments (such as code segments in languages ​​like C and C++) stored in a storage device (such as a hard disk, memory, etc.), wherein the corresponding function of the module can be implemented when the code segment is executed by a processor. When a module is implemented using hardware, its function can be implemented by setting a corresponding hardware structure. For example, the module's function can be implemented by hardware programming a programmable device such as a field-programmable gate array (FPGA), or by designing an application-specific integrated circuit (ASIC) that includes multiple transistors, resistors, capacitors, and other electronic devices. When a module is implemented using firmware, the module's function can be written in the form of program code into a read-only memory such as an EPROM or EEPROM of the device, and the corresponding function of the module can be implemented when the program code is executed by a processor. In addition, some functions of the module may need to be implemented by separate hardware or by working in cooperation with the hardware. For example, the detection function is implemented by a corresponding sensor (such as a proximity sensor, accelerometer, gyroscope, etc.), the signal transmission function is implemented by a corresponding communication device (such as a Bluetooth device, infrared communication device, baseband communication device, Wi-Fi communication device, etc.), the output function is implemented by a corresponding output device (such as a display, speaker, etc.), and so on.

[0029] This invention aims to address the problems of high cost and difficult deployment of existing sea surface wave monitoring equipment, as well as the lack of spatiotemporal consistency and loss of high-frequency details in general visual models. This invention proposes a wave field reconstruction method and monitoring protocol based on monocular video, which can achieve high-density, physically accurate, and dynamically continuous wave height field reconstruction using ordinary monocular cameras without the need for expensive binocular arrays or lidar.

[0030] The technical solution proposed in this invention is not just a single neural network model, but a complete three-dimensional reconstruction and monitoring system for sea surface waves that integrates physical simulation environment construction, deep learning network architecture design, and cross-domain transfer training strategy. Figure 1 A schematic diagram of a wave field reconstruction and monitoring system based on monocular video according to an embodiment of the present invention is shown. Figure 2 A schematic diagram of a wave field reconstruction and monitoring system based on monocular video, according to an embodiment of the present invention, is shown. Figure 1 and Figure 2As shown, the wave field reconstruction and monitoring system based on monocular video includes: a high-fidelity ocean dynamics synthesis environment construction subsystem (WaveScapeGenerator) 101, a physically embedded spatiotemporal transformer-based wave reconstruction network (WaveFormer) 102, and a teacher-student distillation training framework 103. These three components of the monocular video-based wave field reconstruction and monitoring system logically form a closed-loop data-algorithm-application chain. The high-fidelity ocean dynamics synthesis environment construction subsystem, as the source of physical knowledge, solidifies fluid dynamics laws into true data values ​​through the rendering pipeline; the physically embedded spatiotemporal transformer-based wave reconstruction network, as the feature carrier, learns these physical laws through a specific network topology; and the teacher-student distillation training framework, as a bridge for knowledge transfer, transmits the geometric perception capabilities from the simulation environment to real-world applications through parameter coupling.

[0031] Figure 2 The main key parameters include the following: The input sequence length (T) is the number of video frames T input into the graph. T can be, for example, a natural number greater than 1. This means there are multiple images that can be used to understand the movement of waves over time.

[0032] Geometric parameters ( () indicates the camera's intrinsic parameters (focal length, etc.) and extrinsic parameters (h indicates the mounting height). (Indicates pitch angle).

[0033] The optimization objective of the physically embedded spatiotemporal transformer-based wave reconstruction network integrates multiple constraints, including absolute physical height, surface normal vector, and sea-sky mask. Loss function. for, in, This indicates the absolute physical height loss. Represents the surface normal vector loss. This indicates the loss of the sea-sky mask; β represents the attention weight assigned to the absolute physical height prediction task of waves, with wave height prediction being the primary task; β represents the attention weight assigned to the surface normal vector task, used to constrain the local slope and smoothness of waves. This represents the attention weight assigned to the sky-sky segmentation task, used to accurately remove sky background interference.

[0034] loss function This represents the global objective function (or total error) that the WaveFormer network, a spatiotemporal transformer based on physical embedding, needs to optimize during the training phase. Loss function. It is a comprehensive indicator that measures the overall difference between network predictions and actual values, and is determined by the absolute physical height loss. Surface normal vector loss And the loss of the sea and sky mask The loss function is a weighted combination of the losses from these three sub-tasks. Its core function is to guide the update direction of the entire neural network weights. This is achieved by introducing different attention weight coefficients (in specific embodiments, the weight coefficients can be set to...). =1.0, β=0.3, =0.3), loss function This forces the network to find the optimal balance among three constraints: accurately predicting absolute altitude, recovering smooth local slopes with high-frequency details, and precisely removing background sky interference. Loss function This effectively prevents the network from overfitting to a single task during training, thus ensuring that the output three-dimensional wave field has a high degree of physical morphological realism.

[0035] Absolute physical height loss Using masked L1 loss, the height error is calculated only in the effective water surface area. The calculation formula is as follows: Where i represents the i-th pixel. This represents the effective water surface mask for the i-th pixel (1 for water surfaces and 0 for non-water surfaces). This represents the predicted absolute height of the i-th pixel. This represents the true absolute height of the i-th pixel.

[0036] Surface normal vector loss Using masked L1 loss instead of traditional cosine similarity The calculation formula is as follows: Where i represents the i-th pixel. This represents the effective water surface mask for the i-th pixel (1 for water surfaces and 0 for non-water surfaces). Let represent the predicted surface normal vector of the i-th pixel. This represents the true surface normal vector of the i-th pixel. Surface normal vector loss The reason for using masked L1 loss instead of traditional cosine similarity is that the sea surface tends to be horizontal, and cosine loss suffers from gradient vanishing problem for small angular deviations, while L1 loss can impose a stricter linear penalty on small angular deviations, which is beneficial for recovering fine wave geometry.

[0037] Sea and Sky mask loss Using binary cross-entropy (BCE) loss, The calculation formula is as follows: Where i represents the i-th pixel, and N is the total number of pixels. This represents the true value of the mask for the i-th pixel. This represents the logits predicted value of the i-th pixel in the network output. This represents the sigmoid activation function.

[0038] The High-Fidelity Marine Dynamics Synthetic Environment Construction Subsystem 101 serves as the data foundation and physical prior source for a wave field reconstruction and monitoring system based on monocular video. Essentially, it is a parametric simulation platform based on computer graphics and fluid physics. Its core function is to address the physical challenge of obtaining pixel-level absolute height truth values ​​in real marine environments. The High-Fidelity Marine Dynamics Synthetic Environment Construction Subsystem 101 generates synthetic datasets with absolute physical height truth values ​​based on physical spectral models and a physical rendering engine (Blender Cycles). It simulates a certain degree of volume absorption and scattering (i.e., turbidity) of seawater by adjusting water transport parameters (Transmission[0.75,0.85]); it simulates complex solar glare by loading a real-world high dynamic range ambient light map (HDRI); and it programmatically generates dynamic surface foam layers (Whitecaps) based on wave kinetic energy. These physical optical features provide crucial high-frequency texture supervision signals for the network.

[0039] Figure 3 A schematic diagram of the architecture of a high-fidelity marine dynamics synthesis environment construction subsystem according to an embodiment of the present invention is shown. Figure 3 As shown, the high-fidelity ocean dynamics synthesis environment construction subsystem 101 includes two key processing units: an ocean dynamics solution unit 201 based on empirical spectra and a multimodal data generation unit 202 based on physical rendering (PBR).

[0040] The empirical spectrum-based ocean dynamics solution unit 201 incorporates the JONSWAP (Joint North Sea Wave Project Spectrum) and Pierson-Moskowitz empirical wave spectrum models. By randomizing physical parameters such as wind speed, wind direction, peakedness, and swell direction, this unit can programmatically generate nonlinear wave surface meshes that conform to the laws of hydrodynamics.

[0041] The physically based rendering-based multimodal data generation unit 202 simulates the light transmission paths of the real physical world, including Fresnel reflection, refraction, and the scattering characteristics of sea foam. This unit is not only responsible for rendering realistic RGB video streams, but more importantly, utilizing the rendering engine's depth buffer (Z-buffer), it can simultaneously output an absolute physical height map (MetricHeightMap), a surface normal map (SurfaceNormalMap), and a sea-sky separation mask that are strictly aligned with the RGB pixels. These data collectively constitute a training benchmark with absolute physical scale.

[0042] The physically embedded spatiotemporal transformer-based wave reconstruction network 102 is the online execution engine of the monocular video-based wave field reconstruction and monitoring system proposed in this invention. It is responsible for receiving the monocular video stream and camera geometric parameters, converting the monocular video stream into a 3D sea surface model in real time, and outputting the absolute wave field in metric units. Structurally, the physically embedded spatiotemporal transformer-based wave reconstruction network 102 abandons the traditional parallax-depth indirect calculation mode and adopts an end-to-end direct regression architecture. The internal structure of the physically embedded spatiotemporal transformer-based wave reconstruction network 102 is further subdivided into the following four core functional modules: Geometric-AwareEncoder, FactorizedSpatio-TemporalTransformer, WaveRefinementModule (WRM), and Multi-TaskGeometricDecoder.

[0043] Figure 4 A schematic diagram of the architecture of a physically embedded spatiotemporal transformer wave reconstruction network according to an embodiment of the present invention is shown. Figure 4 As shown, the feature embedding dimension (numbers such as 128, 256, 512, etc.) represents the richness of the features; there are multiple attention heads, some of which are responsible for tracking the temporal movement direction of the wave crest, and others are responsible for the wavefront topology within this single frame. These attention heads work together to capture complex fluid dynamics.

[0044] The geometrically perceptual feature encoding module is responsible for extracting visual features from video frames. Located at the input of the physically embedded spatiotemporal transformer wave reconstruction network 102, this module, unlike conventional encoders, integrates a geometric embedding layer. Instead of simply inputting T-frame RGB images, it synchronously inputs a geometric parameter vector containing the camera's focal length, principal point coordinates, mounting height, and pitch angle. The geometrically perceptual feature encoding module encodes the camera's intrinsic parameters (focal length, optical center, etc.) and extrinsic parameters (mounting height, pitch angle, etc.) into high-dimensional vectors and explicitly injects them into the visual features. This design enables the network to understand the mapping relationship between image pixel intensity and real physical space dimensions (metric units).

[0045] In one embodiment of the present invention, the geometrically aware feature encoding module maps these geometric parameters into geometric embeddings with the same dimension as the image features using a multilayer perceptron (MLP), and explicitly fuses them into the visual feature map through element-wise summation or concatenation. The improvement in this structural design lies in breaking the scale ambiguity of monocular vision. Traditional monocular depth estimation networks typically learn relative depth because the size of objects in an image varies with distance and there is a lack of absolute reference points. By explicitly injecting the absolute physical quantity of camera mounting height into the feature layer, the weight distribution of the wave reconstruction network is modulated by the geometric parameters during convolution operations. This forces the wave reconstruction network to establish a nonlinear mapping function from pixel brightness / texture gradient to absolute physical height. Therefore, the solution provided by the present invention can directly output wave height in metric units (meters) without requiring the cumbersome subsequent scale calibration steps required by traditional optical flow methods or stereo vision.

[0046] The decoupled spatiotemporal transformer module, composed of alternating stacked temporal and spatial attention modules, is the core for capturing the dynamic patterns of waves. This module employs a temporal- and spatially decoupled attention mechanism. Specifically, the visual feature map first passes through a temporal attention module (TemporalAttentionBlock), which calculates self-attention weights only in the time dimension, i.e., between consecutive time steps (FrameSequences) (the same spatial pixel position in different frames), while freezing the spatial dimension. Immediately afterwards, the output features pass through a spatial attention module (SpatialAttentionBlock), which calculates self-attention only in the spatial dimension of a single frame, while freezing the temporal dimension. These two modules are stacked alternately, forming a serial processing pipeline.

[0047] The decoupled spatiotemporal transformer module consists of alternating stacked time attention modules and spatial attention modules. The improved principle of this structure is based on the physical properties of fluid waves. The motion of ocean waves has a high degree of spatiotemporal correlation (i.e., the propagation characteristics of the wave equation).

[0048] The role of temporal attention: By establishing long-range dependencies on the timeline, wave reconstruction networks can understand the direction and speed of wave propagation. If viewed in a single frame, a wave crest might be misjudged as noise or reflection; however, by considering the evolution of preceding and following frames, the wave reconstruction network can confirm that it is a moving physical wave packet. This mechanism fundamentally eliminates the flickering and jittering phenomena commonly found in traditional frame-by-frame prediction.

[0049] The role of spatial attention: The sea surface is a continuous surface, and the shape of waves is globally constrained by gravity and surface tension. The spatial attention mechanism allows wave reconstruction networks to infer local geometry using global texture information, thus enabling reasonable surface shapes to be inferred from the surrounding environment even in areas with weak texture (such as calm water).

[0050] Compared to using 3D convolutions (3D-CNN), this decoupled Transformer structure significantly reduces computation while maintaining the global receptive field, making real-time inference possible on edge devices.

[0051] Evaluation metrics for spatiotemporal consistency can be such as pixel-level temporal coherence analysis. This involves plotting the absolute height of fixed pixels in the predicted wave field as a curve over time and comparing it to the ground truth. A model with good consistency will have a curve that closely matches the phase and amplitude of wave propagation, and the curve will be smooth and continuous. The reduction of non-physical jitter is manifested in the elimination of high-frequency abrupt spikes on the curve, and the smooth propagation of wave peaks along a specific direction in consecutive video frames, without the flickering artifacts that appear intermittently in single-frame predictions.

[0052] The Physics-Prior-Guided Wave Refinement Module is an innovative component located at the end of the network decoder, specifically designed to address the loss of high-frequency details caused by the inherent low-pass filtering effect in deep neural networks. Embedded after the upsampling layer of the decoder, the Physics-Prior-Guided Wave Refinement Module employs a special residual structure. Essentially, it is a residual connection with physical prior constraints.

[0053] Figure 5 A detailed structural block diagram of a physically-guided wave refinement module according to an embodiment of the present invention is shown. Figure 5 As shown, let the input feature map be... In computer vision feature extraction, this represents a three-dimensional tensor. C represents the number of feature channels, typically carrying various extracted feature information; H represents the height of the feature map, corresponding to the number of pixel rows in the vertical direction of the image or feature space; W represents the width of the feature map, corresponding to the number of pixel rows in the horizontal direction of the image or feature space. Input feature map The process is divided into two paths: the main path (IdentityPath), which is the feature direct path, directly preserving the original feature information; and the physical sharpening path. In the physical sharpening path, a Laplacian ConvLayer with frozen parameters is deployed, whose kernel weights are hard-coded as second-order differential operators (e.g., [[0,1,0],[1,-4,1],[0,1,0]]). The output of this convolutional layer is then multiplied by a learnable channel-level gain coefficient. The output feature is... , The calculation formula is as follows: in, This represents depthwise convolution. This represents a strictly frozen second-order Laplace differential operator (e.g., [[0,1,0],[1,-4,1],[0,1,0]]), used to display the high-frequency roughness of the extracted wave surface. This represents the channel-wise learnable parameter, whose dimension is consistent with the number of feature channels C, such as the [1,C,1,1] tensor.

[0054] During the training of conventional convolutional neural networks, in order to minimize the overall mean squared error, the network tends to output a blurred, smoothed result (over-smoothing), resulting in the filtering out of sharp peaks and fine, foamy textures. The Laplacian convolution operator, mathematically the second derivative, is specifically designed to extract high-frequency abrupt changes (edges and roughness) in an image. By forcibly implanting this physical operator into the network and not allowing it to modify it, we are essentially implanting a permanent high-frequency extractor into the network. The physically prior-guided wave thinning module does not simply add back the high-frequency information, but instead learns a coefficient... , The dimension is, for example, [1, C, 1, 1], meaning that the wave reconstruction network can learn different sharpening intensities for different feature channels (representing different wave frequencies or patterns). This gives the wave reconstruction network adaptive capabilities: it automatically increases sharpness in areas with large waves (more high-frequency information). Reduce in areas with swells (higher frequency information) This approach restores microscopic details while avoiding the introduction of image noise. Instead of involving the entire convolutional kernel in the learning process, the 3×3 Laplacian kernel weights are frozen in this application, with only the channel-level gain coefficients open. This means a very small parameter space (e.g., only 64, 128, or 256 parameters), fundamentally reducing the risk of overfitting. Furthermore, the AdamW optimizer, combined with L2 regularization and gradient clipping, effectively prevents channel-level gain coefficients from being overfitted. Numerical explosion.

[0055] The multi-task geometry decoding module is responsible for solving deep features into the final physical parameters. It contains three parallel prediction heads that output the absolute height field, the surface normal field (used to constrain local smoothness), and the sea-sky mask (used to remove sky background interference).

[0056] The teacher-student distillation training framework 103 serves as a bridge connecting virtual simulation and real-world applications. It belongs to the model training and generation system of the wave field reconstruction and monitoring system based on monocular video proposed in this invention. The teacher-student distillation training framework 103 consists of two asymmetric network entities: the teacher network and the student network.

[0057] The teacher network is a high-capacity teacher network built on a large-scale pre-trained visual model (such as DINOv2) and fully supervised training on a synthetic dataset, possessing extremely strong wave geometry perception capabilities.

[0058] The student network is a lightweight student network, namely the aforementioned spatiotemporal transformer wave reconstruction network 102 based on physical embedding.

[0059] Figure 6 A schematic diagram of a teacher-student distillation training framework according to an embodiment of the present invention is shown. Figure 6 As shown, the working logic of the teacher-student distillation training framework 103 lies in establishing a cross-domain pseudo-label generation and transfer mechanism. When faced with unlabeled real sea surface videos, the teacher network is responsible for generating pseudo-labels, and the student network is guided by a loss function to learn the light and shadow distribution characteristics of waves in the real image domain. This framework enables the technical solution of this invention to achieve high-precision reconstruction of the real sea surface even without any real sensor (such as LiDAR) data as a reference.

[0060] The teacher-student distillation training framework 103 establishes a unidirectional supervised connection during the training phase. The teacher network is in inference mode (EvalMode), and its weights are frozen. Real-world sea surface video is simultaneously input to both the teacher and student networks. The height map output by the teacher network is treated as pseudo-labels and used to calculate the L1 loss function with the prediction results of the student network. A key structural constraint is that during training, the weights of the branches used to predict the normal head and mask head of the student network are frozen, and the weights of the encoder and height prediction head are updated only using backpropagation.

[0061] It should be noted that the quality of the pseudo-labels is guaranteed through structured constraints and final real-world evaluation. When generating pseudo-labels, the teacher network uses the DINOv2 visual large model backbone, which has strong representational capabilities. To prevent unknown reflective noise in the real-world environment from contaminating the pseudo-labels, the student network employs a strategy of freezing auxiliary branches (normals and masks) during distillation training, forcing the network to update only the height prediction head. The final effectiveness of the pseudo-labels was quantitatively verified on the WASS real sea surface test set. While the pseudo-labels contain a certain amount of error, freezing most branches and using a low learning rate ensures that network fine-tuning and adaptation to real-world data are not biased.

[0062] The improvement in this step lies in structured knowledge transfer. There is a significant domain gap between real and synthetic sea surface images (e.g., the light absorption rate of real water bodies, the complexity of sky reflection). Directly applying a model trained on synthetic data to real-world scenes will fail due to the different texture distributions.

[0063] However, the geometric patterns of waves (such as the relationship between wavelength and wave height, and the sharpness of wave crests) are shared in both virtual and real environments. The teacher network, trained in a virtual world with perfect truth, possesses strong discriminative power regarding geometric structures. Distillation effectively forces the student network to extract wave geometry identical to what the teacher network sees under ideal conditions, regardless of the complexity of real-world lighting. Freezing auxiliary branches (predicting normals and masks) further restricts the student network's search space, preventing it from overfitting noise in real-world data (such as strong sunlight reflections), thus ensuring the model's generalization ability even without a true ground truth.

[0064] Figure 7 A flowchart illustrating a wave field reconstruction and monitoring method based on monocular video according to an embodiment of the present invention is shown. Figure 7 As shown, the monocular video-based wave field reconstruction and monitoring system is used in a method for wave field reconstruction and monitoring based on monocular video. The method includes the following steps: Step 701: Generate a synthetic dataset with true values ​​of absolute physical height through a high-fidelity ocean dynamics synthetic environment construction subsystem.

[0065] Step 702: Construct a physical embedding-based spatiotemporal transformer wave reconstruction network, which includes a geometrically aware feature encoding module, a decoupled spatiotemporal transformer module, a physical prior-guided wave refinement module, and a multi-task geometric decoding module.

[0066] Step 703: Use the synthetic dataset to pre-train the teacher network to obtain the pre-trained teacher network.

[0067] Step 704: Use the pre-trained teacher network to generate pseudo-labels for unlabeled real sea surface videos. The student network is a physical embedding-based spatiotemporal transformer wave reconstruction network. Update the parameters of the student network according to the loss function and train the student network.

[0068] Step 705: Deploy the trained student network on the target platform, receive real-time monocular video streams and camera geometric parameters, and output wave field.

[0069] The system and / or method for wave field reconstruction and monitoring based on monocular video proposed in this invention can be applied to the following scenarios: 1. Marine Robots and Autonomous Unmanned Systems: Provide unmanned surface vessels (USVs) and autonomous underwater vehicles (AUVs) with environmental awareness capabilities for near-surface operations, assisting them in achieving automatic berthing, hovering, and path planning and maneuver control in complex sea conditions.

[0070] 2. Maritime Operation Safety Assistance: Applied to offshore operation platforms or maintenance vessels, it monitors key parameters such as effective wave height (SWH) in real time. Especially in high-risk operations such as crew transfer and resupply, it provides operators with decision-making basis for whether sea conditions meet safety operation standards (such as wave height restrictions).

[0071] 3. Special Amphibious Missions: Provide wave micro-topography data for amphibious landing missions in shallow waters and on beaches; at the same time assist low-altitude aircraft in determining the dynamic altitude of the sea surface to prevent flight safety accidents caused by sudden changes in waves.

[0072] 4. Marine engineering monitoring: Used for design verification and long-term health monitoring of coastal defense facilities and offshore structures, replacing expensive traditional radar or buoy arrays as a low-cost, easy-to-deploy distributed sea state monitoring solution.

[0073] Example 1: The wave field reconstruction and monitoring system and / or method based on monocular video proposed in this invention can be applied to safety auxiliary monitoring of boarding operations on offshore wind power maintenance vessels.

[0074] The application scenario for Example 1 is as follows: A maintenance vessel is attempting to approach the tower base of an offshore wind turbine, preparing to transport maintenance personnel via a boarding ladder. The on-site lighting conditions are complex, including overcast skies, strong sea surface reflections, and broken foam caused by disturbance to the wind turbine foundation. Operational safety standards require the effective wave height (SWH) to be less than 1.5 meters.

[0075] The configuration of Example 1 is as follows: the hardware deployment is to fix a monitoring camera on the side of the bridge of the maintenance vessel (about 8.0 meters above the water surface), with the field of view covering the water area connecting the ship's side and the tower base; the system configuration is to load the student network weights after simulation and real-world distillation fine-tuning.

[0076] The system and / or method for wave field reconstruction and monitoring based on monocular video proposed in this invention, in the operation process of Embodiment 1, i.e., the optimal working state, is as follows: Cross-domain interference resistance: Although the ambient lighting was dim and there were complex foam textures (which may not be fully covered in the virtual training data), thanks to the teacher-student distillation strategy, the network has learned to ignore the changes in lighting and focus on extracting the geometric structure of the waves. It can accurately identify the foam as high-frequency textures on the wave surface, rather than misjudging it as highly abrupt changes.

[0077] Detail Restoration: At this point, the physics-guided wave refinement module is operating at its best. The fixed Laplace operator precisely captures the subtle wave fragments and capillary waves near the pile foundation. Combined with channel-level adaptive coefficients, the network clearly reconstructs the sharp peak shapes of the breaking waves, rather than a blurry mess.

[0078] Operation decision: The system calculates the effective wave height (SWH) within the current field of view in real time. Although the sea surface appears to be undulating to the naked eye, the system's quantitative analysis shows that the SWH is 1.3 meters and the wave period is relatively long, which meets the safe operation window.

[0079] The optimal performance of the wave field reconstruction and monitoring system and / or method based on monocular video proposed in this invention is as follows: Under these conditions, the technical solution proposed in this invention demonstrates the advantages of Sim-to-Real generalization capability and the detail recovery capability of the wave refinement module guided by physical priors. This proves that the wave field reconstruction and monitoring system based on monocular video proposed in this invention can be directly deployed and used (Zero-shot / Few-shot) without the need for expensive ground truth acquisition and retraining in specific sea areas, and can provide quantitative physical parameters that surpass human visual judgment, ensuring the safety of maritime operations. In real-world WASS binocular stereo vision benchmark tests, the technical solution proposed in this invention, using only monocular input, achieved a mean absolute error (MAE) of 0.039 meters and a Pearson correlation coefficient as high as 0.681. This proves that under complex real-world sea conditions, the effective wave height (SWH) error output by the wave field reconstruction and monitoring system based on monocular video proposed in this invention is extremely small, fully meeting the stringent safety standard of controlling SWH error to the centimeter level for maritime boarding operations.

[0080] Example 2: The system and / or method for wave field reconstruction and monitoring based on monocular video proposed in this invention can be applied to wave field reconstruction of unmanned ships / drones equipped with stabilization gimbals.

[0081] The application scenario for Example 1 is as follows: An unmanned surface vessel (USV) equipped with a three-axis mechanical stabilization gimbal or a low-altitude flying drone is performing a maritime search and rescue or hydrological survey mission. The mission requires rapidly acquiring the topography of the sea area ahead in order to plan a flight path.

[0082] Configuration status of Example 1: The hardware deployment is as follows: the camera is mounted on the stabilization gimbal, which can filter out most of the high-frequency and severe shaking of the hull / body, ensuring the image is stable; the data input is real-time video stream + current camera attitude information provided by inertial navigation (IMU).

[0083] The optimal operating state of the wave field reconstruction and monitoring system and / or method based on monocular video proposed in this invention, as described in Example 2, is as follows: Sim-to-Real generalization advantages: In open ocean areas, water color and reflectivity may differ from the virtual environment used during training. Thanks to the teacher-student distillation strategy, the model maintains its sensitivity to geometric structure when facing real deep blue ocean or turbid nearshore waters, and its predictions do not fail due to differences in water color.

[0084] End-to-end real-time inference: When the computing power of unmanned equipment is limited (such as Jetson Xavier / Orin), the technical solution of this invention has significant advantages over heavy-duty networks such as 3D-CNN. It can output the wave height of the ocean ahead in real time at a speed of 30 FPS, allowing the unmanned vessel to "see" the swell pattern ahead.

[0085] Dynamic capture capability: Although the camera is moving, the network can focus on the relative motion of the waves because the image is relatively stable. The model can accurately predict a large wave rising 50 meters ahead, assisting the unmanned vessel in slowing down or adjusting its wave-cutting angle in advance.

[0086] The optimal performance of the wave field reconstruction and monitoring system and / or method based on monocular video proposed in this invention is as follows: Under these conditions, the technical solution proposed in this invention demonstrates the advantages of lightweight real-time inference and strong generalization ability. The technical solution proposed in this invention solves the problem of difficult installation (baseline limitation) of traditional binocular vision on unmanned devices. As long as the video image does not jitter excessively, it can provide high-density 3D environmental perception data. Facing the high real-time requirements of mobile deployments, the technical solution proposed in this invention exhibits extremely high computational efficiency. With a 10-frame video sequence as input, the model parameter count is only 13.78M, and the single inference time is as low as 4.95 milliseconds. Compared to the traditional 3D U-Net architecture (inference time 73.89 milliseconds), the technical solution proposed in this invention improves the running speed by nearly 15 times, making it extremely suitable for edge computing platforms of unmanned vehicles with limited computing power.

[0087] The wave field reconstruction and monitoring system based on monocular video proposed in this invention completely eliminates the complex binocular parallax calculation module from its structure, adopting a geometrically embedded monocular regression architecture. An explicit geometric parameter injection interface is designed at the bottleneck layer of the network encoder, directly mapping the camera's mounting height, pitch angle, and intrinsic parameters to the feature space. This improvement allows the system to operate with only a single ordinary monocular camera. Functionally, it no longer outputs unitless relative depth, but directly establishes a mapping between pixel intensity and physical metric height (meters). The technical benefits are a significant reduction in hardware cost and payload weight, enabling the system to be easily deployed on small unmanned aerial vehicles (UAVs) or unmanned surface vessels (USVs) without being affected by binocular calibration parameter drift caused by hull vibration, thus significantly improving the system's robustness and engineering applicability.

[0088] This invention proposes a wave field reconstruction and monitoring system based on monocular video, overcoming the over-smoothing drawback of deep learning regression tasks and accurately recovering microscopic sea surface topography. Existing depth estimation networks typically use standard convolutional layers for upsampling, which is mathematically equivalent to a low-pass filter and tends to smooth out high-frequency details. The technical solution of this invention innovatively incorporates a physically prior-guided wave refinement module (WRM) at the end of the decoder. This module contains a special parallel residual branch: this branch fixes an untrainable Laplacian second-order differential operator and concatenates it with a channel-wise learnable alpha coefficient vector. Functionally, the fixed Laplacian operator forces the network to extract physically meaningful surface roughness and edge gradients, while the learnable alpha coefficients allow the network to adaptively determine how much high-frequency texture to inject based on the characteristic response of the current channel. The technical effect is that the design effectively breaks the smoothing bias of neural networks, so that the reconstruction results not only include macroscopic wave undulations, but also clearly present high-frequency microstructures such as capillary waves and broken foam, which significantly improves the geometric realism of the wave height model.

[0089] This invention proposes a wave field reconstruction and monitoring system based on monocular video, which solves the temporal flicker problem of single-frame prediction and generates a continuous wave field consistent with hydrodynamics. For dynamic fluid surfaces, traditional frame-by-frame prediction methods ignore the wave propagation laws. The technical solution of this invention introduces a decoupled spatio-temporal transformer in its structure. This structure decomposes the attention mechanism in the physical dimension, alternately stacking temporal and spatial attention modules. Functionally, the temporal attention module is specifically responsible for tracking the phase shift of waves on the time axis, while the spatial attention module is responsible for maintaining the wave surface topology within a single frame. The technical effect is that the model learns the implicit hydrodynamic equations (such as wave dispersion relations) and can use information from previous and subsequent frames to constrain the current frame. This makes the output wave height field highly consistent in time, eliminating non-physical flicker and jitter artifacts, and providing reliable dynamic wave parameters for marine engineering.

[0090] This invention proposes a wave field reconstruction and monitoring system based on monocular video, overcoming the bottleneck of training with ground truth data from real sea surfaces and achieving zero-cost transfer from simulation to reality. In the field of data-driven marine remote sensing, obtaining pixel-aligned ground truth wave heights is almost an impossible task (requiring extremely expensive array-type LiDAR and difficult to implement over large areas). The technical solution of this invention combines graphics rendering technology with transfer learning technology. First, a large-scale synthetic dataset is constructed using a high-fidelity physics engine, providing the model with perfect physical supervision signals. Second, through a teacher-student distillation strategy, a teacher model pre-trained on virtual data guides the student model to adapt to the illumination distribution of the real sea surface. This strategy of using the virtual to assist the real produces unexpected technical effects: it allows ordinary monocular cameras to acquire physical measurement capabilities that were originally only available to active sensors through algorithmic compensation, without the need to collect any height-labeled training data in real sea areas, greatly reducing the threshold and cost of algorithm implementation.

[0091] The wave field reconstruction and monitoring system based on monocular video proposed in this invention improves computational efficiency and real-time performance, meeting the deployment requirements of edge computing. Compared to using computationally intensive 3D convolutional neural networks (3D-CNN) to process video data, the decoupled transformer structure adopted in this invention significantly reduces the number of parameters and computational complexity (FLOPs). This design enables the algorithm to achieve high frame rate real-time inference on resource-constrained edge computing devices (such as unmanned shipborne industrial control computers) while ensuring the capture of long-range spatiotemporal dependencies, thus meeting the urgent need for real-time sea state monitoring in marine operations.

[0092] In one embodiment of the present invention, an electronic device is also provided, comprising a processor, a graphics card, and a memory. The memory is configured to store machine-readable instructions, the graphics card is configured to train the method for wave field reconstruction and monitoring based on monocular video, and the processor is configured to execute the machine-readable instructions. When the processor and / or graphics card executes the machine-readable instructions, the following processing steps are implemented: generating a synthetic dataset with absolute physical height truth values ​​through a high-fidelity ocean dynamics synthesis environment construction subsystem; constructing a physically embedded spatiotemporal transformer-based wave reconstruction network, which includes a geometrically aware feature encoding module, a decoupled spatiotemporal transformer module, a physically prior-guided wave refinement module, and a multi-task geometric decoding module; pre-training a teacher network using the synthetic dataset to obtain a pre-trained teacher network; generating pseudo-labels for unlabeled real sea surface videos using the pre-trained teacher network, the student network being the physically embedded spatiotemporal transformer-based wave reconstruction network; updating the parameters of the student network according to a loss function and training the student network; and deploying the trained student network on a target platform to receive real-time monocular video streams and camera geometric parameters, and outputting the wave field.

[0093] The graphics card used can preferably be a model with a GPU computing power higher than 5.0. Since the amount of data to be trained is large, providing a graphics card configuration can significantly improve the training speed.

[0094] The memory includes various media capable of storing machine-readable instructions, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0095] It is understood that, in addition to the memory and processor mentioned above, the computer system described above also includes other hardware and software components not listed in this specification. The specific components can be determined according to the model of the specific data processing equipment in different application scenarios, and will not be listed and described in detail in this specification.

[0096] In one embodiment of the present invention, a computer-readable storage medium is also provided, on which machine-readable instructions are stored. When executed by a processor, the machine-readable instructions perform the following processing steps: generating a synthetic dataset with absolute physical height truth values ​​through a high-fidelity ocean dynamics synthesis environment construction subsystem; constructing a physically embedded spatiotemporal transformer wave reconstruction network, which includes a geometrically aware feature encoding module, a decoupled spatiotemporal transformer module, a physically prior-guided wave refinement module, and a multi-task geometric decoding module; pre-training a teacher network using the synthetic dataset to obtain a pre-trained teacher network; generating pseudo-labels for unlabeled real sea surface videos using the pre-trained teacher network, wherein the student network is the physically embedded spatiotemporal transformer wave reconstruction network; updating the parameters of the student network according to a loss function and training the student network; and deploying the trained student network on a target platform to receive real-time monocular video streams and camera geometric parameters, and outputting wave fields.

[0097] Although various embodiments of the present invention have been described above, it should be understood that they are presented by way of example only and not as limitations. It will be apparent to those skilled in the art that various combinations, modifications, and alterations can be made without departing from the spirit and scope of the invention. Therefore, the breadth and scope of the invention disclosed herein should not be limited by the exemplary embodiments disclosed above, but should be defined according to the technical solutions of the invention and their equivalents.

Claims

1. A wave field reconstruction and monitoring system based on monocular video, characterized in that, include: A high-fidelity marine dynamics synthetic environment construction subsystem is configured to generate synthetic datasets with absolute physical height truth values ​​based on a physical spectrum model and a physical rendering engine. A physically embedded spatiotemporal transformer-based wave reconstruction network is configured to receive a monocular video stream and camera geometry parameters, and output an absolute wave field in metric units; and A teacher-student distillation training framework is configured to pre-train a teacher network using the synthetic dataset. When faced with real sea surface videos, the teacher network generates pseudo-labels to guide the student network in learning the light and shadow distribution characteristics of waves.

2. The wave field reconstruction and monitoring system based on monocular video according to claim 1, characterized in that, The high-fidelity marine dynamics synthesis environment construction subsystem includes: An empirical spectrum-based ocean dynamics solution unit, with built-in JONSWAP and Pierson-Moskowitz wave spectrum models, is configured to generate nonlinear wave surface meshes conforming to hydrodynamic laws through randomized physical parameters; and The physically based rendering-based multimodal data generation unit is configured to simulate real light transmission, render video streams, and output absolute physical height maps, surface normal maps, and sea-sky segmentation masks aligned with RGB pixels.

3. The wave field reconstruction and monitoring system based on monocular video according to claim 1, characterized in that, The physical embedding-based spatiotemporal transformer wave reconstruction network includes: The geometry-aware feature encoding module is configured to encode the input camera geometry parameters into a high-dimensional vector and inject them into the visual features of the video frame; The decoupled spatiotemporal converter module is configured to consist of alternating stacked time attention modules and spatial attention modules, and is used to capture wave dynamics. A physics-a priori guided wave refinement module, configured to recover high-frequency details of waves; and The multi-task geometry decoding module includes multiple parallel prediction heads that output the absolute height field, surface normal field, and sea-sky mask, respectively.

4. The wave field reconstruction and monitoring system based on monocular video according to claim 3, characterized in that, The geometric perception feature encoding module maps the camera's focal length and / or principal point and / or installation height and / or pitch angle into geometric embedding vectors through a multilayer perceptron, and fuses them with the visual feature map by element-wise addition or channel splicing.

5. The wave field reconstruction and monitoring system based on monocular video according to claim 3, characterized in that, In the decoupled spatiotemporal transformer module, the temporal attention module calculates self-attention weights only in the time axis dimension to model wave propagation; the spatial attention module calculates self-attention weights only in the spatial dimension of a single frame to maintain the topology of the wave surface; the temporal attention module and the spatial attention module are stacked alternately to form a serial processing pipeline.

6. The wave field reconstruction and monitoring system based on monocular video according to claim 3, characterized in that, The physically-guided wave refinement module employs a residual structure, including: Feature direct paths are used to preserve original feature information; The physical sharpening path includes a parameter-frozen Laplacian convolutional layer for extracting high-frequency gradient information, and a learnable channel-level gain coefficient for adaptively controlling the sharpening intensity. The final output is the sum of the output of the feature pass-through path and the output of the physical sharpening path.

7. The wave field reconstruction and monitoring system based on monocular video according to claim 1, characterized in that, Within the aforementioned teacher-student distillation training framework The teacher network is built on a large-scale pre-trained visual model and undergoes fully supervised training on the synthetic dataset. The student network is a wave reconstruction network based on a physically embedded spatiotemporal transformer; During the distillation stage, the teacher network generates pseudo-labels for real sea surface videos, constrains the output of the student network through a loss function, and freezes the branch weights in the student network used to predict normals and masks. Among them, the loss function The calculation formula is as follows: in, This indicates the absolute physical height loss. Represents the surface normal vector loss. This indicates the loss of the sea-sky mask; β represents the attention weight assigned to the wave absolute physical height prediction task, and β represents the attention weight assigned to the surface normal vector task. This represents the attention weights assigned to the sea-sky segmentation task; Among them, absolute physical height loss The calculation formula is as follows: Where i represents the i-th pixel. This represents the effective water surface mask for the i-th pixel. This represents the predicted absolute height of the i-th pixel. This represents the true absolute height of the i-th pixel; Surface normal vector loss The calculation formula is as follows: Where i represents the i-th pixel. This represents the effective water surface mask for the i-th pixel. Let represent the predicted surface normal vector of the i-th pixel. This represents the true surface normal vector of the i-th pixel; Sea and Sky mask loss The calculation formula is as follows: Where i represents the i-th pixel, and N is the total number of pixels. This represents the true value of the mask for the i-th pixel. This represents the logits predicted value of the i-th pixel in the network output. This represents the sigmoid activation function.

8. A wave field reconstruction and monitoring method based on monocular video, implemented based on the wave field reconstruction and monitoring system based on monocular video as described in any one of claims 1 to 7, characterized in that, Includes the following steps: A high-fidelity ocean dynamics synthetic environment construction subsystem generates a synthetic dataset with true values ​​of absolute physical height. A physical embedding-based spatiotemporal transformer wave reconstruction network is constructed, which includes a geometrically aware feature encoding module, a decoupled spatiotemporal transformer module, a physically prior-guided wave refinement module, and a multi-task geometric decoding module. The teacher network is pre-trained using the synthetic dataset to obtain a pre-trained teacher network; The pre-trained teacher network is used to generate pseudo-labels for unlabeled real sea surface videos. The student network is a physical embedding-based spatiotemporal transformer wave reconstruction network. The parameters of the student network are updated according to the loss function, and the student network is trained. The trained student network is deployed on the target platform to receive real-time monocular video streams and camera geometry parameters, and output wave fields.

9. An electronic device, characterized in that, include: A processor, configured to execute machine-readable instructions; A graphics card configured to train the wave field reconstruction and monitoring method based on monocular video as described in claim 8; and A memory configured to store machine-readable instructions that, when executed by a processor and / or graphics card, perform the steps of the wave field reconstruction and monitoring method based on monocular video according to claim 8.

10. A computer-readable storage medium storing computer-readable instructions thereon, characterized in that, When the computer-readable instructions are executed by the processor, they perform the steps of the wave field reconstruction and monitoring method based on monocular video as described in claim 8.