A deep learning-based near-shore sea current flow velocity video measurement method and system
By using deep learning-based optical flow detection and window self-attention model, non-contact, wide-coverage, and high-precision real-time monitoring of coastal current velocity has been achieved. This solves the problems of high equipment cost, accuracy being greatly affected by the environment, and difficulty in obtaining flow field distribution in traditional methods, and adapts to the current velocity monitoring needs of complex coastal environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG HAIQIXING MARINE TECH CO LTD
- Filing Date
- 2026-02-04
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional methods for measuring coastal current velocity are expensive, complex to operate, and their accuracy is greatly affected by the environment. Furthermore, they are difficult to obtain information on the distribution of flow fields over a wide area. Video flow measurement technology is insufficient in terms of accuracy and stability in complex coastal environments.
A deep learning-based video method for measuring nearshore ocean current velocity was adopted. Video image sequences were acquired through image acquisition equipment, and feature extraction and analysis were performed using an optical flow detection model and a window self-attention model, including confidence screening and feature fusion of optical flow features, to output the amplitude and direction of ocean current velocity.
It enables non-contact, wide-coverage, and high-precision real-time monitoring of ocean current velocity, improving the accuracy and stability of velocity prediction, adapting to complex environments, and reducing equipment deployment and maintenance costs.
Smart Images

Figure CN122244746A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of marine monitoring technology, specifically to a method and system for measuring nearshore ocean current velocity based on deep learning video. Background Technology
[0002] Currently, coastal video current velocity measurement is an important technical means in marine engineering, environmental monitoring, and coastal management. Traditional current velocity measurement methods suffer from problems such as high equipment costs, complex operation, and measurement accuracy being greatly affected by the environment. With the rapid development of deep learning technology, video analysis-based current velocity measurement methods have provided a new solution for coastal flow field monitoring. Traditional coastal current velocity measurement mainly relies on physical sensors and invasive equipment, such as acoustic Doppler current profilers (ADCPs) and electromagnetic current meters. Although these methods have high accuracy, they suffer from problems such as expensive equipment, difficult deployment, high maintenance costs, and susceptibility to marine environmental corrosion. In addition, traditional methods usually only provide point measurements, making it difficult to obtain information on large-scale flow field distribution, thus limiting a comprehensive understanding of coastal dynamic processes. Video current measurement technology, as a non-contact measurement method, was initially mainly applied to river current velocity monitoring. Early video current measurement methods were mainly based on particle image velocimetry (PIV) technology and surface feature tracking algorithms, estimating flow velocity by analyzing the motion trajectory of floating objects or artificially released tracer particles. However, these traditional methods face many challenges in complex coastal environments, such as wave interference, changes in illumination, and lack of tracers, resulting in insufficient measurement accuracy and stability. Summary of the Invention
[0003] To address the aforementioned shortcomings, this invention discloses a video method for measuring nearshore ocean current velocity based on deep learning, which enables accurate monitoring of coastal current velocity.
[0004] The first aspect of this invention discloses a method for measuring nearshore ocean current velocity based on deep learning video, comprising: Acquire video image sequences of the water area to be monitored using image acquisition equipment; Extract multiple consecutive frames from the video image sequence to form an image frame group; The image frame group is input into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the optical flow features including pixel displacement components and confidence levels; The image frame group and the corresponding optical flow feature are associated as detection input features, and the detection input features are input into a window self-attention model for processing to output the predicted ocean current monitoring results, which include ocean current velocity amplitude and ocean current direction.
[0005] As an optional implementation, in a first aspect of the present invention, inputting the image frame group into an ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group includes: An optical flow extraction network is used to encode features of the image frame group to determine the optical flow features between adjacent frames in the image frame group. A feature alignment layer is added after the convolutional block of the optical flow extraction network. The feature alignment layer includes a 1×1 convolutional layer and a layer normalization layer. The optical flow extraction network is designed based on the SEA-RAFT architecture, and its feature extraction backbone network is ConvNeXt-V2.
[0006] As an optional implementation, in a first aspect of the present invention, the step of associating the image frame group and the corresponding optical flow features as detection input features, and inputting the detection input features into a window self-attention model for processing to output predicted ocean current monitoring results, includes: The image frame group and the extracted optical flow features are time-aligned and channel-sequentially concatenated to form a fused feature sequence; The fused feature sequence is input into an improved Video Swin Transformer network, wherein the improved Video Swin Transformer network is a window self-attention model. In the self-attention calculation mechanism of the Video Swin Transformer network, an attention mask derived from the confidence in the optical flow features is introduced to apply penalty weights to the features of low confidence regions. The fused features output by the Video Swin Transformer network are then used to output the corresponding ocean current monitoring results via a dual-branch prediction head.
[0007] As an optional implementation, in a first aspect of the present invention, inputting the fused feature sequence into an improved Video Swin Transformer network includes: The fused image-optical flow features are processed by an optical flow orientation coding layer to generate orientation enhancement features; The orientation enhancement features are spatially downsampled and dimensionally embedded using a Patch Embedding layer. Embedded features are processed through Video Swin Transformer block sequences, where the first two stages use a confidence-guided attention mechanism; Features are aggregated through a global spatiotemporal pooling layer; The dual-branch predictor outputs the velocity amplitude and direction vector.
[0008] As an optional implementation, in the first aspect of the present invention, the optical flow extraction network and the improved Video Swin Transformer network are trained in the following manner: A training dataset is constructed by synchronously collecting measured flow velocity data from a corresponding acoustic current meter and a sequence of continuous video frames obtained by an image acquisition device. An optical flow extraction network is used to extract pixel-level optical flow training features from adjacent frame pairs in the continuous video frame sequence. The optical flow training features include horizontal displacement components, vertical displacement components, and confidence levels. The continuous video frame sequence and the extracted optical flow features are temporally aligned and concatenated to form fused training features; The fused training feature sequence is input into an improved Video Swin Transformer network, wherein an attention mask derived from the confidence in the optical flow training features is introduced into the self-attention calculation mechanism of the Video Swin Transformer network to penalize feature interactions in low-confidence regions. The fused features output by the Video Swin Transformer network are used to predict the velocity amplitude and direction angle via the prediction head. The ocean optical flow detection model and the improved Video Swin Transformer network are optimized by a joint loss function, which includes an optical flow error term based on optical flow characteristics and real values, a flow velocity prediction error term based on predicted flow velocity and measured flow velocity, and a loss term used to constrain the consistency between the flow velocity derived from optical flow and the directly predicted flow velocity. Until the optical flow extraction network and the improved Video Swin Transformer network meet the set requirements; The optical flow extraction network and the improved Video Swin Transformer network are trained using a phased training strategy, including: In the first training phase, the network parameters of the ocean optical flow detection model are frozen, and only the improved VideoSwin Transformer network and subsequent prediction head are trained. In the second training phase, all network parameters are unfrozen, and the overall model is jointly fine-tuned at a learning rate lower than that in the first phase.
[0009] As an optional implementation, in the first aspect of the present invention, the method further includes: A bidirectional transfer mechanism is constructed between the optical flow extraction network and the improved Video Swin Transformer network; The bidirectional transmission mechanism includes: An optical flow extraction network processes adjacent image frame pairs to generate intermediate layer feature maps. The intermediate layer feature map is dimensionality reduced and then input into the improved Video Swin Transformer network as a priori for the motion region; The improved Video Swin Transformer network processes fused sequences containing original image and optical flow features; Feature maps are extracted from the intermediate layers of the improved Video Swin Transformer network and upsampled to a predetermined resolution; The upsampled feature map is fed back to the iterative optimization module of the optical flow extraction network; The optical flow estimation process is adjusted based on the feedback feature map.
[0010] As an optional implementation, in a first aspect of the present invention, after inputting the image frame group into an ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the method further includes: The pixel displacement component is converted into a physical velocity component using a scale conversion factor; the scale conversion factor is the conversion factor between image pixels and physical space. Following the output of the predicted ocean current monitoring results, the following is also included: The prediction results at multiple consecutive time points are processed by a moving average.
[0011] A second aspect of this invention discloses a video measurement system for nearshore ocean current velocity based on deep learning, comprising: Acquisition module: Used to acquire video image sequences of the water area to be monitored through image acquisition equipment; Extraction module: used to extract multiple consecutive frames of images from the video image sequence as an image frame group; Optical flow detection module: used to input the image frame group into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the optical flow features including pixel displacement components and confidence levels; The result output module is used to associate the image frame group and the corresponding optical flow features as detection input features, and input the detection input features into a window self-attention model for processing to output the predicted ocean current monitoring results, which include ocean current velocity amplitude and ocean current direction.
[0012] A third aspect of the present invention discloses an electronic device, comprising: a memory storing executable program code; a processor coupled to the memory; the processor calling the executable program code stored in the memory to execute the deep learning-based video measurement method for nearshore ocean current velocity disclosed in the first aspect of the present invention.
[0013] A fourth aspect of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute the deep learning-based video measurement method for nearshore ocean current velocity disclosed in the first aspect of the present invention.
[0014] Compared with the prior art, the embodiments of the present invention have the following beneficial effects: The method in this embodiment of the invention achieves refined extraction and targeted analysis of ocean current features through hierarchical collaborative processing of optical flow detection model and window self-attention model. First, the optical flow detection model accurately captures the pixel displacement components and confidence levels of adjacent frames of the image, locking the dynamic motion trajectory of the ocean current from the underlying visual feature level. Then, the window self-attention model performs attention weighting processing on the associated image frame group and optical flow features, automatically focusing on the core effective features of ocean current motion and weakening background noise features, effectively avoiding measurement deviations caused by single feature analysis, and significantly improving the accuracy of ocean current velocity amplitude and direction prediction. Attached Figure Description
[0015] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0016] Figure 1 This is a flowchart illustrating the video measurement method for nearshore ocean current velocity based on deep learning disclosed in an embodiment of the present invention. Figure 2 This is a schematic diagram of the recognition process of the improved Video Swing Transformer network disclosed in the embodiments of the present invention; Figure 3 This is a schematic flowchart of a specific video method for measuring ocean current velocity disclosed in an embodiment of the present invention; Figure 4 This is a diagram of the existing SEA-RAFT network structure disclosed in the embodiments of the present invention; Figure 5 This is a network structure diagram of the improved SEA-RAFT disclosed in the embodiments of the present invention; Figure 6This is a schematic diagram of the structure of the ConvNeXt-V2 Block disclosed in an embodiment of the present invention; Figure 7 This is the improved VideoSwinTransformer and SEA-RAFT end-to-end deeply coupled architecture disclosed in the embodiments of the present invention; Figure 8 This is a schematic diagram of a nearshore ocean current velocity video measurement system based on deep learning provided in an embodiment of the present invention; Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0018] It should be noted that the terms "first," "second," "third," "fourth," etc., in the specification and claims of this invention are used to distinguish different objects, not to describe a specific order. The terms "comprising" and "having," and any variations thereof, in the embodiments of this invention are intended to cover non-exclusive inclusion. Exemplarily, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or devices.
[0019] Example 1 Please see Figure 1 , Figure 1 This is a flowchart illustrating the video measurement method for nearshore ocean current velocity based on deep learning disclosed in this invention. The execution entity of the method described in this embodiment is an execution entity composed of software and / or hardware. This execution entity can receive relevant information via wired and / or wireless means and can send certain instructions. It may also have certain processing and storage functions. This execution entity can control multiple devices, such as remote physical servers or cloud servers and related software, or local hosts or servers and related software that perform related operations on devices located in a certain location. In some scenarios, multiple storage devices can also be controlled; these storage devices may be placed in the same location as the devices or in different locations. Figure 1 and Figure 3 As shown, this deep learning-based video method for measuring nearshore ocean current velocity includes the following steps: S101: Acquire video image sequences of the water area to be monitored using image acquisition equipment; S102: Extract multiple consecutive frames of images from the video image sequence as an image frame group; S103: Input the image frame group into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the optical flow features including pixel displacement components and confidence levels; S104: Associate the image frame group and the corresponding optical flow features as detection input features, and input the detection input features into a window self-attention model for processing to output the predicted ocean current monitoring results, the ocean current monitoring results including ocean current velocity amplitude and ocean current direction.
[0020] This invention provides a deep learning-based method for measuring coastal current velocity via video, overcoming the core bottlenecks of existing coastal current velocity measurement technologies. Traditional contact-based methods (such as acoustic Doppler current meters) suffer from high deployment and maintenance costs, limited coverage, and susceptibility to interference from harsh environments such as wind, waves, and tides. Conventional non-contact video measurement methods lack a collaborative optimization mechanism for optical flow calculation and current velocity prediction, leading to accumulated errors and difficulty in simultaneously ensuring the accuracy of current velocity amplitude and direction predictions. Furthermore, they lack stability under complex hydrological and meteorological conditions (such as variable tidal states and weather interference), failing to meet the demands for high-precision real-time current velocity data in scenarios such as coastal engineering monitoring and flood warning. Therefore, this invention constructs a systematic data acquisition and calibration system, designs a deeply coupled fusion network, and innovatively introduces a multi-task collaborative optimization loss function to achieve non-contact, wide-coverage, and high-precision real-time measurement of coastal water current velocity, while simultaneously improving the model's adaptability to complex environments and the consistency of prediction results.
[0021] In practical implementation, the confidence index is introduced into the optical flow features to quantify the validity of pixel displacement. The window self-attention model can screen highly valid optical flow features based on the confidence index for subsequent processing, eliminate invalid / low-quality pixel displacement information, further improve the reliability and stability of ocean current monitoring results, and reduce the impact of outliers on the measurement results.
[0022] To address the characteristics of nearshore currents, such as complex flow fields, rapid velocity changes, and significant influence from waves, tides, and shoreline topography, this study employs continuous multi-frame image groups as the basis for analysis, rather than single-frame or dual-frame images. This approach captures the dynamic movement trends of ocean currents over short periods, enabling dynamic and continuous monitoring of nearshore currents. It adapts to the unsteady motion characteristics of nearshore currents and solves the problem that traditional single-frame analysis cannot reflect the continuity of ocean current movement.
[0023] This deep learning-based end-to-end feature processing approach eliminates the need for manually designed ocean current feature extraction rules. The model learns from data to adapt to different nearshore water environments (such as varying water quality, lighting, and wave conditions), exhibiting strong environmental resilience and solving the problem of traditional manual feature design methods in complex nearshore environments. Using video image sequences as the monitoring data source, data acquisition can be achieved with conventional image acquisition equipment, eliminating the need to deploy underwater physical sensors (such as current meters and flow meters) in nearshore waters. This enables contactless ocean current monitoring, avoiding equipment damage and data drift issues caused by nearshore sediment deposition, marine organism attachment, and wave impact on physical sensors. It also significantly reduces the hardware costs of equipment deployment and maintenance.
[0024] More preferably, such as Figures 4 to 6 As shown, inputting the image frame group into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group includes: An optical flow extraction network is used to encode features of the image frame group to determine the optical flow features between adjacent frames in the image frame group. A feature alignment layer is added after the convolutional block of the optical flow extraction network. The feature alignment layer includes a 1×1 convolutional layer and a layer normalization layer. The optical flow extraction network is designed based on the SEA-RAFT architecture, and its feature extraction backbone network is ConvNeXt-V2.
[0025] The solution of this invention revolves around an optical flow extraction network based on the SEA-RAFT architecture, the ConvNeXt-V2 backbone network, and the addition of a feature alignment layer. It optimizes the core problems of optical flow feature extraction in nearshore ocean current video monitoring, such as low accuracy, feature misalignment, insufficient robustness under complex sea conditions, and low extraction efficiency. It achieves high accuracy, high stability, and high adaptability in acquiring optical flow features (pixel displacement components and confidence levels) from the bottom layer of feature extraction, laying a high-quality feature foundation for the subsequent prediction of ocean current velocity and direction using the window self-attention model.
[0026] Using ConvNeXt-V2 as the feature extraction backbone of the optical flow extraction network, its deep feature extraction capability based on a pure convolutional architecture can accurately capture fine-grained dynamic visual features of water flow in nearshore ocean current video images (such as pixel displacement of small flow velocities and gradual features of flow field boundaries). Compared with traditional backbone networks (such as ResNet and VGG), it can better adapt to the texture characteristics of ocean surface images, effectively avoid displacement component detection bias caused by shallow feature extraction, and improve the quantitative detection accuracy of pixel displacement components.
[0027] The optical flow extraction network is designed based on the SEA-RAFT architecture, which is specifically designed for feature matching and iterative optimization in optical flow detection. It can achieve adaptive feature aggregation and fine-grained flow field iteration. In specific implementation, an improved version of SEA-RAFT is used, which replaces ResNet with ConvNeXt-V2 to extract fine-grained features. The input to the front-end SEA-RAFT optical flow extraction sub-network (improved version) is 15 pairs of adjacent frames (I... t ,I t+1 The output corresponds to 15 3-channel optical flow features (u t-t+1 ,v t-t+1 ,c t-t+1 (pixel-level displacement components + confidence level), where u and v are converted into physical velocity components (v) through α and Δt. x =u×α / Δt,v y =v×α / Δt). Where α is the pixel-to-distance conversion coefficient, and Δt is the time interval between adjacent frames.
[0028] The iterative optimization framework is retained, and the original ResNetFPN core feature extraction backbone is replaced with the lightweight backbone of the feature encoder ConvNeXt-V2. A feature alignment layer (1×1 convolution + LayerNorm) is added after each convolutional block to ensure that the output features match the input dimension of the Video Swin Transformer (256 channels). Its core advantage is that the number of parameters is reduced by 30%-40% while maintaining comparable feature extraction capabilities. ConvNeXt-V2 enhances the ability to capture fine-grained textures by replacing BatchNorm with LayerNorm and optimizing the inverse residual structure.
[0029] The solution of this invention features lightweight and efficient feature processing. Compared with traditional large convolutional kernels, the 1×1 convolutional layer significantly reduces the computational load and parameter count of the network while achieving feature fusion, avoiding computational redundancy caused by deep backbone networks. The pure convolutional architecture adopted by ConvNeXt-V2 itself has efficient parallel computing capabilities, which is suitable for the real-time processing requirements of video image sequences and solves the problems of low efficiency of optical flow detection in deep networks and inability to adapt to real-time monitoring.
[0030] More preferably, such as Figure 2 and Figure 7 As shown, the step of associating the image frame group and the corresponding optical flow features as detection input features, and inputting the detection input features into a window self-attention model for processing to output the predicted ocean current monitoring results, includes: S1041: The image frame group and the extracted optical flow features are time-aligned and channel-joined to form a fused feature sequence; S1042: Input the fused feature sequence into the improved Video Swin Transformer network, wherein the improved Video Swin Transformer network is a window self-attention model, and in the self-attention calculation mechanism of the Video Swin Transformer network, an attention mask derived from the confidence in the optical flow features is introduced to apply penalty weights to the features of low confidence regions. S1043: The fused features output by the Video Swin Transformer network are output as corresponding ocean current monitoring results via a dual-branch prediction head.
[0031] The solution of this invention addresses the problems of insufficient feature fusion, interference analysis of invalid features, and single dimension of monitoring results in nearshore ocean current monitoring. It achieves accurate, efficient, and targeted prediction of ocean current velocity amplitude and direction from multi-source features. It not only fully explores the correlation value between image frame groups and optical flow features, but also avoids interference from invalid features through multi-stage optimization, while ensuring the professionalism and adaptability of the monitoring results output.
[0032] By associating image frame groups with optical flow features into a fusion feature sequence through temporal alignment and channel stitching, high-quality fusion of multi-source features is achieved from two dimensions, solving the problem of feature fusion failure caused by temporal misalignment and channel dimension mismatch in traditional feature stitching: temporal alignment ensures that the visual features of image frame groups and the motion features of optical flow features are completely matched in the time dimension, so that the visual features of each pixel position can correspond to its accurate displacement / confidence features, avoiding the disconnect between visual features and motion features caused by temporal deviation, and allowing the fusion features to truly reflect the motion state of nearshore currents at a specific time. Channel stitching deeply integrates the spatial visual features of an image with the motion quantization features of optical flow in the feature channel dimension without losing any effective feature information from either side. It achieves complementarity between spatial texture features and dynamic motion features, allowing subsequent window self-attention models to analyze both the visual representation and motion patterns of ocean currents simultaneously. This fully explores the correlation value of multi-source features and lays a high-quality feature foundation for accurate prediction.
[0033] In the self-attention computation of the Video Swin Transformer network, an attention mask derived from optical flow confidence is introduced, and a penalty weight is applied to low-confidence regions. This enables precise feature selection and focus, fundamentally avoiding the interference of invalid / low-quality features on the analysis results.
[0034] The targeted filtering of attention masks uses the confidence level of optical flow features as an important basis for self-attention allocation, allowing the model to automatically focus on effective fusion features in high-confidence regions (i.e., features that can truly reflect ocean current movement) and weaken invalid features in low-confidence regions (such as false features caused by water surface reflection, floating objects, and abrupt changes in light and shadow). This allows the model's computational resources and analytical focus to concentrate on core effective features, improving analytical efficiency and targeting. Applying penalty weights to low-confidence regions, rather than directly eliminating them, avoids the loss of flow field feature integrity caused by the elimination of local low-confidence features, and reduces their impact on the overall analysis results through quantitative penalties. This achieves a balance between preserving feature integrity and weakening interference from invalid features, adapting to the characteristics of continuous and locally susceptible nearshore ocean current fields. The window self-attention model's inherent local window analysis characteristics enable refined mining of local flow field features in nearshore currents. Combined with confidence attention masks, it further achieves dual optimization of refined local analysis and focused effective feature analysis, making the model's analysis of local current flow fields (such as nearshore eddies and low-velocity currents in shallow waters) more accurate.
[0035] The dual-branch differentiated design allows for the design of dedicated network layers and loss functions for the numerical regression of flow velocity amplitude and the angle determination of direction, respectively. This enables independent training and prediction of the two monitoring results, avoiding mutual interference between the numerical calculation of amplitude and the angle analysis of direction, and significantly improving the quantitative accuracy of flow velocity amplitude and the accuracy of direction determination. Compared to traditional global self-attention, window self-attention restricts self-attention computation to a local window, significantly reducing the computational cost and parameter count of the model. It avoids computational redundancy in deep Transformer networks. Combined with feature fusion and pre-optimization of attention masks, the model can quickly process fused feature sequences, meeting the real-time requirements of video measurement methods.
[0036] The upper-level analysis optimization and the lower-level feature extraction optimization of the preceding ocean optical flow detection model form a complete technical closed loop in this embodiment of the invention: the preceding optical flow detection model outputs high-precision, confidence-based optical flow features, and this part of the solution fully utilizes the high-quality feature results of the lower level through the whole-process optimization of precise fusion, mask screening, and bi-branch prediction. Through the targeted design of each link, the accuracy loss in the feature transfer process is avoided. From the lower-level feature extraction to the upper-level result output, the overall accuracy, stability and consistency of the entire deep learning-based nearshore current velocity video measurement method are guaranteed, so that the measurement results can truly and accurately reflect the actual motion state of the nearshore current.
[0037] More preferably, the step of inputting the fused feature sequence into the improved Video Swin Transformer network includes: The fused image-optical flow features are processed by an optical flow orientation coding layer to generate orientation enhancement features; The orientation enhancement features are spatially downsampled and dimensionally embedded using the PatchEmbedding layer; Embedded features are processed through VideoSwinTransformer block sequences, where the first two stages use a confidence-guided attention mechanism; Features are aggregated through a global spatiotemporal pooling layer; The dual-branch predictor outputs the velocity amplitude and direction vector.
[0038] This invention addresses issues such as insufficient nearshore current feature mining, disconnected spatiotemporal feature fusion, non-targeted attention allocation, and incomplete feature aggregation. It achieves refined, targeted, and comprehensive feature analysis and accurate prediction, from fused feature sequences to current velocity amplitude and direction vectors. This process, through multi-module collaborative design and progressive processing, fully mines the directional features, spatiotemporal correlation features, and core effective features of ocean currents. It further improves the accuracy, stability, and professionalism of ocean current monitoring from the core stage of upper-level model analysis, while also considering model computational efficiency. By employing a dedicated optical flow orientation coding layer to process fused image-optical flow features and generate orientation enhancement features, targeted feature enhancement is achieved to address the core needs of nearshore ocean current monitoring. This solves the problems of weakened orientation features and poor integration with spatial / temporal features in traditional feature processing. The optical flow direction coding layer can extract, encode, and enhance the optical flow direction information in the fused features, highlighting the motion direction features of ocean currents from the multi-source fused features. This allows the model to accurately capture the direction change patterns of nearshore ocean currents (such as the rotation direction of eddies and the deflection trend of nearshore currents), making up for the deficiency in the mining of direction features in conventional feature processing. The orientation enhancement features are not generated in isolation, but are processed based on fused image-optical flow features. This achieves deep binding between orientation features and image spatial features and optical flow displacement features, allowing orientation features to work in synergy with other features to reflect the motion state of ocean currents and avoid the disconnect between orientation features and overall features.
[0039] A confidence-guided attention mechanism is introduced in the first two core stages of the VideoSwinTransformer block sequence. This mechanism combines the confidence of optical flow features with the self-attention allocation depth of the VideoSwin Transformer, solving the problems of indiscriminate allocation of computational resources and interference analysis of invalid features in traditional attention mechanisms. The first two stages are the basic feature extraction and preliminary fusion stages of the VideoSwinTransformer network. In this stage, a confidence-guided attention mechanism is introduced, which can achieve effective feature focusing from the source of feature analysis. The model prioritizes the deep extraction and fusion of core features (features that truly reflect ocean current movement) in high-confidence areas, and applies attention suppression to invalid features in low-confidence areas (such as water surface reflection and false features caused by floating objects), thereby minimizing the interference of invalid features on subsequent analysis.
[0040] The confidence-guided attention mechanism works synergistically with the window self-attention and cross-window fusion characteristics of VideoSwinTransformer itself. This enables refined analysis of high-confidence features within a local window and captures the spatiotemporal correlation of ocean current features in different regions through cross-window fusion. At the same time, it avoids the computational redundancy of global attention, achieving a dual optimization of targeted focus and global correlation, making the model more accurate in analyzing the features of complex nearshore flow fields.
[0041] By aggregating the features output from the VideoSwinTransformer block sequence through a global spatiotemporal pooling layer, a comprehensive fusion of ocean current features in both temporal and spatial dimensions is achieved. This solves the problems of spatiotemporal feature disconnect and the inability of local features to reflect the global flow field in traditional feature processing. Spatial dimension aggregation can integrate local ocean current features in different areas of nearshore waters, and explore the spatial correlation of flow fields in different areas (such as the velocity / direction correlation between shallow and deep water areas, and the flow field correlation between the shoreline and the far shore). The output is aggregated features that can reflect the global spatial characteristics of the entire monitored water area, avoiding the one-sidedness of monitoring results caused by analyzing only local features.
[0042] The solution in this embodiment of the invention implements an improved VideoSwinTransformer and SEA-RAFT end-to-end deep coupling architecture, which adds an optical flow confidence-guided attention mechanism and applies a penalty term; The input consists of 16 original images and a fusion sequence of 15 optical flow features: The original image retains RGB 3 channels, and the optical flow features are inserted into the sequence in time alignment to form an input tensor with dimensions (16,3840,2160,6) (3-channel original texture and 3-channel optical flow); it is then converted into sequence features (16,600×338,256) through PatchEmbedding (4×4 pixels / patch). The optical flow confidence-guided attention mechanism, SEA-RAFT, outputs a confidence map c for each frame pair. t-t+1 Converted to an attention mask: mask t =1-exp(-2×c t-t+1(The mask for high-confidence regions approaches 1, and the mask for low-confidence regions approaches 0). When VideoSwinTransformer calculates self-attention, it applies the mask. t Regions with a value <0.5 are penalized by a weight (multiplied by 0.3) to reduce the interference of low-reliability optical flow features on global fusion.
[0043] The Stage2 output of VideoSwinTransformer (32×32×1024) is upsampled (bilinear interpolation) to 256×256×256 and fed back to the SEA-RAFT iterative update module. For each frame pair, the optical flow optimization direction is dynamically adjusted to suppress local noise optical flow (such as bubble random motion) that conflicts with the global motion trend.
[0044] A new optical flow direction coding layer is added, which converts the direction angle θ=arctan2(v,u) into a sine code (sinθ,cosθ), and concatenates it with the velocity amplitude √(u²+v²) to form a 5-channel feature, thereby enhancing the representation of direction information. The 4096-dimensional fused features received from the Video Swin Transformer are processed through a 3-layer MLP (hidden unit number 2048→1024→512); dual-branch output: the main branch predicts the velocity amplitude (v) pred ), auxiliary branch predicts flow velocity direction (θ) pred Consistency is ensured through joint loss optimization.
[0045] More preferably, the optical flow extraction network and the improved Video Swin Transformer network are trained in the following manner: A training dataset is constructed by synchronously collecting measured flow velocity data from a corresponding acoustic current meter and a sequence of continuous video frames obtained by an image acquisition device. An optical flow extraction network is used to extract pixel-level optical flow training features from adjacent frame pairs in the continuous video frame sequence. The optical flow training features include horizontal displacement components, vertical displacement components, and confidence levels. The continuous video frame sequence and the extracted optical flow features are temporally aligned and concatenated to form fused training features; The fused training feature sequence is input into an improved Video Swin Transformer network, wherein an attention mask derived from the confidence in the optical flow training features is introduced into the self-attention calculation mechanism of the Video Swin Transformer network to penalize feature interactions in low-confidence regions. The fused features output by the Video Swin Transformer network are used to predict the velocity amplitude and direction angle via the prediction head. The ocean optical flow detection model and the improved Video Swin Transformer network are optimized by a joint loss function, which includes an optical flow error term based on optical flow characteristics and real values, a flow velocity prediction error term based on predicted flow velocity and measured flow velocity, and a loss term used to constrain the consistency between the flow velocity derived from optical flow and the directly predicted flow velocity. Until the optical flow extraction network and the improved Video Swin Transformer network meet the set requirements; The optical flow extraction network and the improved Video Swin Transformer network are trained using a phased training strategy, including: In the first training phase, the network parameters of the ocean optical flow detection model are frozen, and only the improved VideoSwin Transformer network and subsequent prediction head are trained. In the second training phase, all network parameters are unfrozen, and the overall model is jointly fine-tuned at a learning rate lower than that in the first phase.
[0046] This invention focuses on a dual-network collaborative training scheme that utilizes measured data-driven training dataset construction, multi-dimensional joint loss functions, and a phased network training strategy. Addressing issues such as data disconnect from real-world scenarios, model fitting bias due to single-loss constraints, unstable end-to-end training gradients, and poor dual-network synergy in deep learning model training, this invention achieves accurate fitting, efficient collaboration, and stable convergence between the optical flow extraction network and the improved Video Swin Transformer network. This allows the trained dual networks to be highly adaptable to the actual scenarios of nearshore current video measurement, ensuring the monitoring accuracy, stability, and engineering adaptability of the entire measurement method from the underlying model training stage. Simultaneously, it significantly improves model training efficiency and reduces the risk of overfitting.
[0047] The training dataset is constructed based on synchronous video frame sequences from image acquisition equipment and measured flow velocity data from acoustic current meters. This solves the problem of model fitting being out of touch with the actual scene caused by traditional simulation data / asynchronous data training from the data source. It ensures that the feature distribution of the model training is highly consistent with the feature distribution of actual nearshore current monitoring. Data synchronization ensures that the visual features and optical flow features of the video frames are completely matched with the measured flow velocity data in the temporal and spatial dimensions. This allows the model to learn the real correlation between the visual performance of the current, the motion features of the optical flow, and the actual flow velocity amplitude / direction. It avoids the distortion of feature mapping relationships caused by asynchronous data and greatly improves the model's fitting accuracy to the actual nearshore current scene. The multi-source data fusion dataset contains multi-dimensional information such as vision, optical flow, and measured flow velocity, which perfectly matches the input / output features of the model's actual inference. This achieves consistency between training features and inference features, allowing the parameter optimization of model training to directly serve the actual inference needs and improve the effectiveness and relevance of model inference.
[0048] Specifically, a joint loss function is designed, which includes optical flow error terms, velocity prediction error terms, and velocity consistency constraint terms. This function imposes end-to-end multi-dimensional constraints on the two networks, solving the problems of traditional single loss functions that only focus on the final output, ignore intermediate feature fitting, and have poor synergy between the two networks. This achieves a triple optimization: accurate intermediate optical flow features, accurate final velocity prediction, and consistent feature mapping between the two networks. The optical flow error term directly constrains the output accuracy of the optical flow extraction network, enabling the model to learn more accurate pixel-level horizontal / vertical displacement components and confidence levels. This ensures the authenticity of optical flow features from the bottom up, avoids the transfer of intermediate feature fitting bias to subsequent networks, and lays a high-quality feature foundation for the final flow velocity prediction. The direct constraint of the velocity prediction error term improves the deviation between the final prediction result and the measured velocity of the Video Swin Transformer network. This is the core objective constraint of model training, ensuring that the velocity amplitude and direction angle of the final output of the model can accurately match the actual ocean current velocity, thereby achieving the accuracy of the monitoring results. The velocity consistency constraint establishes a correlation constraint between the optical flow-derived velocity and the velocity directly predicted by the model, forcing the model to learn the consistency of the feature mapping between the two, avoiding the unreasonable situation where the motion state reflected by the optical flow features is completely disconnected from the predicted velocity. At the same time, it realizes the feature co-optimization of the optical flow extraction network and the Video Swin Transformer network, allowing the two networks to adapt to and reinforce each other during training, thereby improving the overall synergy and stability of the model. Multi-loss weighted collaboration allows the three loss terms to be dynamically balanced and optimized together during training. This avoids model overfitting caused by a single loss term (such as optical flow feature distortion caused by focusing only on flow velocity prediction) and enables synchronous improvement of accuracy in each stage. This allows the overall model to form a virtuous feature mapping relationship of accurate optical flow → effective fusion → accurate prediction.
[0049] In this embodiment of the invention, the initial parameters of the optical flow extraction network are first fixed, allowing the improved Video Swin Transformer network to focus on learning the mapping relationship between optical flow features, fusion features, and measured flow velocity. This enables accurate local fitting of the upper-layer prediction network and avoids gradient disorder in the upper-layer network caused by drastic fluctuations in the parameters of the lower-layer optical flow network when the two networks are trained simultaneously. This significantly reduces the optimization difficulty in the early stages of training and allows the model to converge quickly to a local optimum. At the same time, this stage can quickly select the optimal initial parameters of the Video Swin Transformer network, laying the foundation for subsequent joint fine-tuning and improving the overall training efficiency.
[0050] Unfreeze all parameters of the optical flow extraction network and perform global joint fine-tuning of the two networks with a learning rate lower than that of the first stage. This achieves synergistic optimization of the two networks, allowing the parameters of the optical flow network to be finely adjusted according to the feature requirements of the upper Transformer network, achieving a high degree of adaptation between the lower-level features and the upper-level analysis. At the same time, the low learning rate avoids drastic fluctuations of parameters near local optima, ensuring stable convergence of model training and ultimately obtaining the globally optimal parameters of the two networks. The low learning rate fine-tuning strategy can also effectively reduce the risk of overfitting, allowing the model to fit the patterns in the training data while retaining its generalization ability to unknown real-world ocean current scenarios. The solution in this invention breaks away from the traditional model of separate training and independent inference of the optical flow extraction network and the improved Video Swin Transformer network, achieving end-to-end joint training and collaborative optimization of the two networks. This transforms the two networks from independent modules into a highly adapted, integrated model.
[0051] During training, the parameter optimizations of the two networks are interconnected and mutually constrained. The output features of the optical flow extraction network can be dynamically adjusted according to the feature requirements of the upper-layer Video Swin Transformer network. The feature analysis of the Video Swin Transformer network can guide the feature extraction optimization of the optical flow network in reverse, realizing deep collaboration between the two networks. This makes the feature processing efficiency and fitting accuracy of the overall model much higher than that of the two networks trained separately.
[0052] The end-to-end training mode allows the model to form a complete optimization loop from video frame input to flow velocity amplitude / direction output. The parameter optimization of all links is centered around the final actual monitoring needs, avoiding the problem of poor feature adaptability between modules caused by individual training, greatly improving the overall model's integrated inference performance, and ensuring the smoothness and accuracy of feature extraction to result output in actual monitoring.
[0053] Specifically, during data training, cameras and ADCP (Advanced Direct Current Profiler) devices need to be deployed to collect data. Equipment configuration: 8-megapixel industrial camera, equipped with infrared fill light (suitable for nighttime scenes); Deployment parameters: Installed on a coastal observation tower (height ≥ 15 meters), with the shooting angle at 25°-30° to the horizontal plane, ensuring that the image includes the entire monitored water area and part of the shoreline (for coordinate calibration). Acquisition settings: Frame rate 25fps, resolution 3840×2160, bit rate 8Mbps, H.265 encoding format, continuous acquisition period ≥7 days (covering the complete tidal monthly cycle).
[0054] Synchronous measured data acquisition: Two acoustic Doppler current meters (ADCPs) were deployed in the center of the video acquisition area, with a sampling frequency of 1Hz, to collect measured flow velocity data (including amplitude v). meas With direction θ meas ); RTK calibration and physical unit conversion involve selecting N well-defined points in the real-world scenario to ensure they are clearly identifiable in the image. The actual straight-line distance between these points is then precisely measured using an RTK device and denoted as L (in meters).
[0055] Find the pixel positions corresponding to the actual points on the mapped image, and ensure accurate correspondence through manual marking, feature matching, etc. Measure the straight-line distance between these two pixels on the image (usually in pixels), denoted as d (unit: pixels).
[0056] Based on the correspondence between actual distance and pixel distance, the pixel-to-meter conversion factor α is calculated using the formula α=L / d. Its physical meaning is that 1 pixel in the image corresponds to α meters in the actual scene.
[0057] Based on a frame rate of 25fps, the time interval is calculated as Δt = 1 / 25 = 0.04s between adjacent frames.
[0058] Preprocessing and Labeling Dataset Construction: Sample Construction. Sixteen consecutive frames are extracted from the preprocessed video as the base sequence, denoted as I1, I2, ..., I... 16 Extract 15 pairs of adjacent frames (I1 and I2, I2 and I3, ..., I...) from the base sequence. 15 with I 16 The 16-frame base sequence and the optical flow features (pre-generated u, v, c) corresponding to 15 frame pairs are combined as input samples, and the mean value of ADCP data for the corresponding time period is used as the label (v). meas ,θ meas A total of 50,000 samples were constructed. Dataset partitioning: The dataset is divided into a training set (40,000 samples), a validation set (5,000 samples), and a test set (5,000 samples) in an 8:1:1 ratio to ensure that each subset includes weather conditions such as sunny, cloudy, and rainy days, as well as tidal states such as high tide, low tide, and slack tide, thus ensuring data diversity.
[0059] Build improved versions of SEA-RAFT and VideoSwinTransformer.
[0060] A joint loss is constructed, and an optimization objective integrating optical flow error, velocity prediction error, and consistency error is established.
[0061] Loss total =0.4×Loss flow +0.5×Loss vel +0.1×Loss consist Loss flow SEA-RAFT optical flow error (endpoint error EPE = √(u) pred -u gt )²+(v pred -v gt By constraining the accuracy of optical flow extraction through endpoint error, high-quality fundamental features are provided for flow velocity calculation. Loss vel For the velocity prediction error (MSE(v) pred ,v meas )+(1-cos(θ pred -θ meas By combining MSE and angle cosine error, the prediction accuracy of flow velocity amplitude and direction is simultaneously optimized; Loss consist For optical flow-velocity consistency loss (L1(|√(u²+v²) / Δt-v) pred The L1 loss function constrains the consistency between the optical flow derivation and the directly predicted flow velocity, effectively reducing error accumulation. Through scientific weight allocation, this loss function achieves multi-task collaborative optimization, significantly improving the model's adaptability to different tidal states (high tide, low tide, slack tide) and weather conditions (sunny, cloudy, rainy).
[0062] Phased training: Phase 1 (Pre-training, 20 rounds): Freeze SEA-RAFT weights, train only VideoSwinTransformer and prediction head, learning rate 5e-5, batch size=8; Phase 2 (Joint Fine-tuning, 50 rounds): Unfreeze all layers, train with mixed precision (FP16), learning rate 1e-5, batch size=16, and use AdamW optimizer (weight decay 1e-4). Early stopping mechanism: If the validation set Loss_total is not reduced for 10 consecutive rounds, training is stopped and the best model is saved (named "FusionNet_best.pth").
[0063] More preferably, the method further includes: A bidirectional transfer mechanism is constructed between the optical flow extraction network and the improved Video Swin Transformer network; The bidirectional transmission mechanism includes: An optical flow extraction network processes adjacent image frame pairs to generate intermediate layer feature maps. The intermediate layer feature map is dimensionality reduced and then input into the improved Video Swin Transformer network as a priori for the motion region; The improved Video Swin Transformer network processes fused sequences containing original image and optical flow features; Feature maps are extracted from the intermediate layers of the improved Video Swin Transformer network and upsampled to a predetermined resolution; The upsampled feature map is fed back to the iterative optimization module of the optical flow extraction network; The optical flow estimation process is adjusted based on the feedback feature map.
[0064] Specifically, in this embodiment of the invention, bidirectional transmission is SEA-RAFT for each pair of adjacent frames (e.g., I...). t with I t+1 The C3 layer features (512×512×384) are extracted and reduced to 256 channels through 1×1 convolution. These features are then input into Stage 1 of VideoSwinTransformer as spatial feature priors for that time period, guiding VideoSwinTransformer to focus on the water movement region.
[0065] This invention focuses on the bidirectional transmission mechanism between the optical flow extraction network and the improved Video Swin Transformer network. It overcomes the technical limitations of traditional one-way feature transmission and independent analysis and processing by the front and rear networks, and constructs a dual-network dynamic collaborative system in which the optical flow network empowers the front and the Video Swin Transformer network provides reverse guidance. It addresses problems such as inaccurate positioning of motion areas, susceptibility of optical flow estimation to noise interference, and insufficient feature adaptability of the two networks in nearshore current monitoring. It achieves bidirectional complementarity, dynamic optimization, and accurate adaptation of features, further improving the accuracy of optical flow feature extraction and the ability to mine ocean current motion features from the network collaboration level. Ultimately, it qualitatively improves the monitoring accuracy, anti-interference ability, and scene adaptability of the entire measurement method.
[0066] The bidirectional transfer mechanism constructs a closed-loop feature interaction channel between the optical flow extraction network and the improved Video Swin Transformer network. This transforms the two networks from a unidirectional sequential relationship of optical flow extraction and Video Swin Transformer analysis into a mutually supportive and mutually optimizing collaborative relationship. This completely solves the problems of information silos caused by unidirectional feature transfer in traditional architectures and the inability of subsequent networks to correct front-end feature biases. Forward empowerment: The dimensionality reduction of the intermediate layer feature map of the optical flow extraction network is used as the prior input of the motion region to the VideoSwin Transformer network, providing accurate localization of ocean current motion regions for upper-layer analysis. This allows the VideoSwin Transformer network to focus directly on the effective motion region for refined analysis without having to mine motion features from scratch, thus reducing meaningless feature calculations. Reverse guidance: The upsampled feature maps of the intermediate layers of the Video Swin Transformer network are fed back to the iterative optimization module of the optical flow extraction network, providing global ocean current motion patterns mined from the upper layers for optical flow estimation. This allows the optical flow extraction network to correct local optical flow estimation biases based on global features, avoiding optical flow distortion caused by local pixel-level analysis. The bidirectional feature transfer and interaction enable the feature information of the two networks to complement each other across layers and networks. The pixel-level motion features of the bottom layer of the optical flow network are fused with the global spatiotemporal features of the upper layer of the Video Swin Transformer network, fully mining the feature value of the two networks and achieving a synergistic effect between the fine features at the bottom layer and the global features at the upper layer.
[0067] The motion region prior provides a clear analysis boundary for the Video Swin Transformer network, avoiding ineffective attention calculations and feature mining in non-motion regions, reducing redundant computations, improving the feature processing efficiency of the entire model, and further enhancing the engineering adaptability of real-time monitoring. Based on prior orientation analysis of the motion region, the Video Swin Transformer network can more accurately mine the spatiotemporal correlation features within the motion region (such as the motion patterns of local eddies and the correlation changes of flow velocity in different regions), avoid interference from background noise features, and improve the accuracy of ocean current velocity and direction prediction.
[0068] The feedback feature map is upsampled to a predetermined resolution and precisely matched with the feature dimensions of the optical flow extraction network. It can directly participate in the iterative optimization process of optical flow estimation, realize the dynamic adjustment and fine optimization of optical flow features, and make the output optical flow features (displacement components, confidence) fit the motion state of local pixels and conform to the motion law of global ocean currents, greatly improving the authenticity and consistency of optical flow features. The reverse-guided iterative optimization mechanism enables the optical flow extraction network to have self-correction capabilities. It can dynamically adjust its feature extraction process based on the analysis results of subsequent networks, avoiding the problem that traditional optical flow networks cannot correct their output features once they are generated and that deviations continue to propagate. This ensures high-quality output of optical flow features from the bottom up.
[0069] In the feature interaction process, the bidirectional transfer mechanism, through targeted processing such as dimensionality reduction and upsampling, enables the features of the optical flow extraction network and the Video Swin Transformer network to achieve precise adaptation in terms of dimensionality, resolution, and semantics. This solves the problems of feature fusion difficulties and poor synergy caused by differences in feature dimensions and semantic disconnect in traditional dual networks. The dimensionality reduction of the optical flow intermediate layer feature map maps the pixel-level high-dimensional features to low-dimensional features that are compatible with the Video SwinTransformer network, avoiding the computational pressure on the upper network caused by high-dimensional features, while achieving dimensionality matching between the low-level motion features and the upper-level analysis features. The upsampling process of the intermediate layer feature map of the Video Swin Transformer restores the global low-resolution features to a predetermined resolution that matches the iterative optimization module of the optical flow extraction network, allowing the upper-layer global features to directly participate in local pixel-level optical flow estimation, thus achieving accurate docking between the upper-layer global semantics and the lower-layer local semantics. The adaptation of dimensions and resolution ensures that there is no information loss or dimension conflict in the transmission of features between the two networks. The precise matching at the semantic level enables the two networks to accurately understand the meaning of each other's features, realizing effective interaction and fusion of features and greatly improving the overall synergy of the two networks.
[0070] The bidirectional feature optimization, combined with the parameter optimization of the preceding joint training and the constraint optimization of multiple loss functions, creates a synergistic effect. This achieves comprehensive optimization of the entire model from three levels: parameters, features, and network collaboration. It minimizes the deviation between the final output current velocity amplitude and direction vector, significantly improving the final monitoring accuracy of the entire measurement method. The closed-loop optimization mechanism throughout the entire process ensures that each link of the measurement method is no longer an independent optimization unit, but rather forms an integrated optimization system centered on the accurate monitoring of nearshore currents. This achieves layer-by-layer precision and step-by-step optimization from the underlying features to the final result.
[0071] More preferably, after inputting the image frame group into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the method further includes: The pixel displacement component is converted into a physical velocity component using a scale conversion factor; the scale conversion factor is the conversion factor between image pixels and physical space. Following the output of the predicted ocean current monitoring results, the following is also included: The prediction results at multiple consecutive time points are processed by a moving average.
[0072] This invention focuses on two post-processing stages: scale conversion of pixel displacement components and moving average processing of prediction results. Addressing the core issues in nearshore current video measurement—the disconnect between pixel-level features and actual physical current velocities, and the susceptibility of single-moment prediction results to transient noise interference—it achieves accurate conversion from visual features to physically measured values and optimization from discrete single-moment results to continuous and stable monitoring data. Without altering the core network model, it further enhances the physical applicability, result stability, and data reliability of the measurement method at the result level. The current prediction results at multiple consecutive time points are processed by moving average. To address the issue of fluctuations and outliers in the results caused by the influence of instantaneous noise on the water surface (such as sudden ripples, local floating objects, and instantaneous changes in light and shadow) on the prediction results at a single time point, the temporal smoothing and noise filtering of the monitoring results are achieved, so that the output current velocity and direction data are more in line with the continuous and stable motion characteristics of nearshore currents.
[0073] In practice, the camera's real-time stream (25fps) is accessed via the RTSP protocol and decoded into RGB images (3840×2160) using the FFmpeg library, with each frame appended with a precise timestamp.
[0074] Input preparation: Cache 16 consecutive frames of images to form the model input tensor (1,16,3840,2160,3). FusionNet inference: Load the FusionNet_best.pth model for inference and output the flow velocity magnitude v. pred With direction θ pred ; Result smoothing: 3-frame moving average (v) was used. smooth =(v t-2 +v t-1 +v t ) / 3) Suppress instantaneous fluctuations.
[0075] This invention innovatively designs a network architecture that deeply couples optical flow with spatiotemporal features. Through a bidirectional feature transfer mechanism between SEA-RAFT and VideoSwinTransformer, it achieves collaborative learning of optical flow details and global spatiotemporal trends. The optical flow features extracted by SEA-RAFT guide VideoSwinTransformer to focus on water movement regions. Global feature feedback from VideoSwinTransformer optimizes optical flow calculations to suppress local noise. Combined with an optical flow confidence attention mechanism, it enhances the reliability of feature fusion, laying a structural foundation for multi-task collaborative optimization. Thirdly, it proposes a targeted multi-task joint loss function, constructing a three-in-one optimization objective Loss_total that integrates optical flow error, velocity prediction error, and consistency error, effectively reducing error accumulation. This loss function assigns weights based on prior importance, achieving multi-task collaborative optimization and significantly improving the model's adaptability to different tidal states (high tide, low tide, slack tide) and weather conditions (sunny, cloudy, rainy).
[0076] The network components and training strategies were optimized. The ConvNeXt-V2 lightweight backbone was adopted to improve the fine-grained feature capture capability of SEA-RAFT and reduce the number of parameters. The adaptability of Video Swin Transformer to different flow rates was enhanced by dynamic window partitioning. The model convergence was achieved by combining phased training (pre-training + joint fine-tuning) and mixed precision training with a multi-task loss function, ensuring high accuracy and stability of flow rate prediction.
[0077] An efficient real-time processing workflow is established, which accesses the video stream via the RTSP protocol and combines it with moving average smoothing processing. While ensuring the real-time performance of 25fps frame rate, the optimized model performance through multi-task loss function effectively suppresses instantaneous fluctuations and outputs stable and reliable streaming results, perfectly balancing the real-time and accuracy requirements in engineering applications.
[0078] The method in this embodiment of the invention achieves refined extraction and targeted analysis of ocean current features through hierarchical collaborative processing of optical flow detection model and window self-attention model. First, the optical flow detection model accurately captures the pixel displacement components and confidence levels of adjacent frames of the image, locking the dynamic motion trajectory of the ocean current from the underlying visual feature level. Then, the window self-attention model performs attention weighting processing on the associated image frame group and optical flow features, automatically focusing on the core effective features of ocean current motion and weakening background noise features, effectively avoiding measurement deviations caused by single feature analysis, and significantly improving the accuracy of ocean current velocity amplitude and direction prediction.
[0079] Example 2 Please see Figure 8 , Figure 8This is a schematic diagram of the nearshore ocean current velocity video measurement system based on deep learning disclosed in an embodiment of the present invention. Figure 8 As shown, this deep learning-based nearshore current velocity video measurement system may include: Acquisition module 21: Used to acquire video image sequences of the water area to be monitored through image acquisition equipment; Extraction module 22: used to extract multiple consecutive frames of images from the video image sequence as an image frame group; Optical flow detection module 23: used to input the image frame group into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the optical flow features including pixel displacement components and confidence level; Result output module 24: is used to associate the image frame group and the corresponding optical flow features as detection input features, and input the detection input features into a window self-attention model for processing to output the predicted ocean current monitoring results, the ocean current monitoring results including ocean current velocity amplitude and ocean current direction.
[0080] The method in this embodiment of the invention achieves refined extraction and targeted analysis of ocean current features through hierarchical collaborative processing of optical flow detection model and window self-attention model. First, the optical flow detection model accurately captures the pixel displacement components and confidence levels of adjacent frames of the image, locking the dynamic motion trajectory of the ocean current from the underlying visual feature level. Then, the window self-attention model performs attention weighting processing on the associated image frame group and optical flow features, automatically focusing on the core effective features of ocean current motion and weakening background noise features, effectively avoiding measurement deviations caused by single feature analysis, and significantly improving the accuracy of ocean current velocity amplitude and direction prediction.
[0081] Example 3 Please see Figure 9 , Figure 9 This is a schematic diagram of the structure of an electronic device disclosed in an embodiment of the present invention. The electronic device can be a computer, a server, etc. Of course, in certain cases, it can also be a mobile phone, tablet computer, monitoring terminal, or other smart device, as well as an image acquisition device with processing capabilities. Figure 9 As shown, the electronic device may include: Memory 510 storing executable program code; Processor 520 coupled to memory 510; The processor 520 calls the executable program code stored in the memory 510 to execute some or all of the steps in the deep learning-based video measurement method for nearshore ocean current velocity in Embodiment 1.
[0082] This invention discloses a computer-readable storage medium storing a computer program that causes a computer to perform some or all of the steps in the deep learning-based video measurement method for nearshore ocean current velocity in Embodiment 1.
[0083] This invention also discloses a computer program product, wherein when the computer program product is run on a computer, the computer performs some or all of the steps in the deep learning-based nearshore current velocity video measurement method in Embodiment 1.
[0084] This invention also discloses an application publishing platform, which is used to publish computer program products. When the computer program products are run on a computer, the computer performs some or all of the steps in the deep learning-based nearshore ocean current velocity video measurement method in Embodiment 1.
[0085] In various embodiments of the present invention, it should be understood that the sequence number of each process does not necessarily imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0086] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; they can be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0087] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0088] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-accessible memory. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a memory and includes several requests to cause a computer device (which can be a personal computer, server, or network device, specifically a processor in the computer device) to execute some or all of the steps of the methods described in the various embodiments of the present invention.
[0089] In the embodiments provided by this invention, it should be understood that "B corresponding to A" means that B is associated with A, and B can be determined based on A. However, it should also be understood that determining B based on A does not mean determining B solely based on A; B can also be determined based on A and / or other information.
[0090] Those skilled in the art will understand that some or all of the steps in the various methods of the embodiments described can be implemented by a program instructing related hardware. This program can be stored in a computer-readable storage medium, including read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-Erasable Programmable Read-Only Memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium capable of carrying or storing data.
[0091] The foregoing has provided a detailed description of the deep learning-based video measurement method, system, electronic device, and storage medium for nearshore ocean current velocity disclosed in the embodiments of the present invention. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A video-based method for measuring nearshore ocean current velocity based on deep learning, characterized in that, include: Acquire video image sequences of the water area to be monitored using image acquisition equipment; Extract multiple consecutive frames from the video image sequence to form an image frame group; The image frame group is input into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the optical flow features including pixel displacement components and confidence levels; The image frame group and the corresponding optical flow feature are associated as detection input features, and the detection input features are input into a window self-attention model for processing to output the predicted ocean current monitoring results, which include ocean current velocity amplitude and ocean current direction.
2. The method for measuring nearshore ocean current velocity based on deep learning as described in claim 1, characterized in that, The step of inputting the image frame group into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group includes: An optical flow extraction network is used to encode features of the image frame group to determine the optical flow features between adjacent frames in the image frame group. A feature alignment layer is added after the convolutional block of the optical flow extraction network. The feature alignment layer includes a 1×1 convolutional layer and a layer normalization layer. The optical flow extraction network is designed based on the SEA-RAFT architecture, and its feature extraction backbone network is ConvNeXt-V2.
3. The method for measuring nearshore ocean current velocity based on deep learning as described in claim 2, characterized in that, The step of associating the image frame group and the corresponding optical flow features as detection input features, and inputting the detection input features into a window self-attention model for processing to output predicted ocean current monitoring results includes: The image frame group and the extracted optical flow features are time-aligned and channel-sequentially concatenated to form a fused feature sequence; The fused feature sequence is input into an improved VideoSwin Transformer network, wherein the improved VideoSwin Transformer network is a window self-attention model. In the self-attention calculation mechanism of the VideoSwin Transformer network, an attention mask derived from the confidence in the optical flow features is introduced to apply penalty weights to the features in the low confidence region. The fused features output by the Video Swin Transformer network are then used to output the corresponding ocean current monitoring results via a dual-branch prediction head.
4. The method for measuring nearshore ocean current velocity based on deep learning as described in claim 3, characterized in that, The step of inputting the fused feature sequence into the improved Video Swin Transformer network includes: The fused image-optical flow features are processed by an optical flow orientation coding layer to generate orientation enhancement features; The orientation enhancement features are spatially downsampled and dimensionally embedded using a Patch Embedding layer. Embedded features are processed through Video Swin Transformer block sequences, where the first two stages use a confidence-guided attention mechanism; Features are aggregated through a global spatiotemporal pooling layer; The dual-branch predictor outputs the velocity amplitude and direction vector.
5. The method for measuring nearshore ocean current velocity based on deep learning as described in claim 3, characterized in that, The optical flow extraction network and the improved Video Swin Transformer network were trained as follows: A training dataset is constructed by synchronously collecting measured flow velocity data from a corresponding acoustic current meter and a sequence of continuous video frames obtained by an image acquisition device. An optical flow extraction network is used to extract pixel-level optical flow training features from adjacent frame pairs in the continuous video frame sequence. The optical flow training features include horizontal displacement components, vertical displacement components, and confidence levels. The continuous video frame sequence and the extracted optical flow features are temporally aligned and concatenated to form fused training features; The fused training feature sequence is input into an improved Video Swin Transformer network, wherein an attention mask derived from the confidence in the optical flow training features is introduced into the self-attention calculation mechanism of the Video Swin Transformer network to penalize feature interactions in low-confidence regions. The fused features output by the Video Swin Transformer network are used to predict the velocity amplitude and direction angle via the prediction head. The ocean optical flow detection model and the improved Video Swin Transformer network are optimized by a joint loss function, which includes an optical flow error term based on optical flow characteristics and real values, a flow velocity prediction error term based on predicted flow velocity and measured flow velocity, and a loss term used to constrain the consistency between the flow velocity derived from optical flow and the directly predicted flow velocity. Until the optical flow extraction network and the improved Video Swin Transformer network meet the set requirements; The optical flow extraction network and the improved Video Swin Transformer network are trained using a phased training strategy, including: In the first training phase, the network parameters of the ocean optical flow detection model are frozen, and only the improved Video SwinTransformer network and subsequent prediction heads are trained. In the second training phase, all network parameters are unfrozen, and the overall model is jointly fine-tuned at a learning rate lower than that in the first phase.
6. The method for measuring nearshore ocean current velocity based on deep learning as described in claim 3, characterized in that, The method further includes: A bidirectional transfer mechanism is constructed between the optical flow extraction network and the improved Video Swin Transformer network; The bidirectional transmission mechanism includes: An optical flow extraction network processes adjacent image frame pairs to generate intermediate layer feature maps. The intermediate layer feature map is dimensionality reduced and then input into the improved Video Swin Transformer network as a priori for the motion region; The improved Video Swin Transformer network processes fused sequences containing original image and optical flow features; Feature maps are extracted from the intermediate layers of the improved Video Swin Transformer network and upsampled to a predetermined resolution; The upsampled feature map is fed back to the iterative optimization module of the optical flow extraction network; The optical flow estimation process is adjusted based on the feedback feature map.
7. The method for measuring nearshore ocean current velocity based on deep learning as described in claim 1, characterized in that, After inputting the image frame group into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the method further includes: The pixel displacement component is converted into a physical velocity component using a scale conversion factor; the scale conversion factor is the conversion factor between image pixels and physical space. Following the output of the predicted ocean current monitoring results, the following is also included: The prediction results at multiple consecutive time points are processed by a moving average.
8. A video measurement system for nearshore ocean current velocity based on deep learning, characterized in that, include: Acquisition module: Used to acquire video image sequences of the water area to be monitored through image acquisition equipment; Extraction module: used to extract multiple consecutive frames of images from the video image sequence as an image frame group; Optical flow detection module: used to input the image frame group into the ocean optical flow detection model to determine the optical flow features between adjacent frames in the image frame group, the optical flow features including pixel displacement components and confidence levels; The result output module is used to associate the image frame group and the corresponding optical flow features as detection input features, and input the detection input features into a window self-attention model for processing to output the predicted ocean current monitoring results, which include ocean current velocity amplitude and ocean current direction.
9. An electronic device, characterized in that, include: Memory containing executable program code; A processor coupled to the memory; The processor calls the executable program code stored in the memory to execute the deep learning-based video measurement method for nearshore ocean current velocity as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein the computer program causes a computer to execute the deep learning-based video measurement method for nearshore ocean current velocity as described in any one of claims 1 to 7.