A weakly supervised dense video event detection and structured log generation method and system for ship monitoring scenarios
By employing a weakly supervised dense video event detection method, utilizing a visual language model and a multi-instance learning framework, the problems of high frame-level annotation cost and low positioning accuracy in ship safety monitoring are solved, enabling efficient, stable, and structured recording of ship safety events.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI MARITIME UNIVERSITY
- Filing Date
- 2026-04-02
- Publication Date
- 2026-06-12
AI Technical Summary
When existing technologies rely on fully supervised video description methods for ship safety monitoring, frame-level temporal boundary annotation is costly, while weakly supervised methods have low positioning accuracy in the absence of temporal boundary supervision, making it difficult to adapt to safety events of varying complexity and resulting in poor automatic recording performance.
We employ a weakly supervised dense video event detection method, which utilizes a visual language model to extract multimodal features, performs multi-scale temporal modeling through a deformable encoder, generates initial pseudo-boundaries by combining semantic guidance and non-uniform window initialization, and refines and fuses the boundaries through a multi-instance learning framework to generate structured logs.
It significantly reduces annotation costs without relying on large-scale, finely annotated data, improves the accuracy of temporal location and recording stability of ship safety incidents, reduces the burden on safety management personnel, and promotes the real-time and intelligent management of ship safety.
Smart Images

Figure CN122200515A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and cross-modal video understanding, and in particular to a method and system for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios. Background Technology
[0002] With the intelligent transformation of the global shipping industry, ship safety monitoring has become a core element in ensuring the safety of life and property at sea. Modern large ships are typically equipped with comprehensive closed-circuit television monitoring systems covering the bridge, deck, and engine room to record crew work activities and sudden safety incidents. However, faced with the massive amounts of video data generated around the clock, current safety management still mainly relies on manual spot checks or post-event review. This approach is not only costly and inefficient, but also prone to missed inspections due to fatigue, failing to meet the need for real-time recording and structured archiving of dynamic safety incidents such as "not wearing lifesaving equipment," "unauthorized open flame operations," or "personnel physical conflicts."
[0003] To automate surveillance, computer vision technology has been widely adopted. Early technologies focused primarily on object detection or action recognition. While these could identify specific objects (such as safety helmets) or simple actions, they struggled to understand continuous events with complex temporal logic and could not generate natural language descriptive logs that conformed to human cognitive habits. The emergence of dense video description technology has provided a new path to solve this problem. It can automatically locate multiple non-overlapping or overlapping event segments in a video and generate descriptive text for each.
[0004] However, applying DVC technology to the field of ship safety faces significant challenges. First, fully supervised DVC model training relies on expensive frame-level temporal boundary annotations, making it extremely difficult to obtain large-scale, finely annotated data in the highly specialized field of ship safety. Second, while existing weakly supervised methods only require training with event-level text descriptions, significantly reducing annotation costs, the lack of explicit temporal boundary supervision often makes it difficult for models to accurately infer the start and end times of events, easily leading to positioning drift or misdetecting background noise as events. Furthermore, existing multi-instance learning frameworks typically employ fixed aggregation strategies, ignoring the vast differences in event complexity within ship scenarios (e.g., the semantic density of "walking" and "group work" are completely different), resulting in inaccurate boundary localization when facing complex, long events, and introducing excessive noise when facing simple, short events. Therefore, there is an urgent need for an automatic description and recording method for ship safety events that can utilize weakly supervised signals, adaptively guide temporal localization through text semantics, and flexibly adapt to events of varying complexity. Summary of the Invention
[0005] The purpose of this invention is to address the problems in existing ship safety monitoring fields, such as the high dependence of fully supervised video description methods on frame-level temporal boundary annotations, resulting in high annotation costs, and the low positioning accuracy and difficulty in automatically recording safety events of varying complexity (e.g., simple actions and complex group events) when temporal boundary supervision is lacking. This invention provides a method for weakly supervised intensive video event detection and structured log generation for ship monitoring scenarios. This method can utilize a large amount of weakly supervised data with only event-level text annotations to achieve accurate temporal positioning and natural language description of ship safety events.
[0006] This invention provides a method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios, comprising the following steps: Step 1, extracting multimodal features from ship monitoring videos and safety event text descriptions using a visual language model, and generating enhanced video coding features through multi-scale temporal modeling using a deformable encoder; Step 2, constructing a temporal distribution based on the similarity between video coding features and text features, generating initial pseudo-boundaries through semantically guided adaptive enhancement and non-uniform window initialization, and using weighted temporal moment regression; Step 3, mapping the initial pseudo-boundaries to reference points and inputting them into the decoder for layer-by-layer interaction and non-linear updates, and achieving boundary refinement and joint supervision using a multivariate loss including segment consistency constraints; Step 4, constructing candidate boundaries and calculating weakly supervised loss based on a multi-instance learning (MIL) framework, and generating the final safety event record through semantic adaptive calibration and dynamic K-value weighted fusion.
[0007] Preferably, step 1 includes: step 1-1, using a pre-trained vision-language model, processing the video through the vision branch to extract frame-level multimodal features, and processing the descriptive text through the text branch to extract text feature vectors; step 1-2, inputting the frame-level multimodal feature sequence into the baseline encoder, performing temporal downsampling through multi-level one-dimensional convolution to construct a multi-scale temporal feature representation, and superimposing positional encoding; step 1-3, inputting the multi-scale temporal feature representation after superimposing positional encoding into the deformable Transformer encoder, and generating enhanced video temporal coding features through a multi-scale deformable self-attention mechanism.
[0008] Preferably, step 2 includes: step 2-1, calculating the similarity between the security event text embedding and the video frame-level features, constructing a similarity temporal distribution, and quantizing the text embedding into a vector norm to characterize semantic strength; step 2-2, adjusting the decision threshold and sliding window scale inversely according to the semantic strength; step 2-3, performing local temporal analysis on the similarity temporal distribution within the adjusted sliding window, identifying significant change points and performing weighted enhancement to obtain an enhanced temporal response distribution; step 2-4, initializing a non-uniform time window based on the enhanced temporal response distribution, selecting a set of key frames within each window, calculating weighted temporal moments, and obtaining an initial pseudo-boundary box.
[0009] Preferably, step 3 includes: Step 3-1, mapping the initial pseudo-boundary box to an initial reference point and initializing the event query vector, using the initial reference point and the event query vector as input to the first layer decoder; Step 3-2, executing a self-attention mechanism and a multi-scale deformable cross-attention mechanism in each layer of the decoder, sampling key feature points on the multi-scale pyramid diagram of video coding features based on the current reference point position, and realizing the alignment and interaction between the event query vector and the video temporal features; Step 3-3, predicting the coordinate offset using the bounding box prediction head, and performing nonlinear coordinate updates using the inverse Sigmoid function and the Sigmoid function to obtain the updated reference point coordinates; Step 3-4, using the updated reference... Point coordinates are used as input to the next decoding layer. During the transmission process, the backpropagation of gradients is blocked, and the output states of each decoding layer are collected to generate a hierarchical query feature sequence. In steps 3-5, for the output of each decoding layer, the detection correlation loss is calculated. The detection correlation loss includes at least classification loss, boundary regression loss, generalized intersection-union loss, and counting loss. Each loss term is combined according to a preset weight and calculated separately in each decoding layer. In steps 3-6, multiple time segments predicted in the same video are sorted temporally according to their center time position. The difference in foreground confidence between temporally adjacent segments is calculated, and a time weighting factor is constructed based on the time interval between adjacent segments. The segment consistency loss is introduced to apply a smoothing constraint to temporally adjacent predicted segments.
[0010] Preferably, step 4 includes: Step 4-1, performing temporal jitter enhancement based on intermediate event time segments, constructing a multi-instance package containing multiple candidate bounding boxes, calculating instance confidence and description generation score, and constructing description generation loss and weakly supervised multi-instance learning loss; Step 4-2, constructing a semantic adaptive module, fusing decoder end-layer features and text embedding, generating adaptive weights and boundary fine-tuning offsets, reweighting the initial score and updating the center and width of the candidate boxes to obtain a semantically calibrated candidate bounding box set; Step 4-3, statistically analyzing the distribution characteristics of the comprehensive quality score in the calibrated candidate bounding box set, and dynamically calculating the optimal aggregation quantity K; Step 4-4, selecting the K target candidate boxes with the highest comprehensive quality score from the calibrated candidate bounding box set, performing weighted summation and fusion on the time coordinates to generate the final safety event time interval, and associating it with the event description text, event confidence, and event type to generate a structured ship monitoring safety event record.
[0011] Preferably, the dynamic calculation of the preferred aggregation quantity K includes: the range factor, the distribution uniformity factor, and the peak confidence factor of the statistically calibrated scores; establishing a positive correlation between the preferred aggregation quantity K and the range factor and the distribution uniformity factor, and a negative correlation with the peak confidence factor; dynamically calculating the K value based on the relationship, and constraining the K value within a preset minimum and maximum quantity range.
[0012] This invention provides a weakly supervised dense video event detection and structured log generation system for ship monitoring scenarios, comprising: a multimodal feature extraction module, used to extract multimodal features from ship monitoring videos and safety event text descriptions using a visual language model, and to generate enhanced video coding features through multi-scale temporal modeling using a deformable encoder; a pseudo-boundary generation module, used to construct a temporal distribution based on the similarity between video coding features and text features, and to generate initial pseudo-boundaries using weighted temporal moment regression through semantically guided adaptive enhancement and non-uniform window initialization; a boundary refinement module, used to map the initial pseudo-boundaries as reference points into the decoder for layer-by-layer interaction and non-linear updates, and to achieve boundary refinement and joint supervision using a multivariate loss including segment consistency constraints; and a structured log generation module, used to construct candidate boundaries and calculate weakly supervised loss under a multi-instance learning framework, and to generate the final safety event record through semantic adaptive calibration and dynamic K-value weighted fusion.
[0013] The present invention provides a computer storage medium storing computer-executable instructions thereon, which, when executed by a processor, implement the steps of the method described above.
[0014] Technical effect
[0015] Compared with existing technologies, its beneficial effects are as follows: This invention significantly reduces the threshold for implementing intelligent ship retrofitting and the long-term data maintenance cost without relying on large-scale fine-labeled data. Through a weakly supervised learning framework, model training can be completed using only existing nautical logs or event-level semantic descriptions, enabling rapid deployment of existing ship CCTV systems. Simultaneously, addressing practical issues such as complex sea conditions, ship swaying, and variable operational scenarios, it introduces fragment consistency constraints and iterative decoding mechanisms to effectively suppress event boundary jitter and fragmented recording, improving the stability and robustness of recording continuous safety events. Furthermore, an adaptive adjustment strategy based on semantic strength enables the system to accurately capture both high-risk emergencies and low-dynamic daily violations, significantly reducing false alarms caused by environmental noise while maintaining high sensitivity. Finally, through multi-instance fusion and semantic alignment mechanisms, it achieves automatic temporal location and structured semantic recording of safety events, promoting the transformation of ship safety management from traditional post-event manual retrospection to a real-time, standardized, and intelligent event recording mode, greatly reducing the monitoring and log reporting burden on safety management personnel. Attached Figure Description
[0016] The above and other objects, features, and advantages of this application will become more apparent from the following detailed description of the embodiments in conjunction with the accompanying drawings. The drawings are provided to further illustrate the embodiments of this application and form part of the specification. They are used together with the embodiments of this application to explain the application and do not constitute a limitation thereof. In the drawings, the same reference numerals generally represent the same components or steps.
[0017] Figure 1 This is a schematic diagram of the overall network architecture of the weakly supervised dense video event detection and structured log generation method for ship monitoring scenarios in this embodiment of the invention. Detailed Implementation
[0018] To make the technical means, creative features, objectives and effects of this invention easy to understand, the following embodiments, in conjunction with the accompanying drawings, specifically illustrate a method and system for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios.
[0019] This embodiment provides a method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios.
[0020] Figure 1 This is a schematic diagram of the overall network architecture of the weakly supervised dense video event detection and structured log generation method for ship monitoring scenarios in this embodiment of the invention.
[0021] like Figure 1As shown, the method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios in this embodiment includes the following steps:
[0022] Step S1: Use a visual language model to extract multimodal features from ship monitoring videos and text descriptions of safety incidents, and use a deformable encoder to perform multi-scale temporal modeling to generate enhanced video coding features.
[0023] In this embodiment, the specific implementation process of step S1 is as follows:
[0024] Step S1-1: Using a pre-trained vision-language model, the video is processed through the vision branch to extract frame-level multimodal features containing latent semantics, while the descriptive text is processed through the text branch to extract text feature vectors representing the overall semantics, thereby constructing the multimodal input required for subsequent semantic-guided analysis.
[0025] Let the input ship monitoring video sequence be... The corresponding security incident text description is as follows: Frame-level feature sequences are extracted using a pre-trained action recognition network. ,in The length of the video. The feature dimension is used as the input; simultaneously, the text description of the security incident is input into the text encoding branch of the vision-language multimodal model to extract text feature vectors that represent the overall semantic information of the text. .
[0026] Step S1-2, Frame-level feature sequence:
[0027]
[0028] The input is a baseline encoder, which includes one-dimensional convolutional layers and group normalization layers. Multi-level one-dimensional convolutions are used to perform temporal downsampling on the frame-level feature sequence, constructing multi-scale temporal feature representations with different temporal resolutions. :
[0029]
[0030] in, This indicates different time scale levels. This represents a one-dimensional convolution operation at the corresponding scale. This represents the group normalization operation. Furthermore, sinusoidal positional codes are superimposed on each scale feature to incorporate temporal positional information. .
[0031] Steps S1-3 involve inputting the multi-scale temporal feature representation after overlay position encoding into the deformable Transformer encoder. A multi-scale deformable self-attention mechanism is used to sparsely sample features at different time scales, calculate the attention weights at the sampling positions, and aggregate the corresponding temporal context information to generate an enhanced video temporal coding feature sequence. :
[0032]
[0033] Step S2: Construct a temporal distribution based on video text similarity, generate initial pseudo-boundaries using weighted temporal moment regression through semantically guided adaptive enhancement and non-uniform window initialization.
[0034] In this embodiment, the specific implementation process of step S2 is as follows:
[0035] Step S2-1: Calculate the similarity sequence between the security event text embedding and the video frame-level features. We construct a temporal distribution of similarity reflecting the semantic response intensity of events over time; simultaneously, we perform vector norm quantization and standardization on the text embeddings to calculate the semantic strength of the text. ;
[0036] Step S2-2: Establish an adaptive adjustment relationship between semantic strength and temporal analysis parameters. Adjust the decision threshold and sliding window size for temporal change analysis inversely based on semantic strength, so that high semantic strength events correspond to lower change decision thresholds and smaller temporal analysis windows, while low semantic strength events correspond to higher change decision thresholds and larger temporal analysis windows. The specific implementation process is as follows:
[0037] Set a base threshold and base window size Then the adaptively adjusted decision threshold and sliding window The calculation is as follows:
[0038]
[0039]
[0040] in, For the standardized function, The preset adjustment coefficient is used; the higher the semantic strength, the lower the decision threshold and the smaller the window.
[0041] Step S2-3: Perform local time series analysis on the similarity time series distribution within the adjusted sliding window, calculate the response fluctuation degree at each time position, and identify the time positions that exceed the decision threshold as significant change points; perform weighted enhancement on the similarity response signals in the neighborhood of the change points to obtain the enhanced time series response distribution.
[0042] Specifically, this embodiment utilizes vectorized convolution to calculate the local variance of similarity sequences. , will satisfy The moment is identified as a point of change and Gaussian enhancement is performed to obtain... .
[0043] Step S2-4: Perform non-uniform time window initialization based on the enhanced time-series response distribution, and in each initialization window... Internally filter the set of keyframes with higher response values. Based on the temporal position of the keyframe and the corresponding response intensity, weighted temporal moments are calculated, and the center of the candidate segment is calculated using the weighted temporal moments. With width This outputs the initial pseudo-boundary box. :
[0044]
[0045]
[0046] in, To preset the mapping coefficients, the initial pseudo-boundary box is output. .
[0047] Step S3: The initial pseudo-boundary is mapped to the reference point input into the deformable decoder, and the interaction is carried out layer by layer through the attention mechanism and the nonlinear coordinate update strategy. For each layer of prediction calculation, multiple losses including classification, regression, counting and fragment consistency constraints are calculated. Deep joint supervision is implemented on the decoder to achieve layer-by-layer refinement and stable optimization of the event boundary.
[0048] In this embodiment, the specific implementation process of step S3 is as follows:
[0049] Step S3-1, Query Initialization and Reference Point Mapping: Map the center coordinates and temporal width of the initial pseudo-boundary box generated in step S2 to initial reference points, and initialize a set of learnable event query vectors; use the initial reference points and event query vectors as input to the first-layer decoder. In this embodiment, the initial pseudo-boundary box... Mapped to the initial reference point .
[0050] Step S3-2, Multi-scale Temporal Feature Interaction: In each network layer of the decoder, a self-attention mechanism is first executed to aggregate the contextual association information between different event query vectors. Subsequently, a multi-scale deformable cross-attention mechanism is executed, which samples key feature points on the multi-scale pyramid diagram of video coding features based on the current reference point position, thereby achieving alignment and interaction between event query vectors and video temporal features.
[0051] In this embodiment, in the decoder's... The layer aggregates video features using a multi-scale deformable cross-attention mechanism to obtain query features. :
[0052]
[0053] Step S3-3, Coordinate Iterative Update Based on Nonlinear Transformation: Utilizing the bounding box prediction head Predict coordinate offset based on query features of the current layer A non-linear coordinate update strategy is constructed: the coordinates of the reference point in the current layer are transformed by the inverse Sigmoid function, the coordinate offset is added, and then the Sigmoid function is transformed again to obtain the updated reference point. coordinate:
[0054]
[0055] in, It is the Sigmoid activation function. This is the inverse Sigmoid function.
[0056] Step S3-4, Gradient blocking and hierarchical feature output: The updated reference point coordinates are used as the input to the next decoding layer, and the backpropagation of gradients is blocked during the transmission process. The output states of all decoding layers are collected to generate a hierarchical query feature sequence containing rich semantic and temporal information. The query features of the last decoder layer are used as the input basis for subsequent semantic refinement modules.
[0057] Steps S3-5: For the output of each decoding layer, construct the detection-related loss for supervising event prediction. The detection-related losses include classification loss for event category prediction, boundary regression loss for temporal boundary regression, generalized intersection-union (GUID) loss for measuring the degree of overlap of prediction boundaries, and counting loss for constraining the prediction of the number of events. These loss terms are combined according to preset weights and calculated separately at each decoding layer to achieve deep supervision.
[0058]
[0059] in, To address focus loss due to class imbalance, These are the predicted bounding box and the false true bounding box, respectively. These are the weighting coefficients for each item.
[0060] Steps S3-6 involve temporally sorting multiple predicted time segments from the same video based on their center time positions; calculating the foreground confidence difference between temporally adjacent segments and constructing a time weighting factor based on the time interval between adjacent segments; and introducing segment consistency loss. Smoothing constraints are applied to temporally adjacent prediction segments to reduce the fluctuation of prediction results in the time dimension:
[0061] Assuming a single video contains The predicted segment has a center point of . The prospect confidence level is First, the predicted segments are sorted by center point. Obtain the index by performing time-series sorting. ; Calculate the score difference between adjacent segments after sorting. Time difference with the center ; Calculate time weights using the exponential decay function Finally, calculate the batch-average weighted consistency loss:
[0062]
[0063] in, The time decay coefficient, This is the loss weighting coefficient; the loss is forced to be close in time (i.e.) Predicted segments (smaller ones) have similar foreground confidence, thus smoothing the boundary prediction.
[0064] Step S4: Construct candidate boundaries and calculate weakly supervised loss based on a multi-instance learning framework, and generate the final security event record through semantic adaptive calibration and dynamic K-value weighted fusion.
[0065] In this embodiment, the specific implementation process of step S4 is as follows:
[0066] Step S4-1 involves performing temporal jitter enhancement based on intermediate event time segments, constructing a multi-instance bag containing multiple candidate bounding boxes, calculating instance confidence and normalized description generation score, and weighting them to obtain a comprehensive score. Simultaneously, a description generation loss and a weakly supervised MIL loss are constructed. The bag aggregation score is used to constrain the model under unsupervised temporal boundaries, automatically identifying high-quality pseudo-boundaries.
[0067] In this embodiment, multiple instance candidate packages are constructed. Calculate the dual score for each candidate box. :
[0068]
[0069] in, Classification probability of the instance, To describe the generation probability, To describe length, Temperature coefficient;
[0070] Simultaneously, calculate the descriptive generation loss. Used to optimize subtitle generation capabilities:
[0071]
[0072] Constructing a weakly supervised multi-instance learning loss Constraining the aggregation score of multi-instance packages to automatically discover high-quality pseudo-boundaries without time boundary supervision:
[0073]
[0074] Step S4-2: Construct a semantic adaptive module and fuse the features from the last layer of the decoder. With text embedding Generate adaptive weights through a lightweight network and boundary offset Calibrate the scoring and boundaries:
[0075]
[0076] The initial scores are reweighted using weights, and the center and width of the candidate bounding boxes are updated based on the offsets, resulting in a semantically calibrated set of candidate bounding boxes. The calibrated scores are... .
[0077] Step S4-3: Statistically analyze the distribution characteristics of the comprehensive quality score in the calibrated candidate bounding box set, and calculate the range factor. Distribution uniformity factor and peak confidence factor Establish the optimal aggregation quantity The positive correlation with the range range factor and the distribution uniformity factor, and the negative correlation with the peak confidence factor; the optimal number of aggregates participating in the fusion is dynamically calculated based on the relationship. and will The value constraint is within the preset minimum and maximum range:
[0078]
[0079] in, This is the range factor. For the distribution uniformity factor, The peak confidence factor. As the baseline quantity, , These are the minimum and maximum quantity constraints, respectively.
[0080] Step S4-4: Select the candidate bounding boxes with the highest overall quality score from the calibrated candidate bounding box set. One target candidate box. For The overall quality score of each target candidate box is normalized to obtain the fusion weight. The time coordinates of each target candidate box are weighted, summed, and merged to generate the final security event time interval. :
[0081]
[0082] By associating the final time interval with the event description text, event confidence level, and event type, a structured record of ship monitoring safety events is generated.
[0083] This embodiment also provides a weakly supervised dense video event detection and structured log generation system for ship monitoring scenarios, including:
[0084] The multimodal feature extraction module is used to execute the steps in step S1 of this embodiment to achieve multimodal feature extraction.
[0085] The pseudo-boundary generation module is used to execute the steps in step S2 of this embodiment to generate pseudo-boundaries.
[0086] The boundary refinement module is used to execute the steps in step S3 of this embodiment to achieve boundary refinement.
[0087] The structured log generation module is used to execute the steps in step S4 of this embodiment to generate structured logs.
[0088] Although embodiments of the present invention have been disclosed above, they are not limited to the applications listed in the specification and embodiments. They can be applied to various fields suitable for the present invention. For those skilled in the art, other modifications can be easily made. Therefore, without departing from the general concept defined by the claims and their equivalents, the present invention is not limited to the specific details and illustrations shown and described herein.
Claims
1. A method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios, characterized in that, Includes the following steps: Step 1: Extract multimodal features from ship monitoring videos and safety incident text descriptions using a visual language model, and generate enhanced video coding features by performing multi-scale temporal modeling through a deformable encoder. Step 2: Construct a temporal distribution based on the similarity between the video coding features and text features, generate an initial pseudo boundary through semantically guided adaptive enhancement and non-uniform window initialization, and use weighted temporal moment regression. Step 3: Map the initial pseudo-boundary to the reference point input decoder for layer-by-layer interaction and nonlinear update, and use a multi-loss including segment consistency constraints to achieve boundary refinement and joint supervision. Step 4: Construct candidate boundaries and calculate weakly supervised loss based on a multi-instance learning framework, and generate the final security event record through semantic adaptive calibration and dynamic K-value weighted fusion.
2. The method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios according to claim 1, characterized in that: Step 1 includes: Step 1-1: Using a pre-trained vision-language model, the video is processed through the vision branch to extract frame-level multimodal features, and the descriptive text is processed through the text branch to extract text feature vectors. Steps 1-2: Input the frame-level multimodal feature sequence into the baseline encoder, perform temporal downsampling through multi-level one-dimensional convolution to construct a multi-scale temporal feature representation, and superimpose positional encoding; Steps 1-3 input the multi-scale temporal feature representation after superimposed position encoding into the deformable Transformer encoder, and generate enhanced video temporal coding features through the multi-scale deformable self-attention mechanism.
3. The method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios according to claim 1, characterized in that: Step 2 includes: Step 2-1: Calculate the similarity between the security event text embedding and the video frame-level features, construct the similarity temporal distribution, and perform vector norm quantization on the text embedding to characterize the semantic strength; Step 2-2: Adjust the decision threshold and sliding window scale in reverse according to the semantic strength; Steps 2-3: Perform local temporal analysis on the similarity temporal distribution within the adjusted sliding window, identify significant change points and perform weighted enhancement to obtain the enhanced temporal response distribution; Steps 2-4: Based on the enhanced temporal response distribution, perform non-uniform time window initialization, filter key frame sets within each window, calculate weighted temporal moments, and obtain initial pseudo-boundary boxes.
4. The method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios according to claim 1, characterized in that: Step 3 includes: Step 3-1: Map the initial pseudo bounding box to the initial reference point and initialize the event query vector. Use the initial reference point and the event query vector as the input to the first layer decoder. Step 3-2: In each layer of the decoder, the self-attention mechanism and the multi-scale deformable cross-attention mechanism are executed. Based on the current reference point position, key feature points are sampled on the multi-scale pyramid diagram of video coding features to achieve alignment and interaction between the event query vector and the video temporal features. Step 3-3: Use the bounding box prediction head to predict the coordinate offset, and perform non-linear coordinate updates using the inverse Sigmoid function and the Sigmoid function to obtain the updated reference point coordinates. Steps 3-4: The updated reference point coordinates are used as the input to the next decoding layer. During the transmission process, the backpropagation of gradients is blocked, and the output states of each decoding layer are collected to generate a hierarchical query feature sequence.
5. The method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios according to claim 4, characterized in that: Step 3 also includes: Steps 3-5: For the output of each decoding layer, calculate the detection correlation loss. The detection correlation loss includes at least classification loss, boundary regression loss, generalized intersection-union loss, and counting loss. Each loss term is combined according to a preset weight and calculated separately in each decoding layer.
6. The method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios according to claim 4, characterized in that: Step 3 also includes: Steps 3-6: Sort the predicted time segments in the same video according to their center time position, calculate the difference in foreground confidence between temporally adjacent segments, construct a time weighting factor based on the time interval between adjacent segments, and introduce segment consistency loss to apply a smoothing constraint to temporally adjacent predicted segments.
7. The method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios according to claim 1, characterized in that: Step 4 includes: Step 4-1: Perform temporal jitter enhancement based on intermediate event time segments, construct a multi-instance package containing multiple candidate bounding boxes, calculate instance confidence and description generation score, and construct description generation loss and weakly supervised multi-instance learning loss; Step 4-2: Construct a semantic adaptive module, fuse the features of the last layer of the decoder with the text embedding, generate adaptive weights and boundary fine-tuning offsets, reweight the initial scores and update the center and width of the candidate boxes to obtain a set of semantically calibrated candidate bounding boxes. Step 4-3: Statistically analyze the distribution characteristics of the comprehensive quality score in the calibrated candidate bounding box set, and dynamically calculate the optimal aggregation quantity K; Step 4-4: Select the K target candidate boxes with the highest comprehensive quality scores from the calibrated candidate bounding box set, perform weighted summation and fusion of time coordinates to generate the final safety event time interval, and associate it with event description text, event confidence and event type to generate a structured record of ship monitoring safety events.
8. The method for weakly supervised dense video event detection and structured log generation for ship monitoring scenarios according to claim 7, characterized in that: The dynamically calculated optimal aggregation quantity K includes: The range factor, distribution uniformity factor, and peak confidence factor of the statistically calibrated scores; Establish the positive correlation between the optimal aggregation quantity K and the range range factor and the distribution uniformity factor, as well as the negative correlation with the peak confidence factor; The K value is dynamically calculated based on the relationship, and the K value is constrained within a preset minimum and maximum range.
9. A weakly supervised intensive video event detection and structured log generation system for ship monitoring scenarios, characterized in that, include: The multimodal feature extraction module is used to extract multimodal features from ship monitoring videos and safety incident text descriptions using a visual language model, and to generate enhanced video coding features through multi-scale temporal modeling using a deformable encoder. The pseudo-boundary generation module is used to construct a temporal distribution based on the similarity between the video coding features and the text features, and generate the initial pseudo-boundary through semantically guided adaptive enhancement and non-uniform window initialization, using weighted temporal moment regression. The boundary refinement module is used to map the initial pseudo-boundary to the reference point input decoder for layer-by-layer interaction and nonlinear update, and to achieve boundary refinement and joint supervision by using multiple losses including segment consistency constraints. The structured log generation module is used to construct candidate boundaries and calculate weakly supervised loss under a multi-instance learning framework, and generate the final security event record through semantic adaptive calibration and dynamic K-value weighted fusion.
10. A computer storage medium storing computer-executable instructions thereon, characterized in that: When the computer-executable instructions are executed by a processor, they implement the steps of the method as described in any one of claims 1-8.