A port multi-view monitoring video-oriented operation state analysis method and system

By using unified temporal coding and decoupling guidance mechanisms to divert semantic information, combined with multi-granularity fusion modeling, the problem of distinguishing between the overall operational status and local behavior in multi-camera videos was solved, enabling accurate and stable analysis of port operational status.

CN122244768APending Publication Date: 2026-06-19SHANGHAI MARITIME UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI MARITIME UNIVERSITY
Filing Date
2026-04-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, the semantic information of multi-camera videos is mixed, making it difficult to effectively distinguish between the overall operation status and local behavior. The lack of structural constraints in the fusion judgment process leads to unstable recognition results.

Method used

By using unified temporal coding to obtain feature representations of multi-source monitoring videos, a decoupling guidance mechanism is introduced for semantic diversion, and a multi-granularity fusion mechanism is used for differentiated modeling to generate fused semantic representations for analyzing port operation status.

🎯Benefits of technology

It effectively distinguishes between overall operational status and local behavioral information in multi-view videos, improving the accuracy and stability of analysis in complex port operation scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244768A_ABST
    Figure CN122244768A_ABST
Patent Text Reader

Abstract

This invention provides a method and system for analyzing the operational status of port multi-view monitoring videos, comprising the following steps: acquiring multi-source monitoring video sequences collected by multi-source monitoring devices deployed in the port or terminal environment, and determining video feature representations based on the multi-source monitoring video sequences; determining global operational semantic representations and view-specific behavioral semantic representations based on the video feature representations; performing differentiated modeling on the global operational semantic representations and view-specific behavioral semantic representations, and obtaining a fused semantic representation for operational status analysis through a multi-granularity fusion mechanism; and determining the port operational intelligent perception result based on the fused semantic representation. This invention has the following beneficial effects: it can effectively distinguish between overall operational status information and local behavioral information in multi-view monitoring videos, improving the accuracy and stability of operational status analysis and risk perception in complex port operational scenarios without increasing system reasoning complexity.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision and intelligent video analysis technology, and in particular to a method and system for analyzing the operational status of multi-view monitoring videos in ports. Background Technology

[0002] With the continuous expansion of port and terminal scale, port operations are increasingly characterized by wide distribution of operating areas, numerous operational processes, and frequent changes in operational rhythm. To ensure port operation safety and operational order, video surveillance systems have been widely used in port operation management. The monitoring range typically covers operating areas, storage yards, passageways, and related auxiliary areas, becoming an important means of information acquisition in port operation management.

[0003] In actual operation, port surveillance video is typically captured simultaneously by multiple cameras. Influenced by factors such as the layout of the work area, equipment installation conditions, and varying monitoring needs, different cameras differ significantly in installation location, height, and shooting angle. The area covered and focus of each monitoring frame also differs. This deployment naturally gives port surveillance video the characteristics of multiple cameras and multiple perspectives, resulting in significant differences in spatial information and visual representation among different video frames. Simultaneously, port operations are continuous and phased, often involving frequent interactions between personnel, vehicles, and equipment. While recording changes in the operational process, surveillance video contains both macroscopic information reflecting the overall operational status and phases, as well as local behavioral information related to specific work areas or perspectives. This interweaving of information across temporal and spatial scales gives port surveillance video data a diverse semantic hierarchy and complex structure.

[0004] In the high-intensity operation and complex working environment of ports, management's demand for surveillance video is gradually shifting from post-event review to continuous perception of operational status and risks. This requires the ability to stably reflect changes in operational status and identify potential risks within multi-view, long-term video data. However, limited by the complexity of the port operating environment and the inherent differences in the characteristics of surveillance video data, effectively distinguishing between different levels of operational status and behavioral information within multi-view, long-term video data while maintaining overall consistency remains a real challenge in port surveillance video applications.

[0005] Therefore, it is necessary to further study the analysis methods of port surveillance videos in light of their application environment and data characteristics, so as to provide a more reliable technical foundation for subsequent operational status analysis and risk management. Summary of the Invention

[0006] In view of the shortcomings of the prior art described above, the purpose of this invention is to provide a method and system for analyzing the operational status of port multi-view surveillance videos, in order to solve the problems of mixed semantic information in multi-camera videos, difficulty in effectively distinguishing between overall operational status and local behavior, and unstable recognition results due to the lack of structural constraints in the fusion judgment process.

[0007] To achieve the above and other related objectives, the present invention provides the following technical solution:

[0008] A method for analyzing the operational status of port multi-view monitoring videos includes the following steps: acquiring multi-source monitoring video sequences collected by multi-source monitoring devices deployed in the port or terminal environment, and determining video feature representations based on the multi-source monitoring video sequences; determining a global operational semantic representation for characterizing the overall operational status and a view-specific behavioral semantic representation for characterizing local behavioral features under different monitoring perspectives based on the video feature representations; performing differentiated modeling on the global operational semantic representation and the view-specific behavioral semantic representation, and obtaining a fused semantic representation for operational status analysis through a multi-granularity fusion mechanism; and determining port operational intelligent perception results based on the fused semantic representations, wherein the port operational intelligent perception results include port operational status identification results, abnormal behavior judgment results, and operational risk assessment results.

[0009] A port operation status analysis system based on multi-view monitoring videos includes: a video acquisition module for acquiring multi-source monitoring video sequences collected by multi-source monitoring devices deployed in a port or terminal environment, and determining video feature representations based on the multi-source monitoring video sequences; a semantic representation determination module for determining a global operation semantic representation for characterizing the overall operation status and a view-specific behavior semantic representation for characterizing local behavioral features under different monitoring perspectives based on the video feature representations; a differential modeling module for differentially modeling the global operation semantic representation and the view-specific behavior semantic representation, and obtaining a fused semantic representation for operation status analysis through a multi-granularity fusion mechanism; and an analysis result output module for determining port operation intelligent perception results based on the fused semantic representations, wherein the port operation intelligent perception results include port operation status identification results, abnormal behavior judgment results, and operation risk assessment results.

[0010] An electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the above-described method for analyzing the operational status of port multi-view surveillance video.

[0011] In one embodiment of the present invention, the step of acquiring a multi-source monitoring video sequence collected by a multi-source monitoring device deployed in a port or terminal environment, and determining video feature representation based on the multi-source monitoring video sequence, includes: acquiring a multi-source monitoring video sequence collected by a multi-source monitoring device deployed in a port or terminal environment; preprocessing the multi-source monitoring video sequence, and extracting video feature representation from the preprocessed multi-source monitoring video sequence using a unified temporal coding method, wherein the preprocessing includes video frame extraction, scale normalization, and cross-viewpoint time alignment.

[0012] In one embodiment of the present invention, the unified temporal coding method adopts a temporal feature extraction structure with shared parameters to map the multi-source surveillance video sequence to a feature representation space with consistent structure.

[0013] In one embodiment of the present invention, determining a global operation semantic representation for characterizing the overall operation state and a view-specific behavior semantic representation for characterizing local behavioral features under different monitoring perspectives based on the video feature representation includes: introducing a decoupling guidance mechanism to the video feature representation, performing split modeling of video semantics at the feature dimension level, thereby obtaining a global operation semantic representation for characterizing the overall operation state and a view-specific behavior semantic representation for characterizing local behavioral features under different monitoring perspectives.

[0014] In one embodiment of the present invention, the decoupling guidance mechanism uses a mutually exclusive constraint gating allocation method to guide the flow of video semantics at the feature dimension level, thereby enabling the global job semantic representation and the viewpoint-specific behavior semantic representation to enter the corresponding semantic modeling channel.

[0015] In one embodiment of the present invention, the differential modeling of the global job semantic representation and the viewpoint-specific behavior semantic representation, and the obtaining of a fused semantic representation for job status analysis through a multi-granularity fusion mechanism, includes: performing temporal modeling on the global job semantic representation and the viewpoint-specific behavior semantic representation, obtaining overall evolutionary features for characterizing the job process in the time dimension and dynamic change features for characterizing local behaviors based on the temporal modeling results; performing temporal fusion on the overall evolutionary features and dynamic change features, and obtaining a fused semantic representation for job status analysis based on the temporal fusion results.

[0016] In one embodiment of the present invention, the temporal modeling process of the global job semantic representation and the viewpoint-specific behavior semantic representation is based on an attention-driven temporal modeling mechanism to characterize the dependencies between different time steps within the sequence, thereby enabling the model to focus on time segments that have a key impact on changes in job status; in the process of temporal fusion of the overall evolutionary features and dynamic change features, the representations from different semantic branches are jointly modeled using a sample-adaptive weight allocation method.

[0017] As described above, the present invention provides a method and system for analyzing the operational status of port multi-view monitoring videos, which has the following beneficial effects: The present invention introduces a decoupled guided semantic diversion mechanism based on unified temporal coding, and combines it with a sample-adaptive multi-granularity fusion modeling strategy, so that the global operational status semantics and the local perspective behavior semantics can be effectively distinguished and collaboratively modeled at the representation level. This enables the effective distinction between overall operational status information and local behavior information in multi-view monitoring videos, and improves the accuracy and stability of operational status analysis and risk perception in complex port operation scenarios without increasing the system's reasoning complexity. Attached Figure Description

[0018] Figure 1 This is a flowchart of the operation status analysis method for multi-view monitoring video of ports according to the first embodiment of the present invention.

[0019] Figure 2 This is a schematic diagram of the operation status analysis process based on decoupling guidance and multi-granularity fusion of the operation status analysis method for port multi-view monitoring video in the first embodiment of the present invention.

[0020] Figure 3 This is a schematic diagram of the operation status analysis system for multi-view monitoring video of ports according to the second embodiment of the present invention;

[0021] Figure 4 This is a schematic diagram of an electronic device according to the third embodiment of the present invention. Detailed Implementation

[0022] The following specific examples illustrate the implementation of the present invention. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. It should be noted that, unless otherwise specified, the following embodiments and features described herein can be combined with each other.

[0023] The first embodiment of the present invention relates to a method for analyzing the operational status of port multi-view monitoring videos, specifically a method for analyzing the operational status of port multi-view monitoring videos based on decoupling guidance and multi-granularity fusion, the process of which is as follows: Figure 1 and Figure 2 As shown, the details are as follows:

[0024] Step 101: Obtain the multi-source monitoring video sequence collected by multi-source monitoring equipment deployed in the port or station environment, and determine the video feature representation based on the multi-source monitoring video sequence.

[0025] Specifically, the process begins by acquiring multi-source monitoring video sequences collected by multi-source monitoring devices deployed in port or terminal environments. Then, these sequences are preprocessed, and video feature representations are extracted using a unified temporal coding method. The preprocessing includes frame extraction, scale normalization, and cross-viewpoint time alignment to ensure consistency in the temporal dimension of video data from different monitoring perspectives or operational areas. The unified temporal coding method employs a shared-parameter temporal feature extraction structure to map the multi-source monitoring video sequences to a structurally consistent feature representation space.

[0026] It should also be noted that the multi-source surveillance videos can come from different monitoring perspectives or different work areas. To address the differences in temporal and spatial scales between the multi-view videos, the video data undergoes unified preprocessing. After preprocessing, a unified temporal coding method is used to extract features from the multi-source video sequences, mapping the video frame sequences from different monitoring perspectives to a structurally consistent feature representation space. The unified encoded video feature representation can be expressed as follows: ,in, Indicates the length of the time step. This represents the feature dimension, used to characterize the dynamic changes of a video over time.

[0027] In practical applications, continuous video sequences are first obtained from multi-source monitoring devices deployed in port or terminal environments. These monitoring videos can come from different monitoring angles, different work areas, or different installation heights, and their content covers personnel work behavior, equipment operating status, vehicle running trajectories, and environmental change information.

[0028] The acquired video data is then preprocessed, including frame extraction, resolution and scale normalization, and time alignment, to eliminate differences in temporal and spatial scales between different video sources. The resulting video frame sequence after preprocessing is as follows: ,in, Indicates the length of the time step;

[0029] Subsequently, a unified temporal coding process is performed on the video frame sequence, mapping the original video data to a feature representation space with a consistent structure. Specifically, a temporal coding network maps the video frame at each time step into a feature vector. ,in, Stack the feature vectors from all time steps in chronological order to obtain a uniformly encoded video feature representation. This unified encoding stage does not involve the distinction between different semantic information; its purpose is to provide unified and comparable feature inputs for subsequent semantic splitting and modeling.

[0030] Step 102: Based on the video feature representation, determine the global operation semantic representation used to characterize the overall operation status and the view-specific behavior semantic representation used to characterize the local behavioral features under different monitoring perspectives.

[0031] Specifically, a decoupling guidance mechanism is introduced into the video feature representation to perform split modeling of video semantics at the feature dimension level, thereby obtaining a global operation semantic representation for characterizing the overall operation status and a view-specific behavior semantic representation for characterizing local behavioral features under different monitoring perspectives. The decoupling guidance mechanism uses a mutually exclusive constraint gating allocation method to guide the splitting of video semantics at the feature dimension level, so that the global operation semantic representation and the view-specific behavior semantic representation enter the corresponding semantic modeling channel.

[0032] It should also be noted that, to distinguish the different semantic components contained in the video features, a decoupling guidance mechanism is introduced. This mechanism uses a mutually exclusive gating allocation method to guide the semantic distribution of the video at the feature dimension level. Based on the feature content, this mechanism assigns corresponding gating weights to different semantic channels, guiding semantic components related to the overall job state into the global job semantic channel, and semantic components related to specific viewpoint behavior changes into the viewpoint-specific behavior semantic channel, thus achieving semantic separation at the representation level. Furthermore, during the gating allocation process, the weights of different semantic channels are normalized to ensure that each semantic channel forms a mutually exclusive relationship on the same feature dimension, preventing the same feature component from being repeatedly or mixed in the modeling. Based on the above gating guidance, the split global job semantic representation and viewpoint-specific behavior semantic representation are obtained, which are then used for subsequent differentiated modeling.

[0033] In practical applications, the unified encoded video feature representation A decoupling guidance mechanism is introduced to perform split modeling of video semantics at the feature dimension level to distinguish between global operation semantics and viewpoint-specific behavior semantics. Specifically, a mutual exclusion gating mechanism is used to generate two types of gating weights to characterize the distribution relationship of different semantic components on the feature dimension. The calculation process is as follows: , ,in, , This represents the Sigmoid activation function. , To avoid the same feature dimension simultaneously generating large responses to two types of semantics, the gating weights are normalized to form a mutually exclusive assignment relationship: , ,in, The constant is extremely small; then, the unified coding features are weighted element-wise to obtain the semantic representation after splitting: ,in, , representing global operation semantic representation and viewpoint-specific behavior semantic representation, respectively; through this step, different types of video semantic information are explicitly distinguished at the representation level.

[0034] Step 103: Differentiate modeling is performed on the global job semantic representation and the viewpoint-specific behavior semantic representation, and a fused semantic representation for job status analysis is obtained through a multi-granularity fusion mechanism.

[0035] Specifically, firstly, temporal modeling is performed on the global operation semantic representation and the perspective-specific behavior semantic representation. Based on the temporal modeling results, the overall evolution characteristics of the operation process in the time dimension and the dynamic change characteristics of local behavior are obtained. Then, the overall evolution characteristics and dynamic change characteristics are fused temporally, and the fused semantic representation for operation status analysis is obtained based on the temporal fusion results.

[0036] It should be noted that the global operational semantic representation and the viewpoint-specific behavioral semantic representation are input into a multi-granularity fusion module. Different semantic components are fused and modeled using a sample-adaptive weight allocation method to generate a fused semantic representation for operational status analysis. Specifically, for the different semantic representations obtained from the splitting, a multi-granularity fusion modeling structure is constructed to perform temporal modeling on both the global operational semantic representation and the viewpoint-specific behavioral semantic representation. The modeling process is based on an attention-driven temporal modeling mechanism, characterizing the dependencies between different time steps within the sequence, enabling the model to focus on time segments that have a key impact on changes in operational status. Through this differentiated modeling approach, the global operational semantic branch focuses on characterizing the stable evolution characteristics of port operational processes and stages, while the viewpoint-specific behavioral semantic branch focuses on characterizing the fine-grained dynamic changes in the behavior of personnel, equipment, or vehicles under different monitoring perspectives, thereby achieving collaborative modeling of multi-level operational information.

[0037] In the semantic fusion stage, the representations from different semantic branches are jointly modeled using a sample-adaptive weight allocation method. The model dynamically evaluates the importance of each semantic component in the current scene based on the feature content of the current video sample, and adjusts its contribution ratio in the fusion representation accordingly to generate a fusion semantic representation for subsequent analysis and decision-making. Through the above adaptive fusion method, the model can flexibly adjust the influence of global operation semantics and local behavioral semantics on the final analysis results according to different operation scenarios and operating conditions, thereby improving the stability and adaptability of operation status analysis and anomaly identification.

[0038] In practical applications, the different semantic representations obtained from decoupling are modeled differently, and a fusion representation for decision-making is generated through a multi-granularity fusion mechanism; specifically, the semantic representations of the global job are first... Semantic representation of viewpoint-specific behavior Temporal modeling is performed; in this embodiment, a modeling method based on a self-attention mechanism is adopted, and the input of any semantic branch is... Then its query, key, and value vectors are calculated as follows: ,in, Based on the above mapping, calculate the scaled dot product attention: Through this modeling process, the global job semantic branch output represents... It is used to characterize the overall evolution of the work process over time; the semantic branch output of the viewpoint-specific behavior represents... It is used to characterize the dynamic changes in local behavior;

[0039] Subsequently, the outputs of each semantic branch are time-series aggregated to obtain a semantic-level vector representation: , ,in, This represents an average or weighted aggregation operation over the time dimension. During the fusion phase, learnable scoring vectors are introduced to score different semantic representations: The fusion weights are calculated using the Softmax function. Based on the fusion weights, the final fused representation is generated: ,in, This is a fusion representation used for subsequent analysis and decision-making.

[0040] Step 104: Based on the fused semantic representation, obtain the port operation status identification result, abnormal behavior judgment result, and operation risk assessment result.

[0041] In practical applications, based on fusion representation A task output module is constructed to generate intelligent perception results for port operations. These output results include port operation status identification results, abnormal behavior judgment results, and operation risk assessment results. Specifically, during the training phase, corresponding task prediction results are output based on fused semantic representations, and model parameters are optimized using real-world annotation information to ensure the model's analytical performance under different port operation scenarios. A corresponding main task loss function is constructed according to the task type. When the output is an operation status identification or abnormal behavior judgment result, cross-entropy loss is used as the main task loss. ,in, The number of categories; when the output is a risk assessment or continuous scoring result, the mean squared error loss is used: ;

[0042] To prevent the decoupling semantics from degrading or re-aliasing during training, a reconstruction-based decoupling constraint strategy is introduced during the training phase to decouple the global job semantic representation. Semantic representation of viewpoint-specific behavior Concatenate along the feature dimension and pass through a mapping function. Reconstructing the feature representation after unified coding: And define the reconstruction loss as: The overall training objective of the model is defined as: ,in, The weighting coefficients are used only during the training phase and are not involved in the online inference and deployment process.

[0043] The second embodiment of the present invention relates to an operational status analysis system for multi-view monitoring video of ports. Please refer to [link / reference]. Figure 3 ,include:

[0044] The video acquisition module is used to acquire multi-source monitoring video sequences collected by multi-source monitoring devices deployed in port or station environments, and to determine video feature representations based on the multi-source monitoring video sequences.

[0045] The semantic representation determination module is used to determine, based on the video feature representation, a global operation semantic representation for characterizing the overall operation status and a view-specific behavior semantic representation for characterizing local behavioral features under different monitoring perspectives;

[0046] The differential modeling module is used to perform differential modeling of the global job semantic representation and the viewpoint-specific behavior semantic representation, and obtains the fused semantic representation for job status analysis through a multi-granularity fusion mechanism;

[0047] The analysis results output module is used to determine the port operation intelligent perception results based on the fused semantic representation. The port operation intelligent perception results include port operation status identification results, abnormal behavior judgment results, and operation risk assessment results.

[0048] It is not difficult to see that this embodiment is a system implementation corresponding to the first embodiment, and this embodiment can be implemented in conjunction with the first embodiment. The relevant technical details mentioned in the first embodiment are still valid in this embodiment, and will not be repeated here to reduce repetition. Accordingly, the relevant technical details mentioned in this embodiment can also be applied to the first embodiment.

[0049] It is worth mentioning that all modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, a part of a physical unit, or a combination of multiple physical units. Furthermore, to highlight the innovative aspects of this invention, this embodiment does not introduce units that are not closely related to solving the technical problem proposed by this invention; however, this does not mean that other units are absent from this embodiment.

[0050] The third embodiment of the present invention relates to an electronic device; please refer to [link / reference]. Figure 4 ,include:

[0051] At least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the above-described method for analyzing the operational status of port multi-view surveillance video.

[0052] The memory and processor are connected via a bus, which can include any number of interconnecting buses and bridges, connecting various circuits of one or more processors and memories. The bus can also connect various other circuits, such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and will not be described further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver can be a single element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium. Data processed by the processor is transmitted over the wireless medium via an antenna, which further receives data and transmits it to the processor.

[0053] The processor manages the bus and general processing, and also provides various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. Memory is used to store data used by the processor during operation.

[0054] The fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method embodiments.

[0055] That is, those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing related hardware. This program is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0056] In summary, this invention introduces a decoupled guided semantic diversion mechanism based on unified temporal coding, and combines it with a sample-adaptive multi-granularity fusion modeling strategy to effectively distinguish and collaboratively model global operational status semantics and local perspective behavioral semantics at the representation level. This enables the effective differentiation of overall operational status information and local behavioral information in multi-view monitoring videos, improving the accuracy and stability of operational status analysis and risk perception in complex port operation scenarios without increasing the system's reasoning complexity.

[0057] The above embodiments are merely illustrative of the principles and effects of the present invention and are not intended to limit the invention. All equivalent modifications or alterations made by those skilled in the art without departing from the spirit and technical concept disclosed in this invention should still be covered by the claims of this invention.

Claims

1. A method for analyzing the working state of a multi-view monitoring video of a port, characterized in that, Includes the following steps: Acquire multi-source monitoring video sequences collected by multi-source monitoring devices deployed in port or terminal environments, and determine video feature representations based on the multi-source monitoring video sequences; Based on the video feature representation, a global operation semantic representation for characterizing the overall operation status and a view-specific behavior semantic representation for characterizing local behavioral features under different monitoring perspectives are determined. Differential modeling is performed on the global job semantic representation and the viewpoint-specific behavior semantic representation, and a fused semantic representation for job status analysis is obtained through a multi-granularity fusion mechanism; The port operation intelligent perception results are determined based on the fused semantic representation, wherein the port operation intelligent perception results include port operation status identification results, abnormal behavior judgment results, and operation risk assessment results.

2. The method of claim 1, wherein the method is a port-oriented multi-view monitoring video operation state analysis method. The process of acquiring multi-source monitoring video sequences collected by multi-source monitoring devices deployed in port or terminal environments, and determining video feature representations based on the multi-source monitoring video sequences, includes: Acquire multi-source monitoring video sequences collected by multi-source monitoring devices deployed in port or terminal environments; The multi-source surveillance video sequence is preprocessed, and video feature representations are extracted from the preprocessed multi-source surveillance video sequence using a unified temporal coding method. The preprocessing includes video frame extraction, scale normalization, and cross-viewpoint time alignment.

3. The method of claim 2, wherein the method is a port-oriented multi-view monitoring video operation state analysis method. The unified temporal coding method adopts a temporal feature extraction structure with shared parameters, mapping the multi-source surveillance video sequences to a feature representation space with a consistent structure.

4. The method of claim 1, wherein the method is a port-oriented multi-view monitoring video operation state analysis method. The step of determining a global operational semantic representation for characterizing the overall operational status and a view-specific behavioral semantic representation for characterizing local behavioral features under different monitoring perspectives, based on the video feature representation, includes: A decoupling guidance mechanism is introduced into the video feature representation to perform split modeling of video semantics at the feature dimension level, thereby obtaining a global operation semantic representation for characterizing the overall operation status and a view-specific behavior semantic representation for characterizing local behavioral features under different monitoring perspectives.

5. The method of claim 4, wherein the method is a port-oriented multi-view monitoring video operation state analysis method. The decoupling guidance mechanism guides the flow of video semantics at the feature dimension level through a mutually exclusive gating allocation method, thereby enabling the global job semantic representation and the viewpoint-specific behavior semantic representation to enter the corresponding semantic modeling channel.

6. The method of claim 1, wherein the method is a port-oriented multi-view monitoring video operation state analysis method. The process of differentially modeling the global job semantic representation and the viewpoint-specific behavior semantic representation, and obtaining a fused semantic representation for job state analysis through a multi-granularity fusion mechanism, includes: Temporal modeling is performed on the global job semantic representation and the viewpoint-specific behavior semantic representation. Based on the temporal modeling results, the overall evolution characteristics of the job process in the time dimension and the dynamic change characteristics of local behavior are obtained. The overall evolutionary features and dynamic change features are fused in a time series, and a fused semantic representation for job status analysis is obtained based on the time series fusion result.

7. The method of claim 6, wherein the method is a port-oriented multi-view monitoring video operation state analysis method. The temporal modeling process of the global job semantic representation and the viewpoint-specific behavior semantic representation is based on the attention-driven temporal modeling mechanism, which describes the dependency relationship between different time steps within the sequence, so that the model can focus on the time segments that have a key impact on the changes in job status. In the process of temporal fusion of the overall evolutionary features and dynamic change features, the representations from different semantic branches are jointly modeled using a sample-adaptive weight allocation method.

8. A port-oriented multi-view monitoring video operation state analysis system, characterized in that: include: The video acquisition module is used to acquire multi-source monitoring video sequences collected by multi-source monitoring devices deployed in port or station environments, and to determine video feature representations based on the multi-source monitoring video sequences; The semantic representation determination module is used to determine, based on the video feature representation, a global operation semantic representation for characterizing the overall operation status and a view-specific behavior semantic representation for characterizing local behavioral features under different monitoring perspectives; The differential modeling module is used to perform differential modeling on the global job semantic representation and the view-specific behavior semantic representation, and obtains a fused semantic representation for job status analysis through a multi-granularity fusion mechanism; The analysis results output module is used to determine the intelligent perception results of port operations based on the fused semantic representation. The intelligent perception results of port operations include port operation status identification results, abnormal behavior judgment results, and operation risk assessment results.

9. An electronic device, comprising: include: At least one processor; as well as, A memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the operational status analysis method for port multi-view surveillance video as described in any one of claims 1 to 7.