Time sequence self-supervised learning method and system for rail transit engineering video images

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining dynamic frame rate sampling and multi-scale random occlusion reconstruction with reconstruction loss and frequency loss, the problem of insufficient temporal modeling adaptability and multi-scale target adaptability in rail transit engineering videos is solved, improving video feature representation and recognition capabilities, and making it suitable for construction safety monitoring and equipment fault early warning.

CN120930713BActive Publication Date: 2026-06-16BEIJING URBAN CONSTRUCTION DESIGN & DEVELOPMENT GROUP CO LIMITED +1

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING URBAN CONSTRUCTION DESIGN & DEVELOPMENT GROUP CO LIMITED
Filing Date: 2025-06-23
Publication Date: 2026-06-16

Application Information

Patent Timeline

23 Jun 2025

Application

16 Jun 2026

Publication

CN120930713B

IPC: G06N3/0895; G06N3/084; G06N3/0455; G06N3/0464; G06N3/045; G06V10/82; G06V20/40; G06V10/26

CPC: G06N3/0895; G06N3/084; G06N3/0455; G06N3/0464; G06N3/045; G06V10/82; G06V20/40; G06V10/26

AI Tagging

Application Domain

Character and pattern recognition Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN120930713B_ABST

Patent Text Reader

Abstract

The application discloses a kind of time sequence self-supervised learning method and system for rail transit engineering video image, method includes: collecting urban rail transit engineering construction video, and segmented video is generated to construct self-supervised pre-training data set by segmentation mask;Select pre-trained model and initialize;The input video is dynamically frame rate sampling processing;And each video frame is generated corresponding binary mask image by random occlusion processing;After the video processed is sampled under multiple scales, input to by encoding-decoding network, complete space-time feature coding and the generation of reconstructed video and predicted mask;By calculating the reconstruction loss of reconstructed video and real video, the dynamic frequency loss based on the consistency of Fourier domain low frequency and high frequency characteristics and the segmentation loss of predicted mask and real mask, constitute loss function.Solve the problem of insufficient adaptability of time sequence modeling and insufficient adaptability of multi-scale target.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and intelligent transportation engineering construction technology, and more specifically, relates to a temporal self-supervised learning method and system for video images of rail transit engineering. Background Technology

[0002] In recent years, self-supervised learning has made some progress in the field of computer vision. Existing video self-supervised learning methods can be mainly divided into three categories: First, video representation learning based on temporal contrastive learning, such as TCLR, which uses a strategy of positive samples from adjacent frames and negative samples from different video segments to learn a general representation of video content; Second, video representation learning based on temporal prediction tasks, such as Jigsaw and TimeSformer, which achieves video representation learning by requiring the model to restore the shuffled order of video frames or generate motion representations for future frames; Third, video modeling methods based on spatiotemporal masking, such as VideoMAE, which constructs a video in which a portion of the time sequence is masked, and then reconstructs the video from the unmasked portion using an encoder-decoder structure.

[0003] However, the aforementioned video self-supervised learning methods have significant shortcomings in urban rail transit engineering construction scenarios. Contrastive learning methods, which use positive samples from adjacent frames and negative samples from different video segments, are prone to misjudgment in complex engineering scenarios such as periodic mechanical movements, failing to accurately capture the temporal characteristics of the engineering scene. Self-supervised learning methods based on temporal prediction, due to the complex background of static tracks and dynamic machinery in engineering scenarios, tend to overlook key moving targets and struggle to effectively model long-term processes with phased changes, such as concrete pouring. Mask-based modeling methods, due to the large differences in target size in engineering videos (e.g., workers and excavators), using a fixed mask ratio easily leads to the loss of key information for small-sized targets, and also lacks sufficient modeling of the temporal continuity of dynamic targets.

[0004] The core technical problems of existing methods are: ignoring the misjudgment of features due to the periodic mechanical motion in engineering scenarios, difficulty in handling insufficient feature representation caused by large differences in target size in engineering videos, and weak ability to model long-term temporal dependencies. Statistical analysis of target size and target action duration in a large number of engineering construction videos reveals that target size coverage is large, and the duration of continuous actions for different targets is highly diverse. Existing methods, lacking corresponding strategies for this characteristic, suffer from insufficient adaptability to temporal modeling and multi-scale targets in rail transit engineering scenarios. Therefore, a technical solution that can effectively address these problems is urgently needed. Summary of the Invention

[0005] This invention aims to address the shortcomings of existing video self-supervised learning methods in temporal modeling and multi-scale target adaptation in rail transit engineering scenarios. By employing dynamic frame rate sampling, a TimeSformer backbone network with embedded deformable convolutional layers, and multi-scale random occlusion reconstruction, combined with optimizations of reconstruction loss, dynamic frequency loss, and segmentation loss, the invention enhances the model's ability to represent the spatiotemporal features of targets with different action durations and sizes. This makes the invention suitable for scenarios such as construction safety monitoring and equipment fault early warning.

[0006] To address the aforementioned deficiencies or improvement needs of existing technologies, as a first aspect of this invention, the present invention provides a temporal self-supervised learning method for video images in rail transit engineering, comprising:

[0007] S1. Collect videos of urban rail transit engineering construction, segment the videos, and generate segmentation masks for each frame of video sequentially through the SAM segmentation network and the Xmem tracking network to construct a self-supervised pre-training dataset;

[0008] S2. Use a pre-trained TimeSformer and initialize the self-supervised learning parameters and backbone model parameters for the rail transit engineering scenario; for position encoding initialization, initialize the time dimension using a sine function;

[0009] S3. Perform dynamic frame rate sampling on the input video; generate a corresponding binary mask image for each video frame through random occlusion processing; after multi-scale downsampling, input the processed video into an encoder-decoder network to complete spatiotemporal feature encoding and the generation of reconstructed video and predicted mask;

[0010] S4. The loss function is constructed by calculating the reconstruction loss between the reconstructed video and the real video, the dynamic frequency loss based on the consistency of low-frequency and high-frequency features in the Fourier domain, and the segmentation loss between the predicted mask and the real mask.

[0011] Furthermore, the specific process of generating a segmentation mask for each frame of video sequentially through the SAM segmentation network and the Xmem tracking network in S1 is as follows:

[0012] Specifically, all frames of the video are read first. For the first frame, the segmentation network SAM is used to obtain the segmentation mask of the first frame. For the second frame, the target tracking network Xmem uses the segmentation mask of the first frame as a cue to obtain the segmentation mask of the second frame. The above steps are repeated until the segmentation mask of the last frame is obtained.

[0013] Furthermore, the specific method for initializing the time dimension using a sine function in S2 is as follows:

[0014] A unique and smooth location representation is generated for each time step, enabling the model to capture temporal order and relative distance. Specific implementation methods and formulas are as follows:

[0015]

[0016] Where t is the time step position; d model is the total dimension of the position encoding; i is the index of the current dimension; F is the hyperparameter for adjusting the wavelength range.

[0017] Furthermore, the specific process of dynamic frame rate sampling in S3 is as follows:

[0018] Read the video frame rate, set the target frame rate range to 5-25fps, and randomly generate sampling frame rate values f based on a uniform or Gaussian distribution; calculate the corresponding sampling interval.

[0019]

[0020] Among them, f base The original video frame rate;

[0021] The video frame sequence is sampled at equal intervals according to the calculated sampling interval s to generate video segments with different frame rates; for segments that are less than the minimum length requirement after sampling, the frame number is supplemented by linear interpolation.

[0022] Furthermore, the specific process of the random occlusion is as follows:

[0023] For each video frame, at least one shape, including rectangle, circle, and polygon, is randomly selected as the occlusion shape; an occlusion area of the corresponding size is generated according to a preset occlusion ratio range of 10%-70%; for each video frame, occlusion is applied at a randomly selected spatial location, and the pixels in the occlusion area are filled with a pixel value of 0; at the same time, a corresponding binary mask image is generated.

[0024] Furthermore, the method for calculating the reconstruction loss in S4 is as follows:

[0025] The reconstruction loss uses a weighted L1 norm, focusing on calculating the pixel differences in the occluded regions, and incorporates a gradient difference term of 0.1; the formula is as follows:

[0026]

[0027] Where Ω represents the occluded area. This is the gradient operator.

[0028] Furthermore, the method for calculating the dynamic frequency loss in S4 is as follows:

[0029] Dynamic frequency loss constrains the feature consistency of video content at different frequencies through frequency domain analysis; the formula is as follows:

[0030]

[0031] L DF =c(t)L LF +(1-c(t))L HF

[0032] Among them, L LF This represents the low-frequency loss in the Fourier domain between the reconstructed video and the real video; L HF represents the high-frequency loss between the reconstructed video and the real video in the Fourier domain; c(t) represents the weight value that changes with the training rounds.

[0033] Furthermore, the method for calculating the segmentation loss in S4 is as follows:

[0034]

[0035] Where, p i g represents the predicted probability of the i-th pixel; i represents the true label value of the i-th pixel; C represents the number of categories; ∈ is a smoothing term to prevent the denominator from being zero.

[0036] As a second aspect of the present invention, the present invention provides a temporal self-supervised learning system for video images of rail transit engineering, comprising:

[0037] The dataset construction unit is used to collect videos of urban rail transit engineering construction, segment the videos, and generate segmentation masks for each frame of video sequentially through the SAM segmentation network and the Xmem tracking network to construct a self-supervised pre-trained dataset.

[0038] The model initialization unit is used to initialize the self-supervised learning parameters and backbone model parameters of the rail transit engineering scenario using the pre-trained TimeSformer; for position encoding initialization, the time dimension is initialized using a sine function.

[0039] The video reconstruction and mask prediction unit is used to perform dynamic frame rate sampling processing on the input video; and generate a corresponding binary mask image for each video frame through random occlusion processing; after multi-scale downsampling, the processed video is input into the encoder-decoder network to complete the spatiotemporal feature encoding and the generation of reconstructed video and predicted mask;

[0040] The loss function construction unit is used to construct the loss function by calculating the reconstruction loss between the reconstructed video and the real video, the dynamic frequency loss based on the consistency of low-frequency and high-frequency features in the Fourier domain, and the segmentation loss between the predicted mask and the real mask.

[0041] As a third aspect of the invention, the invention also provides a computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor of any step of the described temporal self-supervised learning method for video images of rail transit engineering.

[0042] In summary, compared with the prior art, the above-described technical solutions conceived by this invention can achieve the following beneficial effects:

[0043] 1. The temporal self-supervised learning method for video images in rail transit engineering of this invention utilizes dynamic frame rate sampling video preprocessing technology. After reading the video frame rate, it randomly generates sampling frame rates within the range of 5-25fps according to a uniform or Gaussian distribution. It calculates the sampling interval and samples video frames at equal intervals. For segments shorter than the minimum length, it uses linear interpolation to supplement the number of frames, thereby improving the model's ability to represent the temporal features of continuous actions of different targets. This technology can dynamically adjust the sampling frequency according to the diversity of target action durations in engineering videos, enabling the model to effectively capture the temporal dimension features of different scenarios such as periodic mechanical movements and concrete pouring, avoiding the loss of temporal information caused by fixed frame rate sampling. In practical applications, for the long-duration tunneling process of a tunnel boring machine, the system can automatically reduce the sampling frame rate to reduce computational load; for brief violations by workers, the sampling frame rate is increased to capture key details, achieving efficient utilization of computing resources and accurate extraction of temporal features.

[0044] 2. The temporal self-supervised learning method for video images in rail transit engineering of this invention employs a multi-scale spatiotemporal random mask reconstruction method with a TimeSformer core embedded with deformable convolutional layers. For each frame of video, rectangles, circles, or polygons are randomly selected as occlusion shapes, generating occlusion regions at a ratio of 10%-70% and filling them with pixel values of 0. Simultaneously, a binary mask is generated. The processed video is input into an encoder-decoder network for spatiotemporal feature encoding reconstruction and segmentation prediction, improving the model's ability to represent the spatiotemporal features of targets of different sizes. This method enhances feature extraction from irregular targets through deformable convolutional layers and, combined with a multi-scale random occlusion strategy, solves the problem of lost key information due to fixed mask ratios caused by large size differences between targets such as workers and excavators in engineering videos, achieving effective representation of targets at different scales. When dealing with complex scenes, the system uses a larger proportion of occlusion regions for large equipment, prompting the model to learn its overall structural features; for small tools or parts, a smaller proportion of occlusion is used to preserve detailed information, significantly improving the model's ability to identify and locate targets at multiple scales.

[0045] 3. The temporal self-supervised learning method for rail transit engineering video images of this invention optimizes the objective from multiple dimensions, including reconstruction loss, dynamic frequency loss, and segmentation loss. The reconstruction loss uses a weighted L1 norm with a 0.1x gradient difference term, focusing on calculating pixel differences in occluded areas. The dynamic frequency loss constrains feature consistency through low-frequency and high-frequency losses in the Fourier domain, with weights dynamically adjusted with each training epoch. The segmentation loss uses DiceLoss to calculate the difference between the predicted mask and the true mask, improving the model's accuracy in reconstructing the overall video structure and high-frequency details. A multi-loss function joint optimization strategy constrains model training from three dimensions: pixel-level reconstruction, frequency domain feature consistency, and target segmentation accuracy, ensuring the model accurately understands video content and extracts effective features in complex engineering scenarios. In the early stages of training, the system guides the model to learn the overall video structure using a high-weight low-frequency loss. As training progresses, the weight of the high-frequency loss is gradually increased to strengthen the learning of detailed features such as weld detection and bolt loosening, ultimately achieving a comprehensive and detailed understanding of the engineering video. Attached Figure Description

[0046] Figure 1 This is a flowchart of the temporal self-supervised learning method for video images of rail transit engineering according to an embodiment of the present invention;

[0047] Figure 2 A flowchart illustrating the dataset construction process in an embodiment of the present invention;

[0048] Figure 3 This is a schematic diagram illustrating the generation of a video segmentation mask according to an embodiment of the present invention;

[0049] Figure 4 This is a flowchart illustrating the model training process according to an embodiment of the present invention.

[0050] Figure 5 This is a schematic diagram of video reconstruction and mask prediction according to an embodiment of the present invention;

[0051] Figure 6 This is a system unit diagram of an embodiment of the present invention. Detailed Implementation

[0052] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0053] Example 1

[0054] Please refer to Figure 1This embodiment 1 provides a temporal self-supervised learning method for video images in rail transit engineering, including...

[0055] S1. Collect videos of urban rail transit engineering construction, segment the videos, and generate segmentation masks for each frame of video sequentially through the SAM segmentation network and the Xmem tracking network to construct a self-supervised pre-training dataset;

[0056] S2. Use a pre-trained TimeSformer and initialize the self-supervised learning parameters and backbone model parameters for the rail transit engineering scenario; for position encoding initialization, initialize the time dimension using a sine function;

[0057] S3. Perform dynamic frame rate sampling on the input video; generate a corresponding binary mask image for each video frame through random occlusion processing; after multi-scale downsampling, input the processed video into an encoder-decoder network to complete spatiotemporal feature encoding and the generation of reconstructed video and predicted mask;

[0058] S4. The loss function is constructed by calculating the reconstruction loss between the reconstructed video and the real video, the dynamic frequency loss based on the consistency of low-frequency and high-frequency features in the Fourier domain, and the segmentation loss between the predicted mask and the real mask.

[0059] This embodiment 1 further elaborates on the above steps.

[0060] (1) Dataset Construction

[0061] To implement a multi-scale spatiotemporal coding self-supervised learning method for video images of urban rail transit engineering construction, a high-quality video pre-training dataset must first be established. For specific steps, please refer to... Figure 2 .

[0062] S101. Video Acquisition: In response to the complex scenarios at urban rail transit construction sites, high-resolution imaging equipment is used to acquire video from multiple angles, at multiple time periods, and under multiple lighting conditions. The video content from disassembly covers as much of the construction situation as possible.

[0063] S102. Video Segmentation: The original images captured need to be divided into video segments of 2s-10s by professionals, ensuring that there are no scene jumps during the segmentation process;

[0064] S103. Video Segmentation Mask Generation: Please refer to... Figure 3For the video segment in S102, a segmentation network (SAM) and a target tracking network are used to generate a mask. Specifically, all frames of the video are read first. For the first frame, the segmentation network SAM is used to obtain the segmentation mask of the first frame. For the second frame, the target tracking network Xmem uses the segmentation mask of the first frame as a cue to obtain the segmentation mask of the second frame. The above steps are repeated until the segmentation mask of the last frame is obtained. At this point, the self-supervised pre-training dataset for urban rail transit engineering construction videos is completed.

[0065] (2) Model initialization

[0066] Please refer to Figure 4 Specifically, in this embodiment 1, a parameter initialization and dynamic position encoding strategy based on pre-trained TimeSformer is used to improve the model's ability to capture temporal features in complex engineering scenarios. During the model construction phase, a TimeSformer pre-trained on ImageNet22k is used as the backbone network. This network has learned general visual feature representations on large-scale image datasets and can be effectively transferred to the field of rail transit engineering. In particular, a special design was made for the position encoding in the time dimension, specifically addressing the temporal characteristics of engineering videos. Position encoding is generated using a sine function; the core idea is to generate a unique and smooth position representation for each time step, enabling the model to capture temporal order and relative distance. The specific formula is:

[0067]

[0068] Where t is the time step position; d model The total dimension of the positional encoding, usually consistent with the model feature dimension, determines the representational power of the positional encoding; i is the index of the current dimension, traversing each dimension of the positional encoding; F is a hyperparameter that adjusts the wavelength range, and in a preferred embodiment, F = 10000. This design allows the positional encoding to have different periods in different dimensions, thus representing the relative relationships between time steps. For example, when processing consecutive video frames during tunnel construction, the differences in positional encoding between adjacent frames can reflect small changes in construction progress, while the differences between distant frames reflect significant transitions in construction stages.

[0069] In addition, pre-training parameters, including AdamW optimizer parameters, initial learning rate parameters, training batch size parameters, weight decay parameters, and training epoch parameters, need to be initialized.

[0070] In terms of pre-training parameter initialization, in addition to position encoding, key parameters such as optimizer, learning rate, and batch size were carefully configured. The AdamW optimizer, combined with its adaptive learning rate adjustment and weight decay mechanism, effectively prevents model overfitting and improves training stability.

[0071] In practical engineering applications, this parameter initialization and location encoding strategy demonstrates significant advantages. For example, in subway station construction monitoring scenarios, the system can accurately identify feature changes at different construction stages, such as the transition from earthwork excavation to main structure construction. For the tunnel boring machine (TBM) excavation process, the model can capture the dynamic evolution of key indicators such as cutter wear and tunneling speed changes through temporal location encoding, providing data support for construction safety assessment. When dealing with tunnel environments with drastic lighting changes, the combination of pre-trained visual features and dynamic location encoding enables the model to stably extract target features, reducing the impact of environmental interference on recognition accuracy. Furthermore, by reasonably setting a DropPath rate of 0.1 for regularization, the model's generalization ability is further enhanced, allowing it to maintain good performance even when facing unseen engineering scenarios.

[0072] This initialization strategy, based on pre-training and dynamic position encoding, provides a solid model foundation for video analysis in rail transit engineering. By combining large-scale image pre-training knowledge with the temporal characteristics of the engineering field, the model can quickly adapt to the complexity of engineering scenarios, accurately capture key information during construction, and provide strong technical support for achieving intelligent engineering monitoring and management.

[0073] (3) Video reconstruction and mask prediction

[0074] Specifically, read the video frame rate, set the target frame rate range to 5-25fps, and randomly generate sampling frame rate values f based on a uniform or Gaussian distribution; calculate the corresponding sampling interval:

[0075]

[0076] Among them, f base The original video frame rate;

[0077] The video frame sequence is sampled at equal intervals according to the calculated sampling interval s to generate video segments with different frame rates; for segments that are less than the minimum length requirement after sampling, the frame number is supplemented by linear interpolation.

[0078] For each video frame, at least one occlusion shape, including rectangle, circle, and polygon, is randomly selected; an occlusion area of corresponding size is generated according to a preset occlusion ratio range of 10%-70%; for each video frame, occlusion is applied at a randomly selected spatial location, and the pixels in the occlusion area are filled with a pixel value of 0; at the same time, a corresponding binary mask image is generated for subsequent loss calculation.

[0079] Please refer to Figure 5In this embodiment, Example 1 achieves accurate reconstruction and segmentation of engineering scenes through a multi-scale spatiotemporal coding-decoding network. When processing videos subject to random spatiotemporal occlusion, the system first performs multi-scale downsampling, a process employing a pyramid structure to generate video representations at different scales, such as the original resolution, 1 / 2 resolution, and 1 / 4 resolution. This multi-scale strategy enables the model to simultaneously capture large-scale scene structures (such as the overall layout of a tunnel) and small-scale detailed features (such as bolts, cables, and other components), effectively addressing the problem of significant differences in target size in rail transit engineering.

[0080] The downsampled video is fed into a TimeSformer encoder embedded with deformable convolutional layers for spatiotemporal feature extraction. The deformable convolutional layers, by adaptively adjusting the sampling position of the convolutional kernels, can better capture the geometric features of irregular targets in engineering scenes, such as curved pipes and inclined support structures. TimeSformer's temporal modeling capabilities enable the model to analyze dynamic changes during construction, such as soil deformation during tunnel boring machine (TBM) advancement and the continuous movements of workers operating tools. During encoding, the model converts the video frame sequence into feature tensors containing spatiotemporal information. Each feature vector not only represents the visual features of a spatial location but also implicitly contains information about the evolution of that location over time.

[0081] The extracted spatiotemporal features are then fed in parallel into two decoders for video reconstruction and segmentation prediction, respectively. The VideoMAE decoder is responsible for reconstructing the original video from the encoded features, a process that requires the model to learn content information about occluded areas. By optimizing the reconstruction loss, the model is forced to learn the underlying structure and semantic information in the video, such as distinguishing the texture features of different construction materials (concrete, steel) and identifying key components of equipment (such as the cutterhead of a tunnel boring machine and ventilation ducts). The Mask2former segmentation prediction network maps the spatiotemporal features into pixel-level segmentation masks, achieving accurate classification and localization of engineering targets. This network adopts a Transformer architecture, which can effectively handle long-distance dependencies and accurately segment overlapping targets in complex scenes, such as workers and machinery, and different types of construction materials.

[0082] During decoding, the two decoders share spatiotemporal features but perform different tasks. This multi-task learning mechanism enables the model to understand video content from different perspectives, achieving complementary enhancement of features. For example, the video reconstruction task prompts the model to learn global scene information, while the segmentation task strengthens its focus on local details. By jointly optimizing the reconstruction loss and segmentation loss, the model can more comprehensively capture the spatiotemporal features of the engineering scene and improve its ability to perceive the construction status.

[0083] In practical engineering applications, this encoder-decoder architecture demonstrates superior performance. In subway tunnel construction monitoring, the system can reconstruct obscured equipment components in real time and accurately segment different target categories such as workers, machinery, and materials. For critical operations during construction, such as rebar tying and concrete pouring, the model can identify whether the operations comply with specifications through temporal feature analysis, promptly detecting potential safety hazards. When dealing with tunnel environments with drastic lighting changes, the model can stably reconstruct the scene and segment targets through multi-scale feature fusion, reducing the impact of light interference on the analysis results. Furthermore, this architecture is highly adaptable to different construction stages, maintaining high reconstruction and segmentation accuracy from initial earthwork excavation to later equipment installation.

[0084] (4) Loss function construction

[0085] Specifically, the loss function includes three components: reconstruction loss, dynamic frequency loss, and segmentation loss. The reconstruction loss and dynamic frequency loss are calculated by comparing the reconstructed video with the original video. The reconstruction loss uses a weighted L1 norm, focusing on calculating the pixel differences in occluded regions, and includes a 0.1x gradient difference term. The formula is as follows:

[0086]

[0087] Where Ω represents the occluded area. This is the gradient operator.

[0088] Dynamic frequency loss constrains the feature consistency of video content at different frequencies through frequency domain analysis; the formula is as follows:

[0089]

[0090] L DF =c(t)L LF +(1-c(t))L HF

[0091] Among them, L LF This represents the low-frequency loss in the Fourier domain between the reconstructed video and the real video; L HF represents the high-frequency loss between the reconstructed video and the real video in the Fourier domain; c(t) represents the weight value that changes with the training rounds.

[0092] We calculate the segmentation loss by comparing the predicted mask with the mask in the dataset. We choose the DiceLoss loss as the segmentation loss, and the specific calculation formula is as follows:

[0093]

[0094] Where, p i g represents the predicted probability of the i-th pixel; irepresents the true label value of the i-th pixel; C represents the number of categories; ∈ is a smoothing term to prevent the denominator from being zero.

[0095] Specifically, the AdamW optimizer was used for end-to-end training, with an initial learning rate of 1e-4, which was gradually reduced to 1e-6 using cosine annealing. The batch size was set to 32, the total number of training epochs was 300, and a droppath rate of 0.1 was used for regularization. The parameters of the entire backbone network were updated and optimized using stochastic gradient descent.

[0096] This embodiment 1, through the specific implementation described above, effectively solves the problem of processing videos containing targets of different sizes and different durations of action scenes in rail transit engineering construction, demonstrating significant performance improvements in multiple downstream tasks. Those skilled in the art will understand that appropriate adjustments can be made to the above implementation details without departing from the core idea of this embodiment 1, and all such adjustments should be included within the protection scope of this embodiment 1.

[0097] Example 2

[0098] Please refer to Figure 6 This embodiment 2 provides a temporal self-supervised learning system for video images in rail transit engineering, including...

[0099] The dataset construction unit is used to collect videos of urban rail transit engineering construction, segment the videos, and generate segmentation masks for each frame of video sequentially through the SAM segmentation network and the Xmem tracking network to construct a self-supervised pre-trained dataset.

[0100] The model initialization unit is used to initialize the self-supervised learning parameters and backbone model parameters of the rail transit engineering scenario using the pre-trained TimeSformer; for position encoding initialization, the time dimension is initialized using a sine function.

[0101] The video reconstruction and mask prediction unit is used to perform dynamic frame rate sampling processing on the input video; and generate a corresponding binary mask image for each video frame through random occlusion processing; after multi-scale downsampling, the processed video is input into the encoder-decoder network to complete the spatiotemporal feature encoding and the generation of reconstructed video and predicted mask;

[0102] The loss function construction unit is used to construct the loss function by calculating the reconstruction loss between the reconstructed video and the real video, the dynamic frequency loss based on the consistency of low-frequency and high-frequency features in the Fourier domain, and the segmentation loss between the predicted mask and the real mask.

[0103] Example 3

[0104] This embodiment 3 also provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it can implement any step of the temporal self-supervised learning method for video images of rail transit engineering.

[0105] The computer-readable storage medium may include various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0106] For a description of the computer-readable storage medium provided in this application, please refer to the above method embodiments; further details will not be repeated here.

[0107] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A temporal self-supervised learning method for video images in rail transit engineering, characterized in that, include: S1. Collect videos of urban rail transit engineering construction, segment the videos, and generate segmentation masks for each frame of video sequentially through the SAM segmentation network and the Xmem tracking network to construct a self-supervised pre-training dataset; S2. Use a pre-trained TimeSformer and initialize the self-supervised learning parameters and backbone model parameters for the rail transit engineering scenario; for position encoding initialization, initialize the time dimension using a sine function; S3. Perform dynamic frame rate sampling on the input video; generate a corresponding binary mask image for each video frame through random occlusion processing; after multi-scale downsampling, input the processed video into an encoder-decoder network to complete spatiotemporal feature encoding and the generation of reconstructed video and predicted mask; S4. The loss function is constructed by calculating the reconstruction loss between the reconstructed video and the real video, the dynamic frequency loss based on the consistency of low-frequency and high-frequency features in the Fourier domain, and the segmentation loss between the predicted mask and the real mask. In handling tunnel environments with drastic lighting changes, target features are extracted by combining pre-trained visual features with dynamic position coding, reducing the impact of environmental interference on recognition accuracy. Dynamic position coding allows position codes to have different periods in different dimensions, thus representing the relative relationship between time steps. When processing continuous video frames during tunnel construction, the differences in position coding between adjacent frames are used to reflect changes in construction progress, while the differences between distant frames are used to reflect the transition between construction stages. The specific process of random occlusion is as follows: for each video frame, at least one of the following shapes, including rectangle, circle and polygon, is randomly selected as the occlusion shape; an occlusion area of corresponding size is generated according to a preset occlusion ratio range of 10%-70%; for each video image frame, occlusion is applied at a randomly selected spatial position, and the pixels in the occlusion area are filled with a pixel value of 0; at the same time, a corresponding binary mask image is generated. The dynamic frequency loss calculation method in S4 is as follows: the dynamic frequency loss constrains the feature consistency of video content at different frequencies through frequency domain analysis; the formula is as follows: ；； in, This indicates the low-frequency loss in the Fourier domain between the reconstructed video and the real video; This represents the high-frequency loss between the reconstructed video and the real video in the Fourier domain; This represents the weight values that change with each training round.

2. The temporal self-supervised learning method for video images in rail transit engineering according to claim 1, characterized in that, The specific process of generating segmentation masks for each frame of video sequentially using the SAM segmentation network and the Xmem tracking network in S1 is as follows: Specifically, all frames of the video are read first. For the first frame, the segmentation network SAM is used to obtain the segmentation mask of the first frame. For the second frame, the target tracking network Xmem uses the segmentation mask of the first frame as a cue to obtain the segmentation mask of the second frame. The above steps are repeated until the segmentation mask of the last frame is obtained.

3. The temporal self-supervised learning method for video images in rail transit engineering according to claim 1, characterized in that, The specific method for initializing the time dimension using a sine function in S2 is as follows: A unique and smooth location representation is generated for each time step, enabling the model to capture temporal order and relative distance. Specific implementation methods and formulas are as follows: ； in, This refers to the time step position; The total dimension for location encoding; This is the index for the current dimension; Hyperparameters for adjusting the wavelength range.

4. The temporal self-supervised learning method for video images in rail transit engineering according to claim 1, characterized in that, The specific process of dynamic frame rate sampling in S3 is as follows: Read the video frame rate, set the target frame rate range to 5-25fps, and randomly generate sampled frame rate values based on a uniform or Gaussian distribution. ; Calculate the corresponding sampling interval: ； in, The original video frame rate; According to the calculated sampling interval The video frame sequence is sampled at equal intervals to generate video segments with different frame rates; for segments that are less than the minimum length requirement after sampling, the frame number is supplemented by linear interpolation.

5. The temporal self-supervised learning method for video images in rail transit engineering according to claim 1, characterized in that, The method for calculating reconstruction loss in S4 is as follows: The reconstruction loss uses a weighted L1 norm, focusing on calculating the pixel differences in the occluded regions, and incorporates a gradient difference term of 0.1; the formula is as follows: ； in Indicates the obscured area. This is the gradient operator.

6. The temporal self-supervised learning method for video images in rail transit engineering according to claim 1, characterized in that, The method for calculating the segmentation loss in S4 is as follows: ； in, Indicates the number of categories; To smooth out terms and prevent the denominator from being zero.

7. A temporal self-supervised learning system for video images of rail transit engineering projects, used to implement the temporal self-supervised learning method for video images of rail transit engineering projects as described in claim 1, characterized in that, include: The dataset construction unit is used to collect videos of urban rail transit engineering construction, segment the videos, and generate segmentation masks for each frame of video sequentially through the SAM segmentation network and the Xmem tracking network to construct a self-supervised pre-trained dataset. The model initialization unit is used to initialize the self-supervised learning parameters and backbone model parameters of the rail transit engineering scenario using the pre-trained TimeSformer; for position encoding initialization, the time dimension is initialized using a sine function. The video reconstruction and mask prediction unit is used to perform dynamic frame rate sampling processing on the input video; and generate a corresponding binary mask image for each video frame through random occlusion processing; after multi-scale downsampling, the processed video is input into the encoder-decoder network to complete the spatiotemporal feature encoding and the generation of reconstructed video and predicted mask; The loss function construction unit is used to construct the loss function by calculating the reconstruction loss between the reconstructed video and the real video, the dynamic frequency loss based on the consistency of low-frequency and high-frequency features in the Fourier domain, and the segmentation loss between the predicted mask and the real mask. In handling tunnel environments with drastic lighting changes, target features are extracted by combining pre-trained visual features with dynamic position coding, reducing the impact of environmental interference on recognition accuracy. Dynamic position coding allows position codes to have different periods in different dimensions, thus representing the relative relationship between time steps. That is, when processing continuous video frames during tunnel construction, the differences in position coding between adjacent frames are used to reflect changes in construction progress, while the differences between distant frames are used to reflect the transition between construction stages.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, The computer program is executed by a processor using the temporal self-supervised learning method for video images of rail transit engineering as described in any one of claims 1-6.