Method for monitoring video splicing and image stabilization based on semantic constraint alignment and space-time modeling

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a method based on semantic constraint alignment and spatiotemporal modeling, the problems of unrobust feature extraction and limited motion modeling capabilities in highway surveillance video stitching were solved, generating high-precision and highly stable panoramic surveillance videos, ensuring the authenticity of key targets and the integrity of the videos.

CN122243738APending Publication Date: 2026-06-19QILU UNIVERSITY OF TECHNOLOGY (SHANDONG ACADEMY OF SCIENCES) +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: QILU UNIVERSITY OF TECHNOLOGY (SHANDONG ACADEMY OF SCIENCES)
Filing Date: 2026-05-15
Publication Date: 2026-06-19

Application Information

Patent Timeline

15 May 2026

Application

19 Jun 2026

Publication

CN122243738A

IPC: G06T3/4038; G06T7/33; G06T5/70; G06T5/77; G06V10/26; G06V10/28; G06V10/52; G06V10/77; G06V10/766; G06V10/80; G06V10/82; G06V20/70; G06N3/0455; G06N3/0464

AI Tagging

Application Domain

Image enhancement Image analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies for stitching together highway surveillance videos suffer from problems such as unrobust feature extraction, limited motion modeling capabilities, and damage to data authenticity due to stitching boundary processing, making it difficult to generate high-precision and highly stable panoramic surveillance videos.

Method used

A method based on semantic constraint alignment and spatiotemporal modeling is adopted. Through lane line mask preprocessing, semantic constraint spatial alignment, spatiotemporal sequence modeling and smoothing, an initial stitched panoramic video frame with irregular boundaries is generated. Then, through semantically aware boundary correction and rectangularization processing, a regular rectangular panoramic video is output.

Benefits of technology

It generates high-precision, highly stable panoramic surveillance videos that strictly maintain the authenticity of key targets, solving the problems of alignment error and limited motion smoothing ability in existing technologies, and ensuring the high precision and stability of the videos.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122243738A_ABST

Patent Text Reader

Abstract

This invention relates to the fields of intelligent traffic monitoring and computer vision technology, and in particular provides a method for video stitching and image stabilization based on semantic constraint alignment and spatiotemporal modeling. The method includes: obtaining preprocessed dual-channel video frames and their lane line masks; obtaining a frame-by-frame local homography grid sequence corresponding to the spatial registration of the dual-view images based on the preprocessed dual-channel video frames and their lane line masks; spatiotemporal sequence modeling and smoothing, obtaining a smoothed local homography grid sequence based on the frame-by-frame local homography grid sequence; image deformation and fusion: generating an initial stitched panoramic video frame with irregular boundaries based on the smoothed local homography grid sequence; semantically perceptual boundary correction and rectangularization: obtaining and outputting a regular rectangular panoramic video frame for the irregular boundaries of the initial stitched panoramic video frame. This method can generate high-precision, high-stability panoramic monitoring videos that strictly maintain the authenticity of key targets.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of intelligent traffic monitoring and computer vision technology, and in particular to a method for stitching and stabilizing surveillance videos based on semantic constraint alignment and spatiotemporal modeling. Background Technology

[0002] In the fields of computer vision and intelligent transportation systems, wide-area video surveillance plays a crucial role in improving traffic safety management, real-time accident response, and overall situational awareness. Traditionally, achieving high-definition monitoring with a wide field of view mainly relies on physical PTZ cameras, but this faces limitations such as high hardware costs, complex installation and maintenance processes, and difficulties in achieving multi-node video stream synchronization and seamless integration.

[0003] Despite advancements in image and video stitching technology, existing advanced image stabilization and stitching frameworks still face significant fundamental technical bottlenecks in the highly challenging scenario of highways. Firstly, regarding feature extraction and spatial alignment, highway scenes typically feature elongated, repetitive, and low-contrast lane lines and road markings, accompanied by high-speed vehicle movement and drastic changes in lighting. Existing stitching models largely rely on shallow, general-purpose feature extraction backbone networks like ResNet, whose limited receptive field makes it difficult to extract robust semantic features to capture fine-grained linear textures. Furthermore, existing grid-based unsupervised stitching methods primarily rely on pixel-level photometric errors for optimization. When faced with repetitive linear textures, these methods are prone to feature matching ambiguities, leading to grid collapse or overstretching in textureless areas. This results in severe local misalignment, artifacts, and geometric distortions at stitching boundaries or around moving objects, failing to meet the spatial alignment requirements of high-fidelity surveillance.

[0004] Secondly, limitations also exist in temporal smoothing and motion modeling. To generate smooth panoramic videos and filter out high-frequency jitter caused by mechanical vibrations, existing image stabilization methods generally employ smoothing networks based on local convolutional operations of 3D-CNNs. However, the temporal receptive field of local convolutional kernels is severely limited, making it difficult to effectively simulate and model long-distance inter-frame motion dependencies and understand the global logic of the camera in complex motion. This results in its inability to distinguish between low-frequency intentional camera scans and high-frequency mechanical vibrations, making the trajectory smoothing process prone to getting stuck in local optima: either over-smoothing leads to image lag and loss of true motion intent, or under-smoothing results in residual high-frequency jitter, making it difficult to achieve the optimal balance between registration accuracy and visual stability.

[0005] Finally, the stitched panoramic video inevitably produces irregular, jagged edges due to the cumulative effect of camera motion compensation and projection alignment. When converting it to a standard rectangular format, existing boundary completion techniques struggle to balance visual integrity and data accuracy. Pure geometric rectangularization methods, in order to fill large missing areas, must overstretch edge pixels, easily leading to severe distortion of traffic participants such as vehicles near the edges, thus incorrectly reflecting their true speed or size. On the other hand, blind video inpainting using Generative Adversarial Networks (GANs), based on probability distributions, is prone to producing illusions, such as generating non-existent vehicles in empty lanes or erasing real anomalies. This unconstrained generative inpainting introduces the risk of data forgery, significantly undermining the system's credibility. Summary of the Invention

[0006] In view of this, the present invention provides a method for stitching and stabilizing surveillance videos based on semantic constraint alignment and spatiotemporal modeling, in order to generate high-precision, high-stability panoramic surveillance videos that can strictly maintain the authenticity of key targets.

[0007] In a first aspect, the present invention provides a method for stitching and stabilizing surveillance videos based on semantic constraint alignment and spatiotemporal modeling, the method comprising:

[0008] Step 1: Data acquisition and preprocessing, obtaining the preprocessed dual-channel video frames and their lane line masks; Step 2: Semantic constraint spatial alignment. Based on the preprocessed dual-channel video frames and their lane line masks, obtain the frame-by-frame local homography grid sequence corresponding to the spatial registration of the dual-view images. Step 3: Spatiotemporal sequence modeling and smoothing. Based on the frame-by-frame local homography grid sequence, a smoothed local homography grid sequence is obtained. Step 4, Image Deformation and Fusion: Based on the smoothed local homography grid sequence, generate initial stitched panoramic video frames with irregular boundaries; Step 5, Semantic-aware boundary correction and rectangularization: For the irregular boundaries of the initial stitched panoramic video frames, obtain and output regular rectangular panoramic video frames.

[0009] Optionally, step 1 includes: Acquire dual-channel original video frames, extract the corresponding lane line semantic masks, and perform image preprocessing to obtain preprocessed dual-channel video frames and their lane line masks. Step 11: Obtain dual-channel raw video frames synchronously acquired by multiple fixed cameras with overlapping fields of view. The left-view video stream and the right-view video stream are synchronously acquired by at least two fixed monitoring cameras, and the left-view video stream and the right-view video stream have overlapping fields of view. Extract the dual-channel raw video frames corresponding to the same moment from the left-view video stream and the right-view video stream as input for subsequent spatial registration and panoramic reconstruction. Step 12: Extract the corresponding lane line semantic mask. Treat the lane line information as an independent semantic modality and input it into the lightweight semantic segmentation network for each of the two original video frames to obtain the lane line semantic mask corresponding to each original video frame. The lane line semantic mask is a binary mask or a probabilistic mask, wherein the lane line area is marked as the foreground and the non-lane line area is marked as the background in the binary mask. Step 13: Using an asymmetric resolution strategy, preprocess the dual-channel original video frames and lane line semantic masks to obtain the preprocessed dual-channel video frames and their lane line masks.

[0010] Optionally, the preprocessing includes: Step 131: Perform semantic segmentation on the dual-channel original video frames to extract high-resolution binary lane line masks; Step 132: Using the nearest neighbor interpolation algorithm, the high-resolution binary lane line mask is downsampled to a preset low-resolution target size to generate a low-resolution binary lane line mask; the nearest neighbor interpolation algorithm is used to preserve lane line category boundaries during spatial dimensionality reduction. Step 133: The low-resolution original video frame corresponding to the low-resolution target size and the low-resolution binary lane line mask are synchronously input into the ConvNeXt network backbone module to maintain semantic guidance.

[0011] Optionally, step 132 includes: a. For any target pixel on the low-resolution binary lane line mask, its coordinates are: First, the target pixels are mapped back to floating-point source coordinates on the high-resolution binary lane line mask. Its expression is: ; in, and These are the width and height of the high-resolution mask, respectively. and These represent the width and height of the low-resolution target, respectively. b. Round the floating-point source coordinates to the nearest integer source pixel coordinates. Its expression is: ; c. Apply the high-resolution binary lane line mask to the coordinate system. The pixel value at that location is assigned to the low-resolution binary lane line mask at coordinates. The pixel at that location is expressed as: ; in, For high-resolution binary lane line masks, This is the generated low-resolution binary lane line mask.

[0012] Optionally, step 2 includes: The preprocessed dual-channel video frames and their lane line masks are input into a semantically constrained encoder-decoder spatial alignment network. The ConvNeXt encoder with shared weights is used to extract multi-scale structural features of the dual-channel images, and the lane line semantic mask is fused with the multi-scale structural features. Based on the fused features, a correlation representation is constructed, and the local homography grid is regressed frame by frame through the decoder to obtain the frame-by-frame local homography grid sequence corresponding to the spatial registration of the dual-view images. Step 21: Let the target view image and the reference view image be respectively... , and their corresponding lane line semantic masks are respectively ; Step 22: Adopt weight sharing Siamese The encoding structure extracts features from the target view image, the reference view image, and their corresponding lane line semantic masks, respectively. The encoder is denoted as... Then the first Image features and semantic features at each scale are represented as follows: ; ; in, Represents different levels of the encoder. Indicates appearance characteristics, Represents semantic features; Step 23: At a predetermined resolution scale, concatenate the appearance features and corresponding semantic features through channels, and then perform feature compression mapping. Cross-modal fusion is completed, yielding target view fusion features and reference view fusion features, expressed as follows: ; ; in, This indicates a channel dimension splicing operation. This represents a feature compression mapping consisting of convolution, normalization, and activation functions; Step 24: Based on the target view fusion features and the reference view fusion features, construct a relevance representation or matching cost volume for the location. and candidate displacement Correlation is defined as: ; Traverse a given search window After considering all candidate displacements within the range, the matching cost volume is obtained: ; The matching cost body is used to characterize the pixel similarity of the dual-path view in the local neighborhood and utilizes lane line semantic priors to improve the matching discrimination ability in repeated texture regions.

[0013] Optionally, step 24 includes: performing local homography mesh regression and image deformation based on the matching cost volume; Let the initial rule grid be Regression subnetwork With the matching cost body As input, predict the vertex displacement relative to the initialized regular mesh. ; The vertex displacement is superimposed on the initialized regular mesh to obtain the target local homography mesh: ; For the For each mesh element, based on the coordinates of its four corner points before and after deformation, a direct linear transformation algorithm is used to calculate the corresponding local homography matrix. Thus, the set of local homography matrices for the entire image is obtained: ; Given pixel coordinates In its first The transformation results in each grid cell satisfy: ; in, Indicates equality in the sense of homogeneous coordinates; Based on the set of local homography matrices A spatial deformation field is constructed from the reference view image to the target view image, and differentiable deformation operations are used to transform the reference view image and its corresponding lane line semantic mask, respectively. The expression is as follows: ; in, This represents a differentiable deformation operation based on the set of local homography matrices; Spatial registration of dual-view images is achieved through the microdeformable operation.

[0014] Optionally, step 3 includes: The frame-by-frame local homography grid sequence is converted into a spatiotemporal feature sequence, and a three-dimensional position encoding is introduced to represent the adjacency relationship of the temporal and spatial dimensions. The spatiotemporal feature sequence after introducing the three-dimensional position encoding is input into the spatiotemporal attention smoothing module, and the frame-by-frame local homography grid sequence is globally consistent smoothed through the self-attention mechanism to obtain the smoothed grid sequence. Step 31: Let the length of the video sequence be... The frame-by-frame local homography mesh is represented as: ; in, This indicates the number of grid vertices, where each grid vertex contains a two-dimensional coordinate offset; Step 32: Perform vectorized expansion on the frame-by-frame local homography mesh to obtain: ; in, ; Then, the vectorized result is projected onto the high-dimensional feature space through a linear mapping, resulting in: ; in, For embedded dimensions, and These are learnable parameters; Step 33: Concatenate the embedded features from all time steps to form the input spatiotemporal feature sequence: ; The spatiotemporal feature sequence is input into the spatiotemporal attention smoothing module to perform global temporal modeling on the frame-by-frame local homography grid sequence; The spatiotemporal attention smoothing module employs three-dimensional positional encoding and a spatiotemporal multi-head self-attention mechanism. To characterize the adjacency relationships of the spatiotemporal feature sequences in the temporal and spatial dimensions, it targets the time index. and the spatial position of the grid vertices Construct a three-dimensional position code: ; The three-dimensional position encoding is obtained by superimposing time-dimensional position encoding, spatial height-dimensional position encoding, and spatial width-dimensional position encoding: ; The time-dimensional position encoding satisfies: ; Spatial height dimension location encoding Spatial width dimension position encoding It adopts the same construction form as the time-dimensional positional encoding; Adding the 3D position code to the corresponding embedding features yields: ; Input the feature after adding position encoding The Transformer encoder computes the query matrix, key matrix, and value matrix at each layer: ; in, These are learnable parameters; Attention weights are calculated as follows: The corresponding attention output is: ; A multi-head mechanism is used to model the spatiotemporal dependencies in different subspaces in parallel, and its output is represented as follows: ; in, To output the projection matrix; The spatiotemporal multi-head self-attention mechanism establishes a global dependency between all video frames and all grid vertices.

[0015] Optionally, it includes: The spatiotemporal attention smoothing module uses residual regression to predict the smoothed local homography grid and is trained using temporal consistency constraints. After passing through the Transformer encoder, the spatiotemporal feature representation is obtained. ; Using the regression head Predict the smoothed offset at each time step: ; in, A regression head composed of multilayer sensing mechanisms; The smoothed offset is added to the original local homography mesh to obtain the smoothed local homography mesh: ; To constrain the temporal continuity of the smoothed grid sequence, a temporal consistency loss is introduced: ; To limit the smoothing result from deviating too much from the original motion trend, a fidelity loss is introduced: ; The total loss function of the spatiotemporal attention smoothing module is: ; in, and These are the weighting coefficients for temporal consistency loss and fidelity loss, respectively; the total loss function is used to jointly constrain the temporal continuity and motion fidelity of the smoothed local homography grid sequence.

[0016] Optionally, step 4 includes: Based on the smooth grid sequence, a corresponding set of smooth local homography matrices is constructed, and a differentiable deformation operation is performed on the dual original video frames. The target view image and the deformed reference view image are projected onto a unified coordinate system, and weighted fusion is performed in the overlapping area. The corresponding image content is retained in the non-overlapping area to generate an initial stitched panoramic video frame with irregular boundaries. Step 41: Image distortion; For the Dual-channel raw video frames at any given time, based on a smoothed local homography grid sequence. Construct the corresponding set of smooth local homography matrices: ; in, Indicates the first Time of the first The local homography matrix corresponding to each grid cell; Based on the aforementioned set of smooth local homography matrices, for the first... Perform a differentiable deformation operation on the reference view image at time step 1 to obtain the deformed reference view image: ; in, Indicates the first Reference view of the original video frame at any given moment. This represents a differentiable deformation operation based on the set of smooth local homography matrices; target view image Compared with the deformed reference view image Projecting onto a unified coordinate system yields aligned image pairs for subsequent stitching and fusion; Step 42: Image fusion; Let the first The target view image in a unified coordinate system at any given time is The deformed reference view image is Their effective areas are respectively denoted as and The overlapping region is: ; The non-overlapping regions are: ; In the overlapping area Within this process, a weighted fusion is performed on the target view image and the deformed reference view image to obtain: ; in, Indicates position The fusion weight at the location; In the non-overlapping region Within, the original image content of the corresponding view is preserved; thus generating the first... Initial stitched panoramic video frames at any given moment .

[0017] Optionally, step 5 includes: For the irregular boundaries of the initial stitched panoramic video frame, the boundary regions are classified based on semantic information; the regions containing key traffic targets are corrected by using a geometrically constrained mesh stretching method, and the background regions are repaired by using a generative completion method, resulting in and outputting regular rectangular panoramic video frames. Step 51: Let the panoramic image after spatial alignment, spatiotemporal smoothing, and stitching be: Its corresponding semantic mask is: The goal is to map the panoramic image onto a regular rectangular canvas. The effective region in the panoramic image is defined as... The missing region that needs to be filled is defined as: ; Step 52: Based on the semantic mask Perform semantic classification on the boundary region and construct a semantic decision function: ; in, This indicates a risk area containing key traffic targets, including vehicles, pedestrians, or traffic signs. This indicates a safe background area that does not contain the aforementioned key traffic targets; when When, a geometrically constrained mesh stretching strategy is applied to the risk region; when At that time, a generative repair strategy is adopted for the security background area; The semantic decision function enables adaptive selection of geometric deformation paths and generation repair paths.

[0018] The technical solution provided by this invention includes data acquisition and preprocessing to obtain preprocessed dual-channel video frames and their lane line masks; semantic constraint spatial alignment to obtain a frame-by-frame local homography grid sequence corresponding to the dual-view image spatial registration based on the preprocessed dual-channel video frames and their lane line masks; spatiotemporal sequence modeling and smoothing to obtain a smoothed local homography grid sequence based on the frame-by-frame local homography grid sequence; image deformation and fusion to generate an initial stitched panoramic video frame with irregular boundaries based on the smoothed local homography grid sequence; and semantically perceptual boundary correction and rectangularization to obtain and output a regular rectangular panoramic video frame for the irregular boundaries of the initial stitched panoramic video frame. This method overcomes alignment errors caused by repeated road textures and camera shake, and solves problems such as limited motion smoothing capabilities and boundary processing damage to data authenticity in existing technologies. It can generate high-precision, high-stability panoramic surveillance videos that strictly maintain the authenticity of key targets. Attached Figure Description

[0019] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 A flowchart illustrating the surveillance video stitching and stabilization method based on semantic constraint alignment and spatiotemporal modeling provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of lane line mask preprocessing provided in an embodiment of the present invention, wherein (a) is a lane line mask image under high resolution of 2867×2160, (b) is a lane line mask image under low resolution of 640×480, and (c) is a lane line mask image of 640×480 after processing with an asymmetric resolution strategy. Figure 3 A schematic diagram illustrating semantic constraint space alignment provided in an embodiment of the present invention; Figure 4 A schematic diagram illustrating spatiotemporal sequence modeling and smoothing provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of semantically aware boundary correction and rectangularization provided in an embodiment of the present invention. Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0022] The terminology used in the embodiments of this invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular forms “a,” “the,” and “the” used in the embodiments of this invention are also intended to include the plural forms unless the context clearly indicates otherwise.

[0023] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0024] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."

[0025] Figure 1 The flowchart of the surveillance video stitching and stabilization method based on semantic constraint alignment and spatiotemporal modeling provided in the embodiments of the present invention is as follows: Figure 1 As shown, the method includes: Step 1: Data acquisition and preprocessing, obtaining the preprocessed dual-channel video frames and their lane line masks.

[0026] In this embodiment of the invention, step 1 includes: Acquire dual-channel original video frames, extract the corresponding lane line semantic masks, and perform image preprocessing to obtain preprocessed dual-channel video frames and their lane line masks. Step 11: Obtain dual-channel raw video frames synchronously acquired by multiple fixed cameras with overlapping fields of view. The left-view video stream and the right-view video stream are synchronously acquired by at least two fixed monitoring cameras installed on highway monitoring poles or high-position fixed brackets. The left-view video stream and the right-view video stream have overlapping fields of view. Extract the dual-channel raw video frames corresponding to the same moment from the left-view video stream and the right-view video stream as input for subsequent spatial registration and panoramic reconstruction. Step 12: Extract the corresponding lane line semantic mask. Treat the lane line information as an independent semantic modality and input it into the lightweight semantic segmentation network for each of the two original video frames to obtain the lane line semantic mask corresponding to each original video frame. The lane line semantic mask is a binary mask or a probabilistic mask, wherein the lane line area is marked as the foreground and the non-lane line area is marked as the background in the binary mask. Step 13: Using an asymmetric resolution strategy, preprocess the dual-channel original video frames and lane line semantic masks to obtain the preprocessed dual-channel video frames and their lane line masks.

[0027] In embodiments of the present invention, such as Figure 2 As shown in (a), (b), and (c), the preprocessing includes: Step 131: Perform semantic segmentation on the dual original video frames with original ultra-high resolution to extract accurate and continuous high-resolution binary lane line masks, avoiding aliasing and loss of fine-grained lane line features in direct downsampling. Step 132: Using the nearest neighbor interpolation algorithm, the high-resolution binary lane line mask is downsampled to a preset low-resolution target size to generate a low-resolution binary lane line mask; the nearest neighbor interpolation algorithm is used to preserve the lane line category boundary and prevent pixel intensity decay during spatial dimensionality reduction. Step 133: The low-resolution original video frame corresponding to the low-resolution target size and the low-resolution binary lane line mask are synchronously input into the ConvNeXt network backbone module to reduce computational complexity while maintaining continuous and clear semantic guidance.

[0028] In this embodiment of the invention, the resolution of the high-resolution original video frame is higher than the network input resolution, which is the target resolution used for training or inference of the spatial alignment network; after the corresponding video frame is scaled to the target resolution, it is paired with the lane line semantic mask after nearest neighbor interpolation downsampling and input into the spatial alignment network.

[0029] In this embodiment of the invention, the size specifications in the asymmetric resolution strategy are limited as follows: The original ultra-high-definition resolution has a width and height of 2867×2160, and the lane line has a pixel width of 5 to 10 pixels at this resolution; the preset low-resolution target size has a width and height of 640×480; after processing by the nearest neighbor interpolation algorithm, the resolution of the video frame and semantic mask input into the ConvNeXt network backbone module is uniformly maintained at 640×480.

[0030] In this embodiment of the invention, step 132 includes: a. For any target pixel on the low-resolution binary lane line mask, its coordinates are: First, the target pixels are mapped back to floating-point source coordinates on the high-resolution binary lane line mask. Its expression is: ; in, and These are the width and height of the high-resolution mask, respectively. and These represent the width and height of the low-resolution target, respectively. b. Round the floating-point source coordinates to the nearest integer source pixel coordinates. Its expression is: ; c. Apply the high-resolution binary lane line mask to the coordinate system. The pixel value at that location is assigned to the low-resolution binary lane line mask at coordinates. The pixel at that location is expressed as: ; in, For high-resolution binary lane line masks, This is the generated low-resolution binary lane line mask.

[0031] Step 2: Semantic constraint spatial alignment. Based on the preprocessed dual-channel video frames and their lane line masks, obtain the frame-by-frame local homography grid sequence corresponding to the spatial registration of the dual-view images.

[0032] In embodiments of the present invention, such as Figure 3 As shown, step 2 includes: The preprocessed dual-channel video frames and their lane line masks are input into a semantically constrained encoder-decoder spatial alignment network. The ConvNeXt encoder with shared weights is used to extract multi-scale structural features of the dual-channel images, and the lane line semantic mask is fused with the multi-scale structural features. Based on the fused features, a correlation representation is constructed, and the local homography grid is regressed frame by frame through the decoder to obtain the frame-by-frame local homography grid sequence corresponding to the spatial registration of the dual-view images. Step 21: Let the target view image and the reference view image be respectively... The corresponding binary or probabilistic lane line semantic masks extracted by the semantic segmentation network are respectively ; Step 22: Adopt weight sharing Siamese The encoding structure extracts features from the target view image, the reference view image, and their corresponding lane line semantic masks, respectively. The encoder is denoted as... Then the first Image features and semantic features at each scale are represented as follows: ; ; in, Represents different levels of the encoder. Indicates appearance characteristics, Represents semantic features; Step 23: At a predetermined resolution scale, concatenate the appearance features and corresponding semantic features through channels, and then perform feature compression mapping. Cross-modal fusion is completed, yielding target view fusion features and reference view fusion features, expressed as follows: ; ; in, This indicates a channel dimension splicing operation. This represents a feature compression mapping consisting of convolution, normalization, and activation functions; Step 24: Based on the target view fusion features and the reference view fusion features, construct a relevance representation or matching cost volume for the location. and candidate displacement Correlation is defined as: ; Traverse a given search window After considering all candidate displacements within the range, the matching cost volume is obtained: ; The matching cost body is used to characterize the pixel similarity of the dual-path view in the local neighborhood and utilizes lane line semantic priors to improve the matching discrimination ability in repeated texture regions.

[0033] In this embodiment of the invention, step 24 includes: performing local homography mesh regression and image deformation based on the matching cost volume; Let the initial rule grid be Regression subnetwork With the matching cost body As input, predict the vertex displacement relative to the initialized regular mesh. ; The vertex displacement is superimposed on the initialized regular mesh to obtain the target local homography mesh: ; For the For each mesh element, based on the coordinates of its four corner points before and after deformation, a direct linear transformation algorithm is used to calculate the corresponding local homography matrix. Thus, the set of local homography matrices for the entire image is obtained: ; Given pixel coordinates In its first The transformation results in each grid cell satisfy: ; in, Indicates equality in the sense of homogeneous coordinates; Based on the set of local homography matrices A spatial deformation field is constructed from the reference view image to the target view image, and differentiable deformation operations are used to transform the reference view image and its corresponding lane line semantic mask, respectively. The expression is as follows: ; in, This represents a differentiable deformation operation based on the set of local homography matrices; Spatial registration of dual-view images is achieved through the microdeformable operation.

[0034] In this embodiment of the invention, the training process introduces a geometric-semantic joint loss function composed of content consistency loss, semantic consistency loss, and grid regularization loss, so as to simultaneously constrain appearance consistency, semantic consistency, and geometric continuity of local grids within overlapping regions.

[0035] The training process employs a joint geometric-semantic loss function to constrain the semantically constrained space alignment network. The joint geometric-semantic loss function is as follows: ; in, , and These are the weight coefficients for content consistency loss, semantic consistency loss, and grid regularization loss, respectively. The content consistency loss is defined as: ; in, This indicates the effective overlap area between the target view image and the deformed reference view image; The semantic consistency loss is defined as: ; The mesh regularization loss is defined as: ; in, The target local homography mesh represents the first... The coordinates of the vertices; The geometric-semantic joint loss function is used to simultaneously constrain the appearance consistency, lane line semantic consistency, and geometric continuity of the local homography mesh in the overlapping area between the deformed reference view image and the target view image.

[0036] Step 3: Spatiotemporal sequence modeling and smoothing. Based on the frame-by-frame local homography grid sequence, a smoothed local homography grid sequence is obtained.

[0037] In embodiments of the present invention, such as Figure 4 As shown, step 3 includes: The frame-by-frame local homography grid sequence is converted into a spatiotemporal feature sequence, and a three-dimensional position encoding is introduced to represent the adjacency relationship of the temporal and spatial dimensions. The spatiotemporal feature sequence after introducing the three-dimensional position encoding is input into the spatiotemporal attention smoothing module, and the frame-by-frame local homography grid sequence is globally consistent smoothed through the self-attention mechanism to obtain the smoothed grid sequence. Step 31: Let the length of the video sequence be... The frame-by-frame local homography mesh is represented as: ; in, This indicates the number of grid vertices, where each grid vertex contains a two-dimensional coordinate offset; Step 32: Perform vectorized expansion on the frame-by-frame local homography mesh to obtain: ; in, ; Then, the vectorized result is projected onto the high-dimensional feature space through a linear mapping, resulting in: ; in, For embedded dimensions, and These are learnable parameters; Step 33: Concatenate the embedded features from all time steps to form the input spatiotemporal feature sequence: ; The spatiotemporal feature sequence is input into the spatiotemporal attention smoothing module to perform global temporal modeling on the frame-by-frame local homography grid sequence; The spatiotemporal attention smoothing module employs three-dimensional positional encoding and a spatiotemporal multi-head self-attention mechanism. To characterize the adjacency relationships of the spatiotemporal feature sequences in the temporal and spatial dimensions, it targets the time index. and the spatial position of the grid vertices Construct a three-dimensional position code: ; The three-dimensional position encoding is obtained by superimposing time-dimensional position encoding, spatial height-dimensional position encoding, and spatial width-dimensional position encoding: ; The time-dimensional position encoding satisfies: ; Spatial height dimension location encoding Spatial width dimension position encoding It adopts the same construction form as the time-dimensional positional encoding; Adding the 3D position code to the corresponding embedding features yields: ; Input the feature after adding position encoding The Transformer encoder computes the query matrix, key matrix, and value matrix at each layer: ; in, These are learnable parameters; Attention weights are calculated as follows: The corresponding attention output is: ; A multi-head mechanism is used to model the spatiotemporal dependencies in different subspaces in parallel, and its output is represented as follows: ; in, To output the projection matrix; The spatiotemporal multi-head self-attention mechanism establishes a global dependency between all video frames and all grid vertices.

[0038] In this embodiment of the invention, the spatiotemporal attention smoothing module uses residual regression to predict the smoothing offset and adds the smoothing offset to the original local homography grid to obtain the smoothed local homography grid; at the same time, temporal consistency loss and fidelity loss are introduced to jointly constrain the continuity of the smoothing result and the ability to maintain the original motion trend.

[0039] In this embodiment of the invention, it includes: The spatiotemporal attention smoothing module uses residual regression to predict the smoothed local homography grid and is trained using temporal consistency constraints. After passing through the Transformer encoder, the spatiotemporal feature representation is obtained. ; Using the regression head Predict the smoothed offset at each time step: ; in, A regression head composed of multilayer sensing mechanisms; The smoothed offset is added to the original local homography mesh to obtain the smoothed local homography mesh: ; To constrain the temporal continuity of the smoothed grid sequence, a temporal consistency loss is introduced: ; To limit the smoothing result from deviating too much from the original motion trend, a fidelity loss is introduced: ; The total loss function of the spatiotemporal attention smoothing module is: ; in, and These are the weighting coefficients for temporal consistency loss and fidelity loss, respectively; the total loss function is used to jointly constrain the temporal continuity and motion fidelity of the smoothed local homography grid sequence.

[0040] Step 4, Image Deformation and Fusion: Based on the smoothed local homography grid sequence, generate an initial stitched panoramic video frame with irregular boundaries.

[0041] In this embodiment of the invention, step 4 includes: Based on the smooth grid sequence, a corresponding set of smooth local homography matrices is constructed, and a differentiable deformation operation is performed on the dual original video frames. The target view image and the deformed reference view image are projected onto a unified coordinate system, and weighted fusion is performed in the overlapping area. The corresponding image content is retained in the non-overlapping area to generate an initial stitched panoramic video frame with irregular boundaries. Step 41: Image distortion; For the Dual-channel raw video frames at any given time, based on a smoothed local homography grid sequence. Construct the corresponding set of smooth local homography matrices: ; in, Indicates the first Time of the first The local homography matrix corresponding to each grid cell; Based on the aforementioned set of smooth local homography matrices, for the first... Perform a differentiable deformation operation on the reference view image at time step 1 to obtain the deformed reference view image: ; in, Indicates the first Reference view of the original video frame at any given moment. This represents a differentiable deformation operation based on the set of smooth local homography matrices; target view image Compared with the deformed reference view image Projecting onto a unified coordinate system yields aligned image pairs for subsequent stitching and fusion; Step 42: Image fusion; Let the first The target view image in a unified coordinate system at any given time is The deformed reference view image is Their effective areas are respectively denoted as and The overlapping region is: ; The non-overlapping regions are: ; In the overlapping area Within this process, a weighted fusion is performed on the target view image and the deformed reference view image to obtain: ; in, Indicates position The fusion weight at the location; In the non-overlapping region Within, the original image content of the corresponding view is preserved; thus generating the first... Initial stitched panoramic video frames at any given moment .

[0042] Step 5, Semantic-aware boundary correction and rectangularization: For the irregular boundaries of the initial stitched panoramic video frames, obtain and output regular rectangular panoramic video frames.

[0043] In embodiments of the present invention, such as Figure 5 As shown, step 5 includes: For the irregular boundaries of the initial stitched panoramic video frame, the boundary regions are classified based on semantic information; the regions containing key traffic targets are corrected by using a geometrically constrained mesh stretching method, and the background regions are repaired by using a generative completion method, resulting in and outputting regular rectangular panoramic video frames. Step 51: Let the panoramic image after spatial alignment, spatiotemporal smoothing, and stitching be: Its corresponding semantic mask is: The goal is to map the panoramic image onto a regular rectangular canvas. The effective region in the panoramic image is defined as... The missing region that needs to be filled is defined as: ; Step 52: Based on the semantic mask Perform semantic classification on the boundary region and construct a semantic decision function: ; in, This indicates a risk area containing key traffic targets, including vehicles, pedestrians, or traffic signs. This indicates a safe background area that does not contain the aforementioned key traffic targets; when When, a geometrically constrained mesh stretching strategy is applied to the risk region; when At that time, a generative repair strategy is adopted for the security background area; The semantic decision function enables adaptive selection of geometric deformation paths and generation repair paths.

[0044] In this embodiment of the invention, the risk area The mesh stretching strategy employing geometric constraints specifically includes: Let the target mesh after rectangularization be We obtain the following equation by solving: ; The overall geometric energy function is: ; in, , and These are the weight coefficients for the shape preservation term, the line preservation term, and the semantic consistency term, respectively. The shape retention term is defined as: ; in, Indicates the first element in the initial grid. Vertex parameters of each mesh cell This represents the vertex parameters of the corresponding mesh element after deformation; The straight-line preservation term is defined as: ; in, Geometric mapping operators representing linear structures, Geometric representation of the original linear structure; The semantic consistency term is defined as follows: ; in, This represents the semantic mask after rectangular deformation; Based on the target grid The panoramic image is geometrically deformed so that the geometric structure of the key traffic targets is preserved during the mapping of the risk area to the rectangular boundary.

[0045] In this embodiment of the invention, the security background area A generative repair strategy is adopted, specifically including: Construct a masked input image: ; in, This represents element-wise multiplication. Input the masked input image into the generator The repair results are generated as follows: ; The training objective function of the generator is: ; in, In order to perceive loss, To combat the losses, These are the weighting coefficients; The perceptual loss is defined as: ; in, This represents a feature extraction network. Indicates a reference to a real image; The adversarial loss is defined as: ; in, Indicates the discriminator; The rectangularized output image satisfies: ; in, Indicates based on the target mesh Output of a panoramic image after geometric deformation; By performing region-aware fusion of the geometric constraint deformation results of the risk area and the generated repair results of the safe background area, a regular rectangular panoramic video frame is obtained.

[0046] The technical solution provided by this invention includes data acquisition and preprocessing to obtain preprocessed dual-channel video frames and their lane line masks; semantic constraint spatial alignment to obtain a frame-by-frame local homography grid sequence corresponding to the dual-view image spatial registration based on the preprocessed dual-channel video frames and their lane line masks; spatiotemporal sequence modeling and smoothing to obtain a smoothed local homography grid sequence based on the frame-by-frame local homography grid sequence; image deformation and fusion to generate an initial stitched panoramic video frame with irregular boundaries based on the smoothed local homography grid sequence; and semantically perceptual boundary correction and rectangularization to obtain and output a regular rectangular panoramic video frame for the irregular boundaries of the initial stitched panoramic video frame. This method overcomes alignment errors caused by repeated road textures and camera shake, and solves problems such as limited motion smoothing capabilities and boundary processing damage to data authenticity in existing technologies. It can generate high-precision, high-stability panoramic surveillance videos that strictly maintain the authenticity of key targets.

[0047] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for stitching and stabilizing surveillance videos based on semantic constraint alignment and spatiotemporal modeling, characterized in that, The method includes: Step 1: Data acquisition and preprocessing, obtaining the preprocessed dual-channel video frames and their lane line masks; Step 2: Semantic constraint spatial alignment. Based on the preprocessed dual-channel video frames and their lane line masks, obtain the frame-by-frame local homography grid sequence corresponding to the spatial registration of the dual-view images. Step 3: Spatiotemporal sequence modeling and smoothing. Based on the frame-by-frame local homography grid sequence, a smoothed local homography grid sequence is obtained. Step 4, Image Deformation and Fusion: Based on the smoothed local homography grid sequence, generate initial stitched panoramic video frames with irregular boundaries; Step 5, Semantic-aware boundary correction and rectangularization: For the irregular boundaries of the initial stitched panoramic video frames, obtain and output regular rectangular panoramic video frames.

2. The method according to claim 1, characterized in that, Step 1 includes: Acquire dual-channel original video frames, extract the corresponding lane line semantic masks, and perform image preprocessing to obtain preprocessed dual-channel video frames and their lane line masks. Step 11: Obtain dual-channel raw video frames synchronously acquired by multiple fixed cameras with overlapping fields of view. The left-view video stream and the right-view video stream are synchronously acquired by at least two fixed monitoring cameras, and the left-view video stream and the right-view video stream have overlapping fields of view. Extract the dual-channel raw video frames corresponding to the same moment from the left-view video stream and the right-view video stream as input for subsequent spatial registration and panoramic reconstruction. Step 12: Extract the corresponding lane line semantic mask. Treat the lane line information as an independent semantic modality and input it into the lightweight semantic segmentation network for each of the two original video frames to obtain the lane line semantic mask corresponding to each original video frame. The lane line semantic mask is a binary mask or a probabilistic mask, wherein the lane line area is marked as the foreground and the non-lane line area is marked as the background in the binary mask. Step 13: Using an asymmetric resolution strategy, preprocess the dual-channel original video frames and lane line semantic masks to obtain the preprocessed dual-channel video frames and their lane line masks.

3. The method according to claim 2, characterized in that, The preprocessing includes: Step 131: Perform semantic segmentation on the dual-channel original video frames to extract high-resolution binary lane line masks; Step 132: Using the nearest neighbor interpolation algorithm, the high-resolution binary lane line mask is downsampled to a preset low-resolution target size to generate a low-resolution binary lane line mask; the nearest neighbor interpolation algorithm is used to preserve lane line category boundaries during spatial dimensionality reduction. Step 133: The low-resolution original video frame corresponding to the low-resolution target size and the low-resolution binary lane line mask are synchronously input into the ConvNeXt network backbone module to maintain semantic guidance.

4. The method according to claim 3, characterized in that, Step 132 includes: a. For any target pixel on the low-resolution binary lane line mask, its coordinates are: First, the target pixels are mapped back to floating-point source coordinates on the high-resolution binary lane line mask. Its expression is: ； in, and These are the width and height of the high-resolution mask, respectively. and These represent the width and height of the low-resolution target, respectively. b. Round the floating-point source coordinates to the nearest integer source pixel coordinates. Its expression is: ； c. Apply the high-resolution binary lane line mask to the coordinate system. The pixel value at that location is assigned to the low-resolution binary lane line mask at coordinates. The pixel at that location is expressed as: ； in, For high-resolution binary lane line masks, This is the generated low-resolution binary lane line mask.

5. The method according to claim 2, characterized in that, Step 2 includes: The preprocessed dual-channel video frames and their lane line masks are input into a semantically constrained encoder-decoder spatial alignment network. The ConvNeXt encoder with shared weights is used to extract multi-scale structural features of the dual-channel images, and the lane line semantic mask is fused with the multi-scale structural features. Based on the fused features, a correlation representation is constructed, and the local homography grid is regressed frame by frame through the decoder to obtain the frame-by-frame local homography grid sequence corresponding to the spatial registration of the dual-view images. Step 21: Let the target view image and the reference view image be respectively... , and their corresponding lane line semantic masks are respectively ; Step 22: Adopt weight sharing Siamese The encoding structure extracts features from the target view image, the reference view image, and their corresponding lane line semantic masks, respectively. The encoder is denoted as... Then the first Image features and semantic features at each scale are represented as follows: ；； in, Represents different levels of the encoder. Indicates appearance characteristics, Represents semantic features; Step 23: At a predetermined resolution scale, concatenate the appearance features and corresponding semantic features through channels, and then perform feature compression mapping. Cross-modal fusion is completed, yielding target view fusion features and reference view fusion features, expressed as follows: ；； in, This indicates a channel dimension splicing operation. This represents a feature compression mapping consisting of convolution, normalization, and activation functions; Step 24: Based on the target view fusion features and the reference view fusion features, construct a relevance representation or matching cost volume for the location. and candidate displacement Correlation is defined as: ； Traverse the given search window After considering all candidate displacements within the range, the matching cost volume is obtained: ; The matching cost body is used to characterize the pixel similarity of the dual-path view in the local neighborhood and utilizes lane line semantic priors to improve the matching discrimination ability in repeated texture regions.

6. The method according to claim 5, characterized in that, Step 24 includes: performing local homography mesh regression and image deformation based on the matching cost volume; Let the initial rule grid be Regression subnetwork With the matching cost body As input, predict the vertex displacement relative to the initialized regular mesh. ; The vertex displacement is superimposed on the initialized regular mesh to obtain the target local homography mesh: ； For the For each mesh element, based on the coordinates of its four corner points before and after deformation, a direct linear transformation algorithm is used to calculate the corresponding local homography matrix. Thus, the set of local homography matrices for the entire image is obtained: ； Given pixel coordinates In its first The transformation results in each grid cell satisfy: ; in, Indicates equality in the sense of homogeneous coordinates; Based on the set of local homography matrices A spatial deformation field is constructed from the reference view image to the target view image, and differentiable deformation operations are used to transform the reference view image and its corresponding lane line semantic mask, respectively. The expression is as follows: ； in, This represents a differentiable deformation operation based on the set of local homography matrices; Spatial registration of dual-view images is achieved through the microdeformable operation.

7. The method according to claim 5, characterized in that, Step 3 includes: The frame-by-frame local homography grid sequence is converted into a spatiotemporal feature sequence, and a three-dimensional position encoding is introduced to represent the adjacency relationship of the temporal and spatial dimensions. The spatiotemporal feature sequence after introducing the three-dimensional position encoding is input into the spatiotemporal attention smoothing module, and the frame-by-frame local homography grid sequence is globally consistent smoothed through the self-attention mechanism to obtain the smoothed grid sequence. Step 31: Let the length of the video sequence be... The frame-by-frame local homography mesh is represented as: ； in, This indicates the number of grid vertices, where each grid vertex contains a two-dimensional coordinate offset; Step 32: Perform vectorized expansion on the frame-by-frame local homography mesh to obtain: ； in, ; Then, the vectorized result is projected onto the high-dimensional feature space through a linear mapping, resulting in: ； in, For embedded dimensions, and These are learnable parameters; Step 33: Concatenate the embedded features from all time steps to form the input spatiotemporal feature sequence: ； The spatiotemporal feature sequence is input into the spatiotemporal attention smoothing module to perform global temporal modeling on the frame-by-frame local homography grid sequence; The spatiotemporal attention smoothing module employs three-dimensional positional encoding and a spatiotemporal multi-head self-attention mechanism. To characterize the adjacency relationships of the spatiotemporal feature sequences in the temporal and spatial dimensions, it targets the time index. and the spatial position of the grid vertices Construct a three-dimensional position code: ; The three-dimensional position encoding is obtained by superimposing time-dimensional position encoding, spatial height-dimensional position encoding, and spatial width-dimensional position encoding: ； The time-dimensional position encoding satisfies: ； Spatial height dimension location encoding Spatial width dimension position encoding It adopts the same construction form as the time-dimensional positional encoding; Adding the 3D position code to the corresponding embedding features yields: ； Input the feature after adding position encoding The Transformer encoder computes the query matrix, key matrix, and value matrix at each layer: ； in, These are learnable parameters; Attention weights are calculated as follows: The corresponding attention output is: ; A multi-head mechanism is used to model the spatiotemporal dependencies in different subspaces in parallel, and its output is represented as follows: ； in, To output the projection matrix; The spatiotemporal multi-head self-attention mechanism establishes a global dependency between all video frames and all grid vertices.

8. The method according to claim 7, characterized in that, include: The spatiotemporal attention smoothing module uses residual regression to predict the smoothed local homography grid and is trained using temporal consistency constraints. After passing through the Transformer encoder, the spatiotemporal feature representation is obtained. ; Using the regression head Predict the smoothed offset at each time step: ; in, A regression head composed of a multilayer sensing mechanism; The smoothed offset is added to the original local homography mesh to obtain the smoothed local homography mesh: ； To constrain the temporal continuity of the smoothed grid sequence, a temporal consistency loss is introduced: ； To limit the smoothing result from deviating too much from the original motion trend, a fidelity loss is introduced: ； The total loss function of the spatiotemporal attention smoothing module is: ； in, and These are the weighting coefficients for temporal consistency loss and fidelity loss, respectively; the total loss function is used to jointly constrain the temporal continuity and motion fidelity of the smoothed local homography grid sequence.

9. The method according to claim 7, characterized in that, Step 4 includes: Based on the smooth grid sequence, a corresponding set of smooth local homography matrices is constructed, and a differentiable deformation operation is performed on the dual original video frames. The target view image and the deformed reference view image are projected onto a unified coordinate system, and weighted fusion is performed in the overlapping area. The corresponding image content is retained in the non-overlapping area to generate an initial stitched panoramic video frame with irregular boundaries. Step 41: Image distortion; For the Dual-channel raw video frames at any given time, based on a smoothed local homography grid sequence. Construct the corresponding set of smooth local homography matrices: ； in, Indicates the first Time of the first The local homography matrix corresponding to each grid cell; Based on the aforementioned set of smooth local homography matrices, for the first... Perform a differentiable deformation operation on the reference view image at time step 1 to obtain the deformed reference view image: ； in, Indicates the first Reference view of the original video frame at any given moment. This represents a differentiable deformation operation based on the set of smooth local homography matrices; target view image Compared with the deformed reference view image Projecting onto a unified coordinate system yields aligned image pairs for subsequent stitching and fusion; Step 42: Image fusion; Let the first The target view image in a unified coordinate system at any given time is The deformed reference view image is Their effective areas are respectively denoted as and The overlapping region is: ; The non-overlapping regions are: ; In the overlapping area Within this process, a weighted fusion is performed on the target view image and the deformed reference view image to obtain: ； in, Indicates position The fusion weight at the location; In the non-overlapping region Within, the original image content of the corresponding view is preserved; thus generating the first... Initial stitched panoramic video frames at any given moment .

10. The method according to claim 8, characterized in that, Step 5 includes: For the irregular boundaries of the initial stitched panoramic video frame, the boundary regions are classified based on semantic information; the regions containing key traffic targets are corrected by using a geometrically constrained mesh stretching method, and the background regions are repaired by using a generative completion method, resulting in and outputting regular rectangular panoramic video frames. Step 51: Let the panoramic image after spatial alignment, spatiotemporal smoothing, and stitching be: Its corresponding semantic mask is: The goal is to map the panoramic image onto a regular rectangular canvas. The effective region in the panoramic image is defined as... The missing region that needs to be filled is defined as: ; Step 52: Based on the semantic mask Perform semantic classification on the boundary region and construct a semantic decision function: ； in, This indicates a risk area containing key traffic targets, including vehicles, pedestrians, or traffic signs. This indicates a safe background area that does not contain the aforementioned key traffic targets; when When, a geometrically constrained mesh stretching strategy is applied to the risk region; when At that time, a generative repair strategy is adopted for the security background area; The semantic decision function enables adaptive selection of geometric deformation paths and generation repair paths.