A multi-camera rectification and border multi-scale seamless stitching method and system

By fitting a nonlinear geometric mapping function and aligning multi-scale features in a multi-camera monitoring system, the problem of inconsistent video images in the system is solved, enabling real-time panoramic stitching and a unified spatiotemporal base, thus improving the efficiency and availability of the security system.

CN122023587BActive Publication Date: 2026-06-23UNIV OF SCI & TECH OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UNIV OF SCI & TECH OF CHINA
Filing Date
2026-04-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In existing technologies, the video footage of multi-camera monitoring systems lacks a unified spatial reference and a continuous global perspective, which leads to cross-camera target localization and event tracing relying on human experience, making it difficult to achieve real-time fusion with geometric consistency, brightness continuity, and temporal stability.

Method used

By acquiring the set of punctuation pairs between the video frames of each camera and the reference top-down base map, a nonlinear geometric mapping function is fitted, and the video frames are resampled and corrected in real time to a unified top-down base map coordinate system. Multi-scale feature alignment and frequency band fusion are performed in the overlapping area to generate geometrically consistent and brightness-continuous panoramic stitched video frames.

Benefits of technology

It enables real-time panoramic output of multi-camera video in a unified top-down coordinate system, providing a unified spatiotemporal base and immersive interactive capabilities at the park level, thereby improving the efficiency of security patrol and emergency command.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122023587B_ABST
    Figure CN122023587B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of image processing, and discloses a multi-camera rectification and boundary multi-scale seamless splicing method and system; the method comprises the following steps: acquiring a point pair set between a video frame of each deployed camera and a reference overhead base map; obtaining a non-linear geometric mapping function from a camera pixel coordinate system to a reference overhead base map coordinate system through a base function model fitting; using the geometric mapping function to real-time resample and rectify the video frame of each camera to the reference overhead base map coordinate system; performing multi-scale feature alignment on the obtained overhead rectification frames of each camera in the overlapping area to estimate a local refinement transformation; after one of the two adjacent overhead rectification frames is subjected to the local refinement transformation, performing multi-scale frequency band fusion to generate a panoramic splicing video frame. The application provides a multi-camera rectification and boundary multi-scale seamless splicing method based on sparse point constraint, and real-time panoramic one-map can be realized.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, specifically to a method and system for multi-camera correction and seamless stitching of boundaries at multiple scales. Background Technology

[0002] With the large-scale deployment of security systems, multi-source surveillance video has become a high-density spatiotemporal data stream. However, in reality, the images from each camera are in their own independent pixel coordinate system, lacking a unified spatial reference and a continuous global perspective. This makes cross-camera target localization, event tracing, and coordinated response highly dependent on human experience and memory of locations, making it difficult to form an intuitive and holistic understanding of the situation.

[0003] The core problem addressed in this application is: given a reference top-view base map (or orthophoto) and multiple wide-angle videos, to learn a dense geometric mapping from the pixel coordinates of each video to a unified top-view coordinate system, and to achieve real-time fusion output of geometrically consistent, brightness-continuous, and temporally stable multi-video feeds under the unified coordinate system. This problem faces three key challenges: First, wide-angle distortion and viewing angle differences lead to significant nonlinearity in the mapping, making it difficult for traditional homography / calibration models to balance accuracy and generalization in complex scenes; second, supervision information typically comes from a small number of sparse corresponding points and is noisy, so how to achieve robust parameter estimation and generalize to full-frame dense mapping is crucial to the stitching accuracy; third, multiple videos exhibit exposure differences, dynamic target occlusion, and boundary artifacts in overlapping areas, and how to achieve seamless multi-scale fusion and suppress temporal flicker under real-time constraints directly affects the usability of the "what you see is what you get" output. Summary of the Invention

[0004] To address the aforementioned technical problems, this invention provides a method and system for multi-camera correction and seamless multi-scale stitching of boundaries.

[0005] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:

[0006] In a first aspect, the present invention provides a method for multi-camera correction and seamless multi-scale stitching of boundaries, comprising:

[0007] Obtain the set of punctuation pairs between the video frames of each deployed camera and the reference top-down map;

[0008] Based on the set of punctuation pairs, a nonlinear geometric mapping function from the camera pixel coordinate system to the reference top-view base map coordinate system is obtained by fitting a basis function model.

[0009] Using a geometric mapping function, the video frames of each camera are resampled and corrected in real time to the reference top-view base map coordinate system to obtain the top-view corrected frames of each camera.

[0010] For the top-view correction frames from each camera, multi-scale feature alignment is performed in the overlapping area to estimate the local refinement transformation. After the top-view correction frame of each pair of adjacent top-view correction frames undergoes local refinement transformation, multi-scale frequency band fusion is performed to generate a geometrically consistent and brightness-continuous panoramic stitched video frame.

[0011] In one embodiment, acquiring the set of punctuation pairs between video frames from each deployed camera and a reference top-down map specifically includes:

[0012] Deployment Street camera, number Street cameras at all times The image frame is Assume there exists a reference top-view base map. The corresponding coordinate system is denoted as the reference top-view base map coordinate system. Camera pixel coordinate system The camera pixel count is Refer to the coordinate system of the top-view base map. The reference point for the top-down base map is , for x and y coordinates for The x and y coordinates;

[0013] No. The set of punctuation pairs of the road's cameras , This represents the index of the k-th punctuation pair. Indicates the first Total number of punctuation marks on the road cameras;

[0014] in, For the first The camera pixels in each punctuation pair For the first Reference points on the top-view base map for each punctuation mark pair Let k be the confidence weight of the kth punctuation pair. Indicates transpose. for x and y coordinates for The x and y coordinates.

[0015] In one embodiment, the nonlinear geometric mapping function from the camera pixel coordinate system to the reference top-down base map coordinate system, obtained by fitting a basis function model based on a set of punctuation pairs, specifically includes:

[0016] Normalize the camera pixel coordinates and the coordinates of the reference top-view map;

[0017] The geometric mapping function is represented by a basis function vector containing polynomial terms;

[0018] The parameters of the basis function vector are solved by minimizing the objective function that combines a robust loss function and a regularization term.

[0019] In one embodiment, representing the geometric mapping function using a basis function vector containing polynomial terms specifically includes:

[0020] The basis function vector is Includes up to a specified level polynomial terms:

[0021] ;

[0022] This represents the x and y coordinates of the camera pixel after normalization.

[0023] The geometric mapping function is written as:

[0024] ;

[0025] ;

[0026] in, The parameters are the basis function vectors to be estimated.

[0027] In one embodiment, solving for the parameters of the basis function vector by minimizing the objective function that combines a robust loss function and a regularization term specifically includes:

[0028] The objective function is:

[0029] ;

[0030] in, For robust loss function, The regularization coefficient is . It is the F2 norm; The normalized version The horizontal and vertical coordinates of the camera pixels in each punctuation pair;

[0031] The objective function is solved by iterative reweighted least squares method. The order of the polynomial terms of the basis function vector is adaptively selected based on the number and distribution of punctuation pairs through a model order adaptive mechanism.

[0032] In one embodiment, the step of using a geometric mapping function to resample and correct the video frames of each camera to a reference top-view base map coordinate system in real time, thereby obtaining the top-view corrected frames of each camera, specifically includes:

[0033] Based on the geometric mapping function or the inverse transformation of the geometric mapping function, a dense mapping lookup table from the reference top-view base map coordinates to the camera pixel coordinates is pre-calculated. The top-view corrected frame is generated by querying the dense mapping lookup table and performing interpolation operations.

[0034] In one embodiment, the process of pre-compiling a dense mapping lookup table from the reference top-view base map coordinates to the camera pixel coordinates based on a geometric mapping function or its inverse transformation, and generating the top-view corrected frame by querying the dense mapping lookup table and performing an interpolation operation, specifically includes:

[0035] In the reference top view base map coordinate system Define discrete mesh Define the inverse sampling function : ; For reference to the top-view base map coordinate system Reference top view base map reference point x and y coordinates For the first The camera pixel coordinate system of the road camera For the first The geometric mapping function corresponding to the road camera;

[0036] Construct a dense deformation field for each camera:

[0037] ,

[0038] in, For the first Street cameras at all times Image frames The continuous sampling of the x-axis in the data, For the first Street cameras at all times Image frames The continuous sampling ordinate in the data, It is a dense mapping lookup table;

[0039] Top-down corrected frame for:

[0040] .

[0041] In one embodiment, the top-view corrected frames from each camera are subjected to multi-scale feature alignment in overlapping regions to estimate local refinement transformation, specifically including:

[0042] Construct a multi-scale feature pyramid for each top-view corrected frame;

[0043] The parameters of the local refinement transformation are sequentially optimized at multiple scales from coarse to fine in the feature pyramid; wherein the local refinement transformation is a translation model, an affine model, a homography model, or a block local model.

[0044] In one embodiment, the step of performing multi-scale frequency band fusion on one of two adjacent top-view corrected frames after local thinning transformation to generate a geometrically consistent and brightness-continuous panoramic stitched video frame specifically includes:

[0045] Two adjacent top-view corrected frames are , , to correct the top view frame After local thinning transformation, the top-view corrected frame after thinning transformation is obtained. Construct a weighted graph;

[0046] for and Construct the Laplace Pyramid separately , And construct a Gaussian pyramid for the weighted graph. ; For the first Index of the layered pyramid for The first of the Laplace Pyramid Layer image, for The first of the Laplace Pyramid Layer image, The Gaussian pyramid of the weighted graph is the first Layered images;

[0047] Weighted fusion is performed at each level of the pyramid: ; This represents the first [unit / section] of the merged Laplace's Pyramid. Layered images;

[0048] The merged Laplace pyramid is reconstructed to obtain the final panoramic stitched video frames.

[0049] In a second aspect, the present invention provides a computer system including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method of any embodiment of the first aspect.

[0050] Compared with the prior art, the beneficial technical effects of the present invention are:

[0051] The integrated algorithm framework for multi-camera video twins constructed in this invention takes "robust geometric mapping estimation, dense resampling and multi-scale boundary fusion" as its core. It proposes a multi-camera correction and seamless multi-scale boundary stitching method based on sparse punctuation constraints, which can realize a real-time panoramic image at the park level, providing a unified spatiotemporal foundation and immersive interactive capabilities for security patrol, emergency command and process management. Attached Figure Description

[0052] Figure 1 This is a flowchart of the method in an embodiment of the present invention.

[0053] Figure 2 This is a schematic diagram of basis function geometric mapping model fitting and inverse resampling in an embodiment of the present invention.

[0054] Figure 3 This is a schematic diagram illustrating the construction of a multi-scale feature pyramid and coarse-to-fine registration in an embodiment of the present invention.

[0055] Figure 4 This is a schematic diagram of the Laplace pyramid multi-band fusion process in an embodiment of the present invention. Detailed Implementation

[0056] A preferred embodiment of the present invention will now be described in detail with reference to the accompanying drawings.

[0057] like Figure 1 As shown, a multi-camera correction and seamless multi-scale stitching method for boundaries in this invention includes the following steps:

[0058] S1, obtain the set of punctuation pairs between the video frames of each deployed camera and the reference top-down map;

[0059] S2, based on the set of punctuation pairs, obtains a nonlinear geometric mapping function from the camera pixel coordinate system to the reference top-view base map coordinate system through basis function model fitting;

[0060] S3, using the geometric mapping function, the video frames of each camera are resampled and corrected to the reference top-view base map coordinate system in real time to obtain the top-view corrected frames of each camera;

[0061] S4 performs multi-scale feature alignment on the top-view correction frames of each camera in the overlapping area to estimate the local refinement transformation. After the top-view correction frame of each of the two adjacent top-view correction frames undergoes local refinement transformation, it is fused with multi-scale frequency bands to generate a geometrically consistent and brightness-continuous panoramic stitched video frame.

[0062] This invention addresses the problems of inconsistent spatial coordinate systems, significant image distortion, and brightness discontinuities and stitching seams in multi-camera wide-angle surveillance videos. It proposes a method for top-down correction of multi-camera wide-angle videos based on sparse punctuation constraints and seamless multi-scale stitching of boundary strips, enabling real-time panoramic output of multiple videos in a unified top-down coordinate system.

[0063] 1. Problem definition and symbol conventions.

[0064] The application scenario of this embodiment is a park, where deployment is carried out within the park. Street camera, number Street cameras at all times The image frame is , for Height and width, Represent the set of real numbers. Assume there exists a reference top-down base map (or orthophoto / map projection). Its coordinate system is denoted as the reference top view base map coordinate system. Camera pixel coordinate system The camera's pixel count is ,in Refer to the coordinate system of the top-view base map. Reference point for the top-down base map: .

[0065] The goal is to estimate a nonlinear geometric mapping function on each camera, from the camera pixel domain to a reference plane. :

[0066] ;

[0067] Based on this geometric mapping function, the video frames are resampled to the reference plane to obtain the top-view corrected frames:

[0068] .

[0069] This indicates an image resampling operation. In a preferred embodiment, bilinear interpolation or bicubic interpolation algorithms can be used. However, real-world scenes exhibit wide-angle distortion, pitch differences, installation deviations, and terrain non-planarity. It is usually impossible to stably describe using a single homography matrix.

[0070] This invention uses sparse punctuation pairs as weakly supervised constraint estimation. For the first Given a set of road cameras and punctuation pairs:

[0071] ;

[0072] in, For the first The camera pixels in each punctuation pair For the first Reference points on the top-view base map for each punctuation mark pair Confidence weights (used to suppress mismatches and unstable points). Indicates transpose. for x and y coordinates for The x and y coordinates.

[0073] Therefore, the core algorithmic problem of this invention can be formalized as: in sparse punctuation pairs Robust estimation under constraints This results in: minimal reprojection error of punctuation points; insensitivity to outliers or noise; and efficient online evaluation of the mapping, which can meet the requirements of real-time resampling.

[0074] 2. Nonlinear geometric correction model based on sparse punctuation pair constraints.

[0075] (1) Punctuation pairs and weight construction:

[0076] For each camera, from and China Construction A pair of punctuation marks. Punctuation marks should cover the entire field of view (center and edges) to constrain the global shape of distortion. To improve robustness, this invention introduces confidence weights. The weight can be determined in any of the following ways: structural points such as corners or intersections are given priority and higher weight; feature matching confidence (such as nearest neighbor ratio, matching consistency) is mapped to the weight; and the weight of points in areas with repeated textures or low textures is reduced.

[0077] (2) Coordinate normalization:

[0078] Higher-order basis functions can easily lead to numerical instability, therefore, coordinate normalization is performed, scaling the camera pixel coordinates to [a specific scale]. :

[0079] ;

[0080] ;

[0081] This represents the normalized x and y coordinates of the camera pixels; the coordinates of the reference top-view base map can also be normalized in the same way. Or scaled proportionally to improve fit stability and transferability. Higher-order polynomial terms (such as...) It is highly sensitive to numerical scales. Camera pixel coordinates hour, The magnitude can reach The above can easily lead to: ill-conditioned linear equations (high condition number), making the solution abnormally sensitive to noise; unstable gradient or incremental updates, causing iterative optimization to easily diverge; and higher-order terms dominating the fitting, causing lower-order geometric structures to be submerged. Normalizing the coordinates to... Afterwards, the magnitude of each order term is compressed to a controllable range, making the optimization more stable, the parameters more interpretable, and facilitating cross-resolution migration.

[0082] (3) Unified basis function mapping model:

[0083] This invention uses a unified basis function expansion to represent geometric mapping functions. Let the basis function vector be... Includes up to a specified level Polynomial terms (including cross terms), for example:

[0084] ;

[0085] The geometric mapping function is written as:

[0086] ;

[0087] ;

[0088] in, The parameters are the basis function vectors to be estimated.

[0089] order For adjustable complexity: A larger value indicates stronger fitting ability, but it also makes overfitting more likely; therefore, it needs to be used in conjunction with robust regularization. This allows for adjustable complexity. order The model capacity is determined by the number of parameters. Lower-order terms are used for scenes with slight distortion and few points, while higher-order terms are used for scenes with severe edge distortion, thus covering various camera angles and mounting conditions. Ultimately, whether second-order, third-order, or higher-order terms are chosen, they are essentially different instances of the same frame, facilitating the use of the same robust optimizer. Meanwhile, lower-order terms primarily characterize global translation, scaling, and non-orthogonal shearing, while higher-order terms characterize edge curvature and local nonlinear distortion.

[0090] (4) Robust regularization objective function:

[0091] Considering the presence of noise and outliers in the punctuation, this invention uses a target function of "weight + robust loss + regularization term" to estimate the parameters:

[0092] ;

[0093] in, The robust loss function (optional Huber loss function, Cauchy loss function, Tukey loss function) is used to reduce the impact of outliers; This is the regularization coefficient, used to suppress overfitting in higher-order models and improve generalization.

[0094] Ordinary least squares is equivalent to assuming the error follows a Gaussian distribution. However, the actual punctuation error usually exhibits a mixed distribution of small noise and a few large outliers. For this long-tailed distribution, least squares will be dominated by large residuals, causing the overall model to shift. The robust loss function is equivalent to: in the small residual region, it still approximates a quadratic penalty (maintaining high-precision fitting); in the large residual region, the penalty increases more slowly (suppressing the influence of outliers). Theoretically, this is a typical M-estimation approach: replacing the Gaussian assumption with a log-likelihood that better fits the long-tailed noise, thus obtaining a more stable estimate.

[0095] use Regular expressions This is because high-order models have high degrees of freedom, and are prone to oscillatory fitting phenomena under sparse points, especially when the point distribution is uneven, the edge regions will be excessively curved. Regularization is equivalent to applying a zero-mean Gaussian prior to the parameters, favoring a "smoother and less curved" mapping, thereby improving generalization. From a geometric perspective, regularization suppresses excessively large coefficients of higher-order terms, preventing unreasonable deformation of the mapping in unconstrained regions.

[0096] (5) Solution strategy:

[0097] In this invention, since punctuation correspondences may contain mismatches and long-tailed noise, a robust loss function is employed. To suppress the influence of outliers, the overall objective function no longer belongs to the standard linear least squares form, and direct solution often requires nonlinear optimization. To balance robustness and computational stability, this invention employs Iteratively Reweighted Least Squares (IRLS) to solve the robust M-estimator. The basic idea is to utilize equivalent transformations in robust statistics: by introducing adaptive weights related to the residuals. This transforms the optimization problem with robust loss into a series of successively updated weighted least squares subproblems, allowing for efficient solutions using mature linear algebra methods in each iteration. The specific process is as follows: First, weighted least squares are applied to the parameters... Perform initialization; then in the... In the next iteration, the residuals are predicted based on the current model. Robust loss Weighted form update Then, a weighted least squares problem is solved again under the updated weights to obtain a new parameter estimate. When the change in parameters or the decrease in the objective function in adjacent iterations is less than a preset threshold, convergence is determined and the iteration stops. The advantages of using IRLS are: while ensuring robust estimation performance, each step remains an analytically or stably solvable least squares problem, the computation process is numerically stable and simple to implement, and it is suitable for online or quasi-online engineering deployment in real-time systems.

[0098] (6) Model order adaptation:

[0099] To avoid amplifying noise and causing overfitting by blindly using high-order models when the number of punctuation marks is limited or unevenly distributed, this invention introduces a model order selection mechanism to automatically select the polynomial order with the minimum sufficient complexity. The basic idea is: in the candidate order set The mapping models were fitted to each of the following: (lower-order d=2,3 cover light distortion scenes mainly involving affine and homography, while higher-order d=5,8 cover severe nonlinear distortion scenes at the edges of wide-angle lenses; intermediate orders were skipped to reduce computational overhead while ensuring coverage of typical distortion levels), and the corresponding validation errors were calculated. The verification error can be implemented using leave-one-out method, cross-validation, or punctuation reprojection error, etc. Subsequently, the smallest order that satisfies the error threshold condition is preferred as... Alternatively, a complexity penalty term could be introduced, choosing one that makes... The order as ,in order Next, model parameter dimensions This is a tradeoff coefficient. The rationale behind this strategy is that low-order models have limited fitting ability but good stability, while high-order models have strong fitting ability but are susceptible to overfitting and mapping jitter due to noise and outliers. By selecting the minimum sufficient complexity... This invention can obtain more stable and more generalizable geometric mappings under the same punctuation quality and quantity conditions, while reducing the evaluation overhead of higher-order terms in the online stage, thereby further improving real-time processing efficiency.

[0100] (7) Mapping field generation and online correction:

[0101] After completing the geometric fitting of the punctuation constraints and obtaining the geometric mapping function Subsequently, the present invention will use the geometric mapping function This is converted into a pixel-level dense warp field to support real-time video stream processing. The reason for this is that the punctuation is fitted... Essentially, it's just an evaluable function. If high-order polynomials (or basis functions) are directly computed pixel-by-pixel for each frame of the image during the online phase, it will lead to a large amount of repetitive computation and increase real-time processing overhead. Furthermore, if forward mapping (mapping from source image pixels to the output plane) is used, it will inevitably produce many-to-one overlay and output holes, introducing additional conflict handling and hole-filling steps, affecting stability. Therefore, this invention employs backward warping for online correction: for each output reference top-down base map reference point... Through inverse transformation Tracing back to the sampling coordinates in the source image and sampling, the top-view corrected frame is obtained:

[0102] .

[0103] As an optional embodiment, It can be directly obtained by fitting the punctuation pair in the reverse coordinate system, and... An isomorphic basis function model is adopted. After generating the dense mapping field, only regular operations of "table lookup + interpolation" are performed in the online stage. Specifically, bilinear or bicubic interpolation is used for sampling continuous coordinates to ensure the spatial continuity and edge smoothness of the corrected frames. Through the above design, offline model parameter estimation and online dense resampling are decoupled: parameter estimation is only performed when the punctuation is updated or recalibrated, while the real-time video frame processing stage only needs to perform fast table lookup and interpolation. This transforms online computation into regular memory access and multiply-accumulate operations, which are naturally suitable for SIMD / GPU parallel acceleration. On the other hand, inverse resampling actively samples each output pixel, which can avoid output holes caused by forward mapping and ensure complete output coverage from a mechanistic perspective, significantly improving the stability and engineering usability of real-time correction.

[0104] 3. Real-time generation and inverse resampling of dense mapping fields.

[0105] (1) From parametric mapping to pixel-level deformation field:

[0106] The geometric mapping obtained by robust fitting in the previous section Essentially, it is an evaluable continuous function. However, directly evaluating higher-order basis functions online for each frame and each pixel leads to a large amount of redundant computation; more importantly, if a forward mapping is used:

[0107] ;

[0108] Many-to-one coverage will occur (multiple) Mapped to the same This can lead to conflicts and issues such as holes or cracks (where some output pixels have no source pixels falling into them), requiring additional hole-filling strategies and making it difficult to guarantee stability.

[0109] Therefore, this invention unifies geometric correction into reverse resampling: actively returning to the source image for sampling for each output pixel, thereby naturally ensuring complete coverage of the output domain without holes, and facilitating GPU texture unit acceleration.

[0110] (2) Definition of inverse mapping and dense deformation field:

[0111] Let the reference top view base map coordinate system be established. Define discrete mesh (resolution is) ), These represent the width and height of the reference top-down base map, respectively.

[0112] Define the inverse sampling function:

[0113] ;

[0114] And based on this, a dense deformation field is constructed for each camera:

[0115] ,

[0116] in, Source image Continuous sampling coordinates (floating-point coordinates) in the data. It is a dense mapping lookup table, whose calculation is performed only once when the punctuation parameters are updated; in the online phase, only table lookup and interpolation sampling are performed to achieve high throughput and low latency.

[0117] (3) Inverse sampling function Construction method:

[0118] To cover different computing power and accuracy requirements at the algorithm level, this invention uses a numerical inverse kinematics method to construct... Reference point for each output. Solve the equation Newton's method can be used:

[0119] ;

[0120] in, for Jacobian matrix, For the first The estimated camera pixel coordinates at the next iteration. This method is applicable to any differentiable... However, offline generation It is more expensive, but the calculation is only required when punctuation is updated.

[0121] (4) Reverse resampling:

[0122] For any image frame Its top-view corrected frame is as follows:

[0123] ;

[0124] because Typically, these are floating-point coordinates, requiring an interpolation kernel. Perform resampling. Taking bilinear interpolation as an example, let...

[0125] ;

[0126] , The integer part of the floating-point coordinates (pixel index). , The fractional part of the coordinates (spatial weight).

[0127] but:

[0128] ;

[0129] Bicubic interpolation can be used to obtain smoother edges, but at a higher cost.

[0130] (5) Valid domain and fold detection:

[0131] Backsampling requirements Define the valid domain if it falls within the valid range of the source graph:

[0132] ;

[0133] in This is an indicator function that takes the value 1 when the condition within the square brackets is met, and 0 otherwise. This represents a binary mask used to remove out-of-bounds regions.

[0134] Furthermore, to avoid texture tearing caused by local folding (mapped non-injective), the Jacobian determinant can be used as an optional criterion:

[0135] ;

[0136] This eliminates untrusted areas during the integration phase, improving stability. Represents determinant operations. represents the Jacobian matrix of the inverse mapping function.

[0137] (6) Real-time analysis:

[0138] During the offline or update phase (when punctuation or model parameters change): Calculation The complexity is Online frame-by-frame stage: Only table lookup and interpolation sampling are performed, with a complexity of O(n log n). Furthermore, its operation rules and data parallelism make it suitable for GPUs and vectorization.

[0139] Because the punctuation update frequency is much lower than the video frame rate (usually during the calibration phase or occasional recalibration), the overall system can achieve stable real-time operation over a long period of time.

[0140] When local scale is compressed The image is relatively small, and direct sampling from the source image may cause aliasing. A "filter first, then sample" strategy can be used, based on... Estimate the local scale and select either the Gaussian pre-filter radius (mipmap) or the image pyramid level (Laplacian pyramid); then perform inverse sampling to reduce high-frequency aliasing and flicker.

[0141] For the basis function geometric mapping model fitting and inverse resampling process in this invention, please refer to [link / reference]. Figure 2 .

[0142] 4. Feature pyramid stitching based on multi-scale fusion.

[0143] After completing the geometric correction of each video stream to the reference top plane (resulting in...) With effective domain Even with proper calibration, subpixel-level misalignment still exists due to punctuation errors, local non-planarity, residual lens distortion, and time synchronization deviations, compounded by illumination discontinuities caused by cross-camera exposure differences and white balance discrepancies. Directly estimating the transformation in the pixel domain and performing linear fusion often results in ghosting, breaks, and / or flickering at the boundaries.

[0144] like Figure 3 and Figure 4 As shown, this invention proposes a coarse-to-fine stitching framework based on feature pyramids: robust registration is completed in a multi-scale feature space, and seamless fusion is completed in a multi-scale frequency band, thus taking into account both robustness and real-time performance.

[0145] (1) Construction of a multi-scale feature pyramid network:

[0146] For each corrected frame Construct a coarse-to-fine feature pyramid. ,in Index representing the scale level. The larger the resolution, the larger the receptive field.

[0147] In a preferred embodiment, the feature pyramid can be generated in any of the following ways:

[0148] Classical multi-scale representation: obtained by Gaussian pyramid downsampling of the image. Furthermore, local descriptors such as gradient, corner response, and HOG are extracted from each layer and combined into... .

[0149] Learning-based feature pyramid: A lightweight backbone network extracts multiple layers of features and fuses them from top to bottom to obtain a feature pyramid with uniform semantic strength. Feature layer, as .

[0150] Theoretical motivation: coarse-scale layers can suppress local texture interference and improve matching convexity (avoiding getting trapped in local optima), while fine-scale layers are responsible for restoring high-precision alignment of the boundaries; this constitutes a typical coarse-to-fine optimization strategy.

[0151] (2) Multi-scale registration, robust geometric alignment from coarse to fine:

[0152] Let the two adjacent top-view corrected frames to be spliced ​​be... and The overlapping region is determined by the effective domain on the reference plane. : .

[0153] The goal is to estimate a local thinning transform within the overlapping region. (This can be a translation / affine / homography or a block local model), further aligning the two paths geometrically. To enhance robustness, the transformation estimation is performed in the feature space rather than directly using pixels:

[0154] ;

[0155] in: For the overlapping region in the th The corresponding region in the downsampling of the layer; Layer weights are set (generally, larger weights for coarse layers and smaller weights for fine layers, to ensure stable coarse positioning before fine finishing). Use robust loss (Huber loss or Cauchy loss can be used) to suppress abnormal matching caused by dynamic targets and occlusion; This is a regularization term used to constrain transformation smoothness or limit degrees of freedom (to prevent overfitting to local noise). The weighting coefficients for the regularization term are used to control the smoothness of the transformed field.

[0156] In the coarsest layer First estimate (e.g., maximizing phase correlation / cross-correlation, sparse matching + RANSAC); upsample the estimation results of the previous layer as the initial values ​​for the next layer: Perform local iterative optimization at finer levels (such as Gauss-Newton iteration method, LM algorithm, iterative optical flow update) until... .in, This indicates an upsampling operation, used to convert the first... The transformation parameters of the layer are mapped to the higher resolution layer. For each layer, bilinear interpolation or nearest neighbor interpolation can be used.

[0157] Multi-scale initialization significantly expands the convergence region, coarse layers provide a globally consistent alignment trend, and fine layers resolve sub-pixel misalignment; robust terms ensure that dynamic non-static regions such as pedestrians and vehicles do not dominate geometric estimation.

[0158] To cover different scene complexities, this invention uses a meshed local deformation strategy to... Divide the data into grid blocks, estimate the local affine for each grid block, and apply a smoothing regularization. Constraining adjacent blocks to be continuous improves adaptability to non-planar and residual distortions.

[0159] (3) Fusion weight estimation and multi-scale frequency band fusion:

[0160] After completing the geometric alignment, pass one of the paths through... Transform to a common coordinate system:

[0161] , ;

[0162] in, The source images to be stitched together. The local refinement transformation estimated in step (2) is used. Indicates based on local refinement transformation For images Perform inverse mapping and bilinear interpolation resampling operations. The image to be stitched together is after geometric correction. This is the corresponding valid region mask.

[0163] To reduce seam visibility and ghosting, this invention constructs pixel-level fusion weights. In addition, the following factors are considered: smooth window at the distance from the seam boundary, structural consistency / gradient difference (preferring to choose the more consistent side at texture conflict points), effective domain and geometric confidence (reducing the weight of mapped edges and low-confidence regions).

[0164] The pixel-level fusion weights are calculated using the following energy minimization form:

[0165] ;

[0166] in Defined as the magnitude of the image gradient It is used to punish seams that cross high-frequency texture areas. These are the weighting coefficients for the smoothing term. For the structural cost term weighting coefficient, The gradient operator represents the gradient of the weighted graph. Find the spatial gradient. This represents the L1 norm, used to constrain the sparsity of the gradient to maintain sharp boundaries. The formula is equivalent to "soft seam" optimization, which can significantly reduce ghosting.

[0167] The final fusion is not completed in one step at the pixel domain, but rather across multiple scale frequency bands. and Building the Pyramid of Laplace ; weight Building the Gauss Pyramid Then, they are merged layer by layer:

[0168] ;

[0169] in, This indicates that the matrix is ​​multiplied element-wise, and the inverse pyramid reconstruction yields the fused result. .

[0170] The low-frequency layer primarily determines the continuity of illumination and tone, while the high-frequency layer determines edge and texture details. Multi-band blending is equivalent to using different "seam transition widths" in different frequency bands, which can simultaneously eliminate abrupt changes in brightness and texture breaks, making it the classic and optimal structure for seamless splicing.

[0171] (4) Time consistency and online update strategy:

[0172] Since the video consists of consecutive frames, if each frame is estimated independently... and This can easily introduce inter-frame jitter, causing flickering. Timing regularization can be added for adjustment.

[0173] , ;

[0174] in For smoothing coefficients, For the current frame The geometric transformation parameters are calculated independently (i.e., instantaneous estimates without time-series filtering). For the current frame The pixel-level fusion weight map is calculated independently (i.e., the original weights obtained by minimizing the energy equation based on the gradient of the current frame image). This strategy improves video stability without significantly increasing computation.

[0175] (5) Complexity and real-time performance:

[0176] The main time complexity of this framework is distributed in multi-scale and local region computation. Registration is performed on the feature pyramid from coarse to fine. The coarse layer has low cost and a large convergence region, while the fine layer only performs small-scale refinement.

[0177] The fusion method employs a pyramidal multi-strip structure and can be performed only in overlapping / boundary strip regions, reducing complexity from the full graph. Down to The overall operation consists of convolution, interpolation, and pixel-wise multiplication and addition, which is naturally suitable for GPU parallelism and enables online real-time output.

[0178] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the invention. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.

[0179] It should be understood that although the steps in the flowcharts of the accompanying drawings are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some of the steps in the flowcharts of the accompanying drawings may include multiple steps or stages, which are not necessarily completed at the same time, but may be executed at different times, and the execution order of these steps or stages is not necessarily sequential, but may be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0180] In one embodiment, the present invention provides a computer system, which may be a server. The computer system includes a processor, memory, and a network interface connected via a system bus. The processor provides computing and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database stores data used in the methods described above. The network interface communicates with external terminals via a network connection. The computer program is executed by the processor to implement the methods described above.

[0181] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0182] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention, and no reference numerals in the claims should be construed as limiting the scope of the claims.

[0183] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.

Claims

1. A method for multi-camera correction and seamless multi-scale stitching of boundaries, characterized in that, include: Obtain the set of punctuation pairs between the video frames of each deployed camera and the reference top-down map; Based on the set of punctuation pairs, a nonlinear geometric mapping function from the camera pixel coordinate system to the reference top-view base map coordinate system is obtained by fitting a basis function model. Using a geometric mapping function, the video frames of each camera are resampled and corrected in real time to the reference top-view base map coordinate system to obtain the top-view corrected frames of each camera. For the top-view correction frames from each camera, multi-scale feature alignment is performed in the overlapping area to estimate the local refinement transformation. After the top-view correction frame of each pair of adjacent top-view correction frames undergoes local refinement transformation, multi-scale frequency band fusion is performed to generate a geometrically consistent and brightness-continuous panoramic stitched video frame.

2. The multi-camera correction and seamless multi-scale stitching method for boundaries according to claim 1, characterized in that, The acquisition of the set of punctuation pairs between the video frames of each deployed camera and the reference top-down base map specifically includes: Deployment Street camera, number Street cameras at all times The image frame is Assume there exists a reference top-down base map. The corresponding coordinate system is denoted as the reference top-view base map coordinate system. Camera pixel coordinate system The camera pixel count is Refer to the coordinate system of the top-view base map. The reference point for the top-down base map is , for x and y coordinates for The x and y coordinates; No. The set of punctuation pairs of the road's cameras , This represents the index of the k-th punctuation pair. Indicates the first Total number of punctuation marks on the road cameras; in, For the first The camera pixels in each punctuation pair For the first Reference points on the top-view base map for each punctuation mark pair Let k be the confidence weight of the kth punctuation pair. Indicates transpose. for x and y coordinates for The x and y coordinates.

3. The multi-camera correction and seamless multi-scale stitching method for boundaries according to claim 2, characterized in that, The nonlinear geometric mapping function from the camera pixel coordinate system to the reference top-down base map coordinate system, obtained by fitting a basis function model based on the set of punctuation pairs, specifically includes: Normalize the camera pixel coordinates and the coordinates of the reference top-view map; The geometric mapping function is represented by a basis function vector containing polynomial terms; The parameters of the basis function vector are solved by minimizing the objective function that combines a robust loss function and a regularization term.

4. The multi-camera correction and seamless multi-scale stitching method for boundaries according to claim 3, characterized in that, The use of a basis function vector containing polynomial terms to represent the geometric mapping function specifically includes: The basis function vector is Includes up to a specified level polynomial terms: ; This represents the x and y coordinates of the camera pixel after normalization. The geometric mapping function is written as: ; ; in, The parameters are the basis function vectors to be estimated.

5. The multi-camera correction and seamless multi-scale stitching method for boundaries according to claim 4, characterized in that, The method of solving for the parameters of the basis function vector by minimizing the objective function that combines the robust loss function and the regularization term specifically includes: The objective function is: ; in, For robust loss function, The regularization coefficient is . It is the F2 norm; The normalized version The horizontal and vertical coordinates of the camera pixels in each punctuation pair; The objective function is solved by iterative reweighted least squares method. The order of the polynomial terms of the basis function vector is adaptively selected based on the number and distribution of punctuation pairs through a model order adaptive mechanism.

6. The multi-camera correction and seamless multi-scale stitching method for boundaries according to claim 1, characterized in that, The process of using a geometric mapping function to resample and correct the video frames of each camera to the reference top-view base map coordinate system in real time, thereby obtaining the top-view corrected frames of each camera, specifically includes: Based on the geometric mapping function or the inverse transformation of the geometric mapping function, a dense mapping lookup table from the reference top-view base map coordinates to the camera pixel coordinates is pre-calculated. The top-view corrected frame is generated by querying the dense mapping lookup table and performing interpolation operations.

7. The multi-camera correction and seamless multi-scale stitching method for boundaries according to claim 6, characterized in that, The process, based on a geometric mapping function or its inverse transformation, involves pre-compiling a dense mapping lookup table from the reference top-view base map coordinates to the camera pixel coordinates. The top-view corrected frame is then generated by querying this dense mapping lookup table and performing interpolation. Specifically, this includes: In the reference top view base map coordinate system Define discrete mesh Define the inverse sampling function : ; For reference to the top-view base map coordinate system Reference top view base map reference point x and y coordinates For the first The camera pixel coordinate system of the road camera For the first The geometric mapping function corresponding to the road camera; Construct a dense deformation field for each camera: , in, For the first Street cameras at all times Image frames The continuous sampling of the x-axis in the data, For the first Street cameras at all times Image frames The continuous sampling ordinate in the data, It is a dense mapping lookup table; Top-down corrected frame For: 。 8. The multi-camera correction and seamless multi-scale stitching method for boundaries according to claim 1, characterized in that, The top-view corrected frames from each camera are subjected to multi-scale feature alignment in overlapping areas to estimate local refinement transformation, specifically including: Construct a multi-scale feature pyramid for each top-view corrected frame; The parameters of the local refinement transformation are sequentially optimized at multiple scales from coarse to fine in the feature pyramid; wherein the local refinement transformation is a translation model, an affine model, a homography model, or a block local model.

9. The multi-camera correction and seamless multi-scale stitching method for boundaries according to claim 1, characterized in that, The step of performing multi-scale frequency band fusion on one of two adjacent top-view corrected frames after local thinning transformation to generate a geometrically consistent and brightness-continuous panoramic stitched video frame specifically includes: Two adjacent top-view corrected frames are , , top-down correction frame After local thinning transformation, the top-view corrected frame after thinning transformation is obtained. Construct a weighted graph; for and Construct the Laplace Pyramid separately , And construct a Gaussian pyramid for the weighted graph. ; For the first Index of the layered pyramid for The first of the Laplace Pyramid Layer image, for The first of the Laplace Pyramid Layer image, The first Gaussian pyramid of the weighted graph Layered images; Weighted fusion is performed at each level of the pyramid: ; This represents the first [unit / section] of the merged Laplace's Pyramid. Layered images; The merged Laplace pyramid is reconstructed to obtain the final panoramic stitched video frames.

10. A computer system comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 9.