Deep learning composite video processing method and system based on three-dimensional spectral features
By employing a deep learning method based on three-dimensional spectral features, and utilizing three-dimensional Fourier transform and convolutional neural networks, the problems of crosstalk and halftone interference in composite video signals are solved. This achieves high-precision luminance-color separation and smooth transition between dynamic and static scenes, making it suitable for various signal quality scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- PEKING UNIV
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies suffer from problems such as color crosstalk and halftone interference, susceptibility to noise interference in motion detection, and poor robustness when processing composite video signals, especially in high-frequency textures and non-standard signal processing.
A deep learning method based on three-dimensional spectral features is adopted. By constructing a frequency domain mask through three-dimensional Fourier transform and convolutional neural network, signal separation is achieved. Combined with a multi-scale spatiotemporal frequency window fusion strategy, high-precision separation of luminance and chrominance signals is realized.
It achieves high-precision luminance-color separation, eliminates cross-color and halftone interference, enables seamless transition between dynamic and static scenes, improves robustness to noise and non-standard signals, and is suitable for various signal quality scenarios.
Smart Images

Figure CN122226953A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of video signal processing technology, and in particular relates to a deep learning-based composite video processing method and system based on three-dimensional spectral features. Background Technology
[0002] Composite Video Broadcast Signal (CVBS) is an analog video format that transmits a mixture of luminance (Y), chrominance (C), and synchronization signals on the same channel. The luminance signal's energy is primarily concentrated in the low-frequency band, covering the entire video bandwidth. The chrominance signal is loaded onto a color subcarrier via quadrature amplitude modulation (QAM), typically located at the high-frequency end of the video spectrum, and overlaps with the high-frequency components of the luminance signal in the frequency domain.
[0003] To separate clean Y and C signals from CVBS, traditional techniques primarily rely on comb filters. Early techniques (such as notch filters / bandpass filters) were simple but resulted in severe cross-color and cross-luminance artifacts, along with significant loss of high-frequency details. Mid-stage developments, such as 2D comb filters, utilized inter-row correlation but introduced artifacts at vertical color abrupt changes. The current mainstream technique is the 3D motion-adaptive comb filter, which combines 2D (spatial) and 3D (temporal) comb filtering and adaptively switches between them using a motion detector.
[0004] However, the existing technology still has the following drawbacks: Relying on fixed phase correlation assumptions and hard threshold decisions, it is prone to producing "rainbow patterns" and "dot interference" when processing high-frequency textures or specific brightness signals. Motion detection is susceptible to noise interference and struggles to achieve a smooth transition between static sharpness and dynamic edge integrity. It has poor robustness to long-distance transmission or non-standard signals (such as those with phase jitter or noise).
[0005] Therefore, there is an urgent need for a new video signal processing solution that can overcome the above-mentioned defects and achieve high-precision and high-robust luminance-color separation. Summary of the Invention
[0006] In order to overcome the shortcomings of the prior art, this application provides a deep learning-based composite video processing method and system based on three-dimensional spectral features, which improves the efficiency and accuracy of video signal processing.
[0007] The technical solution is as follows: On the one hand, a deep learning-based composite video processing method based on three-dimensional spectral features is provided, which mainly includes the following steps: Sampling: Sampling or resampling composite video signals; Preprocessing: The sampled composite video signal is preprocessed to obtain preprocessed video data blocks. The preprocessing includes: performing three-dimensional overlapping block division on the sampled composite video signal; calculating and removing the DC component of the effective pixel area for each video block; padding with zeros in the blanking region and interlaced region and applying a three-dimensional window function. Frequency domain transformation: Perform a three-dimensional Fourier transform on the preprocessed video data block to obtain three-dimensional spectrum data; Feature construction and reasoning: Extract the amplitude spectrum of the three-dimensional spectrum data, and construct a mirror spectrum symmetrical about the subcarrier frequency based on the subcarrier frequency characteristics of the composite video signal; input the amplitude spectrum and the mirror spectrum into a pre-trained convolutional neural network to generate a spectrum mask corresponding to the target signal; Signal reconstruction: The spectral mask and the three-dimensional spectral data are multiplied to obtain the target signal spectrum; the target signal spectrum is subjected to a three-dimensional inverse Fourier transform, and a three-dimensional window function is applied to reconstruct the spatiotemporal video signal by overlaying and adding, and the final image is output.
[0008] Preferably, the convolutional neural network includes: The symmetry comparison module is used to perform point-to-point comparisons between the amplitude spectrum and the mirror spectrum. By learning the difference or ratio between the two, it can preliminarily determine whether the current frequency point conforms to the symmetry characteristics of the chroma signal. The spectrum attention module is used to generate a frequency domain attention weight map that focuses on the subcarrier frequency band; A context-aware module is used to extract local contextual features of the spectrum; The mask prediction module is used to output a normalized 3D spectral mask.
[0009] Preferably, the spectral attention module includes a global average pooling layer, a gated subnetwork, and a Sigmoid activation function, which employs a squeezing and activation mechanism; By using global average pooling, the three-dimensional spectral features are compressed in the spatiotemporal dimension to extract the channel statistical vectors describing the global energy distribution; By utilizing a gated subnetwork, the dependency between two different feature channels—compressed and restored channels—is automatically learned, generating attention weight vectors that reflect the importance of the channels. The Sigmoid activation function is used to generate normalized weights that reflect the importance of each frequency point; The weight is dynamically applied to the main path feature by a multiplier to achieve adaptive focusing on the chroma subcarrier frequency.
[0010] Preferably, the gated subnetwork includes two convolutional layers and a ReLU activation function; the gated subnetwork consists of a compression channel and a recovery channel.
[0011] Preferably, it further includes a multi-scale spatiotemporal window fusion step: The composite video signal to be processed is input into two parallel preprocessing modules, namely the static fine branch and the dynamic response branch, to obtain spectral data. The spectral data of the two branches are input into two independently trained convolutional neural networks to obtain fine masks and dynamic masks. The two masks are dynamically fused based on the amount of motion to generate the final mask; The original spectrum is filtered based on the final mask, and a three-dimensional window function is executed to output the final image by overlaying and adding the results.
[0012] Preferably, the mirror spectrum is obtained by symmetrically flipping the amplitude spectrum about the center of the subcarrier frequency in the horizontal, vertical and time dimensions.
[0013] Preferably, the composite video signal is a composite video broadcast signal in CVBS, NTSC, or PAL format.
[0014] On the other hand, a deep learning-based composite video processing system based on three-dimensional spectral features is provided, including: Memory, used to store computer programs; A processor is used to implement the steps of the deep learning composite video processing method based on three-dimensional spectral features described above when executing the computer program.
[0015] Analog-to-digital converters are used to convert analog composite video signals into digital signals; Frame memory and line memory are used to provide spatiotemporal delay signals; The frequency domain transformation module is used to perform three-dimensional Fourier transform and inverse transform; The neural network inference module is used to perform spectral mask prediction.
[0016] Preferably, the neural network inference module is a three-dimensional convolutional neural network.
[0017] The technical solution includes at least the following technical effects: Achieving pixel-level high-fidelity luminance-chrominance separation: Traditional comb filters rely on fixed linear operations and hard threshold decisions. When processing high-frequency textures or specific luminance signals, they cannot effectively distinguish between luminance and chrominance components with overlapping spectra, leading to persistent "cross-color" (rainbow patterns) and "dot interference." This invention maps the signal to the three-dimensional frequency domain and explicitly constructs physical symmetry features about the subcarrier as input to the neural network, transforming the separation problem into a "nonlinear symmetry comparison" problem. Deep networks can accurately learn and identify the unique spectral patterns of chrominance signals, thereby outputting extremely pure chrominance signals while preserving all high-frequency luminance details, fundamentally solving the persistent problems of cross-color and dot interference that are difficult to overcome by traditional methods.
[0018] Breaking the performance trade-off between static and dynamic scenes to achieve seamless image quality transitions: Existing 3D motion adaptive filters require hard switching between 2D mode (anti-ghosting but incomplete separation) and 3D mode (complete separation but susceptible to motion), which is prone to jagged edges in static images or ghosting of moving objects due to motion detection errors. This invention proposes a multi-scale spatiotemporal window fusion strategy that processes high-frequency resolution static branches and high-temporal resolution dynamic branches in parallel, and combines this with motion information for adaptive soft fusion. This allows the system to simultaneously achieve accurate separation of complex static textures and clean processing of fast-moving edges without sacrificing the advantages of either approach, resulting in a smooth, seamless transition between static and dynamic scenes and avoiding abrupt changes in image quality.
[0019] Significantly improves robustness to non-ideal and noisy signals: Traditional methods make strong assumptions about the phase accuracy and amplitude standardization of signals, and lack effective means to handle interference such as phase jitter, amplitude attenuation, and noise introduced by long-distance transmission. This invention operates in the frequency domain, and the neural network (especially the spectrum attention module) can learn the distribution characteristics of noise and automatically suppress the weights of noisy frequency bands. In addition, channel interference simulation and data augmentation are introduced during training, and a loss function including edge consistency constraints is used to force the network to learn to generate spatiotemporally continuous outputs. Therefore, this invention exhibits stronger adaptability and stability for scenarios with poor signal quality, such as old video sources and analog video transmissions, and produces cleaner and more continuous output images.
[0020] Balancing high performance with low computational requirements, this invention is feasible for engineering implementation: Unlike completely end-to-end "black box" deep learning solutions, this invention deeply integrates physical priors in signal processing (such as 3D FFT and subcarrier symmetry). This "physical information-guided" design allows the neural network to focus on learning the complex nonlinear mapping relationships of residuals without having to learn basic modulation and demodulation principles from scratch. Therefore, it can achieve or even surpass the performance of traditional algorithms with a lighter network structure (fewer layers and parameters). This high efficiency makes the solution easy to implement in real-time on FPGAs, embedded DSPs, or mobile chips, giving it high practical application value and market potential.
[0021] It should be understood that the above general description and the following detailed description are merely exemplary and do not limit this application. Attached Figure Description
[0022] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0023] Figure 1 A flowchart of a deep learning composite video processing method based on three-dimensional spectral features provided in a preferred embodiment of this application; Figure 2 A flowchart of a deep learning composite video processing method based on three-dimensional spectral features, provided as another preferred embodiment of this application; Figure 3 A flowchart of a deep learning composite video processing method based on three-dimensional spectral features, employing a multi-scale spatiotemporal window fusion strategy, is provided for another preferred embodiment of this application. Figure 4 A block diagram of a deep learning composite video processing system based on three-dimensional spectral features provided in a preferred embodiment of this application; Figure 5 A block diagram of a deep learning composite video processing system based on three-dimensional spectral features, provided as another preferred embodiment of this application; Figure 6 This is a block diagram of a neural network structure provided in a preferred embodiment of this application; Figure 7 A block diagram of a two-stream neural network structure is provided for a preferred embodiment of this application; Figure 8 A flowchart of composite video signal acquisition provided in a preferred embodiment of this application; Figure 9 Spatial alignment and zero-padding diagram provided for a preferred embodiment of this application; Figure 10This is a diagram showing the processing effect of a traditional 1D (spatial) filter on a zone plate test pattern. Figure 11 This is a diagram showing the processing effect of a traditional 2D (spatial) filter on a zone plate test pattern. Figure 12 This is a diagram showing the processing effect of a traditional 3D (spatial-temporal domain) filter on a zone plate test pattern. Figure 13 The image shows the processing effect of the method proposed in this invention on the zone plate test pattern. Detailed Implementation
[0024] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0025] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0026] Definitions: Composite Video Broadcast Signal (CVBS) is a video format that transmits a mixture of luminance (Y), chrominance (C), and synchronization signals on the same signal line.
[0027] The National Television System Committee (NTSC) standard is an analog color television broadcasting standard primarily used in North America, Japan, and other regions. Its features include 525 scan lines, a field rate of 60 fields per second, and a chroma subcarrier frequency of approximately 3.579545 MHz.
[0028] Luminance / Chrominance Separation (Y / C Separation): This refers to the signal processing procedure that decomposes a composite video signal (CVBS) into independent luminance (Y) and chrominance (C) components.
[0029] Color subcarrier: In analog color television systems, this is a sinusoidal carrier wave used to carry chrominance information. In the NTSC standard, the chrominance signal is loaded onto this subcarrier via quadrature amplitude modulation (QAM).
[0030] Three-Dimensional Fast Fourier Transform (3D FFT) is an efficient discrete Fourier transform algorithm used to transform video signal data blocks from the spatiotemporal domain (horizontal, vertical, and time) to the three-dimensional frequency domain in order to analyze the energy distribution characteristics of the signal in three dimensions.
[0031] The overlap-add method is a digital signal processing technique used to reconstruct continuous time-domain signals from segmented frequency-domain data. It eliminates boundary effects introduced by segmentation by preserving overlapping regions during segmentation and adding the overlapping portions during reconstruction.
[0032] As shown in Figure 1 and Figure 2 As shown, a deep learning-based composite video processing method based on three-dimensional spectral features is provided, including the following steps: Step S1: Sample or resample the composite video signal. Preferably, the composite video signal is a composite video broadcast signal (CVBS), NTSC, or PAL format.
[0033] Step S2: Preprocess the sampled composite video signal; In a preferred embodiment, preprocessing the sampled composite video signal includes the following steps: The sampled composite video signal is subjected to three-dimensional overlapping block processing; for each video block, the DC component of the effective pixel area is calculated and removed, and then zeros are padded in the blanking area and interlaced area, and a three-dimensional window function is applied to obtain the preprocessed video data block.
[0034] Step S3: Frequency Domain Transformation. Perform a three-dimensional Fourier transform on the preprocessed video data block to obtain three-dimensional spectrum data.
[0035] Step S4: Feature Construction and Inference. Extract the amplitude spectrum of the 3D spectral data and construct a mirror spectrum symmetrical about the subcarrier frequency based on the subcarrier frequency characteristics of the composite video signal. Use both the amplitude spectrum and the mirror spectrum as input data to a pre-trained neural network to predict and generate a spectral mask corresponding to the target signal. The architecture of the neural network and the dataset required for training can be flexibly selected based on the application scenario and channel characteristics.
[0036] Step S5: Signal Reconstruction. The spectral mask and the three-dimensional spectral data are multiplied to obtain the target signal spectrum; a three-dimensional inverse Fourier transform is performed on the target signal spectrum, and a three-dimensional window function is applied. The spatiotemporal video signal is then recovered by overlaying and adding the results.
[0037] like Figure 3 As shown, in another preferred embodiment, the method further includes multi-scale spatiotemporal window fusion, comprising the following steps: The composite video signal to be processed is input into two parallel preprocessing modules of static fine branch and dynamic response branch to obtain spectral data. The preprocessing of each branch includes: performing three-dimensional overlapping block processing on the sampled composite video signal; for each video block, calculating and removing the DC component of the effective pixel area, then padding with zeros in the blanking area and interlaced area, and applying a three-dimensional window function to obtain the preprocessed video data block.
[0038] The spectral data of the two branches are input into two independently trained convolutional neural networks (i.e., two-stream neural networks, such as...). Figure 7 In the example shown, obtain the fine mask and the dynamic mask; The motion-guided fusion module dynamically fuses two masks based on the amount of motion to generate the final mask. Post-processing is performed based on the final mask to output the final image. The post-processing includes filtering the original spectrum and performing a three-dimensional window function to sum the results by overlay.
[0039] The principle of multi-scale spatiotemporal window fusion is as follows: Static / Fine Branch: Employs larger 3D data blocks (e.g., 32×32×8). Larger windows offer higher resolution in the frequency domain, enabling precise separation of dense brightness textures near subcarriers and completely eliminating "dots" and "rainbow patterns," but they exhibit hysteresis in response to moving objects.
[0040] Dynamic / Fast Branch: Employs smaller 3D data blocks (e.g., 8×8×2). Smaller windows offer higher sensitivity in the time domain, enabling rapid response to abrupt changes in object edges and avoiding motion blur, but have weaker frequency domain resolution.
[0041] Adaptive fusion: By detecting the motion properties of local areas, the weights of the two outputs are dynamically adjusted to achieve complementary advantages.
[0042] Detailed operating steps Step S10: Multi-scale parallel input. The composite video signal to be processed is sent to two parallel preprocessing modules. Branch A undergoes 32×32×8 block partitioning and 3D FFT transformation; Branch B undergoes 8×8×2 block partitioning and 3D FFT transformation (e.g., ...). Figure 3 (As shown).
[0043] The step is S20: Two-stream feature inference. The spectral data of the two branches are input into independently trained neural networks (i.e., two-stream neural networks such as...). Figure 7 As shown in the diagram, the network architecture adopts the structure described in Embodiment 1, which includes a "spectral attention module". Branch A outputs a fine mask Mask_fine, and branch B outputs a dynamic mask Mask_fast.
[0044] Step S30: Motion-guided fusion. Calculate the motion index V of the current video block (obtainable via frame difference or optical flow). Fuse branch A with the 64 branch B blocks within its spatial range using a fusion coefficient α generated from the motion index. when (At rest) Mainly adopt ; when During (strenuous exercise), Mainly adopt .
[0045] Fusion formula: .
[0046] Step S40: Signal Reconstruction. The original spectrum is filtered using the fused Mask_final, and then 3DIFFT is performed and the image is overlaid to output the final image.
[0047] like Figure 4 As shown, in one embodiment, a deep learning-based composite video processing system based on three-dimensional spectral features is provided, comprising: Memory, used to store computer programs; A processor is used to implement the steps of the deep learning composite video processing method based on three-dimensional spectral features described above when executing the computer program.
[0048] like Figure 5 As shown, it also includes: Analog-to-digital converters are used to convert analog composite video signals into digital signals; Frame memory and line memory are used to provide spatiotemporal delay signals; The frequency domain transformation module is used to perform three-dimensional Fourier transform and inverse transform; The neural network inference module is used to perform spectral mask prediction.
[0049] In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the above-described deep learning composite video processing method based on three-dimensional spectral features.
[0050] In a preferred embodiment, such as Figure 6 The example shown employs a single-branch neural network; and, in another embodiment, as... Figure 7 The diagram shows a two-stream neural network with two branches.
[0051] The neural network inference module employs a lightweight convolutional neural network, which includes the following core functional modules: The symmetry comparison module (Entry Module) mainly consists of 1×1×1 three-dimensional convolutional layers. It does not focus on neighborhood information, but instead concentrates on point-to-point comparison of the "original amplitude" and "mirror amplitude" in the input channel. By learning the difference or ratio between the two, it makes a preliminary judgment on whether the current frequency point conforms to the symmetry characteristics of the chroma signal.
[0052] The Spectral Attention Module primarily consists of a Global Average Pooling layer, a gated subnetwork (containing two 1×1×1 convolutional layers and a ReLU activation function), and a Sigmoid activation function. It generates a frequency domain attention weight map, focusing on the subcarrier frequency band. The gated subnetwork comprises compressed and restored channels.
[0053] It introduces a "squeeze-and-excitation" mechanism: Squeeze: First, the three-dimensional spectral features are compressed in the spatiotemporal dimension through global average pooling to extract the channel statistical vectors describing the global energy distribution; Excitation: The gating subnetwork automatically learns the dependencies between different feature channels and generates attention weight vectors that reflect the importance of the channels.
[0054] The spectral attention module enables the network to adaptively "focus" on the chroma subcarrier frequency and its sideband region, giving higher weights to these channels containing key chroma information, while automatically suppressing channels with high-frequency noise, thereby significantly improving the model's separation purity under complex texture interference.
[0055] The Context-Aware Body Module primarily consists of cascaded residual blocks (ResBlock3D). Each residual block contains two 3×3×3 convolutional layers, a batch normalization layer (BatchNorm3D), and a LeakyReLU activation function. Furthermore, a Skip Connection is introduced. 。
[0056] The context-aware module utilizes a 3×3×3 receptive field to capture the continuity and texture features of the spectrum along the frequency axis. This module can identify the true chroma signal and use the surrounding spectral energy distribution to smooth noise, further correcting the results of symmetry comparisons.
[0057] The head module mainly consists of a 1×1×1 convolutional layer and a sigmoid activation function. It is used to map high-dimensional features back to the [0,1] interval to generate the final three-dimensional spectral mask; a value close to 1.0 indicates that the frequency point is a chromaticity signal, and a value close to 0.0 indicates that it is a luminance signal or noise.
[0058] The model training process in this embodiment aims to minimize the difference between the predicted chromaticity and the true chromaticity: Masking Application: During training forward propagation, the network outputs a prediction mask. The system compares the mask with the original noisy amplitude spectrum of the input. Multiplying these together yields the predicted chromaticity amplitude spectrum: ; Note: This utilizes... The presence of channel interference necessitates that network-generated masks must possess the ability to identify interference.
[0059] Loss function: L1 loss function is used. ; in, This represents the true spectrum of pure colorimetry. The loss function is more robust to outliers and can produce masks with sharper edges and higher sparsity, which is crucial for spectrum separation tasks. for The chromaticity amplitude spectrum is obtained by multiplying the Mask output by the neural network by the original amplitude spectrum.
[0060] Optimizer: The Adam optimizer is used, with an initial learning rate of 0.001, which is dynamically adjusted as the training rounds increase.
[0061] During the inference phase, the video to be processed undergoes preprocessing, FFT, and feature construction before being fed into a trained convolutional neural network. The mask output by the convolutional neural network directly affects the original complex spectrum (both the real and imaginary parts are multiplied by the mask). Finally, the spectrum is transformed back to the spatiotemporal domain using 3D IFFT, and the same three-dimensional sinusoidal window as before is applied. Overlapping and summing are then used to eliminate block boundary effects, outputting the final high-fidelity chroma signal. The luminance signal is calculated using this formula: , Where C represents the pure chroma component (video chroma signal). It is a composite video signal. This is the video luminance signal.
[0062] The chromaticity C is then coherently demodulated into U and V components.
[0063] To train the convolutional neural network described in this invention, a paired dataset containing "composite video signal (Input)" and "pure chroma signal (Target)" needs to be constructed first. Considering the channel interference present in actual composite video signal transmission, this embodiment employs an anti-interference data augmentation strategy, the specific steps of which are as follows: Step S100: Dataset Construction and Data Augmentation.
[0064] S101 base sample generation, such as Figure 8 As shown: A set of 88 high-resolution progressive scan component videos was selected as source material. The video images have rich texture details and include moving and still images. 20 consecutive frames of each video were randomly selected and stitched together to switch between videos every 20 frames.
[0065] According to the standard composite video signal encoding method, the YUV or RGB components are modulated into a composite video signal sampled at four times the subcarrier frequency (CVBS), while retaining the pure chroma component (C) as the ground truth.
[0066] ; ; Where C represents the pure chroma component (video chroma signal). It is a composite video signal. For video luminance signal, For the chroma subcarrier frequency, This is the phase rotation angle.
[0067] S102. Channel Interference Simulation (Data Augmentation): To improve the model's robustness in real-world video signal transmission environments, random channel interference simulation is applied to the composite video signal before it is fed into the preprocessing module. Specifically, this includes: Additive white Gaussian noise (AWGN): Analogous thermal noise of a channel. The formula is: , in, Randomly select from the preset range. This is the composite video signal after adding Gaussian white noise. For the aforementioned channel-interference-free composite video signal, some simplified representations of the Gaussian distribution are provided. The mean, Let Variance be the variance.
[0068] Amplitude gain disturbance: Signal attenuation or over-amplitude caused by analog transmission cables. Multiply the signal amplitude by a random factor. .
[0069] Through the aforementioned enhancements, the network is forced to learn to extract chromaticity spectral features from signals with channel interference, rather than simply memorizing specific amplitude values.
[0070] S103 randomly selects a sample block: Because of the overlapping block strategy, saving the entire spectral block would introduce a 50% data repetition rate, leading to high sample redundancy and causing problems such as overfitting, learning bias, and inefficient training. Therefore, we only randomly retain 5% of the data blocks. To ensure pairwise consistency between the composite video and the pure chroma spectral block, we use a hash seed to control the random process, achieving deterministic repeatable sampling.
[0071] The generated pairing data is preprocessed to adapt to the input requirements of the convolutional neural network: Blocking and DC Removal: The composite video is divided into overlapping blocks, each T×H×W (e.g., 4×16×16) containing a temporally continuous 16×16 square region with four fields and a spatially continuous region. There is a 50% overlap between blocks. This involves traversing the entire composite video with a spatial step size of 8 fields and a temporal step size of 2 fields. The average value of the effective pixel region is calculated and removed, resulting in zero-mean signal normalization and thus DC removal. Spatial alignment and zero padding: Set the padding values of the hidden areas and non-same-level fields in the video block to 0.0 to achieve spatial alignment and eliminate edge step noise caused by data truncation.
[0072] Three-dimensional window function: Apply a three-dimensional sinusoidal window to suppress spectral leakage.
[0073] 3D FFT transform: converts spatiotemporal domain signals into three-dimensional spectral data and extracts their amplitude spectrum.
[0074] Step S200: Data preprocessing: The generated pairing data is preprocessed to adapt to the input requirements of the convolutional neural network: Blocking and DC Removal: The composite video is divided into overlapping blocks, each T×H×W (e.g., 4×16×16) containing a temporally continuous 16×16 square region with four fields and a spatially continuous region. There is a 50% overlap between blocks. This involves traversing the entire composite video with a spatial step size of 8 fields and a temporal step size of 2 fields. The average value of the effective pixel region is calculated and removed, resulting in zero-mean signal normalization and thus DC removal. Spatial alignment and zero padding: such as Figure 9 As shown, the blanking region in the video block and the region fill value of the non-same-level field of the current block are set to 0.0 to achieve spatial alignment and eliminate edge step noise caused by data truncation.
[0075] Three-dimensional window function: Apply a three-dimensional sinusoidal window to suppress spectral leakage.
[0076] Step S300 Frequency Domain Transformation: 3D FFT transform: converts spatiotemporal domain signals into three-dimensional spectral data and extracts their amplitude spectrum.
[0077] Step S400 Symmetry Feature Construction The mirror spectrum is obtained by symmetrically flipping the original amplitude spectrum about the center of the subcarrier frequency in the horizontal, vertical and time dimensions.
[0078] This embodiment does not directly feed the raw spectrum into the network, but instead explicitly constructs physical prior features: 1. Define the center of symmetry: According to the composite video standard, the chroma subcarrier is located at the high end of the horizontal frequency, the Nyquist limit of the vertical frequency, and a specific fundamental frequency position of the time frequency.
[0079] 2. Constructing a mirror spectrum: Horizontal Flip: Calculate the horizontal frequency index Symmetrical position with respect to the subcarrier frequency.
[0080] Table 1. Mapping Relationships of the X (Horizontal) Axis The part in parentheses in the X (horizontal) axis mapping table represents the subcarrier center.
[0081] Vertical Flip: Calculate the vertical frequency index Symmetrical position with respect to the subcarrier frequency.
[0082] Table 2 Y-axis mapping relationship table The part in parentheses in the Y-axis mapping table represents the subcarrier center.
[0083] Time Reversal: Calculating the Time Frequency Index Symmetrical position with respect to the subcarrier frequency.
[0084] Table 3. t (time) axis mapping relationship table The table showing the t (time) axis mapping relationship shows the subcarrier centers within square brackets.
[0085] Using the above index to analyze the original amplitude spectrum Perform a mirror flip to obtain the mirror amplitude spectrum. .
[0086] 3. Feature stacking: and The data is concatenated along the channel dimension to form a network input tensor of shape [Batch, 2, T, X, Y].
[0087] Output effect comparison: like Figure 10 As shown, the processing effect of a traditional 1D (spatial) filter on a zone plate test pattern is such that the image is filled with crosstalk interference (rainbow), and the high-frequency attenuation of the brightness plane is severe.
[0088] like Figure 11 As shown, the traditional 2D (spatial domain) filter's processing effect on the zone plate test pattern is that crosstalk interference is reduced compared to 1D, but it still exists in large quantities. There is dot interference at the boundary of color blocks, and brightness breaks appear at the crosstalk points.
[0089] like Figure 12 The image shows the processing effect of a traditional 3D (spatiotemporal domain) filter on a zone plate test pattern. Crosstalk interference is reduced compared to 2D, but it still exists in large quantities, with dot interference present at the boundaries of color blocks.
[0090] like Figure 13 As shown, the processing effect of the method proposed in this embodiment on the zone plate test pattern is that there is no dot interference and no cross-color interference.
[0091] The deep learning-based composite video processing method and system based on three-dimensional spectral features proposed in this invention have the following technical advantages: 1. A frequency-domain deep learning architecture based on three-dimensional physical priors is constructed, effectively solving the persistent problems of "color crosstalk" and "dots" caused by traditional algorithms relying on hard thresholds, and achieving high-precision soft separation. A deep learning architecture is proposed that maps signals from the spatiotemporal domain to the three-dimensional frequency domain and explicitly constructs "full-dimensional physical symmetry features" about the subcarrier frequency as network input. Utilizing the sparsity characteristics of NTSC signals in the three-dimensional frequency domain, the spectrum is extracted through three-dimensional Fourier transform. Unlike conventional end-to-end learning, this invention explicitly calculates the three-dimensional symmetry index based on the modulation principle of NTSC signals. The network receives the "original amplitude spectrum" and "mirror amplitude spectrum" as dual-channel inputs, transforming the complex signal separation problem into a "nonlinear symmetry comparison" problem. The deep network utilizes nonlinear fitting capabilities to accurately identify and resolve the persistent problems of "color crosstalk" and "dots" in traditional algorithms.
[0092] 2. This invention employs a convolutional neural network to address the image quality loss and artifact issues caused by motion adaptive switching, achieving seamless integration of dynamic and static scenes. It utilizes a frequency-domain attention weighting mechanism, primarily designing an attention module embedded in the network. This module automatically generates a three-dimensional attention weight distribution map based on signal energy distribution, adaptively focusing on the chroma subcarrier frequency band. A spectral attention module is introduced after the symmetry comparison module. This module automatically learns the energy distribution patterns of the spectrum, generating a three-dimensional weight map. This weight map automatically assigns higher weights to the chroma subcarrier (3.58MHz) and its sideband regions, while suppressing DC and aliasing regions far from the subcarrier. When energy is detected at a specific frequency, the attention mechanism amplifies the feature response of that region, helping the network more sensitively capture weak chroma signals and suppress background noise.
[0093] 3. A multi-scale spatiotemporal frequency window fusion strategy was adopted: A parallel processing architecture was proposed, utilizing 3D FFT windows of different sizes to extract high-frequency texture features and fast motion features respectively, and then adaptively fusing them. Two parallel branches were constructed: Branch A uses a large-size 3D window (32×32×8) to provide extremely high frequency resolution for separating fine static textures; Branch B uses a small-size 3D window (8×8×2) to provide extremely high temporal resolution for capturing fast motion. Finally, a gated fusion unit dynamically weights and combines the two features according to the motion attributes of local pixels, thereby achieving optimal results in both static and dynamic scenes. On the training side, an edge consistency constraint loss function was introduced. In addition to minimizing pixel error (L1Loss), a regularization term was added to penalize the prediction differences between adjacent data blocks in the overlapping region. This forces the network to learn a globally smooth feature representation, making the reconstructed video spatiotemporally continuous and without jumps.
[0094] By using energy difference as a strong criterion in a frequency domain deep learning architecture based on three-dimensional physical priors, and combining it with a frequency domain attention weighting mechanism, the convolutional neural network can "focus" on subcarrier features, accurately eliminate stubborn interference that traditional filters cannot handle, and output pure and flawless chromaticity signals while preserving high-frequency details; thus completely eliminating "rainbow patterns" and "dots" and achieving pixel-level high-fidelity separation.
[0095] By employing a multi-scale fusion strategy, features of different spatiotemporal resolutions are processed in parallel. This solves the problem that existing technologies require hard switching between 2D mode (sacrificing sharpness for anti-ghosting) and 3D mode (sacrificing anti-ghosting for sharpness), which easily produces jagged edges or ghosting. This enables the system to have extremely high frequency resolution (no dots) when processing static complex textures and extremely high transient response (no trailing) when processing fast-moving objects, achieving a smooth and seamless transition between static and dynamic scenes.
[0096] This invention is particularly suitable for scenarios with poor signal quality, such as FPV analog video transmission and old monitoring lines, and can output high-quality video streams with reduced blockiness, no edge halos, and temporal stability. It improves robustness to non-standard and noisy signals, and outputs smooth spatiotemporal continuity.
[0097] Compared to general end-to-end neural networks, the method and system of this invention do not require convolutional neural networks to learn the "basic principles of signal processing." Instead, they directly feed physical rules into the network through three-dimensional Fourier transform and explicit feature construction. This allows the network to achieve state-of-the-art (SOTA) performance with only a very small number of layers (lightweight), making it valuable for real-time deployment on FPGAs, embedded chips, or mobile terminals.
[0098] Those skilled in the art will understand that embodiments of this application can be provided as methods, apparatus, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0099] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0100] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0101] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0102] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.
Claims
1. A deep learning-based composite video processing method based on three-dimensional spectral features, characterized in that, include: Sampling: Sampling or resampling composite video signals; Preprocessing: The sampled composite video signal is preprocessed to obtain preprocessed video data blocks. The preprocessing includes: performing three-dimensional overlapping block division on the sampled composite video signal; calculating and removing the DC component of the effective pixel area for each video block; padding with zeros in the blanking region and interlaced region and applying a three-dimensional window function. Frequency domain transformation: Perform a three-dimensional Fourier transform on the preprocessed video data block to obtain three-dimensional spectrum data; Feature construction and reasoning: Extract the amplitude spectrum of the three-dimensional spectrum data, and construct a mirror spectrum symmetrical about the subcarrier frequency based on the subcarrier frequency characteristics of the composite video signal; input the amplitude spectrum and the mirror spectrum into a pre-trained convolutional neural network to generate a spectrum mask corresponding to the target signal; Signal reconstruction: The spectral mask and the three-dimensional spectral data are multiplied to obtain the target signal spectrum; the target signal spectrum is subjected to a three-dimensional inverse Fourier transform, and a three-dimensional window function is applied to reconstruct the spatiotemporal video signal by overlaying and adding, and the final image is output.
2. The deep learning-based composite video processing method based on three-dimensional spectral features according to claim 1, characterized in that, The convolutional neural network includes: a symmetry comparison module, used to perform point-to-point comparison between the amplitude spectrum and the mirror spectrum, and to preliminarily determine whether the current frequency point conforms to the symmetry characteristics of the chroma signal by learning the difference or ratio between the two. The spectrum attention module is used to generate a frequency domain attention weight map that focuses on the subcarrier frequency band; A context-aware module is used to extract local contextual features of the spectrum; The mask prediction module is used to output a normalized 3D spectral mask.
3. The deep learning-based composite video processing method based on three-dimensional spectral features according to claim 2, characterized in that, The spectrum attention module includes a global average pooling layer, a gated subnetwork, and a sigmoid activation function, which employs a squeezing and activation mechanism. By using global average pooling, the three-dimensional spectral features are compressed in the spatiotemporal dimension to extract the channel statistical vectors describing the global energy distribution; By utilizing a gated subnetwork, the dependency between two different feature channels—compressed and restored channels—is automatically learned, generating attention weight vectors that reflect the importance of the channels. The Sigmoid activation function is used to generate normalized weights that reflect the importance of each frequency point; The weight is dynamically applied to the main path feature by a multiplier to achieve adaptive focusing on the chroma subcarrier frequency.
4. The deep learning-based composite video processing method based on three-dimensional spectral features according to claim 3, characterized in that, The gated subnetwork includes two convolutional layers and a ReLU activation function; the gated subnetwork consists of a compression channel and a recovery channel.
5. The deep learning-based composite video processing method based on three-dimensional spectral features according to claim 1, characterized in that, It also includes a multi-scale spatiotemporal window fusion step: The composite video signal to be processed is input into two parallel preprocessing modules, namely the static fine branch and the dynamic response branch, to obtain spectral data. The spectral data of the two branches are input into two independently trained convolutional neural networks to obtain fine masks and dynamic masks. The two masks are dynamically fused based on the amount of motion to generate the final mask; The original spectrum is filtered based on the final mask, and a three-dimensional window function is executed to output the final image by overlaying and adding the results.
6. The deep learning-based composite video processing method based on three-dimensional spectral features according to claim 1, characterized in that, The mirror spectrum is obtained by symmetrically flipping the original amplitude spectrum about the center of the subcarrier frequency in the horizontal, vertical and time dimensions.
7. The deep learning-based composite video processing method based on three-dimensional spectral features according to claim 1, characterized in that, The composite video signal is a composite video broadcast signal (CVBS), NTSC, or PAL format.
8. A deep learning-based composite video processing system based on three-dimensional spectral features, characterized in that, include: Memory, used to store computer programs; A processor, configured to execute the computer program to implement the steps of the deep learning composite video processing method based on three-dimensional spectral features as described in any one of claims 1 to 7.
9. The deep learning composite video processing system based on three-dimensional spectral features according to claim 8, characterized in that, It also includes: Analog-to-digital converters are used to convert analog composite video signals into digital signals; Frame memory and line memory are used to provide spatiotemporal delay signals; The frequency domain transformation module is used to perform three-dimensional Fourier transform and inverse transform; The neural network inference module is used to perform spectral mask prediction.
10. The deep learning composite video processing system based on three-dimensional spectral features according to claim 9, characterized in that, The neural network inference module is a convolutional neural network.