Neural network-based illicit video processing method, system, device and medium
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA RES INST OF FILM SCI & TECH
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from perspective distortion, image jitter, and edge occlusion when processing illegally recorded videos, resulting in low success rates and accuracy of digital watermark extraction. Existing methods such as feature point matching, optical flow, and manual keyframe marking are inefficient and have poor robustness.
A deep convolutional neural network is used to construct a screen corner detection neural network, which automatically detects the four corners of the screen in the illegally recorded video, and eliminates image jitter and distortion through temporal smoothing and perspective correction, so as to achieve stable detection and correction of the screen area.
It achieves fully automated detection of the four corners of the screen, improving processing efficiency and detection accuracy far exceeding traditional methods. It is suitable for high-precision watermark extraction, possesses robustness and adaptability, and supports incremental training to optimize model accuracy.
Smart Images

Figure CN122243831A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of computer vision and video processing technology, specifically to a method, system, device, and medium for processing illegally recorded videos based on neural networks. Background Technology
[0002] With the rapid development of the film and television industry and the continuous improvement of the digital copyright protection system, using digital watermarking for tracing the source of films that have been illegally recorded during their theatrical release has become a mainstream method and standard procedure. Digital watermarking involves embedding invisible tracing information into the screen of the film being shown. After obtaining the illegally recorded video, the watermark information can be extracted to determine key tracing data such as the cinema, screening room, and screening time where the illegal recording occurred.
[0003] However, in practical applications, illegally recorded videos often have quality defects that affect the success rate and accuracy of digital watermark extraction, mainly in the following aspects: First, perspective distortion. Piracy often involves handheld shooting from an angle that is not directly facing the screen, resulting in severe geometric distortions such as trapezoidal or irregular quadrilateral shapes in the recorded video. The four edges of the screen area cannot maintain parallelism and perpendicularity, distorting the image content and directly preventing watermark extraction algorithms from locating effective watermark areas.
[0004] Second, image shakiness. During the recording process, the person filming cannot maintain the absolute stability of the filming equipment. Even slight shaking of the handheld device, breathing tremors, and body movements can cause continuous and irregular inter-frame displacement and shaking in the recorded video, which disrupts the spatial continuity and temporal stability of the watermark information and significantly reduces the robustness of watermark extraction.
[0005] Third, edge occlusion. In a movie theater, there are often obstructing objects such as audience heads, hands, seats, and other filming equipment, which can easily cause partial occlusion of the screen's edge area in the recorded video, making it impossible for traditional image algorithms to fully identify the effective screen area.
[0006] Based on the above issues, before performing digital watermark extraction on illegally recorded videos, the original illegally recorded videos must be processed in a professional and standardized manner. The core objective is to eliminate perspective distortion, suppress image jitter, remove edge occlusion interference, and restore a stable, regular, and distortion-free screen image. This provides high-quality input data for subsequent digital watermark extraction, ensuring the success rate, accuracy, and robustness of the watermark extraction.
[0007] Current technologies for video stabilization and distortion correction mainly include feature point matching, optical flow, manual keyframe marking, and traditional edge detection. However, each method has significant technical shortcomings and application limitations. For example, feature point matching is difficult to extract features in low-light and low-contrast scenes and has poor robustness; optical flow is computationally intensive, has poor real-time performance, and is sensitive to large motions; manual keyframe marking requires manual intervention frame by frame, which is extremely inefficient; and traditional edge detection has poor robustness to complex backgrounds, content, and lighting changes.
[0008] Therefore, there is an urgent need for an intelligent processing method that can automatically identify the screen area and perform jitter stabilization and perspective correction. Summary of the Invention
[0009] To address the aforementioned issues, this application provides a neural network-based method, system, device, and medium for processing illegally recorded videos. By utilizing a deep convolutional neural network, the four corners of the screen in the illegally recorded video are automatically, accurately, and in real time detected. Based on the detection results, temporal smoothing and perspective correction are performed to eliminate image jitter and distortion.
[0010] The technical solution adopted in this application is as follows: Firstly, this application provides a method for processing illegally recorded videos based on neural networks, including: Build and train the screen four corner detection neural network, and load the trained screen four corner detection neural network onto the GPU; It continuously reads each frame from the video stream, processes the image into an RGB format image of a preset size, and transmits it to the GPU. The GPU uses a screen corner detection neural network to predict the image and then transmits the predicted corner coordinates to the CPU. In the CPU, coordinate transformation and timing smoothing are performed on the predicted four corner point coordinates. Based on the processed predicted four corner point coordinates, perspective transformation is performed on the corresponding image to obtain the distortion-corrected image. The distortion-corrected frames of the video stream are output sequentially and continuously.
[0011] Secondly, this application also provides a neural network-based system for processing stolen video recordings, comprising: The model provides units for building and training the screen four-corner detection neural network, and loading the trained screen four-corner detection neural network onto the GPU; The image providing unit is used to sequentially and continuously read each frame of the video stream, process the image into an RGB format image of a preset size, and transmit it to the GPU; The coordinate prediction unit is used in the GPU to predict the image using a screen four-corner detection neural network, and the predicted coordinates of the four corner points are transmitted to the CPU. The distortion correction unit is used to perform coordinate transformation and timing smoothing on the predicted four corner point coordinates in the CPU, and to perform perspective transformation on the corresponding image based on the processed predicted four corner point coordinates to obtain the distortion-corrected image. The output unit is used to sequentially and continuously output each frame of the video stream with distortion correction.
[0012] Thirdly, this application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the above-described neural network-based video recording processing method.
[0013] Fourthly, this application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described neural network-based method for processing stolen videos.
[0014] The above-mentioned technical solution adopted in this application can achieve the following beneficial effects: The screen corner detection neural network achieves fully automatic detection of the four corners of the screen, eliminating the need for manual frame-by-frame labeling, thus reducing preprocessing costs and improving processing efficiency. Trained on a specialized dataset and using a combined loss function, the neural network achieves a corner point coordinate prediction error of less than 10 pixels, demonstrating significantly higher detection accuracy than traditional feature point matching or edge detection algorithms, making it suitable for high-precision watermark extraction. With approximately 200MB of weights, the neural network is well-suited for engineering applications.
[0015] It adaptively matches smoothing parameters based on different jitter levels, balancing jitter suppression with image detail preservation, and is universally applicable to various piracy scenarios. It has strong robustness to various interference factors (such as occlusion, lighting changes, etc.) in cinema piracy scenarios, and can stably complete screen area detection and correction. It supports incremental annotation and incremental training of newly added piracy video samples, continuously optimizing model accuracy and constantly improving its adaptability to new piracy scenarios. Attached Figure Description
[0016] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings: Figure 1 A flowchart illustrating a neural network-based method for processing illegally recorded videos according to an embodiment of this application is shown. Figure 2 A schematic diagram of the structure of a screen four-corner detection neural network according to an embodiment of this application is shown; Figure 3 A schematic diagram illustrating manual annotations according to one embodiment of this application is shown; Figure 4 A schematic diagram of a timing smoothing algorithm according to an embodiment of this application is shown; Figure 5 A schematic diagram of perspective transformation according to an embodiment of this application is shown, wherein (a) represents a schematic diagram before perspective transformation, and (b) represents a schematic diagram after perspective transformation; Figure 6 A flowchart illustrating a neural network-based method for processing illegally recorded videos according to another embodiment of this application is shown. Figure 7 A schematic diagram of the structure of a screen distortion correction system according to an embodiment of this application is shown; Figure 8 A schematic diagram of the overall architecture of a screen distortion correction system according to an embodiment of this application is shown. Figure 9 A schematic diagram of the structure of an electronic device according to an embodiment of this application is shown. Detailed Implementation
[0017] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0018] Figure 1 A schematic flowchart of a neural network-based method for processing illegally recorded videos according to an embodiment of this application is shown. (Refer to...) Figure 1 As shown, the neural network-based method for processing stolen videos proposed in this embodiment includes steps S110 to S150: Step S110: Construct and train the screen four corner detection neural network, and load the trained screen four corner detection neural network onto the GPU.
[0019] The core of this step is to build a dedicated neural network model for accurately locating the coordinates of the four corner points of the screen in a video stream. This model is trained and optimized using labeled data to achieve stable and high-precision corner detection capabilities. The trained model is then loaded onto GPU hardware, leveraging the GPU's computational advantages to achieve high-speed inference for corner detection. This provides an accurate foundation for subsequent image distortion correction, ensuring both efficiency and detection accuracy in the video processing workflow.
[0020] In some optional implementations, step S110 constructs and trains a screen four-corner detection neural network, including: the screen four-corner detection neural network includes an input layer, an encoding layer, a global average pooling layer, a fully connected regression layer, and an output layer connected in sequence; the input layer is used to receive an RGB format image of 256×256×3 and transmit the image to the encoding layer; the encoding layer includes 5 convolutional blocks, each convolutional block includes a downsampling convolutional layer with a stride of 2 and a feature extraction convolutional layer with a stride of 1, and each convolutional layer is followed by a batch normalization and ReLU activation function; the encoding layer... The image is processed into an 8×8×512 feature map; the global average pooling layer is used to compress the feature map into a 1×1×512 feature vector; the fully connected regression layer includes three linear transformation layers, and ReLU activation function and random deactivation layer are used between the three linear transformation layers to prevent overfitting. The fully connected regression layer is used to process the feature vector into an 8-dimensional coordinate vector; the output layer is used to reconstruct the coordinate vector into (4,2) format, and normalize it to the [0,1] interval by the Sigmoid activation function, and output the predicted coordinates of the four corner points arranged in the order of top left, top right, bottom right, and bottom left.
[0021] Figure 2 A schematic diagram of the structure of a screen four-corner detection neural network according to an embodiment of this application is shown. (Refer to...) Figure 2 As shown, the ScreenCornerNet neural network constructed in this embodiment includes an input layer, an encoding layer, a global average pooling layer, a fully connected regression layer, and an output layer connected in sequence. The structure and function of each layer are described in detail below.
[0022] The input layer is used to receive RGB format images with a size of 256×256×3 and transmit the images to the encoder.
[0023] The encoder layer consists of five convolutional blocks. Each block contains two consecutive convolutional operations: the first operation has a stride of 2 for image downsampling, and the second operation has a stride of 1 for deep feature extraction. Each convolutional layer is followed by Batch Normalization (BN) and ReLU activation. BN accelerates network training convergence and mitigates internal covariate shifts, while ReLU introduces a non-linear transformation to enhance feature representation. The encoder layer ultimately outputs a feature map of size 8×8×512.
[0024] The Global Avg Pool is used to compress the 8×8×512 feature map output by the encoding layer into a 1×1×512 feature vector, which significantly reduces the number of parameters in the fully connected layer while preserving global feature information.
[0025] The fully connected regression layer (FC Regression) employs a three-layer sequential linear transformation structure, with the number of neurons decreasing from 512 to 256 to 128 to 8. A ReLU activation function is used to introduce non-linearity between the three linear transformation layers, and a Dropout (0.3) random deactivation layer is added to discard 30% of the neurons, further preventing overfitting. The FC Regression layer ultimately outputs an 8-dimensional coordinate vector, corresponding to the horizontal and vertical coordinates of the four corners of the screen.
[0026] The output layer reshapes the 8-dimensional coordinate vector output by the fully connected regression layer into a two-dimensional matrix in (4,2) format, corresponding to the coordinate information of the four corner points of the screen: top left, top right, bottom right, and bottom left. The coordinate information is normalized to the [0,1] interval using the Sigmoid activation function, achieving a standardized output of the predicted corner point coordinates.
[0027] In some optional implementations, step S110 constructs and trains a screen four-corner detection neural network, including: extracting training frames from the training video stream every preset number of frames, manually labeling the normalized four-corner coordinates of the training frames in the order of top left, top right, bottom right, and bottom left, and constructing a training dataset in JSON format; constructing a combined loss function; the combined loss function includes: a basic regression loss used to calculate the error between the predicted four-corner coordinates and the normalized four-corner coordinates, and a corner order constraint loss used to constrain the spatial topological relationship of the four corner points; using a batch size of 8, Adam as the optimizer, an initial learning rate of 0.001, dynamically adjusting the learning rate through the ReduceLROnPlateau strategy, and completing a preset number of training rounds on the GPU to obtain the trained screen four-corner detection neural network.
[0028] To ensure the detection accuracy and generalization ability of the screen four corner detection neural network, this embodiment constructs a professional training dataset to optimize and train the screen four corner detection neural network.
[0029] We selected videos from a real cinema piracy video library with different shooting angles, lighting conditions, degrees of obstruction, and film types as training video streams. Every 30 to 60 frames, we extracted a training frame as a training sample to ensure the diversity and representativeness of the samples and to cover various complex piracy scenarios. Figure 3 A schematic diagram illustrating manual annotation according to one embodiment of this application is shown. (Refer to...) Figure 3As shown, professional annotators annotated the four corners of the extracted training footage, strictly following a fixed order of top left, top right, bottom right, and bottom left. The annotated corner coordinates were then converted to normalized coordinates within the [0,1] interval, consistent with the output of the corner detection neural network. The training footage and the corresponding normalized corner coordinates were stored in JSON format to construct a structured training dataset.
[0030] This embodiment constructs a combined loss function that considers both the accuracy of corner coordinate regression and the constraints of corner spatial topology. The combined loss function consists of two parts: the basic regression loss and the corner order constraint loss.
[0031] Baseline regression loss: Smooth L1 loss is used as the base regression loss to calculate the distance error between the predicted four corner point coordinates and the manually labeled normalized four corner point coordinates.
[0032] Corner point order constraint loss: Used to constrain the spatial topological relationship of the four corner points, ensuring that the predicted coordinates of the four corner points conform to the geometric structure rules of the screen. The specific constraints are: x-coordinate of the top left corner point < x-coordinate of the top right corner point, x-coordinate of the bottom left corner point < x-coordinate of the bottom right corner point, y-coordinate of the top left corner point < y-coordinate of the bottom left corner point, and y-coordinate of the top right corner point < y-coordinate of the bottom right corner point.
[0033] Combined loss function: The combined loss function is a weighted sum of the basic regression loss and the corner order constraint loss. The calculation formula can be: L total =L base +0.1×L order , formula (1); Among them, L base L represents the basic regression loss. order L represents the corner order constraint loss. total This represents the combined loss function. The weight of the corner order constraint loss is set to 0.1 to ensure topological constraints while avoiding interference with the dominance of the basic regression loss.
[0034] To improve the training speed and efficiency of the screen four-corner detection neural network, this embodiment adopts a GPU-accelerated training strategy based on CUDA (Compute Unified Device Architecture).
[0035] Hardware configuration: It adopts NVIDIA series GPUs that support CUDA parallel computing architecture, with a video memory capacity of ≥8GB to ensure the video memory requirements for training and adapt to training tasks with large batches of high-resolution samples.
[0036] Batch training: Set the batch size to 8, load training samples into GPU memory in batches, and use the parallel computing capabilities of the GPU to achieve synchronous training of batch samples, which greatly improves training efficiency.
[0037] Optimizer selection: The Adam adaptive learning rate optimizer is adopted, with the initial learning rate set to 0.001. The Adam optimizer can adaptively adjust the parameter update step size, balancing training speed and convergence accuracy, and is suitable for training deep convolutional neural networks.
[0038] Learning rate scheduling: The ReduceLROnPlateau learning rate dynamic adjustment strategy is adopted. When the validation set loss stops decreasing, the learning rate is automatically reduced to finely adjust the network parameters and improve the model convergence accuracy.
[0039] Training process: Load data into GPU memory → Forward propagation of the network → Calculate the combined loss function → Backward propagation to update network weights → Save training checkpoints and iterate until the preset training rounds (e.g., 50 times).
[0040] Deploy the trained screen corner detection neural network to a GPU environment. For example, use the torch.load function to load the trained .pth format model file, and use the model.to (cuda) instruction to transfer the model parameters to GPU memory.
[0041] Step S120: Read each frame of the video stream sequentially and process it into an RGB image of a preset size and transmit it to the GPU.
[0042] This step continuously reads the video stream frame by frame, converts each frame into an RGB image of a preset size, and transmits the image to the GPU memory. This provides uniform, high-speed image data for subsequent GPU-accelerated screen corner detection neural network detection, ensuring the continuity of video frame processing and computational efficiency.
[0043] In some optional implementations, step S120, processing the image into an RGB format image of a preset size and transmitting it to the GPU, includes: using OpenCV's VideoCapture to sequentially and continuously read each frame of the video stream; converting the image from BGR format to RGB format and adjusting the image size to 256×256; and converting the image into a tensor image and transmitting it to the GPU.
[0044] The VideoCapture module of the OpenCV computer vision library is used to read the video stream of the stolen video frame by frame and acquire each frame. The BGR format of the frame is converted to RGB format, the frame size is adjusted to the standard input size of 256×256, and the frame is converted into a Tensor image and transmitted to the GPU memory.
[0045] Step S130: The screen corner detection neural network is used in the GPU to predict the image and the predicted corner coordinates are transmitted to the CPU.
[0046] This embodiment performs network forward propagation calculations in a GPU environment to quickly predict the coordinates of the four corner points of the image, and then transmits the prediction results from the GPU back to the CPU for further processing.
[0047] Step S140: In the CPU, coordinate transformation and temporal smoothing are performed on the predicted four corner point coordinates. Based on the processed predicted four corner point coordinates, perspective transformation is performed on the corresponding image to obtain the distortion-corrected image.
[0048] This step involves the CPU performing coordinate format conversion and timing stability optimization on the predicted corner coordinates output by the screen corner detection neural network to eliminate coordinate fluctuations and abnormal deviations. Then, based on the processed, accurate, and stable predicted corner coordinates, perspective transformation is performed on the original image to correct distorted and tilted images into standard rectangular images, ultimately obtaining a distortion-free and regular image.
[0049] In some optional implementations, step S140 performs coordinate transformation and temporal smoothing processing on the predicted four corner point coordinates in the CPU, including: converting the predicted four corner point coordinates into actual pixel coordinates according to the original size of the image; performing temporal smoothing processing using an exponential moving average algorithm; and adaptively adjusting the smoothing coefficient and time window of the temporal smoothing processing according to the degree of video stream jitter.
[0050] The predicted corner coordinates, which are normalized to the [0,1] interval output by the screen corner detection neural network, are multiplied by the width and height of the original image to convert them into actual pixel coordinates, matching the size ratio of the original image.
[0051] The exponential moving average (EMA) algorithm is used to perform temporal smoothing on the predicted four-corner coordinate sequence of continuous images (continuous frames), eliminating random jitter of the predicted four-corner coordinates between frames and ensuring the continuity and stability of the predicted four-corner coordinates.
[0052] The smoothing coefficient can range from [0.01, 1.0). A smaller smoothing coefficient results in better smoothing but a slower response, potentially causing a "trailing" effect. A larger smoothing coefficient results in a faster response but a worse smoothing effect, retaining more jitter. A larger time window considers more historical frames, leading to more stable smoothing.
[0053] The recursive formula is: S (t) =S (t-1) ×(1-α)+C (t) ×α, formula (2); Among them, S (t) S represents the predicted coordinates of the four corner points after smoothing in frame t. (t-1) C represents the predicted coordinates of the four corner points after smoothing in frame t-1. (t) Let α represent the predicted coordinates of the four corner points in frame t, and let α represent the smoothing coefficient.
[0054] In some optional implementations, the smoothing coefficient and time window of the timing smoothing process are adaptively adjusted according to the degree of video stream jitter, including: a smoothing coefficient of 0.05 and a time window of 7 frames for secretly recorded video; a smoothing coefficient of 0.10 and a time window of 5 frames for handheld shooting; and a smoothing coefficient of 0.15 and a time window of 3 frames for slight shaking.
[0055] Based on the degree of image jitter in different recording scenarios, the core parameters of the temporal smoothing algorithm are adaptively adjusted to ensure optimal jitter suppression in various scenarios. Specifically: Video recording scenario (strong shaking): Smoothing coefficient α=0.05, time window=7 frames, strong smoothing strategy is adopted to greatly suppress severe shaking; Handheld shooting scenario (moderate shaking): Smoothing coefficient α=0.10, time window=5 frames, using a moderate smoothing strategy to balance shake suppression and image fidelity; Slightly shaky scenes (weak jitter): Smoothing coefficient α=0.15, time window=3 frames, a weak smoothing strategy is adopted to preserve image details while suppressing minor shaking.
[0056] Figure 4 A schematic diagram of a timing smoothing algorithm according to an embodiment of this application is shown. (Refer to...) Figure 4 As shown, the core parameters for temporal smoothing are: smoothing coefficient α = 0.10, and time window = 5 frames. The horizontal axis represents the frame number (i.e., each frame in the continuously read video stream), and the vertical axis represents the y-coordinate of a corner point in the image. The dashed line represents the y-coordinate curve of a corner point after temporal smoothing, and the solid line represents the predicted y-coordinate curve of a corner point output by the screen four-corner detection neural network. Figure 4 The diagram shown is merely an example of the algorithm for time-series smoothing and does not represent the actual processing result.
[0057] In some optional implementations, step S140 performs perspective transformation on the corresponding image based on the processed predicted four corner point coordinates, including: calculating the perspective transformation matrix with the processed predicted four corner point coordinates as the source point and the standard rectangle as the target point; and using the perspective transformation matrix to process the corresponding image into a distortion-corrected image.
[0058] The perspective transformation matrix is calculated based on the processed predicted four-corner coordinates. Based on the perspective transformation matrix, the original distorted image is geometrically transformed to correct the trapezoidal and irregular quadrilateral distorted image into a standard rectangular distorted image.
[0059] Figure 5 A schematic diagram illustrating the principle of perspective transformation according to an embodiment of this application is shown. (Refer to...) Figure 5 As shown, the schematic diagram before perspective transformation is shown in (a), and the schematic diagram after perspective transformation is shown in (b).
[0060] First, we need to clarify two sets of key points. The source point is the predicted coordinates of the four corner points before perspective transformation and after temporal smoothing (in (a) it is shown as a trapezoid, and the predicted coordinates of the four corner points are: top left (x1,y1), top right (x2,y2), bottom right (x3,y3), bottom left (x4,y4)). The target point is the coordinates of the four corner points of a standard rectangle (in (b) it is shown as a standard rectangle, and the coordinates of the four corner points are: top left (0,0), top right (W,0), bottom right (W,H), bottom left (0,H)).
[0061] The formula for perspective transformation is: , formula (3); in, x and y Represents the coordinates of a source point. x’ and y’ Represents the coordinates of a target point. m ij Represents the elements of a 3×3 perspective transformation matrix M. w This represents the homogeneity coordinate normalization coefficient.
[0062] Based on the source point, target point, and perspective transformation formula, a 3×3 perspective transformation matrix M is calculated using a geometric algorithm. Then, the perspective transformation matrix M is used to process the corresponding image into a distortion-corrected image.
[0063] Step S150: Continuously output each frame of the video stream with distortion correction.
[0064] This step continuously outputs the frames after distortion correction, following the original playback order and frame rate of the video stream, forming a stable and distortion-free complete video stream, providing standardized, high-quality video data for subsequent digital watermark extraction.
[0065] The following specific embodiments illustrate the neural network-based method for processing illegally recorded videos proposed in this application.
[0066] I. Hardware Configuration.
[0067] GPU: NVIDIA GeForce RTX 5080 / 4080 / 3080 (15.92GB GDDR6X); Video memory: ≥8GB, supports CUDA 13.0; Memory: 16GB DDR4-3200; Processor: Intel Core i7-12700K (or equivalent).
[0068] II. Software Environment.
[0069] Python 3.14.2; PyTorch 2.10.0+cu 130 (CUD 13.0); PyQt6 6.6.1; OpenCV 4.9.0.80; NumPy 1.26.4.
[0070] III. Model Training.
[0071] Training cycles: 50 epochs; Batch size: 8; Learning rate: 0.001; Loss function: L total =L base +0.1×L order ; The loss during training changed as follows: Epoch 1, Loss = 0.067857; Epoch 10, Loss = 0.005579; Epoch 20, Loss = 0.000174; Epoch 30, Loss = 0.000073; Epoch 50, Loss = 0.000068.
[0072] IV. GPU inference capability test.
[0073] The inference performance of different NVIDIA GPU models was tested, and the results are as follows: RTX 5080: Processing speed 60FPS, inference latency 8ms, VRAM usage 4GB; RTX 4080: 50 FPS processing speed, 10ms inference latency, 4GB VRAM usage; RTX 3080: 40 FPS processing speed, 15ms inference latency, 4GB VRAM usage.
[0074] Test results show that real-time processing speeds of over 30 FPS can be achieved on mainstream consumer-grade GPUs, meeting the needs of engineering applications.
[0075] V. Code implementation of image processing, GPU inference and coordinate transformation.
[0076] The Python code implementations for image processing, GPU inference, and coordinate transformation are provided, as follows: def_detect_corners(self,frame,model,device,width,height): #Image Preprocessing frame_rgb=cv2.cvtColor(frame,cv2.COLOR_BGR2RGB) frame_resized=cv2.resize(frame_rgb,(256,256)) frame_tensor=torch.from_numpy(frame_resized).float().permute(2,0,1) / 255.0 frame_tensor = frame_tensor.unsqueeze(0).to(device) # Transfer to GPU #GPU Inference with torch.no_grad(): pred_corners=model(frame_tensor) #Calculate on GPU pred_corners=pred_corners.cpu().numpy()[0]#Return to CPU # Convert back to original coordinates pred_corners[:,0]*=width pred_corners[:,1]*= height return pred_corners VI. Code implementation of timing smoothing.
[0077] The Python code implementation for time-series smoothing is provided below: class CornerSmoother: def __init__(self,window_size:int=5,alpha:float=0.1): self.window_size=window_size #Sliding window size self.alpha = alpha # Smoothing coefficient self.history=[]# List of historical points def smooth(self,corners:np.ndarray)->np.ndarray: #Add current corner point to history self.history.append(corners.copy()) if len(self.history)>self.window_size: self.history.pop(0) if len(self.history) < 2: return corners #Exponential Moving Average smoothed=self.history[0].copy() for h in self.history[1:]: smoothed=smoothed*(1-self.alpha)+h*self.alpha return smoothed VII. Code implementation of perspective transformation.
[0078] The Python code implementation of perspective transformation is provided as follows: def_perspective_transform(self,frame,corners,output_width,output_height): #Source point: Detected four corners src_pts=corners.astype(np.float32) #Target point: Standard rectangle dst_pts = np.array([ [0,0], [output_width-1,0], [output_width - 1, output_height - 1], [0, output_height - 1] , dtype=np.float32) # Calculate the perspective transformation matrix M = cv2.getPerspectiveTransform(src_pts, dst_pts) # Apply the perspective transformation Warped = cv2.warpPerspective(frame, M, (output_width, output_height)) return warped Figure 6 The flowchart shows the process of a neural network-based pirated video processing method according to another embodiment of the present application. Refer to Figure 6 As shown, the neural network-based pirated video processing method proposed in this embodiment includes the following steps: Start.
[0079] Step S601, load the screen corner detection neural network to the GPU (torch.load + model.to(cuda)). Go to step S602.
[0080] Step S602, load the video file (VideoCapture(video_path)). Go to step S603.
[0081] Step S603, obtain the video parameters (resolution / frame rate / total number of frames). Go to step S604.
[0082] Step S604, determine whether there is a next frame (frame_num < total). If so, go to step S�05; if not, end.
[0083] Step S605, read the video frame (cap.read()). Go to step S606.
[0084] Step S606, preprocess the image and transfer it to the GPU (BGR → RGB + resize(256, 256)). Go to step S607.
[0085] Step S607, perform GPU inference to detect the screen corners and return the results to the CPU (model.forward() → predict the coordinates of the four corner points). Go to step S608.
[0086] Step S608, coordinate transformation (predicted corner coordinates → actual pixel coordinates). Proceed to step S609.
[0087] Step S609, timing smoothing (CornerSmoother.smooth()). Proceed to step S610.
[0088] Step S610: Calculate the perspective transformation matrix (getPerspectiveTransform()). Proceed to step S611.
[0089] Step S611, perspective transformation (warpPerspective(frame,M)). Proceed to step S612.
[0090] Step S612: Output the processed frame (frame_processed signal). Proceed to step S613.
[0091] Step S613, update progress. Proceed to step S604.
[0092] Figure 7 A schematic diagram of a neural network-based video recording processing system according to an embodiment of this application is shown. (Refer to...) Figure 7 As shown, the neural network-based video recording processing system 700 includes: The model provides unit 710 for building and training the screen four corner detection neural network, and loading the trained screen four corner detection neural network onto the GPU; The image providing unit 720 is used to sequentially and continuously read each frame of the video stream, process the image into an RGB format image of a preset size, and transmit it to the GPU; The coordinate prediction unit 730 is used in the GPU to predict the image using a screen four-corner detection neural network, and the predicted four-corner coordinates are transmitted to the CPU. The distortion correction unit 740 is used to perform coordinate transformation and timing smoothing processing on the predicted four corner point coordinates in the CPU, and perform perspective transformation on the corresponding image based on the processed predicted four corner point coordinates to obtain the distortion-corrected image. The output unit 750 is used to sequentially and continuously output each frame of the video stream with distortion correction.
[0093] In some optional implementations, in the above system, the screen four-corner detection neural network includes an input layer, an encoding layer, a global average pooling layer, a fully connected regression layer, and an output layer connected in sequence. The input layer receives a 256×256×3 RGB image and transmits the image to the encoding layer. The encoding layer includes five convolutional blocks, each including a downsampling convolutional layer with a stride of 2 and a feature extraction convolutional layer with a stride of 1. Each convolutional layer is followed by batch normalization and a ReLU activation function. The encoding layer processes the image into an 8x8 pixel array. The feature map is 8×512; the global average pooling layer is used to compress the feature map into a 1×1×512 feature vector; the fully connected regression layer includes three linear transformation layers, and ReLU activation function and random deactivation layer are used between the three linear transformation layers to prevent overfitting. The fully connected regression layer is used to process the feature vector into an 8-dimensional coordinate vector; the output layer is used to reconstruct the coordinate vector into (4,2) format, and normalize it to the [0,1] interval by Sigmoid activation function, and output the predicted coordinates of the four corner points in the order of top left, top right, bottom right and bottom left.
[0094] In some optional implementations, in the above system, the model providing unit 710 is used to: extract training frames from the training video stream every preset number of frames, manually label the normalized four-corner coordinates of the training frames in the order of top left, top right, bottom right, and bottom left, and construct a training dataset in JSON format; construct a combined loss function; the combined loss function includes: a basic regression loss for calculating the error between the predicted four-corner coordinates and the normalized four-corner coordinates, and a corner order constraint loss for constraining the spatial topological relationship of the four corner points; with a batch size of 8, Adam as the optimizer, an initial learning rate of 0.001, and the learning rate dynamically adjusted through the ReduceLROnPlateau strategy, complete a preset number of training rounds in the GPU to obtain a trained screen four-corner detection neural network.
[0095] In some alternative implementations, in the above system, the image providing unit 720 is used to: sequentially and continuously read each frame of the video stream using OpenCV's VideoCapture; convert the image from BGR format to RGB format and adjust the image size to 256×256; and convert the image into a tensor image and transmit it to the GPU.
[0096] In some alternative implementations, in the above system, the distortion correction unit 740 is used to: convert the predicted four corner coordinates into actual pixel coordinates according to the original size of the image; perform temporal smoothing processing using an exponential moving average algorithm; and adaptively adjust the smoothing coefficient and time window of the temporal smoothing processing according to the degree of video stream jitter.
[0097] In some alternative implementations, in the above system, the smoothing coefficient of the secretly recorded video is 0.05 and the time window is 7 frames; the smoothing coefficient of handheld shooting is 0.10 and the time window is 5 frames; and the smoothing coefficient of slight shaking is 0.15 and the time window is 3 frames.
[0098] In some optional embodiments, in the above system, the distortion correction unit 740 is further configured to: calculate the perspective transformation matrix using the processed predicted four corner point coordinates as the source point and the standard rectangle as the target point; and process the corresponding image into a distortion-corrected image using the perspective transformation matrix.
[0099] It should be noted that the aforementioned neural network-based video recording processing system 700 can implement all of the aforementioned neural network-based video recording processing methods, which will not be elaborated upon further.
[0100] Figure 8 A general architecture diagram of a neural network-based video recording processing system according to an embodiment of this application is shown. (Refer to...) Figure 8 As shown, the system adopts a four-layer architecture design, from top to bottom: user interface layer, image processing engine layer, inference layer, and computing layer. Each layer works together to complete the real-time processing of the secretly recorded video.
[0101] The user interface layer provides interactive functions such as video loading, parameter setting, real-time preview, progress display, and process control. It supports users to import videos, configure smoothing parameters, view processing effects and progress, and start or stop the processing process through control buttons.
[0102] The image processing engine layer, as the core processing layer, is responsible for video decoding, coordinate transformation, temporal smoothing, perspective transformation, and video encoding threads, realizing the complete processing from the original video frame to the corrected video frame.
[0103] The inference layer carries a trained screen corner detection neural network, which is responsible for loading model weights into GPU memory, performing batch inference and tensor calculations based on the GPU, and outputting standardized screen prediction corner coordinates.
[0104] The computing layer is based on the PyTorch deep learning framework and CUDA parallel computing architecture. It relies on NVIDIA GPU hardware to implement memory management and parallel computing acceleration, providing underlying computing power support for model training and real-time inference, and ensuring high-speed and stable operation of the system.
[0105] Figure 9 This invention illustrates a schematic diagram of the structure of an electronic device according to an embodiment of the present application. Figure 9As shown, the electronic device includes a processor, memory, and a network interface connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile and / or volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The network interface is used for communication with external devices via a network connection. When the computer program is executed by the processor, it implements the functions or steps of a neural network-based method for processing illegally recorded videos.
[0106] In one embodiment, the electronic device provided in this application includes a memory and a processor. The memory stores a database and a computer program that can run on the processor. When the processor executes the computer program, it implements the steps of a neural network-based method for processing stolen videos.
[0107] The above is as stated in this application. Figure 7 The method for executing a neural network-based video recording processing system disclosed in the illustrated embodiments can be applied to a processor or implemented by a processor. During implementation, each step of the above method can be completed by integrated logic circuits in the processor's hardware or by software instructions. The processor can be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The steps of the method disclosed in the embodiments of this application can be directly embodied as being executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method.
[0108] In one embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps of a neural network-based method for processing illegally recorded videos.
[0109] It should be noted that the functions or steps that the above-mentioned electronic devices or computer-readable storage media can achieve can be referred to the relevant descriptions in the foregoing method embodiments. To avoid repetition, they will not be described one by one here.
[0110] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0111] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.
[0112] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A method for processing illegally recorded videos based on neural networks, characterized in that, include: Build and train the screen four corner detection neural network, and load the trained screen four corner detection neural network onto the GPU; It continuously reads each frame from the video stream, processes the image into an RGB format image of a preset size, and transmits it to the GPU. The GPU uses a screen corner detection neural network to predict the image and then transmits the predicted corner coordinates to the CPU. In the CPU, coordinate transformation and timing smoothing are performed on the predicted four corner point coordinates. Based on the processed predicted four corner point coordinates, perspective transformation is performed on the corresponding image to obtain the distortion-corrected image. The distortion-corrected frames of the video stream are output sequentially and continuously.
2. The method for processing illegally recorded videos based on neural networks according to claim 1, characterized in that, The construction and training of the screen four-corner detection neural network includes: The screen corner detection neural network consists of an input layer, an encoding layer, a global average pooling layer, a fully connected regression layer, and an output layer connected in sequence. The input layer is used to receive RGB format images of 256×256×3 and transmit the images to the encoding layer; The coding layer consists of 5 convolutional blocks, each of which includes a downsampling convolutional layer with a stride of 2 and a feature extraction convolutional layer with a stride of 1. Each convolutional layer is followed by batch normalization and ReLU activation functions. The coding layer is used to process the image into an 8×8×512 feature map. Global average pooling layers are used to compress feature maps into 1×1×512 feature vectors; The fully connected regression layer consists of three linear transformation layers. ReLU activation function and random deactivation layer are used between the three linear transformation layers to prevent overfitting. The fully connected regression layer is used to process the feature vector into an 8-dimensional coordinate vector. The output layer is used to reconstruct the coordinate vector into the (4,2) format, normalize it to the [0,1] interval by the Sigmoid activation function, and output the predicted coordinates of the four corner points in the order of top left, top right, bottom right, and bottom left.
3. The method for processing illegally recorded videos based on neural networks according to claim 1, characterized in that, The construction and training of the screen four-corner detection neural network includes: Training frames are extracted from the training video stream at preset frame intervals. The normalized coordinates of the four corner points of the training frames are manually labeled in the order of top left, top right, bottom right, and bottom left. The training dataset is constructed in JSON format. Construct a combined loss function; the combined loss function includes: the basic regression loss used to calculate the error between the predicted four corner point coordinates and the normalized four corner point coordinates, and the corner point order constraint loss used to constrain the spatial topological relationship of the four corner points; With a batch size of 8, Adam as the optimizer, and an initial learning rate of 0.001, the learning rate was dynamically adjusted using the ReduceLROnPlateau strategy. The training was completed in a preset number of rounds on the GPU to obtain the trained screen four-corner detection neural network.
4. The method for processing illegally recorded videos based on neural networks according to claim 1, characterized in that, The step of sequentially and continuously reading each frame of the video stream, processing the frame into an RGB format image of a preset size, and transmitting it to the GPU includes: The OpenCV VideoCapture function is used to continuously read each frame of the video stream sequentially. Convert the image from BGR format to RGB format and adjust the image size to 256×256; The image is converted into a tensor image and transmitted to the GPU.
5. The method for processing illegally recorded videos based on neural networks according to claim 1, characterized in that, The process of performing coordinate transformation and timing smoothing on the predicted four corner point coordinates in the CPU includes: The predicted corner coordinates are converted into actual pixel coordinates based on the original size of the image. The exponential moving average algorithm is used for time-series smoothing; the smoothing coefficient and time window of the time-series smoothing are adaptively adjusted according to the jitter of the video stream.
6. The method for processing illegally recorded videos based on neural networks according to claim 5, characterized in that, The smoothing coefficient and time window of the time-series smoothing process are adaptively adjusted according to the jitter level of the video stream, including: The smoothing factor of the illegally recorded video is 0.05, and the time window is 7 frames. The smoothing factor for handheld shooting is 0.10, and the time window is 5 frames. The smoothing factor for slight shaking is 0.15, and the time window is 3 frames.
7. The method for processing illegally recorded videos based on neural networks according to claim 1, characterized in that, The perspective transformation of the corresponding image based on the processed predicted four corner coordinates includes: Using the processed predicted coordinates of the four corner points as the source points and the standard rectangle as the target points, calculate the perspective transformation matrix; The corresponding image is processed into a distortion-corrected image using a perspective transformation matrix.
8. A neural network-based system for processing stolen video recordings, characterized in that, include: The model provides units for building and training the screen four-corner detection neural network, and loading the trained screen four-corner detection neural network onto the GPU; The image providing unit is used to sequentially and continuously read each frame of the video stream, process the image into an RGB format image of a preset size, and transmit it to the GPU; The coordinate prediction unit is used in the GPU to predict the image using a screen four-corner detection neural network, and the predicted coordinates of the four corner points are transmitted to the CPU. The distortion correction unit is used to perform coordinate transformation and timing smoothing on the predicted four corner point coordinates in the CPU, and to perform perspective transformation on the corresponding image based on the processed predicted four corner point coordinates to obtain the distortion-corrected image. The output unit is used to sequentially and continuously output each frame of the video stream with distortion correction.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the neural network-based video recording processing method as described in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the neural network-based video recording processing method as described in any one of claims 1 to 7.