Video conversion method and apparatus

By employing a multi-stage transformation of the target neural network model and a temporal consistency loss function, the problems of poor conversion effects for different video styles and long computation time for high-resolution videos are solved, achieving efficient video format conversion and superior viewing experience.

WO2026137964A1PCT designated stage Publication Date: 2026-07-02SHANGHAI HODE INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SHANGHAI HODE INFORMATION TECH CO LTD
Filing Date
2025-09-04
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing video conversion methods cannot effectively balance the conversion effects of different video styles, and high-resolution video conversion is computationally time-consuming and prone to inter-frame flickering.

Method used

A target neural network model is used for video conversion. Through multi-stage conversion of global information network, mapping network and local enhancement network, combined with multi-sample video training and temporal consistency loss function, the best conversion effect of video style is achieved and flickering is reduced.

Benefits of technology

It achieves optimal conversion effects for different video styles, improves the coverage of high dynamic range videos, reduces computational complexity and inter-frame flicker, and enhances the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025119050_02072026_PF_FP_ABST
    Figure CN2025119050_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A video conversion method, which belongs to the technical field of video processing. The video conversion method comprises: acquiring a first video in a first format; and inputting the first video into a pre-trained target neural network model, so as to output a second video in a second format, wherein the target neural network model is obtained by performing training on the basis of a plurality of sample video pairs, each sample video pair comprises a first sample video in the first format and a second sample video in the second format, and the second sample video is obtained by performing corresponding adjustment and format conversion on the first sample video on the basis of the video style of the first sample video. Thus, the first video can be converted into the second video by means of the target neural network model, and the optimal conversion effect of a corresponding video style can be achieved.
Need to check novelty before this filing date? Find Prior Art

Description

Video conversion method and apparatus

[0001] This application claims priority to Chinese patent application No. 202411917551.4, filed on December 23, 2024, entitled "Video Conversion Method and Apparatus", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of video processing technology, and in particular to a video conversion method, apparatus, computer equipment, computer-readable storage medium, and computer program product. Background Technology

[0003] With advancements in display technology, more and more display devices are beginning to support high dynamic range (HDR) video, which boasts higher dynamic range and a wider color gamut. While significant improvements have been made at the device level, HDR video still accounts for a relatively small percentage of the market. Therefore, converting standard dynamic video into HDR video can enhance the user's viewing experience.

[0004] However, the inventors have found that existing video conversion methods tend to be averaged out, resulting in poor conversion effects for different video styles.

[0005] It should be noted that the above content is not necessarily prior art, nor is it intended to limit the scope of patent protection of this application. Summary of the Invention

[0006] This application provides a video conversion method, apparatus, computer device, computer-readable storage medium, and computer program product to solve or alleviate one or more of the technical problems mentioned above.

[0007] One aspect of this application provides a video conversion method, the method comprising:

[0008] Get the first video in the first format;

[0009] The first video is input into a pre-trained target neural network model to output a second video in a second format;

[0010] The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

[0011] Optionally, the target neural network model includes:

[0012] A global information network is used to receive the downsampled first video and output global information.

[0013] A mapping network is used to receive the first video and the global information to output an intermediate video in the second format;

[0014] A local enhancement network is used to receive the intermediate video and the global information to output a second video in the second format.

[0015] Optionally, the global information network includes:

[0016] Several global information modules, M*M convolutional layers, and global pooling layers; the global information modules include M*M convolutional and pooling layers;

[0017] Where M is a natural number greater than 0.

[0018] Optionally, the mapping network includes:

[0019] The system comprises several M*M convolutions and a first fine-tuning module, with one M*M convolution corresponding to one first fine-tuning module; each first fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset; the mapping network is used to output the intermediate video based on the vector representing amplitude, the vector representing offset, and intermediate information obtained by convolutional calculation of the first video in the first format.

[0020] Optionally, a plurality of N*N convolutional and second fine-tuning modules are provided, with one N*N convolutional module corresponding to one second fine-tuning module; each second fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset; the local enhancement network is used to: output a second video in the second format based on the vector representing amplitude, the vector representing offset, and the intermediate information obtained after convolutional calculation of the intermediate video;

[0021] Where N is a natural number greater than 0, and N > M.

[0022] Optionally, the target neural network model is trained through the following operations:

[0023] During the training process:

[0024] The first frame sequence in the first sample video is input into the target neural network model to output a predicted frame sequence, which corresponds to the second format.

[0025] Obtain the first inter-frame difference between each adjacent frame in the predicted frame sequence;

[0026] Obtain the second inter-frame difference between each adjacent frame in the second frame sequence of the second sample video;

[0027] Obtain the absolute value of the difference between each of the multiple first inter-frame differences and the corresponding second inter-frame difference;

[0028] The network parameters in the target neural network model are adjusted based on the sum of the absolute values ​​of the multiple differences.

[0029] Another aspect of this application provides a video conversion apparatus, the apparatus comprising:

[0030] The acquisition module is used to acquire the first video in the first format;

[0031] The output module is used to input the first video into a pre-trained target neural network model to output a second video in a second format;

[0032] The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

[0033] Another aspect of this application provides a computer device, including:

[0034] At least one processor; and

[0035] A memory that is communicatively connected to the at least one processor;

[0036] Wherein: the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the following operations:

[0037] Get the first video in the first format;

[0038] The first video is input into a pre-trained target neural network model to output a second video in a second format;

[0039] The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

[0040] Another aspect of this application provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the following operations:

[0041] Get the first video in the first format;

[0042] The first video is input into a pre-trained target neural network model to output a second video in a second format;

[0043] The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

[0044] Another aspect of this application provides a computer program product including computer-readable instructions that, when executed by a processor, perform the following operations:

[0045] Get the first video in the first format;

[0046] The first video is input into a pre-trained target neural network model to output a second video in a second format;

[0047] The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

[0048] The embodiments of this application employing the above-described technical solution may have the following advantages:

[0049] The target neural network model is trained using a first sample video in a first format and a second sample video in a second format obtained after video style adjustment and format conversion as sample video pairs. This enables the first video to be converted into the second video through the target neural network model, and achieves the best conversion effect of the corresponding video style.

[0050] I. Description of the attached figures

[0051] The accompanying drawings exemplify embodiments and form part of the specification, serving together with the textual description to explain exemplary implementations of the embodiments. The illustrated embodiments are for illustrative purposes only and do not limit the scope of the claims. Throughout the drawings, the same reference numerals refer to similar but not necessarily identical elements.

[0052] Figure 1 schematically illustrates the operating environment of the video conversion method according to Embodiment 1 of this application;

[0053] Figure 2 schematically illustrates a flowchart of a video conversion method according to Embodiment 1 of this application;

[0054] Figure 3 schematically illustrates the newly added flowchart of the video conversion method according to Embodiment 1 of this application;

[0055] Figure 4 schematically illustrates an application example of the video conversion method according to Embodiment 1 of this application;

[0056] Figure 5 schematically shows a block diagram of a video conversion device according to Embodiment 2 of this application; and

[0057] Figure 6 schematically illustrates a hardware architecture diagram of a computer device according to Embodiment 3 of this application.

[0058] II. Detailed Implementation

[0059] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application. All other embodiments obtained by those skilled in the art based on the embodiments in this application without inventive effort are within the scope of protection of this application.

[0060] It should be noted that the descriptions involving "first," "second," etc., in the embodiments of this application are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined with "first" or "second" may explicitly or implicitly include at least one of that feature. Furthermore, the technical solutions of the various embodiments can be combined with each other, but this must be based on the ability of those skilled in the art to implement them. If the combination of technical solutions is contradictory or impossible to implement, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed in this application.

[0061] In the description of this application, it should be understood that the numerical labels before the steps do not indicate the order of the steps, but are only used to facilitate the description of this application and to distinguish each step, and therefore should not be construed as a limitation of this application.

[0062] First, a definition of the terminology used in this application is provided:

[0063] SDR: Standard Dynamic Range, is a technology for displaying video based on brightness, contrast, color characteristics, and the limitations of CRT (cathode ray tube) displays.

[0064] HDR, or High Dynamic Range, is an upgrade to SDR and a technology that improves video display quality. HDR changes the way brightness and color information in video and images are represented in the signal, thereby supporting a wider range of brightness, a broader color gamut, and higher precision quantization.

[0065] Convolutional Neural Networks (CNNs) are a type of feedforward neural network that incorporates convolutional computations and has a deep structure. They are one of the representative algorithms in deep learning. A CNN consists of multiple convolutional layers, each with a different number of convolutional kernels, and each kernel corresponds to one convolutional channel.

[0066] Secondly, to facilitate understanding of the technical solutions provided in the embodiments of this application by those skilled in the art, the relevant technologies are described below:

[0067] The applicant understands that video conversion schemes based on fixed function mappings, limited by the number of parameters, can only achieve global mapping and cannot accommodate customized mappings for different areas of the video content, nor can they adaptively adjust to different content scenarios. Learnable video conversion schemes can achieve local and scenario-specific adaptive mapping, resulting in a higher ceiling for converted video content quality. However, if only a batch of sample video pairs is collected without considering the consistency of their internal content and style, the final result will be a uniformly distributed model, failing to achieve optimal results across different content categories. Furthermore, models using learnable schemes often require significant computational time when processing high-resolution videos, especially those above 4K, due to the high pixel density, and the output video content is prone to inter-frame flickering.

[0068] Therefore, this application provides a video conversion technology solution. In this solution, (1) the problem of model inapplicability caused by different video content styles is avoided, and conversion is performed using models corresponding to different video styles to achieve the best conversion effect for different video styles; (2) the model is trained in multiple stages to achieve the best balance between efficiency and effect; (3) through a time-stable training method and a time-consistency loss function, the video content output by the model of this solution is guaranteed to be coherent and not prone to flickering. See below for details.

[0069] Finally, for ease of understanding, an exemplary operating environment is provided below.

[0070] As shown in Figure 1, the runtime environment diagram includes:

[0071] As shown in Figure 1, the operating environment includes: service platform 2 and clients (4A, 4B, ..., 4N).

[0072] Service platform 2 can connect to clients (4A, 4B, ..., 4N) via the network.

[0073] Service platform 2 can be a single server, a server cluster, or a cloud computing service center.

[0074] Service platform 2 can provide video conversion services to clients, etc.

[0075] The video conversion service can convert a first video in a first format into a second video in a second format. For example, it can convert an SDR video into an HDR video.

[0076] Service platform 2 may be located in a data center, such as a single location, or distributed across different geographical locations (e.g., multiple locations). Service platform 2 may provide services via a network. The network includes various network devices such as routers, switches, multiplexers, hubs, modems, bridges, repeaters, firewalls, proxy devices, and / or similar devices. The network may include physical links, such as coaxial cable links, twisted-pair cable links, fiber optic links, or combinations thereof, or wireless links, such as cellular links, satellite links, Wi-Fi links, etc.

[0077] Clients (4A, 4B, ..., 4N) can be configured to access the content and services of service platform 2. Clients (4A, 4B, ..., 4N) can include electronic devices with built-in or external display panels, such as mobile devices, tablets, laptops, workstations, virtual reality devices, gaming devices, digital streaming media devices, vehicle terminals, smart TVs, set-top boxes, etc., and can also include virtualized computing instances. Virtualized computing instances can include virtual machines, such as simulations of computer systems, operating systems, servers, etc. The computing device can load the virtual machine based on the virtual image and / or other data defining specific software (e.g., operating system, dedicated applications, servers) used for simulation. As the demand for different types of processing services changes, different virtual machines can be loaded and / or terminated on one or more computing devices.

[0078] A client (4A, 4B, ..., 4N) can be associated with one or more users. A single user can also use one or more of the clients (4A, 4B, ..., 4N) to access service platform 2. Clients (4A, 4B, ..., 4N) can travel to various locations and use different networks to access service platform 2.

[0079] The technical solutions of this application are described below through several embodiments. It should be understood that these embodiments can be implemented in many different forms and should not be construed as being limited to the embodiments set forth herein.

[0080] Example 1

[0081] Figure 2 schematically illustrates a flowchart of a video conversion method according to Embodiment 1 of this application.

[0082] As shown in Figure 2, the video conversion method may include steps S200~S202, wherein:

[0083] Step S200: Obtain the first video in the first format.

[0084] Step S202: Input the first video into a pre-trained target neural network model to output a second video in a second format; wherein, the target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, and the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

[0085] The video conversion method provided in this embodiment uses a target neural network model trained with a first sample video in a first format and a second sample video in a second format obtained after video style adjustment and format conversion as sample video pairs. This allows the first video to be converted into the second video through the target neural network model, achieving the best conversion effect for the corresponding video style.

[0086] The following, with reference to Figure 2, elaborates on each step in steps S200 to S202, as well as other optional steps.

[0087] Step S200: Obtain the first video in the first format.

[0088] The first format can be SDR or other video formats; there is no limitation here. The first video can be a user-uploaded video or a video stored on the server; there is no limitation here. In some embodiments, the first video can be an SDR video that needs to be converted to HDR video. By acquiring the SDR video and converting it to HDR video, the coverage of HDR video can be improved at a very low cost, thereby improving the user's viewing experience.

[0089] Step S202: Input the first video into the pre-trained target neural network model to output the second video in the second format;

[0090] The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

[0091] The video style can be determined based on the different types of videos, such as movies, animations, and TV series, or based on the content of the videos, such as horror content and comedy content; no limitation is made here. In some embodiments, the first video is an SDR video and the second video is an HDR video. The SDR video is converted into the corresponding style HDR video using a pre-trained target neural network model (trained based on videos of various styles).

[0092] During the training process:

[0093] The first sample video can be an SDR sample video, and the second sample video can be an HDR sample video obtained by a colorist adjusting and converting the SDR sample video to match its video style. The adjustment and format conversion process may involve converting the color space of the SDR sample video, mapping its content to a color space that conforms to the HDR standard, and adjusting the color and brightness of the content and areas according to the colorist's artistic taste to achieve the best viewing experience for the second sample video.

[0094] In this embodiment, the target neural network model is trained using a first sample video in a first format and a second sample video in a second format obtained after video style adjustment and format conversion as sample video pairs. This allows the first video to be converted into the second video through the target neural network model, achieving the best conversion effect for the corresponding video style.

[0095] In an optional embodiment, as shown in FIG6, the target neural network model includes: a global information network, a mapping network, and a local enhancement network.

[0096] A global information network is used to receive the downsampled first video and output global information.

[0097] A mapping network is used to receive the first video and the global information to output an intermediate video in the second format;

[0098] A local enhancement network is used to receive the intermediate video and the global information to output a second video in the second format.

[0099] Downsampling is a method of "compressing" or "simplifying" data to reduce file size or processing complexity. It can be used to convert and process the first video frame by frame. Global information networks (GANs) can generate global information related to the entire image, which can be composed of vectors. This global information guides the mapping and local enhancement networks in adaptively processing the input content. The mapping network receives the first video and, using the global information, processes its content to initially map it to the color space and brightness range of the second format standard, resulting in an intermediate video. The local enhancement network, based on the intermediate video and using the global information, adaptively fine-tunes its content, resulting in a more refined second-format video.

[0100] In this embodiment, the optimal balance between efficiency and effectiveness is achieved through multi-stage transformation of the global information network, the mapping network, and the local enhancement network.

[0101] The global information network, mapping network, and local augmentation network will be further illustrated with more examples below.

[0102] In an optional embodiment, the global information network includes:

[0103] Several global information modules, M*M convolutional layers, and global pooling layers; the global information modules include M*M convolutional and pooling layers;

[0104] Where M is a natural number greater than 0.

[0105] Pooling layers are commonly used neural network layers in deep learning. Their function is to downsample the input data, thereby reducing the data size while retaining important features. In this global information network, pooling helps reduce computational cost, the number of parameters, and enhances the model's robustness. In some embodiments, M can be 1, and the stride of the uniform pooling layer can be 2. The stride refers to the distance the pooling operation slides across the input data. In the global information network, multiple calculations are performed through several global information modules. The calculation process of the global information modules is as follows: first, a 1*1 convolution is performed, followed by calculation through a uniform pooling layer. Since the stride of the uniform pooling layer is 2, the resolution of the first video image is continuously reduced during the calculation process, while the receptive field of the global information network is increased. The receptive field refers to the area of ​​the input image that a neuron in a neural network can "see" or "perceive." Finally, the feature information is compressed into a 1*1 vector by the calculation of the 1*1 convolution layer and the global pooling layer, yielding the global information. The feature information can be color space information, brightness information, etc., and is not limited here.

[0106] In this embodiment, global information is output through the computation of the global information network to provide computational resources for the mapping network and the local enhancement network. Furthermore, during the computation of the global information network, the receptive field of the global information network is continuously increased while the first video resolution is continuously reduced, enabling the global information network to capture more information and thus improving its accuracy and generalization ability.

[0107] In an optional embodiment, the mapping network includes:

[0108] The system comprises several M*M convolutions and a first fine-tuning module, with one M*M convolution corresponding to one first fine-tuning module; each first fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset; the mapping network is used to output the intermediate video based on the vector representing amplitude, the vector representing offset, and intermediate information obtained by convolutional calculation of the first video in the first format.

[0109] In some embodiments, M can be 1. First, a 1x1 convolution is performed on the first video. Then, the first fine-tuning module calculates the magnitude and offset vectors for the global information using two connection layers respectively. These magnitude and offset vectors are then applied to the output of the 1x1 convolution to adjust the result, for example, by performing... The calculation of the formula, where This represents the output of a 1x1 convolution. This represents the result after adjustment of the magnitude and offset vectors, where scale and shift represent the magnitude vector and offset vector, respectively.

[0110] In this embodiment, through the calculation of the mapping network, the first video can be initially mapped to the color space and brightness range under the second format standard, resulting in an intermediate video in the second format. Furthermore, by generating amplitude and offset vectors using global information, the mapping network can dynamically adjust the output results in conjunction with global context information, thereby achieving a more accurate mapping from the first format to the second format.

[0111] In an optional embodiment, the local enhancement network includes:

[0112] The network comprises several N*N convolutional layers and a second fine-tuning module, with one N*N convolutional layer corresponding to one second fine-tuning module. Each second fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset. The local enhancement network is used to output a second video in the second format based on the vector representing amplitude, the vector representing offset, and intermediate information obtained after convolutional calculation of the intermediate video.

[0113] Where N is a natural number greater than 0, and N > M.

[0114] A value greater than M allows the local augmentation network to have a larger receptive field, acquiring pixel information around each pixel, thereby enabling adaptive fine-tuning of the intermediate video content, such as adaptive fine-tuning of brightness and color. In some embodiments, N can be 3 and M can be 1. The feature information of the intermediate video can be processed by the local augmentation network through 3*3 convolution, and then the global information can be processed by the first fine-tuning module through two connected layers to obtain the magnitude and offset vectors. The magnitude and offset vectors are then applied to the output of the 3*3 convolution. The calculation of the formula, where This represents the output of a 3x3 convolution. This represents the output result after adjustment of the magnitude and offset vectors, where scale and shift represent the magnitude vector and offset vector, respectively.

[0115] In this embodiment, by using a larger convolutional kernel than the mapping network in the local enhancement network, the local enhancement network has a larger receptive field to acquire pixel information around each pixel, thereby adaptively fine-tuning the content of the intermediate video and obtaining a more refined second format video.

[0116] In an optional embodiment, as shown in Figure 3, the target neural network model is trained through the following operations, which may include:

[0117] Step S300: Input the first frame sequence from the first sample video into the target neural network model to output a predicted frame sequence, which corresponds to the second format.

[0118] Step S302: Obtain the first inter-frame difference between each adjacent frame in the predicted image frame sequence.

[0119] Step S304: Obtain the second inter-frame difference between each adjacent frame in the second frame sequence of the second sample video.

[0120] Step S306: Obtain the absolute value of the difference between each of the multiple first inter-frame differences and the corresponding second inter-frame difference.

[0121] Step S308: Adjust the network parameters in the target neural network model based on the sum of the absolute values ​​of the multiple differences.

[0122] During training, the first frame sequence from the first sample video is input into the target neural network model, which outputs a predicted frame sequence. In each training epoch, one first frame is input into the target neural network model, which outputs one predicted frame. By making the difference between the first and second frames as close as possible, the network parameters of the target neural network model are adjusted to make the difference between the first and second frames output as close as possible. In some embodiments, a temporal consistency loss function can be used to make the difference between the first and second frames as close as possible. The temporal consistency loss function can be... ,in This indicates the difference between the first two frames. This indicates the second frame. This represents the difference between the second and third frames. Besides calculating it using the temporal consistency loss function, the target neural network model can also be evaluated using the L1 Loss loss function to guide the adjustment of network parameters. L1 Loss is a commonly used loss function in machine learning and deep learning, used to measure the error between the model's predictions and the true values. The core idea of ​​L1 Loss is to calculate the absolute difference between the predicted and true values ​​and then average them. Network parameters can include convolutional kernels of convolutional layers, and can also include other parameters; this is not limited here.

[0123] In this embodiment, by making the first inter-frame gap and the second inter-frame gap as close as possible, the network parameters of the target neural network model are adjusted so that the output first inter-frame gap and the second inter-frame gap are as close as possible, thereby ensuring that the second video output by the target neural network model has continuity and is less prone to flickering, thus improving the user's viewing experience.

[0124] To make this application easier to understand, an exemplary application is provided below with reference to Figure 4.

[0125] In this exemplary application, the process of training the target neural network (SDR-HDR lightweight network) based on the original SDR film is as follows:

[0126] S1. Based on the video style of the original SDR video, the colorist creates an HDR version of the video.

[0127] S2. After downsampling the original SDR image, input it into the global information network of the target neural network to obtain global information.

[0128] S3. Input the original SDR image and global information into the mapping network of the target neural network, and output the intermediate video through the mapping network.

[0129] S4. Input the intermediate video and global information into the local enhancement, and output the predicted HDR result through the mapping network.

[0130] S5. The predicted HDR content and the HDR version video created by the colorist are calculated using a loss function to optimize and adjust the network parameters of the target neural network model.

[0131] In this exemplary application, the problem of model inapplicability caused by the different styles of SDR videos is avoided. At the same time, through multi-stage transformation of global information network, mapping network and local enhancement network, the best balance between efficiency and effect is achieved.

[0132] Example 2

[0133] Figure 5 schematically illustrates a block diagram of a video conversion device according to Embodiment 2 of this application. This device can be divided into one or more program modules. One or more program modules are stored in a storage medium and executed by one or more processors to complete the embodiment of this application. The program module referred to in this embodiment is a series of computer program instruction segments capable of performing a specific function. The following description will specifically introduce the functions of each program module in this embodiment. As shown in Figure 5, the device 1600 may include: an acquisition module 1610 and an output module 1620, wherein:

[0134] Module 1610 is used to acquire the first video in the first format;

[0135] The output module 1620 is used to input the first video into a pre-trained target neural network model to output a second video in a second format;

[0136] The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

[0137] As an optional embodiment, the target neural network model includes:

[0138] A global information network is used to receive the downsampled first video and output global information.

[0139] A mapping network is used to receive the first video and the global information to output an intermediate video in the second format;

[0140] A local enhancement network is used to receive the intermediate video and the global information to output a second video in the second format.

[0141] As an optional embodiment, the global information network includes:

[0142] Several global information modules, M*M convolutional layers, and global pooling layers; the global information modules include M*M convolutional and pooling layers;

[0143] Where M is a natural number greater than 0.

[0144] As an optional embodiment, the mapping network includes:

[0145] The network comprises several M*M convolutions and a first fine-tuning module, with one M*M convolution corresponding to one first fine-tuning module. Each first fine-tuning module includes two fully connected layers: one fully connected layer receives the global information to output a vector representing amplitude, and the other fully connected layer receives the global information to output a vector representing offset. The mapping network is used to: output the intermediate video based on the vector representing amplitude, the vector representing offset, and intermediate information obtained after convolution of the first video in the first format.

[0146] .

[0147] As an optional embodiment, the local enhancement network includes:

[0148] The network comprises several N*N convolutional layers and a second fine-tuning module, with one N*N convolutional layer corresponding to one second fine-tuning module. Each second fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset. The local enhancement network is used to output a second video in the second format based on the vector representing amplitude, the vector representing offset, and intermediate information obtained after convolutional calculation of the intermediate video.

[0149] Where N is a natural number greater than 0, and N > M.

[0150] As an optional embodiment, the target neural network model is trained through the following operations:

[0151] During the training process:

[0152] The first frame sequence in the first sample video is input into the target neural network model to output a predicted frame sequence, which corresponds to the second format.

[0153] Obtain the first inter-frame difference between each adjacent frame in the predicted frame sequence;

[0154] Obtain the second inter-frame difference between each adjacent frame in the second frame sequence of the second sample video;

[0155] Obtain the absolute value of the difference between each of the multiple first inter-frame differences and the corresponding second inter-frame difference;

[0156] The network parameters in the target neural network model are adjusted based on the sum of the absolute values ​​of the multiple differences.

[0157] Example 3

[0158] Figure 6 schematically illustrates the hardware architecture of a computer device 10000 suitable for implementing a video conversion method according to Embodiment 3 of this application. In some embodiments, the computer device 10000 may be a smartphone, wearable device, tablet computer, personal computer, vehicle terminal, game console, virtual device, workbench, digital assistant, set-top box, robot, or other terminal device. In other embodiments, the computer device 10000 may be a rack server, blade server, tower server, or cabinet server (including independent servers or server clusters composed of multiple servers). As shown in Figure 6, the computer device 10000 includes, but is not limited to: a memory 10010, a processor 10020, and a network interface 10030 that can communicate and be linked to each other via a system bus. Wherein:

[0159] The memory 10010 includes at least one type of computer-readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 10010 may be an internal storage module of a computer device 10000, such as the hard disk or memory of the computer device 10000. In other embodiments, the memory 10010 may also be an external storage device of the computer device 10000, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the computer device 10000. Of course, the memory 10010 may also include both the internal storage module and the external storage device of the computer device 10000. In this embodiment, the memory 10010 is typically used to store the operating system and various application software installed on the computer device 10000, such as program code for video conversion methods. In addition, the memory 10010 can also be used to temporarily store various types of data that have been output or will be output.

[0160] In some embodiments, processor 10020 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other chip. Processor 10020 is typically used to control the overall operation of computer device 10000, such as performing control and processing related to data interaction or communication with computer device 10000. In this embodiment, processor 10020 is used to run program code stored in memory 10010 or process data.

[0161] Network interface 10030 may include a wireless network interface or a wired network interface, which is typically used to establish a communication link between computer device 10000 and other computer devices. For example, network interface 10030 is used to connect computer device 10000 to an external terminal via a network, establishing a data transmission channel and communication link between computer device 10000 and the external terminal. The network may be an intranet, the Internet, Global System for Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G network, Bluetooth, Wi-Fi, or other wireless or wired networks.

[0162] It should be noted that Figure 6 only shows a computer device with components 10010-10030, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.

[0163] In this embodiment, the video conversion method stored in memory 10010 can also be divided into one or more program modules and executed by one or more processors (such as processor 10020) to complete the embodiment of this application.

[0164] Example 4

[0165] This application also provides a computer-readable storage medium storing computer-readable instructions thereon, wherein the computer-readable instructions, when executed by a processor, implement the steps of the video conversion method in the embodiment.

[0166] In this embodiment, the computer-readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the computer-readable storage medium may be an internal storage unit of a computer device, such as the hard disk or memory of the computer device. In other embodiments, the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the computer device. Of course, the computer-readable storage medium may include both the internal storage unit and the external storage device of the computer device. In this embodiment, the computer-readable storage medium is typically used to store the operating system and various application software installed on the computer device, such as the program code of the video conversion method in the embodiment. In addition, the computer-readable storage medium can also be used to temporarily store various types of data that have been output or will be output.

[0167] Example 5

[0168] This application also provides a computer program product, including computer-readable instructions that, when executed by a processor, implement the methods described in the above embodiments.

[0169] Obviously, those skilled in the art should understand that the modules or steps of the embodiments of this application described above can be implemented using general-purpose computer devices. They can be centralized on a single computer device or distributed across a network of multiple computer devices. Optionally, they can be implemented using computer-executable program code, thereby storing them in a storage device for execution by a computer device. In some cases, the steps shown or described can be performed in a different order than those presented here, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the embodiments of this application are not limited to any particular combination of hardware and software.

[0170] It should be noted that the above are merely preferred embodiments of this application and do not limit the scope of patent protection of this application. Any equivalent structural or procedural changes made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the scope of patent protection of this application.

Claims

1. A video conversion method, wherein, The method includes: Get the first video in the first format; The first video is input into a pre-trained target neural network model to output a second video in a second format; The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

2. The method of claim 1, wherein, The target neural network model includes: A global information network is used to receive the downsampled first video and output global information. A mapping network is used to receive the first video and the global information to output an intermediate video in the second format; A local enhancement network is used to receive the intermediate video and the global information to output a second video in the second format.

3. The method of claim 2, wherein, The global information network includes: Several global information modules, M*M convolutional layers, and global pooling layers; the global information modules include M*M convolutional and pooling layers; Where M is a natural number greater than 0.

4. The method of claim 3, wherein, The mapping network includes: The system comprises several M*M convolutions and a first fine-tuning module, with one M*M convolution corresponding to one first fine-tuning module; each first fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset; the mapping network is used to output the intermediate video based on the vector representing amplitude, the vector representing offset, and intermediate information obtained by convolutional calculation of the first video in the first format.

5. The method of claim 4, wherein, The local enhancement network includes: The network comprises several N*N convolutional layers and a second fine-tuning module, with one N*N convolutional layer corresponding to one second fine-tuning module. Each second fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset. The local enhancement network is used to output a second video in the second format based on the vector representing amplitude, the vector representing offset, and intermediate information obtained after convolutional calculation of the intermediate video. Where N is a natural number greater than 0, and N > M.

6. The method according to any one of claims 1 to 5, wherein, The target neural network model is trained through the following operations: During the training process: The first frame sequence in the first sample video is input into the target neural network model to output a predicted frame sequence, which corresponds to the second format. Obtain the first inter-frame difference between each adjacent frame in the predicted image frame sequence; Obtain the second inter-frame difference between each adjacent frame in the second frame sequence of the second sample video; Obtain the absolute value of the difference between each of the multiple first inter-frame differences and the corresponding second inter-frame difference; The network parameters in the target neural network model are adjusted based on the sum of the absolute values ​​of the multiple differences.

7. A video conversion apparatus, wherein, The device includes: The acquisition module is used to acquire the first video in the first format; The output module is used to input the first video into a pre-trained target neural network model to output a second video in a second format; The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

8. A computer device, wherein, include: At least one processor; and A memory communicatively connected to the at least one processor; wherein: The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enable the at least one processor to perform the following operations: Get the first video in the first format; The first video is input into a pre-trained target neural network model to output a second video in a second format; The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

9. The computer device of claim 8, wherein, The target neural network model includes: A global information network is used to receive the downsampled first video and output global information. A mapping network is used to receive the first video and the global information to output an intermediate video in the second format; A local enhancement network is used to receive the intermediate video and the global information to output a second video in the second format.

10. The computer device of claim 9, wherein, The global information network includes: Several global information modules, M*M convolutional layers, and global pooling layers; the global information modules include M*M convolutional and pooling layers; Where M is a natural number greater than 0.

11. The computer device of claim 10, wherein, The mapping network includes: The system comprises several M*M convolutions and a first fine-tuning module, with one M*M convolution corresponding to one first fine-tuning module; each first fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset; the mapping network is used to output the intermediate video based on the vector representing amplitude, the vector representing offset, and intermediate information obtained by convolutional calculation of the first video in the first format.

12. The computer device of claim 11, wherein, The local enhancement network includes: The network comprises several N*N convolutional layers and a second fine-tuning module, with one N*N convolutional layer corresponding to one second fine-tuning module. Each second fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset. The local enhancement network is used to output a second video in the second format based on the vector representing amplitude, the vector representing offset, and intermediate information obtained after convolutional calculation of the intermediate video. Where N is a natural number greater than 0, and N > M.

13. The computer device of any one of claims 8 to 12, wherein, The target neural network model is trained through the following operations: During the training process: The first frame sequence in the first sample video is input into the target neural network model to output a predicted frame sequence, which corresponds to the second format. Obtain the first inter-frame difference between each adjacent frame in the predicted image frame sequence; Obtain the second inter-frame difference between each adjacent frame in the second frame sequence of the second sample video; Obtain the absolute value of the difference between each of the multiple first inter-frame differences and the corresponding second inter-frame difference; The network parameters in the target neural network model are adjusted based on the sum of the absolute values ​​of the multiple differences.

14. A computer-readable storage medium, wherein, The computer-readable storage medium stores computer instructions that, when executed by a processor, perform the following operations: Get the first video in the first format; The first video is input into a pre-trained target neural network model to output a second video in a second format; The target neural network model is trained based on multiple sample video pairs; each sample video pair includes a first sample video in a first format and a second sample video in a second format, wherein the second sample video is obtained by adjusting and converting the format of the first sample video according to the video style of the first sample video.

15. The computer-readable storage medium according to claim 14, wherein, The target neural network model includes: A global information network is used to receive the downsampled first video and output global information. A mapping network is used to receive the first video and the global information to output an intermediate video in the second format; A local enhancement network is used to receive the intermediate video and the global information to output a second video in the second format.

16. The computer-readable storage medium according to claim 15, wherein, The global information network includes: Several global information modules, M*M convolutional layers, and global pooling layers; the global information modules include M*M convolutional and pooling layers; Where M is a natural number greater than 0.

17. The computer-readable storage medium according to claim 16, wherein, The mapping network includes: The system comprises several M*M convolutions and a first fine-tuning module, with one M*M convolution corresponding to one first fine-tuning module; each first fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset; the mapping network is used to output the intermediate video based on the vector representing amplitude, the vector representing offset, and intermediate information obtained by convolutional calculation of the first video in the first format.

18. The computer-readable storage medium according to claim 17, wherein, The local enhancement network includes: The network comprises several N*N convolutional layers and a second fine-tuning module, with one N*N convolutional layer corresponding to one second fine-tuning module. Each second fine-tuning module includes two fully connected layers, one of which receives the global information to output a vector representing amplitude, and the other of which receives the global information to output a vector representing offset. The local enhancement network is used to output a second video in the second format based on the vector representing amplitude, the vector representing offset, and intermediate information obtained after convolutional calculation of the intermediate video. Where N is a natural number greater than 0, and N > M.

19. The computer-readable storage medium according to any one of claims 14 to 18, wherein, The target neural network model is trained through the following operations: During the training process: The first frame sequence in the first sample video is input into the target neural network model to output a predicted frame sequence, which corresponds to the second format. Obtain the first inter-frame difference between each adjacent frame in the predicted image frame sequence; Obtain the second inter-frame difference between each adjacent frame in the second frame sequence of the second sample video; Obtain the absolute value of the difference between each of the multiple first inter-frame differences and the corresponding second inter-frame difference; The network parameters in the target neural network model are adjusted based on the sum of the absolute values ​​of the multiple differences.

20. A computer program product comprising computer-readable instructions, wherein, When executed by a processor, the computer-readable instructions implement the steps of the method described in claims 1 to 6.