Method and apparatus for cross-component prediction
By using a cross-component prediction method based on deep neural networks to update the chromaticity prediction model parameters with the luminance component, the problem of insufficient compression efficiency and quality in existing technologies is solved, and more efficient video compression is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TENCENT AMERICA LLC
- Filing Date
- 2022-05-31
- Publication Date
- 2026-06-16
AI Technical Summary
Existing video coding methods cannot effectively utilize the spatiotemporal information of different layers, resulting in insufficient compression efficiency and quality, especially in the poor performance of the cross-component linear prediction mode in intra-frame prediction.
A cross-component prediction method based on deep neural networks is adopted. Chromaticity prediction is performed through a pre-trained DNN model, and the model parameters are updated with low bit precision during encoding or decoding. Chromaticity reconstruction and prediction are performed using the luminance component, thereby reducing processing time.
It improves video compression quality and efficiency, reduces processing time, enhances the ability to analyze nonlinear and nonlocal spatiotemporal correlations, and optimizes compression performance.
Smart Images

Figure CN116670686B_ABST
Abstract
Description
[0001] Cross-reference to related applications
[0002] This application is based on and claims priority to U.S. Provisional Patent Application No. 63 / 210,751, filed June 15, 2021, the disclosure of which is incorporated herein by reference in its entirety. Technical Field
[0003] Embodiments of this disclosure relate to cross component prediction (or cross-component prediction) based on neural networks during the encoding or decoding of images and / or video sequences. Background Technology
[0004] Video encoding and decoding reduce redundancy in the input video signal through compression. Both lossless and lossy compression help reduce bandwidth or storage space requirements, in some cases by two orders of magnitude or more. Lossless compression refers to the technique of reconstructing an exact copy of the original signal from the compressed original signal. When using lossy compression, the reconstructed signal may not be identical to the original signal, but the distortion between the original and reconstructed signals is small enough that the reconstructed signal is useful for the intended application. Lossy compression is widely used in video encoding or decoding. The permissible amount of distortion may depend on the application. For example, users of some consumer streaming applications may tolerate higher distortion than users of television streaming applications.
[0005] Traditional video coding standards (e.g., H.264 / AVC, High Efficiency Video Coding (HEVC), and Universal Video Coding (VVC)) are all designed on a similar (recursive) block-based hybrid prediction / transform framework, where various coding tools (e.g., intra / inter-frame prediction, integer transform, and context-adaptive entropy coding) are carefully crafted to optimize overall efficiency. Essentially, spatiotemporal pixel neighborhoods are used to construct the predictive signal to obtain the corresponding residuals for subsequent transform, quantization, and entropy coding. However, this approach cannot extract different levels of spatiotemporal stimuli by analyzing spatiotemporal information at different layers. Therefore, for better compression efficiency and quality, methods and devices that explore nonlinear and nonlocal spatiotemporal correlations are needed. Summary of the Invention
[0006] According to one aspect of this disclosure, a method for performing neural network (NN)-based cross-component prediction (CCP) with low bit precision during encoding or decoding can be provided. The method may include: using a pre-trained deep neural network (DNN) CCP model for chroma prediction to reconstruct chroma components based on received luminance components; updating one or more parameters of the pre-trained DNN CCP model with low bit precision; generating an updated DNN CCP model based on at least one video sequence, wherein the updated DNN CCP model is used for chroma prediction with low bit precision; and performing cross-component prediction of at least one video sequence using the updated DNN CCP model within a reduced processing time.
[0007] According to one aspect of this disclosure, an apparatus for performing cross-component prediction based on a neural network (NN) with low bit precision during encoding or decoding can be provided. The apparatus may include: at least one memory configured to store program code; and at least one processor configured to read the program code and operate according to the instructions of the program code. The program code may include: reconstruction code configured to cause the at least one processor to reconstruct chroma components based on received luma components using a pre-trained DNN CCP model for chroma prediction; update code configured to cause the at least one processor to update one or more parameters of the pre-trained DNN CCP model with low bit precision; generation code configured to cause the at least one processor to generate an updated DNN CCP model based on at least one video sequence, wherein the updated DNN CCP model is used for chroma prediction with low bit precision; and prediction code configured to cause the at least one processor to perform cross-component prediction of at least one video sequence using the updated DNN CCP model within a reduced processing time.
[0008] According to one aspect of this disclosure, a non-transitory computer-readable medium may be provided for storing instructions for performing cross-component prediction based on a neural network (NN) with low bit precision during encoding or decoding. When executed, the instructions may cause at least one processor to: reconstruct chroma components based on received luma components using a pre-trained DNN CCP model for chroma prediction; update one or more parameters of the pre-trained DNN CCP model with low bit precision; generate an updated DNN CCP model based on at least one video sequence, wherein the updated DNN CCP model is used for chroma prediction with low bit precision; and perform cross-component prediction of at least one video sequence using the updated DNN CCP model within a reduced processing time. Attached Figure Description
[0009] Further features, properties, and various advantages of the disclosed subject matter will become more apparent from the following detailed description and accompanying drawings, wherein:
[0010] Figure 1 This is a simplified block diagram of a communication system according to one embodiment;
[0011] Figure 2 yes Figure 1 A block diagram of example components of one or more devices;
[0012] Figure 3 This is a diagram illustrating exemplary cross-component prediction based on a deep neural network (DNN) with low bit precision during encoding or decoding, according to an embodiment.
[0013] Figure 4 A flowchart is shown, according to one embodiment, of a method for performing cross-component prediction based on a deep neural network (DNN) with low bit precision during encoding or decoding;
[0014] Figure 5 This is a diagram of a streaming environment according to one embodiment;
[0015] Figure 6 This is a block diagram of a video decoder according to an embodiment;
[0016] Figure 7 This is a block diagram of a video encoder according to an embodiment. Detailed Implementation
[0017] As mentioned above, methods in related technologies can utilize spatiotemporal pixel neighborhoods to construct predicted signals to obtain corresponding residuals for subsequent transformation, quantization, and entropy coding. However, this method cannot extract spatiotemporal stimuli at different levels by analyzing spatiotemporal information at different layers. Therefore, to achieve better compression efficiency and quality, it is necessary to explore methods and devices that address nonlinear and nonlocal spatiotemporal correlations.
[0018] By utilizing information from different components and additional auxiliary information, non-neural network-based encoders can predict other components to achieve better compression performance. However, their performance is not as good as neural network-based encoders. As an example, the linear prediction pattern of cross-components in intra-frame prediction does not perform well and effectively compared to methods based on deep neural networks (DNNs).
[0019] DNNs are essentially programmed to extract stimuli at different levels and have the ability to explore highly nonlinear and nonlocal correlations. This offers promising opportunities for high compression quality.
[0020] According to embodiments of this disclosure, a low-bit-precision content-adaptive online training method for cross-component prediction (CCP) can be provided. Online training may include real-time training of one or more models. Embodiments may be based on deep neural networks (DNNs) for video processing, adjusting the model's precision during the online training phase, and improving video compression quality for different video inputs through a series of processing steps.
[0021] Figure 1 A simplified block diagram of a communication system (100) according to an embodiment of the present disclosure is shown. The communication system (100) may include at least two terminals (140-130) interconnected via a network (150). For unidirectional data transmission, a first terminal (140) may encode video data at a local location for transmission to another terminal (130) via the network (150). A second terminal (130) may receive the encoded video data from the other terminal from the network (150), decode the encoded data, and display the recovered video data. Unidirectional data transmission is common in media service applications, etc.
[0022] Figure 1 A second pair of terminals (110, 120) is shown, providing a second pair of terminals to support bidirectional transmission of encoded video, for example, during video conferencing. For bidirectional data transmission, each terminal (110, 120) can encode video data captured at a local location for transmission to the other terminal via the network (150). Each terminal (110, 120) can also receive encoded video data transmitted by the other terminal, can decode the encoded data, and can display the recovered video data on a local display device.
[0023] exist Figure 1 In this disclosure, terminals (140-120) may be represented as servers, personal computers, and smartphones, but the principles of this disclosure are not limited thereto. Embodiments of this disclosure are applicable to laptop computers, tablet computers, media players, and / or dedicated video conferencing equipment. A network (150) refers to any number of networks that transmit encoded video data between terminals (140-120), including, for example, wired and / or wireless communication networks. Communication networks (150) may exchange data in circuit-switched and / or packet-switched channels. Representative networks include telecommunications networks, local area networks (LANs), wide area networks (WANs), and / or the Internet. For the purposes of this discussion, the architecture and topology of the network (150) may be of little importance to the operation of this disclosure, unless explained below.
[0024] Figure 2 yes Figure 1 A block diagram of example components for one or more devices.
[0025] Device 200 can correspond to any of the terminals (110-140). For example... Figure 2 As shown, device 200 may include bus 210, processor 220, memory 230, storage component 240, input component 250, output component 260 and communication interface 270.
[0026] Bus 210 includes components that allow communication between components of device 200. Processor 220 is implemented in hardware, firmware, or a combination of hardware and software. Processor 220 is a central processing unit (CPU), graphics processing unit (GPU), accelerated processing unit (APU), microprocessor, microcontroller, digital signal processor (DSP), field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), or other type of processing component. In some implementations, processor 220 includes one or more processors that can be programmed to perform functions. Memory 230 includes random access memory (RAM), read-only memory (ROM), and / or another type of dynamic or static storage device (e.g., flash memory, magnetic storage, and / or optical storage) that stores information and / or instructions for use by processor 220.
[0027] Storage component 240 stores information and / or software related to the operation and use of device 200. For example, storage component 240 may include hard disks (e.g., magnetic disks, optical disks, magneto-optical disks, and / or solid-state disks), optical disks (CDs), digital versatile disks (DVDs), floppy disks, cassette tapes, magnetic tapes, and / or another type of non-transitory computer-readable media and corresponding drives.
[0028] Input component 250 includes components that allow device 200 to receive information, for example, via user input (e.g., a touchscreen display, keyboard, keypad, mouse, buttons, switches, and / or microphone). Additionally or alternatively, input component 250 may include sensors for sensing information (e.g., a Global Positioning System (GPS) component, accelerometer, gyroscope, and / or actuator). Output component 260 includes components that provide output information from device 200 (e.g., a display, speaker, and / or one or more light-emitting diodes (LEDs)).
[0029] Communication interface 270 includes transceiver-like components (e.g., a transceiver and / or separate receiver and transmitter) that enable device 200 to communicate with other devices, for example, via a wired connection, a wireless connection, or a combination of wired and wireless connections. Communication interface 270 can allow device 200 to receive information from and / or provide information to another device. For example, communication interface 270 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi interface, a cellular network interface, etc.
[0030] Device 200 can perform one or more of the processes described herein. Device 200 can perform these processes in response to processor 220 executing software instructions stored on a non-transitory computer-readable medium such as memory 230 and / or storage component 240. Computer-readable medium is defined herein as a non-transitory storage device. Storage devices include storage space within a single physical storage device or storage space distributed across multiple physical storage devices.
[0031] Software instructions may be read into memory 230 and / or storage component 240 from another computer-readable medium or another device via communication interface 270. When executed, the software instructions stored in memory 230 and / or storage component 240 may cause processor 220 to perform one or more processes described herein. Additionally or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to perform one or more processes described herein. Therefore, the implementations described herein are not limited to any particular combination of hardware circuitry and software.
[0032] supply Figure 2 The number and arrangement of components shown are for illustrative purposes only. In practice, device 200 may include more than [previous components]. Figure 2 The components may be more numerous, fewer, different, or arranged differently. Additionally or alternatively, a set of components of device 200 (e.g., one or more components) may perform one or more functions described as being performed by another set of components of device 200.
[0033] A video compression framework can be described as follows. The input video x can include multiple image frames x1, ..., xT, where T represents the total number of frames in the video. Frames can be segmented into spatial blocks, and each block can be iteratively segmented into smaller blocks. Any suitable segmentation method can be used. As an example, 3D tree encoding (e.g., octree segmentation) can be used. The segmented blocks can contain luma and chroma components. During intra-frame prediction, the luma component can be predicted first, and then the two chroma channels can be predicted later. According to embodiments, the predictions for the two chroma channels can be generated jointly or separately. The reconstructed chroma components can be generated by DNN-based models in the encoder and decoder. In some embodiments, the reconstructed chroma components can be generated by a DNN-based model only in the decoder.
[0034] According to embodiments, one or more processes including signal processing, spatial or temporal filtering, scaling, weighted averaging, upsampling / downsampling, pooling, recursive processing with memory, linear system processing, nonlinear system processing, neural network processing, deep learning-based processing, AI processing, pre-trained network processing, machine learning-based processing, or combinations thereof can be used as modules for preprocessing and / or post-processing image frames.
[0035] Figure 3 This is a diagram illustrating an exemplary deep neural network (DNN) based cross-component prediction process (300) performed with low bit precision during encoding or decoding according to an embodiment. Figure 3 As shown, process 300 may include a neural network model (302) and reconstruction quality calculation (304).
[0036] The neural network model (302) can be trained and jointly perform inference, taking into account the luminance component (e.g., during encoding) or the reconstructed luminance component (e.g., during decoding), certain auxiliary information or information associated with adjacent luminance reference blocks and adjacent chrominance reference blocks as inputs to the neural network model (302).
[0037] In some embodiments, the neural network model (302) may be a pre-trained model that is fine-tuned before or after encoding or decoding using the neural network model (302). In some embodiments, the neural network model (302) may be pre-trained, but may be continuously updated during the corresponding encoding or decoding by leveraging inference acceleration and continuous adjustment. For continuous updates, in some embodiments, the neural network model (302) may be supported by a custom hardware processor and may also be supported by a lower-precision floating-point representation used during training.
[0038] According to an embodiment, additional auxiliary information may include image attributes and information provided by the encoder, including but not limited to luminance components, block size, block components, quantization parameter (QP) values, etc.
[0039] The output of the neural network model (302) can be the predicted chroma components. These two chroma channels can use different neural network-based models, or they can use the same model. Embodiments of this disclosure allow for arbitrary variations in the combination, connection, or order in which these components are used as inputs.
[0040] The predicted chroma components can be used as input to the reconstruction quality calculation (304) to generate a reconstructed chroma block. In some embodiments, the reconstruction quality calculation (304) may also use chroma blocks from other prediction modes as input. In some embodiments, the reconstruction quality calculation (304) may receive the original chroma block associated with the reconstructed chroma block to determine the compression quality and to determine whether one or more parameters of the neural network model need to be updated or can be updated, thereby updating the neural network model.
[0041] According to one embodiment, by updating some (or all) parameters of a pre-trained neural network-based model with low bit precision, better compression performance of one or more reconstructed components can be optimized for the input video. While the default model parameter precision for most current neural networks is FP32 (some hardware may support FP64 model training), during the inference phase, specific hardware platforms may support low bit precision, such as FP16, INT8, INT4, INT2, and INT1. Low bit precision may be a trade-off between compression performance and overall processing time.
[0042] To improve the learning speed and accuracy of the neural network-based cross-component prediction model, additional parameters can be added to the model. These one or more additional parameters can be added as learnable parameters during initial training, fine-tuning, or continuous adjustment. During training, the additional parameters can be learned by optimizing the rate-distortion loss based on the input video sequence.
[0043] According to one embodiment, the neural network-based model for cross-component prediction can be fine-tuned or can be continuously updated based on a single video sequence. According to another embodiment, the neural network-based model for cross-component prediction can be fine-tuned or can be continuously updated based on a set of video sequences.
[0044] According to embodiments, neural network-based models can be pre-trained. In one embodiment, one or more parameters can be updated only in one layer or certain types of layers of the neural network model, generating a new model. In other preferred embodiments, parameters are updated across multiple layers or all layers of the neural network model. In one embodiment, only one or more bias terms / parameters can be optimized and updated with low bit precision. In one embodiment, one or more weight (coefficient) terms / parameters can be optimized and updated with low bit precision. In one embodiment, one or more bias parameters and one or more weight terms / parameters can be jointly optimized or optimized together with low bit precision.
[0045] At the end of training and / or fine-tuning, updated parameters can be computed. In one embodiment, compression performance can be computed between the updated parameters and existing pre-trained parameters. In one embodiment, the updated parameters are the fine-tuned parameters; that is, the neural network model is updated with the fine-tuned parameters, and existing pre-trained parameters can be replaced. In other preferred embodiments, the updated parameters are some specific transformation of the fine-tuned parameters.
[0046] According to one embodiment, data compression can be performed on the updated parameters; for example, the LZMA2 algorithm can be used to compress the updated parameters. In another embodiment, compression may not be performed.
[0047] When compared to the neural network-based cross-component prediction method described herein, the cross-component prediction method in intra-frame prediction mode can achieve better compression quality. According to some embodiments, one or more parameters to be optimized can be updated to low bit precision to improve compression performance for a specific video, serving as online training input. Furthermore, by updating and / or fine-tuning parameters with low bit precision, the updated parameters can accelerate the inference process and reduce processing time.
[0048] Figure 4 A flowchart is shown for a process 400 for performing cross-component prediction based on a deep neural network (DNN) with low bit precision during encoding or decoding. Process 400 can be performed using an encoder or a decoder, or both.
[0049] In operation 405, a pre-trained deep neural network (DNN) cross-component prediction (CCP) model for chroma prediction can be used to reconstruct the chroma component based on the received luminance component. A pre-trained neural network model for chroma prediction can be used to reconstruct the chroma component based on the luminance component. In some embodiments, in operation 405, the luminance component can be received. In some examples, the luminance component may have already been reconstructed.
[0050] In operation 410, one or more parameters of a pre-trained DNN CCP model can be updated with low bit precision.
[0051] In operation 415, the pre-trained neural network model for chroma prediction can be updated with low bit precision. In some embodiments, updating the pre-trained neural network model may include optimizing one or more parameters of the pre-trained neural network model with low bit precision. In some embodiments, updating the pre-trained neural network model for chroma prediction with low bit precision may also include updating one or more parameters of the pre-trained neural network model based on a single video sequence or a set of video sequences.
[0052] Reconstructed chromaticity components can be generated based on predicted chromaticity components and one or more chromaticity components encoded using a set of prediction modes. According to an embodiment, generating reconstructed chromaticity components can be based on quality computation of the predicted chromaticity components, wherein the quality computation of the predicted chromaticity components can be based on one or more chromaticity components from other prediction modes and the original chromaticity components associated with the predicted chromaticity components.
[0053] In operation 420, the updated DNN CCP model can be used to perform cross-component prediction for at least one video sequence in a reduced processing time. Here, the processing time for performing cross-component prediction for at least one video sequence using the updated DNN CCP model is less than the processing time for performing cross-component prediction for at least one video sequence using the pre-trained DNN CCP model; that is, the processing time for performing cross-component prediction for at least one video sequence can be reduced compared to the pre-trained DNN CCP model.
[0054] In some embodiments, updating a pre-trained neural network model may include optimizing one or more parameters from one or more layers of the pre-trained neural network model with low bit precision. In some embodiments, the one or more parameters optimized with low bit precision may include one or more bias parameters. In some embodiments, the one or more parameters optimized with low bit precision may include one or more weight parameters. In some embodiments, the one or more parameters optimized with low bit precision may include one or more bias parameters and one or more weight parameters that are jointly optimized, and the one or more bias parameters and one or more weight parameters are parameters that are jointly updated.
[0055] In some embodiments, one or more layers may include one or more convolutional layers of a pre-trained neural network model. In some embodiments, one or more layers may include a set of final layers of a pre-trained neural network model.
[0056] In some embodiments, the update may include calculating a first compression performance of an updated neural network model, the updated DNN CCP model may include one or more parameters updated with low bit precision, then calculating a second compression performance of a pre-trained neural network model, the pre-trained DNN CCP model may include one or more relevant parameters; and determining whether to update the pre-trained neural network model to include one or more parameters optimized with one or more scaling factors based on a comparison of the first compression performance and the second compression performance, which are both above a threshold.
[0057] As an example of the application of the disclosed topic, Figure 5The placement of a video encoder and decoder in a streaming environment is illustrated. The disclosed subject matter can also be applied to other video-enabled applications, including, for example, video conferencing, digital television, and storing compressed video on digital media including CDs, DVDs, Memory Sticks, etc.
[0058] The streaming system may include a capture subsystem (513) that may include a video source (501), such as a digital camera, creating, for example, an uncompressed video sample stream (502). This sample stream (502), depicted as a thick line to emphasize the high data volume compared to the encoded video bitstream, may be processed by an encoder (503) coupled to the camera (501). The encoder (503) may include hardware, software, or a combination thereof to implement or enforce aspects of the disclosed subject matter as described in more detail below. The encoded video bitstream (504), depicted as a thin line to emphasize the lower data volume compared to the sample stream, may be stored on a streaming server (505) for future use. One or more streaming clients (506, 508) may access the streaming server (505) to retrieve a copy (507, 509) of the encoded video bitstream (504). The client (506) may include a video decoder (510) that decodes an input copy of the encoded video bitstream (507) and creates an output video sample stream (511) that can be presented on a display (512) or other presentation device (not shown). In some streaming systems, the video bitstream (504, 507, 509) may be encoded according to a specific video coding / compression standard. Examples of such standards include H.265 HEVC. The video coding standard under development is informally referred to as Universal Video Coding (VVC). The disclosed topics can be used in a VVC environment.
[0059] Figure 6 This can be a functional block diagram of a video decoder (510) according to an embodiment of the present invention.
[0060] The receiver (610) can receive one or more codec video sequences to be decoded by the decoder (610); in the same or another embodiment, one encoded video sequence at a time, wherein the decoding of each encoded video sequence is independent of the other encoded video sequences. The encoded video sequences can be received from a channel (612), which can be a hardware / software link to a storage device storing the encoded video data. The receiver (610) can receive encoded video data and other data, such as encoded audio data and / or auxiliary data streams, which can be forwarded to their respective user entities (not shown). The receiver (610) can isolate the encoded video sequences from the other data. To combat network jitter, a buffer memory (615) can be coupled between the receiver (610) and the entropy decoder / parser (620) (hereinafter referred to as the "parser"). The buffer (615) may be unnecessary or may be small when the receiver (610) receives data from a storage / forwarding device with sufficient bandwidth and controllability or from a synchronization network. In order to make the best use on packet networks (e.g., the Internet), a buffer (615) may be required, which may be relatively large and may have an adaptive size.
[0061] The video decoder (510) may include a parser (620) to reconstruct symbols (621) from the entropy-coded video sequence. These symbols include information for managing the operation of the decoder (510) and information that potentially controls a presentation device (e.g., a display (512)) that is not part of the decoder but may be coupled to it. Figure 6 As shown. The control information used to present the device may be in the form of supplemental enhancement information (SEI message) or video availability information (VUI) parameter set fragments (not shown). The parser (620) can parse / entropy decode the received encoded video sequence. The encoding of the encoded video sequence may be based on video coding techniques or standards and may follow principles known to those skilled in the art, including variable-length coding, Huffman coding, arithmetic coding with or without context sensitivity, etc. The parser (620) may extract a set of subgroup parameters of at least one pixel subgroup from the encoded video sequence based on at least one parameter corresponding to the group. The subgroup may include picture groups (GOPs), pictures, tiles, slices, macroblocks, coding units (CUs), blocks, transform units (TUs), prediction units (PUs), etc. The entropy decoder / parser may also extract information from the encoded video sequence, such as transform coefficients, quantizer parameter (QP) values, motion vectors, etc.
[0062] The parser (620) can perform entropy decoding / parsing operations on the video sequence received from the buffer (615) to create symbols (621). The parser (620) can receive encoded data and selectively decode specific symbols (621). Furthermore, the parser (620) can determine whether a specific symbol (621) will be provided to the motion compensation prediction unit (653), the scaler / inverse transform unit (651), the intra-frame prediction unit (652), or the loop filter (656).
[0063] Depending on the type of the encoded video picture or its portions (e.g., inter- and intra-pictures, inter- and intra-blocks) and other factors, the reconstruction of the symbol (621) can involve multiple different units. Which units are involved and how they are involved can be controlled by subgroup control information parsed from the encoded video sequence by the parser (620). For clarity, the flow of this subgroup control information between the parser (620) and the multiple units below is not described.
[0064] In addition to the functional blocks already mentioned, the decoder (510) can be conceptually subdivided into several functional units as described below. In practical implementations operating under commercial constraints, many of these units interact closely with each other and can be integrated at least partially with each other. However, for the purpose of describing the disclosed subject matter, it is appropriate to conceptually subdivide it into the following functional units.
[0065] The first unit is the scaler / inverse transform unit (651). The scaler / inverse transform unit (651) receives the quantized transform coefficients and control information, including which transform to use, block size, quantization factor, quantization scaling matrix, etc., as symbols (621) from the parser (620). The scaler / inverse transform unit can output blocks containing sample values, which can be input into the aggregator (655).
[0066] In some cases, the output samples of the scaler / inverse transform unit (651) may belong to intra-coded blocks; that is, blocks that do not use prediction information from previously reconstructed images but can use prediction information from previously reconstructed portions of the current image. This prediction information can be provided by the intra-picture prediction unit (652). In some cases, the intra-picture prediction unit (652) uses surrounding reconstructed information obtained from the current (partially reconstructed) image (666) to generate blocks of the same size and shape as the blocks in the reconstruction. In some cases, the aggregator (655) adds the prediction information already generated by the intra-picture prediction unit (652) to the output sample information provided by the scaler / inverse transform unit (651) based on each sample.
[0067] In other cases, the output samples of the scaler / inverse transform unit (651) may belong to inter-frame coded blocks and may be motion-compensated. In this case, the motion compensation prediction unit (653) can access the reference picture memory (657) to obtain samples for prediction. After motion compensation of the extracted samples according to the symbols (621) associated with the block, these samples can be added by the aggregator (655) to the output of the scaler / inverse transform unit (referred to in this case as residual samples or residual signals) to generate output sample information. The addresses in the reference picture memory from which the motion compensation unit obtains the predicted samples can be controlled by motion vectors, and the motion compensation unit can obtain these addresses in the form of symbols (621), which may have, for example, X, Y, and reference picture components. When using subsampled precise motion vectors, motion compensation may also include interpolation of sampled values obtained from the reference picture memory, motion vector prediction mechanisms, etc.
[0068] The output samples of the aggregator (655) can undergo various loop filtering techniques in the loop filter unit (656). The video compression techniques may include loop filtering techniques controlled by parameters contained in the encoded video bitstream and available to the loop filter unit (656) as symbols (621) from the parser (620), but may also be in response to metadata obtained during the decoding of a previous (in the order of decoding) portion of the encoded picture or encoded video sequence and to previously reconstructed and loop-filtered sample values.
[0069] The output of the loop filter unit (656) can be a sample stream that can be output to the presentation device (512) and stored in the reference image memory (666) for future inter-frame image prediction.
[0070] Once fully reconstructed, some coded images can be used as reference images for future predictions. Once the coded images have been fully reconstructed and have been identified as reference images (e.g., by the parser (620)), the current reference image (666) can become part of the reference image buffer (657), and a new current image memory can be reallocated before the reconstruction of the next coded image begins.
[0071] The video decoder (510) can perform decoding operations according to a predetermined video compression technique, which may be described in standards such as H.265 HEVC. The encoded video sequence may conform to the syntax specified by the video compression technique or standard used, in a sense, conforming to the syntax of the video compression technique or standard, as specified in the video compression technique documentation or standard, particularly in the brief document therein. Standard conformance also requires the complexity of the encoded video sequence to be within the range defined by the level of the video compression technique or standard. In some cases, the level limits the maximum picture size, maximum frame rate, maximum reconstruction sampling rate (e.g., measured in megasamples per second), maximum reference picture size, etc. In some cases, the limitations set by the level can be further restricted by the assumed reference decoder (HRD) specification and metadata managed by the HRD buffer of signaling in the encoded video sequence.
[0072] In one embodiment, the receiver (610) may receive additional (redundant) data with encoded video. This additional data may be included as part of the encoded video sequence. The video decoder (510) may use the additional data to correctly decode the data and / or more accurately reconstruct the original video data. The additional data may be, for example, temporal, spatial, or signal-to-noise ratio (SNR) enhancement layers, redundant slices, redundant images, forward error correction codes, etc.
[0073] Figure 7 This may be a functional block diagram of a video encoder (503) according to an embodiment of the present disclosure.
[0074] The encoder (503) can receive video samples from a video source (501) (not part of the encoder) that can capture video images to be encoded by the encoder (503).
[0075] The video source (501) can be provided as a digital video sample stream of a sequence of source videos to be encoded by the encoder (503). This digital video sample stream can have any suitable bit depth (e.g., 8-bit, 10-bit, 12-bit, ...), any color space (e.g., BT.601YCrCb, RGB, ...), and any suitable sampling structure (e.g., YCrCb 4:2:0, YCrCb 4:4:4). In a media service system, the video source (501) can be a storage device storing previously prepared video. In a video conferencing system, the video source (503) can be a camera capturing local image information as a video sequence. Video data can be provided as multiple individual pictures, which, when viewed sequentially, are given motion. The pictures themselves can be organized as a spatial array of pixels, where each pixel can include one or more samples, depending on the sampling structure, color space, etc., used. Those skilled in the art will readily understand the relationship between pixels and samples. The following description focuses on samples.
[0076] According to one embodiment, the video encoder (503) can encode and compress images of a source video sequence into an encoded video sequence (743) in real time or under any other time constraints required by the application. Implementing an appropriate encoding rate is a function of the controller (750). The controller (750) controls and is functionally coupled to other functional units described below. For clarity, coupling is not described. Parameters set by the controller may include rate control-related parameters (image skipping, quantizer, λ value of rate-distortion optimization techniques, etc.), image size, group of pictures (GOP) layout, maximum motion vector search range, etc. Other functions of the controller (750) can be readily identified by those skilled in the art as they may be related to the video encoder (503) optimized for a particular system design.
[0077] Some video encoders operate within an “encoding loop” readily recognizable to those skilled in the art. As an oversimplification, the encoding loop can consist of the encoding portion of an encoder (730) (hereinafter referred to as the “source encoder”) responsible for creating symbols based on the input picture and reference picture to be encoded, and a (local) decoder (733) embedded in the encoder (503) that reconstructs the symbols to create sample data that the (remote) decoder will also create (since any compression between the symbols and the encoded video bitstream is lossless in the video compression techniques considered in the disclosed subject matter). This reconstructed sample stream is fed into a reference picture memory (734). Since the decoding of the symbol stream results in bit-accurate results independent of the decoder location (local or remote), the contents of the reference picture buffer are also bit-accurate between the local and remote encoders. In other words, when prediction is used during decoding, the encoder’s prediction portion, as a reference picture sample, “sees” the exact same sample values as the decoder “sees.” The basic principles of reference picture synchronization (and the resulting drift, if synchronization cannot be maintained, for example, due to channel errors) are well known to those skilled in the art.
[0078] The operation of the "local" decoder (733) can be the same as that of the "remote" decoder (510), as already described above. Figure 6 A detailed description was provided. However, a brief reference is also included. Figure 6 Since symbols are available and the encoding / decoding of symbols of the encoded video sequence by the entropy encoder (745) and the parser (620) can be lossless, the entropy decoding part of the decoder (510) (including the channel (612), receiver (610), buffer (615) and parser (620)) may not be fully implemented in the local decoder (733).
[0079] At this point, it can be observed that, in addition to the parsing / entropy decoding present in the decoder, any decoder technique must also exist in the corresponding encoder in essentially the same functional form. The description of encoder techniques can be simplified, as these techniques are the inverse of a fully described decoder technique. More detailed descriptions are only required in certain areas, and are provided below.
[0080] As part of its operation, the source encoder (730) can perform motion-compensated predictive coding, which predictively encodes the input frame by referencing one or more previously encoded frames from the video sequence designated as "reference frames". In this way, the encoding engine (732) encodes the differences between pixel blocks of the input frame and pixel blocks of the reference frame, which can be selected as the predictive reference for the input frame.
[0081] The local video decoder (733) can decode encoded video data of frames that can be designated as reference frames based on symbols created by the source encoder (730). The operation of the encoding engine (732) can advantageously be a lossy process. When the encoded video data can be decoded by the video decoder (730)... Figure 7 When decoded at (not shown), the reconstructed video sequence can typically be a copy of the source video sequence with some errors. The local video decoder (733) replicates the decoding process that can be performed on the reference frame by the video decoder, and can store the reconstructed reference frame in a reference picture cache (734). In this way, the encoder (503) can locally store copies of the reconstructed reference frames that have the same content as the reconstructed reference frames that will be obtained by the remote video decoder (without transmission errors).
[0082] The predictor (735) can perform a prediction search on the encoding engine (732). That is, for a new frame to be encoded, the predictor (735) can search the reference image memory (734) for sample data (as candidate reference pixel blocks) or certain metadata, such as reference image motion vectors, block shapes, etc., which can be used as appropriate prediction references for the new image. The predictor (735) can operate on a sample block-by-pixel basis to find suitable prediction references. In some cases, as determined by the search results obtained by the predictor (735), the input image can have prediction references extracted from multiple reference images stored in the reference image memory (734).
[0083] The controller (750) can manage the encoding operations of the video encoder (730), including, for example, the setting of parameters and subgroup parameters for encoding video data.
[0084] The outputs of all the aforementioned functional units can undergo entropy encoding (745) in the entropy encoder. The entropy encoder converts the symbols generated by the various functional units into an encoded video sequence by lossless compression symbols, using techniques known to those skilled in the art, such as Huffman coding, variable-length coding, arithmetic coding, etc.
[0085] The transmitter (740) can buffer the encoded video sequence created by the entropy encoder (745) in preparation for transmission via a communication channel (760), which may be a hardware / software link to a storage device where the encoded video data will be stored. The transmitter (740) can combine the encoded video data from the video encoder (730) with other data to be transmitted, such as encoded audio data and / or auxiliary data streams (source not shown).
[0086] The controller (750) can manage the operation of the encoder (503). During encoding, the controller (750) can assign a specific encoded image type to each encoded image, which can affect the encoding techniques that can be applied to the corresponding image. For example, an image can typically be specified as one of the following frame types:
[0087] An intra-frame picture (I-picture) can be a picture that is encoded and decoded without using any other frames in the sequence as a prediction source. Some video codecs allow different types of intra-frame pictures, including, for example, pictures refreshed by a separate decoder. Those skilled in the art are aware of those variations of I-pictures and their corresponding applications and characteristics.
[0088] A predicted image (P-image) can be an image that uses at most one motion vector and a reference index to predict the sample value of each block, and is encoded and decoded using intra-frame prediction or inter-frame prediction.
[0089] A bidirectional prediction image (B-image) can be an image that uses up to two motion vectors and a reference index to predict sample values for each block, and is encoded and decoded using intra-frame prediction or inter-frame prediction. Similarly, a multi-prediction image can reconstruct a single block using two or more reference images and associated metadata.
[0090] The source image can typically be spatially subdivided into multiple sample blocks (e.g., each sample block is 4×4, 8×8, 4×8, or 16×16 sample blocks) and encoded on a block-by-block basis. Blocks can be predictedly encoded by referencing other (already encoded) blocks determined by the encoding allocation applied to the corresponding image of the block. For example, blocks of image I can be encoded unpredictably, or predictively (spatial prediction or intra-frame prediction) by referencing already encoded blocks of the same image. Pixel blocks of image P can be encoded unpredictably by referencing a previously encoded reference image, via spatial prediction or via temporal prediction. Blocks of image B can be predictedively encoded by referencing one or two previously encoded reference images, via spatial prediction or via temporal prediction.
[0091] The video encoder (503) can perform encoding operations according to a predetermined video coding technique or standard (e.g., H.265 HEVC). In its operation, the video encoder (503) can perform various compression operations, including predictive coding operations that utilize temporal and spatial redundancy in the input video sequence. Therefore, the encoded video data can conform to the syntax specified by the video coding technique or standard being used.
[0092] In one embodiment, the transmitter (740) may transmit additional data along with the encoded video. The video encoder (730) may include such data as part of the encoded video sequence. The additional data may include temporal / spatial / SNR enhancement layers, other forms of redundant data (e.g., redundant images and slices), supplementary enhancement information (SEI) messages, fragments of visual usability information (VUI) parameter sets, etc.
[0093] This disclosure relates to several block segmentation methods in which motion information is considered during tree splitting for video coding. More specifically, the techniques in this disclosure relate to tree splitting methods based on flexible tree structures with motion field information. The techniques proposed in this disclosure can be applied to homogeneous and heterogeneous derived motion fields.
[0094] The derived motion field of a block is defined as homogeneous if it is available to all sub-blocks within the block, and all motion vectors within the derived motion field are similar (e.g., the motion vectors share the same reference frame, and the absolute differences between motion vectors are all below a certain threshold). This threshold can be signaled in the bitstream or predefined.
[0095] If the derived motion field is available for all sub-blocks of the block, and the motion vectors in the derived motion field are dissimilar (e.g., at least one motion vector references a reference frame that is not referenced by other motion vectors), or at least one absolute difference between two motion vectors in the field is greater than a signaled or predefined threshold, then the derived motion field of the block is defined as heterogeneous.
[0096] While several exemplary embodiments have been described in this disclosure, variations, substitutions, and various alternative equivalents fall within the scope of this disclosure. Therefore, it should be understood that those skilled in the art will be able to design numerous systems and methods that, although not expressly shown or described herein, embody the principles of this disclosure and are therefore within its spirit and scope.
Claims
1. A method for cross-component prediction, said method employing low bit precision during encoding or decoding, executed by one or more processors, characterized in that, The method includes: The CCP model, which uses a pre-trained deep neural network (DNN) for chromaticity prediction, reconstructs the chromaticity components based on the received luminance components. Update one or more parameters of the pre-trained DNN CCP model with low bit precision; An updated DNN CCP model is generated based on at least one video sequence, wherein the updated DNN CCP model is used for chroma prediction with low bit precision; and Cross-component prediction of the at least one video sequence is performed using the updated DNN CCP model within a reduced processing time, wherein the updated DNN CPP model uses the low bit precision during inference, the low bit precision format being one of FP16, INT8, INT4, INT2, or INT1.
2. The method according to claim 1, characterized in that, Updating one or more parameters of the pre-trained DNN CCP model includes updating one or more parameters from one or more layers of the pre-trained DNN CCP model with low bit precision.
3. The method according to claim 1, characterized in that, Updating one or more parameters of the pre-trained DNN CCP model further includes updating one or more parameters of the pre-trained DNN CCP model based on multiple video sequences, wherein the pre-trained DNN CCP model is used for chroma prediction with low bit precision.
4. The method according to claim 1, characterized in that, The one or more parameters updated with low bit precision include one or more bias parameters.
5. The method according to claim 1, characterized in that, The one or more parameters updated with low bit precision include one or more weight parameters.
6. The method according to claim 1, characterized in that, The one or more parameters updated with low bit precision include one or more bias parameters and one or more weight parameters, which are parameters that are jointly updated.
7. The method according to claim 2, characterized in that, The one or more layers include one or more convolutional layers of the pre-trained DNNCCP model.
8. The method according to claim 2, characterized in that, The one or more layers include a set of final layers of the pre-trained DNNCCP model.
9. The method according to claim 2, characterized in that, The one or more layers include all layers in the pre-trained DNNCCP model that have the same layer properties.
10. The method according to claim 1, characterized in that, The quality calculation of the reconstructed chromaticity components is based on one or more chromaticity components from other prediction modes and the original chromaticity components associated with the reconstructed chromaticity components.
11. The method according to claim 1, characterized in that, The updating of one or more parameters of the pre-trained DNN CCP model further includes: Calculate the first compression performance of the updated DNN CCP model, wherein the updated DNN CCP model includes one or more parameters updated with low bit precision; Calculate the second compression performance of the pre-trained DNN CCP model, wherein the pre-trained DNN CCP model includes one or more relevant parameters; and Based on a comparison of the first compression performance and the second compression performance, it is determined whether to update the pre-trained DNNCCP model to include one or more parameters updated with low bit precision, wherein the first compression performance and the second compression performance are above a threshold.
12. An apparatus for predicting cross-components, characterized in that, The device includes: At least one memory configured to store program code; and At least one processor configured to read the program code and operate according to the instructions of the program code, the program code comprising: The reconstruction code is configured to cause the at least one processor to reconstruct the chromaticity components based on the received luminance components using a pre-trained DNN CCP model for chromaticity prediction. Update code, which is configured to cause the at least one processor to update one or more parameters of the pre-trained DNN CCP model with low bit precision; Generate code, the code being configured to cause the at least one processor to generate an updated DNN CCP model based on at least one video sequence, the updated DNN CCP model being used for chroma prediction with low bit precision; and The prediction code is configured to enable the at least one processor to perform cross-component prediction of the at least one video sequence using the updated DNN CCP model within a reduced processing time, wherein the updated DNN CPP model uses the low bit precision during inference, the low bit precision format being one of FP16, INT8, INT4, INT2, or INT1.
13. The apparatus according to claim 12, characterized in that, Updating one or more parameters of the pre-trained DNN CCP model includes updating one or more parameters from one or more layers of the pre-trained DNN CCP model with low bit precision.
14. The apparatus according to claim 12, characterized in that, The one or more parameters updated with low bit precision include one or more bias parameters and one or more weight parameters, wherein the one or more bias parameters and one or more weight parameters are parameters that are jointly updated.
15. The apparatus according to claim 13, characterized in that, The one or more layers include one or more convolutional layers, a set of final layers, or all layers with the same layer properties in the pre-trained DNN CCP model.
16. The apparatus according to claim 12, characterized in that, The quality calculation of the reconstructed chromaticity components is based on one or more chromaticity components from other prediction modes and the original chromaticity components associated with the reconstructed chromaticity components.
17. A non-transitory computer-readable medium storing instructions, characterized in that, When the instruction is executed by at least one processor used for low-bit-precision cross-component prediction based on a neural network (DNN) during encoding or decoding, the at least one processor: The chromaticity components are reconstructed based on the received luminance components using a pre-trained DNN CCP model for chromaticity prediction. Update one or more parameters of the pre-trained DNN CCP model with low bit precision; An updated DNN CCP model is generated based on at least one video sequence, wherein the updated DNN CCP model is used for chroma prediction with low bit precision; and Cross-component prediction of the at least one video sequence is performed using the updated DNN CCP model within a reduced processing time, wherein the updated DNN CPP model uses the low bit precision during inference, the low bit precision format being one of FP16, INT8, INT4, INT2, or INT1.
18. The non-transitory computer-readable medium according to claim 17, characterized in that, Updating one or more parameters of the pre-trained DNN CCP model includes updating one or more parameters from one or more layers of the pre-trained DNN CCP model with low bit precision.