Rolling sample group in vvc video codec

By modifying the parameters of the VVC decoder configuration record and the 'rolling' sample group, the semantic ambiguity of the layer identifier method identifier code parameter was resolved, ensuring effective decoding of VVC video files and improving the decoder's processing capabilities.

CN116547971BActive Publication Date: 2026-06-16DOUYIN VISION CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DOUYIN VISION CO LTD
Filing Date
2021-10-26
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

The existing VVC video file format fails to correctly specify the semantics of the layer identifier method identifier code parameter in the decoder configuration information and the signaling notification of the 'rolling' sample group, causing the decoder to be unable to decode the video file effectively. In particular, when layer_id_method_idc is equal to 1 or 3, the application scope of the access point is not clearly defined.

Method used

Modify the VVC decoder configuration record to explicitly specify the layer identifier method identifier code parameter. By setting layer_id_method_idc, indicate whether the access point is applied to all layers or only to the relevant layers. Modify the grouping type parameter of the 'rolling' sample group to clearly describe the correlation between the access point and the layer.

🎯Benefits of technology

It achieves efficient decoding of VVC video files, ensuring that the decoder can correctly process layer identifier method identifier code parameters, thereby improving the decoding efficiency and accuracy of video files.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116547971B_ABST
    Figure CN116547971B_ABST
Patent Text Reader

Abstract

A mechanism for processing video data is disclosed. Conversion between visual media data and visual media data files is performed. The visual media data files include pictures in layers, random access recovery point (roll) access point groups in the layers, and grouping type parameters. The grouping type parameters specify correspondence between the access points and associated layers of the layers. The grouping type parameters include a layer identifier method identification code parameter that specifies that the access points include one or more of the following: one or more gradual decoding refresh (GDR) pictures; and one or more hybrid network abstraction layer (NAL) unit pictures with both intra random access point (IRAP) sub-pictures and non-IRAP sub-pictures.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references to related applications

[0002] This patent application claims the benefit of International Application No. PCT / CN2020 / 123540, filed October 26, 2020, entitled “Signalling of Decoder Configuration Information and the ‘Roll’ Sample Group in VVC Video Files,” which is incorporated herein by reference. Technical Field

[0003] This patent document relates to the generation, storage, and consumption of digital audio and video media information in file formats. Background Technology

[0004] Digital video accounts for the largest share of bandwidth usage on the internet and other digital communication networks. As the number of connected user devices capable of receiving and displaying video increases, the bandwidth demand for digital video is likely to continue to grow. Summary of the Invention

[0005] The first aspect relates to a method for processing video data, comprising: performing a conversion between visual media data and a visual media data file, the visual media data file including pictures in layers, random access recovery point (rolling) sample groups specifying access points in layers, and a grouping type parameter specifying the correspondence between access points and related layers, and including a layer identifier method identifier parameter, the layer identifier method identifier parameter specifying that the access point includes one or more of the following: one or more progressive decode refresh (GDR) pictures; and one or more hybrid network abstraction layer (NAL) unit pictures having both intra-frame random access point (IRAP) subpictes and non-IRAP subpictes.

[0006] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the conversion includes: encoding the image into a layer in a visual media file; determining a group of scrolling sample points for access points in the specified layer; encoding grouping type parameters into the media file; and storing the visual media file.

[0007] Optionally, in any of the foregoing aspects, another implementation of that aspect specifies that the conversion includes: receiving a visual media file including images encoded and decoded into the layer; obtaining a scroll sample group from the media file, the scroll sample group specifying access points in the layer; obtaining a grouping type parameter from the media file; and decoding the media file based on the grouping type parameter.

[0008] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the grouping type parameter includes a target layer parameter, which includes multiple bits, each bit specifying one of the relevant layers.

[0009] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the layer identifier method identifier code parameter specifies that the access point is only applied to the relevant layer.

[0010] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the layer identifier method identifier code parameter specifies the access point applied to all layers.

[0011] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the group type parameter is represented as group_type_parameter, the target layer parameter is represented as target_layers, and the layer identifier method identifier parameter is represented as layer_id_method_idc.

[0012] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that when all access points in the relevant layer are GDR images and the access points are applied to all layers, layer_id_method_idc is set to zero.

[0013] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that when all access points in the relevant layer are GDR images and the access points are applied only to the relevant layer, layer_id_method_idc is set to 1.

[0014] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that when the access point in the relevant layer is a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point applies to all layers, layer_id_method_idc is set to 2.

[0015] Optionally, in any of the foregoing aspects, another implementation of that aspect specifies that when the access point in the relevant layer is a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point is applied only to the relevant layer, layer_id_method_idc is set to 3.

[0016] The second aspect relates to an apparatus for processing video data, including a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to: perform a conversion between visual media data and a visual media data file, the visual media data file including pictures in layers, random access recovery point (rolling) sample groups specifying access points in layers, and a grouping type parameter specifying the correspondence between access points and related layers, and including a layer identifier method identification code parameter, the identifier method identification code parameter specifying that the access point includes one or more of the following: one or more progressive decode refresh (GDR) pictures; and one or more hybrid network abstraction layer (NAL) unit pictures having both intra-frame random access point (IRAP) subpictes and non-IRAP subpictes.

[0017] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the conversion includes: encoding the image into a layer in a visual media file; determining a group of scrolling sample points for access points in the specified layer; encoding grouping type parameters into the media file; and storing the visual media file.

[0018] Optionally, in any of the foregoing aspects, another implementation of that aspect specifies that the conversion includes: receiving a visual media file including images encoded and decoded into the layer; obtaining a scroll sample group from the media file, the scroll sample group specifying access points in the layer; obtaining a grouping type parameter from the media file; and decoding the media file based on the grouping type parameter.

[0019] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the grouping type parameter includes a target layer parameter, which includes multiple bits, each bit specifying one of the relevant layers.

[0020] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the layer identifier method identifier code parameter specifies that the access point is only applied to the relevant layer.

[0021] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the layer identifier method identifier code parameter specifies the access point applied to all layers.

[0022] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that the group type parameter is represented as group_type_parameter, the target layer parameter is represented as target_layers, and the layer identifier method identifier parameter is represented as layer_id_method_idc.

[0023] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that when all access points in the relevant layer are GDR images and the access points are applied to all layers, layer_id_method_idc is set to zero.

[0024] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that when all access points in the relevant layer are GDR images and the access points are applied only to the relevant layer, layer_id_method_idc is set to 1.

[0025] Alternatively, in any of the foregoing aspects, another implementation of that aspect specifies that when the access point in the relevant layer is a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point applies to all layers, layer_id_method_idc is set to 2.

[0026] Optionally, in any of the foregoing aspects, another implementation of that aspect specifies that when the access point in the relevant layer is a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point is applied only to the relevant layer, layer_id_method_idc is set to 3.

[0027] The third aspect relates to a non-transitory computer-readable medium comprising a computer program product for use by a video codec apparatus, the computer program product comprising computer-executable instructions stored on the non-transitory computer-readable medium, such that when executed by a processor, the computer-executable instructions cause the video codec apparatus to perform any of the methods described in the preceding aspects.

[0028] For clarity, any of the foregoing embodiments may be combined with any one or more other foregoing embodiments to form new embodiments within the scope of this disclosure.

[0029] These and other features will become clearer from the following detailed description taken in conjunction with the accompanying drawings and claims. Attached Figure Description

[0030] To gain a more complete understanding of this disclosure, reference is made to the following brief description in conjunction with the accompanying drawings and detailed specifications, wherein the same reference numerals denote the same parts.

[0031] Figure 1 This is a schematic diagram of an example media file containing a Multi-Functional Video Codec (VVC) bitstream that includes video data.

[0032] Figure 2 This is a flowchart of an example method for encoding a rolling sample group.

[0033] Figure 3 This is a flowchart of an example method for decoding a group of scrolling samples.

[0034] Figure 4 This is a block diagram showing an example video processing system.

[0035] Figure 5 This is a block diagram of an example video processing device.

[0036] Figure 6 This is a flowchart of an example method for video processing.

[0037] Figure 7 This is a block diagram illustrating an example video encoding / decoding system.

[0038] Figure 8 This is a block diagram of an example encoder.

[0039] Figure 9 This is a block diagram of an example decoder.

[0040] Figure 10 This is a schematic diagram of an example encoder. Detailed Implementation

[0041] First, it should be understood that although illustrative implementations of one or more embodiments are provided below, the disclosed systems and / or methods can be implemented using any number of techniques, whether currently known or yet to be developed. This disclosure should not be limited in any way to the exemplary implementations, drawings, and techniques shown below, including the exemplary designs and implementations shown and described herein, but modifications can be made within the full scope of the appended claims and their equivalents.

[0042] VVC, also known as H.266, is used in some descriptions for ease of understanding only and not to limit the scope of the disclosed technology. Therefore, the technology described herein is also applicable to other video codec protocols and designs. In this document, textual edits relative to the VVC specification or the current draft of the International Organization for Standardization (ISO) Basic Media File Format (ISOBMFF) file format specification are indicated by strikethrough to represent undoed text and italics to represent added text.

[0043] The example implementations of the above aspects are described below.

[0044] This document relates to video file formats. Specifically, it relates to decoder configuration information and signaling notifications for “rolling” sample groups in media files carrying Multi-Functional Video Codec (VVC) video bitstreams based on the ISO Basic Media File Format (ISOBMFF). These ideas can be applied individually or in various combinations for video bitstreams encoded and decoded by any codec (e.g., the VVC standard), and for any video file format (e.g., the VVC video file format under development).

[0045] Adaptive Color Transformation (ACT), Adaptive Loop Filter (ALF), Adaptive Motion Vector Resolution (AMVR), Adaptive Parameter Set (APS), Access Unit (AU), Access Unit Delimiter (AUD), Advanced Video Coding (Rec.ITU-TH.264|ISO / IEC 14496-10) (AVC), Bidirectional Prediction (B), Bidirectional Prediction with CU-level Weights (BCW), Bidirectional Optical Flow (BDOF), Block-based Incremental Pulse Coding and Modulation (BDPCM), Buffer Period (BP), Context-based Adaptive Binary Arithmetic Coding (CABAC), Code Block (CB), Constant Bit Rate (CBR), Cross-Component Adaptive Loop Filter (CCALF), Code Layer Video Stream (CLVS), Code Block Picture Buffer (CPB), Clean Random Access (CRA), Cyclic Redundancy Check (CRC), Code Tree Block (CTB) Code-Decoder Tree Unit (CTU), Decoding Capability Information (DCI), Dependent Random Access Point (DRAP), Decoding Unit (DU), Decoding Unit Information (DUI), Exponential Golomb (EG), k-order Exponential Golomb (EGk), End of Bitstream (EOB), End of Sequence (EOS), Padding Data (FD), First-In-First-Out (FIFO), Fixed Length (FL), Green, Blue, and Red (GBR), General Constraint Information (GCI), Gradual Decoder Refresh (GDR), Geometric Partitioning Mode (GPM), and High-Efficiency Video Coding (also known as Rec. ITU-T H).265|ISO / IEC 23008-2)(HEVC), Hypothetical Reference Decoder (HRD), Hypothetical Stream Scheduler (HSS), Intra-Frame (I), Intra-Frame Block Copy (IBC), Instant Decode Refresh (IDR), Inter-Layer Reference Picture (ILRP), Intra-Frame Random Access Point (IRAP), Low-Frequency Inseparable Transform (LFNST), Least Probable Symbol (LPS), Least Significant Bit (LSB), Long-Term Reference Picture (LTRP), Luminance Map with Chroma Scaling (LMCS), Matrix-Based Intra-Frame Prediction (MIP), Most Probable Symbol (MPS), Most Significant Bit (MSB), Multiple Transform Selection (MTS), Motion Vector Prediction (MVP), Network Abstraction Layer (NAL), Output Layer Set (OLS), Operation Point (OP), Operation Point Information (OPI), Prediction (P), Picture Header (PH), Picture Order Count (POC), Picture Parameter Set ( PPS), Predictive Refinement Using Optical Flow (PROF), Picture Timing (PT), Picture Unit (PU), Quantization Parameter (QP), Random Access Decodable Preamble (RADL), Random Access Skip Preamble (RASL), Raw Byte Sequence Payload (RBSP), Red, Green, and Blue (RGB), Reference Picture List (RPL), Sample Adaptive Offset (SAO), Sample Aspect Ratio (SAR), Supplemental Enhancement Information (SEI), Strip Header (SH), Subpicture Level Information (SLI), Data Bit String (SODB), Sequence Parameter Set (SPS), Short-Term Reference Picture (STRP), Stepped Temporal Sublayer Access (STSA), Truncated Rice (TR), Variable Bit Rate (VBR), Video Coding Layer (VCL), Video Parameter Set (VPS), Multifunctional Supplemental Enhancement Information (also known as Rec.ITU-T) H.274 (ISO / IEC 23002-7) (VSEI), Video Availability Information (VUI), Multi-Function Video Codec (also known as Rec.ITU-TH.266 (ISO / IEC 23090-3)) (VVC), and Wavefront Parallel Processing (WPP).

[0046] Video codec standards have primarily evolved through the development of ITU-T and ISO / IEC standards. ITU-T developed the H.261 and H.263 standards, while ISO / IEC developed the MPEG-1 and MPEG-4 Visual standards. The two organizations jointly developed the H.262 / MPEG-2 video standard, the H.264 / MPEG-4 Advanced Video Codec (AVC) standard, and the H.265 / HEVC standard. Starting with H.262, video codec standards were based on a hybrid video codec architecture, utilizing time prediction plus transform coding. To explore further video codec technologies beyond HEVC, the Joint Video Exploration Team (JVET) was jointly established by the Video Codec Experts Group (VCEG) and MPEG. JVET adopted many methods and incorporated them into reference software called the Joint Exploration Model (JEM). When the Multi-Functional Video Codec (VVC) project was officially launched, JVET was later renamed the Joint Video Experts Group (JVET). VVC is a codec standard aiming to reduce the bitrate by 50% compared to HEVC. VVC has been finalized by JVET.

[0047] The Multi-Functional Video Coding (VVC) standard (ITU-TH.266|ISO / IEC23090-3) and the related Multi-Functional Supplemental Enhancement Information (VSEI) standard (ITU-TH.274|ISO / IEC23002-7) are designed for a wide range of applications, including television broadcasting, video conferencing, or playback of stored media, as well as more advanced use cases such as adaptive bitrate streaming, video region extraction, compositing and merging content from multiple codec video bitstreams, multi-view video, scalable layered codecs, and viewport-adaptive 360° immersive media.

[0048] Media streaming applications are typically based on Internet Protocol (IP), Transmission Control Protocol (TCP), and Hypertext Transfer Protocol (HTTP) transmission methods, and often rely on file formats such as the ISO Basic Media File Format (ISOBMFF). One such streaming system is HTTP-based Dynamic Adaptive Streaming (DASH). To use video formats with ISOBMFF and DASH, file format specifications specific to those video formats, such as AVC and HEVC, are used to encapsulate the video content in ISOBMFF tracks as well as DASH representations and segments. Information about the video bitstream, such as grade, hierarchy, and level, and much other information, is presented as file format-level metadata and / or DASH Media Presentation Descriptions (MPDs) for content selection purposes, such as selecting appropriate media segments, both for initialization at the start of a streaming session and for stream adaptation during the streaming session.

[0049] Similarly, to use image formats with ISOBMFF, image-specific file format specifications will be adopted, such as AVC and HEVC image file formats. MPEG is developing the VVC video file format, which is an ISOBMFF-based file format for storing VVC video content. MPEG is also developing an ISOBMFF-based VVC image file format for storing image content encoded and decoded using VVC.

[0050] The following is a design based on the VVC image file format and some VVC file format features of MPEG. This sub-clause specifies the decoder configuration information for ISO / IEC 23090-3 video content. This record contains the dimensions of the length field used in each sample to indicate the length of its contained NAL units, as well as the lengths of parameter sets, DCI, OPI, and SEI NAL units (if stored in sample entries). This record is an outer frame (the dimensions of which are provided by the structure containing it). This record contains a version field. The specification defines version 1 of this record. Changes to the version number indicate incompatible changes to the record. If the version number is unrecognizable, the reader should not attempt to decode this record or its applicable stream. Compatible extensions to this record extend it and do not change the configuration version code. Readers should be prepared to ignore unrecognizable data that exceeds their understanding of the data definition.

[0051] When a track itself contains a VVC bitstream or is referenced by resolving a "subp" track, a VVC profile layer level record (VvcPTLRecord) should be present in the decoder configuration record, and in this case, the specific output layer set of the VVC bitstream is indicated by the field output_layer_set_idx. If ptl_present_flag is equal to zero in the decoder configuration record of the track, then the track will have an "oref" track reference to an ID, which can refer to a VVC track or an "opeg" entity group. The values ​​of the syntax elements of VvcPTLRecord, chroma_format_idc, and bit_depth_minus8 are valid for all parameter sets referenced when decoding the stream described by this record (referred to as "all parameter sets" in the following sentences of this paragraph). Specifically, the following limitations may apply:

[0052] The profile indicator `general_profile_idc` should indicate the profile that the output layer set identified by the output layer set index `output_layer_set_idx` in this configuration record conforms to. If different profiles are marked for different CVSs of the output layer set identified by `output_layer_set_idx` in this configuration record, the stream may need to be checked to determine which profile (if any) the entire stream conforms to. If the entire stream is not checked, or the check shows that no profile the entire stream conforms to, the entire stream will be split into two or more sub-streams with separate configuration records that can satisfy these rules. The tier indicator `general_tier_flag` should indicate the tier that is equal to or greater than the highest tier indicated in all `profile_tier_level()` syntax structures (in all parameter sets), and the output layer set identified by `output_layer_set_idx` in this configuration record conforms to that highest tier.

[0053] Each bit in `general_constraint_info` can only be set if a bit is set in all `general_constraints_info()` syntax structures of all `profile_tier_level()` syntax structures (in all parameter sets) that the output layer set identified by `output_layer_set_idx` in this configuration record. The level indicator `general_level_idc` should indicate the highest capability level among all `profile_tier_level()` syntax structures (in all parameter sets) that the output layer set identified by `output_layer_set_idx` in this configuration record.

[0054] The following constraints apply to the chroma format identifier (chroma_format_idc). If the VVC stream applied to the configuration record is a single-layer bitstream, the value of sps_chroma_format_idc as defined in ISO / IEC 23090-3 should be the same across all SPSs referenced by the VCL NAL unit in the samples applied to the current sample entry description, and the value of chroma_format_idc should be equal to the value of sps_chroma_format_idc. Otherwise (if the VVC stream applied to the configuration record is a multi-layer bitstream), the value of vps_ols_dpb_chroma_format[MultiLayerOlsIdx[output_layer_set_idx]] should be the same across all CVSs applied to the current sample entry description, and the value of chroma_format_idc should be equal to the value of vps_ols_dpb_chroma_format[MultiLayerOlsIdx[output_layer_set_idx]].

[0055] The following constraints apply to bit_depth_minus8. If the VVC stream applied to the configuration record is a single-layer bitstream, the value of sps_bitdepth_minus8 should be the same across all SPSs referenced by the VCL NAL cells in the samples applied to the current sample entry description, and the value of bit_depth_minus8 should be equal to the value of sps_bitdepth_minus8. Otherwise (if the VVC stream applied to the configuration record is a multi-layer bitstream), the value of vps_ols_dpb_bitdepth_minus8[MultiLayerOlsIdx[output_layer_set_idx]] should be the same across all CVSs applied to the current sample entry description, and the value of bit_depth_minus8 should be equal to the value of vps_ols_dpb_bitdepth_minus8[MultiLayerOlsIdx[output_layer_set_idx]].

[0056] The following constraints apply to `picture_width`. If the VVC stream to which the configuration record is applied is a single-layer bitstream, the value of `sps_pic_width_max_in_luma_samples` as defined in ISO / IEC 23090-3 should be the same across all SPSs referenced by the VCL NAL unit in the samples to which the current sample entry description is applied, and the value of `picture_width` should be equal to the value of `sps_pic_width_max_in_luma_samples`. Otherwise (if the VVC stream to which the configuration record is applied is a multi-layer bitstream), the value of `vps_ols_dpb_pic_width[MultiLayerOlsIdx[output_layer_set_idx]]` should be the same across all CVSs to which the current sample entry description is applied, and the value of `pic_width` should be equal to the value of `vps_ols_dpb_pic_width[MultiLayerOlsIdx[output_layer_set_idx]]`.

[0057] The following constraints apply to `picture_height`. If the VVC stream applied to the configuration record is a single-layer bitstream, the value of `sps_pic_height_max_in_luma_samples` should be the same across all SPSs referenced by the VCLNAL unit in the samples applied to the current sample entry description, and the value of `picture_height` should be equal to the value of `sps_pic_height_max_in_luma_samples`. Otherwise (if the VVC stream applied to the configuration record is a multi-layer bitstream), the value of `vps_ols_dpb_pic_height[MultiLayerOlsIdx[output_layer_set_idx]]` should be the same across all CVSs applied to the current sample entry description, and the value of `pic_height` should be equal to the value of `vps_ols_dpb_pic_height[MultiLayerOlsIdx[output_layer_set_idx]]`.

[0058] The VVC decoder configuration record provides explicit indications of the chroma format, bit depth, and other formatting information used by the VVC video elementary stream. If two sequences differ in their color space or bit depth indications in their VUI information, then two different VVC sample entries are also used.

[0059] There is a set of arrays to carry initialization non-VCL NAL cells. NAL cell types are limited to representing DCI, OPI, VPS, SPS, PPS, prefix APS, and prefix SEI NAL cells. Reserved NAL cell types may be further defined, and readers should ignore arrays with reserved or disallowed NAL cell type values. This tolerant behavior is designed not to raise errors and allows for backward compatibility extensions of these arrays in further specifications. NAL cells carried in a sample entry immediately follow the AUD and OPI NAL cells (if any) or are otherwise included at the beginning of the access cell reconstructed from the first sample of the reference sample entry.

[0060] It is recommended that the arrays be arranged in the order of DCI, OPI, VPS, SPS, PPS, prefix APS, and prefix SEI.

[0061] The example syntax for VVCPTLRecord and VvcDecoderConfigurationRecord is as follows:

[0062]

[0063]

[0064] The semantic examples of the above grammatical elements are as follows.

[0065] `num_bytes_constraint_info` specifies the length of the `general_constraint_info` field. The length of the `general_constraint_info` field is `num_bytes_constraint_info * 8 - 2` bits. This value should be greater than 0. A value of 1 indicates that `gci_present_flag` in the `general_constraint_info()` syntax structure represented by the `general_constraint_info` field is equal to 0.

[0066] general_profile_idc, general_tier_flag, general_level_idc, ptl_frame_only_constraint_flag, ptl_multilayer_enabled_flag, general_constraint_info, sublayer_level_present[j], sublayer_level_idc[i], num_sub_profiles, and general_sub_profile_idc[j] contain matching values ​​for the field or syntax structure general_profile_idc. general_tier_flag, general_level_idc, ptl_frame_only_constraint_flag, ptl_multilayer_enabled_flag, general_constraint_info(), ptl_sublayer_level_present[i], sublayer_level_idc[i], ptl_num_sub_profiles, and general_sub_profile_idc[j] represent the streams for which this configuration record is applied.

[0067] The increment of `lengthSizeMinusOne` by 1 indicates the byte length of the `NALUnitLength` field in the VVC video stream sample in the stream to which this configuration record is applied. For example, a size of one byte is represented by the value 0. The value of this field should be one of 0, 1, or 3, corresponding to lengths encoded and decoded using 1, 2, or 4 bytes, respectively.

[0068] A ptl_present_flag value of 1 indicates that the track contains a VVC bitstream corresponding to the operation point specified by output_layer_set_idx and numTemporalLayers, and all NAL cells in the track belong to that operation point. A ptl_present_flag value of 0 indicates that the track may not contain a VVC bitstream corresponding to a specific operation point, but may instead contain a VVC bitstream corresponding to multiple output layer sets, or may contain one or more separate layers that do not form an output layer set, or separate sublayers other than those with a TemporalId equal to 0.

[0069] track_ptl specifies the grade, level, and hierarchy of the output layer set represented by the VVC bitstream contained in the track.

[0070] `output_layer_set_idx` specifies the output layer set index, which represents the VVC bitstream contained in the track. The value of `output_layer_set_idx` can be used as the value of the `TargetOlsIdx` variable provided to the VVC decoder by an external device or the OPI NAL unit for decoding the bitstream contained in the track.

[0071] `avgFrameRate` gives the average frame rate in frames per (256 seconds) for the stream applied to this configuration record. A value of 0 indicates an unspecified average frame rate. When the track contains multiple layers and samples are reconstructed for operation points specified by `output_layer_set_idx` and `numTemporalLayers`, this gives the average access unit rate of the bitstream at the operation point.

[0072] A constantFrameRate value of 1 indicates that the stream to which this configuration record is applied has a constant frame rate. A value of 2 indicates that the representation of each temporal layer in the stream has a constant frame rate. A value of 0 indicates that the stream may or may not have a constant frame rate. When a track contains multiple layers and samples are reconstructed for operation points specified by output_layer_set_idx and numTemporalLayers, this provides an indication of whether the bitstream at the operation point has a constant access unit rate.

[0073] A value greater than 1 indicates that the tracks applied to this configuration record are time-scalable, and the number of temporal layers (also known as temporal sublayers or sublayers) contained therein is equal to numTemporalLayers. A value of 1 indicates that the tracks applied to this configuration record are not time-scalable. A value of 0 indicates that it is unknown whether the tracks applied to this configuration record are time-scalable.

[0074] chroma_format_id indicates the chroma format applied to this track.

[0075] picture_width represents the maximum image width applied to this track, in units of brightness samples.

[0076] picture_height represents the maximum image height applied to this track, in units of brightness samples.

[0077] bit_depth_minus8 represents the bit depth applied to this track.

[0078] numArrays represents the number of arrays of NAL units of the indicated type.

[0079] When array_completeness equals 1, it means that all NAL cells of the given type are in the next array and are not in the stream; when it equals 0, it indicates that additional NAL cells of the indicated type may be in the stream; allowed values ​​are constrained by the sample entry name.

[0080] NAL_unit_type indicates the type of NAL units in the following array (all of which should be of this type); it is restricted to using one of the values ​​indicating DCI, OPI, VPS, SPS, PPS, prefix APS, or prefix SEI NAL units.

[0081] `numNalus` indicates the number of NAL units of the indication type included in the configuration record of the stream to which this configuration record is applied. The SEI array should contain only declarative SEI messages, i.e., those that provide information about the entire stream. An example of such an SEI could be a user data SEI.

[0082] nalUnitLength indicates the byte length of a NAL unit.

[0083] nalUnit contains DCI, OPI, VPS, SPS, PPS, APS, or declarative SEI NAL units.

[0084] Random access recovery point sample groups, also known as “rolling” sample groups, are used to provide recovery point information for progressive decoding refresh. When “rolling” sample groups are used with VVC tracks, the syntax and semantics of the grouping_type_parameter are specified to be the same as those of “sap” sample groups.

[0085] When the target layer image mapped to the "roll" sample group is a GDR image, layer_id_method_idc is used with values ​​of 0 and 1. When layer_id_method_idc is equal to 0, the "roll" sample group specifies the behavior of all layers in the track.

[0086] This defines the semantics of layer_id_method_idc equal to 1.

[0087] When not all images of the target layer mapped to the "rolling" sample group are GDR images, layer_id_method_idc equal to 2 and 3 is used, and for images of the target layer that are not GDR images, the following applies: the referenced PPS has pps_mixed_nalu_types_in_pic_flag equal to 1, and for each subpick index i in the range of 0 to sps_num_subpics_minus1 (inclusive), both of the following are true: sps_subpic_treated_as_pic_flag[i] equal to 1, and at least one IRAP subpick in the current sample or subsequent sample in the same CLVS has the same subpick index i. When layer_id_method_idc equals 2, the "rolling" sample group specifies the behavior of all layers present in the track. The semantics of layer_id_method_idc equal to 3 are specified here. When the reader begins decoding using samples marked with layer_id_method_idc equal to 2 or 3, the reader needs to further modify the SPS, PPS, and PH NAL units of the bitstream so that when any SPS referenced by such samples has sps_gdr_enabled_flag equal to 1, the bitstream starting with samples marked as belonging to that sample group and having layer_id_method_idc equal to 2 and 3 is a consistent bitstream; any PPS referenced by such samples has pps_mixed_nalu_types_in_pic_flag equal to 0; all VCL NAL units of the AU have nal_unit_type equal to GDR_NUT; and any picture header of the AU has ph_gdr_pic_flag equal to 1. The value of ph_recovery_poc_cnt corresponds to the roll_distance of the sample group to which the AU belongs. When the "rolling" sample group involves a dependent layer rather than its (or its multiple) reference layers, the sample group indicates the features applicable when all reference layers of the dependent layer are available and decoded. Sample groups can be used to initiate decoding of the prediction layer.

[0088] When layer_id_method_idc equals 1, each bit in the target_layers field represents the layer carried in the track. Since this field is only 28 bits long, the SAP indication in the track is limited to a maximum of 28 layers. Each bit of this field, starting from the least significant bit (LSB), is mapped in ascending order of the layer_id value to a list of layer_id values ​​signaled in the layer information sample group (“linf”) associated with that sample.

[0089] The following are example technical problems solved through publicly available technical solutions. The latest design for VVC video file formats regarding signaling notifications of decoder configuration information and 'rolling' sample groups has the following issues. First, in VvcDecoderConfigurationRecord, when signaling notifications of grade, level, and hierarchy (PTL) information, picture format parameters, including color format, bit depth, picture width, and picture height, are signaled. This information can be used for content selection purposes. However, other parameters can also be used for content selection purposes, such as the required decoded picture buffer size, maximum picture output reordering, maximum latency, GDR picture enable flag, CRA picture enable flag, reference picture resampling enable flag, spatial resolution variation enable flag with CLVS, subpicture segmentation enable flag, maximum number of subpictures per picture, WPP enable flag, slice segmentation enable flag, maximum number of slices per picture, stripe segmentation enable flag, rectangular stripe enable flag, raster scan stripe enable flag, maximum number of stripes per picture, etc., but may not be signaled in the decoder configuration record.

[0090] Secondly, in VvcDecoderConfigurationRecord, when signaling PTL information is notified, the numTemporalLayers field is also signaled after the PTL information is notified. However, the syntax structure of the signaling notification for PTL information depends on the numTemporalLayers field.

[0091] Third, the semantics of the field layer_id_method_idc being equal to 1 or 3 are not correctly specified in the description of the random access recovery sample group, i.e., the "rolling" sample group. Specifically, when layer_id_method_idc is equal to 1, signaling notification of the applicable layer can be specified, but when layer_id_method_idc is equal to 3, it can be left unspecified.

[0092] This paper discloses mechanisms for addressing one or more of the problems listed above. In one example, the VVC decoder configuration record is modified to position the number of sublayers before the PTL record. This allows the decoder to first obtain the number of sublayers and use that number to obtain the PTL record for each sublayer. In another example, the grouping type parameter of the rolling sample group is modified to more clearly describe the correlation between access points in the rolling sample group and the layers to which these access points are applied. For example, the target layer can indicate the layer associated with the access point. Furthermore, the layer identifier method identifier code can be set to indicate whether the access point is applied to all layers or only to the layers specified in the target layer parameter. Additionally, the layer identifier method identifier code can be set to indicate whether the access point consists only of GDR images or includes a combination of GDR images and mixed NAL unit images.

[0093] To address the aforementioned and other issues, the following summarized methods are disclosed. These items should be considered as examples for explaining general concepts, and not interpreted in a narrow way. Furthermore, these items can be applied individually or in combination in any way.

[0094] Example 1

[0095] To address the first issue, one or more of the following parameters can be signaled in VvcDecoderConfigurationRecord: maximum required size of the decoded image buffer, maximum image output reordering (e.g., the maximum number of images allowed before any image in the decoding order and after that image in the output order), maximum latency (e.g., the maximum number of images allowed before any image in the output order and after that image in the decoding order), GDR image enable flag, CRA image enable flag, reference image resampling enable flag, spatial resolution variation with CLVS enable flag, sub-image segmentation enable flag, maximum number of sub-images per image, WPP enable flag, slice segmentation enable flag, maximum number of slices per image, strip segmentation enable flag, rectangular stripe enable flag, and raster scan stripe enable flag, as well as the maximum number of stripes per image.

[0096] (a) In one example, one or more of the above parameters are signaled only in VvcDecoderConfigurationRecord when signaling PTL information is provided.

[0097] (b) In one example, one or more parameters may exist prior to the signaling notification of the PTL information. Furthermore, byte alignment may be required for all parameters signaled prior to the PTL information. In one example, reserved bits may be further reserved for signaling notification.

[0098] (c) In one example, one or more parameters may exist after the signaling notification of the PTL information. Furthermore, byte alignment may be required for all parameters signaled after the PTL information. In one example, reserved bits may be further signaled.

[0099] (d) In one example, a subset of one or more parameters may exist before the signaling notification of the PTL information, while the remainder may exist after the signaling notification. Furthermore, byte alignment may be required for all parameters signaled before the PTL information. In one example, reserved bits may be further reserved for signaling notification.

[0100] (e) Additionally, byte alignment may be required for all parameters notified via signaling after the PTL information. In one example, further signaling notification reserved bits may be possible.

[0101] Example 2

[0102] To address the second issue, VvcDecoderConfigurationRecord was modified so that when signaling PTL information is sent, the numTemporalLayers field is also signaled before the PTL information is sent.

[0103] (a) In one example, when PTL information is signaled in VvcDecoderConfigurationRecord, it is signaled after the fields chroma_format_idc, bit_depth_minus8, numTemporalLayers, and constantFrameRate. In another example, PTL information is signaled directly after all the above fields and some reserved bits.

[0104] (b) In one example, when signaling PTL information in VvcDecoderConfigurationRecord, it is signaled after the fields numTemporalLayers and constantFrameRate. In another example, the PTL information is signaled directly after all the aforementioned fields and some reserved bits. Furthermore, additional reserved bits are signaled after the PTL information.

[0105] (c) In another example, when the PTL information is signaled in VvcDecoderConfigurationRecord, it is signaled as the last of all fields that are conditional on "if(ptl_present_flag)".

[0106] (d) In one example, the signaling notification reserved bit is used before the signaling notification PTL information.

[0107] Example 3

[0108] To solve the third problem, make one or more of the following modifications: (See the following statement:)

[0109] (a) "The semantics of layer_id_method_idc equal to 1 are specified in clause 9.5.7." is amended as follows: "When layer_id_method_idc equals 1, the layer whose behavior is specified by the 'rolling' sample group is specified in clause 9.5.7." As used herein, clause 9.5.7 refers to the corresponding number of clauses in the document ISO / IEC 14496-15:2021(E) entitled "Information technology - Encoding and decoding of audiovisual objects - Part 15: Transmission of structured video in the network abstraction layer (NAL) unit of the ISO basic media file format".

[0110] (b) "The semantics of layer_id_method_idc equal to 3 are specified in Clause 9.5.7." amended as follows: "When layer_id_method_idc equals 3, the layer whose behavior is specified by the 'rolling' sample group is specified in the same way as when layer_id_method_idc equals 1 as specified in Clause 9.5.7."

[0111] Example 4

[0112] To solve problem 3, optionally, one or more of the following changes can be made:

[0113] (a) The following sentence in Clause 9.5.7: “When layer_id_method_idc equals 1, each bit in the target_layers field represents the layer carried in the track.” is amended as follows: “When layer_id_method_idc equals 1 or 3, each bit in the target_layers field represents the layer carried in the track.”

[0114] (b) The following sentence: “Clause 9.5.7 specifies the semantics of layer_id_method_idc equal to 1” is amended as follows: “When layer_id_method_idc equals 1, the layer whose behavior is specified by the 'rolling' sample group is specified in Clause 9.5.7.”

[0115] (c) The following sentence: “Clause 9.5.7 specifies the semantics for layer_id_method_idc equal to 3” is amended as follows: “When layer_id_method_idc equals 3, the layer whose behavior is specified by the 'rolling' sample group is specified in Clause 9.5.7.”

[0116] The following are some example implementations of aspects summarized above, which can be applied to the standard specification of the VVC video file format. The revised text is based on the latest draft specification of the aforementioned related features. Added or modified parts are indicated by underline and bold, and deleted parts are indicated by bold italics.

[0117] In one example, the syntax of VvcDecoderConfigurationRecord is modified as follows:

[0118]

[0119]

[0120] In one example, the semantic modification of VvcDecoderConfigurationRecord is as follows:

[0121] A ptl_present_flag value of 1 indicates that the track contains a VVC bitstream corresponding to the operation point specified by output_layer_set_idx and numTemporalLayers, and all NAL cells in the track belong to that operation point. A ptl_present_flag value of 0 indicates that the track may not contain a VVC bitstream corresponding to a specific operation point, but may instead contain a VVC bitstream corresponding to multiple output layer sets, or may contain one or more separate layers that do not form an output layer set, or separate sublayers other than those with a TemporalId equal to 0.

[0122]

[0123] track_ptl specifies the grade, level, and hierarchy of the output layer set represented by the VVC bitstream contained in the track.

[0124] `output_layer_set_idx` specifies the output layer set index, which represents the VVC bitstream contained in the track. The value of `output_layer_set_idx` can be used as the value of the `TargetOlsIdx` variable provided to the VVC decoder by an external device or OPI NAL unit, as specified in ISO / IEC 23090-3, for decoding the bitstream contained in the track.

[0125]

[0126] picture_width represents the maximum image width applied to this track, in units of brightness samples.

[0127] picture_height represents the maximum image height applied to this track, in units of brightness samples.

[0128]

[0129]

[0130] numArrays represents the number of arrays of NAL units of the indicated type.

[0131] In one example, the description of the random access recovery point sample group is modified as follows: The random access recovery point sample group "rolling" is used to provide information about recovery points during progressive decoding refresh. When the "rolling" sample group is used with a VVC track, the syntax and semantics of the grouping_type_parameter are specified the same as those of the "sap" sample group. When the image of the target layer mapped to the samples of the "rolling" sample group is a GDR image, layer_id_method_idc is used with values ​​of 0 and 1. When layer_id_method_idc is equal to 0, the "rolling" sample group specifies the behavior of all layers in the track.

[0132] As specified in Clause 9.5.7.

[0133] When not all images of the target layer mapped to the "rolling" sample group are GDR images, layer_id_method_idc equal to 2 and 3 is used, and for images of the target layer that are not GDR images, the following applies: the referenced PPS has pps_mixed_nalu_types_in_pic_flag equal to 1, and for each subpick index i in the range of 0 to sps_num_subpics_minus1 (inclusive), both of the following are true: sps_subpic_treated_as_pic_flag[i] equals 1, and at least one IRAP subpick in the current sample or subsequent sample in the same CLVS has the same subpick index i. When layer_id_method_idc equals 2, the "rolling" sample group specifies the behavior of all layers present in the track.

[0134] When layer_id_method_idc equals 3, the layer whose behavior is specified by the 'roll' sample group. The semantics of layer_id_method_idc equaling 3 are specified in the same way as when layer_id_method_idc equals 1 in clause 9.5.7.

[0135] When the reader begins decoding using samples marked with layer_id_method_idc equal to 2 or 3, the reader needs to further modify the SPS, PPS, and PH NAL units of the bitstream reconstructed according to Clause 11.6 of the ISO / IEC 14496-15:2021(E) document, such that when the sps_gdr_enabled_flag of any SPS referenced by such samples is equal to 1, the bitstream starting with samples marked as belonging to that sample group and with layer_id_method_idc equal to 2 and 3 is a consistent bitstream, the pps_mixed_nalu_types_in_pic_flag of any PPS referenced by such samples is equal to 0, the nal_unit_type of all VCLNAL units of the AU is equal to GDR_NUT, and the ph_gdr_pic_flag of any picture header of the AU is equal to 1, and the value of ph_recovery_poc_cnt corresponds to the roll_distance of the sample group to which the AU belongs.

[0136] When a "rolling" sample group relates to a dependent layer rather than its reference layers(s), the sample group indicates the features applicable when all reference layers of the dependent layer are available and decoded. The sample group can be used to initiate the decoding of the prediction layer.

[0137] In one example, the syntax of VvcDecoderConfigurationRecord is modified as follows:

[0138]

[0139]

[0140] In one example, the semantic modification of VvcDecoderConfigurationRecord is as follows:

[0141] A ptl_present_flag value of 1 indicates that the track contains a VVC bitstream corresponding to the operation point specified by output_layer_set_idx and numTemporalLayers, and all NAL cells in the track belong to that operation point. A ptl_present_flag value of 0 indicates that the track may not contain a VVC bitstream corresponding to a specific operation point, but may instead contain a VVC bitstream corresponding to multiple output layer sets, or may contain one or more separate layers that do not form an output layer set, or separate sublayers other than those with a TemporalId equal to 0.

[0142]

[0143] `output_layer_set_idx` specifies the output layer set index, which represents the VVC bitstream contained in the track. The value of `output_layer_set_idx` can be used as the value of the `TargetOlsIdx` variable provided to the VVC decoder by an external device or OPI NAL unit, as specified in ISO / IEC 23090-3, for decoding the bitstream contained in the track.

[0144] `avgFrameRate` gives the average frame rate in frames per (256 seconds) for the stream applied to this configuration record. A value of 0 indicates an unspecified average frame rate. When the track contains multiple layers and samples are reconstructed for operation points specified by `output_layer_set_idx` and `numTemporalLayers`, this gives the average access unit rate of the bitstream at the operation point.

[0145] A constantFrameRate value of 1 indicates that the stream to which this configuration record is applied has a constant frame rate. A value of 2 indicates that the representation of each temporal layer in the stream has a constant frame rate. A value of 0 indicates that the stream may or may not have a constant frame rate. When a track contains multiple layers and samples are reconstructed for operation points specified by output_layer_set_idx and numTemporalLayers, this provides an indication of whether the bitstream at the operation point has a constant access unit rate.

[0146] A value greater than 1 for `numTemporalLayers` indicates that the track applied to this configuration record is time-scalable, and the number of time-domain layers (also called time sublayers or sublayers in ISO / IEC 23090-3) is equal to `numTemporalLayers`. A value of 1 indicates that the track applied to this configuration record is not time-scalable. A value of 0 indicates that it is unknown whether the track applied to this configuration record is time-scalable.

[0147] chroma_format_id indicates the chroma format applied to this track.

[0148]

[0149] picture_width represents the maximum image width applied to this track, in units of brightness samples.

[0150] picture_height represents the maximum image height applied to this track, in units of brightness samples.

[0151]

[0152]

[0153] numArrays represents the number of arrays of NAL units of the indicated type.

[0154] Figure 1 This is a schematic diagram of an example media file 100 containing a VVC bitstream 127 of video data. The media file includes pictures 125 that can be displayed to create a video sequence. The pictures 125 are compressed in the VVC bitstream 127. The bitstream 127 also includes various parameter sets 123 that instruct the decoder on the parameters used to compress the pictures 125. The parameter sets 123 may include a video parameter set (VPS), a sequence parameter set (SPS), a picture parameter set (PPS), and an adaptive parameter set (APS), which respectively include parameters for the entire video, parameters for the video sequence, parameters for one or more pictures, and parameters for regions of one or more pictures.

[0155] Compression can include intra-frame prediction and inter-frame prediction. In intra-frame prediction, image 125 is divided into blocks, and each block is encoded / decoded relative to other blocks in the same image 125. In inter-frame prediction, image 125 is divided into blocks, and each block is encoded / decoded relative to other blocks in other images 125. Image 125 encoded / decoded according to inter-frame prediction or intra-frame prediction can be referred to as an inter-frame codec image or an intra-frame codec image, respectively. One advantage of inter-frame codec images is that such images 125 are substantially more compressed than intra-frame codec images. However, because inter-frame codec images are encoded / decoded relative to other images 125, the video decoder cannot begin decoding the video sequence at an inter-frame codec image. Instead, the video decoder can begin decoding the video at any intra-frame codec image. Intra-frame codec images can also be referred to as IRAP images. This is because any intra-frame codec image can serve as an access point 135 for the video stream. Access point 135 is any point in the video stream where the decoder can begin decoding the video stream, and decoding errors, such as those due to lost information, are generally not encountered except for GDR images as described below.

[0156] In some cases, image 125 can be divided into sub-images. A sub-image is a rectangular region within image 125. The advantage of sub-images is that they can be processed separately during decoding and display. For example, in drawing applications, virtual reality applications, etc., sub-images can be displayed instead of the entire image 125. Furthermore, in video calling applications, sub-images can be rearranged and stitched together in different configurations. In some cases, the set of access points 135 can differ for different sub-images within the same image 125. For example, a sub-image with less important video can have fewer access points 135 to increase compression. When this occurs, image 125 can include intra-frame codec sub-images and inter-frame codec sub-images, also known as IRAP sub-images and non-IRAP sub-images. Bitstream 127 is a set of Network Abstraction Layer (NAL) units, which are video data elements sized to fit packets in a communication network. Therefore, parameter set 123 and image 125 are carried as NAL units in bitstream 127. Therefore, image 125, which has IRAP sub-images and non-IRAP sub-images, can be referred to as a hybrid NAL unit image.

[0157] Another access point 135 scheme involves the use of GDR images. A GDR image comprises intra-frame codec portions and one or more inter-frame codec portions. Access points 135 are created using GDR images in groups. Specifically, a first GDR image contains an intra-frame codec region on the leftmost portion of image 125, where the remainder of the image is codeced according to inter-frame codecs. A second GDR image contains an intra-frame codec region that is shifted to the right to a position adjacent to but not overlapping with the intra-frame codec region of the first GDR image. The remainder of the second GDR image is inter-frame codeced. In this way, the intra-frame codec region sweeps across multiple images from left to right. A constraint of the GDR images is that the inter-frame codec region to the left of the intra-frame codec region can only point back to a previous GDR image in the current GDR image group. The decoder can start decoding from the first GDR image in the group. In this case, the decoder can decode the intra-frame codec region but cannot decode the inter-frame codec region. The decoder can then proceed to the second GDR picture, in which case both the intra-frame codec region and the inter-frame codec region to the left of the intra-frame codec region can be decoded. Once the decoder reaches the last GDR picture, all regions can be decoded and the video can be displayed. GDR pictures can produce errors when used as access point 135, but these errors do not persist beyond the last GDR picture in the group. Therefore, GDR pictures are typically not displayed when the group is used as access point 135. The advantage of GDR pictures is that each GDR picture is smaller than the entire IRAP picture, which reduces the data bursts associated with each access point 135. When the decoder does not use a GDR picture as access point 135, the video preceding the GDR picture group is available, so the decoder can decode all GDR pictures in the group without errors in the inter-frame codec region. It should be noted that GDR pictures are generally prohibited from being used with mixed NAL unit pictures.

[0158] Image 125 and parameter set 123 can be organized into layers 120 and / or sublayers. Layer 120 is a grouping of image 125 and parameter set 123, which can be decoded and output as part of an output layer set. For example, different layers 120 can be encoded and decoded at different resolutions. In another example, the output layer set may include a base layer and enhancement layers. This allows the decoder to decode the base layer and obtain video at a first resolution, and then decode the desired number of enhancement layers to increase the resolution based on device and network capabilities. Sublayer 121 is a layer 120 that allows for time scaling. For example, image 125 can be assigned to different sublayers 121 based on a time identifier (Id). In this way, each sublayer 121 contains a subset of image 125. This allows the decoder to decode and display the selected sublayers 121 to achieve the desired frame rate.

[0159] Layers 120 and / or sublayers 121 of bitstream 127 may be arranged in track 110. Track 110 contains a sequence of temporal samples of a specific type that can be decoded and displayed by a decoder. In this context, a sample is a unit of media data. For example, track 110 may include a set of temporal compressed video samples (e.g., picture 125 over time), compressed audio samples, cue data samples, parameter samples, etc. It should be noted that the term sample can also refer to the color value of a pixel, but this is not the intended definition in this context. Track 110 may contain any number of layers 120 and / or any number of sublayers 121 containing such samples.

[0160] As can be understood from the preceding description, the data in media file 100 can be arranged in various ways. Therefore, media file 100 also includes a sample table frame 130, which contains parameters describing the samples (e.g., media data) contained in track 110. For example, a decoder can read sample table frame 130 to determine how to begin processing the data contained in the various tracks 110. Among many other parameters, sample table frame 130 may contain a scrolling sample group 131 and a VVC decoder configuration record 141.

[0161] Rolling sample group 131 is also called random access recovery sample group. Rolling sample group 131 is a data unit in layer 120 used to signal access point 135 to VVC bitstream 127, and is primarily used for signaling access point 135 appearing at GDR pictures. It should be noted that random access point (RAP) sample groups can be used to signal access points appearing at other IRAP pictures, such as IDR, CRA, BLA, etc. Therefore, rolling sample group 131 contains a list of access points 135 included in GDR pictures within VVC bitstream 127. Access point 135 is considered a sample of rolling sample group 131. In some example implementations, the operation of rolling sample group 131 is unclear. This disclosure addresses these issues by providing parameters that clearly describe the relationship between access point 135 in rolling sample group 131 and layer 120.

[0162] The rolling sample group 131 includes a grouping type parameter 137, which can also be represented as `group_type_parameter`. Grouping type parameter 137 specifies the correlation / correspondence between access point 135 and layer 120. It should be noted that when access point 135 is applied to layer 120, layer 120 can be referred to as a related layer. Therefore, layer 120 includes a set of related layers, which can be the same as a set of all layers 120 or a subset of layers 120. Grouping type parameter 137 also includes a target layer parameter 136 and a layer identifier method identifier code 138, which can be represented as `target_layers` and `layer_id_method_idc`, respectively. In the example implementation, the target layer parameter 136 includes multiple bits, each specifying one of the related layers. In one example, the target layer parameter 136 can be 24 bits long, thus allowing the specification of up to 24 related layers.

[0163] Layer identifier method identifier 138 specifies the nature of access point 135 and clarifies the correlation between access point 135 and the layer. In the example, layer identifier method identifier 138 may include a four-bit value. In a particular implementation, layer identifier method identifier 138 may be set to zero or two to indicate that access point 135 applies to all layers 120. In this case, all layers are relevant layers, and target layer parameter 136 may be omitted from media file 100 and / or ignored by the decoder. Furthermore, layer identifier method identifier 138 may be set to 1 or 4 to indicate that access point 135 applies only to the relevant layers specified by target layer parameter 136. Additionally, layer identifier method identifier 138 may indicate the nature of the image 125 present at access point 135. For example, layer identifier method identifier 138 may be set to 0 or 1 to indicate that access point 135 is a GDR image. Furthermore, the layer identifier method identifier code 138 can be set to 2 or 3 to indicate that the access point 135 can be a GDR picture or a hybrid NAL unit picture with both IRAP subpictures and non-IRAP subpictures.

[0164] In a specific implementation, when all access points in the relevant layer are GDR images and the access points are applied to all layers, `layer_id_method_idc` can be set to zero. Furthermore, when all access points in the relevant layer are GDR images and the access points are applied only to the relevant layer, `layer_id_method_idc` is set to 1. Additionally, when the access points in the relevant layer are GDR images, mixed NAL unit images, or a combination thereof, and the access points are applied to all layers, `layer_id_method_idc` is set to 2. Finally, when the access points in the relevant layer are GDR images, mixed NAL unit images, or a combination thereof, and the access points are applied only to the relevant layer, `layer_id_method_idc` is set to 3. Thus, the decoder can parse access point 135, grouping type parameter 137, target layer 136, and layer identifier method identifier code 138 to determine the correlation between access point 135 in the rolling sample group 131 and layer 120. The decoder can then use access point 135 to begin decoding image 125 in the relevant layer.

[0165] Furthermore, the sample table frame 130 may include a VVC decoder configuration record 141, which may be represented as a VVCDecoderConfigurationRecord. The VVC decoder configuration record 141 contains data that the decoder can use to select content. For example, the VVC decoder configuration record 141 may contain data describing the output layer set and corresponding layer 120 in track 110. The decoder can then use such data to select the track 110 that should be decoded and displayed. For example, the VVC decoder configuration record 141 may contain data describing the VVC Profile Level (PTL) record 143, output layer set index, frame rate, number of sublayers 121, bit depth, chroma format, image size, etc.

[0166] VVC PTL record 143 indicates the tier, hierarchy, and level information of layer 120 and / or sublayer 121. The tier, hierarchy, and level define constraints on the bitstream and thus limit the capabilities required to decode it. Tiers, hierarchy, and levels can also be used to indicate interoperability points between various decoder implementations. A tier is a defined set of codec tools used to create compatible or consistent bitstreams. Each tier specifies a subset of algorithmic features and constraints that all decoders conforming to that tier should support. A level is a set of constraints on the bitstream (e.g., maximum luminance sampling rate, maximum bitrate at resolution, etc.). For example, a level can be a set of constraints (e.g., hardware constraints) indicating the decoder performance required to play back a bitstream of a specified tier. These levels are divided into two layers: primary levels and higher levels. Primary levels are lower than higher levels. These layers are used to handle applications that differ in maximum bitrate. Primary levels are designed for most applications, while higher levels are designed for very demanding applications. For any given tier, the level of the hierarchy typically corresponds to the specific decoder processing load and memory capabilities. Therefore, the decoder should select layer 120 and / or sublayer 121 for playback by determining layer 120 and / or sublayer 121 with PTL information that matches the decoder's capabilities.

[0167] In some example implementations, the VVC decoder configuration record 141 is unclear because the number of sublayers 145 is signaled in the VVC decoder configuration record 141 after the VVC PTL record 143. This is problematic because the decoder needs the number of sublayers 145 before it can interpret the VVC PTL record 143. In this disclosure, the number of sublayers 145 is signaled in the VVC decoder configuration record 141 before the VVC PTL record 143. The decoder can then parse the VVC decoder configuration record to obtain the number of sublayers 145 and use the number of sublayers 145 to determine the number of VVC PTL records for sublayer 121. In one example, the VVC decoder configuration record 141 includes a constant frame rate syntax element, a chroma format identifier syntax element, and a bit depth minus oct syntax element. The VVC PTL record 143 may be located after the constant frame rate syntax element, the chroma format identifier syntax element, and the bit depth minus oct syntax element in the VVC decoder configuration record 141. In addition, the number of sub-layers 145 can be located before the constant frame rate syntax element, chroma format identifier syntax element, and bit depth minus oct syntax element in the VVC decoder configuration record 141.

[0168] In a specific implementation, the VVC decoder configuration record 141 can be configured to position the number of sublayers 145 before the VVC PTL record 143 to determine the PTL information of track 110, layer 120 and / or sublayer 121.

[0169]

[0170] In another example, various additional information may be included in the VVC decoder configuration record 141 to support the selection of track 110, layer 120, and / or sublayer 121 in the decoder. This information may include the maximum required size of the decoded picture buffer, maximum picture output reordering, maximum latency, GDR picture enable flag, CRA picture enable flag, reference picture resampling enable flag, spatial resolution variation with CLVS enable flag, subpicture segmentation enable flag, maximum number of subpictures per picture, WPP enable flag, slice segmentation enable flag, maximum number of slices per picture, strip segmentation enable flag, rectangular stripe enable flag, raster scan stripe enable flag, maximum number of stripes per picture, or combinations thereof. In some examples, such information may be included only if the VVC decoder configuration record 141 includes the VVC PTL record 143.

[0171] By including such information and / or by rearranging the order of the data, the VVC decoder configuration record 141 is improved to allow the decoder to make more efficient selections of additional features and / or sub-layers 121 for tracks 110, layers 120, and / or sub-layers 121.

[0172] Figure 2 This is a flowchart of an example method 200 for encoding a group of scrolling samples, for example, by encoding the group of scrolling samples into a media file 100. In step 201, the encoder encodes the image into a layer in the media file (e.g., media file 100).

[0173] In step 203, the encoder determines the scroll sample group for access points in the specified layer. As mentioned above, the scroll sample group is intended to be used in conjunction with GDR images. However, in addition to GDR images, various layers may include other types of access points, such as mixed NAL unit images with IRAP sub-images and non-IRAP sub-images. The encoder encodes the scroll sample group into the media file.

[0174] In step 205, the encoder encodes the group type parameter into the media file, for example, into a scroll sample group. The group type parameter specifies the correspondence between the access point of the sample as a scroll sample group and the associated layer. As mentioned above, the associated layer is any layer referenced by the access point. The group type parameter includes a layer identifier method identifier parameter that specifies the nature of the access point. For example, the layer identifier method identifier parameter can be set to indicate that the access point includes one or more of the following: (1) one or more GDR images; and (2) one or more mixed NAL unit images with IRAP sub-images and non-IRAP sub-images. For example, the layer identifier method identifier parameter can be set to a first value to indicate that all access points are GDR images, and to a second value to indicate that the access point is a combination of GDR images and mixed NAL unit images (or only mixed NAL unit images).

[0175] In one example, the group type parameter includes a target layer parameter. The target layer parameter consists of multiple bits, each specifying one of the relevant layers. In one example, the layer identifier method identifier parameter can be set to specify that the access point applies only to the relevant layer. In another example, the layer identifier method identifier parameter can be set to specify that the access point applies to all layers. It should be noted that in some examples, the target layer parameter can be omitted in this case. In some examples, the group type parameter is represented as `group_type_parameter`, the target layer parameter as `target_layers`, and the layer identifier method identifier parameter as `layer_id_method_idc`.

[0176] In one example, when all access points in the relevant layer are specified as GDR images and the access points are applied to all layers, `layer_id_method_idc` is set to zero. In another example, when all access points in the relevant layer are specified as GDR images and the access points are applied only to the relevant layer, `layer_id_method_idc` is set to 1. In yet another example, when the access points in the relevant layer are specified as GDR images, mixed NAL unit images, or a combination thereof, and the access points are applied to all layers, `layer_id_method_idc` is set to 2. In yet another example, when the access points in the relevant layer are specified as GDR images, mixed NAL unit images, or a combination thereof, and the access points are applied only to the relevant layer, `layer_id_method_idc` is set to 3.

[0177] In step 207, the encoder stores the media file. In this embodiment, the media file is sent to the decoder.

[0178] Figure 3The flowchart illustrates an example method 300 for decoding a group of scrolling samples, for example, by using a media file 100 received as a result of method 200. In step 301, the decoder receives a media file including images encoded into layers. The media file also includes a group of scrolling samples.

[0179] In step 303, the decoder obtains the scroll sample group from the media file. The scroll sample group specifies the access point in the layer.

[0180] In step 305, the decoder obtains a grouping type parameter from the media file, for example, from a rolling sample group. The grouping type parameter specifies the correspondence between the access point of the sample group and the associated layer. As mentioned above, the associated layer is any layer referenced by the access point. The grouping type parameter includes a layer identifier method identifier parameter that specifies the nature of the access point. For example, the layer identifier method identifier parameter can be set to indicate that the access point includes one or more of the following: (1) one or more GDR images; and (2) one or more mixed NAL unit images having both IRAP sub-images and non-IRAP sub-images. For example, the layer identifier method identifier parameter can be set to a first value to indicate that all access points are GDR images, and to a second value to indicate that the access point is a combination of GDR images and mixed NAL unit images (or only mixed NAL unit images).

[0181] In the example, the group type parameter includes the target layer parameter. The target layer parameter consists of multiple bits, each specifying one of the relevant layers. In the example, the layer identifier method identifier parameter can be set to specify that the access point applies only to the relevant layer. In another example, the layer identifier method identifier parameter can be set to specify that the access point applies to all layers. It should be noted that in some examples, the target layer parameter can be omitted in this case. In some examples, the group type parameter is represented as `group_type_parameter`, the target layer parameter is represented as `target_layers`, and the layer identifier method identifier parameter is represented as `layer_id_method_idc`.

[0182] In one example, when all access points in the relevant layer are specified as GDR images and the access points are applied to all layers, `layer_id_method_idc` is set to zero. In another example, when all access points in the relevant layer are specified as GDR images and the access points are applied only to the relevant layer, `layer_id_method_idc` is set to 1. In yet another example, when the access points in the relevant layer are specified as GDR images, mixed NAL unit images, or a combination thereof, and the access points are applied to all layers, `layer_id_method_idc` is set to 2. In yet another example, when the access points in the relevant layer are specified as GDR images, mixed NAL unit images, or a combination thereof, and the access points are applied only to the relevant layer, `layer_id_method_idc` is set to 3.

[0183] In step 307, the decoder decodes the media file based on the group type parameter. The decoder can then forward the decoded media file or portions thereof (e.g., specific layers and / or sublayers) to the display for the user to view.

[0184] Figure 4 This is a block diagram of an example video processing system 400 that can implement the various techniques disclosed herein. Various implementations may include some or all of the components in system 400. System 400 may include an input 402 for receiving video content. The video content may be received in a raw or uncompressed format (e.g., 8 or 10-bit multi-component pixel values), or in a compressed or encoded format. Input 402 may represent a network interface, a peripheral bus interface, or a storage interface. Examples of network interfaces include wired interfaces (such as Ethernet, Passive Optical Networking (PON), etc.) and wireless interfaces (such as Wi-Fi or cellular interfaces).

[0185] System 400 may include an encoding / decoding component 404 capable of implementing the various encoding or coding methods described in this document. Encoding / decoding component 404 can reduce the average bit rate of the video from input 402 to the output of encoding / decoding component 404 to produce an encoded / decoded representation of the video. Therefore, encoding / decoding techniques are sometimes referred to as video compression or video transcoding techniques. The output of encoding / decoding component 404 may be stored or transmitted via connected communication, as represented by component 406. The stored or communicated bitstream (or encoded / decoded) representation of the video received at input 402 may be used by component 408 to generate pixel values ​​or displayable video that is sent to display interface 410. The process of generating user-visible video from the bitstream representation is sometimes referred to as video decompression. Furthermore, although some video processing operations are referred to as “encoding / decoding” operations or tools, it should be understood that the encoding / decoding tools or operations are used at the encoder, and the corresponding decoding tools or operations will be inverted by the decoder to retrieve the encoded / decoded results.

[0186] Examples of peripheral bus interfaces or display interfaces may include Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), or DisplayPort. Examples of storage interfaces include SATA (Serial Advanced Technology Accessory), PCI, IDE, etc. The technologies described in this document can be implemented in a variety of electronic devices, such as mobile phones, laptops, smartphones, or other devices capable of digital data processing and / or video display.

[0187] Figure 5 This is a block diagram of an example video processing apparatus 500. Apparatus 500 can be used to implement one or more of the methods described herein. Apparatus 500 can be implemented in smartphones, tablets, computers, Internet of Things (IoT) receivers, etc. Apparatus 500 may include one or more processors 502, one or more memories 504, and video processing hardware 506. The processors(multiple) 502 may be configured to implement one or more methods described herein. The memories(multiple) 504 may be used to store data and code used to implement the methods and techniques described herein. The video processing hardware 506 may be used to implement some of the techniques described herein in hardware circuitry. In some embodiments, the video processing hardware 506 may be at least partially included in the processor 502, such as a graphics coprocessor.

[0188] Figure 6 This is a flowchart of an example method 600 for video processing. Method 600 performs a conversion between visual media data and a file storing information corresponding to the visual media data, based on a video file format. In the context of an encoder, this conversion can be performed by encoding the visual media data into a visual media data file in a video file format. In the context of a decoder, this conversion can be performed by decoding the visual media data file in a video file format to obtain visual media data for display.

[0189] Figure 7 This is a block diagram illustrating an example video codec system 700 that can utilize the techniques disclosed herein. Figure 7 As shown, the video encoding / decoding system 700 may include a source device 710 and a destination device 720. The source device 710 generates encoded video data and may be referred to as a video encoding device. The destination device 720 can decode the encoded video data generated by the source device 710 and may be referred to as a video decoding device.

[0190] The source device 710 may include a video source 712, a video encoder 714, and an input / output (I / O) interface 716.

[0191] Video source 712 may include sources such as a video capture device, an interface for receiving video data from a video content provider, and / or a computer graphics system that generates video data, or a combination of these sources. Video data may include one or more pictures. Video encoder 714 encodes the video data from video source 712 to generate a bitstream. The bitstream may include a sequence of bits forming a codec representation of the video data. The bitstream may include codec pictures and associated data. A codec picture is a codec representation of a picture. Associated data may include sequence parameter sets, picture parameter sets, and other syntax elements. I / O interface 716 includes a modulator / demodulator (modem) and / or a transmitter. Encoded video data may be transmitted directly to destination device 720 via network 730 through I / O interface 716. Encoded video data may also be stored on storage medium / server 740 for access by destination device 720.

[0192] Destination device 720 may include I / O interface 726, video decoder 724, and display device 722. I / O interface 726 may include a receiver and / or a modem. I / O interface 726 may acquire encoded video data from source device 710 or storage medium / server 740. Video decoder 724 may decode the encoded video data. Display device 722 may display the decoded video data to a user. Display device 722 may be integrated with destination device 720 or may be external to destination device 720 configured to connect to an external display device.

[0193] The video encoder 714 and the video decoder 724 can operate according to video compression standards such as High Efficiency Video Codec (HEVC), Multi-Functional Video Codec (VVC), and other current and / or other standards.

[0194] Figure 8 This is a block diagram illustrating an example of a video encoder 800, which can be... Figure 7 The system 700 shown contains a video encoder 714. The video encoder 800 can be configured to perform any or all of the techniques disclosed herein. Figure 8 In the example, the video encoder 800 includes multiple functional components. The techniques described in this disclosure can be shared among the various components of the video encoder 800. In some examples, the processor can be configured to perform any or all of the techniques described in this disclosure.

[0195] The functional components of the video encoder 800 may include a segmentation unit 801, a prediction unit 802 (which may include a mode selection unit 803, a motion estimation unit 804, a motion compensation unit 805, and an intra-frame prediction unit 806), a residual generation unit 807, a transform processing unit 808, a quantization unit 809, an inverse quantization unit 810, an inverse transform unit 811, a reconstruction unit 812, a buffer 813, and an entropy coding unit 814.

[0196] In other examples, the video encoder 800 may include more, fewer, or different functional components. In one example, the prediction unit 802 may include an intra-block copy (IBC) unit. The IBC unit may perform prediction in IBC mode, where at least one reference picture is the picture in which the current video block is located.

[0197] Furthermore, some components, such as the motion estimation unit 804 and the motion compensation unit 805, can be highly integrated, but for interpretive purposes... Figure 8 The examples are shown separately.

[0198] The segmentation unit 801 can segment an image into one or more video blocks. The video encoder 800 and video decoder 900 can support various video block sizes.

[0199] The mode selection unit 803 can, for example, select one of the intra-frame or inter-frame encoding / decoding modes based on the error result, and provide the obtained intra-frame or inter-frame encoded / decoded blocks to the residual generation unit 807 to generate residual block data and to the reconstruction unit 812 to reconstruct the encoded / decoded blocks for use as reference images. In some examples, the mode selection unit 803 can select a combined intra-frame and inter-frame prediction (CIIP) mode, where the prediction is based on the inter-frame prediction signal and the intra-frame prediction signal. The mode selection unit 803 can also select the resolution of the motion vector (e.g., sub-pixel or integer pixel precision) for the blocks in the inter-frame prediction case.

[0200] To perform inter-frame prediction for the current video block, motion estimation unit 804 can generate motion information for the current video block by comparing one or more reference frames from buffer 813 with the current video block. Motion compensation unit 805 can determine the predicted video block for the current video block based on the motion information of the image from buffer 813 (rather than the image associated with the current video block) and decoded samples.

[0201] The motion estimation unit 804 and the motion compensation unit 805 can perform different operations on the current video block, for example, the different operations performed depend on whether the current video block is in an I-strip, a P-strip, or a B-strip.

[0202] In some examples, motion estimation unit 804 can perform unidirectional prediction of the current video block, and can search for a reference video block for the current video block in the reference images of list 0 or list 1. Motion estimation unit 804 can then generate a reference index indicating that the reference image in list 0 or list 1 contains the reference video block, and a motion vector indicating the spatial displacement between the current video block and the reference video block. Motion estimation unit 804 can output the reference index, prediction direction indicator, and motion vector as motion information for the current video block. Motion compensation unit 805 can generate a predicted video block for the current block based on the reference video block indicated by the motion information of the current video block.

[0203] In other examples, motion estimation unit 804 can perform bidirectional prediction of the current video block. Motion estimation unit 804 can search for a reference video block for the current video block in the reference images of list 0 and can also search for another reference video block for the current video block in the reference images of list 1. Motion estimation unit 804 can then generate a reference index indicating that the reference images in list 0 or list 1 contain the reference video block, and a motion vector indicating the spatial displacement between the reference video block and the current video block. Motion estimation unit 804 can output the reference index and the motion vector of the current video block as the motion information of the current video block. Motion compensation unit 805 can generate a predicted video block for the current video block based on the reference video block indicated by the motion information of the current video block.

[0204] In some examples, the motion estimation unit 804 may output the complete set of motion information for decoding processing by the decoder. In other examples, the motion estimation unit 804 may not output the complete set of motion information for the current video. Instead, the motion estimation unit 804 may refer to the motion information of another video block to signal the motion information of the current video block. For example, the motion estimation unit 804 may determine that the motion information of the current video block is sufficiently similar to the motion information of an adjacent video block.

[0205] In one example, the motion estimation unit 804 may indicate in the syntax structure associated with the current video block that the current video block has the same motion information value as another video block.

[0206] In another example, motion estimation unit 804 can identify another video block and motion vector difference (MVD) in the syntax structure associated with the current video block. The motion vector difference indicates the difference between the motion vector of the current video block and the motion vector of the indicating video block. Video decoder 900 can use the motion vector of the indicating video block and the motion vector difference to determine the motion vector of the current video block.

[0207] As discussed above, the video encoder 800 can predictively signal motion vectors. Two examples of predictive signaling notification techniques that can be implemented by the video encoder 800 include Advanced Motion Vector Prediction (AMVP) and merge pattern signaling notification.

[0208] The intra-prediction unit 806 can perform intra-prediction on the current video block. When performing intra-prediction on the current video block, the intra-prediction unit 806 can generate prediction data for the current video block based on the decoded samples of other video blocks in the same frame. The prediction data for the current video block can include the predicted video block and various syntax elements.

[0209] The residual generation unit 807 can generate residual data for the current video block by subtracting (e.g., indicated by a minus sign) multiple predicted video blocks from the current video block. The residual data for the current video block can include residual video blocks corresponding to different sample components of the samples in the current video block.

[0210] In other examples, such as in skip mode, residual data for the current video block may not exist, and the residual generation unit 807 may not perform a subtraction operation.

[0211] The transform processing unit 808 can generate one or more transform coefficient video blocks of the current video block by applying one or more transforms to the residual video block associated with the current video block.

[0212] After the transform processing unit 808 generates a transform coefficient video block associated with the current video block, the quantization unit 809 can quantize the transform coefficient video block associated with the current video block based on one or more quantization parameter (QP) values ​​associated with the current video block.

[0213] The inverse quantization unit 810 and the inverse transform unit 811 can apply inverse quantization and inverse transform to the transform coefficient video block respectively to reconstruct the residual video block from the transform coefficient video block. The reconstruction unit 812 can add the reconstructed residual video block to the corresponding sample points of one or more predicted video blocks generated by the prediction unit 802 to produce a reconstructed video block associated with the current block for storage in the buffer 813.

[0214] After the video block is reconstructed by reconstruction unit 812, a loop filtering operation can be performed to reduce video block artifacts in the video block.

[0215] The entropy encoding unit 814 can receive data from other functional components of the video encoder 800. When the entropy encoding unit 814 receives data, it can perform one or more entropy encoding operations to generate entropy-coded data and output a bitstream including the entropy-coded data.

[0216] Figure 9 This is a block diagram illustrating an example of a video decoder 900, which can be... Figure 7 The video decoder 724 in the system 700 shown in the figure.

[0217] The video decoder 900 can be configured to perform any or all of the techniques disclosed herein. Figure 9 In the example, the video decoder 900 includes multiple functional components. The techniques described in this disclosure can be shared among the various components of the video decoder 900. In some examples, the processor can be configured to perform any or all of the techniques described in this disclosure.

[0218] exist Figure 9 In the example, the video decoder 900 includes an entropy decoding unit 901, a motion compensation unit 902, an intra-frame prediction unit 909, an inverse quantization unit 904, an inverse transform unit 905, a reconstruction unit 906, and a buffer 907. In some examples, the video decoder 900 can perform operations related to the video encoder 800 (…). Figure 8 The decoding process is the overall inversion of the encoding process described.

[0219] The entropy decoding unit 901 can retrieve the encoded bitstream. The encoded bitstream may include entropy-encoded video data (e.g., encoded blocks of video data). The entropy decoding unit 901 can decode the entropy-encoded video, and based on the entropy-encoded video data, the motion compensation unit 902 can determine motion information including motion vectors, motion vector precision, reference image list index, and other motion information. The motion compensation unit 902 can determine this information, for example, by performing AMVP and merge modes.

[0220] The motion compensation unit 902 can generate motion compensation blocks, possibly based on interpolation filters. The identifier of the interpolation filter to be used at sub-pixel precision can be included in the syntax element.

[0221] The motion compensation unit 902 can use the interpolation filter used by the video encoder 800 during the encoding of the video block to calculate the interpolation values ​​of a sub-integer number of pixels of the reference block. The motion compensation unit 902 can determine the interpolation filter used by the video encoder 800 based on the received syntax information and use the interpolation filter to generate the prediction block.

[0222] The motion compensation unit 902 can use some syntactic information to determine: the size of the blocks used to encode (multiple) frames and / or (multiple) stripes of the encoded video sequence, segmentation information describing how each macroblock of the image of the encoded video sequence is segmented, the mode indicating how each segment is encoded, one or more reference frames (and a list of reference frames) for each inter-frame coded block, and other information for decoding the encoded video sequence.

[0223] Intra-prediction unit 903 can use, for example, an intra-prediction mode received in the bitstream to form prediction blocks from spatially adjacent blocks. Inverse quantization unit 904 inverse quantizes (i.e., dequantizes) the quantized video block coefficients provided in the bitstream and decoded by entropy decoding unit 901. Inverse transform unit 905 applies the inverse transform.

[0224] The reconstruction unit 906 can sum the residual block using the corresponding prediction block generated by the motion compensation unit 902 or the intra-frame prediction unit 903 to form a decoded block. As desired, a deblocking filter can also be applied to filter the decoded block to remove block artifacts. The decoded video block is then stored in a buffer 907, which provides a reference block for subsequent motion compensation / intra-frame prediction and also produces the decoded video for presentation on a display device.

[0225] Figure 10 This is a schematic diagram of an example encoder 1000. Encoder 1000 is suitable for implementing VVC technology. Encoder 1000 includes three loop filters: a deblocking filter (DF) 1002, a sample adaptive offset (SAO) 1004, and an adaptive loop filter (ALF) 1006. Unlike DF 1002, which uses predefined filters, SAO 1004 and ALF 1006 utilize the original samples of the current image and reduce the mean square error between the original and reconstructed samples by adding an offset and applying a finite impulse response (FIR) filter, respectively, using auxiliary information signaling from the encoder / decoder to inform the offset and filter coefficients. ALF 1006 is located in the final processing stage of each image and can be considered as a tool for attempting to capture and repair artifacts generated by previous stages.

[0226] The encoder 1000 also includes an intra-frame prediction component 1008 and a motion estimation / compensation (ME / MC) component 1010, configured to receive input video. The intra-frame prediction component 1008 is configured to perform intra-frame prediction, while the ME / MC component 1010 is configured to perform inter-frame prediction using a reference image obtained from a reference image buffer 1012. Residual blocks from inter-frame or intra-frame prediction are fed into a transform component 1014 and a quantization component 1016 to generate quantized residual transform coefficients, which are then fed into an entropy codec component 1018. The entropy codec component 1018 entropy codes and decodes the prediction results and quantized transform coefficients and sends them to a video decoder (not shown). The quantization component output from the quantization component 1016 can be fed into an inverse quantization component 1020, an inverse transform component 1022, and a reconstruction (REC) component 1024. REC component 1024 can output images to DF1002, SAO 1004 and ALF 1006 for filtering before these images are stored in reference image buffer 1012.

[0227] The following provides a list of preferred solutions for some embodiments.

[0228] The following solutions illustrate examples of the techniques discussed in this article.

[0229] 1. A visual media processing method (e.g., Figure 6 The method 600 shown includes: performing (602) a conversion between visual media data and a file storing information corresponding to the visual media data according to a video file format; wherein the video file format includes a decoder configuration record configured with information for content selection, wherein the decoder configuration record includes one or more fields: required decoded picture buffer size, maximum picture output reordering, maximum wait time, stepwise decode refresh picture enable flag, clean random access picture enable flag, reference picture resampling enable flag, spatial resolution variation enable flag with encoded / decoded video layer sequence, sub-picture segmentation enable flag, maximum number of sub-pictures in each picture, wavefront parallel processing enable flag, slice segmentation enable flag, maximum number of slices in each picture, strip segmentation enable flag, rectangular stripe enable flag, raster scan stripe enable flag, and maximum number of stripes in each picture.

[0230] 2. A visual media processing method, comprising: performing a conversion between visual media data and a file storing information corresponding to the visual media data according to a video file format; wherein the rule specifies that a field indicating the number of temporal layers is included in a decoder configuration record based on whether the file includes hierarchy information of the visual media data; wherein the rule further specifies that the field is included before the hierarchy information.

[0231] 3. The method according to Solution 2, wherein the rule further specifies the order in which the grade level information appears in the video file format relative to one or more additional information fields.

[0232] 4. The method according to Solution 3, wherein one or more additional information fields include a chroma format indication field, a bit depth field, a field indicating the number of temporal layers, or a field indicating whether a constant frame rate is used for visual media data.

[0233] 5. According to the method described in Solution 3, one or more additional information fields include reserved bit fields.

[0234] 6. The method according to any one of solutions 2-5, wherein the rule specifies that the grade level information is included as the last field of the decoder configuration record.

[0235] 7. The method according to any one of solutions 1-6, wherein the conversion includes generating a bitstream representation of the visual media data and storing the bitstream representation to a file according to format rules.

[0236] 8. The method according to any one of solutions 1-6, wherein the conversion includes parsing the file according to format rules to recover visual media data.

[0237] 9. A video decoding apparatus, comprising a processor configured to implement one or more of the methods described in solutions 1 to 8.

[0238] 10. A video encoding apparatus, comprising a processor configured to implement one or more of the methods described in solutions 1 to 8.

[0239] 11. A computer program product having computer code stored thereon, which, when executed by a processor, causes the processor to implement the method of any one of solutions 1 to 8.

[0240] 12. A computer-readable medium on which a bitstream representation conforms to a file format generated according to any one of solutions 1 to 8.

[0241] 13. The methods, apparatus, or systems described in this document. In the solutions described herein, the encoder conforms to the format rules by generating a codec representation according to the format rules. In the solutions described herein, the decoder can use the format rules to parse the syntax elements in the codec representation and determine the presence or absence of the syntax elements according to the format rules to generate decoded video.

[0242] In this document, the term "video processing" can refer to video encoding, video decoding, video compression, or video decompression. For example, a video compression algorithm can be applied during the conversion from the pixel representation of a video to the corresponding bitstream representation, and vice versa. As defined in the syntax, the bitstream representation of the current video block can, for example, correspond to bits that are co-occurring or scattered at different positions within the bitstream. For example, a macroblock can be encoded based on the error residual values ​​of the transformation and encoding / decoding, and also using bits in the header and other fields in the bitstream. Furthermore, during the conversion, the decoder can, based on this determination, parse the bitstream knowing that some fields may or may not be present, as described in the solutions above. Similarly, the encoder can determine whether to include or exclude certain syntax fields and generate the codec representation accordingly by including or excluding syntax fields from the codec representation.

[0243] The disclosures and other schemes, examples, embodiments, modules, and functional operations described in this document can be implemented in digital electronic circuits or in computer software, firmware, or hardware, containing the structures disclosed in this document and their equivalents, or combinations thereof. The disclosed and other embodiments can be implemented as one or more computer program products encoded on a computer-readable medium, i.e., one or more computer program instruction modules for execution by a data processing apparatus or for controlling the operation of a data processing apparatus. The computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a complex influencing machine-readable propagating signals, or combinations thereof. The term "data processing apparatus" encompasses all means, devices, and machines for processing data, including, for example, a programmable processor, a computer, or multiple processors or computers. In addition to hardware, the apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, or combinations thereof. Propagating signals are artificially generated signals, such as machine-generated electrical, optical, or electromagnetic signals, which are generated to encode information for transmission to a suitable receiver device.

[0244] Computer programs (also known as programs, software, software applications, scripts, or code) can be written in any programming language, including compiled or interpreted languages, and can be deployed in any form, including standalone programs or modules, components, subroutines, or other units suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple co-located files (e.g., a file storing one or more modules, subroutines, or code portions). A computer program can be deployed to execute on one computer or on multiple computers located at a single site or distributed across multiple sites and interconnected by a communications network.

[0245] The processes and logic flows described in this document can be performed by one or more programmable processors executing one or more computer programs to perform functions by manipulating input data and generating outputs. The processes and logic flows can also be performed by special-purpose logic circuitry (e.g., field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs)), and the apparatus can be implemented as special-purpose logic circuitry (e.g., FPGAs or ASICs).

[0246] Processors suitable for executing computer programs include, for example, both general-purpose and special-purpose microprocessors, and any one or more processors in any type of digital computer. Typically, a processor receives instructions and data from read-only memory or random access memory, or both. The basic components of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include one or more mass storage devices (e.g., magneto-optical, magneto-optical, or optical disc) for storing data, or operatively coupled to receive data from or transfer data to a mass storage device (e.g., magneto-optical, magneto-optical, or optical disc), or both. However, a computer does not necessarily need to have such devices. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks (e.g., internal hard disks or removable disks); magneto-optical disks; and CD-ROM and DVD-ROM disks. Processors and memory may be supplemented by or incorporated into special-purpose logic circuitry.

[0247] While this patent document contains numerous details, these details should not be construed as limiting any subject matter or the scope of the claims, but rather as descriptions of features specific to particular embodiments of a particular art. In this patent document, certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented separately in multiple embodiments or in various suitable sub-combinations. Furthermore, although features may be described above as operating in certain combinations and even initially claimed in the same manner, in certain circumstances one or more features from the claimed combination may be removed from the combination, and the claimed combination may be for sub-combinations or variations thereof.

[0248] Similarly, although operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring such operations to be performed in the specific order or sequence shown, or to perform all the operations shown, in order to achieve the desired result. Furthermore, the separation of various system components in the embodiments described in this patent document should not be construed as requiring such separation in all embodiments.

[0249] Only a few implementations and examples are described, and other implementations, enhancements and variations can be made based on what is described and shown in this patent document.

[0250] When no intermediate component exists other than a line, trace, or other medium between the first and second components, the first component is directly coupled to the second component. When an intermediate component other than a line, trace, or other medium exists between the first and second components, the first component is indirectly coupled to the second component. The term "coupling" and its variations include direct coupling and indirect coupling. Unless otherwise stated, the term "about" means including a range of 10% of the following value.

[0251] While several embodiments have been provided in this disclosure, it should be understood that the disclosed systems and methods may be embodied in many other specific forms without departing from the spirit or scope of this disclosure. The present examples are intended to be illustrative rather than limiting and are not intended to be limited to the details given herein. For example, various elements or components may be combined or integrated into another system, or certain features may be omitted or not implemented.

[0252] Furthermore, without departing from the scope of this disclosure, the discrete or individual technologies, systems, subsystems, and methods described and illustrated in the various embodiments may be combined or integrated with other systems, modules, technologies, or methods. Other items shown or discussed as coupled may be directly connected or indirectly coupled or communicated via some interface, device, or intermediate component in an electrical, mechanical, or other manner. Those skilled in the art can identify other examples of changes, substitutions, and alterations, and these changes, substitutions, and alterations may be made without departing from the spirit and scope of the disclosure herein.

Claims

1. A method for processing visual media data, comprising: The conversion between visual media data and visual media data files is performed based on random access recovery point groups of access points in a specified layer and grouping type parameters. The visual media data files include images within the layer. The grouping type parameters specify the correspondence between the access points and the related layers of the layer and include a layer identifier method identifier parameter. The layer identifier method identifier parameter specifies that the access points include one or more of the following: One or more progressively decoded and refreshed GDR images; and One or more hybrid network abstraction layer (NAL) unit images containing both intra-frame random access point (IRAP) sub-images and non-IRAP sub-images; The layer identifier method identifier parameter is represented as layer_id_method_idc, and layer_id_method_idc equal to 1 or 3 indicates that the access point is only applied to the relevant layer.

2. The method according to claim 1, wherein, The conversion includes: The image is encoded into a layer in the visual media data file; Determine the random access recovery point sample group for the access points in the specified layer; Encode the grouping type parameters into the visual media data file; and Store the visual media data file.

3. The method according to claim 1, wherein, The conversion includes: Receive the visual media data file including the image encoded and decoded into the layer; The random access recovery point set is obtained from the visual media data file, and the random access recovery point set specifies the access point in the layer; Obtain the grouping type parameter from the visual media data file; and The visual media data file is decoded based on the grouping type parameter.

4. The method according to claim 1, wherein, The grouping type parameter includes a target layer parameter, which includes multiple bits, each bit specifying one of the relevant layers.

5. The method according to claim 1, wherein, The value of the layer identifier method identifier parameter specifies whether the access point applies only to the relevant layer or whether the access point applies to all layers.

6. The method according to claim 4, wherein, The group type parameter is denoted as group_type_parameter, and the target layer parameter is denoted as target_layers.

7. The method according to claim 6, wherein, When all access points in the relevant layer are specified as GDR images and the access points are applied to all layers, the layer_id_method_idc is set to 0.

8. The method according to claim 6, wherein, When all access points in the relevant layer are specified as GDR images and the access points are applied only to the relevant layer, the layer_id_method_idc is set to 1.

9. The method according to claim 6, wherein, When the access point in the relevant layer is specified as a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point is applied to all layers, the layer_id_method_idc is set to 2.

10. The method according to claim 6, wherein, When the access point in the relevant layer is specified as a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point is applied only to the relevant layer, the layer_id_method_idc is set to 3.

11. An apparatus for processing video data, comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to: The conversion between visual media data and visual media data files is performed based on random access recovery point groups of access points in a specified layer and grouping type parameters. The visual media data files include images within the layer. The grouping type parameters specify the correspondence between the access points and the related layers of the layer and include a layer identifier method identifier parameter. The layer identifier method identifier parameter specifies that the access points include one or more of the following: One or more progressively decoded and refreshed GDR images; and One or more hybrid network abstraction layer (NAL) unit images containing both intra-frame random access point (IRAP) sub-images and non-IRAP sub-images; in, The layer identifier method identifier parameter is represented as layer_id_method_idc, and layer_id_method_idc equal to 1 or 3 indicates that the access point is only applied to the relevant layer.

12. The apparatus according to claim 11, wherein, The conversion includes: The image is encoded into a layer in the visual media data file; Determine the random access recovery point sample group for the access points in the specified layer; Encode the grouping type parameters into the visual media data file; and Store the visual media data file.

13. The apparatus according to claim 11, wherein, The conversion includes: Receive the visual media data file including the image encoded and decoded into the layer; The random access recovery point set is obtained from the visual media data file, and the random access recovery point set specifies the access point in the layer; Obtain the grouping type parameter from the visual media data file; and The visual media data file is decoded based on the grouping type parameter.

14. The apparatus according to claim 11, wherein, The grouping type parameter includes a target layer parameter, which includes multiple bits, each bit specifying one of the relevant layers.

15. The apparatus according to claim 11, wherein, The value of the layer identifier method identifier parameter specifies whether the access point applies only to the relevant layer or whether the access point applies to all layers.

16. The apparatus according to claim 14, wherein, The grouping type parameter is denoted as `group_type_parameter`, and the target layer parameter is denoted as `target_layers`. Specifically, when all access points in the relevant layer are specified as GDR images and the access points are applied to all layers, the layer_id_method_idc is set to 0. Specifically, when all access points in the relevant layer are specified to be GDR images and the access points are only applied to the relevant layer, the layer_id_method_idc is set to 1. Specifically, when the access point in the relevant layer is specified as a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point is applied to all layers, the layer_id_method_idc is set to 2, and Specifically, when the access point in the relevant layer is specified as a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point is applied only to the relevant layer, the layer_id_method_idc is set to 3.

17. A non-transitory computer-readable medium comprising a computer program product for use by a video codec apparatus, the computer program product comprising computer-executable instructions stored on the non-transitory computer-readable medium, such that, when executed by a processor, the computer-executable instructions cause the video codec apparatus to: The conversion between visual media data and visual media data files is performed based on random access recovery point groups of access points in a specified layer and grouping type parameters. The visual media data files include images within the layer. The grouping type parameters specify the correspondence between the access points and the related layers of the layer and include a layer identifier method identifier parameter. The layer identifier method identifier parameter specifies that the access points include one or more of the following: One or more progressively decoded and refreshed GDR images; and One or more hybrid network abstraction layer (NAL) unit images containing both intra-frame random access point (IRAP) sub-images and non-IRAP sub-images; in, The layer identifier method identifier parameter is represented as layer_id_method_idc, and layer_id_method_idc equal to 1 or 3 indicates that the access point is only applied to the relevant layer.

18. The non-transitory computer-readable medium according to claim 17, wherein, The grouping type parameter includes a target layer parameter, which includes multiple bits, each bit specifying one of the relevant layers.

19. The non-transitory computer-readable medium according to claim 17, wherein, The value of the layer identifier method identifier parameter specifies whether the access point applies only to the relevant layer or whether the access point applies to all layers.

20. The non-transitory computer-readable medium according to claim 18, wherein, The grouping type parameter is denoted as `group_type_parameter`, and the target layer parameter is denoted as `target_layers`. Specifically, when all access points in the relevant layer are specified as GDR images and the access points are applied to all layers, the layer_id_method_idc is set to 0. Specifically, when all access points in the relevant layer are specified to be GDR images and the access points are only applied to the relevant layer, the layer_id_method_idc is set to 1. Specifically, when the access point in the relevant layer is specified as a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point is applied to all layers, the layer_id_method_idc is set to 2, and Specifically, when the access point in the relevant layer is specified as a GDR image, a hybrid NAL unit image, or a combination thereof, and the access point is applied only to the relevant layer, the layer_id_method_idc is set to 3.

21. A non-transitory computer-readable recording medium for storing instruction and visual media data files, wherein, When the instruction is executed, it causes the processor to: The visual media data file is generated based on a random access recovery point sample group of access points in a specified layer and a grouping type parameter. The visual media data file includes images in the layer. The grouping type parameter specifies the correspondence between the access points and the related layers of the layer and includes a layer identifier method identifier parameter. The layer identifier method identifier parameter specifies that the access points include one or more of the following: One or more progressively decoded and refreshed GDR images; and One or more hybrid network abstraction layer (NAL) unit images containing both intra-frame random access point (IRAP) sub-images and non-IRAP sub-images; The layer identifier method identifier parameter is represented as layer_id_method_idc, and layer_id_method_idc equal to 1 or 3 indicates that the access point is only applied to the relevant layer.

22. A method for storing visual media data files, comprising: The visual media data file is generated based on the random access recovery point sample group and grouping type parameter of the access points in the specified layer. The visual media data file is stored in a non-transitory computer-readable storage medium. The visual media data file includes images in the layer. The grouping type parameter specifies the correspondence between the access point and the related layers of the layer and includes a layer identifier method identifier parameter. The layer identifier method identifier parameter specifies that the access point includes one or more of the following: One or more progressively decoded and refreshed GDR images; and One or more hybrid network abstraction layer (NAL) unit images containing both intra-frame random access point (IRAP) sub-images and non-IRAP sub-images; The layer identifier method identifier parameter is represented as layer_id_method_idc, and layer_id_method_idc equal to 1 or 3 indicates that the access point is only applied to the relevant layer.