Level indicator for sub-picture entity group

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing level indicators and signaling notification mechanisms into the sub-image entity group, the problem of low efficiency in sub-image management in multi-view and 360-degree immersive media in existing video codec standards is solved, and efficient video data transmission and codec are achieved.

CN115225904BActive Publication Date: 2026-06-19FACE CUTE CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: FACE CUTE CO LTD
Filing Date: 2022-04-15
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing video codec standards struggle to efficiently manage and transmit sub-image information when handling multi-view video and viewport-adaptive 360-degree immersive media, resulting in low bandwidth utilization efficiency.

Method used

By introducing level indicators in sub-image entity groups and using signaling notification mechanisms to indicate the track of sample groups in the entity-to-group frame, the conversion between visual media data and media data files is realized, and multi-track management of VVC reference tracks for general video codecs is supported.

Benefits of technology

It improves the efficiency of video data transmission and bandwidth utilization, supports efficient encoding, decoding and transmission of multi-view video and 360-degree immersive media, and adapts to the dynamic changes of different viewpoints.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115225904B_ABST

Patent Text Reader

Abstract

A mechanism for processing video data. It determines the level indicator for a sub-picture set. The sub-picture set is included in one or more sub-picture tracks. The sub-picture tracks are included in sub-picture entity groups. When the level indicator is included in a sample group, a signaling indication is given in an entity-to-group box to indicate the track containing the sample group. A conversion between visual media data and media data files is performed based on the level indicator.

Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-references to related applications

[0002] This patent application claims the benefit of U.S. Provisional Patent Application No. 63 / 175,421, filed April 15, 2021, entitled “Signaling Notification of Information for Sub-Picture Track Sets,” which is incorporated herein by reference. Technical Field

[0003] This patent document relates to the generation, storage, and consumption of digital audio and video media information in file formats. Background Technology

[0004] Digital video still accounts for the largest share of bandwidth usage on the Internet and other digital communication networks. As the number of connected user devices capable of receiving and displaying video increases, the bandwidth required for digital video usage is expected to continue to grow. Summary of the Invention

[0005] The first aspect relates to a method for processing video data, comprising: determining a level indicator for a set of subpictures included in one or more subpicture tracks, wherein the subpicture tracks are included in a subpicture entity group, and wherein, when the level indicator is included in a sample group, a signaling instruction is given in an entity-to-group frame to indicate a track including the sample group; and performing a conversion between visual media data and media data files based on the level indicator.

[0006] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides the indication as a track identifier ID, which is signaled in the level information track ID level_information_track_id field.

[0007] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides a rule stipulating that the track including the sample group should include a picture header network abstraction layer (NAL) unit.

[0008] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides a rule stipulating that the track including the sample group must include a picture header network abstraction layer (NAL) unit.

[0009] Alternatively, in any of the foregoing aspects, another implementation of that aspect provides a rule stipulating that the track including the sample group must be a Universal Video Codec (VVC) reference track.

[0010] Alternatively, in any of the foregoing aspects, another implementation of that aspect provides that the entity-to-group frame defines the sub-image entity group.

[0011] Alternatively, in any of the foregoing aspects, another implementation of that aspect provides that the sample group carries information about the sub-image set.

[0012] Alternatively, in any of the foregoing aspects, another embodiment of that aspect provides that the conversion includes encoding the visual media data into the media data file.

[0013] Alternatively, in any of the foregoing aspects, another embodiment of that aspect provides that the conversion includes decoding the visual media data from the media data file.

[0014] The second aspect relates to an apparatus for processing video data, comprising: a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to: determine a level indicator for a set of subpictures included in one or more subpicture tracks, wherein the subpicture tracks are included in a group of subpicture entities, and wherein, when the level indicator is included in a sample group, signaling indication in an entity-to-group frame indicates that a track including the sample group is included; and perform a conversion between visual media data and media data files based on the level indicator.

[0015] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides the indication as a track identifier ID, which is signaled in the level information track ID level_information_track_id field.

[0016] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides a rule stipulating that the track including the sample group should include a picture header network abstraction layer (NAL) unit.

[0017] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides a rule stipulating that the track including the sample group must include a picture header network abstraction layer (NAL) unit.

[0018] Alternatively, in any of the foregoing aspects, another implementation of that aspect provides a rule stipulating that the track including the sample group must be a Universal Video Codec (VVC) reference track.

[0019] Alternatively, in any of the foregoing aspects, another implementation of that aspect provides that the entity-to-group frame defines the sub-image entity group.

[0020] Alternatively, in any of the foregoing aspects, another implementation of that aspect provides that the sample group carries information about the sub-image set.

[0021] The third aspect relates to a non-transitory computer-readable medium comprising a computer program product used by a video codec device, the computer program product including computer-executable instructions stored on the non-transitory computer-readable medium such that, when the computer-executable instructions are executed by a processor, the video codec device causes the video codec device to: determine a level indicator for a set of subpictures included in one or more subpicture tracks, wherein the subpicture tracks are included in a subpicture entity group, and wherein, when the level indicator is included in a sample group, a signaling indication in an entity-to-group frame indicates that a track including the sample group is included; and perform a conversion between visual media data and media data files based on the level indicator.

[0022] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides the indication as a track identifier ID, which is signaled in the level information track ID level_information_track_id field.

[0023] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides a rule stipulating that the track including the sample group should include a picture header network abstraction layer (NAL) unit.

[0024] Alternatively, in any of the foregoing aspects, another implementation of this aspect provides a rule stipulating that the track including the sample group must include a picture header network abstraction layer (NAL) unit.

[0025] Alternatively, in any of the foregoing aspects, another implementation of that aspect provides a rule stipulating that the track including the sample group must be a Universal Video Codec (VVC) reference track.

[0026] Alternatively, in any of the foregoing aspects, another embodiment of this aspect provides that the entity-to-group frame defines the sub-image entity group, and wherein the sample group carries information about the sub-image set.

[0027] For clarity, any of the foregoing embodiments may be combined with any one or more other foregoing embodiments to form new embodiments within the scope of this disclosure.

[0028] These and other features will become clearer from the following detailed description taken in conjunction with the accompanying drawings and claims. Attached Figure Description

[0029] For a more complete understanding of this disclosure, reference is made to the following brief description in conjunction with the accompanying drawings and detailed description, wherein the same or similar reference numerals are used to refer to the same or similar parts.

[0030] Figure 1This is a schematic diagram illustrating the segmentation of an example image into strips, sub-images, slices, and codec tree units (CTUs).

[0031] Figure 2 This is a schematic diagram of a viewport-dependent 360° video transmission scheme based on an example sub-image.

[0032] Figure 3 This is a schematic diagram of an example mechanism for extracting sub-images from a bitstream.

[0033] Figure 4 This is a diagram of a media file stored in the International Organization for Standardization (ISO) Standard Media File Format (ISOBMFF).

[0034] Figure 5 A block diagram is shown to illustrate an example video processing system.

[0035] Figure 6 This is a block diagram of an example video processing device.

[0036] Figure 7 This is a flowchart illustrating an example video processing method.

[0037] Figure 8 A block diagram of an example video codec system is shown.

[0038] Figure 9 A block diagram of an example encoder is shown.

[0039] Figure 10 A block diagram of an example decoder is shown.

[0040] Figure 11 This is a schematic diagram of an example encoder. Detailed Implementation

[0041] First, it should be understood that although illustrative implementations of one or more embodiments are provided below, the disclosed systems and / or methods can be implemented using any number of techniques, whether currently known or existing. This disclosure should not be limited in any way to the illustrative embodiments, drawings, and techniques shown below, including the exemplary designs and implementations shown and described herein, but modifications can be made within the full scope of the appended claims and their equivalents.

[0042] This patent document relates to a video file format. Specifically, this document relates to signaling notification of information about certain sets of subpicture tracks within a subpicture entity group. This supports carrying a common video codec (VVC) video bitstream across multiple tracks in a media file based on the International Organization for Standardization (ISO) Benchmark Media File Format (ISOBMFF). The ideas described herein can be applied individually or in various combinations for video bitstreams encoded by any codec (e.g., the VVC standard) and for any video file format (e.g., the VVC video file format).

[0043] This disclosure includes the following abbreviations: Adaptive Color Transform (ACT), Adaptive Loop Filter (ALF), Adaptive Motion Vector Precision (AMVR), Adaptive Parameter Set (APS), Access Unit (AU), Access Unit Delimiter (AUD), Advanced Video Codec (Rec.ITU-TH.264|ISO / IEC 14496-10) (AVC), Bidirectional Prediction (B), Bidirectional Prediction with Codec Unit (CU) Level Weights (BCW), Bidirectional Optical Flow (BDOF), Block-Based Incremental Pulse Code Modulation (BDPCM), Buffer Period (BP), Context-Based Adaptive Binary Arithmetic Codec (CABAC), Codec Block (CB), Constant Bit Rate (CBR), Cross-Component Adaptive Loop Filter (CCALF), Codec Picture Buffer (CPB), Clean Random Access (CRA), Cyclic Redundancy Check (CRC), Codec Tree Block (CTB), Codec Tree Unit (CTU), Codec Unit (CU), Codec Video Sequence (CV). S), Decoding Capability Information (DCI), Decoding Initialization Information (DII), Decoding Picture Buffer (DPB), Correlated Random Access Point (DRAP), Decoding Unit (DU), Decoding Unit Information (DUI), Exponential Golomb (EG), k-order Exponential-Golomb (EGk), End of Bitstream (EOB), End of Sequence (EOS), Padding Data (FD), First-In-First-Out (FIFO), Fixed Length (FL), Green, Blue and Red (GBR), General Constraint Information (GCI), Progressive Decoding Refresh (GDR), Geometric Segmentation Mode (GPM), High-Efficiency Video Coding (HEVC, also known as Rec. ITU-T H).265|ISO / IEC 23008-2), Hypothetical Reference Decoder (HRD), Hypothetical Stream Scheduler (HSS), Intra-Frame (I), Intra-Frame Block Copy (IBC), Instantaneous Decode Refresh (IDR), Inter-Layer Reference Picture (ILRP), Intra-Frame Random Access Point (IRAP), Low-Frequency Inseparable Transform (LFNST), Least Probable Symbol (LPS), Least Significant Bit (LSB), Long-Term Reference Picture (LTRP), Luminance Map with Chroma Scaling (LMCS), Matrix-Based Intra-Frame Prediction (MIP), Most Probable Symbol (MPS), Most Significant Bit (MSB), Multiple Transform Selection (MTS), Motion Vector Prediction (MVP), Network Abstraction Layer (NAL), Output Layer Set (OLS), Operation Point (OP), Operation Point Information (OPI), Prediction (P), Picture Header (PH), Picture Order Count (POC), Picture Parameter Set (PPS), etc. The optical flow includes Predictive Refinement (PROF), Picture Timing (PT), Picture Unit (PU), Quantization Parameter (QP), Random Access Decodable Preamble Picture (RADL), Random Access Skipped Preamble Picture (RASL), Raw Byte Sequence Payload (RBSP), Red, Green, and Blue (RGB), Reference Picture List (RPL), Sample Adaptive Offset (SAO), Sample Aspect Ratio (SAR), Supplemental Enhancement Information (SEI), Strip Header (SH), Subpicture Level Information (SLI), Data Bit String (SODB), Sequence Parameter Set (SPS), Short-Term Reference Picture (STRP), Stepped Temporal Sublayer Access (STSA), Truncated Rice (TR), Variable Bit Rate (VBR), Video Coding Layer (VCL), Video Parameter Set (VPS), Universal Supplemental Enhancement Information (VSEI, also known as Rec.ITU-T H.274|ISO / IEC 23002-7), Video Availability Information (VUI), and Universal Video Coding (VVC, also known as Rec.ITU-T H.266|ISO / IEC 23090-3). .

[0044] Video codec standards have primarily evolved through the development of standards by the International Telecommunication Union (ITU)-T (ITU-T) and the International Organization for Standardization (ISO) / International Electrotechnical Commission (IEC). ITU-T developed H.261 and H.263, while ISO / IEC developed MPEG-1 and MPEG-4 Vision. The two organizations jointly developed the H.262 / MPEG-2 video, H.264 / MPEG-4 Advanced Video Coding (AVC), and H.265 / HEVC standards. Since H.262, video codec standards have been based on a hybrid video codec architecture, employing temporal prediction plus transform coding. To explore future video codec technologies beyond HEVC, the Video Codec Experts Group (VCEG) and MPEG jointly established the Joint Video Exploration Team (JVET). Since then, JVET has adopted many new methods and applied them to reference software called the Joint Exploration Model (JEM). When the Universal Video Codec (VVC) project officially began, JVET was later renamed the Joint Team of Video Experts (JVET). VVC was a codec standard aiming for a 50% bitrate reduction compared to HEVC. VVC has been completed by JVET.

[0045] The VVC standard, also known as ITU-T H.266|ISO / IEC 23090-3, and the related General Supplemental Enhancement Information (VSEI) standard, also known as ITU-T H.274|ISO / IEC 23002-7, are designed for a wide range of applications such as television broadcasting, video conferencing, playback of storage media, adaptive bitrate streaming, video region extraction, synthesis and merging of content from multiple codec video bitstreams, multi-view video, scalable layered codecs, and viewport-adaptive 360° immersive media.

[0046] File format standards are as follows. Media streaming applications are typically based on Internet Protocol (IP), Transmission Control Protocol (TCP), and Hypertext Transfer Protocol (HTTP) transmission methods, and often rely on file formats such as ISOBMFF. One such streaming system is HTTP-based Dynamic Adaptive Streaming (DASH). Video can be encoded in video formats such as AVC and / or HEVC. The encoded video can be encapsulated in ISOBMFF tracks and included in DASH representations and segments. For content selection purposes, important information about the video bitstream, such as profile, tier, and level, can be presented as file format level metadata and / or in the DASH Media Presentation Description (MPD). For example, this information can be used to select appropriate media segments for initialization at the start of a streaming session and for stream adaptation during the streaming session.

[0047] Similarly, when using an image format with ISOBMFF, a file format specification specific to that image format can be used, such as the AVC image file format and the HEVC image file format. MPEG is developing the VVC video file format, which is an ISOBMFF-based file format for storing VVC video content. MPEG is also developing an ISOBMFF-based VVC image file format for storing image content encoded and decoded using VVC.

[0048] We will now discuss HEVC's image segmentation schemes. HEVC includes four different image segmentation schemes: regular striping, subordinate striping, slice processing, and wavefront parallel processing (WPP). These segmentation schemes can be applied to maximum transmission unit (MTU) size matching, parallel processing, and reduction of end-to-end latency.

[0049] The regular stripes are similar to those in H.264 / AVC. Each regular strip is encapsulated in its own NAL unit, and intra-picture prediction (intra-sample prediction, motion information prediction, and encoding / decoding mode prediction) and entropy encoding / decoding dependencies across strip boundaries are disabled. Therefore, a regular strip can be reconstructed independently of other regular strips within the same picture (although interdependencies may still exist due to loop filtering operations).

[0050] Regular stripes can be used for parallelization and are also available in H.264 / AVC. Parallelization based on regular stripes does not require a large amount of inter-processor and / or inter-core communication. An exception is inter-processor or inter-core motion compensation data sharing when decoding predictive codec images. Such predictive codec images are typically much more computationally intensive than inter-processor or inter-core data sharing due to intra-image prediction. However, for the same reason, using regular stripes results in significant encoding / decoding overhead due to the bit cost of the stripe header and the lack of prediction across stripe boundaries. Furthermore, regular stripes are also used as a mechanism for bitstream segmentation to match MTU size requirements, compared to other tools mentioned below. This is due to the intra-image independence of regular stripes and because each regular stripe is encapsulated in its own NAL unit. In many cases, the goals of parallelization and MTU size matching impose conflicting requirements on the stripe layout in the image. This recognition led to the development of the parallelization tools mentioned below.

[0051] Slave stripes have short stripe headers and allow bitstream splitting at treeblock boundaries without compromising in-picture prediction. Slave stripes divide a regular stripe into multiple NAL units. This provides reduced end-to-end latency by allowing a portion of the regular stripe to be sent before the entire regular stripe's encoding is complete.

[0052] In WPP, images are segmented into single-row codec tree blocks (CTBs). Entropy decoding and prediction are allowed to use data from CTBs in other segments. Parallel processing is possible by decoding CTB rows in parallel. The start of decoding a CTB row is delayed by two CTBs to ensure that data related to the CTBs above and to the right of the main CTB is available before decoding the main CTB. Using this staggered start (which looks like a wavefront when represented graphically), parallelization is possible with up to the number of processors / cores involved in the image's CTB rows. Because intra-image prediction between adjacent treeblock rows within an image is allowed, the inter-processor / inter-core communication used to implement intra-image prediction can be substantial. WPP segmentation does not result in the production of additional NAL units. Therefore, WPP is not a tool for MTU size matching. However, regular striping can be used with WPP when MTU size matching is employed, but with some encoding / decoding overhead.

[0053] A slice defines the horizontal and vertical boundaries that divide an image into slice columns and rows. Slice columns are configured from the top to the bottom of the image. Similarly, slice rows are configured from the left to the right of the image. The number of slices in an image can be obtained by multiplying the number of slice columns by the number of slice rows.

[0054] The scan order of CTBs can be local within a slice. The scan order can be based on the order of the slice's CTB raster scans. Therefore, all CTBs within a slice can be decoded before decoding the top-left CTB of the next slice in the order of the slice raster scans of the image. Similar to regular stripes, slices break the intra-image prediction dependency and the entropy decoding dependency. However, slices do not need to be included in a single NAL unit, similar to the usage of WPP in this regard. Therefore, slices are not used for MTU size matching. Each slice can be processed by one processor / core, and when a slice spans more than one slice, inter-processor / core communication for decoding intra-image predictions between processing units of adjacent slices is limited to transmitting a shared slice header. This communication may also include loop filtering associated with the sharing of reconstructed samples and metadata. When a slice includes more than one slice or WPP segment, the entry point byte offset of each slice or WPP segment in the slice, except for the first slice or WPP segment, is signaled in the slice header.

[0055] For simplicity, HEVC specifies application constraints for four different image segmentation schemes. The encoded / decoded video sequence may not include most of the slices and wavefronts specified in the HEVC standard. For each strip and slice, one or two of the following conditions must be met: The first condition is that all encoded / decoded treeblocks in a strip belong to the same slice. The second condition is that all encoded / decoded treeblocks in a slice are included in the same strip. Finally, a wavefront segment comprises exactly one CTB line. Furthermore, when using WPP, if a strip begins in a CTB line, it should end in the same CTB line.

[0056] The exemplary modification to HEVC specifies three motion-constrained slice set (MCTS) related SEI messages. These include a temporal MCTS SEI message, an MCTS extractable information set SEI message, and an MCTS extractable information nested SEI message. The temporal MCTS SEI message indicates the presence of an MCTS in the bitstream and signals the MCTS. For each MCTS, motion vectors are restricted to pointing to full-sample locations within the MCTS and fractional-sample locations interpolated using only full-sample locations within the MCTS. Furthermore, motion vector candidates predicted from temporal motion vectors derived from blocks outside the MCTS are not allowed. This allows each MCTS to be decoded independently without referencing slices outside the MCTS.

[0057] The MCTS Extraction Information Set (SEI) message provides supplementary information that can be used in MCTS sub-bitstream extraction to generate a consistent bitstream of the MCTS set. MCTS sub-bitstream extraction is defined as part of the semantics of the MCTS Extraction Information Set (SEI) message. This information includes multiple extraction information sets. Each extraction information set defines the number of MCTS sets and includes RBSP bytes to be used during the MCTS sub-bitstream extraction process to replace VPS, SPS, and PPS. When extracting sub-bitstreams according to the MCTS sub-bitstream extraction process, parameter sets such as VPS, SPS, and PPS are overwritten or replaced. Furthermore, the stripe header is updated because one or more slice address-related syntax elements, such as first_slice_segment_in_pic_flag and slice_segment_address, should have different values.

[0058] Now we will discuss image segmentation and sub-images in VVC. In VVC, an image can be divided into one or more slice rows and one or more slice columns. A slice is a sequence of CTUs covering a rectangular area of the image. CTUs within a slice are scanned in raster scan order within the slice. A strip consists of an integer number of complete slices or an integer number of consecutive complete CTU rows within a slice of the image. To manage strips, VVC supports raster scan strip mode and rectangular strip mode. In raster scan strip mode, a strip consists of a complete slice sequence raster scanned by slice of the image. In rectangular strip mode, a strip consists of multiple complete slices that together form a rectangular area of the image, or multiple consecutive complete CTU rows of a slice that together form a rectangular area of the image. Slices within a rectangular strip are scanned in slice raster scan order within the rectangular area corresponding to that strip. A sub-image consists of one or more strips that together cover a rectangular area of the image.

[0059] Figure 1 This is a schematic diagram illustrating the segmentation of an example image into strips, sub-images, slices, and CTUs. In schematic diagram 100, the image is segmented into 18 slices, 24 strips, 24 sub-images, and 120 CTUs. The concept and function of sub-images will now be discussed. In VVC, each sub-image comprises one or more complete rectangular strips that collectively cover a rectangular area of the image, as shown in schematic diagram 100. Sub-images can be defined as extractable. Extractable sub-images are encoded and decoded independently of other sub-images within the same image and earlier images in the decoding order. Sub-images can also be defined as non-extractable and therefore not encoded and decoded independently of other sub-images. Regardless of whether a sub-image is extractable, the encoder can control whether loop filtering is applied individually across sub-image boundaries for each sub-image. Loop filtering includes the application of deblocking filters, SAO filters, and / or ALF filters.

[0060] Subpicks are similar to HEVC's MCTS. Both allow for the independent encoding, decoding, and extraction of rectangular subsets of encoded and decoded picture sequences, enabling use cases such as viewport-dependent 360° video stream optimization and region of interest (ROI) applications.

[0061] In a 360° video stream, also known as omnidirectional video, only a subset of the entire omnidirectional video sphere is presented to the user at any given moment. This subset is represented as the current viewport. The user can change their viewing direction at any time by turning their head, thus changing the current viewport. At least some lower-quality representations of areas not covered by the current viewport can become available to the client. Therefore, in the event that the user suddenly changes their viewing direction to anywhere on the sphere, the area outside the viewport can be prepared to be presented to the user. The high-quality representation of the omnidirectional video is used only for the current viewport presented to the user at any given moment. This optimization is achieved by dividing the high-quality representation of the entire omnidirectional video into sub-pictures with appropriate granularity, as illustrated in Figure 100. In this example, the 12 sub-pictures on the left are high-precision sub-pictures used for the current viewport and are therefore depicted as including more CTUs. The remaining twelve sub-pictures on the right are lower-precision sub-pictures used outside the current viewport of the omnidirectional video.

[0062] Figure 2 This is a schematic diagram of a viewport-dependent 360° video transmission scheme based on example sub-images. The scheme includes encoded and stored video 200 and transmitted and decoded video 210. Encoded and stored video 200 includes the entire video, both in high and low resolution. Transmitted and decoded video 210 includes a portion of encoded and stored video 200. For example, transmitted and decoded video 210 may include the same low-resolution image as encoded and stored video 200. Furthermore, transmitted and decoded video 210 may include high-resolution sub-images associated with the current viewport displayed to the user, and may exclude high-resolution sub-images outside the current viewport.

[0063] therefore, Figure 2 A viewport-dependent 360° video transmission scheme based on example subpicks is illustrated, which uses only subpicks representing the video at high precision. The lower-precision representation of the complete video does not use subpicks and can be encoded and decoded using random access points (RAPs) at lower frequencies compared to the higher-precision representation. RAPs have lower compression ratios than other images, so reducing the number of RAPs reduces the bitstream size. The client receives the lower-precision complete video. For the higher-precision video, the client only receives and decodes the subpicks covering the current viewport.

[0064] Now let's discuss the differences between subpictures and MCTS. There are several design differences between subpictures and MCTS. First, the subpicture feature of VVC allows the motion vectors of the codec block to point outside the subpicture, even if the subpicture is extractable. This can be achieved by applying sample padding to the subpicture boundaries in a manner similar to padding at picture boundaries. Second, additional changes are introduced for the selection and derivation of motion vectors in Merge mode and during motion vector refinement on the decoder side of VVC. This allows for higher encoding and decoding efficiency compared to the non-standard motion constraints applied on the encoder side of MCTS. Third, when extracting one or more extractable subpictures from a picture sequence to create a subbitstream of a consistent bitstream, it is not necessary to rewrite the existing SH NAL and PH NAL units. In HEVC MCTS-based subbitstream extraction, SH may need to be rewritten. Note that in HEVC MCTS extraction and VVC subpicture extraction, SPS and PPS may need to be rewritten. However, there may only be a small set of parameters in the bitstream, and each picture has at least one stripe. Therefore, rewriting SH is a significant burden for the application system. Fourth, stripes of different sub-pictures within an image are allowed to have different NAL unit types. As discussed in more detail below, this is often referred to as a mixed NAL unit type or mixed sub-picture type within the image. Fifth, VVC specifies the HRD and level definitions for the sub-picture sequence. Therefore, the encoder can ensure the consistency of the sub-bitstream for each extractable sub-picture sequence.

[0065] Now let's discuss mixed subpicture types within a picture. In AVC and HEVC, all VCL NAL units in a picture may be required to have the same NAL unit type. VVC introduces the option to mix subpictures with some different VCL NAL unit types within a picture. This provides support for random access not only at the picture level but also at the subpicture level. In VVC, VCL NAL units within a subpicture may still need to have the same NAL unit type.

[0066] The ability to randomly access sub-images from IRAP is beneficial for 360° video applications. In similar... Figure 2 In the viewport-dependent 360° video transmission scheme shown, the content of spatially adjacent viewports largely overlaps. Therefore, during a viewport orientation change, only a small fraction of sub-pictures within the viewport are replaced by new sub-pictures. Consequently, most sub-pictures remain within the viewport. The newly introduced sub-picture sequence must begin with an IRAP stripe, but a significant reduction in the overall transmission bit rate can be achieved by allowing inter-frame prediction of the remaining sub-pictures regardless of viewport changes.

[0067] An indication can be provided in the PPS referenced by the image to indicate whether the image includes only a single type of NAL unit or includes multiple types, for example, by using a flag called PPS Mixed NAL Unit Types (pps_mixed_nalu_types_in_pic_flag) in the image flags. This allows an image to simultaneously include sub-images containing IRAP stripes and sub-images containing subsequent stripes. Other combinations of different NAL unit types are also permitted in the image. For example, a mixed NAL unit image can include preceding image stripes of NAL unit types RASL and RADL, which allows merging sub-image sequences with open group of pictures (GOP) codec structures and sub-image sequences with closed group of pictures (GOP) codec structures extracted from different bitstreams into a single bitstream.

[0068] We now discuss subpicture layout and subpicture identifier (ID) signaling notification. The subpicture layout in VVC is signaled in SPS and is therefore constant throughout the codec layer video sequence (CLVS). Each subpicture is signaled by the position of the top-left CTU and the width and height of the image (represented by the number of CTUs). Therefore, signaling notification ensures that subpictures cover a rectangular area of the image at the CTU granularity. The order of subpicture signaling notification in SPS determines the index of each subpicture within the image.

[0069] Figure 3Schematic diagram 300 illustrates an example mechanism for extracting sub-pictures from a bitstream. To enable the extraction and merging of sub-picture sequences without rewriting the SH or PH, VVC's stripe addressing scheme associates stripes with sub-pictures based on sub-picture IDs and sub-picture-specific stripe indices. Signaling notification in the SH includes the sub-picture ID of the sub-picture of the stripe and the sub-picture-level stripe index. The value of the sub-picture ID for a specific sub-picture can differ from the value of the corresponding sub-picture index. The mapping between the sub-picture ID and the sub-picture index can be signaled in the SPS or PPS, but not in both, or implicitly inferred. When present, the sub-picture ID mapping is overwritten or added when the SPS and PPS are rewritten during the sub-picture sub-bitstream extraction process. Together, the sub-picture ID and the sub-picture-level stripe index indicate to the decoder the exact location of the first decoding CTU of the stripe within the DPB slot of the decoded picture. As shown in diagram 300, after sub-bitstream extraction, the sub-picture ID of the sub-picture remains unchanged, while the sub-picture index can change. Even when the raster scan CTU address of the first CTU in a sub-picture stripe has changed compared to its value in the original bitstream, the unchanged values of the sub-picture ID and sub-picture-level stripe index in the corresponding SH still correctly indicate the position of each CTU in the decoded picture of the extracted sub-bitstream. Schematic 300 illustrates sub-picture extraction using sub-picture ID, sub-picture index, and sub-picture-level stripe index, where the example includes two sub-pictures and four stripes.

[0070] Similar to sub-image extraction, sub-image signaling notification allows merging several sub-images from different bitstreams into a single bitstream simply by rewriting the SPS and PPS. This mechanism may require coordinating the generation of different bitstreams, for example, by using different sub-image IDs and other largely aligned SPS, PPS, and PH parameters, such as CTU size, chroma format, and encoding / decoding tools. Although sub-images and stripes are signaled independently in the SPS and PPS, there are inherent mutual constraints between the sub-image and stripe layouts to form a consistent bitstream. First, the presence of sub-images may necessitate the use of rectangular stripes and may prohibit raster scan stripes. Second, the stripes of a given sub-image should be NAL units that are consecutive in the decoding order, which restricts the order of the NAL units of the encoded / decoded stripes in the bitstream based on the sub-image layout.

[0071] This section discusses some fundamentals of the VVC video file format. For example, it discusses track types that carry the VVC elementary stream. This document defines the following types of tracks that carry the VVC elementary stream. A VVC track represents the VVC elementary stream by including NAL units in the track's samples and / or sample entries. A VVC track can also be associated with other VVC tracks that include other layers and / or sublayers of the VVC elementary stream via the "vvcb" entity group, the "vopi" sample group, the "opeg" entity group, or a combination thereof. Furthermore, a VVC track can be associated with other VVC tracks by referencing a VVC subpicture track. When a VVC track references a VVC subpicture track, the VVC track is called a VVC reference track. A VVC reference track should not include VCL NAL units, nor should it be referenced by a VVC track via a "vvcN" track reference.

[0072] A VVC non-VCL track is a track that includes only non-VCL NAL units and is referenced by the VVC track via the "vvcN" track reference. A VVC non-VCL track may include an APS, which carries ALF, LMCS, or scaling list parameters. These parameters may or may not include other non-VCL NAL units. Therefore, these parameters are stored in a separate track from the track that includes VCL NAL units and are transmitted via that track. A VVC non-VCL track may also include a picture header NAL unit, with or without an APS NAL unit and / or other non-VCL NAL units. Therefore, the picture header NAL unit may be stored in a separate track from the track that includes VCL NAL units and is transmitted via that track.

[0073] VVC sub-image tracks comprise one or more VVC sub-image sequences forming a rectangular region, or one or more complete strip sequences forming a rectangular region. Samples on a VVC sub-image track comprise one or more complete sub-images or one or more complete strips forming a rectangular region.

[0074] VVC non-VCL tracks and VVC subpicture tracks enable the transmission of VVC video in streaming applications. Each of these tracks can be carried with its own DASH representation. For decoding and rendering of a subset of tracks, the client can request the DASH representation of the VVC subpicture track subset as well as the DASH representation of the non-VCL tracks segment by segment. This avoids redundant transmission of APS, other non-VCLNAL units, and unnecessary subpictures. Furthermore, when a VVC subpicture track includes one or more complete strips, but not all strips of a subpicture, all strips in the subpicture track belong to the same subpicture. Additionally, in this case, any VVC reference track that references a subpicture track, for example via a "subp" track reference, also references the subpicture track(s) that include the remaining strips from the same subpicture.

[0075] Subpic entity groups are now discussed. Subpic entity groups are defined as providing level information indicating the consistency of a merged bitstream created from several VVC subpicture tracks. The VVC baseline track provides another mechanism for merging VVC subpicture tracks. The implicit reconstruction process based on subpicture entity groups may require modification of the parameter set. Subpic entity groups provide guidance to simplify the generation of the parameter set for reconstructing the bitstream. When subpicture tracks within a group to be jointly decoded are interchangeable, the SubpicCommonGroupBox indicates the combination rules and level_idc of the resulting combination during joint decoding. Tracks are interchangeable when the player can select any set of num_active_tracks subpicture tracks from groups with the same level of contribution. When codec subpictures with different attributes (e.g., different precision) are selected for joint decoding, the SubpicMultipleGroupsBox indicates the combination rules and level_idc of the resulting combination during joint decoding. All entity_id values included in the subpicture entity group should identify the VVC subpicture track. SubpicCommonGroupBox and SubpicMultipleGroupsBox (if they exist) should be included in the GroupsListBox of the file-level MetaBox, and not in MetaBoxes at other levels.

[0076] The following is an example syntax for sub-images sharing a group box.

[0077]

[0078] The semantic example of sub-images sharing a group box is as follows. `level_idc` specifies the level to which any selection of the `num_active_tracks` entities within the entity group conforms. `num_active_tracks` specifies the number of tracks for which `level_idc` values are provided.

[0079] The following is an example syntax for grouping multiple sub-images.

[0080]

[0081] The semantic example of multiple group boxes for a sub-image is as follows. `level_idc` specifies the level that any combination of tracks with ID equal to `i` within the subgroup containing `num_active_tracks[i]` corresponds to, for all `i` values in the range 0 to `num_subgroup_ids-1` (inclusive). `num_subgroup_ids` specifies the number of independent subgroups (each subgroup is identified by the same `track_subgroup_id[i]` value). Different subgroups are identified by different `track_subgroup_id[i]` values. `track_subgroup_id[i]` specifies the subgroup ID of the `i`-th track in this entity group. The range of subgroup ID values should be 0 to `num_subgroup_ids-1`, inclusive. `num_active_tracks[i]` specifies the number of tracks in the subgroup containing ID equal to `i` recorded in `level_idc`.

[0082] The following are example technical problems solved by the disclosed technical solutions. For example, sub-picture entity groups are suitable for situations where related sub-picture information remains consistent throughout the entire duration of the track. However, this is not always the case. For example, for a specific sub-picture sequence, different CVSs may have different levels. In this case, sample grouping should be used to carry essentially the same information, but allow some information (e.g., CVS) to differ between different sample groups.

[0083] Figure 4 This is a schematic diagram of a media file 400 stored in an ISOBMFF. For example, media file 400 may include an encoded bitstream and may be stored in an ISOBMFF for transmission to a decoder. By organizing media file 400 in an ISOBMFF, subsets of media file 400, such as sub-bitstreams of specific precision, screen size, frame rate, etc., can be selected and transmitted to the decoder. Furthermore, organizing media file 400 in an ISOBMFF allows the decoder to locate relevant portions of media file 400 for decoding and display. The ISOBMFF media file 400 is stored in multiple boxes carrying objects and / or data associated with media content or media representation. For example, media file 400 may include a file type box (e.g., ftyp) 430, a movie box (e.g., moov) 410, and a media data box (e.g., mdat) 420. Such boxes may further include other boxes in a nested manner to store all relevant data in the media data file 400.

[0084] The file type box 430 can carry data describing the entire file, and therefore can carry file-level data. Thus, a file-level box is any box that includes data related to the entire media file 400. For example, the file type box 430 may include a file type indicating compatibility information and / or a version number of an ISO specification for the media file 400.

[0085] Movie frame 410 may carry data describing the movie included in the media file, and therefore may carry movie-level data. A movie-level frame is any frame that includes data describing the entire movie included in the media file 400. Movie frame 410 may include various sub-frames for including data for various purposes. For example, movie frame 410 includes track frame 411, denoted as trak, which carries metadata describing the track represented by the media. For example, track frame 411 may carry temporal and / or spatial information describing how the corresponding sample 421 is arranged in the video for display. It should be noted that the data describing the track is track-level data, therefore any frame describing a track is a track-level frame. Track frame 411 may carry many different types of frames dedicated to the track described within the corresponding track frame 411. For example, track frame 411 may include sample table frame 412. Sample table frame 412, denoted as stbl, is a frame that includes the time and data index of the media sample 421 associated with the track. Among other items, sample table frame 412 may include sample group entries 413. Sample group entry 413 includes data describing the attributes of the sample group. A sample group is any grouping of samples 421 associated with a corresponding track described by track frame 411. Thus, as described in sample table frame 412, a sample group may carry information for (e.g., described) a set of pictures / subpictures, or may specify a set of pictures / subpictures.

[0086] Movie frame 410 may also include MetaBox 415, which is a structure for carrying timeless metadata. As shown, MetaBox 415 can be included in movie frame 410, and therefore can be considered a movie-level frame when the metadata is related to the entire movie. In some examples, MetaBox 415 may also be included in track frame 411, and therefore can be considered a track-level frame when the metadata is related to the corresponding track. MetaBox 415 may include various frames carrying metadata. For example, MetaBox 415 may include entity-to-group frame 417, which is a frame that includes metadata describing the corresponding entity group, such as the grouping type of the entity group. Thus, the entity-to-group frame specifies a sub-picture entity group. An entity group is a group of items that share specific characteristics and / or share specific relationships, such as tracks in track frame 411.

[0087] Media data frame 420 includes interleaved and time-ordered media data (e.g., codeced video images and / or audio in one or more media tracks). For example, media data frame 420 may include a bitstream of video data encoded according to VVC, AVC, HEVC, etc. Media data frame 420 may include video images, audio, text, or other media data for display to a user. Such video images, audio, text, or other media data may be collectively referred to as sample 421.

[0088] As described above, this disclosure relates to the segmentation of an image sequence into different spatial regions called sub-images. For example, the top region of each image may be included in a first sub-image, while the bottom portion of each image may be included in a second sub-image. This allows for the independent display of different regions. It also allows different regions to be encoded and decoded differently, for example, by applying different encoding / decoding tools, constraints, etc. Sub-images are included in separate track frames 411, creating sub-image tracks. Furthermore, images and sub-images are encoded and decoded according to profiles, tiers, and levels describing the encoding / decoding constraints applied to the images and sub-images. Thus, profile, tier, and level indicate that the corresponding video can be decoded by any decoder including hardware sufficient to decode at the indicated profile, tier, and level. For example, profile may indicate a set of encoding / decoding tools used to encode and decode the video, tier may indicate the maximum bit rate of the video, and level may indicate various additional constraints applied to the video, such as maximum sampling rate, maximum luminance image size, maximum number of stripe segments per image, etc. In some systems, the level information of the sub-image tracks in track frames 411 is notified at the file level signaling level. However, this method may not allow different sub-picture tracks and / or different points within the same sub-picture track to have different levels.

[0089] This document discloses a mechanism for solving one or more of the aforementioned problems. This disclosure addresses this problem by employing a level indicator 431 for the sub-picture tracks included in track frame 411. For example, sub-picture tracks in track frame 411 can be grouped into entity groups in entity-to-group frame 417. Furthermore, samples in sub-picture tracks can be grouped into sample groups via sample group entry 413. The level indicator 431 can then be used to indicate the level of a sample group of a track via sample group signaling, or to indicate the level of the entire track via entity group signaling. For example, the level indicator 431 can indicate the level information of a group of sub-picture samples 421 organized into a track, and is thus described in the corresponding track frame 411. Furthermore, sub-pictures with different levels can be organized into different sample groups. The level indicator 431 can then be included in the sample group entry 413 describing the corresponding sample group. This allows different sample groups within the same track to have different levels (as indicated by the level indicator). In an example implementation, sub-pictures with the same level can be organized into a sub-picture level information sample group described by a corresponding sample group entry 413. A level indicator 431 can then be included in the sample group entry 413 to describe the level information of the sub-picture group. In one example, a sub-picture entity group type indicator 435 can be used to indicate the type of sub-picture entity group associated with the sample group. The sub-picture entity group type indicator 435 can be signaled in the sample group entry 413 of the sample group. For example, the sub-picture entity group type indicator 435 may include a one-bit flag. In another example, the sub-picture entity group type indicator 435 may include a 24-bit field specifying the grouping type of the associated sub-picture entity group, as indicated by the entity-to-group box 417. In yet another example, the sub-picture entity group type indicator 435 may include a group type parameter associated with the sample group indicated in the sample group entry 413. For example, the sample table frame 412 may include sample-to-group boxes (SampleToGroupBox), each of which describes sub-image level information for a sample group. The SampleToGroupBox may include a group_type_parameter used as an indicator 435 for the sub-image entity group type. In one example, when the grouping type indicating the sub-image entity group is a shared sub-image group, the group_type_parameter is set to acgl, and when the grouping type indicating the sub-image entity group is multiple sub-image groups, the group_type_parameter is set to amgl.In another example, the `group_type_parameter` can include 32 bits, and one bit of the `group_type_parameter` can be used to signal the subpicture common group flag (`subpicture_common_group_flag`) to indicate whether the subpicture entity group is a subpicture common group or a subpicture multiple group. In yet another example, the group identifier (ID) in the entity-to-group box 417 can also be signaled in the sample group entry 413 of the sample group to associate the sample group of sample 421 with the entity group of sample 421.

[0090] In another example, the set of subpicture tracks in the corresponding track frame 411 is organized into subpicture entity groups, as indicated in the entity-to-group frame 417. The subpicture level indicator 431 can be signaled in the entity-to-group frame 417 of the subpicture entity group or in the sample group entry 413 of the sample group. For example, when the level is static for all samples 421 in the subpicture track set described by track frame 411, the level indicator 431 can be signaled in the entity-to-group frame 417 of the subpicture entity group. In another example, when the level may not be static for all samples 421 in the subpicture track set described by track frame 411, the level indicator 431 can be signaled in the sample group entry 413 of the sample group. In one example, an indication such as a flag can be signaled in the entity-to-group frame 417 of the subpicture entity group. This flag can be used as the level indicator 431 and to indicate whether the level is static for all samples 421 in the subpicture track set described by track frame 411.

[0091] In another example, the set of sub-picture tracks described by track frame 411 can be organized into sub-picture entity groups, as indicated in entity-to-group frame 417. When signaling the level indicator 431 of a sub-picture for a sample group, the track ID 433 of the track including the sample group can be signaled in the entity-to-group frame 417 of the sub-picture entity group, for example via the level information track ID field (level_info_track_id). For example, a rule could specify that a track with track ID 433 equal to level_info_track_id should or must be a track that includes a Picture Header Network Abstraction Layer (NAL) unit, and therefore should be a VVC baseline track. In another example, a rule could specify that a track with track ID 433 equal to level_info_track_id should be a VVC baseline track.

[0092] To address the aforementioned and other issues, a method summarized below is disclosed. These items should be considered as examples for interpreting general concepts, and not interpreted in a narrow way. Furthermore, these items can be used individually or in any combination.

[0093] Example 1

[0094] In one example, one or more sample group signaling can be used to inform the sub-picture track set in the sub-picture entity group of certain information, such as level indicators.

[0095] Example 2

[0096] In one example, signaling can be used in the entity-to-group box of a sub-picture entity group or in the sample group entry of a sample group to inform certain information about the sub-picture track set in the sub-picture entity group, such as level indicators.

[0097] Example 3

[0098] In one example, when information is static for all samples in the sub-image track set, signaling, such as a level indicator, is used to notify that information in the sub-image entity group. In another example, when information may not be static for all samples in the sub-image track set, signaling, such as a level indicator, is used to notify that information in the sample group. In yet another example, an indicator, such as a flag, indicates whether certain information, such as a level indicator, is static for all samples in the sub-image track set. This indication can be signaled in the entity-to-group box of the sub-image entity group.

[0099] Example 4

[0100] In one example, when signaling in the sample group notifies the sub-image track set in the sub-image entity group of information such as level indicators, the entity-to-group box of the sub-image entity group notifies the track ID of the track in the sample group, for example via the field level_info_track_id.

[0101] Example 5

[0102] In one example, a rule could specify that a track whose track identifier (track_ID) equals the level information track ID (level_info_track_id) should be a track that includes a picture header NAL unit, such as a VVC baseline track. In another example, a rule could specify that a track whose track_ID equals level_info_track_id should be a track that includes a picture header NAL unit, such as a VVC baseline track. In yet another example, a rule could specify that a track whose track_ID equals level_info_track_id should be a VVC baseline track.

[0103] Example 6

[0104] In one example, a sample group is specified, such as a subpicture level information sample group, to signal certain information about the subpicture track set in the subpicture entity group, such as level indicators.

[0105] Example 7

[0106] In one example, the sample group entry of a sample group signals information such as a level indicator. In another example, the sample group entry of a sample group signals an indication of the type of the sub-image entity group associated with that sample group. In one example, a one-bit flag is used to signal this indication. In another example, a 24-bit field is used to signal this indication, specifying the grouping type of the associated sub-image entity group. In another example, the sample group's grouping type parameter (grouping_type_parameter) is used to signal the indication of the type of the sub-image entity group associated with the sample group. In one example, a rule could specify that all sample-to-groupboxes of the sub-image level information sample group should include the grouping_type_parameter. The value of grouping_type_parameter is set to "acgl" to specify that the grouping type of the associated sub-image entity group is "acgl". The value of grouping_type_parameter is set to "amgl" to specify that the grouping type of the associated sub-image entity group is "amgl".

[0107] In one example, the rule could specify that all SampleToGroupBoxes of a subpicture-level information sample group should include the grouping_type_parameter. One bit of the 32 bits of the grouping_type_parameter can be used to signal the subpic_common_group_flag. In one example, the sample group entry in the sample group signals the group_id of the associated subpicture entity group's EntityToGroupBox.

[0108] The following are some example implementations of aspects summarized above, some of which can be applied to the standard specifications of the VVC video file format. Additions or modifications are indicated by underlined bold text, and deletions are indicated by bold italics.

[0109] The first embodiment of the aforementioned example is as follows.

[0110]

[0111] 11.5.1 Subpic Entity Groups. 11.5.1.1 Overview. Subpic entity groups are defined to provide level information indicating the consistency of merged bitstreams across several VVC subpicture tracks. Note: The VVC baseline track provides an alternative mechanism for merging VVC subpicture tracks. The implicit reconstruction process based on subpicture entity groups requires modification of the parameter set. Subpic entity groups provide guidance to simplify the generation of parameter sets for reconstructing the bitstream. When the subpicture tracks within a group to be jointly decoded are interchangeable, i.e., the player can select any set of num_active_tracks subpicture tracks from groups with the same level of contribution, the SubpicCommonGroupBox indicates the combination rules and level_idc of the resulting combination during joint decoding. When there are codec subpictures with different attributes (e.g., different precision) selected for joint decoding, the SubpicMultipleGroupsBox indicates the combination rules and level_idc of the resulting combination during joint decoding. All entity_id values included in a subpicture entity group should identify the VVC subpicture track. If they exist, SubpicCommonGroupBox and SubpicMultipleGroupsBox should be included in the GroupsListBox of the file-level MetaBox, and not in MetaBoxes at other levels.

[0112] 11.5.1.2 Syntax for child images sharing a group frame.

[0113]

[0114] 11.5.1.3 Semantics of child images sharing a group frame. level_idc specifies any of the num_active_tracks entities in the current entity group. The level that meets the requirements. num_active_tracks specifies the number of tracks for which a level_idc value is provided.

[0115] 11.5.1.4 Syntax for Multiple Boxes of Sub-images

[0116]

[0117] 11.5.1.5 The semantics of.

[0118] level_idc specifies that for all values of i in the range 0 to num_subgroup_ids–1 (inclusive), in the subgroup with ID equal to i... Any num_active_tracks[i] tracks of The appropriate level. `num_subgroup_ids` specifies the number of independent subgroups, each identified by the same `track_subgroup_id[i]` value. Different subgroups are identified by different `track_subgroup_id[i]` values. `track_subgroup_id[i]` specifies the subgroup ID of the i-th track in this entity group. The subgroup ID value should range from 0 to `num_subgroup_ids-1`, inclusive. `num_active_tracks[i]` specifies... ID equals the number of tracks in the subgroup of i.

[0119] Figure 5 To illustrate a block diagram of an example video processing system 500, various techniques disclosed herein can be implemented. Various implementations may include some or all of the components of system 500. System 500 may include an input 502 for receiving video content. The video content may be received in a raw or uncompressed format, such as 8-bit or 10-bit multi-component pixel values, or may be received in a compressed or encoded format. Input 502 may represent a network interface, a peripheral bus interface, or a storage interface. Examples of network interfaces include wired interfaces such as Ethernet, Passive Optical Network (PON), and wireless interfaces such as Wi-Fi or cellular interfaces.

[0120] Video processing system 500 may include codec component 504, which implements the various codec or encoding methods described in this document. Codec component 504 can reduce the average bit rate of the video from input 502 to the output of codec component 504 to produce a codec representation of the video. Therefore, codec techniques are sometimes referred to as video compression or video transcoding techniques. The output of codec component 504 can be stored or transmitted via connected communication (as shown in component 506). The stored or transmitted bitstream (or codec) representation of the video received at input 502 can be used by component 508 to generate pixel values or to send displayable video to display interface 510. The process of generating a user-viewable video from the bitstream representation is sometimes referred to as video decompression. Furthermore, although some video processing operations are referred to as “codec” operations or tools, it should be understood that codec tools or operations are used at the encoder, and the corresponding decoding tools or operations for the reverse codec results will be performed by the decoder.

[0121] Examples of peripheral bus interfaces or display interfaces may include Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), or DisplayPort. Examples of storage interfaces include SATA (Serial Advanced Technology Accessory), PCI, IDE, etc. The techniques described herein can be implemented in a variety of electronic devices, such as mobile phones, laptops, smartphones, or other devices capable of performing digital data processing and / or video display.

[0122] Figure 6 This is a block diagram of an example video processing apparatus 600. Apparatus 600 can be used to implement one or more methods described herein. Apparatus 600 can be implemented in smartphones, tablets, computers, Internet of Things (IoT) receivers, etc. Apparatus 600 may include one or more processors 602, one or more memories 604, and video processing circuitry 606. The processors(multiple) 602 can be configured to implement one or more methods described herein. The memories(multiple) 604 can be used to store data and code for implementing the methods and techniques described herein. The video processing circuitry 606 can be used to implement some of the techniques described herein using hardware circuitry. In some embodiments, the video processing circuitry 606 may be implemented at least partially within the processor 602, such as a graphics coprocessor.

[0123] Figure 7A flowchart of an example method 700 for video processing. Method 700 includes, in step 702, a level indicator for a subpicture set included in one or more subpicture tracks. The subpicture track is included in a subpicture entity group. The subpicture entity group is specified in an entity-to-group box. The subpicture set may be part of a sample group specified in a sample group entry. The sample group carries information about the subpicture set. When a level indicator is included in a sample group, a signaling notification indication is given in the entity-to-group box to indicate the track that includes the sample group. In one example, this indication is the track ID signaled in the level information track ID (level_information_track_id) field. In one example, a rule specifies that a track including a sample group should include a Picture Header NAL unit. In one example, a rule specifies that a track including a sample group must include a Picture Header Network Abstraction Layer (NAL) unit. In one example, a rule specifies that a track including a sample group must be a Universal Video Codec (VVC) reference track.

[0124] In step 704, based on the level indicator, a conversion between visual media data and a media data file is performed. When method 700 is executed on the encoder, the conversion includes generating a media data file from the visual media data. When method 700 is executed on the decoder, the conversion includes parsing and / or decoding the media data file to obtain visual media data.

[0125] It should be noted that method 700 can be implemented in an apparatus for processing video data, including a processor and a non-transitory memory thereon with instructions, such as a video encoder 900, a video decoder 1000, and / or an encoder 1100. In this case, the instructions executed by the processor cause the processor to perform method 700. Furthermore, method 700 can be executed by a non-transitory computer-readable medium comprising a computer program product for use by a video encoding / decoding device. This computer program product includes computer-executable instructions stored on the non-transitory computer-readable medium, such that when executed by a processor, it causes the video encoding / decoding device to perform method 700.

[0126] Figure 8 A block diagram illustrating an example video codec system 800 that can utilize the techniques disclosed herein. Figure 8 As shown, the video encoding / decoding system 800 may include a source device 810 and a destination device 820. The source device 810 generates encoded video data, which can be referred to as a video encoding device. The destination device 820 can decode the encoded video data generated by the source device 810 and can be referred to as a video decoding device.

[0127] Source device 810 may include video source 812, video encoder 814, and input / output (I / O) interface 816. Video source 812 may include sources such as video capture devices, interfaces for receiving video data from video content providers, and / or computer graphics systems for generating video data, or combinations of these sources. Video data may include one or more images. Video encoder 814 encodes the video data from video source 812 to generate a bitstream. The bitstream may include bit sequences that form a codec representation of the video data. The bitstream may include codec images and associated data. A codec image is a codec representation of an image. Associated data may include sequence parameter sets, image parameter sets, and other syntax structures. I / O interface 816 may include a modulator / demodulator (modem) and / or a transmitter. Encoded video data may be transmitted directly to destination device 820 via network 830 through I / O interface 816. Encoded video data may also be stored on storage medium / server 840 for access by destination device 820.

[0128] The target device 820 may include an I / O interface 826, a video decoder 824, and a display device 822. The I / O interface 826 may include a receiver and / or a modem. The I / O interface 826 may acquire encoded video data from the source device 810 or the storage medium / server 840. The video decoder 824 can decode the encoded video data. The display device 822 can display the decoded video data to a user. The display device 822 may be integrated with the target device 820, or it may be external to the target device 820, which may be configured to interface with an external display device.

[0129] The video encoder 814 and video decoder 824 can operate according to video compression standards, such as the High Efficiency Video Codec (HEVC) standard, the Universal Video Codec (VVC) standard, and other current and / or further standards.

[0130] Figure 9 The block diagram illustrating an example of a video encoder 900 shows that the video encoder 900 can be in... Figure 8 The video encoder 814 is shown in the video codec system 800. The video encoder 900 can be configured to perform any or all of the techniques disclosed herein. Figure 9 In the example, the video encoder 900 includes multiple functional components. The techniques described in this disclosure can be shared among the various components of the video encoder 900. In some examples, the processor can be configured to perform any or all of the techniques described in this disclosure.

[0131] The functional components of the video encoder 900 may include a segmentation unit 901, a prediction unit 902 (which may include a mode selection unit 903, a motion estimation unit 904, a motion compensation unit 905, and an intra-frame prediction unit 906), a residual generation unit 907, a transform unit 908, a quantization unit 909, an inverse quantization unit 910, an inverse transform unit 911, a reconstruction unit 912, a buffer 913, and an entropy coding unit 914.

[0132] In other examples, the video encoder 900 may include more, fewer, or different functional components. In one example, the prediction unit 902 may include an intra-block copy (IBC) unit. The IBC unit can perform prediction in IBC mode, where at least one reference picture is the picture containing the current video block.

[0133] Furthermore, some components, such as the motion estimation unit 904 and the motion compensation unit 905, can be highly integrated, but for interpretable purposes... Figure 9 The example is shown separately.

[0134] The segmentation unit 901 can segment an image into one or more video blocks. The video encoder 900 and the video decoder 1000 can support various video block sizes.

[0135] The mode selection unit 903 can select one of the encoding / decoding modes (intra-frame or inter-frame, e.g., based on error results) and provide the resulting intra-frame or inter-frame encoded / decoded block to the residual generation unit 907 to generate residual block data, and then provide it to the reconstruction unit 912 to reconstruct the coded block for use as a reference picture. In some examples, the mode selection unit 903 can select a combination of intra-frame and inter-frame prediction (CIIP) modes, where prediction is based on inter-frame prediction signals and intra-frame prediction signals. In the case of inter-frame prediction, the mode selection unit 903 can also select the precision of the motion vector (e.g., sub-pixel or integer pixel precision) for the block.

[0136] To perform inter-frame prediction on the current video block, motion estimation unit 904 can generate motion information for the current video block by comparing one or more reference frames from buffer 913 with the current video block. Motion compensation unit 905 can determine the predicted video block for the current video block based on motion information and decoded samples from images other than those associated with the current video block from buffer 913.

[0137] The motion estimation unit 904 and the motion compensation unit 905 can perform different operations on the current video block, for example, depending on whether the current video block is in an I-band, P-band, or B-band.

[0138] In some examples, motion estimation unit 904 may perform unidirectional prediction for the current video block, and may search reference images in list 0 or list 1 to find a reference video block for the current video block. Motion estimation unit 904 may then generate a reference index indicating a reference image in list 0 or list 1 that includes a reference video block, and a motion vector indicating a spatial shift between the current video block and the reference video block. Motion estimation unit 904 may output the reference index, prediction direction indicator, and motion vector as motion information for the current video block. Motion compensation unit 905 may generate a predicted video block for the current block based on the reference video block indicated by the motion information of the current video block.

[0139] In other examples, motion estimation unit 904 can perform bidirectional prediction for the current video block. Motion estimation unit 904 can search for a reference video block for the current video block in the reference images in list 0, and can also search for another reference video block for the current video block in the reference images in list 1. Motion estimation unit 904 can then generate reference indices indicating the reference images in lists 0 and 1, which include reference video blocks and motion vectors indicating spatial shifts between the reference video blocks and the current video block. Motion estimation unit 904 can output the reference index and motion vector of the current video block as motion information for the current video block. Motion compensation unit 905 can generate a predicted video block for the current video block based on the reference video blocks indicated by the motion information of the current video block.

[0140] In some examples, the motion estimation unit 904 may output a complete set of motion information for the decoder's decoding process. In some examples, the motion estimation unit 904 may not output a complete set of motion information for the current video. Instead, the motion estimation unit 904 may refer to the motion information of another video block to signal the motion information of the current video block. For example, the motion estimation unit 904 may determine that the motion information of the current video block is sufficiently similar to the motion information of a neighboring video block.

[0141] In one example, the motion estimation unit 904 may indicate a value in the syntactic structure associated with the current video block that indicates to the video decoder 1000 that the current video block has the same motion information as another video block.

[0142] In another example, motion estimation unit 904 can identify another video block and motion vector difference (MVD) within the syntactic structure associated with the current video block. The motion vector difference indicates the difference between the motion vector of the current video block and the motion vector of the indicated video block. Video decoder 1000 can use the motion vector of the indicated video block and the motion vector difference to determine the motion vector of the current video block.

[0143] As discussed above, the video encoder 900 can predictively signal motion vectors. Two examples of predictive signaling notification techniques that can be implemented by the video encoder 900 include advanced motion vector prediction (AMVP) and merge pattern signaling notification.

[0144] Intra-prediction unit 906 can perform intra-prediction on the current video block. When intra-prediction unit 906 performs intra-prediction on the current video block, it can generate prediction data for the current video block based on decoded samples from other video blocks in the same frame. The prediction data for the current video block can include the predicted video block and various syntax elements.

[0145] The residual generation unit 907 can generate residual data for the current video block by subtracting the predicted video block from the current video block. The residual data for the current video block may include residual video blocks that correspond to different sample components of the samples in the current video block.

[0146] In other examples, residual data for the current video block may not exist, such as in skip mode, and residual generation unit 907 may not perform subtraction operations.

[0147] Transform unit 908 can generate one or more transform coefficient video blocks for the current video block by applying one or more transforms to the residual video block associated with the current video block.

[0148] After the transform processing unit 908 generates a transform coefficient video block associated with the current video block, the quantization unit 909 can quantize the transform coefficient video block associated with the current video block based on one or more quantization parameter (QP) values associated with the current video block.

[0149] The inverse quantization unit 910 and the inverse transform unit 911 can apply inverse quantization and inverse transform to the transform coefficient video block, respectively, to reconstruct the residual video block from the transform coefficient video block. The reconstruction unit 912 can add the reconstructed residual video block to the corresponding samples of one or more predicted video blocks generated by the prediction unit 902 to generate a reconstructed video block associated with the current block and store it in the buffer 913.

[0150] After the video block is reconstructed by reconstruction unit 912, a loop filtering operation can be performed to reduce video block artifacts in the video block.

[0151] The entropy encoding unit 914 can receive data from other functional components of the video encoder 900. When the entropy encoding unit 914 receives data, it can perform one or more entropy encoding operations to generate entropy-encoded data and output a bit stream including the entropy-encoded data.

[0152] Figure 10 The block diagram illustrating an example of a video decoder 1000 may be... Figure 8 The system 800 shown contains a video decoder 824. The video decoder 1000 can be configured to perform any or all of the techniques disclosed herein. Figure 10 In the example, the video decoder 1000 includes multiple functional components. The techniques described in this disclosure can be shared among the various components of the video decoder 1000. In some examples, the processor can be configured to perform any or all of the techniques described in this disclosure.

[0153] exist Figure 10 In the example, the video decoder 1000 includes an entropy decoding unit 1001, a motion compensation unit 1002, an intra-frame prediction unit 1003, an inverse quantization unit 1004, an inverse transform unit 1005, a reconstruction unit 1006, and a buffer 1007. In some examples, the video decoder 1000 may perform functions typically associated with the video encoder 900 (e.g., Figure 9 The encoding channel (pass) is the opposite of the decoding channel.

[0154] The entropy decoding unit 1001 can retrieve the encoded bitstream. The encoded bitstream may include entropy-coded video data (e.g., encoded video data blocks). The entropy decoding unit 1001 can decode the entropy-coded video data, and based on the entropy-coded video data, the motion compensation unit 1002 can determine motion information including motion vectors, motion vector precision, reference image list index, and other motion information. For example, the motion compensation unit 1002 can determine this information by executing AMVP and merge modes.

[0155] The motion compensation unit 1002 can generate motion compensation blocks, possibly performing interpolation based on an interpolation filter. The identifier of the interpolation filter used at sub-pixel precision can be included in the syntax element.

[0156] The motion compensation unit 1002 can use the interpolation filter used by the video encoder 900 during the encoding of the video block to calculate the interpolated values of the sub-integer pixels of the reference block. The motion compensation unit 1002 can determine the interpolation filter used by the video encoder 900 based on the received syntax information and use the interpolation filter to generate the prediction block.

[0157] The motion compensation unit 1002 can use some syntax information to determine the size of the blocks used to encode the frames and / or stripes of the encoded video sequence, segmentation information describing how each macroblock of the picture of the encoded video sequence is segmented, a mode indicating how each segment is encoded, one or more reference frames (and a list of reference frames) for each inter-frame coded block, and other information of the decoded encoded video sequence.

[0158] Intra-prediction unit 1003 can use, for example, an intra-prediction mode received in the bitstream to form prediction blocks from spatially neighboring blocks. Dequantization unit 1004 dequantizes, i.e., dequantizes, the quantized video block coefficients provided in the bitstream and decoded by entropy decoding unit 1001. Inverse transform unit 1005 applies an inverse transform.

[0159] The reconstruction unit 1006 can add the residual block to the corresponding prediction block generated by the motion compensation unit 1002 or the intra-frame prediction unit 1003 to form a decoded block. If necessary, a deblocking filter can also be applied to filter the decoded block to remove block artifacts. The decoded video block is then stored in a buffer 1007, which provides a reference block for subsequent motion compensation / intra-frame prediction and also generates decoded video for display device rendering.

[0160] Figure 11 This is a schematic diagram of an example encoder 1100. Encoder 1100 is suitable for implementing VVC technology. Encoder 1100 includes three loop filters: a deblocking filter (DF) 1102, a sample adaptive offset (SAO) 1104, and an adaptive loop filter (ALF) 1106. Unlike DF 1102, which uses predefined filters, SAO 1104 and ALF 1106 utilize the original samples of the current image, reducing the mean square error between the original and reconstructed samples by adding an offset and applying a finite impulse response (FIR) filter, respectively, and by using the side information signaling of the encoder and decoder to inform the offset and filter coefficients. ALF 1106 is located in the final processing stage of each image and can be considered as a tool for attempting to capture and repair artifacts generated by previous stages.

[0161] The encoder 1100 also includes an intra-frame prediction component 1108 and a motion estimation / compensation (ME / MC) component 1110 configured to receive input video. The intra-frame prediction component 1108 is configured to perform intra-frame prediction, while the ME / MC component 1110 is configured to perform inter-frame prediction using a reference image obtained from a reference image buffer 1112. Residual blocks from inter-frame or intra-frame prediction are fed to a transform (T) component 1114 and a quantization (Q) component 1116 to generate quantized residual transform coefficients, which are then fed to an entropy codec component 1118. The entropy codec component 1118 entropy codes and decodes the prediction results and quantized transform coefficients and sends them to a video decoder (not shown). The quantization component output from the quantization component 1116 can be fed to an inverse quantization (IQ) component 1120, an inverse transform component 1122, and a reconstruction (REC) component 1124. REC component 1124 can output images to DF 1102, SAO 1104 and ALF 1106 for filtering before these images are stored in reference image buffer 1112.

[0162] The following provides a list of preferred solutions as examples.

[0163] The following solutions illustrate examples of the techniques discussed in this article.

[0164] 1. A method for processing visual media (e.g., Figure 7 The method 700 described herein includes: performing a conversion between visual media information and a digital representation of the visual media information according to rules, wherein the digital representation includes a set of multiple sub-picture tracks logically grouped into one or more sub-picture entity groups, and wherein the rules specify that information associated with these sets is included in one or more sample groups in the digital representation.

[0165] 2. According to the method of Solution 1, the information includes a level indicator that indicates the encoding / decoding level of each sub-image in the set.

[0166] 3. The method according to any one of solutions 1-2, wherein information related to the set is included in the group box entry.

[0167] 4. The method according to any one of solutions 1-2, wherein information related to the set is included in the sample group entry.

[0168] 5. The method according to any one of solutions 1-4, wherein the rule specifies that the track identifier of the track containing one or more sample point groups is included in the entity-to-group box of one or more sub-picture entity groups.

[0169] 6. According to the method described in Solution 5, the track identifier is included in the level information track identifier (level_info_track_id) field.

[0170] 7. The method according to any one of solutions 1-6, wherein one or more sample groups comprise a single sample group, wherein the single sample group is dedicated to indicating sub-picture level information.

[0171] 8. The method according to solution 7, wherein the information is a single bit flag.

[0172] 9. A media data processing method, comprising: obtaining a digital representation of visual media information, wherein the digital representation is generated according to the method of any one of solutions 1-8; and streaming the digital representation.

[0173] 10. A media data processing method, comprising: receiving a digital representation of visual media information, wherein the digital representation is generated according to the method of any one of solutions 1-8; and generating visual media information from the digital representation.

[0174] 11. The method according to any one of solutions 1-10, wherein the conversion includes generating a bitstream representation of visual media data and storing the bitstream representation to a file according to format rules.

[0175] 12. The method according to any one of solutions 1-10, wherein the conversion includes parsing the file according to format rules to recover visual media data.

[0176] 13. A video decoding apparatus, comprising a processor configured to implement one or more of the methods described in solutions 1-12.

[0177] 14. A video encoding apparatus, comprising a processor configured to implement one or more of the methods described in solutions 1-12.

[0178] 15. A computer program product having computer code stored thereon, which, when executed by a processor, causes the processor to implement the method described in any one of solutions 1-12.

[0179] 16. A computer-readable medium having recorded a bitstream representation thereon conforming to a file format generated according to any one of solutions 1-12.

[0180] 17. A method, apparatus or system described in this document.

[0181] In the solution described in this paper, the encoder conforms to the format rules by generating a codec representation based on those rules. In the solution described in this paper, the decoder can use the format rules to parse the syntax elements in the codec representation, determining the presence or absence of syntax elements according to the format rules to generate the decoded video.

[0182] In this document, the term "video processing" can refer to video encoding, video decoding, video compression, or video decompression. For example, a video compression algorithm can be applied during the conversion from a pixel representation of a video to a corresponding bitstream representation, and vice versa. As defined in the syntax, the bitstream representation of the current video block can, for example, correspond to bits at different positions within a co-occurring or scattered bitstream. For example, a macroblock can be encoded or decoded based on the error residual values after transformation and encoding / decoding, and also using bits from the header and other fields in the bitstream. Furthermore, during the conversion, the decoder can, based on this determination, parse the bitstream knowing that some fields may or may not be present, as described in the solutions above. Similarly, the encoder can determine whether certain syntax fields are included or excluded, and generate the encoding / decoding representation accordingly by including or excluding syntax fields from the encoding / decoding representation.

[0183] The other solutions, examples, embodiments, modules, and functional operations disclosed in this document can be implemented in digital electronic circuits or computer software, firmware, or hardware, including the structures disclosed in this document and their structural equivalents, or combinations thereof. The disclosed and other embodiments can be implemented as one or more computer program products, i.e., one or more computer program instruction modules encoded on a computer-readable medium, which are executed or controlled by a data processing device. The computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of substances affecting machine-readable propagation signals, or a combination thereof. The term "data processing apparatus" encompasses all means, devices, and machines that process data, including, for example, a programmable processor, a computer, or multiple processors or computers. In addition to hardware, the apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, or a combination thereof. The transmitted signal is an artificially generated signal, such as a machine-generated electrical, optical, or electromagnetic signal, which is generated to encode information for transmission to a suitable receiver device.

[0184] Computer programs (also known as programs, software, software applications, scripts, or code) can be written in any programming language (including compiled or interpreted languages) and can be deployed in any form, including standalone programs or modules, components, subroutines, or other units suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored as a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple harmonizing files (e.g., files storing one or more modules, subroutines, or portions of code). Computer programs can be deployed to execute on a single computer or on multiple computers located in one location or distributed across multiple locations and interconnected via a communication network.

[0185] The processes and logic described in this document can be executed by one or more programmable processors to execute one or more computer programs, thereby performing functions by manipulating input data and generating output. The processes and logic can also be executed by dedicated logic circuits, and can be implemented as dedicated logic circuits, such as FPGAs (field-programmable gate arrays) or ASICs (application-specific integrated circuits).

[0186] For example, processors suitable for executing computer programs include general-purpose and special-purpose microprocessors, as well as any one or more processors in any kind of digital computer. Typically, the processor receives instructions and data from read-only memory or random access memory, or both. The basic components of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, to receive data from or transfer data to one or more mass storage devices, or both. However, a computer does not necessarily have such devices. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and memory may be supplemented by or incorporated into special-purpose logic circuitry.

[0187] Although this patent document includes numerous details, these details should not be construed as limiting any invention or the scope of the claim, but rather as a description of features that may be specific to particular embodiments of a particular invention. Certain features described in this patent document in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented separately in multiple embodiments or in any suitable sub-combination. Furthermore, although features may be described above as functioning in certain combinations and even initially claimed in this way, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may involve sub-combinations or variations of sub-combinations.

[0188] Similarly, although the operations are depicted in a specific order in the figures, this should not be construed as requiring such operations to be performed in the specific order or sequence shown, or requiring all illustrated operations to achieve the desired result. Furthermore, the separation of various system components in the embodiments described in this patent document should not be construed as requiring such separation in all embodiments.

[0189] Only a few implementation methods and examples have been described. Other implementations, improvements and modifications can be made based on the description and instructions in this patent document.

[0190] When there is no intermediate component other than a line, trace, or other medium between the first and second components, the first component is directly coupled to the second component. When there is an intermediate component other than a line, trace, or other medium between the first and second components, the first component is indirectly coupled to the second component. The term "coupling" and its variations include direct coupling and indirect coupling. Unless otherwise stated, the use of the term "approximately" means including a range of 10% of the subsequent value.

[0191] While several embodiments have been provided in this disclosure, it should be understood that the disclosed systems and methods may be embodied in many other specific forms without departing from the spirit or scope of this disclosure. The present examples are intended to be illustrative rather than limiting and are not intended to be limited to the details given herein. For example, various elements or components may be combined or integrated into another system, or certain features may be omitted or not implemented.

[0192] Furthermore, without departing from the scope of this invention, the separate or individual technologies, systems, subsystems, and methods described and illustrated in the various embodiments may be combined or integrated with other systems, modules, technologies, or methods. Other items shown or discussed as coupled may be directly connected or indirectly coupled or communicated via some interface, device, or intermediate component in an electrical, mechanical, or other manner. Those skilled in the art can identify other examples of changes, substitutions, and modifications, and these changes, substitutions, and modifications can be made without departing from the spirit and scope of the disclosure herein.

Claims

1. A method for processing video data, comprising: Determine a level indicator for a set of sub-images included in one or more sub-image tracks, wherein the sub-image tracks are included in sub-image entity groups, and wherein, when the level indicator is included in a sample group, an indication is transmitted via signaling in an entity-to-group box to indicate the track including the sample group, wherein the indication is a track identifier ID transmitted via signaling in the level information track identifier ID field, and wherein a rule specifies that the track including the sample group includes a Picture Header Network Abstraction Layer (NAL) unit; and The conversion between visual media data and media data files is performed based on the level indicator.

2. The method of claim 1, wherein, The rule stipulates that the track including the sample group must be a VVC (Video Codec) reference track.

3. The method of claim 1, wherein, The entity-to-group box defines the sub-image entity group.

4. The method of claim 1, wherein, The sample point group carries information about the sub-image set.

5. The method of any one of claims 1-4, wherein, The conversion includes encoding the visual media data into the media data file.

6. The method of any one of claims 1-4, wherein, The conversion includes decoding the visual media data from the media data file.

7. An apparatus for processing video data, comprising: processor; and It has a non-transitory memory for instructions, wherein the instructions, when executed by the processor, cause the processor to: Determine a level indicator for a set of sub-images included in one or more sub-image tracks, wherein the sub-image tracks are included in sub-image entity groups, and wherein, when the level indicator is included in a sample group, an indication is transmitted via signaling in an entity-to-group box to indicate the track including the sample group, wherein the indication is a track identifier ID transmitted via signaling in the level information track identifier ID field, and wherein a rule specifies that the track including the sample group includes a Picture Header Network Abstraction Layer (NAL) unit; and The conversion between visual media data and media data files is performed based on the level indicator.

8. The apparatus of claim 7, wherein, The rule stipulates that the track including the sample group must be a VVC (Video Codec) reference track.

9. The apparatus of claim 7, wherein, The entity-to-group box defines the sub-image entity group.

10. The apparatus of claim 7, wherein, The sample point group carries information about the sub-image set.

11. A non-transitory computer-readable medium comprising a computer program product for use by a video codec apparatus, the computer program product including computer-executable instructions stored on the non-transitory computer-readable medium, such that when the computer-executable instructions are executed by a processor, the video codec apparatus: determining a level indicator of a set of sub-pictures included in one or more sub-picture tracks, wherein, The sub-image track is included in a sub-image entity group, and wherein, when the level indicator is included in a sample group, an indication is transmitted via signaling in the entity-to-group box to indicate the track including the sample group, wherein the indication is a track identifier ID transmitted via signaling in the level information track identifier ID field, and wherein a rule specifies that the track including the sample group includes a picture header network abstraction layer (NAL) unit; and The conversion between visual media data and media data files is performed based on the level indicator.

12. The non-transitory computer-readable medium of claim 11, wherein, The rule stipulates that the track including the sample group must be a VVC (Video Codec) reference track.

13. The non-transitory computer-readable medium of claim 11, wherein, The entity-to-group box defines the sub-image entity group, and the sample point group carries information about the sub-image set.