Method and apparatus for applying adaptive neural network-based filtering
The adaptive filtering skip determination method addresses complexity issues in neural network-based video encoding by optimizing filtering decisions, enhancing encoding efficiency and quality, and improving applicability to mobile environments.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- UNIVERSITY INDUSTRY COOPERATION GROUP OF KYUNG HEE UNIVERSITY
- Filing Date
- 2025-12-09
- Publication Date
- 2026-06-25
Smart Images

Figure KR2025021060_25062026_PF_FP_ABST
Abstract
Description
Method and apparatus for applying adaptive neural network-based filtering
[0001] The present invention relates to video compression technology, specifically to a method and apparatus for applying adaptive neural network-based filtering, and to standard video coding, AI-based feature generation and transmission, video compression and transmission, and video codecs.
[0002] The present invention may correspond to a technical field identical to at least one of the digital video compression technology standards known by standard names such as MPEG-2, MPEG-4 Video, H.263, H.264 / AVC, H.265 / HEVC, H.266 / VVC, VC-1, AV1, QuickTime, VP-9, VP-10, and Motion JPEG, a technical field for improving the inherent efficiency of the standard, or a technical field for improving or replacing the standard.
[0003] Digital video encoding and decoding are widely utilized in various digital video applications. For example, devices such as video recording equipment and camcorders used for video recording activities—including digital television broadcasting, video transmission via communication networks, video calls, video conversations, and video chats, recording and provision of video content using optical media such as VCDs (video compact discs), DVDs (digital versatile discs), and Blu-rays, all procedures for the production, editing, collection, and distribution of video content, and video recording for various purposes including personal, commercial, industrial, and security purposes—are all dependent on video encoding and decoding technology.
[0004] Accordingly, embodiments that can be referred to as digital video encoders and decoders may constitute a part of a wide range of devices related to the creation, recording, and provision of digital video, including digital television, digital broadcasting systems, wireless broadcasting systems, computers in the form of notebooks / desktops / tablets, e-book readers, digital cameras, digital recording devices, digital multimedia playback devices, video game devices / terminals / consoles, mobile phones with multimedia playback functions (including smartphones), equipment for video conferencing, and other devices.
[0005] Digital video encoders and decoders as described above can be implemented by digital video compression standards that are understood by and widely used by people skilled in the art. The digital video compression standards may include at least one of the compression standards known by standard names such as MPEG-2, MPEG-4 Video, H.263, H.264 / AVC, H.265 / HEVC, H.266 / VVC, VC-1, AV1, QuickTime, VP-9, VP-10, and Motion JPEG.
[0006] Video encoders and decoders can be implemented to encode or decode digital video information more efficiently while complying with the above specifications, or by improving or modifying them. Attempts to modify the above specifications may also lead to the development of new specifications. Among well-known examples is the so-called enhanced compression model (ECM), which is an attempt to improve and replace the conventional H.266 / VVC specifications, currently being developed by the Joint Video Experts Team (JVET), a joint international standardization group of ISO, IEC, and ITU-T.
[0007] In conventional standards, including H.266 / VVC, sufficient research has not been conducted on filtering methods utilizing neural networks. Although filtering methods using neural networks have recently been proposed, problems have arisen regarding implementation difficulties due to significantly increased complexity. In particular, the increased complexity of the decoder makes it unsuitable for mobile environments, creating a need for a solution to address this issue.
[0008] In addition, conventional neural network-based filtering techniques primarily relied on the encoder's decision to determine whether to skip filtering; however, there was a problem where encoding efficiency was reduced because the parameters used in this decision process were not optimized. In particular, there was a limitation in that the parameters used for rate-distortion optimization or the threshold values for deciding to skip filtering could not be adaptively adjusted to the characteristics of the slice, thus failing to provide consistent performance for various video content.
[0009] Furthermore, due to a lack of coordination between the hierarchical structure used in video encoding and neural network-based filtering, there was a problem where filtering decision information from lower layers was not effectively utilized for encoding in upper layers. Consequently, unnecessary filtering operations were performed in upper layers, or conversely, necessary filtering was omitted, which could lead to a degradation in overall encoding efficiency and restoration quality.
[0010] To solve the aforementioned technical problem, the present invention aims to provide an efficient filtering skip determination method based on a hierarchical structure that adaptively determines whether to apply neural network-based filtering at various units, optimizes and transmits parameters required for filtering skip determination at the slice level, and provides an efficient filtering skip determination method.
[0011] A method for decoding an image using an information processing device according to an embodiment of the present invention for solving the aforementioned technical problem comprises the steps of: acquiring a bit sequence in which an image is encoded; acquiring a restored sample of a current picture from the bit sequence; determining whether to apply neural network-based filtering to the restored sample for at least one decoding unit; and selectively applying the neural network-based filtering based on the determination. The step of determining may be characterized by determining whether to apply the neural network-based filtering in a lower decoding unit only when it is determined to apply the neural network-based filtering in an upper decoding unit.
[0012] The above upper decoding unit may be characterized as a slice, and the above lower decoding unit may be characterized as a coding tree unit (CTU).
[0013] Whether or not to apply the neural network-based filtering in the above-mentioned sub-decoding unit may be determined in block units based on a depth specified by a first threshold value.
[0014] The above-mentioned determining step may be characterized by determining whether to apply the neural network-based filtering for each layer based on a temporal identifier.
[0015] The above-mentioned determining step may be characterized by determining to apply the neural network-based filtering only to layers having temporal identifiers below a second threshold.
[0016] The above-mentioned determining step may be characterized by determining to apply the neural network-based filtering in the upper layer when the application rate of the neural network-based filtering in the lower layer exceeds a third threshold.
[0017] The above-mentioned determining step may be characterized by determining not to apply the neural network-based filtering to the entire slice of the upper layer when the ratio of blocks to which the neural network-based filtering is not applied in the lower layer exceeds a fourth threshold.
[0018] The above-mentioned determining step may be characterized by being performed based on whether the neural network-based filtering is applied in the reference block referenced by the current decoding unit.
[0019] The above-mentioned determining step may be characterized by generating an activation map based on whether the neural network-based filtering of the reference block is applied, and determining whether the current decoding unit is applied based on the activation map.
[0020] The above activation map may be characterized by being generated by utilizing at least one of a motion vector or a block vector.
[0021] The above-mentioned determining step may be characterized by determining whether to apply the neural network-based filtering to the entire current decoding unit based on whether the reference block referenced by the pixel corresponding to a predetermined position of the current decoding unit is applied.
[0022] The above-mentioned determining step may be characterized by being performed based on boundary strength information within a coding tree unit (CTU).
[0023] The above-mentioned determining step may be characterized by being performed by counting the number of pixels of a block distinguished by the boundary strength.
[0024] At least one threshold value used in the above decision may be characterized as being included in the bit sequence and transmitted in slice units.
[0025] The above-mentioned determining step may be characterized by being performed implicitly based on partitioning information within a coding tree unit (CTU).
[0026] A video encoding method by an information processing device according to an embodiment of the present invention for solving the aforementioned technical problem comprises: a step of acquiring a current picture of a video to be encoded; a step of encoding the current picture to generate a reconstructed sample; a step of determining whether to apply neural network-based filtering to the reconstructed sample for at least one encoding unit; and a step of selectively applying the neural network-based filtering based on the determination. The step of determining may be characterized by determining whether to apply the neural network-based filtering in a lower encoding unit only when it is determined to apply the neural network-based filtering in an upper encoding unit.
[0027] The above method may be characterized by further including the step of including information regarding the above decision in a bit sequence.
[0028] The above-mentioned determining step may be characterized by being performed based on rate-distortion optimization.
[0029] The lambda value used in the above rate-distortion optimization may be characterized by being adaptively set on a slice basis.
[0030] An image decoding device according to an embodiment of the present invention for solving the aforementioned technical problem comprises a processor configured to execute at least one instruction and a memory for storing said instruction, wherein the processor, upon executing said instruction, acquires a bit sequence in which an image is encoded, acquires a reconstructed sample of a current picture from said bit sequence, determines whether to apply neural network-based filtering to said reconstructed sample for at least one decoding unit, and is configured to selectively apply said neural network-based filtering based on said determination, and the processor may be characterized by determining whether to apply said neural network-based filtering in a lower decoding unit only when it is determined to apply said neural network-based filtering in an upper decoding unit.
[0031] According to the present invention, at least one of the following effects can be achieved in video encoding and decoding: improvement of encoding efficiency, improvement of decoding efficiency, improvement of video quality, reduction of computational load, reduction of software size, reduction of hardware size, and other improvements in performance related to encoding and decoding.
[0032] According to the present invention, by explicitly or implicitly determining whether to apply neural network-based filtering at various units such as sequences, GOPs, frames, pictures, slices, and CTUs, it is possible to effectively reduce the computational complexity of the decoder while maintaining the restored image quality. In particular, by skipping neural network-based filtering for areas with low filtering gain, unnecessary computations can be prevented, and the applicability to various decoding environments, including mobile environments, can be increased.
[0033] According to the present invention, by adaptively adjusting and transmitting the threshold value and rate-distortion optimization parameters required for filtering skip determination at the slice level, it is possible to make an optimized filtering skip determination for video content having various characteristics. Through this, improvements in encoding efficiency and restoration quality can be expected compared to the conventional method using fixed parameters.
[0034] According to the present invention, by efficiently determining whether to apply filtering to an upper layer using filtering decision statistics of a lower layer in a hierarchical structure, it is possible to reduce bit sequence overhead while maintaining encoding consistency between layers. In addition, by utilizing an activation map based on the filtering status of a reference block, the same filtering skip decision can be derived in the encoder and decoder without separate signaling, thereby improving transmission efficiency.
[0035] FIG. 1 is a conceptual diagram of a video communication system according to an embodiment of the present invention,
[0036] FIG. 2 is a conceptual diagram of the arrangement of an encoder and a decoder in a real-time video streaming environment according to an embodiment of the present invention.
[0037] FIG. 3 is a conceptual diagram of a functional unit of a video decoder according to an embodiment of the present invention,
[0038] FIG. 4 is a conceptual diagram of a functional unit of a video encoder according to an embodiment of the present invention,
[0039] FIG. 5 is a conceptual diagram of a frame type according to an embodiment of the present invention,
[0040] FIG. 6 is a conceptual diagram showing the structure of a video encoder according to the H.266 / VVC standard, and
[0041] FIG. 7 is a conceptual diagram of the structure of an upper layer and a lower layer according to one embodiment of the present invention.
[0042] The present invention is capable of various modifications and may have various embodiments, and specific embodiments are illustrated in the drawings and described in detail. However, this is not intended to limit the invention to specific embodiments, and it should be understood that the invention includes all modifications, equivalents, and substitutions that fall within the spirit and scope of the invention.
[0043] Terms such as "first," "second," etc., may be used to describe various components, but said components should not be limited by said terms. These terms are used solely for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be named the second component, and similarly, the second component may be named the first component. The term "and / or" includes a combination of multiple related described items or any one of the multiple related described items, and is non-exclusive unless otherwise indicated. When items are listed in this application, they are merely illustrative descriptions intended to facilitate the explanation of the spirit of the present invention and possible methods of implementation, and are therefore not intended to limit the scope of the embodiments of the present invention.
[0044] In this specification, "A or B" may mean "only A," "only B," or "both A and B." Alternatively, in this specification, "A or B" may be interpreted as "A and / or B." For example, in this specification, "A, B or C" may mean "only A," "only B," "only C," or "any combination of A, B and C."
[0045] A slash ( / ) or a comma used in this specification may mean "and / or." For example, "A / B" may mean "A and / or B." Accordingly, "A / B" may mean "only A," "only B," or "both A and B." For example, "A, B, C" may mean "A, B or C."
[0046] In this specification, "at least one of A and B" may mean "only A," "only B," or "both A and B." Additionally, in this specification, the expressions "at least one of A or B" or "at least one of A and / or B" may be interpreted as synonymous with "at least one of A and B."
[0047] Additionally, in this specification, "at least one of A, B and C" may mean "only A," "only B," "only C," or "any combination of A, B and C." Also, "at least one of A, B or C" or "at least one of A, B and / or C" may mean "at least one of A, B and C."
[0048] When it is stated that one component is "connected" or "connected" to another component, it should be understood that while it may be directly connected or connected to that other component, there may also be other components in between. On the other hand, when it is stated that one component is "directly connected" or "directly connected" to another component, it should be understood that there are no other components in between.
[0049] The terms used in this application are used merely to describe specific embodiments and are not intended to limit the invention. The singular expression includes the plural expression unless the context clearly indicates otherwise. In this application, terms such as "comprising" or "having" are intended to specify the presence of the features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.
[0050] Unless otherwise defined, all terms used herein, including technical or scientific terms, are used with the same meaning as generally understood by those skilled in the art to which the present invention pertains. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and should not be interpreted in an ideal or overly formal sense unless explicitly defined in this application.
[0051] In describing the invention in this application, embodiments may be described or illustrated in terms of the described functions or unit blocks that perform the functions. The blocks may be expressed in this application as one or more devices, units, modules, parts, etc. The blocks may be implemented in hardware by a method of implementing one or more logic gates, integrated circuits, processors, controllers, memory, electronic components, or information processing hardware, which are not limited thereto. Alternatively, the blocks may be implemented in software by a method of implementing application software, operating system software, firmware, or information processing software, which are not limited thereto. A single block may be implemented by being separated into multiple blocks that perform the same function, or conversely, a single block may be implemented to perform the functions of multiple blocks simultaneously. The blocks may also be implemented by being physically separated or combined according to any criteria. The blocks may be implemented to operate in an environment where their physical locations are not specified and they are spaced apart from each other by a communication network, the Internet, a cloud service, or a communication method not limited thereto. Since all of the above-mentioned methods of implementation fall within the scope of various embodiments that a person skilled in the art familiar with the field of information and communication technology can adopt to realize the same technical concept, any detailed methods of implementation should be interpreted as being included within the scope of the technical concept of the invention in this application.
[0052] Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the attached drawings. In describing the present invention, to facilitate overall understanding, the same reference numerals are used for identical components in the drawings, and redundant descriptions of identical components are omitted. Furthermore, it is assumed that multiple embodiments are not mutually exclusive and that some embodiments may be combined with one or more other embodiments to form new embodiments.
[0053]
[0054] digital video codecs
[0055] FIG. 1 is a conceptual diagram of a video communication system according to an embodiment of the present invention. The video communication system (100) may be configured to include at least two terminals (110, 120) connected to each other through a network (105).
[0056] In one embodiment of the present invention, FIG. 1 may represent a block diagram for configuring a unidirectional video communication network. A first terminal (110) among the terminals may encode the video data in order to transmit (111) the video data through a network (105). A second terminal (120) among the terminals may be configured to receive (121) the encoded video data through a network, decode it, and display it.
[0057] In another embodiment of the present invention, FIG. 1 may represent a block diagram for configuring a bidirectional video communication network. For the bidirectional video communication, each terminal (110, 120) may be configured to encode video data acquired by itself for video transmission (112, 122) to each other terminal passing through the network. Each terminal may also receive (113, 123) video data transmitted through the network by another terminal, decode it, and be configured to display the decoded video data.
[0058] Each terminal (110, 120) shown in FIG. 1 may be exemplified as a device such as a server computer, a personal computer, a portable computer, and a smartphone, depending on the embodiment, but is not limited thereto and may be any device corresponding to a commonly used computing device. For example, according to an embodiment of the present invention, each terminal (110, 120) may mean a desktop computer, a laptop computer, a tablet PC, a mobile phone, a smartphone, a PDA (personal digital assistant), a workstation, an electronic calculator, a server computer, a cloud computer, a virtualization computer, a quantum computer, or any other electronic, electrical, or quantum computing device implemented in a movable or inmovable form. In particular, among such devices, it may be interpreted as any device that is designed to operate as a terminal device according to an embodiment of the present invention, is capable of operating as a terminal device according to an embodiment of the present invention, or is capable of installing and / or executing a computer program that enables it to operate as a terminal device according to an embodiment of the present invention or perform a method corresponding to such operation.
[0059] Each of the above terminals (110, 120) may be implemented by a plurality of functional units configured to exchange information within each of the above terminals (110, 120) by being interconnected in various forms, such as a bus, a circuit, or a relationship between a routine and a subroutine. Additionally, through the interconnection, for the purpose of executing or supporting the operation of a functional unit among the above functional units that primarily requires the performance of calculations, the terminals may be configured to include a processor (130) having a computational function and a memory (140) connected to the processor.
[0060] The processor (130) described in this specification may mean one or more general-purpose computers or special-purpose computers, such as a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions.
[0061] Even if the processor (130) is expressed in the singular for ease of understanding, a person of ordinary knowledge in the relevant technical field will know that the processor (130) may include a plurality of processing elements and / or a plurality of types of processing elements. For example, a device according to one embodiment of the present invention may include a plurality of processors or one processor and one controller as the processor (130). In addition, the processor (130) may be implemented by various processing configurations, such as a parallel processor or a multi-core processor.
[0062] The processor (130) may be configured to execute an operating system (OS) and one or more software executed on the operating system. Additionally, the processor may access, store, manipulate, process, and generate data in response to the execution of the software. The software may include a computer program, code, instructions, or a combination of one or more of these, and may configure the processing unit to operate as desired or command the processing unit independently or collectively. The software may be permanently or temporarily embodied in any type of machine, component, physical device, virtual equipment, computer storage medium or device, or transmitted signal wave to be interpreted by the processor (130) or to provide instructions or data to the processor. The software may be distributed among a plurality of computer systems connected to the network (105), such as the terminals (110, 120), and may be stored or executed in a distributed manner.
[0063] The software may also be implemented in the form of program instructions that can be executed through various computer means and may be recorded or stored in the memory (140). The memory (140) may be a computer-readable recording medium, and program instructions, data files, data structures, etc. may be recorded in the computer-readable recording medium alone or in combination. The program instructions stored in the memory (140) may be based on a command system specifically designed and configured for an embodiment of the present invention, or may follow a command system known and available to those skilled in the art of computer software, such as Assembly, C, C++, Java, Python, etc. It should be understood that the command system and the program instructions thereunder include not only machine code such as that produced by a compiler, but also high-level language code that can be executed by the device and / or the processor (130) according to an embodiment of the present invention using an interpreter, etc.
[0064] A computer-readable recording medium constituting an apparatus according to an embodiment of the present invention, including the memory (140) described in this specification, may include a temporary or volatile recording medium that is maintained only while the processor is operating, such as a processor cache, RAM, or flash memory; or may include a relatively non-volatile or long-term recording medium such as a magnetic media such as a hard disk, floppy disk, and magnetic tape; an optical recording medium such as a CD-ROM or DVD; a magneto-optical media such as a floptical disk; or a solid state memory; or may include a read-only recording medium such as a ROM placed on hardware; furthermore, the hardware itself configured to perform a series of program instructions and equivalent operations by means of a hard-wired structure by circuit wiring, and since each step for performing the operation implementing the embodiment of the present invention can be considered to be recorded by the connection and arrangement of the hardware components, the method of connection and arrangement is the same as the It is obvious to a person skilled in the art that it can be seen as equivalent to memory (140).
[0065] The embodiments described above with respect to the processor (130) and the memory (140) are not mutually exclusive and may be selected or combined as needed. For example, a hardware device may be configured to operate as a module composed of one or more of the software to perform the operation of the embodiment of the present invention, and vice versa. As another example, in this specification, all or part of the operation assigned to a function may be implemented by one or more of the software stored in a device according to an embodiment of the present invention (preferably in a recording medium falling within the category of the memory) and configured to be executed by the processor, in which case such a function may be referred to as a function "included" in the processor.
[0066] The present invention is applicable to any environment for establishing a unidirectional or bidirectional video communication network, and the network (105) should be understood as being able to be established by any means for carrying encoded video data between the terminals (110, 120).
[0067] In one embodiment of the present invention, the network (105) may refer to a wired or wireless communication network. In this case, depending on the embodiment, the network may be configured to communicate information using any communication standard, and the communication standard may include packet-based communication. The packet communication may be understood to mean, for example, packets known as TCP or UDP. The wired communication method of the network (105) may be a method of connecting to an external communication network via an RJ-11 standard telephone line, an Ethernet cable belonging to various categories of the RJ-45 standard, other coaxial cables, metal cables, optical cables, and other various wired media. According to the embodiment, the wireless communication method of the above network (105) may include a short-range wireless communication method including Bluetooth, Wi-Fi, Zigbee, and NFC (near field communication), or a long-range wireless communication method that can be referred to by the name of a wireless communication technology commonly called by the names of communication standard generations such as Wibro, WiMax, Global Systems for Mobile Communications (GSM), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long-Term Evolution (LTE), New Radio (NR), and other 2G, 3G, 4G, 5G, 6G, and other international standard wireless communication specifications such as IMT-2000, IMT-Advanced, IMT-2020, and IMT-2030. Of course, even if any other conventional or newly developed wired or wireless communication means, method, standard, and protocol are applied to the implementation of the network (105), as long as they are means configured to perform transmission and reception in information communication devices such as terminals (110, 120), there is no impediment to achieving the purpose of the present invention.It is also obvious that a network (105) can be configured in combination with one or more wired and / or wireless standards.
[0068] However, in another embodiment of the present invention, the network (105) may be understood to mean a process of information transmission using a computer-readable recording medium. In this case, the configuration of the network is not limited to communication media only, but should be understood to include a process of temporarily storing and physically transporting information in computer-readable memory and / or recording media. The computer-readable recording medium used for information transmission may be understood to mean a recording medium that is relatively non-volatile or capable of long-term recording, such as magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, or solid-state memory, which is primarily used as a means of transporting data between computing devices.
[0069] Any other means of information communication or transport applied may be considered to fall within the scope of embodiments of the present invention as long as it is structured to support decoding by transmitting video data in an encoded state. Accordingly, in addition to the examples listed above, all means of information communication or transport known in the prior art or newly provided may fall within the scope of application of the present invention.
[0070] FIG. 2 is a conceptual diagram of the arrangement of encoders and decoders in a real-time video streaming environment according to an embodiment of the present invention. The streaming system (200) exemplified by FIG. 2 can be seen as applicable to a video data communication network including, for example, digital broadcasting, video telephone, and video conferencing. However, it should be seen that a technical structure identical or similar to the streaming system can be equally applied even when information transport via a recording medium is involved, as described above.
[0071] According to one embodiment of the present invention, the streaming system may include a video source (210) that generates a video stream. The video source may include a digital video acquisition means (212) that acquires uncompressed raw video, which may be composed of, for example, a digital camera or other equipment. The raw video stream (215) may have a massive capacity and may therefore be compressed by a video encoder (217) coupled to or connected to the video source.
[0072] The above encoder (217) may be composed of means including hardware, software, or a combination of both, configured to implement an image encoding method and / or a method of implementing the same according to an embodiment of the present invention.
[0073] By passing through the encoder (217), an encoded bitstream (219) with a reduced capacity compared to the original video stream can be output. The bitstream (219) can be provided in real-time via communication by a relay device, for example, which may be referred to as a streaming server (220), and / or can be stored in a recording medium (225) of the streaming server (220) for subsequent use.
[0074] The streaming system (200) may include at least one streaming client (230, 240) that connects to the streaming server (220) to receive the encoded bit sequence (229) in real time or to acquire it subsequently. The streaming client may include a video decoder (232) that acquires the encoded bit sequence (229) (which may also be considered as a copy of the bit sequence (219) received by the streaming server), decodes the bit sequence (229), and outputs the resulting video data as video data in a form that can be displayed on a display (235) or other visual, auditory, or other sensory display means.
[0075] As mentioned above, the functions for encoding and decoding video data are collectively referred to as the coder-and-decoder system, or video codec.
[0076] FIG. 3 is a conceptual diagram of a functional unit of a video decoder according to an embodiment of the present invention. As shown in FIG. 3, a receiver (310) can receive at least one encoded video data to be decoded by a decoder (305). In an embodiment of the present invention, the encoded video data may be independent for each reception, and the decoding procedure of each independent video data may be independent from the decoding procedure of other video data. The encoded video data may be received by the receiver (310) through a hardware or software connection (315) to a device that stores the data, and as described above, the device that stores the data may be a type of streaming server located on the opposite side of a communication network or may refer to a physical recording medium, but is not limited thereto.
[0077] The receiving unit (310) can receive the encoded video data along with other accompanying data, such as encoded audio data or other auxiliary data, and each of the data can be separated from the video data and provided to an appropriate processing unit (312) other than the video decoder.
[0078] When the video data is received through a communication network, a buffer memory (320) may be combined between the receiver (310) and the decoder (305) to minimize delays and interruptions caused by the network environment. The buffer memory (320) may refer to a computer-readable recording medium that temporarily stores the received video data and reliably supplies it to a parser (330) corresponding to the input terminal of the decoder (305). However, the buffer memory may be unnecessary in environments where the bandwidth of the communication network is sufficient, where video data is being read from a recording medium located at a local location that is not physically separated, or where the possibility of communication delay is not predicted.
[0079] The video decoder (305) may include the parser (330) as its input to interpret the encoded video data. The parser separates (parses) a number of pieces of information stored in the form of bit sequences in the encoded video data according to a predetermined rule, and, if necessary, performs the function of entropy decoding (335) the entropy-coded video data, thereby reconstructing the symbols (338) which are segments of video encoding information. The symbols (338) may include all information for controlling the operation of the decoder (305), and / or may further include information for controlling a device that can operate in conjunction with the decoder (305), such as a display device. Control information for controlling the above-mentioned display device may include information in a format referred to as supplementary enhancement information (SEI) or video usability information (VUI).
[0080] As described above, the parser (330) may be configured to perform entropy decoding (335) of the encoded video data. The method of entropy encoding of the encoded video data may vary depending on the standard of the encoding, and decoding may be performed accordingly. Representative examples of the entropy encoding standard may include variable length coding, Huffman coding, and arithmetic coding, and each of the encoding methods may be context-adaptive or context-sensitive depending on the standard, or may be based on principles widely known to a person skilled in the art.
[0081] The parser (330) may be configured to extract at least one partial image from the encoded video data. The definition of the partial image may vary depending on the encoding standard, and depending on the standard, one or more of the examples listed below may simultaneously overlap. The partial image may be defined in units such as, for example, group of pictures (GOPs), pictures / frames, tiles, slices, macroblocks, blocks, subblocks, transform units (TUs), and prediction units (PUs).
[0082] The parser (330) may be configured to extract encoding information, such as transform coefficients, quantization parameters (QPs), and / or motion vectors, from the encoded video data. The parser (330) may be configured to perform entropy decoding (335) and parsing operations on the video data received from the buffer memory, and to selectively decode symbols (338) representing the encoding information. Additionally, the parser (330) may be configured to selectively supply specific symbols (338) to specific decoding function units within the decoder (305), such as inverse quantization and inverse transform units (340), intra prediction units (350), inter prediction units (355), or loop filter units (360). The control of such information supply may be determined by the information permutation contained in the encoded video and may vary depending on the encoding standard; as such, it is not limited to the scope of the embodiments of the present invention and is not described in detail in this conceptual diagram.
[0083] The above decoder (305) may be composed of a plurality of conceptual functional units that receive and process the encoding information from the parser (330). It is obvious that these conceptual functional units may be combined with one another or further subdivided according to implementation needs. For example, they may be further separated for ease of implementation, or integrated into one for operational efficiency. In any case, each functional unit may be configured to perform close interaction with one another. However, despite the possibility of such integration or separation, the decoding procedure of video data applied as an embodiment of the present invention will be described as a combination of conceptual functional units as described below.
[0084] The above decoder may include an inverse quantization and inverse transform unit (340). The inverse quantization and inverse transform unit (340) may be configured to receive encoding information from the parser (330), including a method to be used for numerical transformation, a block size, quantization coefficients for recovering quantized information, and separation information of a quantization matrix representing the quantization coefficients in a simplified manner, and may be configured to output block values (341) that can be input to an aggregator (370) as a result of processing the encoding information.
[0085] In one embodiment of the present invention, the output values of the inverse quantization and inverse transformation unit (340) may include in-frame predicted encoded block values. The in-frame predicted block value may mean a value that can be decoded using prediction information within a part image currently being decoded, such as the current frame, without using prediction information from a previously decoded part image, such as a previous frame.
[0086] The prediction information within the current partial image may be provided by the in-frame prediction unit (350). According to an embodiment of the present invention, the in-frame prediction unit (350) generates a block value of the same form as the block being decoded as prediction information using image information of a spatially adjacent area derived from a partial image that is currently being decoded and has been partially decoded. The partial image information may be provided (381) from a current image buffer, so-called line buffer (380). According to an embodiment, the merging unit (370) may be configured to merge the prediction information (351) generated by the in-frame prediction unit (350) with the block values (341) provided by the inverse quantization and inverse transformation unit (340).
[0087] In another embodiment, the output values of the inverse quantization and inverse transform unit (340) are inter-frame prediction encoded block values, and in some cases, may include block values for which motion compensation has been performed. In this case, the inter-frame prediction unit (355) may extract and use sample information (386) used for motion-based prediction from a reference picture buffer (385). The information (356) derived by performing motion compensation on the sample information based on the symbols (338) included in the block value as the output value may be configured to be merged by the merging unit (370) with the block values (341) provided by the inverse quantization and inverse transform unit (340). In this case, the block values (341) may be referred to as so-called different or residual values.
[0088] The position information in memory used by the inter-frame prediction unit (355) to extract the sample information from the reference image may be determined by a motion vector provided to the inter-frame prediction unit (355), which is composed of, for example, a combination of X, Y, and other symbols (338) for representing specific points of the reference image. The inter-frame prediction unit (355) may also include a function to interpolate and use the sample values when a so-called 'subsampling' possible motion vector is provided, and may further include a function to predict and reinforce the value of the motion vector.
[0089] The output values (371) of the merging unit (370) are provided to the loop filter unit (360) and can be processed by various loop filtering methods. The loop filter unit (360) may be configured to receive not only the block unit output (371) of the merging unit (370) but also the symbol (338) provided by the parser (330) to control its operation. The output of the loop filter unit (360) may be output to an external display means, such as the display device, through an output connection (390), but may be stored (361) in a line buffer (380) for use in prediction to interpret subsequent in-frame or inter-frame prediction encoding block values, and may also be stored in a reference image buffer (385) via this.
[0090] Specific partial images, such as frames, can be utilized as reference images for performing predictive decoding during a subsequent decoding process once their decoding is completed. A single partial image, such as a frame, can be accumulated step by step in a line buffer (380) and decoding can proceed. When a frame is decoded, the contents of the line buffer (380) are transferred (383) to the reference image buffer (385), and a new line buffer (380) can be allocated for decoding a new frame.
[0091] The above video decoder (305) may be configured to perform a decoding operation according to a predetermined video compression technology that may be documented by various international standard specifications or commercial specifications. The specifications may include, for example, H.264, H.265, H.266, etc., which are international standard recommendations defined by the International Telecommunication Union Standardization Committee (ITU-T). A person skilled in the art will understand that each of these recommendations is equivalent to an international standard jointly defined by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). The encoded video data may comply with a specific bitstream syntax defined by the specifications, as defined by the profile and level specified in the video compression specification document and standard document, and specifically within such document, and as required. In addition, the complexity of the encoded video data may be limited to a certain level for compliance with the profile and level. For example, any profile or level may be configured to limit the maximum image size, the maximum decoding rate, and the maximum reference image size. These limitations may also, in some embodiments, be further limited through metadata signals regarding the management of the HRD buffer included in the hypothetical reference decoder (HRD) and the encoded video data.
[0092] According to one embodiment of the present invention, the receiver (310) may receive additional redundant data along with the encoded video. The additional data may be considered as part of the encoded video data. The additional data may include information that can be used by the decoder (305) to properly decode the data or to more accurately reconstruct an image that is close to the image before encoding. The additional data may be provided in the form, for example, time, space, or layers for signal-to-noise ratio (SNR) enhancement, redundant slices, redundant images, and forward error correction codes.
[0093] FIG. 4 is a conceptual diagram of a functional unit of a video encoder according to an embodiment of the present invention. The encoder (405) may be configured to receive original video information (402) from a video source (401) and perform encoding.
[0094] The original video information (402) may have any suitable bit depth, for example, 8 bits, 10 bits, 12 bits, etc. Additionally, the original video information (402) may have any suitable color space, for example, R / G / B, Y / U / V, Y / Cb / Cr, etc. Additionally, the original video source may have any suitable sampling structure corresponding to the color space, for example, in the form of Y / Cb / Cr 4:2:0, Y / Cb / Cr 4:4:4. The original video source having such a predetermined format may be provided to the encoder in the form of a digital video stream.
[0095] In a unidirectional video communication network, the original video information (402) may be obtained from a recording medium that stores a pre-prepared original video. In a bidirectional video communication network, the original video information (402) may be obtained from a video acquisition device, such as a camera, that generates at least one video transmission stream included in the bidirectional video communication.
[0096] The video data containing the original video information (402) may be composed of a plurality of partial images configured to simulate motion by playing them in chronological order. The partial images may be expressed as concepts such as, for example, a picture or a frame. The partial images may include one or more samples depending on the type of sampling structure, color space, etc. used. A person skilled in the art will understand that the term "sample" is closely related to "pixel" in digital images. The operation of the encoder will be explained below with reference to such samples.
[0097] According to one embodiment of the present invention, the encoder (405) may be configured to encode and compress (partial) images constituting the original video information (402) into the form of encoded video information in real time (or according to other temporal requirements as needed depending on the method of implementation).
[0098] In the above encoder (405), the control unit (450) may be a functional unit configured to control an appropriate encoding speed. The control unit (450) may be configured to control other functional units and to be functionally coupled to the functional units described below. The parameters set by the control unit (450) may include parameters related to bitrate control, such as image skip, quantizer, and variable values for applying image quality optimization techniques, and may also include values such as image size, the structure of a group of pictures (GOP), and the maximum search range of motion vectors. A person skilled in the art will be able to understand the various other functions that the control unit (450) may have, and such other functions may be added or removed according to the design of the video encoder optimized for the individual system design.
[0099] According to an embodiment of the present invention, the encoder (405) may be configured to operate in a structure such as a "coding loop" that is well known to a person skilled in the art. To explain it in a simplified manner for example, the coding loop may consist of an internal encoder (so-called "source coder") (410) responsible for receiving an image to be encoded and generating symbols based on at least one reference image that has been previously encoded, and a local decoder (420) configured to be connected to the internal encoder. The local decoder (420) may be configured to perform an operation to reproduce sample data to be generated by a decoder (490) located at an actual remote location that receives video information encoded from the encoder (405) by receiving the output of the internal encoder (410).
[0100] Video data composed of sample data reconstructed by the internal decoder (420) can be configured to be input into the reference image buffer of the encoder (405). As described above, since the internal decoder (420) is implemented to reproduce the result output by the encoder (405) to be decoded at a remote decoder, the video data stored in the reference image buffer can also be identical in bit unit to the information of the reference image buffer held by the remote decoder. That is, the prediction function unit that may be included in the encoder (405) can read values identical to the sample values of the previous frame that the decoder will refer to during the decoding process from the reference image buffer of the encoder (405).
[0101] As described above, the principle of achieving a match between the encoder (405) and the decoder (490) by means of an internal decoder (420) on the side of the encoder (405) is widely known to a person skilled in the art, and the method of responding to an environment where such an environment is not guaranteed (e.g., loss of information due to communication failure) may also follow what is known to a person skilled in the art.
[0102] An example of the operation method of the internal decoder (420) described above has been explained in detail with reference to FIG. 3. The decoder of FIG. 3 can be considered as the decoder (490) of the "remote location" described above. The internal decoder (420) may be implemented excluding lossless encoding and decoding sections such as the parser (330) or entropy decoding (335), because the internal encoder (405) is implemented simply to reproduce the operation of the decoder located at the remote location, so it is acceptable to decode the symbols immediately without requiring a process of compressing and restoring the symbols. Therefore, it is acceptable for the functional parts preceding the parser and entropy decoder shown in FIG. 3 not to be provided or at least to be implemented only partially.
[0103] As described above, according to a preferred embodiment of the present invention, any decoder function part present in the decoder (excluding the parser and entropy decoder) may naturally exist as substantially the same function part in the corresponding encoder (405).
[0104] The operation of the encoding function unit that may be included in the above encoder (405) can be considered as the inverse of the above decoder function unit. Therefore, the embodiment can generally be explained by performing the operation of the above decoder function unit in reverse. For example, a quantization and transform function unit corresponding to the inverse quantization and inverse transform unit may be provided, and an inter-frame prediction encoding unit corresponding to the inter-frame prediction unit may be provided. In addition to this, some additional explanations will be added.
[0105] The internal encoder (410) may be configured to perform encoding for input image information, e.g., an input frame, by a predictive encoding method executed by a predictive encoding unit (440) that operates by referencing at least one portion of an image encoded in a temporally earlier order, e.g., frames, from a reference image buffer (430) from at least one reference image information, e.g., video data designated as a reference frame. In this case, the encoder (405) may be configured to encode a differential between blocks of samples constituting the input image and blocks of samples constituting the reference image.
[0106] The internal decoder (420) can decode video data that can be designated as the reference image from the symbols generated by the internal encoder (410). As described above, since this video data is identical to the decoding operation performed by the remote decoder, the video data used as the reference image may be provided to the encoder (405) in a form that has undergone lossy compression and has some damage, and this operation may be intended to match the operation with the decoder.
[0107] The prediction encoding unit (440) may be configured to perform a prediction search operation within the encoder (405). The prediction search operation may refer to an operation corresponding to the inter-frame prediction or intra-frame prediction described in the description of the decoder. For image information that is input and scheduled to be newly encoded, the prediction unit may access the reference image buffer (430) to retrieve information in order to obtain information such as motion vectors, block shapes, metadata that may include the same, and actual sample blocks to be referenced, which are information indicating points of reference images that can function as prediction reference information suitable for the new image information. The prediction encoding unit (440) may operate according to the so-called "sample block by pixel block" standard to obtain appropriate prediction reference information. According to one embodiment of the present invention, at least one predicted reference information pointing to at least one reference image information stored in the reference image buffer (430) may be designated for the input image, as determined based on the search results obtained by the predicted encoding unit (440).
[0108] In one embodiment of the present invention, the control unit (450) may be configured to manage the overall encoding operation of the internal encoder (410), including the setting of parameters used to encode video data.
[0109] The outputs of all the aforementioned functional units may be subject to entropy coding (460) for final output. The entropy coding (460) may include various entropy coding techniques such as variable length coding, Huffman coding, and arithmetic coding as described above for the symbols generated by the various functional units, and each of the coding methods may be context-adaptive or context-sensitive methods depending on the standard, or may be based on principles widely known to a person skilled in the art. Such entropy coding (460) can typically achieve lossless compression and may be configured to convert at least one symbol generated by the functional units into encoded video data.
[0110] The control unit (450) may apply a type of encoding of a specific partial image during the encoding interval to each partial image, such as a picture or a frame, in controlling the operation of the encoder (405). Depending on the type, the method of encoding the partial image may be affected. Depending on the embodiment, the type may include a classification of "frame types" as follows.
[0111] FIG. 5 is a conceptual diagram of a frame type according to an embodiment of the present invention. The following description will be explained together with reference to FIG. 5.
[0112] An intra-frame prediction ("I") encoded image (510) may refer to an image that can be encoded and decoded using only its own information without referencing other image information within the video data through prediction encoding. The "I" image may be designated by names such as key frame, independent / instantaneous decoder refresh (IDR) frame, and clean random-access (CRA) frame according to the video encoding standard. The "I" image designated by such various names may have various modifications and application methods as permitted by each standard and may differ partially from one another. In addition to those listed above, various application methods for implementing the "I" image may be based on various methods that are already known to a person skilled in the art or may be newly provided.
[0113] The inter-frame prediction ("P") encoded image (520) may mean an image that can be encoded and decoded via intra-frame or inter-frame prediction based on at least one prediction information and / or motion vector that designates at least one reference image to predict sample values of blocks constituting the image. The "P" image may be configured to reference only one reference frame according to the video encoding standard, or configured to reference one or more reference frames. In the case of referencing one or more reference frames, sample information and / or associated metadata derived from multiple reference images may be used for the reconstruction of a single block. However, in common cases, the image designated as the "P" image may be understood as an image that performs a reference limited to the temporally preceding image.
[0114] A bidirectional prediction ("B") image (530) may mean an image that can be encoded and decoded through in-frame or inter-frame prediction based on at least one prediction information and / or motion vector pointing to at least two reference images to predict sample values of blocks constituting the image. In common cases, the image designated as the "B" image is distinguished from the image designated as the "P" image and may be understood as an image that performs the reference, not limited to the image that precedes it in time.
[0115] Video data is spatially divided into multiple sample blocks during the encoding and decoding process, and encoding can proceed in block units. The block units include, for example, sizes such as 4x4, 8x8, 4x8, or 16x16 in units of horizontal / vertical pixels, as is widely known, but are not limited thereto. The blocks may be encoded by a predictive encoding method by referencing any other (already encoded) blocks, as allowed and / or restricted by the type specified for each partial image in which the blocks are included. For example, blocks of the "I" image (510) may not use a predictive encoding method, or may be encoded by referencing blocks that have already been encoded within the same partial image. That is, only the so-called intra-frame prediction method may be used. In contrast, for the "P" image (520), at least one reference image encoded in a previous time unit may be referenced, and thus, inter-frame prediction may also be used in encoding along with intra-frame prediction. In the case of the "B" image (530), reference can be performed even among reference images that are encoded earlier in the encoding order but follow later in terms of time unit. However, it is widely known that there may be blocks encoded within the "P" image or the "B" image that do not rely on predictive encoding.
[0116] The above video encoder (405) may be configured to perform encoding operations according to a predetermined video compression technology that may be documented by various international standard specifications or commercial specifications. Examples of the specifications may include all those described in the decoder.
[0117] According to one embodiment of the present invention, a transmitting unit (470) may buffer the encoded video data generated by the entropy encoding in order to provide / transmit the video data (ultimately to a decoder (490) at a remote location) through a hardware or software connection (495) to a device storing the encoded video data. According to an embodiment, when providing / transmitting the encoded video data from the video encoder (405), the transmitting unit (470) may receive and merge other data accompanying the encoded video data, such as encoded audio data or other auxiliary data, from a separate source (480).
[0118] According to one embodiment of the present invention, the transmitting unit (470) may be configured to transmit additional data along with the encoded video. The additional data may be considered as part of the encoded video data. The additional data may include information that can be used by a decoder to properly decode the data, or to more accurately reconstruct an image that is close to the image before encoding. Examples of the additional data may include all the examples previously shown in relation to the receiving unit (310) of the decoder.
[0119] The present invention can be implemented by digital video compression standards that are understood by and widely used by people skilled in the art, as described above. The digital video compression standards may include at least one of the compression standards known by standard names such as MPEG-2, MPEG-4 Video, H.263, H.264 / AVC, H.265 / HEVC, H.266 / VVC, VC-1, AV1, QuickTime, VP-9, VP-10, and Motion JPEG.
[0120] Figure 6 is a conceptual diagram showing the structure of a video encoder according to the H.266 / VVC standard. What is shown in Figure 6 corresponds to the approximate structure of a video encoder widely known by standard codes such as ITU-T H.266 and ISO / IEC 23090-3, and by the designation MPEG-I Part 3 or the common name versatile video coding (VVC).
[0121] According to FIG. 6, a video encoder (605) may be configured to receive uncompressed and unencoded original video data (601) as input and output an encoded bit sequence (602). The video data (601) may be supplied directly to a luminance signal mapping unit (610a) when in-frame prediction encoding is performed, or supplied to a luminance signal mapping unit (610b) via an inter-frame prediction unit (620) that includes motion vector extraction. When in-frame prediction encoding is performed, the mapped luminance signal may be supplied to an output merger (606) by selecting (608) at least one of the in-frame prediction encoded signal via the in-frame prediction unit (625) or the inter-frame prediction encoded signal output from the luminance signal mapping unit (610b) via the inter-frame prediction unit (620). The output of the above output merger can be applied to a chroma scaling unit (615). (The operation of the above illuminance signal mapping unit (610) and the operation of the above chroma scaling unit (615) are collectively referred to as the luma mapping / chroma scaling (LMCS) process.) The reduced chroma signal can be provided to a transform unit (630), and the transform unit (630) can perform an adaptive color transform, particularly on the chroma signal. The coefficients derived as a result of the transformation are applied to a quantization unit (640) and quantized. This performs lossy compression, and the result of the lossy compression can be output as a bit sequence (602) after passing through a multi-hypothesis context-adaptive arithmetic coding unit (650), which is a lossless compression method.
[0122] Meanwhile, the result of the lossy compression above can enter the decoding procedure by undergoing inverse quantization (645), inverse transform (635), and luminance signal expansion (617) processes to generate a coding loop. The result of the luminance signal expansion above can be supplied to an internal merger (607) along with the result of selecting (608) at least one of the previously generated intra-frame prediction coded signal or inter-frame prediction coded signal. The result of the internal merger above can undergo inverse luminance signal mapping (617) and then undergo processing such as a deblocking filter (660), a sample adaptive offset (SAO) (670), and an adaptive loop filter (ALF) to reproduce the image quality improvement process in the decoder. The result of reproducing the operation in the decoder as described above is applied to the reference image buffer (690) and can be recycled for prediction encoding by the inter-frame prediction unit (620).
[0123] The present invention may also be utilized by or combined with the enhanced compression model (ECM), which is an implementation of a next-generation video codec being developed by the Joint Video Experts Team (JVET), an international standardization expert organization, to improve H.266 / VVC. According to the standardization progress document of the above JVET, document number ISO / IEC JTC 1 / SC 29 / WG 5 N 190 (also document number JVET AC2025-v1), the improved compression model may include an improved intra-frame predictive coding method, an improved inter-frame predictive coding method, an improved transform and transform coefficient coding method, an improved adaptive loop filtering method, a bilateral filtering method, a new sample adaptive offset (SAO) method for image quality improvement, an extended entropy coding method, and an improved gradual decoding refresh (GDR) technique, and such techniques may be included in the present invention as background technology for implementing the present invention.
[0124]
[0125] Existing neural network filtering skip method
[0126] The technique of skipping filtering using a neural network can be determined by the encoder side. The neural network-based filtering can be referred to as a neural network-based loop filter (NNLF), and the filter can be applied in conjunction with or as a replacement for conventional deblocking filters, sample adaptive offset (SAO), adaptive loop filters (ALF), etc. Since the neural network-based loop filter has the characteristic of significantly increasing the computational complexity of the decoder, applying it uniformly to all areas may be inefficient. Therefore, it is necessary to reduce complexity by selectively applying neural network-based filtering only to areas where a benefit from filtering application is expected, and skipping filtering in areas where it is not.
[0127] For example, it is checked whether there is a substantial gain from applying filtering to each CTU, and if the value of the gain is large, the CTU may not apply neural network-based filtering and transmit information about it to the decoder. In this case, the filtering gain is measured as the Euclidean distance and can be calculated based on the difference between the filtered result, the reconstructed result, and the original CTU. More specifically, the Euclidean distance can be determined by calculating and comparing the difference between the sample value of the original CTU and the sample value of the reconstructed (restored) CTU, and the difference between the sample value of the original CTU and the sample value of the filtered CTU. The larger the value, the greater the filtering gain can be judged. That is, if the Euclidean distance after applying filtering is smaller than the Euclidean distance before applying filtering, it can be judged that there is an improvement in image quality due to filtering, and the greater the difference, the greater the filtering gain can be judged.
[0128] According to one embodiment of the present invention, rate-distortion optimization (RDO) calculations may be utilized in the process of determining whether to apply neural network-based filtering. The Lagrangian multiplier, i.e., the lambda (λ) value used in the rate-distortion optimization, may be adaptively adjusted on a slice basis. For example, different lambda values may be set depending on the quantization parameter (QP) of the slice or the slice type (in-frame slice, inter-frame slice, etc.), and the lambda value may be explicitly transmitted from the encoder to the decoder for each encoding / decoding unit such as a sequence, GOP, frame, picture, slice, etc., or a predefined fixed value may be used.
[0129] According to one embodiment of the present invention, the threshold value used for the filtering skip determination can be transmitted on a slice basis. Slices can be classified into intra-frame slices (I-slices), inter-frame prediction slices (P-slices), bidirectional prediction slices (B-slices), etc., and the characteristics of the reconstructed image and the filtering gain may differ depending on each slice type. Therefore, by transmitting different threshold values for each slice type through a slice header, it is possible to make a filtering skip determination that is adaptive to the characteristics of the slice.
[0130] As another example, the CTU checks whether there is a substantial benefit to applying filtering, and if the value of the benefit is large, the CTU may not apply neural network-based filtering and transmit information about this to the decoder. In this case, the criterion for determining the presence or absence of the filtering benefit may be based on the number of partitions of the boundary strength (BS). If the boundary strength is below a specific threshold, it may be determined that there is a significant benefit to skipping the filtering. In such a case, the threshold may be signaled to the decoder in an explicit or implicit manner, for example, as a bit sequence syntax signal included in a slice header.
[0131] In one embodiment of the present invention, the boundary strength is a value derived from block boundaries to determine the application strength of a deblocking filter, and can be determined based on the existence of a prediction mode between adjacent blocks, a reference image, a motion vector, and a transformation coefficient. A map visualizing block boundaries within a CTU based on the boundary strength can be called a boundary strength map (BS Map), and the boundary strength map can be used as information indicating the partitioning state within the CTU.
[0132] The number of blocks or block boundaries within a CTU can be determined through the boundary intensity map described above. Generally, the more partitions a CTU has, the higher the likelihood that the CTU contains complex textures or movements; for such areas, applying neural network-based filtering can contribute to image quality improvement. Conversely, CTUs with fewer partitions are more likely to be relatively flat areas, and for these areas, skipping neural network-based filtering may not result in significant image quality degradation. However, since judgment based on the number of partitions has the limitation of treating large and small partitioned blocks identically, a method of counting the number of pixels occupied by each block instead of the number of blocks can be utilized for a more precise judgment.
[0133] The threshold value used for the boundary strength-based filtering skip determination described above may be preferably transmitted, particularly at the slice level. This is because the distribution of boundary strengths varies depending on the slice type and quantization parameters, and encoding efficiency can be maximized by using a threshold value optimized at the slice level.
[0134]
[0135] Method for skipping filtering using adaptive neural networks: Explicit method
[0136] According to one embodiment of the present invention, a method for adaptively performing neural network-based filtering skips may be provided to reduce the complexity of the neural network and maximize coding gains during use. This method can eliminate coding inefficiencies arising from simple skip methods and provide optimized decoder complexity. According to one embodiment of the present invention, the adaptive neural network-based filtering skip may be performed by at least one of an explicit method and an implicit method. The explicit method may refer to a method of transmitting skip status information determined by the encoder to the decoder as a bit sequence syntax, and the implicit method may refer to a method in which the encoder and the decoder each derive whether to skip according to the same rule.
[0137] A syntax structure for adaptive filtering skip according to one embodiment of the present invention may be as follows. For example, an adaptive neural network-based filtering skip method may be set independently for encoding / decoding units such as sequence, GOP (group of pictures), frame, picture, slice, CTU, patch, CU, TU, etc. In this case, an adaptive neural network-based filtering skip may be performed for each unit.
[0138] As another example, adaptive neural network-based filtering skips can be performed on individual units based on conditions. That is, the method for adaptively using a neural network to skip can be configured to be set on units such as sequences, GOPs, frames, pictures, slices, CTUs, patches, CUs, and TUs. When a skip-related signal is set to off in a higher unit such as an individual sequence, GOP, frame, picture, slice, CTU, patch, CU, or TU, a bit sequence phrase determining the on / off status of the signal does not appear in the lower unit, and the bit sequence phrase appears in the lower unit only when it is set to on in the higher unit.
[0139] As another example, among units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, and TU, an adaptive neural network filtering skip method can be set up to the CTU and patch size stages, and in the unit partitioning structure, a block for the final adaptive neural network filtering skip method can be set based on depth at the final leaf node. At this time, the depth can be set based on a quadtree or a multi-type tree (MTT). For instance, if an arbitrary CTU size is 256x256, the adaptive neural network filtering skip method up to that CTU is "on," and depth 1 is set based on a quadtree, the block for the final adaptive neural network filtering skip method can be configured with a size of 128x128, and whether filtering is skipped at that unit can be transmitted to a decoder.
[0140] At this time, information regarding which level to treat as a leaf, and whether to set it based on a quadruple tree or MTT, etc., can be determined explicitly or implicitly at stages such as sequence, GOP, frame, picture, slice, CTU, and patch. If explicitly determined, such information can be transmitted from the encoder to the decoder in units such as sequence, GOP, frame, picture, slice, CTU, and patch using bit sequence syntax.
[0141] According to one embodiment of the present invention, when an adaptive neural network-based filtering skip method is explicitly set at stages such as sequence, GOP, frame, picture, slice, CTU, and patch as described above, it can be configured to quickly determine whether to apply a higher unit based on information determined at a lower unit by utilizing a hierarchical structure. The hierarchical structure may refer to a hierarchical structure provided to support temporal scalability in encoding / decoding, and each unit in the hierarchical structure may refer to temporal scalability layers distinguished by a temporal identifier (TID). For example, in a bidirectional prediction structure, a layer with a temporal identifier of 0 (lower layer), a layer with a temporal identifier of 1, a layer with a temporal identifier of 2, and layers above that (upper layer) may be defined by a reference relationship according to the encoding order.
[0142] According to one embodiment of the present invention, whether to apply neural network-based filtering based on the temporal identifier can be specified on a layer-by-layer basis. Generally, lower layers (lower temporal identifiers) have lower quantization parameters (QP), so the benefit of applying filtering may be greater in terms of image quality. For example, a depth value can be signaled, and neural network-based filtering can be applied only to layers having temporal identifiers less than or equal to the depth value. For instance, when the depth value is 0, filtering is applied only to the layer with a temporal identifier of 0 (the lowest layer), and as the depth value increases, filtering can be applied to more lower layers.
[0143] According to one embodiment of the present invention, if the usage rate of the adaptive neural network-based filtering skip method is high in the lower layer, it is determined that the use of the method is also high in the upper layer, and thus the use of the adaptive neural network-based filtering skip method can be set to "on" in the slices of the upper layer. In this case, the high usage of the adaptive neural network-based filtering skip method in the lower layer may mean that a certain threshold ratio is exceeded among the total CTU (or specific unit) of the lower layer, or that a certain threshold ratio is exceeded among the total lower slices. According to another embodiment of the present invention, if the slices of the lower layer that are frequently referenced in the upper layer are mostly set to the adaptive neural network-based filtering skip method, the method can also be set to "on" in the slices of the upper layer. In such cases, the value of the threshold ratio can be transmitted from the encoder to the decoder in units such as sequences, GOPs, frames, pictures, and slices.
[0144] According to one embodiment of the present invention, after the on / off decision of neural network-based filtering in the lower layer is completed in block units, statistical information regarding the decision result may be derived. The statistical information may include the ratio or number of blocks to which neural network-based filtering is applied within the lower layer. When encoding a slice of an upper layer, if the ratio of blocks in the lower layer referenced by the slice of the upper layer to which neural network-based filtering is set to "off" exceeds a predetermined threshold, neural network-based filtering may be forcibly set to "off" for the entire slice of the upper layer. In this case, filtering may be skipped collectively at the slice level without the need to separately provide block-unit on / off signals for the slice of the upper layer. Whether the forced setting is applied may be explicitly signaled by the encoder, or may be implicitly derived by the encoder and decoder based on the same statistical information.
[0145] FIG. 7 is a conceptual diagram of the structure of an upper layer and a lower layer according to an embodiment of the present invention. Referring to FIG. 7, in a bidirectional prediction structure for allowing random access (RA), images (711, 712, 713) that are encoded first according to the coding order (CO) may constitute a lower layer (710), and images (721, 722, 723, 724, 725, 726) encoded by referencing the images of the lower layer (710) may constitute an upper layer (720). The images of the lower layer (710) may have a low temporal identifier (TID) value (e.g., TID=0 or TID=1), and the images of the upper layer (720) may have a high temporal identifier value (e.g., TID=2, TID=3, etc.). The above temporal identifier is a value assigned to each image to support temporal scalability, and even if only images with low temporal identifiers are decoded, basic restoration of the image is possible, and as images with high temporal identifiers are additionally decoded, the frame rate is improved, making more precise image playback possible.
[0146] Generally, images in lower layers with low temporal identifiers are referenced by images in upper layers, so relatively low quantization parameters (QPs) are applied, allowing them to be encoded with high quality. Conversely, images in upper layers with high temporal identifiers are not referenced by other images or have a low reference frequency, so relatively high quantization parameters may be applied. For example, slice quantization parameters can be determined by adding an offset based on the temporal identifier (e.g., +1, +2, +3, etc.) to the sequence quantization parameters. Due to these characteristics, images in lower layers may benefit relatively more from the application of neural network-based filtering in terms of quality, while images in upper layers may benefit relatively less from the filtering.
[0147] Meanwhile, each of the above images (711-713, 721-726) can be played back according to a display order or picture order count (PoC) separate from the coding order (CO). For example, if coding is performed in the order of PoC 0, 8, 4, 2, 1, 3, 6, 5, 7, images corresponding to PoC 0 and PoC 8 are coded first to form a lower layer, an image corresponding to PoC 4 is coded by referencing the lower layer, and subsequently, images of the upper layer can be coded progressively in the order of PoC 2, 1, 3, 6, 5, 7. In this hierarchical coding structure, whether to apply filtering to the upper layer can be efficiently determined based on the statistics of the neural network-based filtering application of the lower layer.
[0148] According to one embodiment of the present invention, when an adaptive neural network-based filtering skip method is explicitly set at stages such as sequence, GOP, frame, picture, slice, CTU, and patch, the decision to skip filtering can be made quickly by utilizing reference information. For example, in the case of a slice that is predicted between frames, an between frames prediction method is utilized, and a reference block can be searched to obtain previously encoded information. At this time, by checking the information of the block referenced in the current unit block (e.g., CTU, patch, etc.), if the block is determined to be a neural network-based filtering skip, it can be decided to execute the neural network-based filtering skip in the same way in the current unit block.
[0149] According to one embodiment of the present invention, an activation map can be generated based on the reference information. The activation map is information indicating whether neural network filtering is applied to each region within the current slice or current unit block, and can be configured based on whether neural network filtering is applied in the reference block referenced by each region. For example, if neural network filtering is applied in the reference block referenced by the motion vector of the current block, the corresponding location in the activation map can be set to "on," and if it is not applied, it can be set to "off." Since the activation map can be derived identically from the encoder and the decoder, it can be generated implicitly without a separate signal.
[0150] The above activation map can be generated using motion vectors (MV) in cross-frame prediction blocks, and can be generated using block vectors (BV) in blocks using intra-block copy (IBC) mode. That is, by constructing the activation map using both motion vectors and block vectors, reference-based adaptive filtering skip decisions can be made not only in cross-frame prediction slices but also in slices that utilize block vectors.
[0151] According to one embodiment of the present invention, the size of the current unit block may not be the same as the unit in which the motion vector (MV) is defined, and the reference block may also not match the size of the block having the MV. In this case, the motion vector value of a specific location of the current unit block (center, top-left, bottom-left, top-right, bottom-right, etc. in the block) may be set as a representative value for reference, and a method may be utilized to check whether neural network-based filtering is skipped at a specific location (center, top-left, bottom-left, top-right, bottom-right, etc. in the reference block). For example, if neural network-based filtering is set to 'on' in the reference block referenced by the pixel corresponding to the center position of the CTU, it may be determined to be 'on' for the entire CTU, and if it is set to 'off', it may be determined to be 'off' for the entire CTU. Alternatively, the reference information of the pixel corresponding to the top-left position of the CTU may be utilized as a representative value.
[0152] According to one embodiment of the present invention, the size of the current unit block may not be the same as the unit where the motion vector is defined, and the reference block may not coincide with the block having the motion vector. In this case, information on the motion vector block referenced by the current unit block (e.g., CTU) is utilized, and if a certain proportion or more of the reference block is set as a neural network-based filtering skip, it may be determined that the entire reference block applies the neural network-based filtering skip. Additionally, if the ratio or number of blocks among the reference blocks that are set as neural network-based filtering skips exceeds a certain threshold, the reference block may be determined to be a block where a skip occurs. For example, if 50% or more of the multiple sub-blocks having motion vectors within a CTU refer to a reference block to which neural network-based filtering is applied, the neural network-based filtering may be determined to be "on" for the entire CTU.
[0153] In cases as described above, information such as the method by which judgment is to be made (e.g., vector reference position, ratio, threshold, etc.) and specific values required in conjunction with the judgment method can be transmitted in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc. As described above, if an adaptive neural network-based filtering skip method is explicitly set at the stages of sequence, GOP, frame, picture, slice, CTU, patch, etc., the filtering skip can be quickly determined by utilizing reference information.
[0154]
[0155] Adaptive Neural Network-based Filtering Skip Method: Implicit Method
[0156] According to one embodiment of the present invention, an adaptive neural network-based filtering skip can be implicitly determined based on prediction block information at the level of a sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc. The implicit determination may refer to a method in which an encoder and a decoder each derive whether to skip based on the same rule and the same input information; in this case, since there is no need to separately signal whether to skip itself, bit sequence overhead can be reduced. Information that can be utilized for the implicit determination may include partitioning information, prediction block information, motion vector information, and transformation coefficient information. Even when the implicit determination is applied, all or part of the signal indicating the rule used for the implicit determination or the encoding / decoding information that substantially contributes to the implicit determination may be explicitly transmitted from the encoder to the decoder.
[0157] According to one embodiment of the present invention, an adaptive neural network filtering skip can be implicitly determined based on partitioning information. For example, by utilizing CU, PU, and TU information existing within a CTU (patch or specific unit), a neural network filtering skip can be determined when the number of partitions is greater than or equal to a specific ratio (threshold). Generally, the more partitions there are, the more likely the CTU is to be an area containing complex textures or movements, and for such areas, the application of neural network filtering can contribute to image quality improvement. Conversely, a CTU with fewer partitions is more likely to be a relatively flat area, and for such areas, even if neural network filtering is skipped, the degradation of image quality may not be significant.
[0158] According to one embodiment of the present invention, a neural network-based filtering skip can be determined by utilizing CU, PU, and TU information existing within a CTU (patch or specific unit) to identify the characteristics of the partitioning. Since the determination based on the number of partitionings has the limitation of treating large partitioned blocks and small partitioned blocks as identical, various partitioning characteristics may be utilized to distinguish them. For example, a neural network-based filtering skip can be determined based on the size of the largest partitioned block, the size of the smallest block, the number of the largest partitioned blocks, the number of the smallest blocks, the ratio occupied by the largest partitioned block, the ratio occupied by the smallest block, and the ratio of the largest partitioned block to the smallest partitioned block.
[0159] According to one embodiment of the present invention, by utilizing boundary strength (BS) information existing within a CTU (patch or specific unit), a neural network-based filtering skip can be determined when the partitioning is higher than a specific number (ratio). The number of blocks or block boundaries existing within the CTU can be identified through the boundary strength map, and whether to skip the filtering can be determined based on this. For example, a neural network-based filtering skip can be determined by verifying the characteristics of the boundary strength partitioning existing within the CTU (patch or specific unit). In the boundary strength-based partitioning analysis, a neural network-based filtering skip can also be determined based on the size of the largest partition block, the size of the smallest block, the number of the largest partition blocks, the number of the smallest blocks, the ratio occupied by the largest partition block, the ratio occupied by the smallest block, and the ratio of the largest partition block to the smallest partition block.
[0160] According to one embodiment of the present invention, the number of pixels appearing as boundary strength partitioning within a CTU (patch or specific unit) can be counted, and a neural network-based filtering skip can be determined based on the information. As described above, judgment based on the number of partitions has the limitation of treating large divided blocks and small divided blocks equally. For example, counting one large block and one small block as one unit may not accurately reflect the complexity of the actual area. Therefore, for a more precise judgment, a method of counting the number of pixels occupied by each block instead of the number of blocks can be utilized, and through this, it is possible to determine a filtering skip that more accurately reflects the actual complexity within the CTU.
[0161] According to one embodiment of the present invention, decision prediction information (e.g., CU, PU, TU information) existing within a CTU (patch or specific unit) can be utilized. The prediction information may refer to determined in-frame and inter-frame prediction information existing within the CTU (patch or specific unit). In this case, since the characteristics of the restored image differ depending on the prediction mode, the information can be utilized to implicitly determine an adaptive neural network-based filtering skip. To this end, information including the number of blocks coded in an in-frame mode, the number of blocks coded in an inter-frame mode, the number of blocks coded in a mode utilizing a block vector (BV), the ratio of blocks coded in an in-frame mode, the ratio of blocks coded in an inter-frame mode, and the ratio of blocks coded in a mode utilizing a block vector may be utilized.
[0162] According to one embodiment of the present invention, after classifying how blocks of the intra-frame mode, inter-frame mode, and block vector utilization mode are adjacent among the current CTU (patch or specific unit), a neural network-based filtering skip can be determined by utilizing the number of adjacent blocks, the number of adjacent pixels, etc. For example, the number of cases where intra-frame blocks overlap each other, or the number of adjacent pixels in the overlapping part may serve as examples. Since there is a high probability that visual artifacts will occur at the block boundaries when blocks encoded in different prediction modes are adjacent, this adjacent relationship information can be usefully utilized to determine the necessity of neural network-based filtering.
[0163] According to one embodiment of the present invention, in the case of inter-frame blocks, motion vector information can be utilized to implicitly determine whether to skip adaptive neural network filtering. For example, motion vector information existing within a CTU (patch or specific unit) can be utilized. Since the characteristics of the reconstructed image differ depending on the magnitude of the motion vector, the neural network filtering skip can be implicitly determined in an adaptive manner through this information. To this end, the intra-frame coding, the magnitude of the motion vector, and the magnitude of the sub-block motion vector within the current CTU (patch or specific unit) can be utilized. Generally, when the magnitude of the motion vector is large, it indicates an area with significant motion; in such areas, the error in motion compensation prediction may increase, which may lead to a high need for filtering.
[0164] According to one embodiment of the present invention, whether to skip adaptive neural network filtering can be implicitly determined by utilizing the number of non-zero coefficients. The distribution of non-zero coefficients existing within a CTU (patch or specific unit) can be interpreted as representing the characteristics of the reconstructed image.
[0165] Specifically, when non-Young coefficients appear frequently within a single block, the reconstructed pixel values of that block exhibit great variability. Since discontinuities at block boundaries can be visually prominent in such areas, neural network-based filtering is often required. Conversely, in the case of blocks with low non-Young coefficients, the reconstructed pixel values are relatively monotonous; in these areas, image quality degradation may not be significant even without applying neural network-based filtering. Meanwhile, if multiple blocks exist within a CTU (patch or specific unit) and many of these blocks have low non-Young coefficients, the boundaries between blocks may be visually prominent even if the interior of each individual block is flat. In such cases, the application of neural network-based filtering may actually be necessary.
[0166] Accordingly, based on information such as the total number of non-Young coefficients within the CTU, the number of blocks with non-Young coefficients below a certain number, or the proportion of blocks with non-Young coefficients below a certain number, it is possible to determine whether to apply neural network filtering. The threshold value used for the above determination can be transmitted in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc.
[0167] As described above, since the implicit decision determines whether to skip based on the same rule between the encoder and the decoder, parameters such as threshold values used in the rule may need to be explicitly transmitted so that both sides can use the same values.
[0168] According to one embodiment of the present invention, when filtering skips utilizing an adaptive neural network are implicitly determined based on the partitioning information described above, the necessary information can be explicitly determined in the encoder and transmitted to the decoder. For example, when utilizing CU, PU, and TU information existing within a CTU (patch or specific unit) and utilizing a method based on the number of partitionings being a specific ratio (threshold), the specific ratio (threshold) can be transmitted by bit sequence syntax in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, and TU. In addition, when utilizing one or more of the elements including the size of the largest partition block, the size of the smallest block, the number of the largest partition blocks, the number of the smallest blocks, the ratio occupied by the largest partition block, the ratio occupied by the smallest block, and the ratio of the largest partition block to the smallest partition block among the CU, PU, and TU existing within the CTU (Patch or specific unit), information regarding at least one of the information regarding the size of the largest partition block, the size of the smallest block, the number of the largest partition blocks, the number of the smallest blocks, the ratio occupied by the largest partition block, the ratio occupied by the smallest block, and the ratio of the largest partition block to the smallest partition block used to determine the condition can also be transmitted by bit sequence syntax in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc.
[0169] According to one embodiment of the present invention, information regarding the ratio (threshold) of boundary strength (BS) partitioning existing within a CTU (Patch or specific unit) can be transmitted by bit sequence syntax in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc. Additionally, when utilizing one or more of the elements including the size of the largest partition block, the size of the smallest block, the number of the largest partition blocks, the number of the smallest blocks, the ratio occupied by the largest partition block, the ratio occupied by the smallest block, and the ratio of the largest partition block to the smallest partition block, information regarding at least one of the information regarding the size of the largest partition block, the size of the smallest block, the number of the largest partition blocks, the ratio occupied by the largest partition block, the ratio occupied by the smallest block, and the ratio of the largest partition block to the smallest partition block used to determine the corresponding condition can also be transmitted by bit sequence syntax in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc. If the number of pixels in boundary block partitioning within a CTU (Patch or specific unit) is utilized, the number of pixels can also be transmitted in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc., by means of bit sequence syntax.
[0170] According to one embodiment of the present invention, when determining an adaptive neural network-based filtering skip based on the prediction block information described above, the necessary information can be explicitly determined in the encoder and transmitted to the decoder. Information regarding at least one of the following: the number of blocks coded in an intra-frame mode, the number of blocks coded in an inter-frame mode, the number of blocks coded in a block vector-based mode, the ratio of blocks coded in an intra-frame mode, the ratio of blocks coded in an inter-frame mode, and the ratio of blocks coded in a block vector-based mode, can be transmitted in units such as sequences, GOPs, frames, pictures, slices, CTUs, patches, CUs, and TUs by means of bit sequence syntax. Additionally, when determining a neural network-based filtering skip based on at least one of the following: the number of adjacent blocks and the number of adjacent pixels, after classifying how the blocks of the intra-frame mode, inter-frame mode, and BV-based mode are adjacent within the CTU (Patch or specific unit), the information can be transmitted in units such as sequences, GOPs, frames, pictures, slices, CTUs, patches, CUs, and TUs.
[0171] According to one embodiment of the present invention, in the case of the aforementioned inter-frame block, information required to implicitly determine an adaptive neural network filtering skip based on motion vectors can be explicitly determined by the encoder and transmitted to the decoder. Information such as the number of encoded blocks within a frame, the magnitude of motion vectors, and the magnitude of sub-block motion vectors within a CTU (patch or specific unit) can be transmitted in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc. According to one embodiment of the present invention, information required to implicitly determine whether to skip an adaptive neural network filtering based on the aforementioned number of non-zero coefficients can be explicitly determined and transmitted to the decoder. Information required to determine whether to skip, such as the number of non-zero coefficients existing within a CTU (patch or specific unit), the number of blocks with non-zero coefficients below a specific number, and the ratio occupied by blocks with non-zero coefficients below a specific number, can be transmitted in units such as sequence, GOP, frame, picture, slice, CTU, patch, CU, TU, etc.
[0172]
[0173] Encoder and decoder
[0174] It is evident that the method according to the present invention can be applied equally to an encoder and a decoder. Additionally, as illustrated in FIG. 4 through an internal decoder (420) and a coding loop including it, this decoding process can be implemented equally within the encoder to predict the state of the decoder.
[0175] The encoding method according to the present invention described above can be implemented through an encoder as a device. The encoder as a device may be implemented by maintaining the conventional encoder structure exemplified in FIGS. 1 to 6 or by applying a certain change therefrom, but the form of implementation is not necessarily limited to that exemplified, and any form of encoder structure capable of functioning as a video encoder should be considered an encoder established by the present invention as long as it embodies the technical concept of the present invention.
[0176] In addition, the decoding method of the encoded result according to the present invention described above can be implemented through a decoder as a device. The decoder as a device may be implemented in a form that maintains the conventional decoder structure exemplified in FIGS. 1 to 6 or applies a certain change therefrom, but the form of implementation is not necessarily limited to that exemplified, and any form of decoder structure capable of functioning as a video decoder should be considered a decoder established by the present invention as long as it embodies the technical concept of the present invention.
[0177] A person skilled in the art will readily understand that a bit sequence encoded by the method and apparatus described above can be decoded by applying a method symmetric and / or in reverse order to the encoding method. In one embodiment, when reading information for decoding from the encoded bit sequence, at least one variable-length coded phrase included in the encoded bit sequence may be interpreted, and in one embodiment, the variable-length coding may be performed by an entropy coding method. The technical details and application methods of implementing such a decoding procedure will be readily understood from the encoding procedure described above.
[0178] The processor that may be included in the encoder and / or decoder described herein may mean one or more general-purpose computers or special-purpose computers, such as a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions.
[0179] Even if the above processor is expressed in the singular for ease of understanding, a person skilled in the art will understand that the above processor may include a plurality of processing elements and / or a plurality of types of processing elements. For example, an apparatus according to one embodiment of the present invention may include a plurality of processors or one processor and one controller as the processor. In addition, the processor may be implemented by various processing configurations, such as a parallel processor or a multi-core processor.
[0180] The processor may be configured to execute an operating system (OS) and one or more software executed on the operating system. Additionally, the processor may access, store, manipulate, process, and generate data in response to the execution of the software.
[0181] The software may include a computer program, code, instructions, or a combination of one or more of these, and may be configured to control the processor to operate as desired and to issue instructions to the processor independently or collectively. The software may be permanently or temporarily embodied in any type of machine, component, physical device, virtual equipment, computer storage medium or device, or transmitted signal wave in order to be interpreted by the processor or to provide instructions or data to the processor. The software may be distributed over networked computer systems and may be stored or executed in a distributed manner.
[0182] The software described above may also be implemented in the form of program instructions that can be executed through various computer means and may be recorded or stored in the memory. The memory may be a computer-readable recording medium, and program instructions, data files, data structures, etc., may be recorded on the computer-readable recording medium alone or in combination. The program instructions stored in the memory may be based on a command system specifically designed and configured for embodiments of the present invention, or may follow a command system known and available to those skilled in the art of computer software, such as Assembly, C, C++, Java, Python, etc. It should be understood that the command system and the program instructions derived therefrom include not only machine code such as that generated by a compiler, but also high-level language code that can be executed by a device and / or processor according to an embodiment of the present invention using an interpreter, etc.
[0183] A computer-readable recording medium constituting an device according to an embodiment of the present invention, including the memory described herein, may include a temporary or volatile recording medium that is maintained only while the processor is operating, such as a processor cache, RAM, or flash memory; or may include a relatively non-volatile or long-term recording medium such as a magnetic media such as a hard disk, floppy disk, and magnetic tape; an optical recording medium such as a CD-ROM or DVD; a magneto-optical media such as a floptical disk; or a solid-state memory; or may include a read-only recording medium such as a ROM placed on hardware; furthermore, the hardware itself, configured to perform operations equivalent to a series of program instructions by a hard-wired structure by circuit wiring, may also be considered as having each step for performing the operation implementing the embodiment of the present invention recorded by the connection and arrangement of the hardware components, so the method of connection and arrangement is equivalent to the memory. It is obvious to an ordinary technician that it can be seen.
[0184] The embodiments described above with respect to the processor and the memory are not mutually exclusive and may be selected or combined as needed. For example, a hardware device may be configured to operate as a module composed of one or more of the software to perform the operation of an embodiment of the present invention, and vice versa. As another example, in this specification, all or part of the operation assigned to a certain functional unit may be implemented by one or more of the software stored in a device according to an embodiment of the present invention (preferably in a recording medium falling within the category of the memory) and configured to be executed by the processor, in which case such a functional unit may be referred to as a functional unit "included" in the processor.
[0185]
[0186] Although the present invention has been described above with reference to the drawings and embodiments, as previously stated, the scope of protection of the present invention is not limited by the drawings or embodiments presented above, and those skilled in the art will understand that various modifications and changes can be made to the present invention without departing from the spirit and scope of the invention as described in the claims of the present invention.
Claims
1. In a method for decoding images using an information processing device, A step of acquiring a bit sequence in which an image is encoded; A step of obtaining a restored sample of the current picture from the above bit sequence; A step of determining whether to apply neural network-based filtering to the restored sample for at least one decoding unit; and The method includes the step of selectively applying the neural network-based filtering based on the above decision; An image decoding method characterized in that the above-mentioned determining step determines whether to apply the neural network-based filtering in the lower decoding unit only when it is determined to apply the neural network-based filtering in the upper decoding unit.
2. In Paragraph 1, An image decoding method characterized in that the upper decoding unit is a slice and the lower decoding unit is a coding tree unit (CTU).
3. In Paragraph 2, An image decoding method characterized in that the application of the neural network-based filtering in the above-mentioned sub-decoding unit is determined in block units based on a depth specified by a first threshold value.
4. In Paragraph 1, An image decoding method characterized by the above-mentioned determining step determining whether to apply the neural network-based filtering layer by layer based on a temporal identifier.
5. In Paragraph 4, The above-mentioned determining step is, An image decoding method characterized by deciding to apply the neural network-based filtering only to layers having temporal identifiers below a second threshold.
6. In Paragraph 4, The above-mentioned determining step is, An image decoding method characterized by deciding to apply the neural network-based filtering in an upper layer when the application rate of the neural network-based filtering in a lower layer exceeds a third threshold.
7. In Paragraph 4, The above-mentioned determining step is, An image decoding method characterized by deciding not to apply the neural network-based filtering to the entire slice of the upper layer when the ratio of blocks to which the neural network-based filtering is not applied in the lower layer exceeds a fourth threshold.
8. In Paragraph 1, The above-mentioned determining step is, An image decoding method characterized by being performed based on whether the neural network-based filtering is applied in a reference block referenced by the current decoding unit.
9. In Paragraph 8, The above-mentioned determining step is, An image decoding method characterized by generating an activation map based on whether the neural network-based filtering of the reference block is applied, and determining whether the current decoding unit is applied based on the activation map.
10. In Paragraph 9, The above activation map is, A video decoding method characterized by being generated by utilizing at least one of a motion vector or a block vector.
11. In Paragraph 8, The above-mentioned determining step is, An image decoding method characterized by determining whether to apply the neural network-based filtering to the entire current decoding unit based on whether the reference block referenced by a pixel corresponding to a predetermined position of the current decoding unit is referenced.
12. In Paragraph 1, The above-mentioned determining step is, An image decoding method characterized by being performed based on boundary strength information within a coding tree unit (CTU).
13. In Paragraph 12, The above-mentioned determining step is, An image decoding method characterized by being performed by counting the number of pixels of a block distinguished by the above boundary strength.
14. In Paragraph 1, A video decoding method characterized in that at least one threshold value used in the above determination is included in the bit sequence and transmitted in slice units.
15. In Paragraph 1, The above-mentioned determining step is, An image decoding method characterized by being performed implicitly based on partitioning information within a coding tree unit (CTU).
16. In a method for image encoding using an information processing device, A step of acquiring the current picture of the image to be encoded; A step of encoding the current picture above to generate a restoration sample; A step of determining whether to apply neural network-based filtering to the restored sample for at least one encoding unit; and The method includes the step of selectively applying the neural network-based filtering based on the above decision; An image encoding method characterized in that the above-mentioned determining step determines whether to apply the neural network-based filtering in the lower encoding unit only when it is determined to apply the neural network-based filtering in the upper encoding unit.
17. In Paragraph 16, A video encoding method characterized by further including the step of including information regarding the above decision in a bit sequence.
18. In Paragraph 16, The above-mentioned determining step is, An image encoding method characterized by being performed based on rate-distortion optimization.
19. In Paragraph 18, An image encoding method characterized in that the lambda value used in the above rate-distortion optimization is adaptively set on a slice basis.
20. In a video decoding device, A processor configured to execute at least one instruction; and A memory for storing the above instructions; including, The above processor, as it executes the above instruction, Acquire the encoded bit sequence of the image; Obtain a restored sample of the current picture from the above bit sequence; Determining whether to apply neural network-based filtering to the restored sample for at least one decoding unit; and It is configured to selectively apply the neural network-based filtering based on the above decision; An image decoding device characterized by the processor determining whether to apply the neural network-based filtering in a lower decoding unit only when it is determined to apply the neural network-based filtering in an upper decoding unit.