Feature map encoding device, feature map encoding method, feature map decoding device, and feature map decoding method

The feature map encoding and decoding devices efficiently manage large feature maps by identifying active channels and encoding relevant information, reducing processing load and enhancing transmission and storage efficiency.

WO2026141580A1PCT designated stage Publication Date: 2026-07-02JVC KENWOOD CORP

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
JVC KENWOOD CORP
Filing Date
2025-12-25
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

The enormous amount of information in feature maps makes them unsuitable for efficient transmission and storage in existing neural network systems.

Method used

A feature map encoding device that includes a packing unit to determine active channels and generate a single packing feature frame, and a feature frame parameter set encoding unit to encode information about active channels, along with a decoding device that unpacks and generates feature maps for active and inactive channels using predetermined procedures.

Benefits of technology

Enables efficient encoding and decoding of feature maps with minimal processing load, improving transmission and storage efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure JP2025045626_02072026_PF_FP_ABST
    Figure JP2025045626_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The present invention is provided with: a packing unit that determines an active channel to be encoded from feature maps of a plurality of channels, and generates one packed feature frame by packing only the feature map of the active channel; and a feature frame parameter set encoding unit that encodes, for each packed feature frame, a feature frame parameter set including information indicating whether or not each feature map in the packed feature frame corresponds to the active channel.
Need to check novelty before this filing date? Find Prior Art

Description

Feature Map Encoding Device, Feature Map Encoding Method, Feature Map Decoding Device, Feature Map Decoding Method

[0001] It relates to the encoding and decoding of feature maps in a neural network.

[0002] As a neural network technology used for image recognition such as detecting objects of various scales in an image, dividing regions for each object, or tracking objects, FPN (Feature Pyramid Network) of Non-Patent Document 1 is known. In FPN, a plurality of feature maps of various scales are generated from the image to be processed, and various image recognitions are performed using the feature maps.

[0003] FPN used for image recognition generates a plurality of feature maps from an image, and its structure uses a CNN (Convolutional Neural Network). A CNN reads an image, is composed of convolution and pooling, and can be divided into a feature amount extraction part (backbone) that generates a feature map and a discrimination part (head) that is composed of a hierarchical fully connected layer and generates an output suitable for tasks such as object detection, instance segmentation, and object tracking. FPN uses the backbone of CNN.

[0004] The feature amount extraction part of FPN is typically structured in a hierarchical manner by repeating a basic unit composed of a convolution processing part 301, an activation processing part 302, and a pooling processing part 303 shown in FIG. 3.

[0005] Figure 4 shows the structure of the FPN. The FPN consists of a bottom-up processing unit 322 that generates a multi-scale feature map composed of multiple hierarchical layers using the CNN backbone, and a top-down processing unit 324 that aggregates features from deeper layer feature maps to shallower layer feature maps using the inverse configuration of the CNN backbone. The bottom-up processing unit 322 repeatedly performs the convolution processing unit 301, activation processing unit 302, and pooling processing unit 303, which are the basic units in Figure 3, reducing the resolution of the feature map by half each time to generate a pyramid of multiple layered feature maps. On the other hand, the top-down processing unit 324 adds feature maps with resolutions corresponding to the bottom-up processing unit 322, expanding the resolution of the feature map to the same resolution as the input image, and generating a pyramid of feature maps. In other words, the FPN generates multiple feature maps for each layer from the image 326 that is the target of feature extraction processing.

[0006] The convolution processing unit 301 performs convolution on the data to be processed (image or feature map) using a plurality of predetermined filters (kernels). In the convolution processing in the convolution processing unit 301, predetermined filtering is performed on the entire data to be processed while sliding at predetermined intervals. At this time, the sliding interval is called the stride. The convolution processing unit 301 may determine the stride based on the number of data to be processed. For example, the convolution processing unit 301 may determine the stride to be 1 if the number of data to be processed is less than a predetermined value, and determine the stride to be 2 if it is greater than or equal to the predetermined value. Multiple feature maps are generated by preparing a plurality of predetermined filters at each layer and generating one feature map for each filter. The unit of a feature map is called a channel. If the number (types) of predetermined filters is N (N types), then N (N channels) of feature maps are generated.

[0007] The activation processing unit 302 performs an activation process that non-linearly transforms the feature map output from the convolution processing unit 301. Here, the function used for the activation process is called the activation function. The activation processing unit 302 uses the ReLU (Rectified Linear Unit) function or the sigmoid function, etc., as the activation function.

[0008] The pooling processing unit 303 performs a process of downsampling the feature map by replacing the local values ​​of the feature map output from the activation processing unit 302 with representative values.

[0009] On the other hand, when performing classification using a neural network, it has the capability to execute the task using feature maps of multiple channels at each layer.

[0010] In image recognition, the multi-channel feature maps of each hierarchical level are subjected to a convolution process at predetermined size intervals based on the scale of the feature maps, and the probability of the object's class is calculated for each pixel.

[0011] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature Pyramid Networks for Object Detection. In CVPR, 2017.

[0012] The amount of information in feature maps is enormous, making them unsuitable for transmission and storage. In view of the above problems, this embodiment aims to provide a technology for encoding and decoding feature maps.

[0013] To solve the above problems, the feature map encoding device of this embodiment includes a packing unit that determines the active channel to be encoded from the feature maps of multiple channels, and generates a single packing feature frame by packing only the feature map of the active channel; and a feature frame parameter set encoding unit that encodes a feature frame parameter set for each feature map of the packing feature frame, including information indicating whether or not it is an active channel.

[0014] The feature map decoding device of this embodiment includes a feature frame parameter set decoding unit that decodes a feature frame parameter set containing information indicating whether or not it is an active channel for each feature map of a packing feature frame on a packing feature frame basis, and an unpacking unit that generates the feature map by unpacking the packing feature frame for feature maps of active channels, and generates the feature map for inactive channels that are not active channels using a predetermined procedure.

[0015] According to this embodiment, feature maps can be encoded and decoded efficiently with minimal processing load.

[0016] This is a block diagram illustrating the configuration of the feature map encoding device 100. This is a block diagram illustrating the configuration of the feature map decoding device 200. This is a block diagram illustrating the basic unit processing at each layer of the FPN. This is a diagram illustrating the structure of the FPN. This is a block diagram illustrating the detailed configuration of the feature map reduction unit 102. This is a block diagram illustrating the detailed configuration of the feature map restoration unit 206. This is a block diagram illustrating the detailed configuration of the feature map conversion unit 103. This is a block diagram illustrating the detailed configuration of the feature map inverse conversion unit 205. This is a block diagram illustrating the detailed configuration of the feature map internal encoding unit 104. This is a block diagram illustrating the detailed configuration of the feature map internal decoding unit 203. This is a diagram illustrating the number of channels, width, and height of feature maps x1, x2, and x3. This is a diagram illustrating the state in which multiple channel feature maps are packed into one frame. This is a diagram illustrating flipping when packing multiple channel feature maps into one frame. This is a diagram illustrating the layers and units handled by the feature map encoding device and feature map decoding device of this embodiment. This is a flowchart illustrating the encoding procedure of the first embodiment. This is a flowchart illustrating the decoding procedure of the first embodiment. This is an example of the syntax rules for the feature sequence parameter set in the first embodiment. This is an example of the syntax rules for the feature picture parameter set in the first embodiment. This is an example of the syntax rules for the encoded video bitstream in the first embodiment. This is an example of the syntax rules for the feature picture parameter set and the encoded video bitstream in the third embodiment.

[0017] This section defines the technologies and technical terms used in this embodiment.

[0018] <Features and Feature Maps> In a convolutional neural network (CNN), the data obtained by convolving the data of the portion scanned while changing its position within the target image (input layer data) using a filter, with the filter coefficients, is called a feature or feature map.

[0019] <Packing> Frame packing is the process of combining two or more frames (pictures) into a single frame (picture) by arranging them in a tile-like manner. In this application, packing refers to the process of combining feature maps of multiple channels into a single frame. Figure 12 shows an example of frame packing.

[0020] <Data Types> Data types that represent integer values ​​are designated as integer types, and data types that represent decimal values ​​are designated as decimal types.

[0021] <Layers and Units> The layers and units handled by the feature map encoding and feature map decoding devices of this embodiment will be explained using Figure 14. Figure 14 shows a sequence of consecutive feature frames or feature maps for all channels over time, and is referred to as a sequence layer or sequence unit. Figure 14 shows a sequence of consecutive feature maps for one channel over time, and is referred to as a sequence layer for each channel or a sequence unit for each channel. Figure 14 shows a feature frame or feature map for all channels at the same time, and is referred to as a frame layer or frame unit. Figure 14 shows a feature map for one channel at a certain time, and is referred to as a feature map layer for each channel (a feature map unit for each channel).

[0022] (First Embodiment) A feature map encoding device 100 and a feature map decoding device 200 according to the first embodiment of the present invention will be described.

[0023] Figure 1 is a block diagram of a feature map encoding device 100 according to the first embodiment. The feature map encoding device 100 of this embodiment includes a feature map reduction unit 102, a feature map conversion unit 103, a feature map internal encoding unit 104, a feature sequence parameter set encoding unit 105, a feature frame parameter set encoding unit 106, and a multiplexing unit 107. The feature map encoding device 100 is a device that encodes the feature map generated by the neural network feature extraction unit 101 to generate a bitstream and output it.

[0024] The neural network feature extraction unit 101 reads the image to be feature extracted, generates a feature map through FPN convolution, activation, and pooling processes, and supplies it to the feature map reduction unit 102. In this embodiment, a three-layer multi-scale feature map x1, x2, and x3 is generated.

[0025] The feature map reduction unit 102 converts the three-layer multi-scale feature maps x1, x2, and x3 obtained from the neural network feature extraction unit 101 into a single-layer single-scale feature map xf and supplies it to the feature map conversion unit 103. The details of the feature map reduction unit 102 will be explained in detail with reference to Figure 5.

[0026] The feature map conversion unit 103 takes the fractional single-scale feature map xf supplied from the feature map reduction unit 102, performs packing and quantization processing to convert it into an integer-type packed feature frame, and supplies it to the feature map internal encoding unit 104.

[0027] The details of the feature map conversion unit 103 will be explained in detail with reference to Figure 7.

[0028] The feature map internal encoding unit 104 encodes the integer-type packing feature frames supplied from the feature map conversion unit 103 using image encoding standards such as VVC, HEVC, and AV1 to generate an encoded video bitstream, which is then supplied to the multiplexing unit 107.

[0029] The feature sequence parameter set encoding unit 105 encodes a feature sequence parameter set, which is a set of parameters common to the sequence of single-scale feature maps xf supplied from the feature map reduction unit 102, and supplies it to the multiplexing unit 107. The feature sequence parameter set includes information such as the size of the packing feature frame and the size of the channels of the feature maps packed into the packing feature frame.

[0030] The feature frame parameter set encoding unit 106 encodes the feature frame parameter set, which is a set of parameters set in each single-scale feature map xf, supplied from the feature map reduction unit 102, and supplies it to the multiplexing unit 107. The encoding procedure for the feature frame parameter set of each channel will be described later.

[0031] The multiplexing unit 107 multiplexes the feature sequence parameter set supplied from the feature sequence parameter set encoding unit 105, the feature frame parameter set supplied from the feature frame parameter set encoding unit 106, and the encoded video bitstream supplied from the feature map internal encoding unit 104 to create a multiplexed bitstream. The created multiplexed bitstream is then output to the feature map decoding device 201 via a network or the like.

[0032] The details of the feature map internal encoding unit 104 will be explained in detail with reference to Figure 9.

[0033] Figure 2 is a block diagram showing the configuration of a feature map decoding device 200 according to an embodiment of the present invention, corresponding to the feature map encoding device 100 in Figure 1. The feature map decoding device 200 of this embodiment includes a multiplexing / decoupling unit 201, a feature sequence parameter set decoding unit 202, a feature frame parameter set decoding unit 203, a feature map internal decoding unit 204, a feature map inverse transformation unit 205, and a feature map reconstruction unit 206. The feature map decoding device 200 also receives a multiplexed bitstream encoded by the feature map encoding device 100, decodes the multiplexed bitstream to generate three-layer multi-scale feature maps x1up, x2up, and x3up, and supplies them to the neural network identification unit 207.

[0034] The multiplexing and separation unit 201 multiplexes and separates the multiplexed bitstream supplied from the feature map encoding device 201 via a network or the like, separating the feature sequence parameter set, the feature frame parameter set, and the encoded video bitstream in which the packing feature frame is encoded. It then supplies the feature sequence parameter set to the feature sequence parameter set decoding unit 202, the feature frame parameter set to the feature frame parameter set decoding unit 203, and the encoded video bitstream to the feature map internal decoding unit 204. The feature sequence parameter set decoding unit 202 decodes the feature sequence parameter set supplied from the multiplexing and separation unit 201 and supplies the decoded information to the feature map inverse conversion unit 205 and the feature map restoration unit 206. The feature frame parameter set decoding unit 203 decodes the feature frame parameter set supplied from the multiplexing and separation unit 201 and supplies the decoded information to the feature map inverse conversion unit 205.

[0035] The feature map internal decoding unit 204 decodes the encoded video bitstream encoded by the feature map internal encoding unit 104 of the feature map encoding device 100 according to an image encoding standard such as VVC, HEVC, or AV1, generates an integer-type packing feature frame, and supplies it to the feature map inverse conversion unit 205.

[0036] The details of the internal feature decoding unit 201 will be explained in detail with reference to Figure 10.

[0037] The feature map inverse transformation unit 205 performs inverse quantization and unpacking on the integer-type packing feature frame supplied from the feature map internal decoding unit 204, converts it into a decimal-type single-scale feature map xr, and supplies it to the feature map reconstruction unit 206.

[0038] The details of the feature map inverse conversion unit 205 will be explained in detail with reference to Figure 8.

[0039] The feature map restoration unit 206 converts the single-scale feature map xr supplied from the feature map inverse conversion unit 205 into three-layer multi-scale feature maps x1up, x2up, and x3up, and supplies them to the neural network discrimination unit 207 as the output of the feature map decoding device 200.

[0040] Details of the feature map restoration unit 206 will be described in detail using FIG. 6.

[0041] Based on the three-layer multi-scale feature maps x1up, x2up, and x3up supplied by the feature map restoration unit 206, the neural network discrimination unit 207 performs discrimination processing such as discrimination of an object or an object in the discrimination target image, discrimination of a location or a landscape, and discrimination of a person or a living thing.

[0042] <Regarding Feature Map Reduction and Feature Map Restoration> The feature map reduction unit 102 has a function of converting a multi-scale feature map of a plurality of layers obtained from the neural network feature amount extraction unit 101 into a single-layer single-scale feature map.

[0043] Using FIG. 5, the details of the feature map reduction unit 102 will be described. The feature map reduction unit 102 includes a first feature map reduction unit 501, a first channel coupling unit 502, a second feature map reduction unit 503, a second channel coupling unit 504, a third feature map reduction unit 505, a first padding unit 506, a second padding unit 507, and a third padding unit 508. The feature map reduction unit 102 in FIG. 5 is an example of a configuration for converting a three-layer multi-scale feature map into a single-scale feature map.

[0044] The feature map reduction unit 102 takes as input three-layer multi-scale feature maps of the first feature map x1, the second feature map x2, and the third feature map x3, converts them into a single-layer single-scale feature map xf, and supplies them to the feature map conversion unit 103. Here, let the index indicating the layer be n, the number of channels of the nth layer be Cn, the width of the feature map be Wn, and the height of the feature map be Hn. In the present embodiment, the values of Cn, Wn, and Hn for each layer are as shown in FIG. 11. However, H and W are the width and height of the image for which feature amount extraction is performed, respectively.

[0045] The first padding unit 506 has a function of performing padding on the first feature map x1 to generate a first padded feature map x1pad. In the first padding unit 506, the padding size is determined such that the width and height of x1pad are multiples of 64. The number of channels of x1pad is the same as that of x1 and is 256.

[0046] The second padding unit 507 has a function of performing padding by folding on the second feature map x2 to generate a second padded feature map x2pad. In the second padding unit 507, the padding size is determined such that the width and height of x2pad are multiples of 32. The number of channels of x2pad is the same as that of x2 and is 256.

[0047] The third padding unit 508 has a function of performing padding by folding on the third feature map x3 to generate a third padded feature map x3pad. In the third padding unit 508, the padding size is determined such that the width and height of x3pad are multiples of 16. The number of channels of x3pad is the same as that of x3 and is 256.

[0048] In the first padding unit 506, the second padding unit 507, and the third padding unit 508, it is assumed that the left padding size and the right padding size are the same, and the upper padding size and the lower padding size are the same. That is, it is assumed that the feature maps x1, x2, and x3 are respectively arranged at the centers of x1pad, x2pad, and x3pad. < / /

[0049] The first feature map reduction unit 501 performs convolution in the spatial direction and the channel direction on the first padded feature map x1pad obtained from the first padding unit 506 to generate a first intermediate feature map y1. The number of channels of y1 is 192, the width is Wx1pad / 2, and the height is Hx1pad / 2. Here, Wx1pad and Hx1pad are the width and height of the first padded feature map x1pad, respectively.

[0050] The first channel merging unit 502 has the function of merging the first intermediate feature map y1 obtained from the first feature map reduction unit 501 and the second padded feature map x2pad obtained from the second padding unit 507 in the channel direction to generate an intermediate feature map y1Cx2pad. Since y1 has 192 channels and x2pad has 256 channels, the intermediate feature map y1Cx2pad has 448 channels (192 + 256).

[0051] The second feature map reduction unit 503 performs convolution in the spatial and channel directions on the intermediate feature map y1Cx2pad obtained from the first channel coupling unit 502 to generate a second intermediate feature map y2. The number of channels in y2 is 192, the width is Wy1Cx2pad / 2, and the height is Hy1Cx2pad / 2. Here, Wy1Cx2pad and Hy1Cx2pad are the width and height of the intermediate feature map y1Cx2pad, respectively.

[0052] The second channel merging unit 504 has the function of merging the second intermediate feature map y2 obtained from the second feature map reduction unit 503 and the third padded feature map x3pad obtained from the third padding unit 508 in the channel direction to generate the intermediate feature map y2Cx3pad. Since the intermediate feature map y2 has 192 channels and x3pad has 256 channels, the number of channels in y1Cx2pad is 448 (192 + 256).

[0053] The third feature map reduction unit 505 performs convolution in the spatial and channel directions on the intermediate feature map y2Cx3pad obtained from the second channel coupling unit 504 to generate a third intermediate feature map y3. The number of channels in y3 is 192, the width is Wy2Cx3pad / 2, and the height is Hy2Cx3pad / 2. Here, Wy2Cx3pad and Hy2Cx3pad are the width and height of the intermediate feature map y2Cx3pad, respectively.

[0054] The feature map reduction unit 102 outputs the third intermediate feature map y3 as a single-scale feature map xf and supplies it to the feature map conversion unit 103.

[0055] The feature map reconstruction unit 206 has the function of converting the single-scale feature map xr obtained from the feature map inverse conversion unit 205 into three-layer multi-scale feature maps x1up, x2up, and x3up.

[0056] The details of the feature map restoration unit 206 will be explained using Figure 6. The feature map restoration unit 206 consists of an 8x magnification unit 601, a 4x magnification unit 602, a 2x magnification unit 603, a first feature map mixing unit 604, a second feature map mixing unit 605, a first padding removal unit 606, a second padding removal unit 607, and a third padding removal unit 608.

[0057] The 8x magnification unit 601 has the function of expanding the feature map and reducing the number of channels by performing transposition convolution in the spatial direction and convolution in the channel direction on the single-scale feature map xr obtained from the feature map inverse transformation unit 205, thereby generating an intermediate feature map z1. The number of channels in z1 is 196. If the width and height of the single-scale feature map xr are xrwidth and xrheight, respectively, then the width and height of z1 will be xrwidth × 8 and xrheight × 8, respectively. Here, rwidth × 8 and xrheight × 8 are the same as the width and height of the first padded feature map x1pad, which is the output of the first padding unit 506 of the feature map reduction unit 102.

[0058] The quadruple magnification unit 602 has the function of expanding features and reducing channels by performing transposition convolution in the spatial direction and convolution in the channel direction on the single-scale feature map xr obtained from the feature map inverse transformation unit 205, thereby generating an intermediate feature map z2. The number of channels in z2 is 196. The width and height of z2 are xrwidth × 4 and xrheight × 4, respectively. Here, rwidth × 4 and xrheight × 4 are the same as the width and height of the second padded feature map x2pad, which is the output of the second padding unit 507 of the feature map reduction unit 102.

[0059] The doubling unit 603 has the function of expanding the feature map and reducing the number of channels of the single-scale feature map xr obtained from the feature map inverse transformation unit 205 by performing transposition convolution in the spatial direction and convolution in the channel direction, thereby generating an intermediate feature map z3. The number of channels in z3 is 196. The width and height of z3 are xrwidth × 2 and xrheight × 2, respectively. Here, rwidth × 2 and xrheight × 2 are the same as the width and height of the third padded feature map x3pad, which is the output of the third padding unit 508 of the feature map reduction unit 102.

[0060] The first feature map mixing unit 604 has the function of generating an intermediate feature map z2up, which is an improved version of the intermediate feature map z2 obtained from the 4x magnification unit 602, using the intermediate feature map z1 obtained from the 8x magnification unit 601.

[0061] The second feature map mixing unit 605 has the function of generating an intermediate feature map z3up, which is an improved version of the intermediate feature map z3 obtained from the doubling unit 603, using the intermediate feature map z2up obtained from the first feature map mixing unit 604.

[0062] The padding removal unit 606 has the function of removing padding from the intermediate feature map z1 acquired from the 8x magnification unit 601 and generating a first output feature map x1up. The width and height of x1up are the same as the width and height of the first feature map x1 input to the feature map reduction unit 102.

[0063] The padding removal unit 607 has the function of removing padding from the intermediate feature map z2up obtained from the first feature map mixing unit 604 and generating a second output feature map x2up. The width and height of x2up are the same as the width and height of the second feature map x2 input to the feature map reduction unit 102.

[0064] The padding removal unit 608 has the function of removing padding from the intermediate feature map z3up obtained from the second feature map mixing unit 605 and generating a third output feature map x3up. The width and height of x3up are the same as the width and height of the third feature map x3 input to the feature map reduction unit 102.

[0065] In the first padding removal unit 606, the second padding removal unit 607, and the third padding removal unit 608, similar to the first padding unit 506, the second padding unit 507, and the third padding unit 508 of the feature map reduction unit 102, the padding size on the left and the padding size on the right are set to be the same, and the padding size on the upper and lower sides are set to be the same. That is, each output feature map x1up, x2up, and x3up are assumed to be positioned at the center of each intermediate feature map z1, z2up, and z3up, respectively, and padding is removed from the top, bottom, left, and right.

[0066] <About Feature Map Conversion and Inverse Feature Map Conversion> The feature map conversion unit 103 has the function of performing packing and quantization processing on the multi-channel fractional single-scale feature map xf supplied from the feature map reduction unit 102, and converting it into an integer-type packed feature frame for supply to the feature map internal encoding unit 104.

[0067] The details of the feature map conversion unit 103 on the encoding side will be explained using Figure 7. The feature map conversion unit 103 consists of a packing unit 701 and a feature map quantization unit 702.

[0068] The packing unit 701 has the function of generating a packed feature frame by combining the input feature maps of multiple channels into a single frame.

[0069] In this embodiment, channel truncation is performed on the fractional single-scale feature map xf of multiple channels supplied from the feature map reduction unit 102. Channel truncation is a method of classifying each channel into either an active channel to be encoded or an inactive channel whose encoding is omitted.

[0070] Channels that are encoded by the encoding-side feature map encoding device 100 and decoded by the decoding-side feature map decoding device 200 are designated as active channels. Conversely, channels that are not encoded by the encoding side but are generated by the decoding side are designated as inactive channels. Inactive channels are treated as feature maps with predetermined elements on the decoding side.

[0071] Furthermore, the packing unit 701 performs flipping based on the in-frame position where the channel feature map is placed. The option to perform flipping or not may be provided and transmitted to the feature map decoder 201 via the multiplexed bitstream.

[0072] Figure 13 illustrates the flipping process when packing multiple channel feature maps into a single frame. Flipping involves inverting the position of each channel's feature map elements (pixels) horizontally (left / right), vertically (up / down), or both horizontally and vertically (up / down / left / right) when packing the feature maps for each channel. In Figure 13, the four channel feature maps A (top left), B (top right), C (bottom left), and D (bottom right) are considered a single set. No flipping is performed at position A in Figure 13. At position B, the feature map is inverted horizontally (left / right). At position C, the feature map is inverted vertically (up / down). At position D, the feature map is inverted horizontally and vertically (up / down / left / right). When the distribution of elements in each channel's feature map is similar, performing flipping based on the in-frame position where the channels are placed reduces the boundaries between each channel's feature map, improving encoding efficiency.

[0073] The packing unit 701 supplies the packing feature frame to the feature map quantization unit 702, and also supplies information such as an index for identifying the active channel as a feature frame parameter set to the feature frame parameter set encoding unit 105.

[0074] The feature map quantization unit 702 has the function of converting the elements of a decimal-type packing feature frame (feature map of all channels) into an N-bit integer type (an integer N = approximately 8 to 16) within a predetermined range and outputting an integer-type packing feature frame. In this embodiment, it is assumed that it is converted into a 10-bit integer type from 0 to 1023. The feature map quantization unit 702 detects the minimum and maximum values ​​of the elements of the decimal-type packing feature frame and transmits the detected minimum and maximum values ​​of the elements of the packing feature frame (feature map of all channels) to the decoding side as metadata. In order to convert from a decimal-type packing feature frame (feature map) to an integer-type packing feature frame (feature map), a linear transformation is performed in which the minimum value of the decimal type corresponds to the minimum value of the integer type, and the maximum value of the decimal type corresponds to the maximum value of the integer type. For example, when the range of the integer type is represented by 10 bits, the minimum value of the elements of the integer-type packing feature frame (feature map) is 0, and the maximum value is 1023 (2 10 -1) This is the result. Linear quantization is performed on values ​​between the minimum and maximum values.

[0075] Next, the feature map inverse transform unit 205 has the function of performing inverse quantization and unpacking on integer-type packing feature frames decoded by VVC, HEVC, AV1, etc. supplied from the feature map internal decoding unit 204, and converting them into a decimal-type single-scale feature map xr for supply to the feature map reconstruction unit 206.

[0076] Figure 8 will be used to explain the details of the feature map inverse transform unit 205 on the decoding side. The feature map inverse transform unit 205 is the inverse process of the feature map transform unit 103 and is composed of a feature map inverse quantization unit 801 and an unpacking unit 802.

[0077] The feature map inverse quantization unit 801 performs the inverse processing of the encoding-side feature map quantization unit 702 and has the function of converting the elements of the integer-type packing feature frame from integer type to decimal type. The feature map inverse quantization unit 801 converts the integer-type packing feature frame decoded by the feature map internal decoding unit 204 into a decimal-type packing feature frame using the minimum and maximum decimal values ​​transmitted as metadata. A linear transformation is performed to make the minimum integer value equivalent to the minimum decimal value and the maximum integer value equivalent to the maximum decimal value. For values ​​between the minimum and maximum values, linear inverse quantization is performed.

[0078] The unpacking unit 802 extracts the feature maps for each channel from the packing feature frames arranged in a single frame in the order of the raster scan, and supplies them to the feature map reconstruction unit 206 as a single-scale feature map xr.

[0079] <About Internal Feature Map Encoding and Decoding> Figure 9 will be used to explain the details of the feature map internal encoding unit 104. The feature map internal encoding unit 104 consists of a switch 901, a VVC encoding unit 902, a HEVC encoding unit 903, and an AV1 encoding unit 904. The switch 901 selects an encoding standard for internally encoding the feature map converted by the feature map conversion unit 103. The VVC encoding unit 902 encodes the feature map using the VVC standard and outputs a bitstream compliant with the VVC standard. The HEVC encoding unit 903 encodes the feature map using the HEVC standard and outputs a bitstream compliant with the HEVC standard. The AV1 encoding unit 904 encodes the feature map using the AV1 standard and outputs a bitstream compliant with the AV1 standard.

[0080] In the VVC, HEVC, and AV1 standards, images are divided into predetermined block sizes for encoding.

[0081] It is also possible to implement only one of the following: VVC, HEVC, or AV1. Furthermore, it is possible to use image encoding schemes other than VVC, HEVC, and AV1.

[0082] Next, the details of the feature map internal decoding unit 204 will be explained using Figure 10. The feature map internal decoding unit 204 consists of a switch 1001, a VVC decoding unit 1002, a HEVC decoding unit 1003, and an AV1 decoding unit 1004. The switch 1001 selects the encoding standard to be internally decoded based on the information in the input bitstream that is selected for internal decoding. The VVC decoding unit 1002 decodes the feature map using the VVC standard. The HEVC decoding unit 1003 decodes the feature map using the HEVC standard. The AV1 decoding unit 1004 decodes the feature map using the AV1 standard.

[0083] In the VVC, HEVC, and AV1 standards, decoding is performed for each predetermined block size.

[0084] It is also possible to implement only one of the following: VVC, HEVC, or AV1. Furthermore, it is possible to use image encoding schemes other than VVC, HEVC, and AV1.

[0085] <Feature Sequence Parameter Set> The feature sequence parameter set feature_sequence_parameter_set is described below. Figure 17 shows an example of the syntax rules for the feature sequence parameter set feature_sequence_parameter_set. The syntax elements packing_frame_width and packing_frame_height indicate the horizontal and vertical dimensions of the packing feature frame, respectively. The syntax elements channel_width and channel_height indicate the horizontal and vertical dimensions of each channel of the feature map packed into the packing feature frame, respectively. The syntax element num_feature_channel indicates the number of channels in the feature map packed into the packing feature frame.

[0086] The feature_sequence_parameter_set is a parameter set containing parameters common to all packing feature frames. Therefore, it is not necessary to transmit the feature_sequence_parameter_set with every packing feature frame. It is sufficient to transmit it at appropriate intervals to cover cases where the feature map decoder 201 is unable to obtain the feature_sequence_parameter_set due to a transmission error, or when the feature map decoder 201 obtains the multiplexed bitstream from the middle of the multiplexed bitstream.

[0087] <Feature Frame Parameter Set> The feature frame parameter set feature_picture_parameter_set is explained below. Figure 18 shows an example of the syntax rules for the feature frame parameter set feature_picture_parameter_set. The syntax element truncates_features indicates whether or not to perform channel truncation on the packing feature frame. truncates_features=0 indicates that channel truncation will not be performed on the packing feature frame. truncates_features=1 indicates that channel truncation will be performed on the packing feature frame. The syntax element is_active[n] indicates whether channel n is an active channel or an inactive channel. When is_active[n]=0, channel n is an inactive channel. When is_active[n]=1, channel n is an active channel. When truncates_features=0, channel truncation is not performed, so transmission of is_active[n] is unnecessary. When truncates_features=1, is_active[n] is transmitted for each num_feature_channel. num_feature_channel has already been transmitted in feature_sequence_parameter_set. The feature_picture_parameter_set is a parameter set containing parameters specific to each packing feature frame. Therefore, the feature_picture_parameter_set is transmitted with every packing feature frame.

[0088] <Encoded Video Bitstream> The encoded video bitstream will be explained below. Figure 19 shows an example of the syntax rules for the encoded video bitstream coded_video_data_unit. The encoded video bitstream coded_video_data_unit consists of two layers: the encoded video header coded_video_header and the encoded video data coded_video_data. The encoded video header coded_video_header contains auxiliary information about the encoded video data coded_video_data. The encoded video data coded_video_data is the bitstream of the image encoding standard handled by the feature map internal encoding unit 104 and the feature map internal decoding unit 201. The syntax element codec_index of the encoded video header coded_video_header is encoding information that identifies the image encoding standard. codec_index=0 indicates VVC, codec_index=1 indicates HEVC, and codec_index=2 indicates AV1. The feature map internal decoding unit 204 switches the switch 1001 according to the value of codec_index to select one of the VVC decoding unit 1002, HEVC decoding unit 1003, or AV1 decoding unit 1004. The correspondence between the value of codec_index and each image encoding standard may be changed as needed.

[0089] <Feature Map Coding Processing Procedure> The feature map coding processing procedure performed by the feature map conversion unit 103, feature map internal coding unit 104, feature sequence parameter set coding unit 105, feature frame parameter set coding unit 106, and multiplexing unit 107 of the feature map coding device 100 will be described below. Figure 15 is a flowchart illustrating the feature map coding processing procedure according to the first embodiment.

[0090] The feature sequence parameter set encoding unit 105 creates a feature sequence parameter set (feature_sequence_parameter_set) according to the user settings (step S1000 in Figure 15).

[0091] The following steps are processed frame by frame (steps S1001 and S1008 in Figure 15).

[0092] First, the packing unit 701 of the feature map conversion unit 103 evaluates the importance of the feature map for each channel and determines active and inactive channels (step S1002 in Figure 15). By omitting the encoding of inactive channels, which are channels of little importance, encoding efficiency is improved, and encoding degradation of high-importance channels, which have a significant impact on the image recognition result, is suppressed. Statistical information such as the mean, median, and variance of the elements in each channel can be used to determine importance. For example, the packing unit 701 may calculate the statistical value of the elements contained in a certain channel, and if that statistical value is greater than a predetermined value, that channel may be determined to be an active channel. If the packing unit 701 determines that all channels are active channels, it does not discard any channels.

[0093] Next, the packing unit 701 packs only the feature maps of the determined active channels to generate a packing feature frame (step S1003 in Figure 15). Inactive channels that are not to be encoded are discarded. If the packing feature frame is not filled with all the feature maps of the active channels and there are gaps, the packing unit 701 sets a predetermined value (such as 0) to the elements of the gaps.

[0094] The packing unit 701 supplies the packing feature frame to the feature map quantization unit 702, and also supplies the feature frame parameter set encoding unit 106 with truncates_features, which indicates whether or not to perform channel truncation, and is_active, which identifies the active channel if channel truncation is performed.

[0095] Next, the feature map quantization unit 702 of the feature map conversion unit 103 quantizes the packing feature frame and converts the packing feature frame from a decimal type to an integer type (step S1004 in Figure 15).

[0096] Next, the feature frame parameter set encoding unit 106 encodes truncates_features and is_active obtained from the packing unit 701 according to the rules shown in Figure 18 to generate the feature frame parameter set feature_picture_parameter_set (step S1005 in Figure 15).

[0097] Next, the feature map internal encoding unit 104 selects an image encoding standard for encoding the packing feature frame. According to the selected image encoding standard, the packing feature frame is encoded by either the VVC encoding unit 902, the HEVC encoding unit 903, or the AV1 encoding unit 904, generating encoded video data with the encoded packing feature frame. Furthermore, a coded_video_header containing a codec_index that identifies the image encoding standard is created, and the coded_video_header and the encoded bit video data are combined to create an encoded video bitstream (step S1006 in Figure 15).

[0098] Next, the multiplexing unit 107 multiplexes the feature sequence parameter set, the feature frame parameter set, and the encoded video bitstream to generate a multiplexed bitstream (step S1007 in Figure 15). However, as mentioned above, the feature sequence parameter set does not need to be transmitted every frame, so it is transmitted only in the first frame and at predetermined frame intervals set by the user.

[0099] <Feature Map Decoding Processing Procedure> The feature map decoding processing procedure performed by the multiplexing and separation unit 201, feature sequence parameter set decoding unit 202, feature frame parameter set decoding unit 203, feature map internal decoding unit 204, and feature map inverse conversion unit 205 of the feature map decoding device 200 will be described below. Figure 16 is a flowchart illustrating the feature map decoding processing procedure according to the first embodiment.

[0100] The multiplexing and decoupling unit 201 multiplexes and decouples the multiplexed bitstream supplied from the feature map encoding device 100, separates the feature sequence parameter set, and supplies it to the feature sequence parameter set decoding unit 202. The feature sequence parameter set decoding unit 202 decodes the acquired feature sequence parameter set (step S2000 in Figure 16).

[0101] The following steps are processed frame by frame (steps S2001 and S2007 in Figure 16).

[0102] The multiplexing and decoupling unit 201 multiplexes and decouples the multiplexed bitstream supplied from the feature map encoding device 100, separating the feature frame parameter set from the encoded video bitstream. The feature frame parameter set is supplied to the feature map inverse conversion unit 205, and the encoded video bitstream is supplied to the feature map internal decoding unit 204 (step S2002 in Figure 16).

[0103] The feature map internal decoding unit 201 decodes the codec_index from the coded_video_heder of the encoded video bitstream. According to the value of the codec_index, the VVC decoding unit 1002, HEVC decoding unit 1003, or AV1 decoding unit 1004 decodes the bitstream in which the packing feature frame is encoded and generates the packing feature frame (step S2003 in Figure 16).

[0104] Next, the feature frame parameter set decoding unit 203 decodes the feature frame parameter set according to the rules shown in Figure 18 (step S2004 in Figure 16).

[0105] In the feature map inverse conversion unit 205, the feature map inverse quantization unit 802 converts the packing feature frame from integer type to decimal type (step S2005 in Figure 16).

[0106] Next, the unpacking unit 803 of the feature map inverse transform unit 20 unpacks the packed feature frame using the feature sequence parameters obtained from the feature sequence parameter set decoding unit 202 and the feature frame parameters obtained from the feature frame parameter set decoding unit 203 to generate a feature map for each active channel (step S2006 in Figure 16). For the feature maps of inactive channels, a predetermined value such as 0 is set for all elements (each pixel) to generate a feature map for the inactive channel.

[0107] By adopting this embodiment, it is possible to appropriately determine whether or not to discard channels, and to set active and inactive channels on a frame-by-frame basis, thereby improving encoding efficiency.

[0108] (Second Embodiment) In the first embodiment, truncates_features is a syntax element that indicates whether or not to perform channel truncation in the packing feature frame. In another embodiment, truncates_features is defined as a syntax element that indicates whether or not to update the settings for active and inactive channels in the packing feature frame.

[0109] In this embodiment, similar to the first embodiment, the feature frame parameter set encoding unit 106 determines active and inactive channels when truncates_features=1, and transmits the syntax element is_active for each channel in the packing feature frame. The feature frame parameter set decoding unit 203 decodes the syntax element is_active for each channel and determines whether each channel is an active or inactive channel.

[0110] On the other hand, when truncates_features=0, the feature frame parameter set encoding unit 106 does not transmit the syntax element is_active, but instead inherits the is_active from the previous frame. In this case, the feature frame parameter set decoding unit 203 does not decode the syntax element is_active, but instead inherits the is_active from the previous frame. Therefore, in that frame, it is possible to determine the active channel and the inactive channel without is_active.

[0111] By adopting this embodiment, the active and inactive channels can be determined without the syntax element is_active, thus reducing the amount of code and improving encoding efficiency.

[0112] (Third Embodiment) This embodiment determines whether to update the active and inactive channel settings in a frame without transmitting the truncates_features of the feature_picture_parameter_set.

[0113] Figure 20 shows an example of the syntax rules for the feature_picture_parameter_set and the coded_video_header in this embodiment. The coded_video_header in this embodiment includes the syntax element frame_type, which indicates the frame type in the internal coding. Generally, in image coding standards, each frame can take one of the following frame types: an intraframe in which processing is completed within the screen, a single-prediction frame PFrame that references a single coded picture, or a double-prediction frame BFrame that references multiple coded pictures. Of these, the IntraFrame is a typical frame used for random access.

[0114] The feature sequence parameter set encoding unit 105 transmits a feature sequence parameter set including the syntax element frame_type. The feature sequence parameter set decoding unit 202 decodes the feature sequence parameter set including the syntax element frame_type.

[0115] The feature frame parameter set encoding unit 106 transmits the syntax element is_active for each channel when the syntax element frame_type=IntraFrame, i.e., when the internally encoded frame is an intraframe, instead of truncates_features in the feature frame parameter set feature_picture_parameter_set, and determines the active and inactive channels. The feature frame parameter set decoding unit 203 decodes the syntax element is_active for each channel when the syntax element frame_type=IntraFrame, i.e., when the internally encoded frame is an intraframe, and determines the active and inactive channels.

[0116] On the other hand, when the syntax element frame_type is not IntraFrame, and the frame type is not an intraframe, the feature frame parameter set encoding unit 106 does not transmit the syntax element is_active, but instead inherits the is_active from the previous frame. In this case, the feature frame parameter set decoding unit 203 does not decode the syntax element is_active, but instead inherits the is_active from the previous frame. Therefore, in that frame, it is possible to determine the active channel and the inactive channel without is_active.

[0117] Information corresponding to the frame type is described within the coded video data (coded_video_data). However, the frame type described within the coded video data cannot be retrieved without interpreting the coded video data. By including it in the coded video header (coded_video_header), the frame type can be easily obtained without scanning the coded video data (coded_video_data).

[0118] By adopting this embodiment, active and inactive channels can be determined simply without the syntax element is_active, without the need for complicated settings for each frame. This reduces the amount of processing required for channel truncation, reduces the amount of code, and improves encoding efficiency.

[0119] In all the embodiments described above, the bitstream output by the feature map encoding device has a specific data format so that it can be decoded according to the encoding method used in the embodiment. Furthermore, the feature map decoding device corresponding to this feature map encoding device can decode the bitstream of this specific data format.

[0120] When a wired or wireless network is used to exchange bitstreams between a feature map encoding device and a feature map decoding device, the bitstream may be converted to a data format suitable for the transmission mode of the communication channel before transmission. In this case, a transmitting device is provided that converts the bitstream output by the feature map encoding device into encoded data in a data format suitable for the transmission mode of the communication channel and transmits it to the network, and a receiving device is provided that receives the encoded data from the network, restores it to a bitstream, and supplies it to the feature map decoding device. The transmitting device includes a memory for buffering the bitstream output by the feature map encoding device, a packet processing unit for packetizing the bitstream, and a transmitting unit for transmitting the packetized encoded data over the network. The receiving device includes a receiving unit for receiving the packetized encoded data over the network, a memory for buffering the received encoded data, and a packet processing unit for packetizing the encoded data to generate a bitstream and providing it to the feature map decoding device.

[0121] The above encoding and decoding processes may be implemented not only as hardware-based transmission, storage, and receiving devices, but also by firmware stored in ROM (read-only memory) or flash memory, or by software on a computer. The firmware program or software program may be recorded on a recording medium readable by a computer and provided, provided from a server via a wired or wireless network, or provided as data broadcasting on terrestrial or satellite digital broadcasting.

[0122] The present invention has been described above based on embodiments. The embodiments are illustrative, and it will be understood by those skilled in the art that various modifications are possible in combinations of their components and processing processes, and that such modifications also fall within the scope of the present invention.

[0123] This invention can be used in feature map encoding and decoding techniques.

[0124] 100 Feature map encoding unit, 101 Neural network feature extraction unit, 102 Feature map reduction unit, 103 Feature map transformation unit, 104 Feature map internal encoding unit, 105 Feature sequence parameter set encoding unit, 106 Feature frame parameter set encoding unit, 107 Multiplexing unit, 200 Feature map decoding unit, 201 Multiplexing / decoupling unit, 202 Feature sequence parameter set decoding unit, 203 Feature frame parameter set decoding unit, 204 Feature map internal decoding unit, 205 Feature map inverse transformation unit, 206 Feature map restoration unit, 207 Neural network identification unit, 301 Convolution processing unit, 302 Activation processing unit, 303 Pooling processing unit, 322 Bottom-up processing unit, 324 Top-down processing unit, 326 Image to be processed for feature extraction, 501 First feature map reduction unit, 502 First channel joining unit, 503 Second feature map reduction unit, 504 505 Second channel coupling section, 506 Third feature map reduction section, 506 First padding section, 507 Second padding section, 508 Third padding section, 601 8x magnification section, 602 4x magnification section, 603 2x magnification section, 604 First feature map mixing section, 605 Second feature map mixing section, 606 First padding removal section, 607 Second padding removal section, 608 Third padding removal section, 701 Packing section, 702 Feature map quantization section, 801 Feature map inverse quantization section, 802 Unpacking section, 901 Switch, 902 VVC encoding section, 903 HEVC encoding section, 904 AV1 encoding section, 1001 Switch, 1002 VVC decoding section, 1003 HEVC decoding section, 1004 AV1 decoding section.

Claims

1. A feature map encoding device comprising: a packing unit that determines the active channel to be encoded from the feature maps of multiple channels, and generates a single packing feature frame by packing only the feature map of the active channel; and a feature frame parameter set encoding unit that encodes a feature frame parameter set containing information indicating whether or not it is an active channel for each feature map of the packing feature frame, on a packing feature frame basis.

2. A feature map encoding method comprising: a packing step of determining an active channel to be encoded from feature maps of multiple channels, and generating a single packing feature frame by packing only the feature map of the active channel; and a feature frame parameter set encoding step of encoding a feature frame parameter set containing information indicating whether or not it is an active channel for each feature map of the packing feature frame, for each packing feature frame.

3. A feature map encoding program characterized by causing a computer to perform the following steps: a packing step of determining the active channel to be encoded from the feature maps of multiple channels, and generating a single packing feature frame by packing only the feature map of the active channel; and a feature frame parameter set encoding step of encoding a feature frame parameter set containing information indicating whether or not it is an active channel for each feature map of the packing feature frame for each packing feature frame.

4. A feature map decoding device comprising: a feature frame parameter set decoding unit that decodes a feature frame parameter set containing information indicating whether or not a feature map is an active channel for each feature map of a packing feature frame on a packing feature frame basis; and an unpacking unit that generates the feature map by unpacking the packing feature frame for feature maps of active channels, and generates the feature map for inactive channels that are not active channels using a predetermined procedure.

5. A feature map decoding method comprising: a feature frame parameter set decoding step of decoding a feature frame parameter set that includes information indicating whether or not it is an active channel for each feature map of a packing feature frame on a packing feature frame basis; and an unpacking step of generating the feature map by unpacking the packing feature frame for the feature map of an active channel, and generating the feature map in a predetermined procedure for inactive channels that are not active channels.

6. A feature map decoding program characterized by causing a computer to perform the following steps: a feature frame parameter set decoding step of decoding a feature frame parameter set that includes information indicating whether or not it is an active channel for each feature map of a packing feature frame on a packing feature frame basis; and an unpacking step of unpacking the packing feature frame to generate the feature map for an active channel, and generating the feature map for an inactive channel that is not an active channel using a predetermined procedure.