Group convolution operation method, group convolution operation device and artificial intelligence processor

By generating a recombined weight tensor and calling the standard convolution function, the problem of low computational efficiency of group convolution is solved, improving the computing performance and hardware resource utilization of GPGPU, and realizing efficient group convolution operation.

CN121882136BActive Publication Date: 2026-06-23SHANGHAI BIREN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI BIREN TECH CO LTD
Filing Date
2026-03-06
Publication Date
2026-06-23

Smart Images

  • Figure CN121882136B_ABST
    Figure CN121882136B_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure provide a group convolution operation method, a group convolution operation device and an artificial intelligence processor. The group convolution operation method comprises: obtaining at least one group convolution operator, wherein the group convolution operator comprises an input channel number, an output channel number, a group number and a corresponding original weight tensor; in response to the input channel number and the output channel number being divisible by the group number, performing a weight rearrangement operation on the original weight tensor to generate a reorganized weight tensor for standard convolution operation, the reorganized weight tensor being a block-diagonal sparse weight tensor; and calling a standard convolution function to perform convolution operation based on an input tensor and the reorganized weight tensor. The group convolution operation method fully utilizes the parallel computing capability of a general-purpose graphics processor by combining multiple small group convolutions into a single large standard convolution, reduces the number of kernel launches, and improves the overall computing performance of the system.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments disclosed herein relate to the field of artificial intelligence processing, specifically to a group convolution operation method, a group convolution operation apparatus, and an artificial intelligence processor. Background Technology

[0002] Currently, in deep learning frameworks and GPU acceleration libraries, the implementation of group convolution mainly relies on dedicated group convolution kernels, or by dividing the group convolution into multiple ordinary convolution kernels and calling them. A typical method for implementing group convolution is to divide the input feature map into multiple sub-tensors according to the number of groups. Each sub-tensor is independently convolved with the weight tensor of its corresponding group, and the results are then concatenated. Specifically, when the number of groups is G, for example, a GPU acceleration library might launch an independent convolution computation task for each group, with each task processing C input channels. in / G, Number of output channels is C out / G subconvolution operation. Summary of the Invention

[0003] At least one embodiment of this disclosure provides a group convolution operation method, comprising: obtaining at least one group convolution operator, wherein the group convolution operator includes the number of input channels, the number of output channels, the number of groups, and the corresponding original weight tensor; in response to the number of input channels and the number of output channels being divisible by the number of groups, performing a weight rearrangement operation on the original weight tensor to generate a recombined weight tensor for standard convolution operations; and invoking a standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor.

[0004] For example, in the group convolution operation method provided in at least one embodiment of this disclosure, the recombined weight tensor is a block diagonal sparse weight tensor. The original weight tensor is subjected to a weight rearrangement operation to generate a recombined weight tensor for standard convolution operations. This includes: inserting a weight rearrangement operator before the group convolution operator, wherein the weight rearrangement operator is used to create a target weight tensor with all zeros, and placing the sub-weight blocks corresponding to each group in the original weight tensor in the group order at the block diagonal position of the target weight tensor to generate a block diagonal sparse weight tensor.

[0005] For example, in the group convolution operation method provided in at least one embodiment of this disclosure, the horizontal dimension of the all-zero target weight tensor is determined based on the number of input channels, and the vertical dimension of the all-zero target weight tensor is determined based on the number of output channels.

[0006] For example, in the group convolution operation method provided in at least one embodiment of this disclosure, the input tensor includes input feature map data, and calling a standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor includes: setting the group number parameter of the standard convolution function to 1, and the weight parameter of the standard convolution function to the recombined weight tensor; and calling the standard convolution function to perform convolution calculation based on the input feature map data and the recombined weight tensor.

[0007] For example, in the group convolution operation method provided in at least one embodiment of this disclosure, the group convolution operator includes at least one of a 2D group convolution operator, a 3D group convolution operator, and a transposed group convolution operator.

[0008] For example, in the group convolution operation method provided in at least one embodiment of this disclosure, the group convolution operator further includes a depth-separable convolution operator, wherein the number of groups of the depth-separable convolution operator is equal to the number of output channels of the depth-separable convolution operator.

[0009] For example, in the group convolution operation method provided in at least one embodiment of this disclosure, the 2D group convolution operator is used for at least one of image data processing, text classification, and speech recognition; the 3D group convolution operator is used for at least one of medical image analysis, video understanding and behavior recognition, and autonomous driving and robot perception; and the transposed group convolution operator is used for at least one of upsampling or generation tasks.

[0010] For example, at least one embodiment of the present disclosure provides a group convolution operation method that further includes: in response to at least one of the number of input channels and the number of output channels not being divisible by the number of groups, performing a padding operation on the group convolution operator so that the number of input channels and the number of output channels after padding are both divisible by the number of groups.

[0011] At least one embodiment of this disclosure also provides a group convolution operation apparatus, comprising: an acquisition module configured to acquire at least one group convolution operator, wherein the group convolution operator includes an input channel number, an output channel number, a group number, and a corresponding original weight tensor; a weight rearrangement module configured to perform a weight rearrangement operation on the original weight tensor in response to both the input channel number and the output channel number being divisible by the group number, to generate a recombined weight tensor for standard convolution operations; and a calling module configured to call a standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor.

[0012] At least one embodiment of this disclosure also provides an artificial intelligence processor, including the group convolution operation apparatus provided in the at least one embodiment above.

[0013] At least one embodiment of this disclosure also provides an electronic device, including: a processor; and a memory including one or more computer program modules; wherein the one or more computer program modules are stored in the memory and configured to be executed by the processor, and the one or more computer program modules are used to perform a group convolution operation method according to the at least one embodiment described above.

[0014] At least one embodiment of this disclosure also provides a non-transitory storage medium for storing computer-executable instructions, wherein when the computer-executable instructions are executed by a computer, the group convolution operation method according to the at least one embodiment described above is performed.

[0015] The group convolution operation method of at least one embodiment of this disclosure fully utilizes the parallel computing capabilities of, for example, GPGPUs by merging multiple small group convolutions into a single large convolution (i.e., a standard convolution), reducing the number of convolution kernel startups and improving the system's computational performance. Simultaneously, it can also fully utilize acceleration units such as tensor computation cores and tensor acceleration engines in modern GPGPUs to support advanced hardware features. Attached Figure Description

[0016] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings of the embodiments will be briefly described below. Obviously, the drawings described below only relate to some embodiments of this disclosure and are not intended to limit this disclosure.

[0017] Figure 1 This is a schematic diagram of an exemplary general-purpose graphics processor;

[0018] Figure 2A This is a 3D schematic diagram illustrating a standard convolution operation as an example.

[0019] Figure 2B In response to Figure 2A A 3D diagram illustrating the group convolution operation;

[0020] Figure 2C In response to Figure 2A A 2D schematic diagram of the standard convolution operation;

[0021] Figure 2D In response to Figure 2B A 2D schematic diagram of group convolution operations;

[0022] Figure 3 A flowchart illustrating a group convolution operation method provided in at least one embodiment of this disclosure;

[0023] Figure 4 This is a schematic diagram of a group convolution operation provided for at least one embodiment of the present disclosure;

[0024] Figure 5A schematic diagram of a group convolution operation apparatus provided in at least one embodiment of the present disclosure;

[0025] Figure 6 A schematic diagram of an electronic device provided for at least one embodiment of this disclosure;

[0026] Figure 7 A schematic diagram of a storage medium provided for at least one embodiment of this disclosure; and

[0027] Figure 8 This is a schematic diagram of another electronic device provided for at least one embodiment of the present disclosure. Detailed Implementation

[0028] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the described embodiments of this disclosure without creative effort are within the scope of protection of this disclosure.

[0029] Unless otherwise defined, the technical or scientific terms used in this disclosure shall have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms “first,” “second,” and similar terms used in this disclosure do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as “comprising” or “including” mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. Terms such as “connected” or “linked” are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as “upper,” “lower,” “left,” and “right” are used only to indicate relative positional relationships, and these relative positional relationships may change accordingly when the absolute position of the described objects changes.

[0030] Figure 1 This is a schematic diagram of an exemplary general-purpose graphics processor.

[0031] like Figure 1 As shown, a general-purpose graphics processing unit (GPGPU) is actually an array of programmable multiprocessors. For example, a programmable multiprocessor can be a streaming processor cluster (SPC), such as including... Figure 1The diagram shows streaming processor clusters 1, ..., M, where M is a positive integer greater than 1. In a general-purpose graphics processor, one streaming processor cluster handles one computational task, or multiple streaming processor clusters handle one computational task. Multiple streaming processor clusters share data through a global cache or global memory.

[0032] like Figure 1 As shown, taking streaming processor cluster 1 as an example, one streaming processor cluster includes multiple computing units, such as... Figure 1 The system is structured as Computation Unit 1, Computation Unit 2, ..., Computation Unit N, where N is a positive integer. Each Computation Unit (CU) performs arithmetic and logical operations, such as accumulation, reduction, and standard addition, subtraction, multiplication, and division. A Computation Unit includes multiple cores (also called computational kernels), each of which includes an Arithmetic Logic Unit (ALU), a floating-point unit, etc. These cores are used to execute specific computational tasks. Furthermore, the Computation Unit also includes registers (e.g., ...). Figure 1 The register file and shared memory in a computing unit are used to store source and destination data related to computing tasks in a hierarchical manner. The shared memory in a computing unit is used to share data between the cores of that computing unit.

[0033] like Figure 1 As shown, each streaming processor cluster also provides a buffer for caching data across the N computing units in the streaming processor cluster.

[0034] In parallel computing, computational tasks are typically executed by multiple threads. These threads are divided into multiple thread blocks before execution in a general-purpose graphics processor (or parallel computing processor), and then dispatched via a thread block distribution module. Figure 1 (Not shown in the image) Multiple thread blocks are distributed to various computation units. All threads in a thread block must be assigned to the same computation unit for execution. Simultaneously, thread blocks are broken down into minimum execution thread bundles (or simply warps), each containing a fixed number (or less than this fixed number) of threads, for example, 32 threads. Multiple thread blocks can execute in the same computation unit or in different computation units.

[0035] In each computing unit, the thread beam scheduling / distribution module ( Figure 1(Not shown in the diagram) Thread bundles are scheduled and allocated so that multiple computing cores within the computing unit can run thread bundles. Depending on the number of computing cores in the computing unit, multiple thread bundles within a thread block can execute concurrently or in a time-sharing manner. Multiple threads within each thread bundle execute the same instructions. Memory-executed instructions are issued to shared memory within the computing unit or further issued to intermediate-level caches, global caches, or global memory (e.g., [example cache]). Figure 1 High Bandwidth Memory (HBM) is used for read and write operations.

[0036] The core value of GPGPU lies in its massively parallel architecture, which perfectly matches the computational characteristics of convolution operations. Convolution operations, a cornerstone in fields such as computer vision and deep learning (e.g., for extracting image features in convolutional neural networks), essentially involve sliding a small filter window across the input data (such as an image) and performing numerous independent dot product summations at each location. While a Central Processing Unit (CPU) would be extremely slow to compute these windows sequentially, a GPGPU can distribute the computational tasks for hundreds or even thousands of different filter window locations across an image to different computing cores for parallel execution. Therefore, performing convolution operations on a GPGPU is no longer about "compiling window by window," but rather "instantly computing all windows," resulting in a performance leap of hundreds of times.

[0037] Below, we will first introduce the specific operation process of standard convolution and group convolution.

[0038] Figure 2A This is a 3D schematic diagram of a standard convolution operation as an example.

[0039] like Figure 2A As shown, a standard convolution operation performs a convolution operation on the entire input data. For an input tensor a, C in H represents the total number of input channels for the input data (i.e., the "number of input data layers"). in W represents the height of the input data corresponding to each input channel. in This represents the input data width corresponding to each input channel. Each input channel (i.e., each "layer") is a [H] in ×W in A two-dimensional array (matrix) of C. in These matrices are stacked together to form a three-dimensional block [C] in ×H in ×W in However, it should be noted that the input channel dimension also includes a sample size dimension N. Therefore, the input channel dimension is actually [N, C]. inH in W in There are four dimensions in total. This explanation is based on a sample size of N=1, so N has been omitted (meaning there are three dimensions in the input channel).

[0040] For the weight tensor b (i.e., the convolution kernel), it is a four-dimensional weight tensor with the specific shape [C out C in ,K h ,K w ], where C out This represents the number of output channels (i.e., the number of convolutional kernels, each kernel generating one output channel). Here, each convolutional kernel must cover all input channels, therefore the depth of each convolutional kernel is also C. in K h K represents the height of the convolution kernel. w This indicates the width of the convolution kernel.

[0041] The output tensor c is essentially also a four-dimensional tensor [N, C]. out H out W out However, since the sample size dimension N of the input tensor a is equal to 1, the sample size dimension N of the output tensor b is also equal to 1, which is also omitted here. Specifically, the convolution kernel corresponding to each output channel will obtain a [H] after convolution with the input tensor a. out ×W out A two-dimensional array (matrix) of C. out These matrices are stacked together to form a three-dimensional matrix block [C] out ×H out ×W out ].

[0042] Because standard convolution requires each output channel to be convolved with all input channels, it results in a large number of parameters and computational overhead. Therefore, to reduce computational overhead and improve computational efficiency, group convolution can be used to divide the input channels into G groups, with each group performing convolution operations independently. This reduces the computational overhead to 1 / G of the original, and the number of parameters is also reduced to 1 / G of the original.

[0043] Figure 2B In response to Figure 2A A 3D diagram illustrating the group convolution operation.

[0044] like Figure 2B As shown, the input tensor a is divided into two groups to obtain G1 and G2 (number of groups G=2). At this time, the specific dimension of a single input channel becomes [C]. in / 2,Hin W in It's important to note that this grouping only divides the input channels at a depth level; that is, certain input channels are grouped together, and the specific number of these groups is determined by C. in / 2 decision.

[0045] Because the output tensor c changes, the weight tensor b (i.e., the convolution kernel) also needs to be changed accordingly. Here, the depth of the convolution kernel for each group becomes C. in / 2, while the size of each convolutional kernel does not need to be changed; the height and width remain K. h and K w At this point, the number of convolutional kernels in each group becomes C. out / 2, instead of the original C out Therefore, the dimension of a single convolutional kernel now becomes [C]. out / 2,C in / 2,K h ,K w ].

[0046] Then, the convolution kernel of each group is used in conjunction with the input channel tensor (C) of their corresponding group. in Convolve the tensors of each group (C1 / 2) to obtain the output tensors of each group. out After ( / 2), they are combined using a concatenation method. The final output tensor c still has C as its output channel. out However, it should be noted that the dimension of a single output channel here is [C out / 2,H out W out ].

[0047] For ease of understanding, this disclosure also provides for Figure 2A and Figure 2B A 2D schematic diagram.

[0048] Figure 2C In response to Figure 2A A 2D diagram illustrating the standard convolution operation. Figure 2D In response to Figure 2B A 2D schematic diagram of group convolution operations.

[0049] like Figure 2C As shown, compared to Figure 2A In the standard convolution operation, the input tensor *a* and output tensor *c* are simplified to one-dimensional data, while the weight tensor *b* is simplified to a two-dimensional matrix. The input channel dimension is [C]. in ](H in and W in (omitted); the kernel dimension is specifically [C out C in ](K hand K w (omitted); the output channel dimension is [C out ](H out and W out (Omitted). The specific standard convolution operation process and... Figure 2A The same applies, so I will not repeat it here.

[0050] like Figure 2D As shown, compared to Figure 2B Group convolution operations (number of groups G=2), and Figure 2C Similarly, the input tensor *a* and output tensor *c* are simplified to one-dimensional data, and the weight tensor *b* is simplified to a two-dimensional matrix. The dimension of a single input channel is [C]. in / 2](H in and W in (omitted); the specific dimension of a single convolutional kernel is [C out / 2,C in / 2](K h and K w (omitted); the dimension of a single output channel is [C out / 2](H out and W out (Omitted). The specific group convolution operation process and... Figure 2B The same applies, so I will not repeat it here.

[0051] Although group convolution operations have advantages over standard convolution operations in reducing computation and parameter count, implementing group convolution on general-purpose graphics processing units (GPGPUs) has several drawbacks that directly affect computational efficiency, hardware resource utilization, and overall system performance.

[0052] (1) Low utilization of computational resources in group convolution kernels. Group convolution typically divides the input and output channels into multiple independent subgroups, with each group performing independent convolution computations. Due to the small computational scale of each group (reduced number of channels), the thread blocks in the GPGPU are difficult to fully utilize. For example, when the number of groups G is large and the number of input channels in each group is small, the computational tasks of each group are too fragmented and cannot effectively fill the large-scale parallel computing units of the GPGPU. For another example, dedicated computing units such as tensor cores in modern GPGPUs are usually designed to process large-scale data blocks (such as accumulation direction granularity ≥32), and group convolution cannot match its optimal computational granularity, resulting in idle hardware computing power. Ultimately, this leads to a reduction in computational density and a limitation on overall throughput.

[0053] (2) Multiple kernel startups introduce additional overhead. Currently, in implementing group convolution operations, group convolution is usually split into multiple independent convolution kernel calls, with each group corresponding to one kernel startup. Each kernel startup involves synchronization, context switching, and task scheduling between the Central Processing Unit (CPU) and the GPGPU, which can lead to significant latency. Furthermore, when the number of groups is large, frequent kernel startups can accumulate into a non-negligible system overhead, especially for small-batch or real-time processing tasks in inference scenarios.

[0054] (3) Difficulty in fully utilizing the advanced features of modern GPGPUs. Currently, group convolution kernels are generally not deeply optimized for the features of next-generation GPGPU architectures. For example, the Tensor Core is designed for mixed-precision matrix operations and is suitable for large-scale intensive computations, while the fragmented computational patterns of group convolutions make it difficult to effectively call upon the Tensor Core and thus fail to leverage its high-performance advantages. Furthermore, the Tensor Memory Accelerator (TMA) is used to efficiently manage data transfer between global memory and shared memory, but the memory access patterns in group convolutions are scattered, making it difficult to utilize TMA for efficient batch transfers.

[0055] In view of at least one of the above-mentioned problems, at least one embodiment of the present disclosure provides a group convolution operation method, the method comprising: obtaining at least one group convolution operator, wherein the group convolution operator includes an input channel number, an output channel number, a group number, and a corresponding original weight tensor; in response to both the input channel number and the output channel number being divisible by the group number, performing a weight rearrangement operation on the original weight tensor to generate a recombined weight tensor for standard convolution operation; and invoking a standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor.

[0056] At least one embodiment of this disclosure also provides a group convolution operation apparatus and an artificial intelligence processor including the group convolution operation apparatus.

[0057] The group convolution operation method of at least one embodiment of this disclosure fully utilizes the parallel computing capabilities of, for example, GPGPUs by merging multiple small group convolutions into a single large convolution (i.e., a standard convolution), reducing the number of convolution kernel startups and improving the system's computational performance. Simultaneously, it can also fully utilize acceleration units such as tensor computation cores and tensor acceleration engines in modern GPGPUs to support advanced hardware features.

[0058] The group convolution operation method of this disclosure will be described below with reference to specific embodiments.

[0059] Figure 3This is a flowchart illustrating a group convolution operation method provided in at least one embodiment of this disclosure. Figure 3 As shown, the convolution operation method specifically includes steps S100-S120.

[0060] Step S100: Obtain at least one group convolution operator, wherein the group convolution operator includes the number of input channels, the number of output channels, the number of groups, and the corresponding original weight tensor.

[0061] Step S110: In response to the fact that both the number of input channels and the number of output channels are divisible by the number of groups, perform a weight rearrangement operation on the original weight tensor to generate a recombined weight tensor for standard convolution operations.

[0062] Step S120: Call the standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor.

[0063] For step S100, the group convolution operators include, for example, 2D group convolution operators, 3D group convolution operators, transposed group convolution operators (i.e., deconvolution), and depthwise separable convolution (a special case of group convolution, where the number of groups equals the number of input channels). Among these, the 2D group convolution operator is the most common form, used for applications such as image processing (including high-resolution remote sensing and aerial image analysis), multispectral and hyperspectral imaging, person and biometric recognition, text classification (e.g., text is mapped into a two-dimensional structure by word embedding), and speech recognition. The 3D group convolution operator is used for 3D data, such as medical image analysis (including complete three-dimensional data generated from CT, MRI, and PET scans), video understanding and behavior recognition, and autonomous driving and robot perception. The transposed group convolution operator is used for applications such as upsampling or task generation. Unless otherwise specified, the following descriptions will use the 2D group convolution operator as an example.

[0064] For example, a group convolution operator includes the number of input channels, the number of output channels, the number of groups, and the corresponding original weight tensor. Here, the number of input channels, the number of output channels, and the number of groups are parameter configuration items for this group convolution operator, mainly used to define the behavior and constraints of the operator.

[0065] For example, this set of convolution operators also includes input tensors and output tensors. The original weight tensors, input tensors, and output tensors are the core data items of this set of convolution operators. These items include the shape, layout, and sparsity pattern of the data.

[0066] For example, the set of convolution operators may also include the code and algorithm for actually performing the computation, data layout, memory allocation, cache optimization and other configuration items to form a complete software module. The remaining configuration items will not be described in detail here. Those skilled in the art can configure them according to actual needs. The embodiments disclosed herein do not limit this.

[0067] For step S110, it is first necessary to determine whether the number of input channels and the number of output channels in the group convolution operator are divisible by the number of groups, that is, to determine whether the basic conditions for group convolution operation are met. If it is determined that both the number of input channels and the number of output channels are divisible by the number of groups, it is determined that the group convolution operator meets the conditions for group convolution operation. Then, the original weight tensor corresponding to the group convolution operator is subjected to a weight rearrangement operation to generate a recombined weight tensor for standard convolution operation.

[0068] It should be noted that the original weight tensor is a weight tensor formed by stacking the corresponding sub-weight tensors of each group together, and the corresponding sub-weight tensors of each group do not affect each other.

[0069] For example, in one possible implementation, the recombined weight tensor is a block-diagonal sparse weight tensor. The original weight tensor is subjected to a weight rearrangement operation to generate a recombined weight tensor for standard convolution operations. This includes: inserting a weight rearrangement operator before the group convolution operator, wherein the weight rearrangement operator is used to create a target weight tensor with all zeros, and placing the sub-weight blocks corresponding to each group in the original weight tensor in the group order at the block diagonal position of the target weight tensor to generate a block-diagonal sparse weight tensor.

[0070] Here, the block-diagonal sparse weight tensor is a tensor with a specific sparse structure, where non-zero elements appear only in a number of disjoint sub-blocks (or sub-matrices) arranged along the "diagonal," with all other positions being zero. This structure is applied, for example, in computational scenarios such as deep learning, graph neural networks, multi-task learning, model compression, and efficient computation.

[0071] At least one embodiment of this disclosure inserts a weight rearrangement operator before the group convolution operator. This weight rearrangement operator rearranges the dimensions of the original weight tensor corresponding to the group convolution operator to generate a recombined weight tensor for standard convolution operations. For example, in one specific implementation, the weight rearrangement operator can first create a target weight tensor with all zeros, and then place the sub-weight blocks corresponding to each group in the original weight tensor corresponding to the group convolution operator in the block diagonal position of the target weight tensor according to the grouping order (i.e., the order of the corresponding input channels), so that the remaining positions except the block diagonal position are all kept as zero, and finally the above-mentioned block diagonal sparse weight tensor (i.e., recombined weight tensor) is generated.

[0072] For example, in one possible implementation, the horizontal dimension of the all-zero target weight tensor created by the weight rearrangement operator is determined based on the number of input channels, and the vertical dimension is determined based on the number of output channels.

[0073] For example, if the number of input channels is 32 and the number of output channels is 64, then the horizontal dimension of the all-zero target weight tensor created by the weight rearrangement operator is 32 and the vertical dimension is 64. The embodiments of this disclosure do not limit the other two dimensions of the target weight tensor (convolution kernel height and width).

[0074] For step S120, after obtaining the recombined weight tensor, the standard convolution function can be called to perform a standard convolution operation based on the input tensor and the recombined weight tensor in the group convolution operator. Here, since the off-diagonal block positions in the recombined weight tensor are all zero, when the standard convolution operation is used to process the recombined weight tensor (sparse weights), the zero-value weights will not contribute to the calculation result. Therefore, the standard convolution operation is mathematically equivalent to the original group convolution operation.

[0075] For example, in one possible implementation, the input tensor includes input feature map data, and calling a standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor includes: setting the number of groups parameter of the standard convolution function to 1, and the weight parameter of the standard convolution function to the recombined weight tensor; and calling the standard convolution function to perform convolution calculation based on the input feature map data and the recombined weight tensor.

[0076] For example, the input tensor in the grouped convolution operator here can be input feature map data (used for image processing), the standard convolution function has 1 group, and the weight parameter is the reconstructed weight tensor. The specific convolution operation is implemented by calling the standard convolution function to perform standard convolution calculation based on the input feature map data and the reconstructed weight tensor. For example, the standard convolution function can be represented by the formula Y=Conv2d(X, W_padded), where Y represents the output channel data, Conv2d represents the 2D standard convolution operation, X represents the input tensor, and W_padded represents the reconstructed weight tensor.

[0077] For example, in one possible implementation, the above-described group convolution operation method further includes: in response to at least one of the number of input channels and the number of output channels not being divisible by the group number, performing a padding operation on the group convolution operator so that the padded number of input channels and the number of output channels are both divisible by the group number. Corresponding to step S110 above, if either the number of input channels or the number of output channels is not divisible by the group number, for example, the input channels or the number of output channels that do not meet the group convolution condition can first be padded with zeros so that the padded number of input channels and the number of output channels can meet the group convolution operation condition, and then the group convolution operation method of at least one embodiment of this disclosure is executed.

[0078] For example, if either the number of input channels or the number of output channels is not divisible by the number of groups, it can be directly determined that the convolution operator does not meet the operation conditions of group convolution, thereby suspending the execution of the group convolution operation method of at least one embodiment of this disclosure.

[0079] It should be noted that the group convolution operation method of at least one embodiment of this disclosure can be naturally extended to variants such as 3D group convolution and transposed group convolution, and can also be applied to any computational scenario of multidimensional structured data types with "channel dimension", such as 2D image data (such as RGB images, multispectral images), 3D video frame sequences, etc. The embodiments of this disclosure do not limit this.

[0080] At least one embodiment of the group convolution operation method disclosed herein fully utilizes the parallel computing capabilities of, for example, GPGPUs by merging multiple small group convolutions into a single large standard convolution, thereby reducing the number of convolution kernel startups. Furthermore, this group convolution operation method also fully utilizes existing high-performance standard convolution kernels, requiring only the addition of a weight reorganization operator, resulting in low implementation complexity and avoiding the development and maintenance of dedicated group convolution kernels for different hardware platforms, thus reducing development costs. For example, at least one embodiment of this disclosure can also fully utilize acceleration units such as tensor acceleration cores and tensor acceleration engines of modern GPGPUs to support advanced hardware features.

[0081] Figure 4 This is a schematic diagram of a group convolution operation provided for at least one embodiment of the present disclosure.

[0082] like Figure 4 As shown, compared to Figure 2D The weight tensor b is reorganized into a block-diagonal sparse weight tensor (two-dimensional). The corresponding sub-weight matrices (G1, G2) for each group are placed at the block-diagonal positions of this block-diagonal sparse weight tensor, with the remaining positions set to zero. Therefore, during the standard convolution operation, the zero-weight portions do not contribute to the calculation results for each output channel. The calculation result after the standard convolution operation (e.g., output tensor c) is mathematically equivalent to the calculation result obtained by grouped convolution (e.g., output tensor c). Figure 2D The output tensor c) is the same.

[0083] However, it should be noted that Figure 4 The example shown is only one example with two groups. The group convolution operation method of this disclosure can be applied to any case where the number of groups is greater than two. The embodiments of this disclosure are not limited in this regard.

[0084] At least one embodiment of this disclosure also provides a group convolution operation apparatus corresponding to the above-described group convolution operation method. Figure 5 This is a schematic diagram of a group convolution operation apparatus provided for at least one embodiment of the present disclosure.

[0085] like Figure 5 As shown, the group convolution operation device 200 includes an acquisition module 210, a weight rearrangement module 220, and a calling module 230. The group convolution operation device 200 can be used to implement the above-mentioned group convolution operation method.

[0086] The acquisition module 210 is configured to acquire at least one group convolution operator, wherein the group convolution operator includes the number of input channels, the number of output channels, the number of groups, and the corresponding original weight tensor. The specific operation process is described in the above-mentioned step S100, and will not be repeated here.

[0087] The weight rearrangement module 220 is configured to perform a weight rearrangement operation on the original weight tensor in response to both the number of input channels and the number of output channels being divisible by the number of groups, to generate a recombined weight tensor for standard convolution operations. The specific operation process is described in the explanation of step S110 above, and will not be repeated here.

[0088] Module 230 is configured to invoke a standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor. The specific operation process is described in the explanation of step S120 above, and will not be repeated here.

[0089] The various modules included in the group convolution operation apparatus provided in at least one embodiment of this disclosure can be implemented, for example, at least partially, by hardware, firmware, or software, and this disclosure does not limit this.

[0090] For example, in one possible implementation, the recombined weight tensor is a block-diagonal sparse weight tensor. The weight rearrangement module 220 includes an insertion submodule (not shown in the figure), which is configured to insert a weight rearrangement operator before the group convolution operator. The weight rearrangement operator is used to create a target weight tensor with all zeros and to place the corresponding subweight blocks of each group in the original weight tensor in the block diagonal position of the target weight tensor according to the group order, so as to generate a block-diagonal sparse weight tensor. The specific operation process is described in the above description of step S110, and will not be repeated here.

[0091] It should be noted that, for clarity and brevity, the embodiments of this disclosure do not show all the constituent units of the group convolution operation device 200 described above. To achieve the necessary functions of the group convolution operation device 200, those skilled in the art can provide and set other constituent units (not shown) according to specific needs, and the embodiments of this disclosure do not impose any limitations on this.

[0092] At least one embodiment of this disclosure also provides an artificial intelligence processor, which includes the group convolution operation device 200 provided in any of the above embodiments. For example, the artificial intelligence processor may be a graphics processing unit (GPU), a general-purpose graphics processing unit (GPGPU), a tensor processing unit (TPU), a neural processing unit (NPU), an intelligence processing unit (IPU), or a language processing unit (LPU), and the embodiments of this disclosure are not limited thereto.

[0093] For example, with the above Figure 1 The structure of the general-purpose graphics processing unit (GPGPU) shown is illustrated as an example. Here, the GPGPU is used to execute the group convolution operation method in at least one of the above embodiments.

[0094] For streaming processor clusters and thread blocks, thread block tasks that originally required G (e.g., G=2) independent dispatches are merged into a single dispatch. Compared to the shortcomings of each convolution group requiring one or more SPCs independently and the inability to effectively fill the SPCs due to the small computational scale of a single group, at least one embodiment of this disclosure improves the parallel filling degree of streaming processor clusters (SPCs), reduces the scheduling pressure of the thread block dispatch module to 1 / G of the original, effectively eliminates the idle waiting time of SPCs, and thus improves the overall throughput of GPGPU.

[0095] For computational units and thread bundles, each thread bundle can process larger output feature map blocks, and a single instruction issuance can drive more arithmetic logic units (ALUs) and floating-point computation units to execute concurrently. Compared to the shortcomings of group convolution, which suffers from fragmented computational granularity, low instruction issuance slot utilization, and frequent pipeline pauses due to the input channel number being divided into Cin / G, at least one embodiment of this disclosure optimizes the thread bundle-level parallelism within the computational unit from "task-level parallelism" to "data-level parallelism," thereby improving the instruction issuance efficiency of the computational core and the actual utilization rate of integer / floating-point arithmetic units.

[0096] For memory hierarchy and data transfer engine, such as Figure 1As shown, the GPGPU memory hierarchy includes High Bandwidth Memory (HBM), global cache, shared memory, and register file. During the implementation of group convolution, G independent convolutional kernels need to load the input feature map and weight tensor from HBM separately. This results in a discrete, small-granular memory access pattern, leading to a low global cache hit rate and difficulty in initiating batch DMA transfers using the Tensor Acceleration Engine (TMA). At least one embodiment of this disclosure loads the complete input feature map into shared memory in a single standard convolutional kernel, increasing the data reuse rate of the input feature map by a factor of G. Furthermore, at least one embodiment of this disclosure reduces HBM bandwidth usage, improves the global cache hit rate, and alleviates the pressure on the SPC internal buffer.

[0097] At least one embodiment of this disclosure also provides an electronic device. Figure 6 This is a schematic diagram of an electronic device provided in at least one embodiment of the present disclosure.

[0098] For example, such as Figure 6 As shown, the electronic device 300 includes a processor 310 and a memory 320. The memory 320 is used to store non-transitory computer-readable instructions (e.g., one or more computer program modules). The processor 310 is used to execute the computer program instructions, which, when executed by the processor 310, perform the group convolution operation method provided in any embodiment of this disclosure. The memory 320 and the processor 310 can be interconnected via a bus system and / or other forms of connection mechanism (not shown).

[0099] The processor 310 includes the group convolution operation apparatus of any embodiment of this disclosure, and may include devices with data processing capabilities and / or program execution capabilities such as a central processing unit (CPU), tensor processor (TPU), network processor (NP), or graphics processing unit (GPU). It may also be a digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. For example, the central processing unit (CPU) may be an x86 or ARM architecture, such as a system-on-a-chip (SOC). The processor 310 may be a general-purpose processor or a special-purpose processor, and can control other components in the electronic device 300 to perform desired functions.

[0100] For example, memory 320 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, etc. One or more computer program modules may be stored on the computer-readable storage medium, and processor 310 may run one or more computer program modules to implement various functions of electronic device 300. Various application programs and various data, as well as various data used and / or generated by the application programs, may also be stored in the computer-readable storage medium.

[0101] At least one embodiment of this disclosure also provides a computer storage medium for storing non-transitory computer program executable code (e.g., computer executable instructions). When executed by a computer (e.g., including one or more processors), the non-transitory computer program executable code can implement the group convolution operation method of any embodiment of this disclosure.

[0102] Figure 7 This is a schematic diagram of a storage medium provided in at least one embodiment of this disclosure. For example... Figure 7 As shown, the computer storage medium 400 non-temporarily stores computer-executable instructions 410.

[0103] For example, one or more computer instructions may be stored on the storage medium 400. Some of the computer instructions stored on the storage medium 400 may be instructions for implementing one or more steps in the above-described group convolution operation method, for example, when executed by an instruction processing device (e.g., a processor).

[0104] For example, storage medium 400 may include storage components of a tablet computer, hard disk of a personal computer, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), optical disc read-only memory (CD-ROM), flash memory, or any combination of the above storage media, or other suitable storage media.

[0105] Figure 8 A schematic diagram of another electronic device provided in at least one embodiment of the present disclosure is shown. Figure 8 The illustrated electronic device 500 is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.

[0106] like Figure 8As shown, in some examples, electronic device 500 includes a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 501, including group convolution operation means according to any embodiment of the present disclosure, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 502 or a program loaded from storage device 508 into random access memory (RAM) 503. Various programs and data required for the operation of the computer system are also stored in RAM 503. The processing device 501, ROM 502, and RAM 503 are connected hereby via bus 504. Input / output (I / O) interface 505 is also connected to bus 504.

[0107] For example, the following components can be connected to I / O interface 505: input devices 506 including, for example, touch screens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 507 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 508 including, for example, magnetic tapes, hard disks, etc.; and communication devices 509, such as network interface cards like LAN cards and modems, etc. Communication device 509 allows electronic device 500 to communicate wirelessly or wiredly with other devices to exchange data and perform communication processing via networks such as the Internet. Drive 510 is also connected to I / O interface 505 as needed. Removable media 511, such as disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on drive 510 as needed so that computer programs read from them can be installed into storage device 508 as needed.

[0108] Although Figure 8 An electronic device 500 including various devices is shown; however, it should be understood that implementation or inclusion of all shown devices is not required. More or fewer devices may be implemented or included alternatively.

[0109] For example, the electronic device 500 may further include a peripheral interface (not shown in the figure). This peripheral interface can be various types of interfaces, such as a USB interface, a Lightning interface, etc. The communication device 509 can communicate wirelessly with a network and other devices, such as the Internet, an intranet, and / or a wireless network such as a cellular telephone network, a wireless local area network (LAN), and / or a metropolitan area network (MAN). Wireless communication can use any of a variety of communication standards, protocols, and technologies, including but not limited to Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), Wideband Code Division Multiple Access (W-CDMA), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Bluetooth, Wi-Fi (e.g., based on IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, and / or IEEE 802.11n standards), Voice over Internet Protocol (VoIP), Wi-MAX, protocols for email, instant messaging, and / or Short Message Service (SMS), or any other suitable communication protocol.

[0110] For example, the electronic device 500 may include any device such as a mobile phone, tablet computer, laptop computer, e-book, game console, television, digital photo frame, navigator, server, etc., or any combination of data processing device and hardware. The embodiments disclosed herein do not limit this.

[0111] Although the present disclosure has been described in detail above with general descriptions and specific embodiments, modifications or improvements can be made to the embodiments of the present disclosure, which will be obvious to those skilled in the art. Therefore, all such modifications or improvements made without departing from the spirit of the present disclosure are within the scope of protection claimed by the present disclosure.

[0112] The following points should be noted regarding this disclosure:

[0113] (1) The accompanying drawings of the embodiments of this disclosure only involve the structures involved in the embodiments of this disclosure. Other structures can be referred to the general design.

[0114] (2) For clarity, the thickness of layers or regions in the drawings used to describe embodiments of the present disclosure is enlarged or reduced, i.e., these drawings are not drawn to actual scale.

[0115] (3) Where there is no conflict, the embodiments of this disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.

[0116] The above description is merely a specific embodiment of this disclosure, but the scope of protection of this disclosure is not limited thereto. The scope of protection of this disclosure should be determined by the scope of protection of the claims.

Claims

1. A method for group convolution operations, characterized in that, The group convolution operation method includes: Obtain at least one group convolution operator, wherein the group convolution operator includes the number of input channels, the number of output channels, the number of groups, and the corresponding original weight tensor, wherein the original weight tensor is a weight tensor formed by stacking the sub-weight tensors corresponding to each group together; In response to the fact that both the number of input channels and the number of output channels are divisible by the number of groups, a weight rearrangement operation is performed on the original weight tensor to generate a recombined weight tensor for standard convolution operations, wherein the recombined weight tensor is a block diagonal sparse weight tensor. The standard convolution function is invoked to perform a convolution operation based on the input tensor and the recombined weight tensor. The step of performing a weight rearrangement operation on the original weight tensor to generate a recombined weight tensor for standard convolution operations includes: A weight rearrangement operator is inserted before the group of convolution operators. The weight rearrangement operator is used to create a target weight tensor with all zeros and to place the sub-weight blocks corresponding to each group in the original weight tensor in the block diagonal position of the target weight tensor in the group order to generate the block diagonal sparse weight tensor.

2. The group convolution operation method according to claim 1, characterized in that, The horizontal dimension of the all-zero target weight tensor is determined based on the number of input channels, and the vertical dimension of the all-zero target weight tensor is determined based on the number of output channels.

3. The group convolution operation method according to claim 1, characterized in that, The input tensor includes input feature map data. The invocation of the standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor includes: The number of groups parameter of the standard convolution function is set to 1, and the weight parameter of the standard convolution function is the recombined weight tensor; The standard convolution function is invoked to perform the convolution operation based on the input feature map data and the recombined weight tensor.

4. The group convolution operation method according to any one of claims 1-3, characterized in that, The group convolution operator includes at least one of the following: 2D group convolution operator, 3D group convolution operator, and transpose group convolution operator.

5. The group convolution operation method according to claim 4, characterized in that, The group convolution operator also includes a depthwise separable convolution operator, wherein the number of groups of the depthwise separable convolution operator is equal to the number of output channels of the depthwise separable convolution operator.

6. The group convolution operation method according to claim 4, characterized in that, The 2D group convolution operator is used in at least one of image data processing, text classification, and speech recognition. The 3D group convolution operator is used in at least one of medical image analysis, video understanding and behavior recognition, and autonomous driving and robot perception. The transpose group convolution operator is used for at least one of the upsampling or generation tasks.

7. The group convolution operation method according to any one of claims 1-3, characterized in that, The group convolution operation method also includes: In response to the fact that at least one of the number of input channels and the number of output channels is not divisible by the group number, the group convolution operator is padded so that the padded number of input channels and the number of output channels are both divisible by the group number.

8. A group convolution operation apparatus, characterized in that, The group convolution operation device includes: The acquisition module is configured to acquire at least one group convolution operator, wherein the group convolution operator includes the number of input channels, the number of output channels, the number of groups, and the corresponding original weight tensor, wherein the original weight tensor is a weight tensor formed by stacking the sub-weight tensors corresponding to each group together; The weight rearrangement module is configured to perform a weight rearrangement operation on the original weight tensor in response to the fact that both the number of input channels and the number of output channels are divisible by the number of groups, so as to generate a recombined weight tensor for standard convolution operations, wherein the recombined weight tensor is a block diagonal sparse weight tensor. The calling module is configured to invoke a standard convolution function to perform a convolution operation based on the input tensor and the recombined weight tensor. The step of performing a weight rearrangement operation on the original weight tensor to generate a recombined weight tensor for standard convolution operations includes: A weight rearrangement operator is inserted before the group of convolution operators. The weight rearrangement operator is used to create a target weight tensor with all zeros and to place the sub-weight blocks corresponding to each group in the original weight tensor in the block diagonal position of the target weight tensor in the group order to generate the block diagonal sparse weight tensor.

9. An artificial intelligence processor, characterized in that, The artificial intelligence processor includes the group convolution operation device according to claim 8.

10. An electronic device, characterized in that, The electronic device includes: At least one processor; and Memory, including one or more computer program modules; The one or more computer program modules are stored in the memory and configured to be executed by the processor, and the one or more computer program modules are used to perform the group convolution operation method according to any one of claims 1-7.

11. A non-transitory storage medium, characterized in that, It stores computer-executable instructions, which, when executed by a computer, perform the convolution operation method according to any one of claims 1-7.