Feature data pooling method, electronic device, storage medium, and program

By segmenting the input feature map and establishing an index mapping table, parallel pooling processing is performed in local memory, which solves the problems of memory access fragmentation and insufficient computational granularity in existing pooling methods, and improves the computational efficiency and throughput of the pooling operator.

CN122244646APending Publication Date: 2026-06-19SHANGHAI SUIYUAN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI SUIYUAN TECH CO LTD
Filing Date
2026-05-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing pooling computation methods, a single pooling kernel is unable to perform a one-time vectorized read and operation on the complete window data, resulting in low locality of memory access space, insufficient computational intensity, failure to fully utilize the hardware's parallel processing capabilities, and inability to meet the real-time processing requirements of high-resolution vision tasks.

Method used

By segmenting the original input feature map of the target, moving the local input feature map to local memory, determining the input-output index mapping table, and performing parallel pooling processing in the vector registers of each sub-thread in local memory, the pooling computation scheduling and memory access mode are optimized.

Benefits of technology

It improves the computational efficiency and throughput of the pooling operator, solves the problems of memory fragmentation and insufficient computational granularity, and improves hardware resource utilization and overall computing performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244646A_ABST
    Figure CN122244646A_ABST
Patent Text Reader

Abstract

This invention discloses a feature data pooling method, electronic device, storage medium, and program. The method includes: segmenting the original target input feature map; transferring each segmented local input feature map to local memory via a data transformation engine; determining an input-output index mapping table for each local input feature map in local memory; wherein the input-output index mapping table is stored in local memory; and, based on the input-output index mapping table of each local input feature map, performing parallel pooling processing on the current local input feature data blocks required for pooling processing of the current pooling kernel of each local input feature map in the vector registers of each sub-thread in local memory, to obtain the current output feature data blocks in the output feature map. The technical solution of this invention can optimize pooling computation scheduling and memory access patterns, thereby improving the computational efficiency and throughput of the pooling operator.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing, and more particularly to a method for pooling feature data, an electronic device, a storage medium, and a program. Background Technology

[0002] Pooling, as a core fundamental operator of deep convolutional neural networks, is widely used in computer vision tasks such as image classification, object detection, and semantic segmentation. The mainstream implementation forms include max pooling and average pooling, which achieve feature dimensionality reduction and key information preservation by performing maximum value extraction or mean calculation within a sliding window of the input feature map.

[0003] In existing pooling computation processes, constrained by the data arrangement format of the input feature map, a single pooling kernel struggles to perform a one-time vectorized read and operation on the complete window data, easily leading to fragmented access to window data and significantly reducing memory locality. Simultaneously, a single pooling window generates only one output data point after computation, resulting in low unit computation density and insufficient computational intensity. This fails to fully utilize the parallel processing capabilities of the hardware vector operation units, leading to low hardware resource utilization, limited overall throughput, and high computational latency, making it difficult to meet the real-time processing requirements of high-resolution vision tasks. Summary of the Invention

[0004] This invention provides a method, apparatus, electronic device, storage medium, and program for pooling feature data, which can optimize pooling computation scheduling and memory access patterns, thereby improving the computational efficiency and throughput of pooling operators.

[0005] According to one aspect of the present invention, a pooling method for feature data is provided, comprising: After segmenting the original input feature map of the target, the segmented local input feature maps are transferred to local memory through the data transformation engine; An input-output index mapping table is determined for each of the local input feature maps in the local memory; wherein, the input-output index mapping table is stored in the local memory and includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map; Based on the input-output index mapping table of each local input feature map, the current local input feature data block required for pooling the current pooling kernel of each local input feature map is pooled in parallel in the vector register of each sub-thread in the local memory to obtain the current output feature data block in the output feature map.

[0006] According to another aspect of the present invention, a feature data pooling apparatus is provided, comprising: The input feature map transport module is used to segment the original target input feature map and then transport the segmented local input feature maps to local memory through the data transformation engine. An input-output index mapping table determination module is used to determine an input-output index mapping table for each of the local input feature maps in the local memory; wherein, the input-output index mapping table is stored in the local memory and includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map; The input feature map pooling processing module is used to perform parallel pooling processing on the current local input feature data blocks required for pooling processing of the current pooling kernel of each local input feature map in the vector registers of each sub-thread in the local memory, according to the input-output index mapping table of each local input feature map, so as to obtain the current output feature data block in the output feature map.

[0007] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the feature data pooling method according to any embodiment of the present invention.

[0008] According to another aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to execute and implement the feature data pooling method described in any embodiment of the present invention.

[0009] According to another aspect of the present invention, a computer program product is also provided, comprising a computer program that, when executed by a processor, implements the feature data pooling method described in any embodiment of the present invention.

[0010] This invention, in its embodiments, segments the original target input feature map and then moves the resulting local input feature maps to local memory via a data transformation engine. An input-output index mapping table is then established for each local input feature map in local memory. This mapping table, stored in local memory, includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map. Further, based on the input-output index mapping table for each local input feature map, the current local input feature data blocks required for pooling processing by the current pooling kernel of each local input feature map are subjected to parallel pooling processing in the vector registers of each sub-thread in local memory, resulting in the current output feature data blocks in the output feature map. This method addresses the problems of memory fragmentation, insufficient computational granularity, and low utilization of vector parallel resources in existing pooling methods. It optimizes pooling computation scheduling and memory access patterns, thereby improving the computational efficiency and throughput of the pooling operator.

[0011] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0012] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 This is a flowchart of a feature data pooling method provided in Embodiment 1 of the present invention; Figure 2 This is a flowchart of a feature data pooling method provided in Embodiment 2 of the present invention; Figure 3 This is a schematic diagram of a feature data pooling device provided in Embodiment 3 of the present invention; Figure 4 This is a schematic diagram of the structure of an electronic device provided in Embodiment 4 of the present invention. Detailed Implementation

[0014] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0015] It should be noted that the terms "first," "second," "target," and "partial," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0016] Example 1 Figure 1 This is a flowchart of a feature data pooling method provided in Embodiment 1 of the present invention. This embodiment is applicable to situations where multiple input feature data required for pooling processing are located at once according to an input-output index mapping table, and pooling processing is performed on the feature data. This method can be executed by a feature data pooling device, which can be implemented in software and / or hardware, and is generally integrated into an electronic device. The electronic device can be a terminal device or a server device, as long as it can execute the feature data pooling method. The embodiments of the present invention do not limit the specific device type of the electronic device. Accordingly, as... Figure 1 As shown, the method includes the following operations: S110. After segmenting the original input feature map of the target, the segmented local input feature maps are transferred to local memory through the data transformation engine.

[0017] The target original input feature map can be two-dimensional original feature data to be pooled, serving as the original data source for pooling. For example, the target original input feature map can include, but is not limited to, image type feature data, temporal and frequency domain feature data of speech signals, or feature data formed by text embedding vectors. This embodiment of the invention does not limit the specific type of the target original input feature map. The local input feature map can be an input feature map obtained by segmenting the target original input feature map. The data transformation engine can be an asynchronous DMA (Direct Memory Access) unit. Local memory is a high-speed storage unit with extremely low access latency in the hardware architecture performing pooling operations, and is a core data temporary storage area in high-performance computing scenarios.

[0018] In hardware architectures that perform pooling processing, while the device's global memory has a large capacity, its bandwidth is relatively limited. Therefore, during pooling processing, the original target input feature map to be processed can be segmented according to preset dimensions to obtain multiple local input feature maps, which are then moved to local memory by a data transport engine for pooling processing. For example, the original target input feature map can be segmented according to the batch size and channel size, or the batch size, channel size, and height.

[0019] S120. Determine an input-output index mapping table for each of the local input feature maps in the local memory; wherein, the input-output index mapping table is stored in the local memory and includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map.

[0020] The input-output index mapping table can be a table representing the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map. A local input feature data block can be a data block obtained by partitioning the local input feature map. The output feature map can be a feature map obtained by pooling the local input feature map. Output feature data blocks can be data blocks obtained by partitioning the output feature map.

[0021] Correspondingly, after the segmented local input feature maps are transferred to local memory via the data transformation engine, the mapping relationship between each local input feature data block in each local input feature map and each output feature data block in the output feature map can be determined. An input-output index mapping table is then generated based on this mapping relationship and stored in local memory. It should be noted that if the target original input feature map is segmented using the same segmentation method, the mapping relationship between each local input feature data block in each local input feature map and each output feature data block in the output feature map is the same; that is, each local input feature map can share a single input-output index mapping table.

[0022] S130. Based on the input-output index mapping table of each local input feature map, the current local input feature data block required for pooling the current pooling kernel of each local input feature map is pooled in parallel in the vector register of each sub-thread in the local memory to obtain the current output feature data block in the output feature map.

[0023] Here, the current pooling kernel can be the pooling kernel used to perform pooling processing on each local input feature map. The current local input feature data block can be the local input feature data block to be pooled. The current output feature data block can be the output feature data block obtained by pooling the current local input feature data block.

[0024] Accordingly, after determining the input-output index mapping table for each local input feature map in local memory, batch data reading can be achieved based on the input-output index mapping table of each local input feature map. Specifically, multiple current local input feature data blocks in the local input feature map can be directly located and loaded in batches at once through the input-output index mapping table, avoiding repeated address calculations and multiple scattered memory accesses. On this basis, the vector register resources of multiple parallel sub-threads in local memory can be used to simultaneously perform parallel pooling calculations on the multiple loaded current local input feature data blocks, thereby obtaining the current output feature data block in the output feature map corresponding to the local input feature map.

[0025] Therefore, the feature data pooling method provided in this embodiment of the invention can load multiple current local input feature data blocks in batches at once through the input-output index mapping table of each local input feature map. This eliminates scattered and multiple memory addressing operations, reducing the additional overhead caused by index calculation, address jumps, and non-contiguous memory access. Simultaneously, it avoids repetitive calculation of data correspondences at runtime, significantly reducing control logic complexity and instruction redundancy. By uniformly batch reading data, it can fully utilize the bandwidth advantages of local memory and vector registers, improving data reading and cache utilization, thereby significantly improving the parallelism and execution efficiency of the overall pooling computation.

[0026] This invention, in its embodiments, segments the original target input feature map and then moves the resulting local input feature maps to local memory via a data transformation engine. An input-output index mapping table is then established for each local input feature map in local memory. This mapping table, stored in local memory, includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map. Further, based on the input-output index mapping table for each local input feature map, the current local input feature data blocks required for pooling processing by the current pooling kernel of each local input feature map are subjected to parallel pooling processing in the vector registers of each sub-thread in local memory, resulting in the current output feature data blocks in the output feature map. This method addresses the problems of memory fragmentation, insufficient computational granularity, and low utilization of vector parallel resources in existing pooling methods. It optimizes pooling computation scheduling and memory access patterns, thereby improving the computational efficiency and throughput of the pooling operator.

[0027] Example 2 Figure 2 This is a flowchart of a feature data pooling method provided in Embodiment 2 of the present invention. This embodiment is a specific embodiment based on the above embodiment. In this embodiment, specific optional implementation methods are given for determining the input-output index mapping table for each local input feature map in local memory, and for performing parallel pooling processing on the current local input feature data blocks required for pooling processing of the current pooling kernel of each local input feature map in the vector registers of each sub-thread in local memory. Correspondingly, as Figure 2 As shown, the method in this embodiment may include: S210. After segmenting the original input feature map of the target, the segmented local input feature maps are moved to local memory through the data transformation engine.

[0028] In an optional embodiment of the present invention, the data layout format of the target original input feature map is NCHW, and the segmentation of the target original input feature map may include: determining a segmentation strategy for the target original input feature map based on the memory capacity of the local memory and the size relationship between the target original input feature map; determining a segmentation coordinate set for the target original input feature map based on the segmentation strategy; and segmenting the target original input feature map based on the segmentation coordinate set.

[0029] The segmentation strategy of the target original input feature map can be the rule used to segment the high-dimensional target original input feature map. The segmentation coordinate set can be an ordered set of the start and end coordinates of each local input feature map when the target original input feature map is divided into regions according to the segmentation strategy.

[0030] In this embodiment of the invention, the target original input feature map adopts the NCHW (Number-Channels-Height-Width) data layout format. Under this data layout format, the spatial location data of the corresponding feature map within the same channel dimension are stored contiguously, while the data between different channels are distributed non-contiguously in memory. During the pooling calculation process, in order to generate a single output feature data block, multiple related input feature data points need to be traversed and accessed one by one, making it difficult to form a contiguous memory access block, resulting in low memory access efficiency. Due to the non-contiguous storage of data between channels, if vectorization calculation is performed directly at the channel dimension, an additional data layout transpose operation needs to be introduced, resulting in unnecessary data handling and storage overhead. In addition, during each feature data access process, the input feature map boundary out-of-bounds judgment logic needs to be executed. Frequent conditional branch instructions will interrupt the smooth execution of the processor instruction pipeline, significantly reducing the overall computational throughput.

[0031] Therefore, in this embodiment of the invention, when segmenting the target original input feature map, the segmentation strategy of the target original input feature map can first be determined based on the size relationship between the local memory capacity and the target original input feature map. Specifically, if the memory capacity of each sub-thread in the local memory can accommodate a complete HW feature map with (N, C) channels, then the target original input feature map can be segmented according to... Dimensional segmentation; if the memory capacity of each sub-thread in local memory cannot accommodate a complete HW feature map with (N, C) channels, then the target original input feature map can be segmented according to... The dimensions are used for segmentation, where... The height of the output feature map is determined. Based on this, the set of segmentation coordinates formed by the start and end coordinates of each local input feature map can be determined according to the above segmentation strategy. Then, the original target input feature map can be segmented according to the set of segmentation coordinates to obtain multiple local input feature maps corresponding to the original target input feature map.

[0032] In an optional embodiment of the present invention, each sub-thread in the local memory is provided with a dual-input buffer area, the dual-input buffer area including a first input buffer area and a second input buffer area; the method may further include: performing a pooling operation using the first input buffer area, and simultaneously performing an asynchronous data transfer operation in parallel using the second input buffer area during the pooling operation in the first input buffer area; after determining that the pooling operation in the first input buffer area has been completed, swapping the region indices of the first input buffer area and the second input buffer area.

[0033] The dual-input buffer region consists of two independent on-chip cache spaces in local memory. For example, the dual-input buffer region may include, but is not limited to, a first input buffer region and a second input buffer region. The region index may be an index number used to uniquely identify the first input buffer region and the second input buffer region.

[0034] Specifically, a dual-input buffer area can be set up for each child thread in local memory. Pooling operations are performed using the first input buffer area, and during this process, asynchronous data transfer operations are simultaneously performed in parallel using the second input buffer area. This achieves pipelined parallel execution of the computation and data transfer processes. Furthermore, after the pooling operation in the first input buffer area is completed, the region indices of the first and second input buffer areas can be swapped. The second input buffer area, originally used for data transfer, becomes the current computation buffer, while the first input buffer area, originally used for computation, becomes the prefetch buffer for the next set of data. The process of performing pooling operations using the first input buffer area and asynchronous data transfer operations in parallel using the second input buffer area is repeated. This method achieves seamless integration of data computation and prefetching without requiring additional data copying, reducing data transfer overhead and local memory usage. Furthermore, the parallel execution of independent double-buffered pipelines by multiple sub-threads can further improve the parallelism and throughput efficiency of the overall pooling computation. Especially in scenarios involving high-resolution input feature maps and large batch data processing, it can effectively reduce the overall inference latency and improve the actual implementation performance of hardware computing power.

[0035] S220. Calculate the size of each output feature map based on each of the local input feature maps.

[0036] Correspondingly, after the segmented local input feature maps are moved to local memory through the data transformation engine, the size of the output feature map corresponding to each local input feature map can be calculated based on the size of each local input feature map and the size of the current pooling kernel.

[0037] S230. Determine the pooling kernel position index of each output feature data block in each output feature map according to the size of each output feature map and the size of the current pooling kernel.

[0038] The pooling kernel position index of the output feature data block can be used to characterize the pooling kernel sampling position information corresponding to each output feature data block in the output feature map.

[0039] Accordingly, after calculating the dimensions of each output feature map, the pooling kernel position index of each output feature data block in each output feature map can be determined based on the dimensions of each output feature map and the current pooling kernel size. In a specific example, the pooling kernel position index of the output feature data block can be [k][o], representing the k-th pooling kernel position of the o-th output feature data block, where: ; ; in, The row coordinates of the output feature map. The width of the output feature map. The column coordinates of the output feature map. The row coordinates of the current pooling kernel The width of the current pooling kernel. The column coordinates of the current pooling kernel.

[0040] S240. Calculate the current input-output index of the target local input feature data block in the local input feature map corresponding to the pooling kernel position index of each output feature data block.

[0041] The target local input feature data block can be the local input feature data block covered by the current pooling kernel during the calculation of the current output feature data block.

[0042] Accordingly, after determining the pooling kernel position index of each output feature data block in each output feature map, the current input-output index corresponding to the target local input feature data block currently participating in the pooling operation can be located and calculated in the corresponding local input feature map based on the pooling kernel position index.

[0043] In an optional embodiment of the present invention, the step of calculating the current input-output index of the target local input feature data block in the local input feature map corresponding to the pooling kernel position index of each of the output feature data blocks may include: determining the number of available threads in the local memory; determining the index calculation interval of each available thread in the local memory according to the size of the output feature map and the number of available threads; and calculating the current input-output index of the target local input feature data block using the computing processing unit of each available thread in the local memory.

[0044] The number of available threads can be the total number of threads currently available in local memory for executing computational tasks. The index computation interval can be a contiguous range of indexes allocated to each sub-thread for the index mapping computations that need to be completed.

[0045] In this embodiment of the invention, multiple sub-threads can be started and run in local memory. These sub-threads then compute in parallel the current input-output index of the target local input feature data block in the local input feature map corresponding to the pooling kernel position index of each output feature data block. Specifically, the number of available threads in local memory currently allocated for executing computational tasks can be determined first. Further, the total range of the index to be computed can be determined based on the size of the output feature map, and this total range can be evenly divided according to the number of available threads. This allocates independent and non-overlapping index computation intervals to the computational processing units of each available thread in local memory, enabling each thread to execute index computation in parallel, avoiding computational conflicts and redundant computations, thereby improving the overall index generation efficiency. In a specific example, the computational processing units of the available threads may include, but are not limited to, ALU (Arithmetic Logic Unit), SIMD (Single Instruction, Multiple Data) execution units, and AGU (Address Generation Unit). This embodiment of the invention does not limit the specific structure of the computational processing units of the available threads.

[0046] In an optional embodiment of the present invention, calculating the current input-output index of the target local input feature data block in the local input feature map corresponding to the pooling kernel position index of each of the output feature data blocks may include: calculating the current input-output index of the target local input feature data block based on the following formula: ; ; in, The row coordinates of the current input / output index of the target local input feature data block. The row coordinates of the output feature map. The pooling step size is in the height direction. The amount of fill in the height direction of the local input feature map. The row coordinates of the current pooling kernel. The void ratio in the height direction. The column coordinates of the current input / output index of the target local input feature data block. The column coordinates of the output feature map. The pooling step size is in the width direction. This refers to the amount of padding in the width direction of the local input feature map. The column coordinates of the current pooling kernel. This represents the void ratio in the width direction.

[0047] S250. Generate the input-output index mapping table based on the pooling kernel position index of each output feature data block and the current input-output index of the target local input feature data block.

[0048] Correspondingly, after obtaining the pooling kernel position index of each output feature data block and the current input-output index of the target local input feature data block corresponding to the current output feature data block, a one-to-one correspondence between the pooling kernel position index of the current output feature data block and the current input-output index of the target local input feature data block can be established, thereby generating an input-output index mapping table.

[0049] In a specific example, the input-output index map can be a two-dimensional integer array of shape [KHW, OHW], stored in local memory. In the input-output index map, indices[k][o] represents the flat index in the input HW plane corresponding to the position of the k-th pooling kernel of the o-th output feature data block. Where: ; ; in, The size of the current pooling kernel as stated within the pooling window. The height of the current pooling kernel. The width of the current pooling kernel. The total number of spatial locations in the output feature map. The height of the output feature map. The width of the output feature map.

[0050] S260. In each sub-thread of the local memory, the current input / output index that has a mapping relationship with each current output feature data block is determined from the input / output index mapping table according to each current output feature data block.

[0051] The current output feature data block can be the current output feature data block to be calculated. It should be noted that the current output feature data block can include multiple output feature data blocks.

[0052] Correspondingly, for each sub-thread partitioned in local memory, after determining the input-output index mapping table of the local input feature map that the thread is responsible for, the pooling kernel position index of the current output feature data block in the thread can be further determined based on the calculation logic of the output feature data block and the pooling kernel traversal rules. Based on this, the pooling kernel position index of the current output feature data block can be used as the retrieval basis to perform a lookup operation on the input-output index mapping table. This allows the determination of the current input-output index of the local input feature data block corresponding to the same pooling kernel position as different output feature data blocks, providing an accurate index basis for subsequent data reading and computation. Through this index mapping and lookup mechanism, the overhead of element-by-element scalar access in the original inner loop can be completely eliminated, achieving complete vectorized execution on the OHW dimension of the output feature map space.

[0053] S270. Determine multiple current local input feature data blocks from the current local input feature map according to the current input-output index of each local input feature map.

[0054] Correspondingly, for each sub-thread partitioned in local memory, after determining the current input-output index corresponding to the pooling kernel position index of the current output feature data block in the thread, the vector gather (batch read) instruction can be used to accurately locate and read multiple current local input feature data blocks corresponding to the current input-output index in the local input feature map under the responsibility of the thread, using the current input-output index of the thread as the basis for discrete memory access address. This completes the efficient mapping and loading of discrete indexes to continuous vector data, providing the required local input feature data for subsequent parallel computing.

[0055] S280. The current local input feature data blocks are pooled in parallel by each of the current pooling kernels to obtain the current output feature data blocks in the output feature map.

[0056] Accordingly, after batch reading multiple current local input feature data blocks, parallel pooling operations can be synchronously performed on each loaded current local input feature data block using the hardware parallel computing resources in local memory and the current pooling kernel bound to each sub-thread. Specifically, for each sub-thread partitioned in local memory, after batch reading multiple current local input feature data blocks, the thread can use its bound current pooling kernel to sequentially perform regular pooling operations on the local input feature data blocks obtained in this batch, complete feature aggregation and downsampling processing according to the preset pooling window and calculation strategy, and then generate and output the output feature map corresponding to the local input feature map that the thread is responsible for.

[0057] Optionally, the pooling type, pooling kernel type, and boundary type parameters and execution strategy for the pooling method of the aforementioned feature data can be determined during compilation using specific programming language template parameters to eliminate all runtime branches. Simultaneously, vector instructions can be used in all inner loop operations to fully utilize the vector width supported by the hardware.

[0058] In this embodiment of the invention, after segmenting the original target input feature map, the resulting local input feature maps are transferred to local memory via a data transformation engine. The size of each output feature map is calculated based on each local input feature map, and then the pooling kernel position index of each output feature data block in each output feature map is determined based on the size of each output feature map and the size of the current pooling kernel. Further, the current input-output index of the target local input feature data block in the local input feature map corresponding to the pooling kernel position index of each output feature data block is calculated, and an input-output index mapping table is generated based on the pooling kernel position index of each output feature data block and the current input-output index of the target local input feature data block. After generating the input-output index mapping table, in each sub-thread within local memory, the current input-output index corresponding to each current output feature data block is determined from the input-output index mapping table. Then, based on the current input-output index of each local input feature map, multiple current local input feature data blocks are determined from the current local input feature map. Finally, each current local input feature data block is pooled in parallel using each current pooling kernel to obtain the current output feature data block in the output feature map. This method addresses the problems of memory fragmentation, insufficient computational granularity, and low utilization of vector parallel resources in existing pooling methods. It optimizes pooling computation scheduling and memory access patterns, thereby improving the computational efficiency and throughput of the pooling operator.

[0059] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information in this technical solution comply with relevant laws and regulations and do not violate public order and good morals.

[0060] It should be noted that all information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for display, data used for analysis, etc.) involved in this disclosure are information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data comply with the relevant laws, regulations and standards of the relevant regions.

[0061] It should be noted that any arrangement or combination of the technical features in the above embodiments also falls within the protection scope of this invention.

[0062] Example 3 Figure 3This is a schematic diagram of a feature data pooling device provided in Embodiment 3 of the present invention, as shown below. Figure 3 As shown, the device includes: a data transfer module 310, an input / output index mapping table determination module 320, and a data pooling processing module 330, wherein: The data transfer module 310 is used to transfer the segmented local input feature maps to local memory through the data transformation engine after segmenting the original input feature map of the target.

[0063] The input / output index mapping table determination module 320 is used to determine an input / output index mapping table for each of the local input feature maps in the local memory; wherein, the input / output index mapping table is stored in the local memory and includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map.

[0064] The data pooling processing module 330 is used to perform parallel pooling processing on the current local input feature data blocks required for pooling processing of the current pooling kernel of each local input feature map in the vector registers of each sub-thread in the local memory, according to the input-output index mapping table of each local input feature map, so as to obtain the current output feature data block in the output feature map.

[0065] This invention, in its embodiments, segments the original target input feature map and then moves the resulting local input feature maps to local memory via a data transformation engine. An input-output index mapping table is then established for each local input feature map in local memory. This mapping table, stored in local memory, includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map. Further, based on the input-output index mapping table for each local input feature map, the current local input feature data blocks required for pooling processing by the current pooling kernel of each local input feature map are subjected to parallel pooling processing in the vector registers of each sub-thread in local memory, resulting in the current output feature data blocks in the output feature map. This method addresses the problems of memory fragmentation, insufficient computational granularity, and low utilization of vector parallel resources in existing pooling methods. It optimizes pooling computation scheduling and memory access patterns, thereby improving the computational efficiency and throughput of the pooling operator.

[0066] Optionally, the data arrangement format of the target original input feature map is NCHW, and the data transport module 310 is specifically used to: determine the segmentation strategy of the target original input feature map according to the memory capacity of the local memory and the size relationship between the target original input feature map; determine the segmentation coordinate set of the target original input feature map according to the segmentation strategy; and segment the target original input feature map according to the segmentation coordinate set.

[0067] Optionally, the input / output index mapping table determination module 320 is specifically used for: calculating the size of each output feature map based on each of the local input feature maps; determining the pooling kernel position index of each output feature data block in each of the output feature maps based on the size of each output feature map and the size of the current pooling kernel; calculating the current input / output index of the target local input feature data block in the local input feature map corresponding to the pooling kernel position index of each output feature data block; and generating the input / output index mapping table based on the pooling kernel position index of each output feature data block and the current input / output index of the target local input feature data block.

[0068] Optionally, the input / output index mapping table determination module 320 is further configured to: determine the number of available threads in the local memory; determine the index calculation range of each available thread in the local memory based on the size of the output feature map and the number of available threads; and calculate the current input / output index of the target local input feature data block using the computing processing units of each available thread in the local memory.

[0069] Optionally, the input / output index mapping table determination module 320 is further configured to: calculate the current input / output index of the target local input feature data block based on the following formula: ; ; in, The row coordinates of the current input / output index of the target local input feature data block. The row coordinates of the output feature map. The pooling step size is in the height direction. The amount of fill in the height direction of the local input feature map. The row coordinates of the current pooling kernel. The void ratio in the height direction. The column coordinates of the current input / output index of the target local input feature data block. The column coordinates of the output feature map. The pooling step size is in the width direction. This refers to the amount of padding in the width direction of the local input feature map. The column coordinates of the current pooling kernel. This represents the void ratio in the width direction.

[0070] Optionally, the data pooling processing module 330 is specifically used to: in each sub-thread of the local memory, determine the current input-output index that has a mapping relationship with each current output feature data block from the input-output index mapping table according to each current output feature data block; determine multiple current local input feature data blocks from the current local input feature map according to the current input-output index of each local input feature map; and perform pooling processing on each current local input feature data block in parallel through each current pooling kernel.

[0071] Optionally, each sub-thread in the local memory is provided with a dual-input buffer area, which includes a first input buffer area and a second input buffer area. The above device may also include a dual-buffer module, which is used to perform pooling operations using the first input buffer area, and during the pooling operations in the first input buffer area, synchronously use the second input buffer area to perform asynchronous data transfer operations in parallel. After the pooling operations in the first input buffer area are completed, the region indices of the first input buffer area and the second input buffer area are swapped.

[0072] The aforementioned feature data pooling device can execute the feature data pooling method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the method. Technical details not described in detail in this embodiment can be found in the feature data pooling method provided in any embodiment of the present invention.

[0073] Since the feature data pooling device described above is an apparatus capable of executing the feature data pooling method in the embodiments of the present invention, those skilled in the art can understand the specific implementation and various variations of the feature data pooling device in this embodiment based on the feature data pooling method described in the embodiments of the present invention. Therefore, how the feature data pooling device implements the feature data pooling method in the embodiments of the present invention will not be described in detail here. Any apparatus used by those skilled in the art to implement the feature data pooling method in the embodiments of the present invention falls within the scope of protection of this application.

[0074] Example 4 Figure 4A schematic diagram of an electronic device 10, which can be used to implement embodiments of the present invention, is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.

[0075] like Figure 4 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 can also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.

[0076] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0077] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as pooling methods for feature data.

[0078] In some embodiments, the feature data pooling method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or mounted on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the feature data pooling method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the feature data pooling method by any other suitable means (e.g., by means of firmware).

[0079] Optionally, the feature data pooling method may include: segmenting the target original input feature map, and then transferring the segmented local input feature maps to local memory via a data transformation engine; determining an input-output index mapping table for each of the local input feature maps in the local memory; wherein the input-output index mapping table is stored in the local memory and includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map; and performing parallel pooling processing on the current local input feature data blocks required for pooling processing of the current pooling kernel of each local input feature map in the vector registers of each sub-thread in the local memory, to obtain the current output feature data blocks in the output feature map.

[0080] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0081] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0082] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0083] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0084] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

[0085] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.

[0086] This application also discloses a computer program product, which includes a computer program that, when executed by a processor, implements the feature data pooling method provided in any embodiment of this application. This program product and the feature data pooling methods disclosed in the embodiments of this application belong to the same inventive concept, and therefore will not be described in detail here.

[0087] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.

[0088] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A pooling method for feature data, characterized in that, include: After segmenting the original input feature map of the target, the segmented local input feature maps are transferred to local memory through the data transformation engine; An input-output index mapping table is determined for each of the local input feature maps in the local memory; wherein, the input-output index mapping table is stored in the local memory and includes the mapping relationship between each local input feature data block in the local input feature map and each output feature data block in the output feature map; Based on the input-output index mapping table of each local input feature map, the current local input feature data block required for pooling the current pooling kernel of each local input feature map is pooled in parallel in the vector register of each sub-thread in the local memory to obtain the current output feature data block in the output feature map.

2. The method according to claim 1, characterized in that, The data layout format of the target original input feature map is batch number-channel number-height-width NCHW. The segmentation of the target original input feature map includes: The segmentation strategy of the target original input feature map is determined based on the size relationship between the memory capacity of the local memory and the target original input feature map; The segmentation coordinate set of the target original input feature map is determined according to the segmentation strategy; The target original input feature map is segmented according to the segmentation coordinate set.

3. The method according to claim 1, characterized in that, Determining the input-output index mapping table for each of the local input feature maps in the local memory includes: Calculate the size of each output feature map based on each of the local input feature maps; The pooling kernel position index of each output feature data block in each output feature map is determined based on the size of each output feature map and the size of the current pooling kernel; Calculate the current input-output index of the target local input feature data block in the local input feature map corresponding to the pooling kernel position index of each output feature data block; The input-output index mapping table is generated based on the pooling kernel position index of each output feature data block and the current input-output index of the target local input feature data block.

4. The method according to claim 3, characterized in that, The calculation of the pooling kernel position index of each of the output feature data blocks, corresponding to the current input-output index of the target local input feature data block in the local input feature map, includes: Determine the number of available threads in the local memory; The index calculation range for each available thread in the local memory is determined based on the size of the output feature map and the number of available threads. The current input / output index of the target local input feature data block is calculated using the computing processing units of each available thread in the local memory.

5. The method according to claim 3 or 4, characterized in that, The calculation of the pooling kernel position index of each of the output feature data blocks, corresponding to the current input-output index of the target local input feature data block in the local input feature map, includes: The current input / output index of the target local input feature data block is calculated based on the following formula: ; ; in, The row coordinates of the current input / output index of the target local input feature data block. The row coordinates of the output feature map. The pooling step size is in the height direction. The amount of fill in the height direction of the local input feature map. The row coordinates of the current pooling kernel. The void ratio in the height direction. The column coordinates of the current input / output index of the target local input feature data block. The column coordinates of the output feature map. The pooling step size is in the width direction. This refers to the amount of padding in the width direction of the local input feature map. The column coordinates of the current pooling kernel. This represents the void ratio in the width direction.

6. The method according to claim 1, characterized in that, The parallel pooling process for the current local input feature data blocks required for pooling the current pooling kernel of each local input feature map in the vector registers of each sub-thread in the local memory includes: In each sub-thread of the local memory, the current input / output index that has a mapping relationship with each current output feature data block is determined from the input / output index mapping table according to each current output feature data block; Multiple current local input feature data blocks are determined from the current local input feature map based on the current input-output index of each local input feature map; Each of the current local input feature data blocks is pooled in parallel using each of the current pooling kernels.

7. The method according to claim 1, characterized in that, Each sub-thread in the local memory is provided with a dual-input buffer area, the dual-input buffer area including a first input buffer area and a second input buffer area; the method further includes: Pooling operations are performed using the first input buffer area, and asynchronous data transfer operations are performed in parallel using the second input buffer area during the pooling operations in the first input buffer area. After the pooling operation of the first input buffer region is completed, the region indices of the first input buffer region and the second input buffer region are swapped.

8. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that is executed by the at least one processor to enable the at least one processor to perform the feature data pooling method according to any one of claims 1-7.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the pooling method for feature data as described in any one of claims 1-7.

10. A computer program product, characterized in that, Includes a computer program / instruction, wherein the computer program / instruction, when executed by a processor, implements the feature data pooling method of any one of claims 1-7.