Data processing apparatus, data processing method and related devices
By employing a circular cache space mapping relationship in the neural network processor for distributed storage and retrieval of data, the high power consumption problem of the neural network processor is solved, and low-power data processing is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP LTD
- Filing Date
- 2021-12-22
- Publication Date
- 2026-06-26
AI Technical Summary
Existing neural network processors consume a lot of power during data processing, and traditional chip designs increase chip area and power consumption.
By employing a specific storage space architecture and mapping relationships within a ring-shaped cache space, distributed storage and data retrieval are achieved, reducing the area of the cache control circuitry and lowering the power consumption for data writing and reading.
It effectively reduces the power consumption of neural network processors, reduces energy consumption for data reading and writing, and does not increase the physical area of the chip.
Smart Images

Figure CN116362304B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of neural network processor technology, and in particular to a data processing device, data processing method and related device. Background Technology
[0002] With the development of existing technologies, in order to enhance the artificial intelligence capabilities of devices, neural network processing units (NPUs) are often integrated into the system. These typically employ a "data-driven parallel computing" architecture to accelerate neural network operations and address the inefficiency of traditional chips in neural network computation. Reducing the power consumption of the NPU has become a significant challenge. Summary of the Invention
[0003] In view of this, this application provides a data processing apparatus, a data processing method, and related apparatus, which can reduce the area of the cache control circuit in a neural network processor and reduce the power consumption of data writing and reading through a specific storage space architecture.
[0004] In a first aspect, embodiments of this application provide a data processing apparatus, including a neural network processor, the neural network processor including a processing unit array and M storage modules, the processing unit array including M columns of processing unit sets, where M is a positive even number;
[0005] Each storage module includes a circular cache space that corresponds one-to-one with the number of neural network layers. The M circular cache spaces of each layer form at most M / 2 circular cache spaces, and each circular cache space includes at least 2 and at most M circular cache spaces.
[0006] Each circular cache space includes x rows of cache address space, where x is a positive integer greater than 2. The x rows of cache address space include the first N-1 rows of header address space and the last N-1 rows of tail address space, where N is the size of the convolution kernel NxN, and N is a positive integer greater than 1 and less than x. The tail address space of any circular cache space that makes up the circular cache space has a first mapping relationship with the first N-1 rows of header address space of the next circular cache space. The head address space of any circular cache space that makes up the circular cache space has a second mapping relationship with the first N-1 rows of tail address space of the previous circular cache space.
[0007] The M storage modules are used for distributed storage of the first data, and the M-column processing unit set is used to read the first data from the distributed storage of the M storage modules.
[0008] Secondly, embodiments of this application provide a data processing method, applied to the data processing apparatus described in the first aspect of embodiments of this application, the method comprising:
[0009] The first data is divided into n first sub-data based on the number of channels n. The first data represents the data to be stored, and n is a positive integer less than or equal to M / 2.
[0010] Each first sub-data is written into a circular cache space consisting of INT(M / n) circular cache spaces, and any one circular cache space is used to store any one first sub-data.
[0011] Thirdly, embodiments of this application provide a system-on-a-chip (SoC) where the neural network processor includes a processing unit array and M storage modules. The processing unit array includes M columns of processing unit sets, where M is a positive even number. Each storage module includes a circular cache space corresponding one-to-one with the number of neural network layers. The M circular cache spaces of each layer form at most M / 2 circular cache spaces, and each circular cache space includes at least 2 to at most M circular cache spaces. Each circular cache space includes x rows of cache address space, where x is a positive integer greater than 2. The x rows of cache address space include the first N-1 rows of the header. The address space and the tail address space of the last N-1 rows, where N is the size of the convolution kernel NxN, and N is a positive integer greater than 1 and less than x; the tail shadow space of the last N-1 rows of any circular cache space constituting the circular cache space has a first mapping relationship with the head address space of the first N-1 rows of the next circular cache space, and the head shadow space of the first N-1 rows of any circular cache space constituting the circular cache space has a second mapping relationship with the tail address space of the first N-1 rows of the previous circular cache space; the neural network processor is used for:
[0012] The first data is divided into n first sub-data based on the number of channels n. The first data represents the data to be stored, and n is a positive integer less than or equal to M / 2.
[0013] Each first sub-data is written into a circular cache space consisting of INT(M / n) circular cache spaces, and any one circular cache space is used to store any one first sub-data.
[0014] Fourthly, embodiments of this application provide an electronic device, including a memory and a processor. The memory is used to store a program, and the processor executes the program stored in the memory. When the program stored in the memory is executed, the processor is used to execute instructions for the steps of the method as described in any of the second aspects of embodiments of this application.
[0015] Fifthly, embodiments of this application provide a computer storage medium storing a computer program, the computer program including program instructions, which, when executed by a processor, cause the processor to perform the method described in any of the second aspects of embodiments of this application.
[0016] Sixthly, embodiments of this application provide a computer product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps described in any method of the second aspect of this application. The computer program product may be a software installation package.
[0017] As can be seen, the data processing device, data processing method, and related devices described above include a neural network processor. The neural network processor comprises a processing unit array and M storage modules. The processing unit array includes M columns of processing unit sets, where M is a positive even number. Each storage module includes a circular cache space corresponding one-to-one with the number of neural network layers. The M circular cache spaces of each layer form at most M / 2 circular cache spaces. Each circular cache space includes at least 2 to at most M circular cache spaces. Each circular cache space includes x rows of cache address space, where x is a positive integer greater than 2. The x rows of cache address space include the first N-1 rows of header address space and the last N-1 rows of tail address space. In the context of the circular cache space, N is the size of the convolution kernel, NxN, where N is a positive integer greater than 1 and less than x. The tail address space of any circular cache space comprising the circular cache space has a first mapping relationship with the first address space of the next circular cache space (N-1 rows after the tail address space) and a second mapping relationship with the tail address space of the next circular cache space (N-1 rows before the head address space). The M storage modules are used for distributed storage of the first data, and the M columns of processing units are used to read the first data from the distributed storage of the M storage modules. This reduces the area of the cache control circuit in the neural network processor and lowers the power consumption for data writing and reading. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a schematic diagram of the architecture of a data processing device provided in an embodiment of this application;
[0020] Figure 2 This is a schematic diagram of the structure of a storage module provided in an embodiment of this application;
[0021] Figure 3 This is a schematic diagram of a circular cache space provided in an embodiment of this application;
[0022] Figure 4 An example structure diagram of a circular cache space provided in an embodiment of this application;
[0023] Figure 5 A flowchart illustrating a data processing method provided in an embodiment of this application;
[0024] Figure 6 A schematic diagram of the architecture of a neural network processor provided in an embodiment of this application;
[0025] Figure 7 A power consumption comparison diagram provided for an embodiment of this application;
[0026] Figure 8 This is a schematic diagram of the structure of a system-on-a-chip provided in an embodiment of this application;
[0027] Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0028] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present application.
[0029] The terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.
[0030] It should be understood that the term "and / or" in this document is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this document indicates that the preceding and following related objects are in an "or" relationship. In the embodiments of this application, "multiple" refers to two or more.
[0031] In this application, the term "connection" refers to various connection methods, such as direct connection or indirect connection, to achieve communication between devices. This application does not impose any limitations on this.
[0032] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0033] The background technology and related terms of this application are explained below.
[0034] Background technology related:
[0035] In NPU architecture design, multi-level caches are typically designed to improve system bandwidth. The basic computational unit in an NPU is usually an Arithmetic and Logic Unit (ALU) with storage, also known as a Processing Element (PE). The storage in the PE is called the level 0 cache. Furthermore, the data processed by neural networks is often multi-channel data, such as image data with multiple color channels. In NPU systolic array architectures for convolutional neural networks, a common approach is to split the buffer into independent small buffers corresponding to different channels of Static Random Access Memory (SRAM), each corresponding to a different PE to increase bandwidth. Each SRAM corresponds to one column of PEs. When a PE column in the PE array needs to access data in a different SRAM, one existing method is to add an extra bus to the buffer of different SRAMs to support data access between different SRAMs, which increases the chip area and power consumption. Another method is to combine multiple SRAMs into one large SRAM. However, since there are overlapping data parts in different SRAMs, the overlapping data needs to be accessed to Dynamic Random Access Memory (DRAM) multiple times. The power consumption of each DRAM read is about 100 times higher than that of the SRAM access, which greatly increases the power consumption.
[0036] To address the aforementioned issues, this application provides a data processing apparatus, a data processing method, and related devices, which can employ a new storage space architecture to reduce the area of the cache control circuit in the neural network processor and lower the power consumption for data writing and reading.
[0037] The following is combined with Figure 1 A data processing apparatus according to an embodiment of this application will be described. Figure 1 This is a schematic diagram of the architecture of a data processing device provided in an embodiment of this application. The data processing device 100 includes a processing unit array 110 and M storage modules 120. The processing unit array 110 includes M columns of processing unit sets 111, where M is a positive even number.
[0038] Each storage module 120 includes a circular cache space corresponding one-to-one with the number of neural network layers. The M circular cache spaces of each layer form at most M / 2 circular cache spaces, and each circular cache space includes at least 2 to at most M circular cache spaces.
[0039] Each circular cache space includes x rows of cache address space, where x is a positive integer greater than 2. The x rows of cache address space include the first N-1 rows of header address space and the last N-1 rows of tail address space, where N is the size of the convolution kernel NxN, and N is a positive integer greater than 1 and less than x. The tail address space of any circular cache space that makes up the circular cache space has a first mapping relationship with the first N-1 rows of header address space of the next circular cache space. The head address space of any circular cache space that makes up the circular cache space has a second mapping relationship with the first N-1 rows of tail address space of the previous circular cache space.
[0040] The M storage modules 120 are used for distributed storage of the first data, and the M-column processing unit set 111 is used to read the first data from the distributed storage of the M storage modules 120.
[0041] The aforementioned storage module 120 can be an SRAM, and the aforementioned processing unit set 111 can include multiple PEs.
[0042] For ease of understanding, combined with Figure 2 Each storage module 120 in the embodiments of this application will be described separately. Figure 2 The diagram shows a storage module structure provided in this application embodiment, including α ring buffer spaces, namely ring buffer space 0 to ring buffer space α-1 in the figure. Different ring buffer spaces correspond to different layers in the neural network.
[0043] In one possible embodiment, when allocating the circular cache space for each storage module 120, the space can be allocated in units of the space occupied by even-numbered rows of pixels in the image, so as to ensure that the cache addresses of different storage modules 120 are aligned in the same layer of the neural network.
[0044] Furthermore, in combination Figure 3 The circular storage space in the embodiments of this application will be described. Figure 3 The diagram illustrates the structure of a ring-shaped cache space provided in this application embodiment, including x rows of cache address space, namely rows 0 to x-1 in the diagram. When the convolution kernel is NxN, the first N-1 rows can be set as the head address space and the last N-1 rows as the tail address space.
[0045] Furthermore, in combination Figure 4 The circular cache space in the embodiments of this application will be described. Figure 4This is an example structural diagram of a circular cache space provided in an embodiment of this application. As can be seen, when two circular cache spaces form a circular cache space, each circular cache space is defined as including 8 rows of cache address space, namely rows 0 to 7 in the first circular cache space and rows 8 to 15 in the second circular cache space (the middle rows are not shown). The convolution kernel size is set to 3x3. Therefore, it can be determined that the first two rows and the last two rows in the first circular cache space are mapped to the first two rows and the last two rows in the second circular cache space. That is, rows 6 and 7 are mapped to the first two rows of shadow space before row 8, and rows 8 and 9 are mapped to the last two rows of shadow space after row 7. Similarly, rows 14 and 15 are mapped to the first two rows of shadow space before row 0, and rows 0 and 1 are mapped to the last two rows of shadow space after row 15. This cyclical structure forms the circular cache space. It is understood that the shadow space does not actually store data. When data in the shadow space is needed, it can be obtained from the source space mapped by the shadow space. The shadow space merely represents the address space where a mapping relationship exists. For example, when data for rows 7, 8, and 9 is needed, data for row 7 is read from the first circular cache space, while data for rows 8 and 9 are read from the second circular cache space according to the mapping relationship.
[0046] The aforementioned data processing device enables distributed data storage, while consuming relatively low power when reading data from the distributed storage.
[0047] The following is combined with Figure 5 One data processing method from an embodiment of this application will be described. Figure 5 This application provides a schematic flowchart of a data processing method, which specifically includes the following steps:
[0048] Step 501: Divide the first data into n first sub-data according to the number of channels n of the first data.
[0049] Wherein, the first data represents the data to be stored, and n is a positive integer less than or equal to M / 2. For example, the first data can be image data with n channels. In order to ensure that the data of each channel is evenly distributed and stored in M storage modules, n needs to be less than or equal to M / 2.
[0050] Step 502: Write each first sub-data into a circular cache space consisting of INT(M / n) circular cache spaces.
[0051] In this context, any one of the circular cache spaces is used to store any one of the first sub-data.
[0052] Specifically, the row number y of each first sub-data can be obtained, and then each first sub-data can be sequentially written into the circular cache space composed of INT(M / n) circular cache spaces, and written in y / (INT(M / n)*x) rounds to store the first data.
[0053] It is understandable that when n is 2 and M is 4, there are 2 first sub-data, which can be used to construct 2 circular cache spaces to store the 2 first sub-data respectively. That is, each first sub-data is written into a circular cache space composed of INT(4 / 2) = 2 circular cache spaces. When n is 2 and M is 9, there are 2 first sub-data, which can be used to construct 2 circular cache spaces to store the 2 first sub-data respectively. That is, each first sub-data is written into a circular cache space composed of INT(9 / 2) = 4 circular cache spaces. This can ensure the maximum distributed storage.
[0054] Step 503: Read the first sub-data in each circular cache space through each processing unit set.
[0055] Specifically, each first sub-data can be read sequentially by reading INT(M / n)*x) rows y times from INT(M / n) circular cache spaces.
[0056] In one possible embodiment, when it is necessary to read the first sub-data of any head shadow space, the first sub-data in the N-1 line tail address space of the previous ring cache space corresponding to the head shadow space can be read according to the first mapping relationship.
[0057] In one possible embodiment, when it is necessary to read the first sub-data of any tail shadow space, the first sub-data in the first N-1 lines of the header address space of the next ring cache space corresponding to the tail shadow space is read according to the second mapping relationship.
[0058] In one possible embodiment, when it is not necessary to read any head shadow space or any tail shadow space, the first sub-data of each line is read sequentially.
[0059] The above data processing method can realize distributed data storage. When the number of channels is 2 and the number of SRAMs is 8, four SRAMs can be combined for one channel. The image size that can be stored is 4 times that of an image without average storage. When reading data, it is read from the source space corresponding to the shadow space according to the mapping relationship, without accessing memory, which greatly reduces power consumption.
[0060] To facilitate understanding, the data processing apparatus and data processing method in this application will be illustrated below with examples. For instance, suppose there are four storage modules, namely SRAM0, SRAM1, SRAM2 and SRAM3. The first data to be stored is two-channel image data. Existing methods generally store one channel of data in SRAM0 and the other channel of data in SRAM1, while SRAM2 and SRAM3 are left unused, which is a waste of resources.
[0061] This solution establishes a mapping relationship between SRAM0 and SRAM2, with their start and end address spaces mutually mapped; similarly, there is a mapping relationship between SRAM1 and SRAM3, with their start and end address spaces mutually mapped. This architecture enables distributed storage, where data from one channel is stored in SRAM0 and SRAM2, and data from another channel is stored in SRAM1 and SRAM3. Figure 6 As shown, it will not be elaborated further here.
[0062] Assume the data to be stored is an image with 2 channels and 2800 rows of data. The existing method stores channel 1 in the circular storage space 1 of the allocated storage module 1. Circular storage space 1 can store 20 rows of data, and the corresponding circular storage space 2 of the other storage module 2 can also store 20 rows of data. Therefore, 40 rows of data can be retrieved at a time. According to previous analysis, with a 7x7 convolution kernel, retrieving 40 rows of data results in 24 rows of overlapping data. A total of 70 rounds are needed to retrieve all the image data. These 70 rounds require additional retrieval of 1680 rows of overlapping data, which represents 1680 / 2800 = 0.6 of the total image rows. 60% of the data is overlapping data, requiring repeated data transfer. If the mapping strategy proposed in this technology is adopted, the power consumption caused by this 60% data transfer can be saved.
[0063] like Figure 7 As shown, Figure 7 This is a power consumption comparison diagram provided by an embodiment of this application. The shaded area represents the power consumption when reading the overlapping area. It can be seen that the power consumption of this solution is significantly reduced compared to the existing solutions.
[0064] It is understood that the above is an illustrative example. The data of a channel can be evenly distributed among several storage modules, such as storing the data of a channel in SRAM0 and SRAM4, or SRAM1 and SRAM2, etc. No specific limitation is made here.
[0065] The scenarios in which this application's embodiments can be applied are as follows:
[0066] n×2<M
[0067] n is the number of channels for storing data, and M is the number of storage modules.
[0068] Through the aforementioned data processing device and data processing method, in the architecture of a neural network processor that does not support access between SRAMs at the same cache level, multiple SRAMs can be connected in series through mapping to fully utilize the space in SRAM, thereby increasing the data storage capacity without increasing power consumption and chip area.
[0069] The following is combined Figure 8 This application describes a system-on-a-chip (SoC) 800, which includes a neural network processor 810. The neural network processor 810 includes a processing unit array and M storage modules. The processing unit array includes M columns of processing unit sets, where M is a positive even number. Each storage module includes a circular cache space corresponding one-to-one with the number of neural network layers. The M circular cache spaces of each layer form at most M / 2 circular cache spaces, and each circular cache space includes at least 2 to at most M circular cache spaces. Each circular cache space includes x rows of cache address space, where x is a positive integer greater than 2. The space includes the first N-1 rows of the header address space and the last N-1 rows of the tail address space, where N is the size of the convolution kernel NxN, and N is a positive integer greater than 1 and less than x. The tail address space of any circular cache space comprising the circular cache space has a first mapping relationship with the first N-1 rows of the header address space of the next circular cache space; the head address space of any circular cache space comprising the circular cache space has a second mapping relationship with the first N-1 rows of the tail address space of the previous circular cache space. The neural network processor 810 is used for:
[0070] The first data is divided into n first sub-data based on the number of channels n. The first data represents the data to be stored, and n is a positive integer less than or equal to M / 2.
[0071] Each first sub-data is written into a circular cache space consisting of INT(M / n) circular cache spaces, and any one circular cache space is used to store any one first sub-data.
[0072] In one possible embodiment, regarding the writing of each first sub-data into a circular cache space consisting of INT(M / n) circular cache spaces, the neural network processor 810 is specifically configured to:
[0073] Obtain the row number y of each of the first sub-data items;
[0074] Each of the first sub-data is sequentially written into the circular cache space composed of INT(M / n) circular cache spaces, and written in y / (INT(M / n)*x) rounds to store the first data.
[0075] In one possible embodiment, after writing each first sub-data into a circular cache space consisting of INT(M / n) circular cache spaces, the neural network processor is further configured to:
[0076] The first sub-data in each circular cache space is read by each processing unit set.
[0077] In one possible embodiment, the neural network processor 810 is specifically configured to: read the first sub-data in each circular buffer space through each set of processing units.
[0078] The system sequentially reads rows y of INT(M / n)*x) from INT(M / n) circular cache spaces to read each first sub-data.
[0079] In one possible embodiment, in the process of sequentially reading INT(M / n)*x) rows y rounds of INT(M / n) circular cache spaces to read each first sub-data, the neural network processor 810 is specifically configured to:
[0080] When it is necessary to read the first sub-data of any head shadow space, the first sub-data in the N-1 line tail address space of the previous ring cache space corresponding to the head shadow space is read according to the first mapping relationship.
[0081] When it is necessary to read the first sub-data of any tail shadow space, the first sub-data in the first N-1 lines of the header address space of the next ring cache space corresponding to the tail shadow space is read according to the second mapping relationship;
[0082] When it is not necessary to read any head shadow space or any tail shadow space, read the first sub-data of each line in sequence.
[0083] As can be seen, this can reduce the area of the cache control circuit in the neural network processor and reduce the power consumption for data writing and reading.
[0084] The following is combined with Figure 9 An electronic device according to an embodiment of this application will be described. Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application, such as... Figure 9As shown, the electronic device 900 includes a processor 901, a communication interface 902, and a memory 903. The processor, communication interface, and memory are interconnected. The electronic device 900 may also include a bus 904, through which the processor 901, communication interface 902, and memory 903 are interconnected. The bus 904 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The bus 904 can be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, Figure 9 The bus is represented by a single thick line, but this does not indicate that there is only one bus or one type of bus. The memory 903 stores a computer program, which includes program instructions. The processor is configured to call the program instructions and execute the above-mentioned... Figure 5 All or part of the methods described herein.
[0085] The above primarily describes the solutions of the embodiments of this application from the perspective of the method execution process. It is understood that, in order to achieve the above functions, the electronic device includes corresponding hardware structures and / or software modules for executing each function. Those skilled in the art should readily recognize that, in conjunction with the units and algorithm steps of the various examples described in the embodiments provided herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed by hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0086] This application embodiment can divide the electronic device into functional units according to the above method example. For example, each function can be divided into a separate functional unit, or two or more functions can be integrated into one processing unit. The integrated unit can be implemented in hardware or as a software functional unit. It should be noted that the unit division in this application embodiment is illustrative and only represents one logical functional division. In actual implementation, there may be other division methods.
[0087] This application also provides a computer storage medium storing a computer program for electronic data interchange, which causes a computer to perform some or all of the steps of any of the methods described in the above method embodiments.
[0088] This application also provides a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods described in the above method embodiments. The computer program product may be a software installation package, and the computer may include an electronic device.
[0089] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application.
[0090] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.
[0091] In the several embodiments provided in this application, it should be understood that the disclosed apparatus can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical or other forms.
[0092] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0093] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0094] If the integrated units described above are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage device (CMD). Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a memory and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned memory includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0095] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.
[0096] The embodiments of this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A data processing apparatus, characterized in that, The neural network processor includes a processing unit array and M storage modules. The processing unit array includes M columns of processing unit sets, where M is a positive even number. Each storage module includes a circular cache space that corresponds one-to-one with the number of neural network layers. The M circular cache spaces of each layer form at most M / 2 circular cache spaces, and each circular cache space includes at least 2 and at most M circular cache spaces. Each circular cache space includes x rows of cache address space, where x is a positive integer greater than 2. The x rows of cache address space include the first N-1 rows of header address space and the last N-1 rows of tail address space. N is the size of the convolution kernel NxN, where N is a positive integer greater than 1 and less than x. The tail address space of any ring cache space that makes up the circular cache space has a first mapping relationship with the first address space of the next ring cache space, which is N-1 lines after the tail address space. The head address space of any ring cache space that makes up the circular cache space has a second mapping relationship with the tail address space of the next ring cache space, which is N-1 lines before the head address space. The first mapping relationship and the second mapping relationship represent the mapping relationship between the shadow space and the address space. When data in the shadow space is needed, it is obtained from the address space mapped by the shadow space. The M storage modules are used for distributed storage of the first data, and the M-column processing unit set is used to read the first data from the distributed storage of the M storage modules.
2. A data processing method, characterized in that, Applied to the data processing apparatus of claim 1, the method comprises: The first data is divided into n first sub-data based on the number of channels n. The first data represents the data to be stored, and n is a positive integer less than or equal to M / 2. Each first sub-data is written into a circular cache space consisting of INT(M / n) circular cache spaces, and any one circular cache space is used to store any one first sub-data.
3. The method according to claim 2, characterized in that, The step of writing each first sub-data into a circular cache space composed of INT(M / n) circular cache spaces includes: Obtain the row number y of each of the first sub-data items; Each of the first sub-data is sequentially written into the circular cache space composed of INT(M / n) circular cache spaces, and written in y / (INT(M / n)×x) rounds to store the first data.
4. The method according to claim 2, characterized in that, After writing each first sub-data into a circular cache space consisting of INT(M / n) circular cache spaces, the method further includes: The first sub-data in each circular cache space is read by each processing unit set.
5. The method according to any one of claims 2-4, characterized in that, The step of reading the first sub-data in each circular cache space through each processing unit set includes: The system sequentially reads rows y of INT(M / n)×x) in the circular cache space to read each first sub-data.
6. The method according to claim 5, characterized in that, The step of sequentially reading INT(M / n) × x) rows y rounds of the circular cache space to read each first sub-data includes: When it is necessary to read the first sub-data of any head shadow space, the first sub-data in the N-1 line tail address space of the previous ring cache space corresponding to the head shadow space is read according to the first mapping relationship. When it is necessary to read the first sub-data of any tail shadow space, the first sub-data in the first N-1 lines of the header address space of the next ring cache space corresponding to the tail shadow space is read according to the second mapping relationship; When it is not necessary to read any head shadow space or any tail shadow space, read the first sub-data of each line in sequence.
7. A system-on-a-chip, characterized in that, The system includes a neural network processor, which comprises a processing unit array and M storage modules. The processing unit array comprises M columns of processing unit sets, where M is a positive even number. Each storage module includes a circular cache space corresponding one-to-one with the number of neural network layers. The M circular cache spaces of each layer form at most M / 2 circular cache spaces, and each circular cache space includes at least 2 to at most M circular cache spaces. Each circular cache space includes x rows of cache address space, where x is a positive integer greater than 2. The x rows of cache address space include the first N-1 rows of header address space and the last N-1 rows of tail address space. N is the size of the convolution kernel, NxN, where N is a positive integer greater than 1 and less than x. The trailing shadow space of any circular cache space comprising the circular cache space, following the trailing address space of N-1 rows, has a first mapping relationship with the preceding N-1 rows of the beginning address space of the next circular cache space. The preceding N-1 rows of the beginning shadow space of any circular cache space comprising the circular cache space has a second mapping relationship with the preceding N-1 rows of the trailing address space of the previous circular cache space. These first and second mapping relationships represent the mapping relationship between the shadow space and the address space. When data in the shadow space is needed, it is obtained from the address space mapped by the shadow space. The neural network processor is used for: The first data is divided into n first sub-data based on the number of channels n. The first data represents the data to be stored, and n is a positive integer less than or equal to M / 2. Each first sub-data is written into a circular cache space consisting of INT(M / n) circular cache spaces, and any one circular cache space is used to store any one first sub-data.
8. The system-on-a-chip according to claim 7, characterized in that, The neural network processor is used to read the first sub-data in each circular buffer space through each processing unit set, and the neural network processor is specifically used for: The system sequentially reads rows y of INT(M / n)×x) in the circular cache space to read each first sub-data.
9. An electronic device, characterized in that, include: A memory and a processor, the memory being used to store a program, the processor executing the program stored in the memory, and when the program stored in the memory is executed, the processor being used to perform the data processing method as described in any one of claims 2 to 6.
10. A computer storage medium, characterized in that, The computer storage medium stores a computer program, the computer program including program instructions, which, when executed by a processor, cause the processor to perform the method as described in any one of claims 2-6.