Methods, apparatus, and corresponding circuitry for mapping data to a processing core array
By using bit reversal and mapping techniques, sensor data is stored in the integrated circuit processing core array, solving the problem that integrated circuits cannot meet the requirements of high-performance real-time computing, and realizing efficient sensor data processing and sensing functions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CCORE TECH CO LTD
- Filing Date
- 2025-12-15
- Publication Date
- 2026-06-30
AI Technical Summary
Existing integrated circuits cannot meet the high-performance real-time computing requirements of sensor signal data, and cannot realize the efficient sensing function of robots or sensing machines.
By reversing and rearranging the input bits to generate a bit-inverted input array, and storing it in the computing unit of the processing core array according to the input bit mapping relationship, the fast Fourier transform matrix multiplication calculation is performed, thus optimizing the data storage and calculation process.
It improves sensor data processing capabilities, enhances the perception functions of robots and machines, shortens computation time, reduces data transmission latency, and achieves balanced allocation of storage resources.
Smart Images

Figure CN122309897A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of integrated circuit technology, and in particular to a method, apparatus, and corresponding integrated circuit for mapping data to an integrated circuit processing core array. Background Technology
[0002] In current technologies, artificial intelligence and machine learning increasingly require the support of high computing power. However, the basic processing circuits used to process various sensor signal data from sensors are still insufficient in terms of corresponding high-performance processing capabilities, and cannot meet the high-performance real-time computing requirements of sensor signal data.
[0003] Therefore, the integrated circuit field urgently needs an advanced integrated circuit and processing technology to perform high-performance real-time processing and calculation of conventional and advanced sensor signals, thereby enabling the sensing functions of robots or any type of sensing machine. Summary of the Invention
[0004] To address the problem that existing integrated circuits cannot meet the high-performance real-time computing requirements of sensor signal data, this application mainly provides a method, apparatus, storage medium, electronic device, and computer program product for mapping data to an integrated circuit processing core array.
[0005] To achieve the above objectives, the first technical solution adopted in this application is: a method for mapping data to an integrated circuit processing core array, comprising: reversing and rearranging multiple input bits of an input array input to an integrated circuit to obtain a bit-inverted input array; determining an input bit mapping relationship reflecting the storage location of each inverted input bit in the bit-inverted input array within the processing core array based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit-inverted input array, and a predetermined data storage method; storing each inverted input bit in the storage circuit of a corresponding different computing unit in the processing core array according to the input bit mapping relationship; and performing fast Fourier transform matrix multiplication calculations between multiple weight levels and the inverted input bits stored in the storage circuit of the computing unit in the corresponding different computing units.
[0006] Optionally, obtaining a bit-inverted input array by bit-reversing and rearranging multiple input bits of the input array of the input integrated circuit includes: reversing the input bit index corresponding to each input bit in the input array to generate a corresponding bit-reversal index; and rearranging the input bits in the input array according to the bit-reversal index to obtain the bit-inverted input array.
[0007] Optionally, a bit-inverted input bit index is generated based on the correspondence between the sequence of input bits of the input array and the rearranged sequence of input bits of the bit-inverted input array.
[0008] Optionally, determining the input bit mapping relationship reflecting the storage location of each inverted input bit in the bit-inverted input array within the processing core array, based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit-inverted input array, and a predetermined data storage method, includes: when the number of inverted input bits is greater than the number of computing units, storing the inverted input bits in the storage circuit of the corresponding sequence of computing units according to their bit order, and storing the excess inverted input bits again in the storage circuit of the corresponding sequence of computing units according to their bit order; or, based on the storage capacity of the storage circuit of the computing unit, storing multiple inverted input bits in the same computing unit at once.
[0009] Optionally, performing fast Fourier transform matrix multiplication calculations between multiple weight levels and the inverted input bits stored in the storage circuit of the computing unit in different computing units includes: determining the storage location of the multiple inverted input bits according to the position of the multiple inverted input bits required for fast Fourier transform matrix multiplication calculation of each weight level, and determining the computing unit to transmit the multiple inverted input bits to perform fast Fourier transform matrix multiplication calculation of the weight level according to the storage location of the multiple inverted input bits.
[0010] Optionally, determining the computational unit for transmitting the multiple inverted input bits to perform weighted fast Fourier transform matrix multiplication based on the storage location of the multiple inverted input bits includes: synchronously exchanging and storing the first inverted input bit stored in the first computational unit with the second inverted input bit stored in the second computational unit.
[0011] Optionally, determining the computational unit to transmit the multiple inverted input bits to perform the weighted fast Fourier transform matrix multiplication calculation based on the storage positions of the multiple inverted input bits includes: transmitting the first inverted input bit stored on the first computational unit to the second computational unit by rotating it by a predetermined angle, based on the positions of the first computational unit and the second computational unit in the processing core array.
[0012] Optionally, determining the computational unit to transfer the multiple inverted input bits to perform the weighted fast Fourier transform matrix multiplication calculation based on the storage positions of the multiple inverted input bits includes: transferring the first inverted input bit stored on the first computational unit to the second computational unit by moving it a predetermined number of times in the rows and / or columns of computational units in the processing core array, based on the positions of the first computational unit and the second computational unit in the processing core array.
[0013] The second technical solution adopted in this application is: a device for mapping data to an integrated circuit processing core array, comprising: a bit-inverted input array acquisition module, used to obtain a bit-inverted input array by bit-inverting and rearranging multiple input bits of the input array of the integrated circuit; an input bit mapping relationship determination module, used to determine the input bit mapping relationship reflecting the storage location of each inverted input bit in the bit-inverted input array in the processing core array based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit-inverted input array, and a predetermined data storage method; a data storage module, used to store each inverted input bit in the storage circuit of a corresponding different computing unit in the processing core array according to the input bit mapping relationship; and a calculation module, used to perform fast Fourier transform matrix multiplication calculations between multiple weight levels and the inverted input bits stored in the storage circuit of the computing unit in different computing units.
[0014] The third technical solution adopted in this application is: an integrated circuit that stores a computer program / instruction, which is operated to execute the method of mapping data to the integrated circuit processing core array in the first solution.
[0015] The beneficial effects that the technical solution of this application can achieve are: This application designs an integrated circuit architecture and processing technology, which can enhance the sensor data processing capabilities and better realize the perception functions of robots and various machines. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a schematic diagram of a specific implementation of a method for mapping data to an integrated circuit processing core array according to this application; Figure 2 This is a schematic diagram of the integrated circuit component structure of this application; Figure 3 This is a schematic diagram illustrating how the weight matrix is decomposed into multiple different weight levels according to this application; Figure 4 This is a schematic diagram of the bit reversal of the input array and the bit reversal input bit index of this application; Figure 5 This is a schematic diagram of the bit-to-bit inverted input array of this application; Figure 6This is a schematic diagram of the first-level calculation of the fast Fourier transform matrix multiplication operation in this application; Figure 7 This is a schematic diagram illustrating the processing of inverted input bit shifting within the core array of this application; Figure 8 This is a bit-wrap diagram of the bit-inverted input array of this application; Figure 9 This is a schematic diagram of a specific embodiment of an apparatus for mapping data to an integrated circuit processing core array according to this application.
[0018] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0019] The preferred embodiments of this application will now be described in detail with reference to the accompanying drawings, so that the advantages and features of this application can be more easily understood by those skilled in the art, thereby providing a clearer and more definite definition of the scope of protection of this application.
[0020] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element.
[0021] The technical solution of this application and how it solves the above-mentioned technical problems will be described in detail below with specific embodiments. The specific embodiments described below can be combined with each other to form new embodiments. The same or similar ideas or processes described in one embodiment may not be repeated in other embodiments.
[0022] Figure 1 An embodiment of a method for mapping data to an integrated circuit processing core array is shown in this application.
[0023] Figure 1The method for mapping data to an integrated circuit processing core array, as shown, includes: step S101, reversing and rearranging multiple input bits of the input array of the integrated circuit to obtain a bit-inverted input array; step S102, determining the input bit mapping relationship reflecting the storage location of each inverted input bit in the bit-inverted input array within the processing core array based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit-inverted input array, and a predetermined data storage method; step S103, storing each inverted input bit in the storage circuit of a corresponding different computing unit in the processing core array according to the input bit mapping relationship; and step S104, performing fast Fourier transform matrix multiplication calculations between multiple weight levels and the inverted input bits stored in the storage circuit of the computing unit in different computing units.
[0024] This specific implementation transforms complex sensor data computation into matrix multiplication operations by performing FFT processing on the input data. Utilizing the parallel computing capabilities of the processing core array, it simultaneously performs matrix multiplication calculations at multiple weight levels, significantly reducing computation time. Furthermore, by reversing and rearranging the input bits and using intelligent mapping, it reduces the number of times the input bits move between computing units, lowering data transmission latency and further improving computational efficiency. Finally, by rationally distributing the reversed input bits of the bit-reversed input array across the various computing units of the processing core array, it achieves a balanced allocation of storage resources, avoiding situations where some computing units have idle storage resources while others have insufficient storage resources.
[0025] Specifically, the Fast Fourier Transform (FFT) matrix multiplication operation of the input data is identified, wherein the Fast Fourier Transform matrix multiplication operation of the input data includes a bit-inverted input array; then, the processing core array is configured based on the bit-inverted input array, wherein the configuration of the processing core array includes storing the input bits of the bit-inverted input array in the storage circuits of different processing cores in the integrated circuit processing core array according to the input bit mapping relationship, the input bit mapping relationship determining the preset storage location of each input bit in the bit-inverted input array in the processing core array; and matrix multiplication calculation between the weight level performing the Fast Fourier Transform matrix multiplication operation and the input bits of the bit-inverted input array stored in the storage circuits of different processing cores.
[0026] exist Figure 1 In the embodiment shown, the method of mapping data to the integrated circuit processing core array includes step S101, which involves reversing and rearranging multiple input bits of the input array of the input integrated circuit to obtain a bit-inverted input array.
[0027] In one specific embodiment of this application, the bit-inverted input array is obtained by bit-inverting and rearranging multiple input bits of the input array of the input integrated circuit. This includes: inverting the input bit index corresponding to each input bit in the input array to generate a corresponding bit-inverted index; and rearranging the input bits in the input array according to the bit-inverted index to obtain the bit-inverted input array.
[0028] In a specific example of this application, to achieve efficient storage and computation of input data from the integrated circuit within the integrated circuit processing core array, the input array needs to be converted into a bit-inverted input array, and the inverted input bits of the bit-inverted input array are stored in different processing cores based on mapping rules. The specific processing procedure is as follows: Before mapping the input data, a Fast Fourier Transform (FFT) is first performed on the input data to transform it from the time domain to the frequency domain, thereby simplifying subsequent matrix multiplication operations. The FFT is an efficient Fourier transform algorithm that can simplify the Fourier transform operation, which originally required O(N²) time complexity, to O(NlogN) time complexity, where N is the length of the input data. This is significant for processing large-scale perceptual data such as high-resolution image data and high-sampling-rate audio data. After performing the FFT, the input data can be represented as an input array containing multiple input bits. FFT processing transforms existing input data processing into FFT matrix multiplication. This operation requires a bit-inverted input array and multiple weight levels. The bit-inverted input array is obtained by reversing and rearranging the input bits of the input array. The multiple weight levels are obtained by decomposing the weight matrix, which is the decomposed weight matrix required in the FFT operation. Each weight level corresponds to a computational stage of the FFT operation. The method for decomposing the weight matrix into multiple different weight levels is as follows: Figure 3 As shown.
[0029] Specifically, a bit-inverted input array is obtained by reversing and rearranging the input bit sequence of the input array. The specific generation steps are as follows: for each input bit in the input array, the bit index of each input bit in the input array is used to perform bit reversal processing to generate the corresponding bit reversal index. Then, the input bits are rearranged according to the bit reversal index to obtain the bit-inverted input array.
[0030] For example, suppose the length of the input array is 8, meaning it contains 8 input bits. The input bit indices of these 8 input bits are 000, 001, 010, 011, 100, 101, 110, and 111 in binary. Performing a bit reversal operation on each input bit index yields the following bit reversal indices: 000's bit reversal index is 000, 001's is 100, 010's is 010, 011's is 110, 100's is 1001, 101's is 101, 110's is 011, and 111's is 111. Based on the bit inversion indices mentioned above, the input bits in the input array are rearranged to obtain the bit-inverted input array. The 0th bit of the bit-inverted input array corresponds to the 0th bit of the input array, the 1st bit corresponds to the 4th bit of the input array, the 2nd bit corresponds to the 2nd bit of the input array, the 3rd bit corresponds to the 6th bit of the input array, the 4th bit corresponds to the 1st bit of the input array, the 5th bit corresponds to the 5th bit of the input array, the 6th bit corresponds to the 3rd bit of the input array, and the 7th bit corresponds to the 7th bit of the input array.
[0031] exist Figure 1 In the embodiment shown, the method for mapping data to the integrated circuit processing core array includes step S102, which determines the input bit mapping relationship that reflects the storage location of each inverted input bit in the bit-inverted input array within the processing core array based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit-inverted input array, and a predetermined data storage method.
[0032] In one specific embodiment of this application, a bit-inverted input bit index is generated based on the correspondence between the sequence of input bits of the input array and the rearranged sequence of input bits of the bit-inverted input array.
[0033] Specifically, Figure 4 This is a schematic diagram of the bit reversal of the input array and the bit reversal input bit index of this application, as shown below. Figure 4 As shown, during the generation of the bit-inverted input array, a bit-inverted input bit index also needs to be generated. This index maps the correspondence between the input bit sequence of the input array and the inverted input bits of the bit-inverted input array. The bit-inverted input bit index can be stored in tabular form, where each row of the table corresponds to an input bit in the input array, containing the original index of the input bit in the input array, the corresponding bit-inverted index, and the position index of the input bit in the bit-inverted input array. Using the bit-inverted input bit index, the position of each input bit in the input array after bit inversion can be quickly queried, providing a basis for subsequent input bit mapping.
[0034] Furthermore, based on the bit-inverted input bit index, a mapping relationship is generated between each inverted input bit in the bit-inverted input array and a corresponding storage location in the processing core array.
[0035] exist Figure 1 In the embodiment shown, the method of mapping data to the integrated circuit processing core array includes step S103, which involves storing each inverted input bit into the storage circuit of the corresponding different computing unit in the processing core array according to the input bit mapping relationship.
[0036] Specifically, after generating the inverted input array and the bit-inverted input bit index, according to... Figure 5 The input bit mapping shown stores the input bits of the inverted input array into the storage circuits of different computing units within the integrated circuit processing core array. The input bit mapping is used to determine the storage location of each input bit in the inverted input array within the processing core array.
[0037] Among them, generating such as Figure 5 The method for mapping the input bits shown includes: first, numbering the computing units according to row priority or column priority, etc., to form a computing unit sequence, thereby determining the sequence order of computing units in the processing core array, etc.; then, sequentially allocating the inverted input bit sequence of the inverted input array to the storage circuit of the corresponding computing unit sequence, that is, the first input bit of the inverted input array is allocated to the first processing core of the processing core sequence, the second input bit is allocated to the second processing core, and so on, until all input bits are allocated.
[0038] In one specific embodiment of this application, determining the input bit mapping relationship reflecting the storage location of each inverted input bit in the bit-inverted input array within the processing core array, based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit-inverted input array, and a predetermined data storage method, includes: when the number of inverted input bits is greater than the number of computing units, storing the inverted input bits in the storage circuit of the corresponding sequence of computing units in order of bit order, and storing the excess inverted input bits in the storage circuit of the corresponding sequence of computing units again in order of bit order; or, based on the storage capacity of the storage circuit of the computing unit, storing multiple inverted input bits in the same computing unit at once.
[0039] Specifically, if the number of inverted input bits in the inverted input bit sequence of the inverted input array exceeds the number of computing units in the processing core sequence, the excess inverted input bits can be handled in one of the following two ways: Around-the-loop storage: When the allocation of inverted input bits reaches the last computing unit in the processing core sequence, the allocation starts again from the first computing unit in the processing core sequence, that is, the excess inverted input bits are stored sequentially in the storage circuit of the computing unit. For example, if the computing unit sequence contains four computing units numbered 1 to 4, and the inverted input array contains six input bits numbered 1 to 6, then inverted input bit 1 is allocated to computing unit 1, inverted input bit 2 to computing unit 2, inverted input bit 3 to computing unit 3, and inverted input bit 4 to computing unit 4. At this point, all computing units have corresponding inverted input bits allocated, but there is still unallocated data in the inverted input bits. Therefore, inverted input bit 5 is redistributed to computing unit 1, and inverted input bit 6 is redistributed to computing unit 2. This completes the storage allocation of all inverted input bits. This allocation method can fully utilize the storage resources of each processing core, avoiding waste caused by some processing cores having idle storage resources.
[0040] Multiple-input bit storage: This method stores multiple inverted input bits in the same computing unit's storage circuit. For example, inverted input bits 1-2 are stored in computing unit 1, inverted input bits 3-4 in computing unit 2, inverted input bit 5 in computing unit 3, and inverted input bit 6 in computing unit 4. The above allocation method is merely exemplary. In practical applications, more input bits can be stored in the same computing unit or fewer inverted input bits can be stored in a single computing unit, depending on the storage capacity of each computing unit. Multiple-input bit storage is suitable for scenarios with large computing unit storage capacities, reducing the number of times inverted input bits move between computing units and thus reducing data transmission latency.
[0041] In one specific embodiment of this application, after storing the inverted input bits in the storage circuit of the computing unit, the storage location of the inverted input bits is verified and adjusted to ensure that each inverted input bit is stored in the correct location and that the resources stored by the computing unit do not exceed its maximum capacity. For example, by reading the inverted input bits in the storage circuit of the computing unit and comparing them with the original inverted input bits in the bit inversion input array, if the two data are found to be inconsistent, it indicates that there is an error in the storage location, and the input bit mapping relationship needs to be readjusted and the inverted input bits re-stored; if the data failure is caused by insufficient storage capacity of a certain computing unit, some of the inverted input bits in that computing unit can be transferred to other computing units with sufficient storage capacity to ensure that all inverted input bits can be stored normally.
[0042] exist Figure 1 In the embodiment shown, the method of mapping data to the integrated circuit processing core array includes step S104, which involves performing fast Fourier transform matrix multiplication calculations between multiple weight levels and the inverted input bits stored in the storage circuit of the computing unit in different computing units.
[0043] Specifically, after storing the inverted input bits of the bit-inverted input array into the computation unit, the computation unit needs to perform an FFT, such as... Figure 6 The Fast Fourier Transform matrix multiplication operation shown includes matrix multiplication calculations between multiple weight levels and different inverted input bits. Since the matrix multiplication calculations corresponding to different weight levels may require the use of inverted input bits stored in different computation units, it is necessary to move the inverted input bits to the corresponding computation units during the calculation process to ensure that each computation unit can obtain the inverted input bits it needs.
[0044] In one specific embodiment of this application, the positions where the multiple inverted input bits are stored are determined according to the order of the multiple inverted input bits required for fast Fourier transform matrix multiplication calculation at each weight level, and the multiple inverted input bits are transmitted to the calculation unit for performing fast Fourier transform matrix multiplication calculation at the weight level is determined according to the positions where the multiple inverted input bits are stored.
[0045] Specifically, based on the matrix multiplication computation requirements corresponding to each weight level, the inverted input bits required for each computing unit in the processing core array to perform the corresponding matrix multiplication computation, as well as the current storage location of these inverted input bits, are determined. This generates instructions that cause the corresponding inverted input bits to be transferred from the original computing unit to the target computing unit. That is, firstly, the matrix multiplication computation task corresponding to each weight level is analyzed to determine the target inverted input bits required by each computing unit for computation at that weight level. For example, for a matrix multiplication operation at a certain weight level, computing unit A needs to use inverted input bits a, b, and c to perform the computation, while computing unit B needs to use inverted input bits d, e, and f. Then, the storage status of the inverted input bits in the inverted input array is queried to determine the information of the computing unit where each target inverted input bit is located. For example, through the input bit mapping table, it is found that inverted input bit a is stored in computing unit A, inverted input bit b is stored in computing unit D, inverted input bit c is stored in computing unit E, inverted input bit d is stored in computing unit B, inverted input bit e is stored in computing unit G, and inverted input bit f is stored in computing unit H. Finally, it is determined whether the target inverted input bit is stored locally in the target computing unit. If the target inverted input bit is already stored locally in the target computing unit, no input bit shift instruction needs to be generated. If the target inverted input bit is not stored locally in the target computing unit, an inverted input bit shift instruction is generated. This shift instruction must explicitly specify the current storage computing unit information of the inverted input bit, the computing unit information to which it needs to be moved, the information of the inverted input bit to be moved, and the transmission method to be used, such as parallel transmission or serial transmission. For example, it can be specified that inverted input bit b is moved from computing unit D to computing unit A, inverted input bit c is moved from computing unit E to computing unit A, inverted input bit e is moved from computing unit G to computing unit B, and inverted input bit f is moved from computing unit H to computing unit B. Inverted input bit d is stored in computing unit B and does not need to be moved, and inverted input bit a is stored in computing unit A and also does not need to be moved. At the same time, it can be specified that the shift between inverted input bit b and inverted input bit c, and between inverted input bit e and inverted input bit f, is a serial shift method. The instructions to move to computing unit A and the instructions to move to computing unit B are set to be parallel transmissions.
[0046] Specifically, when calculating the inverted input bit shift instruction, data transmission efficiency and timing issues must be considered. For example, inverted input bit shift instructions with shorter transmission distances can be prioritized to reduce data transmission time. Multiple inverted input bit shift instructions can be combined into a single batch instruction to achieve simultaneous transmission of multiple inverted input bits, thereby improving transmission efficiency. Simultaneously, when generating the inverted input bit shift instruction, it is also necessary to ensure the matching of the execution timing of the inverted input bit shift instruction with the matrix multiplication calculation timing to avoid computational pauses due to input bits not arriving at the target computing unit in time. Furthermore, the inverted input bit shift instruction can be optimized based on the load of the processing core array. For example, if the current computing load of a certain computing unit is heavy and its data receiving capacity decreases, the generation of the instruction for transmitting the inverted input bits to that computing unit can be appropriately delayed, or the corresponding inverted input bits can be temporarily stored in other computing units until the computing load of that unit is reduced before transmission. Alternatively, if multiple computing units need to obtain the inverted input bits from the same source computing unit at the same time, this phenomenon will cause a transmission bottleneck in the source computing unit. In this case, the inverted input bits in the source computing unit can be copied to multiple other intermediate computing units, and then the intermediate computing units can transmit the corresponding inverted input bit data to the target computing unit to alleviate the transmission pressure on the source computing unit.
[0047] In one specific embodiment of this application, the first inverted input bit stored on the first computing unit and the second inverted input bit stored on the second computing unit are synchronously exchanged and stored.
[0048] Specifically, such as Figure 7As shown, the input bit swapping operation is suitable for scenarios where two computing units need to exchange input bits. In this case, the inverted input bit 1 stored in the first computing unit of the processing core array is swapped with the inverted input bit 2 stored in the second computing unit. After the swap, input bit 1 is stored in the storage circuit of the second computing unit, and input bit 2 is stored in the storage circuit of the first computing unit. When performing the input bit swapping operation, it is necessary to ensure that the communication link between the two computing units is normal and that the timing of their swapping operations is synchronized. For example, timing synchronization between the two computing units can be achieved through a handshake signal. The first computing unit sends a swap request signal to the second computing unit. Upon receiving the request signal and if the second computing unit is currently idle, it sends an acknowledgment signal to the first computing unit. After receiving the acknowledgment signal, the first computing unit begins transmitting the inverted input bit 1 to the second computing unit, while the second computing unit begins transmitting the inverted input bit 2 to the first computing unit. Once both inverted input bits have been transmitted, both units send transmission completion signals to indicate that the swapping operation is complete. The input bit swapping operation is simple to operate, has low transmission latency, and is suitable for small-scale inverted input bit swapping scenarios.
[0049] In one specific embodiment of this application, the first inverted input bit stored on the first computing unit is transmitted to the second computing unit by rotating it by a predetermined angle, based on the position of the first computing unit in the processing core array and the position of the second computing unit in the processing core array.
[0050] Specifically, the input bit rotation instruction is applicable to scenarios where the inverted input bit needs to be moved between different storage circuits within the same computing unit. This instruction achieves the transfer of the inverted input bit from the first storage circuit to the second by rotating the inverted input bit from 0 to 360 degrees within the storage circuits inside the computing unit. In practical scenarios, the storage circuits of each computing unit are arranged in structures such as circles, squares, and rings. The rotation direction of the inverted input bit can be clockwise or counterclockwise, and the rotation angle can be flexibly set according to the number and arrangement of the storage circuits. For example, if the storage circuits of the computing unit are arranged in a circular structure and contain a total of 8 storage cells evenly distributed on the circumference, then the rotation angle can be set to an integer multiple of 360 degrees / 8 during transmission. The inverted input bit moves from one storage cell to the next adjacent storage cell every 45 degrees of rotation, thus completing the data transmission. After receiving a rotation command, the local controller of the computing unit parses the rotation direction and angle in the command. Then, based on the rotation direction and angle, it controls the inverted input bit in the storage circuit to rotate and move according to the command information. Finally, the inverted input bit is stored in the target storage circuit. The advantage of the input bit rotation command is that it does not require occupying the communication link between computing units; the movement of the input bit can be achieved only by processing the storage circuit inside the core, resulting in low transmission latency.
[0051] In one specific embodiment of this application, based on the position of the first computing unit in the processing core array and the position of the second computing unit in the processing core array, the first inverted input bit stored on the first computing unit is transferred to the second computing unit by moving it a predetermined number of times in the rows and / or columns of computing units in the processing core array.
[0052] Specifically, such as Figure 8As shown, the core jump instruction for input bits is applicable to scenarios where the inverted input bit needs to move multiple positions within a row or column of the processing core array. This instruction transmits data by specifying the number and positions of the computational units the inverted input bit passes through during its movement within the processing core array. For example, if the processing core array is arranged in a 3x4 configuration, and the inverted input bit is currently stored in the computational unit in the 1st row and 1st column, and the core jump instruction specifies a jump step of 2 steps along the row direction, then the inverted input bit will move from the computational unit in the 1st row and 1st column to the computational unit in the 1st row and 3rd column. The execution process of the core jump instruction is as follows: First, based on the jump step and movement direction in the instruction, the transmission path of the inverted input bit is determined, such as a straight path along the row direction or a straight path along the column direction. Then, the source computational unit transmits the inverted input bit sequentially through intermediate computational units along the transmission path to the target computational unit. The intermediate computational units only act as data forwarders and do not perform any processing on the inverted input bit. Finally, the input bit is transmitted to the target computational unit and stored in its memory circuit.
[0053] In one example of this application, when executing core transition instructions, care must be taken to avoid data conflicts between computational units along the transmission path. For example, if the transmission paths of two inverted input bits need to pass through the same intermediate computational unit and the transmission time is the same, a data conflict will occur because the intermediate computational unit cannot process the transmission of the two inverted input bits simultaneously. To solve this problem, time-division multiplexing is used to allocate different time slices for the transmission of different inverted input bits, ensuring that the same intermediate computational unit processes the transmission of different inverted input bits in different time slices; alternatively, a path planning algorithm is used to plan a unique transmission path for each inverted input bit, avoiding path overlap. Core transition instructions can realize long-distance, fast movement of input bits in the processing core array, and are suitable for data transmission scenarios in large-scale processing core arrays.
[0054] In particular, the execution of input bit shift instructions includes multiple operation modes. Developers can choose the appropriate operation mode according to the transmission requirements of the inverted input bits and the structure of the processing core array. This application does not limit the specific operation mode.
[0055] In one specific embodiment of this application, after the storage and movement of the inverted input bits are completed, the processing core array begins to perform FFT matrix multiplication operations. This operation includes matrix multiplication calculations using multiple weight levels, wherein the calculation of each weight level is a matrix multiplication operation based on the inverted input bits in the current computing unit and the corresponding weight matrix.
[0056] Specifically, the weight matrix corresponding to each weight level is pre-stored in... Figure 2In the main memory shown, before executing the calculation for this weight level, the scheduler sends instructions to the peripheral controller to load the weight matrix from main memory into the processing core array. Weight matrix loading can be divided into parallel loading and serial loading. Parallel loading refers to multiple computing units simultaneously loading different parts of the weight matrix from main memory; this method is suitable for scenarios with large weight matrices, thus shortening loading time. Serial loading refers to computing units loading portions of the weight matrix from main memory sequentially; this method is suitable for scenarios with small weight matrices and reduces the bandwidth requirements of main memory. After the weight matrix is loaded, each computing unit stores the loaded weight matrix locally in a corresponding dedicated weight storage area, such as the weight storage area in a register file, thus enabling the computing unit to read the weight matrix as needed. To ensure the accuracy of the weight matrix loading, the computing unit performs checks such as parity checking and cyclic redundancy checking on the weight data after loading. If an error is found during the check, an error signal is sent to the scheduler, which instructs the computing unit to reload the weight matrix.
[0057] Then, each computation unit performs matrix multiplication based on the locally stored inverted input bits and weight matrix portion. Within the computation unit, the arithmetic logic unit (ALU) first reads the inverted input bits and corresponding weight values from the register file. Then, it performs multiplication, multiplying the inverted input bits by the weight values to obtain the product. Next, it performs addition, summing multiple products to obtain a partial result. Finally, the partial result is stored in the intermediate result storage area of the register file. During matrix multiplication, the computation unit can employ pipelining techniques to overlap multiplication and addition steps, improving computational efficiency. For example, while the ALU is performing addition on the first group of inverted input bits and weight values, the register file can simultaneously transmit the second group of inverted input bits and weight values to the ALU. Therefore, after completing the addition on the first group, the ALU can immediately begin multiplication on the second group without waiting for the register file transmission to complete, thus improving computational efficiency.
[0058] Furthermore, for calculations involving multiple weight levels, the calculation unit executes them sequentially according to the weight level order. The result of the previous weight level is used as input data for the next weight level in subsequent calculations. For example, the result of the first weight level is stored in the intermediate result storage area of the register file. When executing the calculation of the second weight level, the calculation unit reads this result from the intermediate result storage area, then performs matrix multiplication with the weight matrix of the second weight level, and so on, until all weight levels have been calculated.
[0059] Once all weighted calculations are complete, the processing core array obtains the final result of the FFT matrix multiplication operation. Since the result of the FFT matrix multiplication operation may be distributed across multiple processing cores, these partial results need to be merged to obtain the complete output. The specific result merging process is as follows: First, the scheduler determines the order and method of result merging, such as merging by row or by column. Then, each computing unit, according to the scheduler's instructions, transmits its local calculation results to the designated merging computing unit, which is typically a specific processing core or boundary core in the processing core array. After receiving the calculation results transmitted from all computing units, the merging computing unit combines the data according to preset merging rules to obtain the complete output result. Finally, the merging computing unit transmits the complete output result to main memory or an external device through the boundary core and peripheral controller.
[0060] Specifically, during the result merging process, it is necessary to ensure the integrity and accuracy of the partial calculation results transmitted by each computing unit. Therefore, after receiving the partial calculation results transmitted by each computing unit, the merging computing unit verifies the partial calculation results. If it finds that the received partial calculation results are incomplete or contain errors, it sends a retransmission command to the corresponding computing unit, which then retransmits the calculation results. At the same time, the merging computing unit also performs an overall verification of the merged complete result to ensure that the result meets expectations.
[0061] This application transforms complex calculations into matrix multiplication operations by performing FFT processing on the input data. Utilizing the parallel computing capabilities of the processing core array, it simultaneously performs matrix multiplication calculations across multiple weight levels, significantly reducing computation time. Furthermore, by reversing and rearranging the input bits and using intelligent mapping, it reduces the number of times input bits move between processing cores, lowering data transmission latency and further improving computational efficiency. Based on the input bit mapping relationship, the input bits of the bit-reversed input array are rationally allocated to each processing core of the processing core array, achieving a balanced distribution of storage resources and avoiding situations where some processing cores have idle storage resources while others have insufficient storage resources. Simultaneously, when the number of input bits exceeds the number of processing cores, it employs around-the-loop storage or multi-input-bit storage methods to fully utilize the storage capacity of each processing core.
[0062] This application supports various input bit movement methods such as swapping, core switching, and rotation, allowing for the selection of an appropriate method based on the structure and computational requirements of the processing core array. Furthermore, the loading method of the weight matrix can be flexibly adjusted according to its size, employing parallel loading, serial loading, or other data loading methods to suit different application scenarios. Simultaneously, this application incorporates verification mechanisms at each stage of data storage, transmission, and computation, including storage location verification, data transmission verification, and computation result verification, enabling timely detection and correction of data errors and ensuring stable system operation. Additionally, real-time monitoring of the processing core array's operational status via a scheduler allows for timely handling of component failures and other anomalies, improving system reliability.
[0063] Figure 9 This application illustrates a specific embodiment of an apparatus for mapping data to an integrated circuit processing core array.
[0064] exist Figure 9 In the specific embodiment shown, the device for mapping data to the integrated circuit processing core array mainly includes: a bit-inverted input array acquisition module 901, which reverses and rearranges multiple input bits of the input array of the input integrated circuit to obtain a bit-inverted input array; an input bit mapping relationship determination module 902, which determines the input bit mapping relationship reflecting the storage location of each inverted input bit in the bit-inverted input array within the processing core array based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit-inverted input array, and a predetermined data storage method; a data storage module 903, which stores each inverted input bit in the storage circuit of a corresponding different computing unit in the processing core array according to the input bit mapping relationship; and a calculation module 904, which performs fast Fourier transform matrix multiplication calculations between multiple weight levels and the inverted input bits stored in the storage circuit of the computing unit in different computing units.
[0065] The apparatus for mapping data to an integrated circuit processing core array provided in this application can be used to perform the method for mapping data to an integrated circuit processing core array described in any of the above embodiments. Its implementation principle and technical effect are similar, and will not be repeated here.
[0066] In one specific embodiment of this application, the functional modules in the device for mapping data to an integrated circuit processing core array may be directly in hardware, in software modules executed by a processor, or in a combination of both.
[0067] Software modules may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disks, removable disks, CD-ROMs, or any other form of storage medium known in this art. An exemplary storage medium is coupled to the processor, enabling the processor to read information from and write information to the storage medium.
[0068] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor can be a microprocessor, but alternatively, it can be any conventional processor, controller, microcontroller, or state machine. The processor can also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors incorporating a DSP core, or any other such configuration. Alternatively, the storage medium can be integrated with the processor. The processor and storage medium can reside in an ASIC. The ASIC can reside in the user terminal. Alternatively, the processor and storage medium can reside as discrete components in the user terminal.
[0069] In another specific embodiment of this application, a hardware circuit is provided that stores a computer program / instruction, which is operated to perform the method described in the above embodiments for mapping data to an integrated circuit processing core array.
[0070] Specifically, such as Figure 2 As shown, the integrated circuit for performing sensing processing includes multiple array cores, multiple boundary cores, a scheduler acting as the main controller, a first set of peripheral controllers, a second set of peripheral controllers, and a main memory. The integrated circuit may also include a first peripheral load memory unit, a second peripheral load memory unit, a first peripheral memory, a second peripheral memory, a first set of double-first-in-first-out (FIFO) memories, and a second set of double-first-in-first-out (FIFO) memories. The first set of peripheral controllers controls the operation of the first peripheral load memory unit, the first peripheral memory, and the first set of double-first-in-first-out memories. The second set of peripheral controllers operates on the same principle and functions as the first set of peripheral controllers.
[0071] In one specific embodiment of this application, the main function of the integrated circuit is to realize real-time and efficient computation of perceived data and sensor data. The overall configuration of the integrated circuit includes multiple array cores, i.e., computing units, which constitute a central signal and data processing node. Each node has a large-capacity register file, which reduces or significantly reduces the clock cycles required for the array cores to retrieve and push processing data from memory. The integrated circuit generates control instructions such as computation instructions, execution instructions, and data movement instructions through, for example, a scheduler or compiler module. These instructions enable data to flow continuously within the integrated circuit, particularly between multiple array cores and boundary cores. The array cores are preferably used as data or signal processing nodes or as processing circuits, and are selected from array core devices with large-capacity data storage capacity register files and arithmetic logic.
[0072] The arithmetic logic unit of the array core can be configured to perform arithmetic operations such as addition, subtraction, multiplication, and division, and logical operations such as AND, OR, NOT, and XOR. It can also perform more complex operations, such as vector operations, matrix multiplication operations, and fast Fourier transform (FFT) operations, to support the processing of sensing data and sensor data.
[0073] Each array core also includes a local controller, which can be configured to control the overall operation of the array core, including controlling the storage and retrieval of data in the control register file, controlling the types of operations performed by the arithmetic logic unit, and controlling the transmission of data between the array core and other components such as other array cores and boundary cores. The local controller can implement the above control functions by receiving instructions from the scheduler. These instructions may include calculation instructions that specify the arithmetic logic unit to perform and data movement instructions that specify the data transmission path and method.
[0074] In one specific embodiment of this application, the boundary core is configured to connect the array core to other peripheral groups of the integrated circuit, such as a first group of peripheral controllers and a second group of peripheral controllers. The boundary core acts as a data relay and protocol conversion mechanism. Each boundary core may include a boundary register file and a boundary controller. The boundary register file is used to temporarily store data received from the array core or data to be transmitted to the array core. The boundary controller is used to control the transmission of data between the boundary core and the array core, and between the boundary core and peripheral components, to ensure that the timing and format of data transmission meet the requirements.
[0075] In one specific embodiment of this application, the scheduler, acting as the main controller of the integrated circuit, can be configured to coordinate the operations of various components, including generating instructions and sending instructions to the array core, boundary core, peripheral controllers, etc., scheduling data transmission between components, and monitoring the overall operating status of the integrated circuit, such as computation progress, data transmission status, and component failures. The scheduler can allocate computing and transmission resources based on preset algorithms such as task priority algorithms and load balancing algorithms to ensure that the integrated circuit can efficiently and stably execute sensing data processing tasks.
[0076] In one specific embodiment of this application, the first set of peripheral controllers and the second set of peripheral controllers can be configured to interact with external devices such as sensors, external memory, and communication modules. They receive data information such as environmental data collected by sensors from external devices, and simultaneously send data processed by the integrated circuit to the external devices. The first set of peripheral controllers can be connected to a first peripheral memory via a first peripheral loading storage unit, and the second set of peripheral controllers can be connected to a second peripheral memory via a second peripheral loading storage unit. The peripheral loading storage unit is used to implement data loading and storage operations, and the peripheral memory is used to temporarily store data received from external devices or data to be sent to external devices, thereby alleviating bottleneck problems in the data transmission process.
[0077] The first and second sets of dual first-in-first-out (FIFO) memories buffer data, ensuring that data is not lost or misaligned during transmission. For example, when the speed at which the first set of peripheral controllers receives data from external devices exceeds the speed at which they transmit data to the array core, the excess data can be temporarily stored in the first set of FIFOs and gradually transmitted to the array core when the transmission speeds match. Conversely, when the speed at which the first set of peripheral controllers transmits data to the array core exceeds the speed at which they receive data from external devices, the first set of FIFOs provides a buffer to prevent the array core from pausing computation due to insufficient data.
[0078] In one specific embodiment of this application, the main memory may be configured to store program code such as perceptual data processing algorithms, initial data such as weight data of neural networks, and intermediate and final result data generated during the calculation process, all required for the integrated circuit to perform computations. The main memory may employ a high-speed storage medium such as dynamic random access memory (DRAM) or static random access memory (SRAM) to improve data read and write speeds, thereby enhancing the overall performance of the integrated circuit.
[0079] In one specific embodiment of this application, the compiler module has the ability to convert perceptual data processing programs written in high-level programming languages such as C and Python into machine instructions that the integrated circuit can recognize and execute. During the compilation process, the compiler module can perform code optimizations such as code simplification, computation optimization, and data layout optimization to reduce instruction execution cycles, improve data access efficiency, and thus further enhance the computational performance of the integrated circuit. For example, the compiler module can decompose large matrix multiplication operations into multiple smaller matrix multiplication operations based on the number and structure of the array cores, and allocate them to different array cores for parallel execution to shorten the computation time.
[0080] In one specific embodiment of this application, a computer device includes a memory, a processor, and a computer program stored in the memory. The processor executes the computer program to implement the method described in the above embodiments for mapping data to an integrated circuit processing core array.
[0081] In one specific embodiment of this application, a computer program product includes a computer program / instruction that, when executed by a processor, implements the method described in the above embodiments for mapping data to an integrated circuit processing core array.
[0082] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0083] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0084] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Any equivalent structural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. A method for mapping data to an integrated circuit processing core array, characterized in that, include: A bit-inverted input array is obtained by reversing and rearranging multiple input bits of the input array of the integrated circuit. Based on the sequence of multiple input bits of the input array and the rearranged sequence of the inverted input bits of the bit-inverted input array, as well as the predetermined data storage method, an input bit mapping relationship reflecting the storage location of each inverted input bit in the bit-inverted input array within the processing core array is determined. According to the input bit mapping relationship, each of the inverted input bits is stored in the storage circuit of the corresponding different computing unit in the processing core array; In different computing units, fast Fourier transform matrix multiplication is performed between multiple weight levels and the inverted input bits stored in the storage circuit of the computing unit.
2. The method for mapping data to an integrated circuit processing core array according to claim 1, characterized in that, The step of reversing and rearranging multiple input bits of the input array input to the integrated circuit to obtain a bit-inverted input array includes: For each input bit, the corresponding input bit index in the input array is reversed to generate a corresponding bit reversal index; and, The input bits in the input array are rearranged according to the bit reversal index to obtain the bit reversal input array.
3. The method for mapping data to an integrated circuit processing core array according to claim 2, characterized in that, A bit-inverted input bit index is generated based on the correspondence between the sequence of input bits in the input array and the rearranged sequence of inverted input bits in the bit-inverted input array.
4. The method for mapping data to an integrated circuit processing core array according to claim 1, characterized in that, The step of determining the input bit mapping relationship reflecting the storage location of each inverted input bit in the bit-inverted input array within the processing core array based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit-inverted input array, and a predetermined data storage method includes: When the number of inverted input bits is greater than the number of computing units, the inverted input bits are stored in the storage circuit of the corresponding sequence of computing units according to their bit order, and the excess inverted input bits are stored again in the storage circuit of the corresponding sequence of computing units according to their bit order. Alternatively, according to the storage capacity of the storage circuit of the computing unit, multiple inverted input bits are stored in the same computing unit at once.
5. The method for mapping data to an integrated circuit processing core array according to claim 1, characterized in that, The calculation of fast Fourier transform matrix multiplication between multiple weight levels and the inverted input bits stored in the storage circuit of the computing unit in different computing units includes: Based on the position of the multiple inverted input bits required for fast Fourier transform matrix multiplication calculation for each weight level, the storage location of the multiple inverted input bits is determined, and based on the storage location of the multiple inverted input bits, the calculation unit for transmitting the multiple inverted input bits to perform fast Fourier transform matrix multiplication calculation for the weight level is determined.
6. The method for mapping data to an integrated circuit processing core array according to claim 5, characterized in that, The step of determining, based on the storage positions of the plurality of inverted input bits, to transmit the plurality of inverted input bits to the computation unit performing the fast Fourier transform matrix multiplication calculation of the weight level includes: The first inverted input bit stored in the first computing unit is synchronously exchanged and stored with the second inverted input bit stored in the second computing unit.
7. The method for mapping data to an integrated circuit processing core array according to claim 5, characterized in that, The step of determining, based on the storage positions of the plurality of inverted input bits, to transmit the plurality of inverted input bits to the computation unit performing the fast Fourier transform matrix multiplication calculation of the weight level includes: Based on the positions of the first computing unit and the second computing unit in the processing core array, the first inverted input bit stored on the first computing unit is transferred to the second computing unit by rotating it by a predetermined angle.
8. The method for mapping data to an integrated circuit processing core array according to claim 5, characterized in that, The step of determining, based on the storage positions of the plurality of inverted input bits, to transmit the plurality of inverted input bits to the computation unit performing the fast Fourier transform matrix multiplication calculation of the weight level includes: Based on the positions of the first computing unit and the second computing unit in the processing core array, the first inverted input bit stored on the first computing unit is transferred to the second computing unit by moving it a predetermined number of times in the rows and / or columns of computing units in the processing core array.
9. An apparatus for mapping data to an integrated circuit processing core array, characterized in that, include: The bit-inverted input array acquisition module is used to reverse and rearrange multiple input bits of the input array input to the integrated circuit to obtain a bit-inverted input array; The input bit mapping relationship determination module is used to determine the input bit mapping relationship reflecting the storage location of each inverted input bit in the bit inverted input array within the processing core array based on the sequence of multiple input bits of the input array, the rearranged sequence of the inverted input bits of the bit inverted input array, and a predetermined data storage method. The data storage module is used to store each of the inverted input bits into the storage circuit of the corresponding computing unit in the processing core array according to the input bit mapping relationship; The calculation module is used to perform fast Fourier transform matrix multiplication calculations between multiple weight levels and the inverted input bits stored in the storage circuit of the calculation unit in different calculation units.
10. An integrated circuit storing a computer program / instructions, characterized in that, The computer program / instructions are operated to perform the method of mapping data to an integrated circuit processing core array as described in any one of claims 1-7.