Fpga matrix transposing method and device based on axi protocol

By employing a hybrid coupling scheme that combines loose and tight coupling methods, and utilizing the AXI protocol bus for matrix transposition of image data, the problem of limited resources under the AXI protocol is solved, achieving efficient matrix transposition and storage space utilization.

CN116804754BActive Publication Date: 2026-06-23BEIJING INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2022-12-13
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies cannot effectively utilize the three-dimensional storage space of DDR for matrix transposition under the AXI protocol. The resulting technical problem is that existing technologies cannot achieve efficient matrix transposition under the AXI protocol, especially when resources are limited.

Method used

A hybrid coupling scheme is adopted, which divides the image data into small matrix blocks and uses the AXI protocol bus for storage and retrieval. The matrix transpose is achieved by combining loose coupling and tight coupling methods.

Benefits of technology

The AXI protocol enables efficient matrix transposition, reducing resource usage and improving storage space utilization, making it suitable for image data of different sizes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116804754B_ABST
    Figure CN116804754B_ABST
Patent Text Reader

Abstract

The embodiment of the application discloses a FPGA matrix transposition method and device based on AXI protocol. In the field of data transposition for SAR imaging processing on satellite, the AXI protocol is used, and a mixed coupling scheme combining a loose coupling scheme and a tight coupling scheme is proposed to reduce resource usage in the matrix transposition process. The mixed coupling scheme combines external storage of the tight coupling and internal transposition of the loose coupling, and more storage space is used than the tight coupling, but the mixed coupling scheme is suitable for a larger image size like the loose coupling scheme. Compared with the loose coupling scheme, the mixed coupling scheme reduces the cache resources, so that the limited cache resources improve more efficiency. In the actual application process, the mixed coupling scheme has certain adaptability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data transposition in on-board SAR imaging processing, and more particularly to an FPGA matrix transposition method and apparatus based on the AXI protocol. Background Technology

[0002] Synthetic Aperture Radar (SAR) is an all-weather, high-resolution imaging radar. Spaceborne SAR systems have gradually evolved into multi-band, high-resolution, multi-mode, and multi-polarization satellite systems, placing higher demands on on-board SAR processing. With the development of FPGA technology, FPGAs offer advantages in high computing power and low power consumption compared to DSPs and GPUs. Data transposition is an essential step in SAR imaging processing. Because storage space is highly efficient for storing continuous data, the efficiency of accessing two-dimensional images in the range and azimuth directions differs significantly. This imbalance in efficiency affects subsequent data processing.

[0003] Based on whether the original image and the data organization in the storage space are the same, existing technical solutions mainly fall into two categories: in-situ transposition and transpose memory. The in-situ transposition scheme refers to transposing the matrix within the original storage space without occupying more storage. Conversely, the transpose memory scheme requires more storage space than the data size.

[0004] Existing technologies employing in-situ matrix transpose techniques mainly include:

[0005] A matrix transposition and efficient 2D FFT calculation scheme is implemented using SRAM, achieving a matrix transposition efficiency of 90%. Alternatively, a matrix partitioning and 2D mapping method is used to arrange 2D data in DDR storage space, improving access efficiency. There are also similar storage methods involving matrix partitioning and 2D mapping. A 3D-FFT mapping scheme has also been proposed, a generalized matrix partitioning data mapping scheme. Furthermore, there is a storage method involving matrix partitioning and 3D mapping, and based on this, a 3D cross mapping method and a 3D cross multi-channel mapping method have been proposed.

[0006] Existing technologies employing in-situ matrix transpose techniques mainly include:

[0007] The feasibility of transposed memory for images of different dimensions was verified.

[0008] In the field of electronic information, the technology of using ASICs as the physical carrier for chip design is called System-on-Chip (SoC). The technology of using FPGAs as the physical carrier for chip design is called System-on-Chip (SoPC). Systems built based on SOPC technology have great flexibility; they are customizable, expandable, and have software and hardware programmable features, making them widely used in industrial fields, especially in the communications field.

[0009] As chips become increasingly larger and market demands for shorter product design cycles rise, IP reuse technology plays an increasingly important role in SOPC development. IP, or intellectual property core, in the microelectronics field mainly refers to design components that have been verified and whose functional characteristics can be guaranteed. However, IP reuse technology also faces many challenges, among which the lack of standardized IP interfaces and the difficulty of reuse are the most serious. This has led to another important technology in SOPC design—On-Chip Bus technology.

[0010] On-chip buses, unlike off-chip buses, are a technology used for interconnection within a chip, enabling communication between various IPs. This allows IP cores to define their external interconnect interfaces during the initial design phase and adapt them to commonly used on-chip bus protocols, enabling integration with other IP cores based on that protocol and rapid chip development. Currently, leading global IP vendors have launched their own independently developed bus standards. Among these companies, ARM's AMBA on-chip bus has gradually become the industry standard due to its high performance, low latency, and high bandwidth. However, while the AMBA bus protocol can theoretically achieve extremely high performance by increasing the bus width and frequency, the PPA (performance, power, area) limitation means higher power consumption. Users desire to achieve high bandwidth requirements in product design with lower bus width and frequency, reaching the limit of protocol transmission efficiency. Therefore, the AXI bus, as a high-performance, low-latency, and high-bandwidth bus protocol, was introduced.

[0011] Regarding matrix transposition implementation, existing technical solutions are all based on the traditional DDR interface protocol (Native Interface). Except for those using SRAM to achieve efficient 2D FFT calculations, other solutions are not suitable for the AXI protocol. Under the existing Native interface protocol, matrix partitioning and 2D mapping are used to arrange 2D data in the DDR storage space, while 3D mapping matrix transposition technology fully utilizes the different storage efficiencies of DDR's 3D space to determine the matrix transposition mapping scheme. However, under the AXI protocol, the DDR's 3D space is masked, and the storage space is treated as a 2D space. Therefore, existing 3D mapping matrix transposition technology cannot be directly used.

[0012] Secondly, the matrix partitioning and two-dimensional mapping method, which arranges two-dimensional data in DDR storage space, can be used with the AXI protocol, but it uses more cache resources. Summary of the Invention

[0013] To address the aforementioned shortcomings, this invention focuses on solving the problem of a general matrix transpose structure under the AXI protocol with limited resources. To balance the efficiency of storage space access to range and azimuth data, embodiments of this invention provide an FPGA matrix transpose method based on the AXI protocol, the method comprising:

[0014] Obtain P rows × P columns of image data, and determine the size N of the transpose matrix block based on the image data size P and the first operation length. 2 The first data is obtained by storing M columns of data from a single cache into the FPGA cache; the size of the first data is P×M.

[0015] The first N data from M rows of data are retrieved from the cache to form a data group with a length of the first operation length. Multiple data groups form a data segment, which completely occupies one row of the first storage space. Using the AXI protocol bus, the data segment is written into the first storage space with a second operation length until the image data of P rows × P columns is stored. The second operation length is limited by the AXI protocol bus.

[0016] Using the AXI protocol bus, data is read from the first storage space in rows with a first operation length, until the entire size N is read. 2 From the first set of data, we obtain the second set of data;

[0017] The second data is divided into N×N matrix blocks, and the matrix blocks are transposed to obtain the third data.

[0018] The third data is grouped into multiple data groups with the first operation length and written into the second storage space for storage; until the transposition of the P row × P column image data is completed;

[0019] Using the AXI protocol bus, data is read from the second storage space with a second operation length and output line by line to obtain the transposed image data.

[0020] In some embodiments, the mathematical expression for the column data M cached in a single instance is:

[0021]

[0022] Where A is the first operation length and N is the length or width of the matrix block.

[0023] In some embodiments, the size N of the transpose matrix block 2 N is the length or width of the matrix block, and the mathematical expression for N is:

[0024]

[0025] Where A is the first operation length and P is the size of the image data.

[0026] In one possible embodiment, when the image data is a small-dot image, the method includes:

[0027] Get the image data of P rows × P columns;

[0028] M rows of data are stored in the FPGA cache, and the length of each row of data is P. The first I data points of each row of data are extracted to form a data segment, and the length of each data segment is M×I. The M rows of data are divided into a total of [number] segments. Each data segment is arranged from top to bottom in the cache to form a data block;

[0029] The data block is mapped to the first storage space to complete the data mapping of M rows × P columns, until the data mapping of P rows × P columns is completed;

[0030] Using the AXI protocol bus, the data blocks are read from the first storage space sequentially with a second operation length and output until the reading and output of P rows × P columns of image data are completed.

[0031] In one possible embodiment, when the image data is a very large pixel image, the method includes:

[0032] Get the image data of P rows × P columns;

[0033] Using the AXI protocol bus, image data is stored unit by unit in the first storage space according to the row direction;

[0034] Read N data points from the first storage space by skipping rows to obtain an N×N matrix block;

[0035] Transpose the N×N matrix blocks;

[0036] The transposed matrix blocks are skipped and stored in the second storage space until the image data transposed by P rows × P columns is completed;

[0037] Using the AXI protocol bus, transposed data is sequentially read from the second storage space and output with a second operation length.

[0038] On the other hand, embodiments of the present invention disclose a device for FPGA matrix transposition based on the AXI protocol, comprising:

[0039] The grouping unit is used to group the image data in the FPGA buffer according to the grouping principle to obtain the first data;

[0040] The first input unit is used to write the first data into the first storage space with a first operation length via the AXI protocol bus;

[0041] The first output unit is used to read the second data sequentially from the first storage space with a first operation length using the AXI protocol bus;

[0042] The transpose unit is used to divide the second data into matrix blocks, and transpose the matrix blocks to obtain the third data;

[0043] The second input unit is used to sequentially write the third data into the second storage space to complete the transposition of the image data;

[0044] The second output unit is used to read data from the second storage space with a second operation length via the AXI protocol bus, and output the data line by line to obtain the transposed image data.

[0045] To reduce resource consumption and implement a matrix transposition method under the AXI protocol, we combine the aforementioned loosely coupled and tightly coupled schemes to propose a hybrid coupling scheme, thereby reducing resource consumption during the matrix transposition process. Although the research is conducted using the AXI protocol, since DDR is a three-dimensional storage space, its one-dimensional space can be masked, and it can be regarded as two-dimensional. Whether it is row and col or row and bank in the two-dimensional space, only the address positions are different, which does not change the content of the FPGA matrix transposition method disclosed in this invention. Therefore, the coupling scheme disclosed in this invention is still applicable. Attached Figure Description

[0046] Figure 1 This is a flowchart of the transpose method for a hybrid coupling scheme;

[0047] Figure 2 This is a schematic diagram of the general structure of a hybrid coupling scheme;

[0048] Figure 3 This is a flowchart of the transpose method for tightly coupled schemes;

[0049] Figure 4 This is a schematic diagram of data mapping under a tightly coupled scheme;

[0050] Figure 5 This is a flowchart of the transpose method for loosely coupled schemes;

[0051] Figure 6 This is a schematic diagram of the transpose process of matrix partitioning;

[0052] Figure 7 This is a schematic diagram of data mapping under a loosely coupled scheme;

[0053] Figure 8 This is a schematic diagram comparing loosely coupled and hybrid schemes for 2k×2k image data;

[0054] Figure 9 This is a schematic diagram comparing loosely coupled and hybrid schemes for 16k×16k image data;

[0055] Figure 10 This is a schematic diagram of the FPGA matrix transpose based on the AXIS protocol. Detailed Implementation

[0056] This invention discloses an AXI-based matrix transpose method, including three schemes: tightly coupled, loosely coupled, and hybrid coupled schemes. Among these, there are no existing schemes for loosely coupled and tightly coupled methods based on the AXI protocol. Therefore, based on these loosely coupled and tightly coupled schemes, this invention proposes a hybrid coupled structure. This FPGA matrix transpose method is compatible with both tightly coupled and loosely coupled schemes, specifically designs the image data read / write timing process under the AXI protocol, and proposes a general AXI-based matrix transpose structure.

[0057] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.

[0058] 1) Hybrid Coupling Scheme

[0059] This invention discloses an FPGA matrix transpose method based on the AXI protocol. Figure 1 This is a flowchart of the transpose method for the hybrid coupling scheme, and the specific implementation is as follows:

[0060] S110: Acquire image data and store it in the FPGA cache.

[0061] Obtain P rows × P columns of image data, and determine the size N2 of the transpose matrix block and the M columns of data in a single cache based on the size P of the image data and the continuous operation length A (first operation length). Store the M columns of data (first data) into the FPGA cache; wherein, the M columns of data constitute a data block with a size of P × M.

[0062] S120: Store image data in DDR1

[0063] The first N data from M rows of data are retrieved from the cache to form a data group with a length of the first operation length. Multiple data groups form a data segment. The data segment completely occupies one row in DDR1 (first storage space). Using the AXI protocol bus, the data segment is written to DDR1 with a continuous operation length B (second operation length) until the image data of P rows × P columns is stored. The continuous operation length B is limited by the AXI protocol bus.

[0064] S130: Reading small blocks of matrix data before transposition in DDR1

[0065] Using the AXI protocol bus, data is read from the DDR1 in consecutive operation lengths of A, until the data of size N is read. 2 The matrix block data before transpose (second data).

[0066] S140: Obtain the matrix blocks and perform data transpose.

[0067] Divide the matrix data into N×N blocks before transposition to obtain the transposed matrix data (third data).

[0068] S150: Store the transposed matrix block data into DDR2.

[0069] The transposed matrix blocks are grouped into data groups with a continuous operation length A and written into DDR2 (second storage space) to store the transposed matrix blocks; this process continues until the transposition of the P rows × P columns of image data is completed.

[0070] S160: Reads transposed image data from DDR2

[0071] Using the AXI protocol bus, M rows of transposed data are sequentially read from DDR2 with a continuous operation length of B, and output row by row to obtain the transposed image data.

[0072] Example 1:

[0073] To reduce resource consumption and implement a matrix transpose method under the AXI protocol, we combine the above loosely coupled and tightly coupled schemes to propose a hybrid coupling scheme. Figure 2 This is a schematic diagram of a general hybrid coupling scheme. P represents the size of the image to be processed. The continuous operation length is A. The higher A is, the higher the data transmission efficiency. N is the length or width of the matrix transpose processing matrix. 2 Indicates the size of the square formation.

[0074] To clearly illustrate the transmission process of the technical solution, in one possible embodiment, we use 64-bit data of 2048×2048 as an example to illustrate the matrix transpose process. For a 2048×2048 image, P = 2048. The continuous operation length A is 128. The matrix transpose processing matrix size is 1024, i.e., N = 32. Therefore, buffering is required. The data is processed as follows:

[0075] S210: Store image data in DDR1

[0076] The FPGA caches four rows of data, each containing 2048 data points. The first 256 data points from each of the four rows, totaling 1024 data points, are grouped into a single data segment. Therefore, the four rows of 2048 data points result in eight data segments. For each data segment, 32 data points from each data item are sequentially reassembled into 128 data groups. Each data segment contains eight data groups. Using the AXI protocol bus, data is written to DDR1 in continuous operation lengths of 1024, maximizing write efficiency. This process continues until all 2048×2048 data points are stored.

[0077] S220: Read data from DDR1

[0078] Data is read from the storage space. First, the first 128 data entries are read. Since DDR1 has 1024 data entries per line, the process jumps to the third line, reads another 128 data entries, jumps to another line, and so on, until all 1024 data entries have been read.

[0079] S230: Obtain matrix blocks and perform data transpose.

[0080] The 1024 data points are divided into 32×32 matrix blocks, and the data of the submatrix is ​​transposed.

[0081] S240: Store the data after device input into DDR2

[0082] The transposed data is adjusted and written to DDR2 using the AXI protocol bus. Four × 32 data points are grouped into 128 data points and written to DDR2, continuing until all data in the 32 × 32 matrix blocks is written. Steps S220 and S230 are repeated until the matrix transposition of the 2048 × 2048 image data is complete.

[0083] S250: Reads transposed image data from DDR2

[0084] Data is read from DDR2 in consecutive operation lengths of 1024, containing 8 data groups, each with 4 x 32 data elements. By skipping rows during the read operation, 4 rows of data are obtained, which are then output line by line to obtain the transposed matrix data.

[0085] Example 2:

[0086] In another possible embodiment, for a 2048×2048 image, i.e., P=2048, the hybrid coupling scheme caches 8 rows of data. The first 128 data points from each row are taken, and after combining them into 1024 data points, they are written to the storage space. During matrix transposition, 1024 data points are read from the storage space, containing 128 data points in the column direction; therefore, the matrix block is 128×128. The execution process is as follows:

[0087] 1) The FPGA caches 8 rows of data, each row containing 2048 data points. The first 128 data points from each of the 8 rows are taken, totaling 1024 data points, forming a data segment. Therefore, the 8 rows of data (2048 data points) comprise 16 data segments. For each data segment, the first 16 data points from each data point are sequentially reassembled into 128 data groups. Each data segment contains 8 data groups. Using the AXI protocol bus, the data is written to DDR1 in continuous operation lengths of 1024, achieving the highest write efficiency. This process continues until all 2048×2048 data points are stored. This corresponds to step S210.

[0088] 2) Read data from DDR1 using the AXI protocol bus. First, read the first 128 data entries. Since each row of the storage space contains 1024 data entries, then jump to the third row and read another 128 data entries, then jump to the next row, until all 1024 data entries have been read. This corresponds to step S220.

[0089] 3) Transpose the 128×128 matrix block, which contains 16384 data points. This corresponds to step S230.

[0090] 4) Adjust the transposed data and write it to DDR2 using the AXI protocol bus. Combine 4×128 data points into 128 data points and write them to the storage space until all 128×128 data points have been written, corresponding to step S240. Repeat steps 2 and 3 until the 2048×2048 data matrix is ​​transposed and written to the storage space.

[0091] 5) Read data from DDR2 using the AXI protocol bus in continuous operation lengths of 1024, containing 8 data groups, each containing 4 x 32 data. By skipping rows during reading, obtain 4 rows of data, output them line by line to obtain the transposed matrix data. This corresponds to step S250.

[0092] 2) Tightly Coupled Scheme

[0093] This invention discloses an FPGA matrix transpose method based on the AXI protocol. When the image data is a small-dot image, the hybrid coupling scheme can degenerate into a tightly coupled data transpose scheme. Figure 3 This is a flowchart of the transpose method for tightly coupled schemes, and the specific implementation is as follows:

[0094] S310: Obtain the image data of row P × column P;

[0095] S320: Arrange data into data blocks according to the block division principle.

[0096] M rows of data are stored in the FPGA cache, and the length of each row of data is P. The first I data points of each row of data are extracted to form a data segment, and the length of each data segment is M×I. The M rows of data are divided into a total of [number] segments. Each data segment is arranged from top to bottom in the cache to form a data block;

[0097] S330: Maps image data to FPGA external storage.

[0098] The data block is mapped to DDR1 to complete the data mapping of M rows × P columns, until the data mapping of P rows × P columns is completed;

[0099] S340: Reads image data from the FPGA, performs transposition, and outputs the result.

[0100] Using the AXI protocol bus, the data blocks are read from the DDR1 sequentially with a continuous operation length of B and output until the reading and output of P rows × P columns of image data are completed.

[0101] Example 3:

[0102] To clearly illustrate the transmission process of the technical solution, we will use 64-bit data of 512×512 as an example to explain the matrix transpose process. Figure 4 This is a schematic diagram of data mapping under a tightly coupled scheme. The execution process is as follows:

[0103] S410: The FPGA caches 16 rows of data. Each row retrieves 16 columns of data, for a total of 256 data points.

[0104] Each data segment consists of 256 data points.

[0105] S420: 16 data entries with 512 data points in total, comprising 32 data segments, forming a data block arranged from top to bottom.

[0106] S430: Map the 32×256 data block from step S402 to the storage space to complete the mapping process of 16×512 points. Repeat steps S401, S402, and S403 until the original 512×512 image is stored.

[0107] S440: Data reading process. Data is read from DDR1 row by row, then skips rows, as shown by the red arrow in process D. This continues until 8×1024 data points are read, as shown in process E. Then the data order is adjusted, and the output is 16 columns of data.

[0108] S450: Repeat S440 to read out 8×1024 data points, as in process D. Continue until all 512×512 data points have been read.

[0109] Because this method requires using buffer space to tightly couple the data together, it is called tightly coupled. As can be seen, the tightly coupled scheme requires storage space equal to the data size, eliminating the need for additional storage space, and is a representative method of in-situ transpose. However, the required buffer space (16 rows of data) is related to the data dimension: as the image dimension increases, the buffer space also gradually increases. Therefore, loose coupling is suitable for matrix transposes with smaller image sizes. When the image size increases, the FPGA buffer pressure also increases.

[0110] In Example 3, the data segment length is 256. In fact, if 32 lines of data are cached, the data segment length becomes 1024, which will improve storage efficiency, but at the cost of occupying more cache space.

[0111] 3) Loosely coupled scheme

[0112] This invention discloses an FPGA matrix transpose method based on the AXI protocol. When the image data is an image with a very large number of points, the hybrid coupling scheme can degenerate into a tightly coupled data transpose scheme. Figure 5 This is a flowchart of the loosely coupled transpose method, and the specific implementation is as follows:

[0113] S510: Obtain image data of row P × column P;

[0114] S520: Using the AXI protocol bus, image data is stored in DDR1 cell by cell in the row direction.

[0115] S530: Read N data points from DDR1 by skipping rows to obtain an N×N matrix block.

[0116] S540: Figure 6 This is a schematic diagram of the matrix block transpose process. After data storage is complete, it is processed according to... Figure 5 The matrix blocks are read out as shown, and then stored in the transposed positions. For example, (m,n) is read out and (n,m) is swapped to complete the matrix transpose of the N×N matrix blocks.

[0117] S550: Store the transposed matrix blocks row-by-row into DDR2. Repeat S530, S540, and S550 until the image data of row P × column P is transposed and written into DDR2.

[0118] S560: Using the AXI protocol bus, it sequentially reads and outputs transposed data from DDR2 in continuous operation length B.

[0119] Example 4

[0120] To clearly illustrate the transmission process of the technical solution, we will use a 64-bit image of 8192×8192 pixels and a 32×32 matrix block as an example to illustrate the loosely coupled process. Figure 7 This is a schematic diagram of data mapping under a loosely coupled scheme. Execution process S610: Store image data into the first storage unit.

[0121] Using the AXI protocol bus, the image is stored cell by cell in the row direction in DDR1. The first row of the image, consisting of 8192 data points, is stored in the first 8 rows of DDR1. Similarly, the second row of the image is stored starting from the 9th row of the storage space, until the entire image is stored.

[0122] S620: Obtain matrix blocks

[0123] Read 32 data points from DDR1 by skipping rows until a 32×32 matrix block is read.

[0124] S630: Perform data transpose of matrix blocks.

[0125] Transpose the data of a 32×32 matrix block.

[0126] S640: Store the transposed image data into DDR2

[0127] The transposed 32×32 matrix blocks are stored in DDR2 with skipping rows. The transposition of the matrix blocks is repeated until the 8192×8192 image data is transposed and written to DDR2.

[0128] S650: Obtain the matrix transpose result

[0129] Using the AXI protocol bus, data is read sequentially from DDR2 with a continuous operation length of 1024, and the data is combined with the row data length to output 8192 matrix transposed data.

[0130] 4) Comparison of Schemes

[0131] Unlike tight coupling, loose coupling does not require much buffer space to couple data, but it does require additional storage space and is a representative method of transposed memory. Because it does not require buffer space to couple data, it uses very little cache, has high read / write efficiency, does not affect backend processing, and is suitable for large pixel count images.

[0132] It should also be noted that if the matrix transpose time is shortened, the dimension of the matrix blocks must be increased, i.e., the FPGA cache space must be increased.

[0133] To maximize data read and write efficiency when moving data into and out of storage, the tightly coupled scheme uses a data segment length of 1024. A comparison of the loosely coupled and tightly coupled schemes is shown in Table 1. It can be seen that the cache resources required by the tightly coupled scheme increase with the image size. Furthermore, it is noted that if the matrix blocks remain unchanged, the loosely coupled scheme is independent of the image size.

[0134] Table 1 Comparison of Tightly Coupled and Loosely Coupled Schemes

[0135]

[0136] The hybrid coupling scheme combines a tightly coupled external cache with a loosely coupled internal transpose. Compared to the tightly coupled scheme, it uses an additional storage space, but like the loosely coupled scheme, it is suitable for larger image sizes. Compared to the loosely coupled scheme, it reduces cache resources, allowing for greater efficiency improvements with limited cache resources. Analysis of 64-bit data of size 2048×2048 is shown in Table 2.

[0137] To compare the tightly coupled and hybrid coupled schemes, data enters and exits the storage space according to the optimal read and write efficiency, corresponding to steps S120 and S150 of the hybrid coupled scheme and step S340 of the tightly coupled scheme.

[0138] To compare the loosely coupled and hybrid-coupled schemes, after data enters the storage space, it is moved in and out of the storage space according to the optimal read / write efficiency. This corresponds to steps S530 and S550 in the loosely coupled scheme, and steps S130 and S150 in the hybrid-coupled scheme. The cache resources used by the schemes are analyzed in detail below:

[0139] Table 2 Comparison of Tightly Coupled, Loosely Coupled, and Hybrid Coupled Schemes

[0140]

[0141] Referring to Example 2, in this case, the hybrid coupling scheme caches 8 rows of data. The first 128 data points of each row are taken, and the resulting 1024 data points are written to the storage space. During the matrix transpose process, 1024 data points are read from the storage space, which includes data in the 128 column directions. Therefore, the size of the matrix block is 128×128.

[0142] In terms of accessing and removing memory space, hybrid coupling and loose coupling have the same efficiency and do not affect the preceding or subsequent processing flow. During the transposition of matrix blocks, the efficiency is the same as loose coupling, but resource usage is significantly reduced.

[0143] The above examples illustrate the effectiveness of the hybrid coupling method. In actual engineering applications, the parameters P, N, and A should be configured according to specific needs. This will be discussed in detail below.

[0144] refer to Figure 2 The general structure of the hybrid coupling scheme is shown in the figure. From the figure, we can obtain the mathematical expression for the usage of off-chip memory resources in the coupling scheme:

[0145]

[0146] Among them, R mix This represents the BRAM resources used in the hybrid coupling scheme. A is the continuous operation length, N is the size of the block matrix, and P is the size of the image data. In practical applications, the number of input points, i.e., P, is often determined. Therefore, the size of the transpose of the block matrix, i.e., N, can be determined based on A. After determining the parameters A and P, the resource usage of the two schemes can be calculated, and the optimal scheme can be selected.

[0147] For example, when P = 2048 and A = 128, substitute into R mix In the formula:

[0148]

[0149] The above expression applies if and only if The equality holds, that is Substituting the data for A and P, we get N = 64. Therefore, in the improved CTM scheme, if real-time performance is satisfied, the minimum resource requirement is N = 64.

[0150] At this point, the hybrid coupled resource R mix The mathematical expression for usage is:

[0151]

[0152] If a loosely coupled approach is adopted, the mathematical expression for resource consumption is:

[0153] R CTM =A 2 +2A=128 2 +256 = 16K + 256

[0154] Among them, R CTM This indicates the BRAM resources used in the loosely coupled scheme. (Compare to R...) mix With R CTM The hybrid coupling scheme is more resource-efficient and has the highest efficiency.

[0155] In practical engineering, hybrid coupling schemes have more advantages than loose coupling methods, and the comparison results are as follows.

[0156] Divide the resources between hybrid coupling and loose coupling solutions:

[0157]

[0158] With the number of points P as the main variable, the above formula can be transformed into:

[0159]

[0160] This is a linear function of the number P, with a slope of... ΔR(P) is monotonically decreasing within the range P > 0. Also, the intercept ΔR(0) = A. 2 -N 2 When ΔR(0) < 0, hybrid coupling consumes more resources than loose coupling. In reality, A ≥ N, so we will discuss this under the condition ΔR(0) ≥ 0. Since ΔR(0) ≥ 0, there must be a zero point P0.

[0161]

[0162] That is, when P≤P0, ΔR(P)≥0, R mix It uses fewer resources; when P≥P0, ΔR(P)≤0, and the loosely coupled scheme uses fewer resources.

[0163] Further analysis of P0 reveals that P0 is also affected by the continuous operation length A and the internal processing structure N, both of which can influence the applicability of the hybrid coupling scheme.

[0164] The effect of the first operation length A. P0 is rewritten as follows:

[0165]

[0166] P0 increases monotonically with A. This indicates that the larger the image data size and the higher the real-time performance (the higher the A), the more suitable hybrid coupling is than loose coupling.

[0167] However, this growth is limited, and the limit is that the value of A is finite, that is:

[0168]

[0169] Differentiate the above equation with N as the main variable and analyze:

[0170]

[0171] Obtaining the extreme point Since N > 0, P0(1024, N) is increasing in (0, N²) and decreasing in (N², +∞), with a maximum value in N². In practice, N is a power of 2, i.e., N = 2. k (k∈N * Take two neighboring points, N t1 =512, N t2 =1024, substituting into P0(1024,N) gives:

[0172] P0(1024,N t1 ) = 192 × 1024 = 192K

[0173] P0(1024,N t2 ) = 0

[0174] Right now

[0175]

[0176] The above equation holds true when A = 1024 and N = 512.

[0177] When the data size exceeds 192K, hybrid coupling no longer has an advantage, and loose coupling solutions consume fewer resources.

[0178] When the data size is less than 192K, there will always be values ​​for N and A, making hybrid coupling more resource-efficient than loose coupling.

[0179] The above analysis provides a general overview of the applicability of the two schemes. In practical applications, the data size (P) and the continuous operation length (A) are often determined to define and design the internal structure (N). Once the parameters A and P are determined, the resource consumption of hybrid coupling and loose coupling can be calculated and thus differentiated.

[0180] For example, for 2K×2K data, that is, when P=2K and A=128, substitute into R mix In the formula:

[0181]

[0182] The above expression applies if and only if The equality holds, that is Substituting the data for A and P, we get N = 64. Therefore, in the hybrid coupling scheme, when the real-time condition A = 128 is met, the minimum resource requirement is N = 64. The resource usage in the hybrid coupling scheme is as follows:

[0183]

[0184] If a loosely coupled approach is adopted, the resource consumption will be as follows:

[0185] R CTM =A 2 +2A=128 2 +256 = 16K + 256

[0186] It is obvious that, under the same data and real-time conditions, the hybrid coupling scheme has a greater resource advantage.

[0187] In fact, the above analysis of R mix Applicability of hybrid coupling under minimum conditions. Figure 8 This diagram compares loosely coupled and hybrid approaches for 2k×2k image data. In fact, for 2k×2k data, when the first operation length A and the matrix transpose structure size N take other values, the hybrid coupling method also has advantages over loose coupling.

[0188] As can be seen, the real-time performance continuously improves with increasing continuous operation length A. Under the same matrix transpose structure size, hybrid coupling uses fewer resources than loose coupling. Therefore, hybrid coupling can ensure high real-time performance while saving a significant amount of storage resources.

[0189] The above analysis shows that under the condition P0 = 2048, as the number of points increases, hybrid coupling requires data caching, causing a rapid increase in resource consumption. Therefore, the applicability of hybrid coupling is somewhat limited. Figure 9 This diagram compares loosely coupled and hybrid schemes for 16k×16k image data. As can be seen from the diagram, when P0 = 16384, the hybrid coupling scheme uses less resources than the loosely coupled scheme when P0 = 2048.

[0190] According to another aspect of the present invention, an FPGA matrix transposition device based on the AXI-Stream protocol is provided. Figure 10This is a schematic diagram of an FPGA matrix transpose based on the AXIS protocol. As shown in the figure, the device includes:

[0191] Grouping unit 11: Groups the image data in the FPGA buffer according to the grouping principle to obtain the first data;

[0192] First input unit 12: Writes the first data into the first storage space with a continuous operation length using the AXI protocol bus;

[0193] First output unit 13: Using the AXI protocol bus, reads the second data sequentially from the first storage space in a continuous operation length;

[0194] Transpose unit 14: Divide the second data into matrix blocks, transpose the matrix blocks to obtain the third data;

[0195] Second input unit 15: sequentially writes the third data into the second storage space to complete the transposition of the image data;

[0196] Second output unit 16: Using the AXI protocol bus, it reads data from the second storage space in continuous operation length and outputs it line by line to obtain the transposed image data.

[0197] The above examples illustrate that hybrid coupling schemes are more suitable for large-scale images than tight coupling schemes, and also show that they require less cache resources than loose coupling schemes.

[0198] Because it combines loosely coupled and tightly coupled structures, this hybrid coupling scheme can degenerate into either loosely coupled or tightly coupled structures. However, the hybrid approach also possesses advantages that loosely coupled or tightly coupled structures lack. Therefore, hybrid coupling can be called a general structure for matrix transpose.

[0199] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. An FPGA matrix transpose method based on the AXI protocol, comprising: Acquire P rows x P columns of image data, and determine the size N of the transpose matrix sub-block according to the size P of the image data and the first operation length 2 and M pieces of column data of single buffering, store the M pieces of column data into the FPGA cache to obtain first data; the size of the first data is P x M; The first N data from M rows of data are retrieved from the cache to form a data group with a length of the first operation length. Multiple data groups form a data segment, which completely occupies one row of the first storage space. Using the AXI protocol bus, the data segment is written into the first storage space with a second operation length until the image data of P rows × P columns is stored. The second operation length is limited by the AXI protocol bus. Using the AXI protocol bus, data is read from the first storage space in rows with a first operation length, until the data of size N is read. 2 From the first set of data, we obtain the second set of data; The second data is divided into N×N matrix blocks, and the matrix blocks are transposed to obtain the third data. The third data is grouped into multiple data groups with the first operation length and written into the second storage space for storage. Continue until the transposition of the P rows × P columns of image data is completed; Using the AXI protocol bus, data is read from the second storage space with a second operation length and output line by line to obtain the transposed image data.

2. The FPGA matrix transposition method of claim 1, wherein, The mathematical expression for the column data M in the single cache is: Where A is the first operation length, and N is the length or width of the matrix block.

3. The FPGA matrix transposition method of claim 1, wherein, The size N of the permutation matrix sub-block 2 N is the length or width of the matrix sub-block, and N is mathematically expressed as: Where A is the first operation length and P is the size of the image data.

4. The FPGA matrix transposition method of claim 1, wherein, When the image data is a small-dot image, the method includes: Get the image data of P rows × P columns; M pieces of row data are stored in the FPGA cache, and the length of the row data is P; the first I pieces of data of the row data are taken out to form a data segment, the length of the data segment is MxI, and the M pieces of row data are divided into a total of pieces of data segments The data segments are arranged in the cache from top to bottom to form a data block; The data block is mapped to the first storage space to complete the data mapping of M rows × P columns, until the data mapping of P rows × P columns is completed; Using the AXI protocol bus, the data blocks are read from the first storage space sequentially with a second operation length and output until the reading and output of P rows × P columns of image data are completed.

5. The FPGA matrix transpose method according to claim 1, characterized in that, When the image data is an image with a very large pixel count, the method includes: Get the image data of P rows × P columns; Using the AXI protocol bus, image data is stored unit by unit in the first storage space according to the row direction; Read N data points from the first storage space by skipping rows to obtain an N×N matrix block; Transpose the N×N matrix blocks; The transposed matrix blocks are skipped and stored in the second storage space until the image data transposed by P rows × P columns is completed; Using the AXI protocol bus, transposed data is sequentially read from the second storage space with a second operation length and then output.

6. An FPGA matrix transpose device based on the AXI protocol, comprising: The grouping unit is used to group the image data in the FPGA buffer according to the grouping principle to obtain the first data; The first input unit is used to write the first data into the first storage space with a first operation length via the AXI protocol bus; The first output unit is used to read the second data sequentially from the first storage space with a first operation length using the AXI protocol bus; The transpose unit is used to divide the second data into matrix blocks, and transpose the matrix blocks to obtain the third data; The second input unit is used to sequentially write the third data into the second storage space to complete the transposition of the image data; The second output unit is used to read data from the second storage space with a second operating length via the AXI protocol bus and output it line by line to obtain the transposed image data.