A code generation method, a code generation device, and a storage medium

By generating software code to flexibly specify operator connections and data flow in the hardware accelerator, and utilizing a shared buffer pool to achieve data interaction, the contradiction between the application flexibility and high performance and low power consumption of hardware accelerators in radar signal processing is resolved, thereby improving processing efficiency.

CN122308799APending Publication Date: 2026-06-30CALTERAH SEMICON TECH (SHANGHAI) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CALTERAH SEMICON TECH (SHANGHAI) CO LTD
Filing Date
2025-03-14
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing hardware accelerators present a trade-off between application flexibility and high performance and low power consumption in radar signal processing, making it difficult to meet the needs of complex applications.

Method used

By generating software code, users can flexibly specify the connection relationships and data flow between multiple operators in the hardware accelerator, utilize a shared cache pool to realize data interaction between operators, form a data channel, improve application flexibility and reduce power consumption.

Benefits of technology

It realizes soft connections between operators in hardware accelerators, improves application flexibility and processing efficiency, reduces power consumption, and meets the requirements of high performance and low power consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308799A_ABST
    Figure CN122308799A_ABST
Patent Text Reader

Abstract

A code generation method, a code generation device, and a storage medium are disclosed, belonging to the field of data processing. The code generation method is for software code used by a hardware accelerator, comprising: determining multiple operators and their algorithm parameters participating in the processing based on user input, and information on the data flow and cache during the execution of the multiple operators; automatically generating software code based on configuration information; and, when multiple cache allocation instructions in the software code are executed, allocating caches used for data access by the multiple operators from a shared cache pool to form a data channel through which the data flow passes through the multiple operators. The software code generated based on this code generation method can run on a hardware accelerator, enabling the hardware accelerator to possess both high performance and application flexibility.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-reference to related applications

[0002] This disclosure claims priority to Chinese Patent Application No. 202412000007.X, filed with the China National Intellectual Property Administration on December 31, 2024, entitled "A Code Generation Method, Apparatus and Storage Medium", the contents of which are incorporated herein by reference. Technical Field

[0003] This disclosure relates to, but is not limited to, the field of data processing, and more specifically, to a code generation method, a code generation apparatus, and a storage medium. Background Technology

[0004] To achieve application flexibility, the industry commonly uses general-purpose processors such as Digital Signal Processors (DSPs) for data processing in radar signal processing. With the development of radar technology, radar signal processing algorithms have become increasingly complex, and radar signal processor operators have become increasingly diverse. Multi-channel, high-precision radar applications have created a demand for high-performance, low-power processing of echo signals. Mainstream DSP solutions have accelerated the shift towards radar hardware accelerators. However, the processing flow of each operator in a hardware accelerator is simple, lacks flexibility, and fails to fully realize its performance advantages. Summary of the Invention

[0005] One embodiment of this disclosure provides a code generation method for generating software code used by a hardware accelerator. The hardware accelerator includes a hardware scheduler that executes the software code, a shared cache pool, and various types of operators. The code generation method includes:

[0006] Display an interactive interface for generating software code, and receive user input through the interactive interface;

[0007] Configuration information is determined based on user input. The configuration information includes: multiple operators involved in the processing and their algorithm parameters, information about the data flow when the multiple operators perform processing, and information about the cache of the multiple operators; the multiple operators include at least two types of operators.

[0008] Software code is automatically generated based on the configuration information. The software code includes multiple cache allocation instructions. When the multiple cache allocation instructions are executed, caches used for accessing data are allocated from the shared cache pool for the multiple operators respectively, so as to form a data channel through which the data flow passes through the multiple operators.

[0009] An embodiment of this disclosure also provides a code generation apparatus, including a memory and a processor, wherein the memory stores a computer program and the processing device is configured to run the computer program to execute the code generation method described in any embodiment of this disclosure.

[0010] An embodiment of this disclosure also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can implement the code generation method described in any embodiment of this disclosure.

[0011] The embodiments of this disclosure can determine configuration information, multiple operators participating in the processing and their algorithm parameters, data flow information during the processing of the multiple operators, and cache information of the multiple operators based on user input. Then, software code is automatically generated based on the configuration information. Multiple cache allocation instructions in the software code can allocate caches for each operator when accessing data, forming a data channel through which the data flow passes. Therefore, based on this embodiment, users can specify multiple operators participating in the processing through input operations, and can also customize the required data flow by changing the connection relationship and data flow direction between the multiple operators. The software code generated based on this embodiment can realize soft connections between operators in the hardware accelerator, improving application flexibility; and data interaction between operators is achieved through a shared cache pool in the hardware accelerator, eliminating the need for internal and external data transfer, achieving high performance and low power consumption, and resolving the contradiction between application flexibility and high performance / low power consumption in hardware accelerators.

[0012] Other features and advantages of this disclosure will be set forth in the following description, and will be apparent in part from the description, or may be learned by practicing the disclosure. Other advantages of this disclosure may be realized and obtained by means of the embodiments described in the description, claims, and drawings. Attached Figure Description

[0013] The accompanying drawings are provided to illustrate the technical solutions of this disclosure and form part of the specification. They are used together with the embodiments of this disclosure to explain the technical solutions of this disclosure and do not constitute a limitation on the technical solutions of this disclosure. The shapes and sizes of the components in the drawings do not reflect actual proportions and are only intended to illustrate the content of this disclosure.

[0014] Figure 1 This is a schematic diagram of an exemplary radar system according to one embodiment;

[0015] Figure 2 This is a structural diagram of an exemplary radar system according to one embodiment;

[0016] Figure 3 This is a schematic diagram of electromagnetic waves emitted by a radar system in one embodiment;

[0017] Figure 4 This is a schematic diagram of the digital signal processing flow in an embodiment of a radar system;

[0018] Figure 5 This is a schematic diagram of a hardware accelerator in an embodiment of a radar system. It supports multiple operators with hard-connected interfaces.

[0019] Figure 6A and Figure 6B These are schematic diagrams of the radar system hardware accelerators in two other embodiments, supporting a single operator;

[0020] Figure 7 This is a schematic diagram of the structure of a computing system according to an embodiment of the present disclosure, including a hardware accelerator;

[0021] Figure 8A , Figure 8B , Figure 8C and Figure 8D This is a schematic diagram of multiple operators and a shared cache pool specified in the four hardware accelerators of this disclosure embodiment;

[0022] Figure 9A yes Figure 8B A diagram illustrating the caches used by multiple operators;

[0023] Figure 9B This is a schematic diagram of data flow loopback according to an embodiment of this disclosure;

[0024] Figure 10 This is a flowchart of a multi-operator control method according to an embodiment of the present disclosure;

[0025] Figure 11A and Figure 11B This is a schematic diagram of two startup and synchronization control methods for a hardware accelerator including three operators in a multiple loop process according to an embodiment of the present disclosure;

[0026] Figure 12 This is a schematic diagram of a hardware accelerator according to an embodiment of the present disclosure;

[0027] Figure 13 This is a schematic diagram of an integrated circuit according to an embodiment of the present disclosure;

[0028] Figure 14 This is an embodiment of the present disclosure. Figure 13 A schematic diagram of a device for integrated circuits;

[0029] Figure 15 This is a schematic diagram of the data flow between multiple operators and the time-step switching buffer according to an embodiment of this disclosure;

[0030] Figure 16 This is a schematic diagram of a control device according to an embodiment of the present disclosure;

[0031] Figure 17 This is a schematic diagram of a hardware scheduler according to an embodiment of the present disclosure;

[0032] Figure 18 This is a schematic diagram of the structure of an operator according to an embodiment of this disclosure;

[0033] Figure 19 This is a flowchart of a multi-operator control method according to another embodiment of this disclosure;

[0034] Figure 20 This is a flowchart of a code generation method according to an embodiment of this disclosure;

[0035] Figure 21 This is a block diagram of a software development platform according to an embodiment of the present disclosure;

[0036] Figure 22 This is a schematic diagram of a code generation apparatus according to an embodiment of the present disclosure. Detailed Implementation

[0037] This disclosure describes several embodiments, but these descriptions are exemplary and not restrictive, and it will be apparent to those skilled in the art that more embodiments and implementations are possible within the scope of the embodiments described herein.

[0038] In the description of this disclosure, words such as "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment described as "exemplary" or "for example" in this disclosure should not be construed as being more preferred or advantageous than other embodiments. The word "and / or" in this document describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone. "Multiple" refers to two or more. Furthermore, to facilitate a clear description of the technical solutions of the embodiments of this disclosure, the terms "first" and "second" are used to distinguish identical or similar items with substantially the same function and effect. Those skilled in the art will understand that the terms "first" and "second" do not limit the quantity or execution order, and that "first" and "second" do not necessarily imply differences.

[0039] In describing representative exemplary embodiments, the specification may have presented methods and / or processes as a specific sequence of steps. However, the method or process should not be limited to the specific order of steps described herein, as other sequences of steps are possible, to the extent that the method or process does not depend on the specific order of steps described herein. Therefore, the specific order of steps set forth in the specification should not be construed as a limitation on the scope of protection. Furthermore, the scope of protection for the method and / or process should not be limited to the steps performed in the written order, and those skilled in the art will readily understand that these orders can be varied while still remaining within the spirit and scope of the embodiments disclosed herein.

[0040] The main function of radar digital signal processing is to perform fast Fourier transform on millimeter-wave echo signals, convert the time-domain signal into a frequency-domain signal, obtain the energy distribution of the spectrum, and then remove and suppress noise according to the algorithm to complete the target identification and obtain information such as the target's distance, speed and angle.

[0041] Figure 1 A radar system is illustrated, taking a frequency-modulated continuous wave (FMCW) millimeter-wave radar as an example. This radar system includes an RF chip 110, a transmitting antenna 120, a receiving antenna 130, and a main processing chip 140. The RF chip 110 is configured to generate a detection signal and transmit it through the transmitting antenna array 120. The transmitting antenna 120 is connected to the RF chip 110, forming a transmitting antenna array to transmit the detection signal, such as an FMCW electromagnetic wave signal. Multiple receiving antennas 130 are configured to receive the echo signal formed by the detection signal reflected from the detected object (also called the target object, or simply the target). The main processing chip 140 is connected to the receiving antenna array 130 and the RF chip 110, and is configured to process the echo signal obtained by the receiving antenna array 130 to obtain information such as the target's distance, velocity, and angle. In one example, the transmitting antenna 121 and the receiving antenna 131 can be integrated with the RF chip 110 to form an RF transceiver chip, which, together with the main processing chip 140, constitutes a radar signal transceiver processing system (referred to as a radar system). In another example, the RF chip 110 and the main processing chip 140 are integrated into a single system-on-a-chip (SoC), enabling the transmission, reception, and processing of RF signals with a single chip. The transmitting antenna 120 and the receiving antenna 130 can also be integrated with this SoC chip to form a structure similar to an antenna-in-package (AiP) chip.

[0042] The radio frequency chip 110, which can be integrated as one unit or set separately, and the main processing chip 140 can constitute an integrated circuit in a radar system. The integrated circuit, transmitting antenna, and receiving antenna can be set on a carrier, together with the carrier, to form an electromagnetic wave device. The electromagnetic wave device can be installed on equipment such as a car.

[0043] Figure 2 The radar system shown includes a transmitting antenna 11, a power amplifier 21, a signal generator 23, a receiving antenna 13, a low-noise amplifier 31, a mixer 33, an analog-to-digital converter (ADC) 41, and a digital signal processing module 51. The signal generator 23 can be a millimeter-wave generator implemented with an oscillator. The detection signal generated by the signal generator 23 is amplified by the power amplifier 21 and then transmitted through one or more transmitting antennas 11. The radar system typically transmits a series of chirps in frames. The detection signal transmitted by an FMCW-based radar system can use... Figure 3 The sawtooth waveform shown has multiple chirps in each frame. Each chirp includes an up-modulation band, a down-modulation band, and a frequency hold band. The period of each chirp is Tc.

[0044] The detection signal is reflected and / or refracted by the target to form an echo signal. The echo signal received by the receiving antenna 13 is amplified by the low-noise amplifier 31, and then mixed with the corresponding local oscillator signal in the mixer 33 to obtain the intermediate frequency signal. The intermediate frequency signal is sent to the ADC 41 for sampling, and the resulting digital signal is further processed in the digital signal processing module 51. There are usually multiple receiving antennas 13, and the radar system can have multiple receiving channels. Unless otherwise specified, "channel" in this text refers to a receiving channel.

[0045] Figure 4The diagram illustrates an example of a digital signal processing flow. After sampling by the ADC, the digital signals from one or more channels undergo 1D-FFT processing, such as DC removal (DC), windowing (Win), one-dimensional fast Fourier transform (1D-FFT), and spatially variable apodization (SVA), among other data processing in the range dimension. The resulting 1D-FFT data is buffered and accumulated to a predetermined number of frames, such as 128 frames. Then, 2D-FFT processing is performed, including windowing, clutter suppression, and two-dimensional fast Fourier transform (2D-FFT), among other data processing in the Doppler dimension (velocity dimension), yielding the range-Doppler two-dimensional spectrum, or RD (Range-Doppler) spectrum, of the digital signal. Target detection can be performed based on the RD spectrum, such as through constant false alarm rate (CFAR). Rate detection determines the target's spectral peaks in the range and Doppler dimensions, followed by range, velocity, and direction of arrival (D) measurements. O A: DirectionOfArrival) estimation, etc.

[0046] Radar digital signal processing can be based on Figure 1 The main processing chip shown is an implementation example. Depending on the application requirements for performance, cost, and application flexibility, different engineering implementations exist. Running these processes via software on the CPU cannot meet the high real-time requirements of some applications. In these scenarios, hardware accelerators can be placed on the chip to perform these processes and improve computational efficiency. These chips can be further divided into two types:

[0047] One type of hardware acceleration solution is to provide customized acceleration for a specific application, such as... Figure 5As shown, the hardware accelerator includes multiple operators to perform the following processes: DC removal, primarily used to remove the DC component of the signal; windowing, using window functions to window the signal, reducing spectral leakage caused by signal truncation; FFT: used to perform a Fast Fourier Transform on the intermediate frequency signal data after analog-to-digital conversion, converting the time-domain signal into a frequency-domain representation; SVA: used to apply different weighting methods to pixels at different spatial locations, effectively suppressing sidelobes. These operators are connected serially, processing the data to be processed from the Radar Cube in sequence, and the processing results are stored back to the Radar Cube. The operators can be controlled through configuration registers. This scheme uses dedicated ASIC circuits, which have strong processing capabilities and maximize chip area utilization for specific applications. However, the operators are hard-connected, and the interaction direction and processing order between the operators are singular. Firmware development methods are limited, and it is impossible to optimize through software development or firmware iteration to adapt to new application requirements. If complex application designs are to be supported, they need to be determined in the early stages of chip design. After the chip design is completed, it can only support relatively fixed data processing flows and is difficult to support new application requirements.

[0048] These solutions all present a trade-off between application flexibility and high performance and low power consumption, and a solution is needed to resolve this conflict.

[0049] Another hardware acceleration solution, to adapt to different applications, uses software instructions to achieve data interaction between hardware accelerators, such as... Figure 6A and Figure 6B As shown, the input data to be processed by the hardware accelerator is loaded from the RadarCube via a Direct Memory Access (DMA) controller, represented as DMAI in the diagram. The processing result, i.e., the output data, of the hardware accelerator is stored in the RadarCube via another DMA controller (represented as DMAO in the diagram). The internal cache capacity of the hardware accelerator is small, consisting of a single operator (such as...). Figure 6A DC, Figure 6BThe input data is processed using the FFT (Functional Fourier Transform). The hardware accelerator of this scheme can be implemented based on a digital signal processor (DSP) and algorithm logic. Different processing flows required by different applications can be achieved by changing the data interaction between multiple hardware accelerators. The data of each operator can interact with the CPU or upper-layer applications, which is flexible. However, it requires the DSP / CPU to have strong computing power, and there needs to be a high-speed data communication interface between the upper-layer application and the DSP. The cost is high, and the data needs to be frequently moved between the hardware accelerator and the Radar Cube, resulting in low utilization efficiency of the hardware accelerator, high power consumption, long development cycle, and difficulty in meeting the processing requirements of high performance and real-time.

[0050] Although the above example uses a radar system, other applications that require hardware acceleration also need to effectively resolve the conflict between the application flexibility of hardware accelerators and high performance and low power consumption.

[0051] To address this, this disclosure proposes a novel hardware-software interaction method based on software code (such as radar-specific microinstructions). Users can develop different software code for different applications according to their needs, customizing the connection relationships and data flows between multiple operators in the hardware accelerator. During actual operation, both the software code and data flow can be processed by hardware circuits such as Application-Specific Integrated Circuits (ASICs). This solves the problem of limited development flexibility while also offering high performance.

[0052] To this end, one embodiment of this disclosure provides a hardware accelerator, including a hardware scheduler, a plurality of designated operators, a shared cache pool providing cache space for the plurality of operators, and a register set, such as Figure 7 As shown, where:

[0053] The hardware scheduler is configured to allocate caches from the shared cache pool to the plurality of operators based on a set connection order to form a data interaction channel between the plurality of operators, save the information of the allocated caches to the register group; and control the plurality of operators to start sequentially according to the set connection order, complete the processing of the loaded data to be processed in a pipeline manner, and store the processed data in external memory.

[0054] The register group is configured to store configuration information for the plurality of operators, the configuration information including information on the cache allocated to the operators;

[0055] The plurality of operators are configured to, upon startup, obtain information about the cache allocated to the operator from the register group; if an input cache is allocated, read the data block to be processed from the input cache and process it; if an output cache is allocated, write the processed data into the output cache.

[0056] The multiple operators in this embodiment are specified by software code as the operators that need to process the data to be processed. They can be all or some of the operators set in the hardware accelerator. The operators in this article can also be referred to as computing engines, computing units, etc.

[0057] Based on the hardware accelerator of this disclosure, users can flexibly specify multiple operators to implement computing tasks according to different usage scenarios. Not only can parameters be set for the specified multiple operators, but the connection relationships between multiple operators and the data flow direction during operation can also be configured to adapt to the requirements of different application scenarios for the required operators and processing flows. Furthermore, multiple operators can be controlled to process in parallel in a pipelined manner, maximizing the processing power per unit chip area. This enables systems equipped with this hardware accelerator, such as radar signal processors, to achieve optimal performance in different scenarios.

[0058] The shared cache pool (also written as memory pool) in this embodiment can use storage media such as Static Random-Access Memory (SRAM), Dynamic Random-Access Memory (DRAM), and Phase-Change Memory (PCM). This shared cache pool can be implemented using a memory with multiple independently accessible memory regions, which can be accessed simultaneously by different operators, such as a physical memory comprising multiple banks.

[0059] In one example of this embodiment, multiple register groups can be set up, with each register group corresponding to a specific operator. However, in other examples, the register group can be a set of registers shared by multiple operators, or it can include multiple register groups, with at least some of these register groups being shared by multiple operators.

[0060] The connection between the multiple operators specified in this embodiment is a "soft" connection. Users can generate software code according to the needs of the computing task, and select some or all of the operators set by the hardware accelerator through the software code to execute the computing task (the selected multiple operators are the specified multiple operators). The connection order of the specified multiple operators can be defined by the positional relationship between the caches allocated to the multiple operators in the software code, as well as the order of scheduling instructions sent to the operators. Data interaction between multiple operators is carried out through a shared cache pool, and the start time and synchronization method of different operators can also be defined by the software code.

[0061] Figure 7 This illustration shows some operators used in this embodiment when applied to a radar system, such as FFT, SVA, CFAR, DoA, etc., which are merely exemplary. These operators can be implemented in multiple hardware accelerators. The DMA controller shown in the figure can also be used as an operator, namely a DMA operator. The DMA operator can load data under the control of the hardware scheduler, and can also store processed data in memory. This embodiment can support a variety of operators, including but not limited to dedicated hardware operators defined by chip manufacturers, such as FFT operators, CMB operators, etc. Dedicated hardware operators include logic circuits that implement certain calculations (such as complex number multiplication and addition operations), and may also include other auxiliary circuits such as multiplexers, internal lookup tables, internal memory, filters, etc. In other embodiments, it can also be extended to general-purpose operators, and the hardware scheduler can schedule processors such as DSPs, CPUs, DPUs, APUs, etc., as operators.

[0062] This embodiment describes how operator buffer allocation and scheduling methods are related to the characteristics of the operators. Some operators, such as FFT and SVA, process the input data, resulting in processed data that differs from the input data. These operators can be allocated both input and output buffers. However, some operators, such as Digital Front End (DFE) operators, perform statistical analysis on the input data while transmitting it directly. These operators can be allocated only an output buffer. Other operators are used for data movement, such as the DMAO operator, which moves processed input data to memory. These operators can be allocated only an input or output buffer. The operators in this embodiment are not limited to the methods listed here.

[0063] The hardware scheduler in this embodiment can also be represented as SEQ (sequencer), and can be implemented using ASIC, Field Programmable Gate Array (FPGA), microprocessor (MCU), etc. This hardware scheduler can internally set up a cache to load software code generated for the current computing task. By executing instructions in the software code, it performs operations such as allocating caches for operators and scheduling operators. Operator scheduling can be implemented based on scheduling instructions. However, this does not mean that the hardware scheduler must send scheduling instructions to every operator. For example, it may not send scheduling instructions to the DFE operator that transmits data. When the DFE operator detects data input, it can automatically transmit the input data to the output cache and perform statistics. The sequential startup of multiple operators described in this paper does not mean that only one operator can be started at a time step, but rather covers the case where a DFE operator and another operator start at the same time step.

[0064] Radar signal processing can be divided into multiple stages, such as 1D-FFT, 2D-FFT, and CFAR, each of which can be accelerated by its corresponding hardware accelerator. Each stage's hardware accelerator can be pre-configured with the operators required for that stage. The function of an operator is determined by the division of the processing flow within that stage; however, operator configuration is not simply based on size, but also on possible combinations. If further subdivision is not possible to accommodate multiple combinations, even a large operator (e.g., high computational load, numerous processing steps, large circuit size) does not need further subdivision. Conversely, if an operator's components require multiple possible combinations, it can be further subdivided into more operators. The hardware accelerator can be configured with operators required for various applications. When executing a computational task for a specific application, some operators are enabled via software code, while others can be disabled.

[0065] In an exemplary embodiment of this disclosure, the specified plurality of operators are operators in the 1D-FFT processing stage, including, in sequence, the DFE operator, the Chirp Quality Monitor Detection (CQMD) operator, the Fast Fourier Transform (FFT) operator, and the Spatial Vector Analysis (SVA) operator, such as Figure 8A As shown, the DFE operator is optional.

[0066] In another embodiment, the specified plurality of operators are operators in the 2D-FFT processing stage, including, in sequence, the DC removal (DC: Direct Current) operator, the FFT operator, and the SVA operator. Figure 8BThe figure shows an example of this embodiment, which includes two SVA operators, denoted as SVA1 and SVA2, but other examples may only include one SVA operator. In addition to the above operators, it also includes a DMA operator (denoted as DMAI operator) for loading external data to be processed into the hardware accelerator and a DMA operator (denoted as DMAO operator) for storing data processed by the hardware accelerator into external memory.

[0067] In another embodiment, the specified multiple operators are operators processed in the CFAR stage, which include, in sequence, the Controller of Combination (CMB) operator, the Statistic (STAS) operator, the Histogram (HIST) operator, the CFAR operator, and the STAS operator. Figure 8C The figure shows an example of this embodiment. In addition to the above-mentioned operators, the figure also includes two DMA operators, which are denoted as DMAI operator and DMAO operator respectively.

[0068] The processing performed by the hardware accelerator is not necessarily divided according to the above-described 1D-FFT, 2D-FFT, and CFAR stages. In another example of this embodiment, the specified plurality of operators sequentially include the DC operator, FFT operator, SVA operator, and CMB operator, as shown below. Figure 8D As shown.

[0069] Depending on the application requirements, other operators can be inserted between the multiple operators in the above example.

[0070] In the exemplary hardware accelerator described above, the cache for each operator can be allocated independently. This can be achieved through cache allocation instructions in the software code. The hardware scheduler parses the cache allocation instruction for an operator and writes the allocated cache information into the register group corresponding to that operator. The operator can read data from the input cache based on the information read from the corresponding register group, such as the location and size of the input and output caches, process the data, and write it to the output cache.

[0071] by Figure 8B Taking an example, we will further explain the data interaction between the various operators. In the diagram, the DC operator, FFT operator, SVA1 operator, and SVA2 operator are connected sequentially. Each of these operators reads input data from its own input buffer, processes it, and saves the processed data to its own output buffer, which then serves as the input data for the next operator. The detailed data flow is as follows: Figure 9AAs shown, taking the processing and flow of a data block in the data to be processed as an example, the DMAI operator loads the data block from external (such as memory) into Cache0 or Cache1 in the shared cache pool. Cache0 and Cache1 are two output caches allocated for the DMAI operator and two input caches allocated for the DC operator. The operator switches the cache used in adjacent time steps. The DC operator reads the data block from Cache0 or Cache1 for processing, and the processed data block is written to Cache2 or Cache3. Cache2 and Cache3 are two output caches allocated for the DC operator and two input caches allocated for the FFT operator. The FFT operator reads data blocks from Cache2 or Cache3 for processing, and writes the processed data blocks to Cache4 or Cache5. Cache5 and Cache6 are two output caches allocated for the FFT operator and two input caches allocated for the SVA1 operator. The subsequent processing of the SVA1 and SVA2 operators is similar. After the SVA2 operator writes the processed data block to Cache8 or Cache9, the DMAO operator stores the processed data block externally (e.g., in memory), thus completing the processing of the data block by the hardware accelerator. Cache0 to Cache1 are caches allocated from the shared cache pool. During operation, data blocks to be processed are continuously loaded into the hardware accelerator, which processes multiple data blocks in parallel in a pipelined manner until the processing of the data is complete. In this embodiment, a data block can be the data of a chirp or the data of multiple chirs on a distance gate.

[0072] In another embodiment, the hardware accelerator may not include a DMA operator. Instead, the processor performs the operations of loading the data to be processed into a shared cache pool and storing the data processed by all operators in the shared cache pool to the external storage. Figure 8D As shown. For example, after the hardware scheduler completes processing a batch of data blocks, it can notify the processor to retrieve the data from the output cache (e.g., Cache9) of the last operator. Upon receiving the notification, the processor configures the DMA controller to move the data from Cache9 to memory. At this time, the DMA controller is not an operator controlled by the hardware scheduler. It is easy to understand that in other embodiments, the hardware accelerator may only have DMAI or DMAO operators, such as... Figure 8A As shown. Figure 8A In this process, external data to be processed can be directly loaded into the input of the DFE operator and passed through to the output buffer of the DEF operator, so there is no need to allocate an input buffer for the DEF operator.

[0073] In addition to the sequential connection of multiple operators, the hardware accelerator of this disclosure also supports various connection methods, such as... Figure 9BAs shown, after operator B processes the input data multiple times, operator C reads the data processed by operator B. This means the data stream can loop. When an operator needs to perform multiple processing operations on a data block, such as when operator B needs to process the data block output by operator A twice consecutively, the hardware scheduler can be designed through software code to schedule operators and reallocate buffers for operators in the following way, achieving data stream looping: At time step N, after operator B completes its first processing of the data block, the output buffer of operator B at time step N is reallocated to operator B. At time step N+1, the input buffer of operator B at time step N is reallocated to the output buffer of operator B at time step N+1. At time step N+1, operator B processes the data block again. Before the start of time step N+2, the output buffer of operator B at time step N+1 is allocated to the input buffer of operator C at time step N+2. In this way, at time step N+2, operator C can read the data block from the allocated input buffer for subsequent processing, realizing one loop of the data stream at operator B, where N is a positive integer.

[0074] By designing the software code, the operators to be looped and the number of loops can be specified, enabling the data stream to loop back at the specified operators a specified number of times. The looping in this embodiment can be between the input and output of a single operator, or between the input and output of a computational unit composed of multiple operators. When multiple operators of the same type exist, such as... Figure 9B When there are two operators B, the software code connects the four operators in the order of operator A, operator B, operator B, and operator C, which has the same effect as looping back once at operator B (performing the process twice).

[0075] Besides changing the connection relationships of operators, different operators can be specified to implement different applications by generating corresponding software code. For example, if the current application's computation task does not require the participation of operator A, operator A can be omitted, and only operators B and C can be specified to participate in the computation task. During operation, only operators B and C are cached and scheduled. In this case, the data flow of the hardware accelerator is as follows: Figure 9B As shown by the dashed line.

[0076] The hardware accelerator solution of this disclosure can decompose the radar signal processing flow into multiple computational operations, each implemented by multiple operators. All operators are connected to a shared memory pool, and data interaction between operators is achieved through the memory pool. Furthermore, the memory pool's space allocation and operator scheduling are exposed and defined by software instructions, thus being relatively decoupled from the hardware design. Therefore, the hardware acceleration solution of this disclosure has high performance and application flexibility.

[0077] This disclosure also provides a computing system, including a processor, memory, and a hardware accelerator, see [link to relevant documentation]. Figure 7 As shown, where:

[0078] The processor is configured to load the data to be processed stored in the memory into the hardware accelerator;

[0079] The memory is configured to store data to be processed, and to store data obtained by the hardware accelerator after processing the data to be processed.

[0080] The hardware accelerator is configured to allocate caches to multiple operators based on a set connection order to form a data interaction channel between the multiple operators; and to control the multiple operators to start sequentially according to the set connection order, to complete the processing of the loaded data to be processed in a pipeline manner, and to store the processed data in the memory.

[0081] The computing system of this disclosure can be integrated on a SoC chip. The hardware accelerator can allocate caches to multiple operators based on a set connection order to form data interaction channels between the operators; and control the multiple operators to start sequentially according to the set connection order, completing the processing of the loaded data in a pipeline manner. Therefore, users can generate software code according to computing tasks in different usage scenarios. The hardware scheduler runs the software code, flexibly selecting multiple operators required for the computing task and setting the connection relationships between the multiple operators and the data flow direction when the multiple operators are working, to adapt to the processing flow requirements of the computing task.

[0082] As mentioned above, the radar signal processing flow includes multiple stages, each of which can be accelerated by its respective hardware accelerator. Therefore, the computing system in this embodiment can have multiple hardware accelerators. For example, a first hardware accelerator can be set to accelerate the 1D-FFT stage, a second hardware accelerator to accelerate the 2D-FFT stage, and a third hardware accelerator to accelerate the CFAR stage. More accelerators can also be set to perform other processing such as target recognition and target tracking. During software code development, the parallel use of multiple hardware accelerators can be fully considered to utilize the maximum data processing capacity per unit chip area.

[0083] In a hardware accelerator, one or more operators of the same type can be configured. Multiple hardware accelerators can be started sequentially. For example, after the first hardware accelerator, which is used to accelerate the 1D-FFT stage processing, processes the data output by the ADC, it stores the processed 1D-FFT data in memory, loads it into the second hardware accelerator, which is used to accelerate the 2D-FFT stage processing, and starts the second hardware accelerator. After the second hardware accelerator processes the 1D-FFT data, it stores the obtained data (such as the RD map) in memory, loads it into the third hardware accelerator, which is used to accelerate the CFAR stage processing, and so on, until the final processing result is obtained.

[0084] Figure 7 The processors mentioned include CPUs and DSPs, which are merely exemplary. The processor in this embodiment can be a general-purpose processor, such as any one or a combination of a central processing unit (CPU), DSP, data processing unit (DPU), and accelerated processing unit (APU), or other conventional processors; the processor can also be an integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), discrete logic or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above devices. That is, the processor in the above embodiments can be any processing device or combination of devices that implements the methods, steps, and logic block diagrams disclosed in the embodiments of this invention.

[0085] In this embodiment, the memory can use storage media such as SRAM, DRAM, NAND Flash, or a combination of multiple storage media. The DRAM can be a Double Data Rate (DDR) chip, a High-Bandwidth Memory (HBM), or the like.

[0086] In this embodiment, the hardware accelerator includes multiple operators connected in a set connection order. If an input buffer is allocated for the first operator from a shared buffer pool, the processor can load the data block to be processed in memory into the input buffer of the first operator. However, if the first operator is an operator that passes through the input data (such as a DFE operator), the processor can directly load the data block to be processed in memory into the input of the operator.

[0087] In an exemplary embodiment of this disclosure, the hardware accelerator in the computing system employs the hardware accelerator described in any embodiment of this disclosure. The computing system can be a system-on-a-chip (SoC), which integrates the hardware accelerator, processor, and memory into a single SoC. In one example of this embodiment, the SoC is a millimeter-wave chip or sensor chip in a radar system, but this disclosure is not limited to this. Although the computing system in this embodiment adopts an SoC architecture, in other embodiments, a separate hardware accelerator can also be used. For example, the hardware accelerator can be implemented using the control chip of a CXL device, and the processor can access the hardware accelerator through the CXL interface. The memory can be a DDR chip on the CXL device connected to the control chip.

[0088] One embodiment of this disclosure also provides a multi-operator control method, applied to a hardware accelerator including multiple operators and a hardware scheduler, such as... Figure 10 As shown, this includes: scheduling the multiple operators sequentially according to a set connection order through time-step scheduling to complete the processing of the data to be processed, wherein the scheduling of each time step includes:

[0089] Step 100: Send a start signal to the activation operator to trigger the activation operator to start a process once; wherein, the activation operator is the operator among the plurality of operators that needs to be controlled to start a process once in this time step;

[0090] Step 101: In response to the fact that all the activation operators have completed the current processing, the scheduling of this time step ends. If the processing of the data to be processed has not been completed, the scheduling of the next time step continues.

[0091] The method in this embodiment can be implemented by executing software code generated for the application using a hardware scheduler. The hardware scheduler can decode the startup instruction in the software code, send a startup signal to the activation operator by executing the decoded startup instruction; and decode the synchronization instruction, determine whether all the activation operators have completed the current processing by executing the decoded synchronization instruction, and end the scheduling of the current time step if all the activation operators have completed the current processing. A scheduling operation can be completed by the cooperation of the startup instruction and the synchronization instruction. In this embodiment, a combination of a startup instruction and a synchronization instruction is also referred to as a set of scheduling instructions. As mentioned above, some operators, such as the DFE operator, also perform corresponding calculations, but do not require a control signal to start the processing and have a small computational load. These operators may not be scheduled as activation operators.

[0092] In the embodiments of the present disclosure, the time step is used as the scheduling time unit. By sending a start signal (such as a pulse signal) to the activation operator, the activation operator is triggered to start a processing once, and the scheduling of this time step ends after all activation operators have completed this processing. The synchronous processing of the activation operator in units of time steps is realized, making the processing of multiple operators orderly, and the control method is simple, which can avoid conflicts such as the subsequent operator reading data while the previous operator has not output the data required by the subsequent operator.

[0093] In an exemplary embodiment of the present disclosure, through the scheduling step by step in time, the multiple operators are sequentially started according to the set connection order to complete the processing of the data to be processed, including: performing multiple loop controls. In each loop process, through the scheduling of multiple time steps, the multiple operators are controlled to complete the processing of multiple data blocks in the data to be processed in a pipeline manner according to the set connection order; wherein, each activation operator completes the processing of one data block in one time step, and issues a signal indicating the end of this processing after completing the processing of one data block to notify the hardware scheduler. In this embodiment, the processing of all data to be processed is completed through multiple loop controls, and in each loop, multiple operators can be concurrently processed in a pipeline manner through scheduling, so the efficiency is very high.

[0094] For a hardware accelerator, different computing tasks have different processing flows. The loop processing process defined in the embodiments of the present disclosure does not mean that all computing tasks of this hardware accelerator have to use such a process.

[0095] In an example of this embodiment, each loop process includes M time steps, M = m1 + m2 - 1, where m1 is the number of the multiple operators, m2 is the number of data blocks processed in each loop, m1 ≥ 2, and m2 ≥ 2;

[0096] When m < m1, the activation operators in the m-th time step are the 1st to the m-th operators, m = 1, 2,..., M;

[0097] [[ID=​​​​​​​As shown, the operators of the hardware accelerator include a Bandwidth Extension (BWE) operator, an FFT operator, and an SVA operator that are connected in sequence. The processing of the data to be processed is completed through multiple loops. The radar system includes 4 receiving channels, and the data to be processed includes the data on the 4 receiving channels. It is assumed that the data on each channel includes 128 data blocks. Then, the processing of the data to be processed can be completed through 128 loops. Each loop completes the processing of 4 data blocks on 4 channels, and the 4 data blocks can be the data corresponding to the same chirp signal.

[0100] Figure 11A The pipeline of the 3 operators is shown. Only the processing of each operator at each time step in the first two loops is shown. Each row in the figure represents an operator, each column represents a time step, and the channel number in the small square at the intersection of the row and the column is the serial number of the channel where a data block processed by an operator represented by the row at a time step represented by the column is located. As shown in the figure, only the BWE operator starts at the first time step of the first loop, the BWE operator and the FFT operator start at the second time step, the BWE operator, the FFT operator, and the SVA operator start at the third and fourth time steps, the FFT operator and the SVA operator start at the fifth time step, and the SVA operator starts at the sixth time step, thus completing one loop. The start order of each operator in each loop and the active operators in each time step are the same. For a single operator, there are 2 idle time steps between each loop. But in Figure 11B In another embodiment as shown, there are no idle time steps for the operators between two loops, and only in the first two time steps at the start of the first loop and the last two time steps at the end of the last loop do some operators not start or have exited the processing (inactive operators) occur.

[0101] Still taking Figure 11A as an example, corresponding to this embodiment, m1 is the number of the multiple operators, which is equal to 3, m2 is the number of data blocks processed in each loop, which is equal to the number of receiving channels, 4. Then M = m1 + m2 - 1 = 6, that is, there are 6 time steps in each loop. In each loop:

[0102] When m < m1, the active operators at the m-th time step are the first to the m-th operators; corresponding to this example, when m < 3, the active operator at the first time step is the first operator, that is, the BWE operator, and the active operators at the second time step are the first operator and the second operator, that is, the BWE operator and the FFT operator;

[0103] When m1 ≤ m ≤ m2, the activation operators at the m-th time step are all the operators among the multiple operators. Corresponding to this example, when 3 ≤ m ≤ 4, all the operators at the 3rd and 4th time steps, namely the BWE operator, the FFT operator, and the SVA operator, are activation operators.

[0104] When m2 < m ≤ M, the activation operators at the m-th time step are the operators from the (m - m2 + 1)-th to the m1-th operator; corresponding to this example, when 4 < m ≤ 6, the activation operators at the 5th time step are the 2nd operator and the 3rd operator, namely the FFT operator and the SVA operator, and the activation operator at the 6th time step is the 3rd operator, namely the SVA operator.

[0105] It can be understood that when m1 and m2 are other values, the determination rule of the activation operator in this embodiment is still satisfied.

[0106] In this example, 4 pairs of caches can be allocated for each operator. Each pair of caches includes an input cache and an output cache. In each loop, each operator performs 4 processes at 4 time steps. The 4 processes are for data blocks of 4 channels respectively. Each process uses one pair of the 4 pairs of caches allocated, so that the data of the 4 channels are independent of each other. That is, for an operator that needs to independently process the data of each channel, the number of pairs of caches allocated to it can be equal to the number of channels of the radar system. For other operators that do not have such requirements, more than 2 pairs of caches can be allocated. Whether an operator independently processes the data of each channel can be determined according to the requirements of the current application.

[0107] In an exemplary embodiment of the present disclosure, the control method further includes: respectively allocating caches used for accessing data to the multiple operators, and the allocated caches include input caches and / or output caches; wherein, the caches allocated to the multiple operators all come from a shared cache pool, and the multiple caches allocated to the multiple operators for use at the same time step are located in different and independently accessible regions in the shared cache pool, and for adjacent operators, the output cache allocated to the previous operator for use at the current time step is the input cache allocated to the subsequent operator for use at the next time step, so as to form a data interaction channel between adjacent operators. In this application, adjacent operators among the multiple operators can be determined according to the set connection order of the multiple operators. From the perspective of data interaction, when the output data of one operator is the input data of another operator, these two operators are adjacent operators, where the operator with output data is the previous operator, and the operator that takes the output data of the previous operator as input data is the subsequent operator.

[0108] In one example of this embodiment, allocating caches for data access to the plurality of operators includes: before performing the multiple loop control operations, allocating caches for the first processing of each of the plurality of operators, wherein, for adjacent operators among the plurality of operators, the output cache allocated to the preceding operator is the input cache allocated to the following operator. In this example, caches are allocated to specified operators before the start of loop control. The information of the allocated caches may include row information and column information, which can be represented by offset, size, block information, etc. Here, the caches allocated are for the first processing of each operator, not for use in the same time step; therefore, the output cache allocated to the preceding operator can be the input cache allocated to the following operator.

[0109] In one example of this embodiment, after allocating caches for the multiple operators to be used during the loop before the loop, the allocation of caches for the multiple operators to access data further includes: reallocating caches for operators that are active in both the previous and next time steps between two adjacent time steps; wherein, for adjacent operators among the multiple operators, the output cache allocated (including reallocation) for the preceding operator in the next time step is different from the input cache allocated for the following operator in the next time step. This example achieves the switching of caches used by operators between multiple or multiple pairs of allocated caches by reallocating caches (the multiple or multiple pairs of allocated caches can be a collection of caches allocated multiple times during the loop, and do not necessarily need to be allocated simultaneously), and the output cache allocated for the preceding operator in the next time step is different from the input cache allocated for the following operator in the next time step, which can ensure that two operators will not access the same cache simultaneously and cause a conflict. It should be noted that the cache used by the operator switches between multiple or multiple pairs of caches allocated. It is not necessary to allocate multiple or multiple pairs of caches at once. This can be achieved by the hardware scheduler performing the cache reallocation operation, or by the operator's hardware logic combined with configuration information (such as the total number of allocated caches or the total number of pairs).

[0110] In one example of this embodiment, for each of the plurality of operators, there are K groups of buffers allocated and reallocated for the preceding operator during the multiple loops, where K ≥ 2. Each group includes an input buffer and / or an output buffer. Starting from the first processing of the operator, with a period of K time steps, the buffer used by the operator in the kth time step of different periods is the same, and the buffer used in different time steps of the same period is different.

[0111] In an exemplary embodiment of this disclosure, the data to be processed is a frame of data obtained by the radar system processing the received signal; the plurality of operators include operators performing 1D-FFT stage processing, and one data block in the data to be processed is the data of a chirp signal received on a channel; or, the plurality of operators include operators performing 2D-FFT stage processing, and one data block in the data to be processed is the data at the same range gate in the 1D-FFT data, wherein the 1D-FFT data is the data obtained after the 1D-FFT stage processing, including the data of multiple chirps.

[0112] One embodiment of this disclosure also provides a hardware scheduler, such as Figure 12 As shown, it includes a memory 50 and a control device 60, wherein the memory stores software code for controlling multiple operators, and the control device is configured to run the software code to execute the multi-operator control method described in any embodiment of this disclosure (including...). Figure 10 The control method of the corresponding multi-operator implementation, Figure 19 (A control method for multiple operators in a corresponding embodiment). The control device can be any device capable of implementing control functions according to software code.

[0113] One embodiment of this disclosure also provides a hardware accelerator, which can be found in [reference 1]. Figure 7 It includes a hardware scheduler and a plurality of specified operators, wherein the hardware scheduler adopts the hardware scheduler described in any embodiment of this disclosure.

[0114] In an exemplary embodiment of this disclosure, the hardware accelerator further includes a shared cache pool and a register set that provide cache space for operators; wherein:

[0115] The hardware scheduler controls the plurality of operators by: allocating caches from the shared cache pool for the plurality of operators to use when accessing data, wherein the caches include input caches and / or output caches;

[0116] The register group is configured to store configuration information for the plurality of operators. The configuration information includes information on the cache allocated to the operators, and may also include other information such as parameters configured for the operators.

[0117] The plurality of operators are configured to, upon startup, obtain information about the cache allocated to the operator from the register group; if an input cache is allocated, read the data block to be processed from the input cache and process it; if an output cache is allocated, write the processed data into the output cache.

[0118] In an exemplary embodiment of this disclosure, the plurality of operators sequentially include: a CQMD operator, an FFT operator, and an SVA operator; or, the plurality of operators sequentially include: a DC operator, an FFT operator, and an SVA operator; or, the plurality of operators sequentially include: a CMB operator, a STAS operator, a HIST operator, a CFAR operator, and a STAS operator; or, the plurality of operators sequentially include: a DC operator, an FFT operator, an SVA operator, and a CMB operator.

[0119] One embodiment of this disclosure also provides an integrated circuit, such as... Figure 13 As shown, the integrated circuit includes a radio frequency module 2011, an analog signal processing module 2012, and a digital signal processing module 2013 connected in sequence, wherein:

[0120] The radio frequency module 2011 is configured to generate and transmit electromagnetic wave signals, and to receive echo signals;

[0121] The analog signal processing module 2012 is configured to down-frequency the echo signal to obtain an intermediate frequency signal; and

[0122] The digital signal processing module 2013 is configured to perform analog-to-digital conversion on the intermediate frequency signal to obtain a digital signal; wherein the digital signal processing module includes the hardware accelerator described in any embodiment of this disclosure.

[0123] In an exemplary embodiment of this disclosure, the integrated circuit is a millimeter-wave chip or sensor chip used in a radar system.

[0124] One embodiment of this disclosure also provides an electromagnetic wave device, such as... Figure 14 As shown, it includes: a carrier 4; an integrated circuit 5 according to any embodiment of this disclosure, disposed on the carrier 4; and an antenna 6 disposed on the carrier 4, either integrated with the integrated circuit 5 as a single device or disposed separately; wherein the integrated circuit 5 is connected to the antenna 6 and is used to transmit the electromagnetic wave signal and / or receive the echo signal.

[0125] An embodiment of this disclosure also provides an apparatus, including: an apparatus body; and an electromagnetic wave device disposed on the apparatus body as described in any embodiment of this disclosure; wherein the electromagnetic wave device is configured to perform target detection and / or communication to provide reference information to the operation of the apparatus body.

[0126] The following describes an exemplary hardware accelerator that enables a 1D-FFT stage processing flow and a corresponding multi-operator control method.

[0127] The 1D-FFT stage primarily involves performing noise compensation on the echo ADC data at the front end, followed by interference detection, Fast Discrete Fourier Transform (FFT), and Spatial Vector Analysis (SVA) to obtain frequency domain data. This data is then moved from the accelerator buffer to the external memory via DMA. Operators used in the 1D-FFT stage include: a Digital Front End (DFE) operator for noise compensation; a CQMD operator for interference detection; an FFT operator for performing FFT; an SVA operator for sidelobe suppression; and a DMA operator for storing data in memory. Data exchange between operators is sequential; for each operator, the input data is the data to be processed, and the output data is the processed data.

[0128] In this embodiment, the cache allocation and scheduling of each operator are as follows:

[0129] DFE operator: It belongs to the data pass-through module and is not controlled by the start instruction and synchronization instruction, but it needs to allocate a buffer to buffer the pass-through data. The DFE operator can pass-through the input data to the output buffer and perform statistics on the input data. The amount of statistical results is very small and can be written to the output buffer or register group for use by subsequent operators.

[0130] CQMD operator: Controlled by start command and synchronization command, and requires buffer allocation for CQMD operator to read input data and write output data;

[0131] FFT operator: Controlled by start instructions and synchronization instructions, and requires the allocation of input buffer and output buffer for FFT to read input data and write output data;

[0132] SVA operator: Controlled by start instructions and synchronization instructions, and requires the allocation of input buffer and output buffer for SVA to read input data and write output data;

[0133] DMA operator: Controlled by start and synchronization instructions, it requires an input buffer to read input data. DMA moves the input data to the chip's memory and does not require an output buffer.

[0134] In terms of cache physical space, the data flow between operators in this embodiment is sequential. The input data of CQMD is the output data of the DFE operator, the input data of FFT is the output data of CQMD, and the input data of DMA is the output data of SVA. Figure 15As shown, the caches allocated to each operator are divided into four levels. Each level includes two caches: the first level is the output cache allocated to the DFE operator and the input cache allocated to the CQMD operator; the second level is the output cache allocated to the CQMD operator and the input cache allocated to the FFT operator; the third level is the output cache allocated to the FFT operator and the input cache allocated to the SVA operator; and the fourth level is the output cache allocated to the SVA operator and the input cache allocated to the DMA operator.

[0135] In the time dimension, in this embodiment, when a later operator accesses data in the input buffer, it must be ensured that the data processed by the previous operator has been completely written to the output buffer to avoid access conflicts. To this end, this embodiment divides the time dimension into two alternating time steps (also called time slices or time units). In odd-numbered time steps (time steps 1, 3, 5, ...), the DFE operator processes the ADC data of chirps with odd indices, and writes the output data to DFE Cache0 (DFE output cache 0 / CQMD input cache 0 in the diagram). The CQMD operator processes the data in DFE Cache1 (DFE output cache 1 / CQMD input cache 1 in the diagram), and writes the output data to CQMD Cache1 (CQMD output cache 1 / FFT input cache 1 in the diagram). The FFT operator processes the data in CQMD Cache0 (CQMD output cache 0 / FFT input cache 0 in the diagram), and writes the output data to FFT Cache0 (FFT output cache 0 / SVA input cache 0 in the diagram). The SVA operator processes the data in FFT Cache1 (FFT output cache 1 / SVA input cache 1 in the diagram), and writes the output data to SVACache1 (SVA output cache 1 / DMA input cache 1 in the diagram). The DMA operator processes the SVA... Data in Cache0 (SVA output cache 0 / DMA input cache 0 in the diagram) is moved to external memory. The above odd-numbered time steps process data for chirps with odd indices such as 1, 3, 5, ... based on the case where chirp indices start from 1. If chirp indices start from 0, then odd-numbered time steps are used to process data for chirps with even indices such as 0, 2, 4, ...

[0136] The data flow for the odd-numbered time steps is shown by the solid arrows in the figure. Similarly, in the even-numbered time steps (the 2nd, 4th, 6th, ... time steps), each operator processes the data of other chirps. The dashed arrows in the figure represent the data flow direction for each operator to read and write data, which will not be elaborated here.

[0137] Although this embodiment allocates two input buffers and / or two output buffers (dynamically allocated) to each operator, switching between the two input buffers and / or two output buffers at a period of two time steps (odd time steps and even time steps), it is easy to understand that it is also feasible to allocate more than three input buffers and / or output buffers to each operator, switching between more than three input buffers and / or output buffers at a period of more than three time steps. For example, when the radar system has four receiving channels, four input buffers and / or four output buffers can be allocated to each operator, switching between the four input buffers and / or four output buffers at a period of four time steps, processing the data of one receiving channel at each time step. This allows for independent processing of the data from each channel. This embodiment avoids access conflicts between operators, such as access conflicts caused by simultaneous reading and writing to the same cache, by allocating different caches for operators at odd and even time steps, and for adjacent operators, the output cache allocated for the preceding operator in the next time step is different from the input cache allocated for the following operator in the next time step.

[0138] In a time step where multiple operators are started, the processing bandwidth and processing time of each operator differ. Upper-level instruction control must ensure synchronization of operators within the time step. In this embodiment, the hardware scheduler uses a set of scheduling instructions to synchronize operators. Each set of instructions includes a start instruction and a synchronization instruction. The start instruction controls the operator to begin processing, and the synchronization instruction determines whether the operator has completed its processing. The execution of the synchronization instruction ends when all operators have completed their processing. The time step proposed in this application begins with the execution of the start instruction in a set of scheduling instructions and ends when the execution of the synchronization instruction in that set of scheduling instructions is completed.

[0139] The table below shows the scheduling process for each operator in the hardware accelerator at each time step, where L1, L2, ... L... n This represents the 1st, 2nd, and up to the nth time step. Each time step can be divided into a start area (starting the operator) and a sync area (synchronization between operators) to ensure the synchronization of different operators. As mentioned earlier, the DFE operator does not need to be controlled by scheduling instructions, so it is not shown in the table.

[0140]

[0141]

[0142] As shown in the table above, the start and synchronization instructions for the first time step L1 are sent to the CQMD operator to start it, and the synchronization for this time step ends after the CQMD operator finishes processing (completes processing of all data in the input buffer and writes the processed data to the output buffer). The start and synchronization instructions for the second time step L2 are sent to both the CQMD and FFT operators to start them simultaneously, and the synchronization for this time step ends after both the CQMD and FFT operators have finished processing. The start and synchronization instructions for the third time step L3 are sent to the CQMD, FFT, and SVA operators to start them simultaneously, and the synchronization for this time step ends after all three operators have finished processing. From the fourth time step L4 to the (n-3)th time step L... n-3 The start and synchronization instructions for each time step are sent to the CQMD, FFT, SVA, and DMA operators to simultaneously start them. Synchronization for the current time step ends after all three operators have finished processing. The (n-2)th time step L... n-2 The start and synchronization instructions are sent to the FFT operator, SVA operator, and DMA operator to start them simultaneously, and the synchronization of this time step ends after all three operators have finished processing; the (n-1)th time step L n-1 The start and synchronization instructions are sent to both the SVA and DMA operators to start them simultaneously, and the synchronization ends at the current time step after both the SVA and DMA operators have finished processing; in the nth time step L n Both the start command and the synchronization command are sent to the DMA operator to start the DMA operator, and the synchronization of the current time step ends after the DMA operator has finished processing. One cycle of processing is completed in n time steps.

[0143] In this embodiment, the aforementioned n time steps are the time steps taken to obtain the processing result of the input dataset (such as a frame of ADC data) through multiple loop processing. It can be seen that, except for the first three time steps where each operator starts processing sequentially and the last three time steps where each operator exits sequentially, all operators in other time steps process synchronously. Each loop can process one data block at the corresponding position of all channels, and the next loop can begin before the processing of one loop is completely finished. The entire process is carried out in a pipeline manner. Taking its application in a radar system as an example, considering the characteristics of radar data processing, an application typically has thousands of data blocks to process. For over 99% of the entire processing time, all operators and buffers are 100% utilized, thus maximizing the processing capacity of the chip per unit area.

[0144] In the design of hardware accelerators, it is possible to make the processing time of multiple operators consistent as much as possible in order to maximize resource utilization. In this embodiment, the CQMD operator, FFT operator, and SVA operator process roughly the same amount of data per run, and their processing time within a time step is consistent. Even with a simple scheduling method, they still achieve high efficiency.

[0145] In another embodiment, the n time steps in the table above can be processed in a single loop. For example, when the radar system includes 8 receiving channels, the data from the 8 chirps received by the 8 receiving channels can be processed in 11 time steps. In this loop, all operators are executing processing in time steps 4 to 8. Starting from time step 9, the CQMD operator, FFT operator, and SVA operator exit the processing of this loop one by one. In time step 11, only the DMA operator works. After time step 11 ends, the next round of loop processing begins. In this embodiment, multiple operators still work in a pipeline manner during each loop processing, and this application still refers to it as a pipeline operation mode.

[0146] As described above, the multiple operators specified in this embodiment can be soft-connected through software code. Based on the same hardware accelerator, different software codes can be generated when facing different computing tasks. Through cache allocation instructions in the software code, caches are allocated to the specified multiple operators based on the set connection order to form a data interaction channel between operators. Simultaneously, data processing in this embodiment is executed through hardware-accelerated operators. Scheduling instructions can control multiple operators to start sequentially and process in parallel in a pipeline manner. Therefore, the hardware accelerator in this embodiment has application flexibility, does not require frequent data transfer, and also has good performance and low power consumption, effectively resolving the contradiction between application flexibility and high performance / low power consumption in hardware accelerators. The hardware accelerator in this embodiment can be applied not only to radar systems but also to other systems that require hardware accelerators.

[0147] This disclosure also provides a control device (referred to as a first control device when it needs to be distinguished from the control devices of other embodiments), for controlling a plurality of operators configured to process data to be processed according to a set connection order, the control device comprising:

[0148] The instruction decoding circuit is configured to decode software code, which includes multiple sets of scheduling instructions. Each set of scheduling instructions includes a start instruction and a synchronization instruction. The start instruction includes operands for indicating which operator needs to be started, and the synchronization instruction includes operands for indicating which operator needs to be synchronized.

[0149] The instruction execution circuit, coupled to the instruction decoding circuit, is configured to, in response to a start instruction in a set of decoded scheduling instructions, control the operator indicated by the operand to start a processing operation; and, in response to a synchronization instruction in the set of decoded scheduling instructions, determine that all operators indicated by the operands have completed the current processing operation, and then terminate the execution of the synchronization instruction.

[0150] The control device in this embodiment can be the logic control circuit in the aforementioned hardware scheduler. For example... Figure 16 As shown, the instruction decoding circuit 10 reads the instructions in the software code and decodes them. The decoded instructions are executed by the instruction execution circuit 20 and converted into corresponding operations, such as sending a start signal, reading or writing registers, etc.

[0151] The following is a set of pseudocode examples of scheduling instructions:

[0152] start_eng_q0BWE+FFT

[0153] sync_q0BWE+FFT

[0154] The pseudocode for the start instruction is "start_eng_q0BWE+FFT". "start_eng_q0" indicates that this instruction is a start instruction, and "BWE+FFT" indicates that the operators to be started are the BWE and FFT operators. The pseudocode for the synchronization instruction is "sync_q0BWE+FFT". "sync_q0" indicates that this instruction is a synchronization instruction, and "BWE+FFT" indicates that the operators to be synchronized are the BWE and FFT operators. During the execution of the synchronization instruction, the instruction execution circuit can monitor the status of the BWE and FFT operators. When it determines that both operators have completed processing (e.g., when it receives a completion signal from either operator), the execution of the synchronization instruction ends, and subsequent instructions can then be executed.

[0155] This embodiment decodes multiple sets of scheduling instructions through an instruction decoding circuit and then executes them through an instruction execution circuit. One scheduling can start one or more operators and end after all the started operators have completed processing. Therefore, it can control the operators to complete the processing of the data to be processed step by step in a time step manner, ensuring that the operators connected according to the set connection order process the data blocks in the correct time order, avoiding conflicts such as when the previous operator has not yet outputted the data required by the subsequent operator while the subsequent operator is reading data.

[0156] In an exemplary embodiment of this disclosure, the start instruction includes an opcode and operands represented by a bitmap. Each of the plurality of operators corresponds to one bit in the bitmap, and the value of this bit indicates whether the operator needs to be started. The synchronization instruction includes an opcode and operands represented by a bitmap. Each of the plurality of operators corresponds to one bit in the bitmap, and the value of this bit indicates whether the operator needs to be synchronized. After the start instruction is decoded and executed, a start signal can be generated for each operator that needs to be started. This signal is sent to the operator through a dedicated interface such as a signal line to trigger the operator to begin a processing cycle. The pseudocode mentioned above is not the actual executed instruction; the format of the actual instruction is described here.

[0157] In an exemplary embodiment of the present disclosure, a single instruction in the software code is a short instruction of 16 bits, 32 bits, or 64 bits. In some cases, 128-bit instructions are used, which have a relatively wide bit width and are not friendly to the encoding and parsing of instructions. Moreover, for some instructions, the 128 bits cannot be fully utilized. For example, the STOP instruction only uses the high 6 bits, resulting in a waste of the instruction set storage space. The embodiments of the present disclosure adopt short instructions less than 128 bits, such as 32-bit short instructions, which are convenient for encoding and parsing and are beneficial to improving the instruction execution efficiency. It has the following advantages but is not limited to: 32 bits can be effectively utilized, avoiding the waste of a large number of bit positions, which is beneficial to saving the storage space of the instruction set; the instruction combination is more flexible, facilitating users to implement various functions; the hardware implementation of the scheduler is simpler, reducing the possibility of errors; and the transmission efficiency of short instructions is higher, reducing the power consumption and running time of the radar signal processor.

[0158] In an exemplary embodiment of the present disclosure, the software code decoded by the instruction decoding circuit further includes a loop start instruction and a loop end instruction. The loop start instruction includes an operand representing the number of loops; the multiple groups of scheduling instructions are located between the loop start instruction and the loop end instruction; correspondingly, the instruction execution circuit performs multiple loop processes in response to the decoded software code, and controls the multiple operators to complete the processing of the data to be processed in a pipeline manner according to the set connection sequence; wherein, each loop process includes multiple time steps, and each time step starts from the start instruction in a group of scheduling instructions and ends when the synchronization instruction in the group of scheduling instructions is executed. Each operator started in each time step completes the processing of a data block in the data to be processed in that time step.

[0159] The following shows an example of loop-related instructions in pseudocode:

[0160]

[0161]

[0162] Instructions for reallocating the cache can also be inserted between the above-mentioned group of scheduling instructions, as described below.

[0163] In an exemplary embodiment of the present disclosure, there are M groups of scheduling instructions between the loop start instruction and the loop end instruction. Each loop process includes M time steps, M = m1 + m2 - 1, where m1 is the number of the multiple operators, m2 is the number of data blocks processed in each loop process, m1 ≥ 2, and m2 ≥ 2;

[0164] In the case of m < m1, the activated operators in the mth time step are the 1st to the mth operators, m = 1, 2,..., M;

[0165] When \(m_1\leq m\leq m_2\), the activation operators at the \(m\)-th time step are all the operators among the multiple operators;

[0166] When \(m_2 < m\leq M\), the activation operators at the \(m\)-th time step are the \((m - m_2 + 1)\)-th to \(m_1\)-th operators;

[0167] Among them, the operator to be started indicated by the start instruction in the \(m\)-th group of scheduling instructions is the activation operator at the \(m\)-th time step.

[0168] The activation operators (i.e., the operators that need to be started) at each of the above time steps can be referred to the description of the example shown above. Figure 11A Specifically, it can be implemented by generating software code.

[0169] In an exemplary embodiment of the present disclosure, the software code decoded by the instruction decoding circuit further includes a plurality of cache allocation instructions; the instruction execution circuit, in response to the plurality of cache allocation instructions, allocates caches for the multiple operators to use when accessing data respectively;

[0170] Among them, the caches allocated for the multiple operators all come from a shared cache pool, and the multiple caches allocated for the multiple operators to use at the same time step are located in different and independently accessible regions in the shared cache pool to avoid access conflicts caused by two operators accessing the same cache at the same time;

[0171] Among them, the allocated caches include input caches and / or output caches, and for adjacent operators among the multiple operators, the output cache allocated for the previous operator to use at the current time step is the input cache allocated for the subsequent operator to use at the next time step, so as to form a data interaction channel between adjacent operators.

[0172] In an exemplary embodiment of the present disclosure, the plurality of cache allocation instructions decoded by the instruction decoding circuit include a plurality of first cache allocation instructions located before the loop start instruction. The first cache allocation instructions include: a first operand, used to indicate the location and size of the cache initially allocated for an operator; a second operand, used to indicate the storage space to which the first operand is to be written;

[0173] The instruction execution circuit is further configured to, in response to the decoded plurality of first cache allocation instructions, write the first operand in each first cache allocation instruction into the storage space indicated by the second operand in the cache allocation instruction, so as to allocate caches for the multiple operators to use when processing for the first time respectively.

[0174] The cache space of the shared cache pool can be a storage array, including multiple rows and columns. The aforementioned first cache allocation instruction can be a single instruction or multiple related instructions. In one example of this embodiment, where a single instruction in the software code uses a short instruction, the first cache allocation instruction includes multiple short instructions.

[0175] The first operand, used to indicate the location and size of the cache, can include multiple immediate values ​​representing the position of the first row, the number of rows, and the position and number of the first column. The position of the first row and the position of the first column can be represented by the index of the first row and the index of the first column, respectively. The number of rows and columns can be represented in various ways. For example, the number of rows can be represented directly by the value of the number of rows, or by the combination of the index of the first row and the index of the last row, or by the size and number of banks (areas, which can be understood as a set of rows) in the row direction. That is, one or more numbers can be used to represent them directly or indirectly. The second operand, used to indicate the memory space to be written to by the first operand, can be a register address. However, not every immediate value needs to be written to this register address; it is sufficient that the register address to which each immediate value needs to be written can be obtained based on this register address.

[0176] Here is an example of a first cache allocation instruction represented in pseudocode:

[0177]

[0178] In the pseudocode above, each line represents an instruction. The first line, wr_que0, indicates writing to a register. 0xb00 is the register address, and 29 indicates that 30 numbers are to be written. The first number is written to address 0xb00, the second to address 0xb04, the third to address 0xb08, and so on, jumping sequentially.

[0179] In each line, Wdat represents the written data. The immediate value in line 2 is defined as the position of the first column in the SVA input buffer, with a configured value of 16, indicating that the index of the first column is 16. The immediate value in line 3 is defined as the position of the first row in the SVA input buffer, with a configured value of 0, indicating that the index of the first row is 0. The immediate value in line 4 is defined as the size of the BANK (set of columns) in the column direction of the SVA input buffer, with a configured value of 3, indicating that a BANK in the column direction has 4 columns. The immediate value in line 5 is defined as the size of the BANK in the row direction of the SVA input buffer, with a configured value of 11, indicating that a BANK in the row direction has 12 rows. The immediate value in line 6 is defined as the number of BANKs included in the column direction of the SVA input buffer, with a configured value of 3, indicating that there are 4 BANKs. The immediate value in line 7 is defined as the number of BANKs included in the row direction of the SVA input buffer, with a configured value of 2, indicating that there are 3 BANKs. By combining multiple instructions, the input buffer allocated to the SVA operator has a first column index of 16, a total of 4×4 columns, a first row index of 0, and a total of 12×3 rows. The information for the output buffer allocated to the SVA operator can be configured in the same way.

[0180] In the example above, the multiple instructions used to configure the registers can also be called a parameter set, where each line can be 32 bits, but a single instruction cannot independently implement a specific function.

[0181] In an exemplary embodiment of this disclosure, the multiple cache allocation instructions decoded by the instruction decoding circuit further include a second cache allocation instruction located between two adjacent sets of scheduling instructions. The second cache allocation instruction includes: a first operand, used to indicate the location of the cache to be reallocated for the operator in the next time step; and a second operand, used to indicate the storage space to be written to by the first operand.

[0182] The instruction execution circuit is further configured to, in response to each decoded second cache allocation instruction, write the first operand therein to the storage space indicated by the second operand therein, to allocate a cache for use in the next time step for an operator using the storage space; wherein, for adjacent operators among the plurality of operators, the output cache allocated for use in the next time step for the preceding operator is different from the input cache allocated for use in the next time step for the following operator.

[0183] This embodiment uses a second buffer allocation instruction between two sets of scheduling instructions to reallocate buffers for operators. This allows operators to switch between multiple buffers, staggering the output buffer used by the operator at the same time step with the input buffer used by the operator at the next time step. This enables two operators to process data in parallel (including reading and writing data), thus working in a pipelined manner to improve efficiency. The two sets of scheduling instructions in this embodiment can be located in different loop processes.

[0184] The second cache allocation instruction can reallocate caches by changing a few parameters based on the cache allocated by the first cache allocation instruction. An example is as follows:

[0185]

[0186] In the pseudocode above, line 1 `wr_que0` indicates writing to the register, where 0xb00 is the register address and 0 indicates that the number of data to be written is 1. Line 2 `Wdat` indicates writing data; the immediate value to be written is defined as the index of the first column in the SVA operator's input buffer, configured to a value of 32. The information for the other buffers is the same as that allocated by the first buffer instruction and does not need to be reconfigured. Therefore, when reallocating the buffer, simply updating the index of the first column from 16 to 32 is sufficient to determine the location and size of the input buffer reallocated for the SVA operator. The same applies to the reallocation of the output buffer. Setting the reallocation instruction within a loop reduces the number of instructions required.

[0187] In one example of this embodiment, among the multiple sets of scheduling instructions, the second cache allocation instruction between two adjacent sets of scheduling instructions is used to reallocate the cache position for each operator that performed processing in the previous time step and still needs to perform processing in the next time step; for each of the multiple operators, starting from the time step when the operator is first started, with a period of K time steps, the cache allocated by the instruction execution circuit for the operator in the K time steps of the same period is different, and the cache allocated for the operator in the k-th time step of different periods is the same, k = 0, 1, ..., K-1, K ≥ 2.

[0188] Switching the cache used by an operator during a loop does not necessarily require instructions; it can also be achieved through the operator's hardware logic in conjunction with the configured number of caches. In another exemplary embodiment of this disclosure, the first operand in the first cache allocation instruction is further used to indicate the number K of cache groups used by the operator in the multiple loop processes. This instructs the operator to update the row or column position of the allocated cache step by step, with a period of K time steps, so that the operator uses different caches in the K time steps of the same period and uses the same cache in the k-th time step of different periods, where k = 0, 1, ..., K-1, K ≥ 2. The "number K of cache groups" here is equal to the K in "K cache groups are allocated and reallocated for the operator" mentioned above. When only an input buffer is allocated to an operator, the number K in this group represents the number of input buffers; when only an output buffer is allocated to an operator, the number K in this group represents the number of output buffers; when both input and output buffers are allocated to an operator (i.e., allocated in pairs), the number K in this group represents the number of pairs of input and output buffers, i.e., 4 input buffers and 4 output buffers are allocated in this case.

[0189] For example, in the example of the first cache allocation instruction mentioned above, another instruction could be added after the last instruction:

[0190]

[0191] The immediate value in this pseudocode is defined as the number of caches allocated by the SVA operator. A configured value of 3 indicates an allocation of 4 caches. This means that when only the input cache is allocated, the number is 4; when only the output cache is allocated, the number is 4; in this example, when both input and output caches are allocated, the number is 4 for both. However, the input and output caches can also be allocated separately. The number of caches allocated here equals the number of time steps in a cache allocation cycle. Based on this configuration, the SVA operator operates in a cycle of 4 time steps. In each cycle, the caches used by the 4 time steps will cycle sequentially between 4 pairs of caches. These 4 pairs of caches can be defined as having the same size and being sequentially connected in the column direction. The operator updates the position of the cache used in each time step based on the position and size of the first pair of caches and the number of received start signals (which can be counted cyclically using a counter), while keeping the cache size unchanged, thus reallocating the cache on the operator side. This method eliminates the need for a second cache allocation instruction, simplifying the software code.

[0192] In an exemplary embodiment of this disclosure, the instruction execution circuit responds to a start instruction in a decoded set of scheduling instructions by sending a start signal to the operator indicated by the operand to control the operator to start a processing operation; the operator receiving the scheduling instruction among the plurality of operators is configured to: upon receiving the start signal, read the input data block from the input buffer and process it, and save the processing result to the output buffer; and, in response to the data block obtained after processing (i.e. the processing result) being saved to the output buffer, generate a signal indicating that the current processing is complete to notify the instruction execution circuit.

[0193] In this embodiment, the start signal is transmitted through a dedicated interface (such as the line between the hardware scheduler and the operator). In other embodiments, it can also be implemented through a configuration register, such as by setting a register corresponding to an operator to 1 and 0 to generate a pulse signal, and using this pulse signal as a start signal to trigger the operator's start process.

[0194] In an exemplary embodiment of this disclosure, the data to be processed is a frame of data obtained by the radar system from processing received radar signals; the plurality of operators are operators performing 1D-FFT stage processing, and one data block in the data to be processed is the data of a chirp signal received on a channel; or, the plurality of operators are operators performing 2D-FFT stage processing, and one data block in the data to be processed is the data at the same range gate in the 1D-FFT data. In application fields outside of radar, and in different processing stages within a radar system, the data blocks to be processed by the operators and the amount of data contained in the data to be processed, as well as the physical meaning of the data to be processed, can be different.

[0195] One embodiment of this disclosure provides a hardware scheduler, such as Figure 17 As shown, it includes a control device and an internal memory. The control device includes the control device described in any embodiment of this disclosure (such as a first control device or a second control device). The internal memory is configured to store the software code (also referred to as an instruction queue or instruction set) to be decoded and executed by the control device.

[0196] The hardware scheduler may include one or more memories and one or more control devices. Figure 17 The example shown includes two memories and two control units, capable of simultaneously controlling multiple hardware accelerators. The internal memories can use high-speed storage media such as SRAM to cache software code. The software code in the internal memories can be loaded in several ways: first, it can be written to the internal memories via a bus, which can then be executed by the processor; the bus used could be an Advanced Peripheral Bus (APB); second, it can be initially stored in memory (see...). Figure 7The first method involves moving the data from main memory to internal memory via DMA; the second method involves the hardware scheduler actively reading the software code from a designated address. A multiplexer can be configured to select the data path for each of these three methods. As shown in the diagram, the hardware scheduler connects to the register bank and the shared buffer pool, allowing it to read and write to both. The hardware scheduler can also send start signals to operators and receive completion signals (represented as "done" in the diagram) to indicate that the operator has completed its current processing. The control device can determine whether to terminate the synchronization instruction based on the validity of the "done" signals for all active operators at the current time step.

[0197] One embodiment of this disclosure provides an operator, such as Figure 18 As shown, it includes a data reading unit, a processing unit, and a data writing unit, wherein:

[0198] The data reading unit is configured to acquire information about the input buffer allocated to the operator, and based on the information about the input buffer, read out the data block in the input buffer and input it into the processing unit;

[0199] The processing unit is configured to process the data block to obtain a processed data block;

[0200] The write data unit is configured to obtain information about the output cache allocated to the operator, and write the processed data block into the output cache based on the information about the output cache.

[0201] The operator in this embodiment can be applied to the hardware accelerator described in any embodiment of this disclosure, enabling data to be passed between operators through a cache and allowing information of the allocated cache to be read from the outside. This achieves soft connections (configurable connections) and fast data interaction between operators. While ensuring processing efficiency, the connection relationship between operators and the design of data flow can be changed according to application needs, improving the application flexibility of the hardware accelerator using the operator.

[0202] In an exemplary embodiment of this disclosure, the operator further includes any one or more of the following interfaces:

[0203] The scheduling interface receives a start signal, which is used as a trigger signal for this operator to begin processing.

[0204] The synchronization interface is used to write the processed data block into the output buffer and then send a signal indicating that the processing is complete.

[0205] like Figure 18As shown, the operator can initiate a processing step (which includes reading, processing, and writing data) based on a start signal, such as a pulse signal, sent by the hardware scheduler through the scheduling interface. After processing, the operator returns a "done" signal to the hardware scheduler through the synchronization interface to indicate that the processing is complete. The scheduling interface and synchronization interface can be dedicated interfaces between the hardware scheduler and the operator, such as dedicated lines, or they can be implemented through register configuration.

[0206] In an exemplary embodiment of this disclosure, the read data unit obtains information about the input buffer allocated to the operator from the register set, and the write data unit obtains information about the output buffer allocated to the operator from the register set; the register set can be a register set used by one operator or a register set shared by multiple operators. The hardware scheduler can also write operator parameters, statistical data generated during processing, or other configuration information into the corresponding register set, and is not limited to allocating buffers for operators.

[0207] In an exemplary embodiment of this disclosure, the operator further includes a cache update unit;

[0208] The cache update unit is configured to: read from the register group the location and size of the cache initially allocated to the operator, and the number of cache groups K allocated to the operator, K≥2; and cyclically count the number of times the scheduling interface receives the start signal, with the count value ranging from 0 to K-1; when the cumulative number of cycles is k, calculate the location of the cache used by the operator in the next time step based on the location and size of the initially allocated cache and the value of k, and update the cache location in the register group to the calculated cache location; where k=0,1,…,K-1, and the calculated cache location is different when the value of k is different.

[0209] exist Figure 11A In the example shown, each operator uses 4 caches during the loop. When updating the cache position on the operator side, taking the input cache of the SVA operator as an example, assume that the first column index of the input cache initially allocated to the SVA operator in the register set is 16, and the total number of columns in the cache is 16. The SVA operator's cache update unit monitors the number of times the start signal is received by the scheduling interface. The initial value of k is 0. In the first loop:

[0210] The SVA operator receives the start signal for the first time in the third time step, uses the initially allocated input buffer with the first column index of 16, adds 1 to the value of k to equal 1, and updates the first column index stored in the register group to the new position 16+16=32.

[0211] The SVA operator receives the start signal for the second time in the fourth time step, uses the input buffer with the first column index of 32, adds 1 to the value of k to equal 2, and updates the first column index stored in the register group to 16 + 16 × 2 = 48;

[0212] The SVA operator receives the start signal for the third time in the 5th time step, uses the input buffer with the first column index of 48, adds 1 to the value of k to equal 3, and updates the first column index stored in the register group to 16 + 16 × 3 = 64;

[0213] The SVA operator receives the start signal for the fourth time at the 6th time step, uses the input buffer with the first column index of 64, and sets the value of k to 0 because it exceeds the maximum value. The first column index stored in the register group is updated to 16 + 16 × 0 = 16.

[0214] The updates to the input buffer are the same in subsequent loops, and the updates to the output buffer are similar, so I will not repeat them here.

[0215] This embodiment updates the cache on the operator side. When the number of caches allocated to the operator is large, it is not necessary to update the cache through many instructions, which can simplify instruction design and speed up the operation.

[0216] Another embodiment of this disclosure provides a control device (when it is necessary to distinguish it from the control device of other embodiments, the control device of this embodiment is referred to as a second control device). The control device of this embodiment is used to control a plurality of specified operators, the plurality of operators being configured to process data to be processed according to a set connection order, the control device comprising:

[0217] An instruction decoding circuit is configured to decode software code, the software code including a plurality of first cache allocation instructions, each first cache allocation instruction including: a first operand, used to indicate the location and size of the cache allocated for the operator; and a second operand, used to indicate the storage space to be written to the first operand;

[0218] The instruction execution circuit, coupled to the instruction decoding circuit, is configured to, in response to a plurality of decoded first cache allocation instructions, write a first operand in each first cache allocation instruction into the storage space indicated by a second operand in the cache allocation instruction, so as to allocate caches for the plurality of operators to use when accessing data.

[0219] The caches allocated to the multiple operators are all from a shared cache pool.

[0220] The control device in this embodiment can be the logic control circuit in the aforementioned hardware scheduler. It can employ... Figure 16As shown in the structure, the instruction decoding circuit 10 reads the instructions in the software code and decodes them. The decoded instructions are then converted into corresponding operations by the instruction execution circuit 20, such as reading or writing registers, sending start signals, etc.

[0221] This embodiment decodes multiple first cache allocation instructions through an instruction decoding circuit and then executes them through an instruction execution circuit. It allocates caches to multiple operators from a shared cache pool, enabling data to be passed between operators through the cache. This achieves soft connections (configurable connections) and fast data interaction between operators. While ensuring processing efficiency, the connection relationship between operators can be changed according to application needs, improving the application flexibility of hardware accelerators using multiple operators.

[0222] In an exemplary embodiment of this disclosure, the first operand in the first cache allocation instruction is used to indicate the location and size of the cache initially allocated to the operator. The cache initially allocated to the operator is the cache used when the operator performs its first processing. The cache includes an input cache and / or an output cache.

[0223] The multiple operators are started sequentially according to a set connection order. For adjacent operators, the output buffer initially allocated to the preceding operator is the same as the input buffer initially allocated to the following operator. In this embodiment, the first buffer allocation instruction indicates the location and size of the buffer initially allocated to the operator. When the subsequent operator performs subsequent processing, the buffer can be reallocated using instructions, but this disclosure is not limited to this. In another embodiment, the initially allocated buffer can be used throughout the entire computation task. In this case, for adjacent operators, when the preceding operator writes the processed data to its output buffer, the following operator can pause processing. After the preceding operator completes writing the data, it notifies the hardware scheduler or the following operator, which then reads the data from the preceding operator's output buffer (i.e., the following operator's input buffer) for processing. Although parallel processing between operators is not possible, soft connections between operators can still be achieved, improving application flexibility. Furthermore, the buffer space occupied is small, making it suitable for situations with limited buffer space. This embodiment can schedule operators without using start and synchronization instructions.

[0224] In an exemplary embodiment of this disclosure, a single instruction in the software code is a short instruction of 16 bits, 32 bits, or 64 bits. The aforementioned first cache allocation instruction may include multiple short instructions to implement the cache allocation function.

[0225] In an exemplary embodiment of this disclosure, the software code decoded by the instruction decoding circuit further includes multiple sets of scheduling instructions; the instruction execution circuit is further configured to schedule one or more operators in response to each set of decoded scheduling instructions, so that the scheduled operators start and complete one processing of a data block in a time step; wherein each time step starts when a set of scheduling instructions is executed and ends when the execution of the set of scheduling instructions ends.

[0226] This embodiment can be implemented using start and synchronization instructions, similar to the previous embodiments. However, other scheduling methods can also be used. For example, during the design phase, the processing time of multiple operators can be balanced. When generating the software code, a set time margin can be added to the maximum processing time of multiple operators each time, serving as the set time for each time step. Each time step begins when the start instruction is executed and ends after the set time (a delay instruction can be designed to implement this function). This control method is slightly less efficient, but it simplifies control and the corresponding hardware circuitry.

[0227] In an exemplary embodiment of this disclosure, the software code decoded by the instruction decoding circuit further includes a loop start instruction and a loop end instruction. The loop start instruction includes an operand representing the number of loop iterations. The first cache allocation instruction is located before the loop start instruction, and the multiple sets of scheduling instructions are located between the loop start instruction and the loop end instruction.

[0228] The instruction execution circuit responds to the decoded software code by performing multiple loop processes, controlling the multiple operators to complete the processing of the data to be processed in a pipeline manner according to a set connection order; each loop process includes multiple time steps, and each operator completes one processing of a data block in the data to be processed in one time step.

[0229] The instructions for looping and scheduling in this embodiment can adopt the loop-related instructions shown in the pseudocode above. They can also be modified. For example, the loop start instruction (Lp_start_q0), loop end instruction (lp_end_q0), and start instruction (Start_eng_q0) remain unchanged, but the synchronization instruction (sync_q0) is updated to a delay instruction (delay_q0). The operand of the delay instruction can be set to the delay time, which can be configured to be the maximum processing time of multiple operators plus a set time margin.

[0230] In an exemplary embodiment of the present disclosure, there are M groups of scheduling instructions between the loop start instruction and the loop end instruction. Each loop process includes M time steps, where M = m1 + m2 - 1, m1 is the number of the plurality of operators, m2 is the number of data blocks processed in each loop, m1 ≥ 2, and m2 ≥ 2;

[0231] When m < m1, the activated operators in the m-th time step are the 1st to the m-th operators, where m = 1, 2,..., M;

[0232] When m1 ≤ m ≤ m2, the activated operators in the m-th time step are all the operators among the plurality of operators;

[0233] When m2 < m ≤ M, the activated operators in the m-th time step are the (m - m2 + 1)-th to the m1-th operators;

[0234] Among them, the operator to be started indicated by the start instruction in the m-th group of scheduling instructions is the activated operator in the m-th time step.

[0235] For this embodiment, reference can be made to the detailed description above.

[0236] In an exemplary embodiment of the present disclosure, the first cache allocation instruction includes a write register instruction and N' write data instructions associated with the write register instruction. The write register instruction is associated with a total of N write data instructions, where:

[0237] The write register instruction includes: an operation code, the second operand representing the register address Add, and an operand representing the number of immediate numbers N, where N ≥ N' ≥ 2;

[0238] The write data instruction includes an operation code and an immediate number. The first operand includes N' immediate numbers among the N' write data instructions. The N' immediate numbers among the N' write data instructions include information on the row position, number of rows, column position, and number of columns of the cache allocated for the operator;

[0239] The instruction execution circuit writes the first operand in each first cache allocation instruction into the storage space indicated by the second operand in the cache allocation instruction, including: for each of the N' write data instructions, writing the immediate number therein into the register with the address Add + n - 1 for the operator to read, where n is the serial number of the instruction among the N write data instructions, indicating that the instruction is the n-th instruction under the write register instruction, and 1 ≤ n ≤ N.

[0240] This embodiment can also adopt an example of the first cache allocation instruction represented by the pseudo code above:

[0241]

[0242] For detailed explanation, please refer to the aforementioned embodiments, where Wr_que0[0xb00],29 is a write register instruction, and the other instructions are write data instructions. It should be noted that an operator usually needs to configure other parameters besides the cache, and the configuration of these other parameters can also be accomplished through write data instructions. Therefore, the write data instructions for implementing cache allocation mentioned above do not necessarily follow the write register instruction immediately; they can also be separated from the write register instruction by one or more other write data instructions.

[0243] In one example of this embodiment, the N' immediate values ​​in the N' write data instructions also include the number K of cache groups used by the operator in multiple loops, to instruct the operator to update the row or column position of the allocated cache step by step with a period of K time steps, so that the operator uses different caches in the K time steps of the same period and the same cache in the k-th time step of different periods, k = 0, 1, ..., K-1, K ≥ 2.

[0244] An example of this embodiment is as follows:

[0245] Wdat 3 / / CFG_SVA_BBUF_PPG_NUM

[0246] The instruction has been explained in detail above using the SVA operator as an example, so it will not be repeated here.

[0247] In an exemplary embodiment of this disclosure, the software code decoded by the instruction decoding circuit further includes a second cache allocation instruction located between two adjacent sets of scheduling instructions. The second cache allocation instruction includes: a first operand for indicating the location of the cache to be reallocated for the operator in the next time step; and a second operand for indicating the storage space to be written to by the first operand.

[0248] The instruction execution circuit is further configured to, in response to each decoded second cache allocation instruction, write the first operand therein to the storage space indicated by the second operand therein, so as to allocate a cache for the operator using the storage space to use in the next time step; wherein the caches allocated for the plurality of operators to use in the same time step are located in different and independently accessible regions in the shared cache pool, and for adjacent operators among the plurality of operators, the output cache allocated for the preceding operator to use in the current time step is the input cache allocated for the following operator to use in the next time step.

[0249] This embodiment uses a second buffer allocation instruction between two sets of scheduling instructions to reallocate buffers for operators. This allows operators to switch between multiple buffers, staggering the output buffer used by the operator at the same time step with the input buffer used by the operator at the next time step. This enables two operators to process data in parallel (including reading and writing data), thus working in a pipelined manner to improve efficiency. The two sets of scheduling instructions in this embodiment can be located in different loop processes.

[0250] In one example of this embodiment, among the multiple sets of scheduling instructions, the second cache allocation instruction between two adjacent sets of scheduling instructions is used to reallocate the cache position for each operator that performed processing in the previous time step and still needs to perform processing in the next time step; for each of the multiple operators, starting from the time step when the operator is first started, with a period of K time steps, the cache allocated by the instruction execution circuit for the operator in the K time steps of the same period is different, and the cache allocated for the operator in the k-th time step of different periods is the same, k = 0, 1, ..., K-1, K ≥ 2.

[0251] In an exemplary embodiment of this disclosure, the second cache allocation instruction includes a write register instruction and M write data instructions associated with the write register instruction, where M ≥ 1, wherein:

[0252] The write register instruction includes: an opcode, a second operand representing the register address Add', and an operand representing the number of immediate values ​​M;

[0253] The write data instruction includes an opcode and an immediate value. The first operand includes M immediate values ​​from the M write data instructions. The M immediate values ​​from the M write data instructions include information about the row or column position of the cache reallocated for the operator.

[0254] The instruction execution circuit responds to each second cache allocation instruction by writing the first operand therein into the storage space indicated by the second operand, including: for each of the M write data instructions, writing the immediate value therein into a register at address Add'+m-1 for the operator to read, where m is the sequence number of the instruction in the M write data instructions, indicating that the instruction is the m-th instruction under the write register instruction, 1≤m≤M.

[0255] An example of a second cache allocation instruction (M=1) is as follows:

[0256]

[0257] The example has been explained in detail above, so it will not be repeated here.

[0258] In an exemplary embodiment of this disclosure, the data to be processed is a frame of data obtained by the radar system from processing received radar signals; the plurality of operators are operators performing 1D-FFT stage processing, and one data block in the data to be processed is data of a chirp signal received on a channel; or, the plurality of operators are operators performing 2D-FFT stage processing, and one data block in the data to be processed is data at the same range gate in the 1D-FFT data. Similar to the foregoing embodiments, when the control device of this disclosure is applied to a radar system, it can use a frame of data as the dataset to be processed in a set of multiple loops.

[0259] This disclosure also provides an embodiment of a multi-operator control method applied to a hardware accelerator, the hardware accelerator including a hardware scheduler, a specified number of operators, and a shared cache pool, such as... Figure 19 As shown, the control method includes:

[0260] Step 200: Allocate caches from the shared cache pool for the multiple operators to use when accessing data, so as to form a data channel between the multiple operators. The allocated caches include input caches and / or output caches.

[0261] Step 201: By scheduling time steps, the multiple operators are started sequentially according to the set connection order to complete the processing of the data to be processed.

[0262] Among the plurality of operators, the output buffer used by the preceding operator at the current time step is the input buffer used by the following operator at the next time step.

[0263] This embodiment allocates caches to multiple operators from a shared cache pool, enabling data transfer between operators via the cache. This achieves soft connections and rapid data interaction between operators, and through time-step scheduling, the multiple operators can be started sequentially according to a set connection order to process the data to be processed. While ensuring processing efficiency, the connection relationship between operators can be changed according to application needs, improving the application flexibility of hardware accelerators using multiple operators. The term "adjacent operator" in this paper refers to two operators whose output data is the input data of another operator, from the perspective of data interaction.

[0264] In an exemplary embodiment of the present disclosure, allocating caches for the multiple operators to access data from the shared cache pool includes: allocating caches for the multiple operators to use during the first processing from the shared cache pool; wherein, for adjacent operators among the multiple operators, the output cache allocated for the previous operator to use during the first processing is the input cache allocated for the subsequent operator to use during the first processing (since they are started sequentially, there will be no read / write operations on them at the same time step). In another embodiment, the initially allocated cache can be used throughout the entire calculation task execution of the operator. As described above, it will not be elaborated further.

[0265] In an exemplary embodiment of the present disclosure, through step-by-step scheduling, starting the multiple operators in sequence according to a set connection order to complete the processing of the data to be processed includes: performing multiple loop controls to control the multiple operators to complete the processing of multiple data blocks in the data to be processed in a pipeline manner according to the set connection order; each loop process includes multiple time steps, and each operator completes one processing of one data block in the data to be processed in one time step. Among them, the caches allocated for the multiple operators to use in the same time step are located in different and independently accessible areas in the shared cache pool; and for adjacent operators among the multiple operators, the output cache allocated for the previous operator to use in the current time step is the input cache allocated for the subsequent operator to use in the next time step. This embodiment can avoid access conflicts among multiple operators through multiple loop controls in combination with cache allocation.

[0266] In an exemplary embodiment of the present disclosure, each loop process includes M time steps, M = m1 + m2 - 1, where m1 is the number of the multiple operators, m2 is the number of data blocks processed in each loop, m1 ≥ 2, and m2 ≥ 2;

[0267] During each loop process:

[0268] When m < m1, the activated operators at the m-th time step are the 1st to the m-th operators, m = 1, 2,..., M;

[0269] When m1 ≤ m ≤ m2, the activated operators at the m-th time step are all the operators among the multiple operators;

[0270] When m2 < m ≤ M, the activated operators at the m-th time step are the (m - m2 + 1)-th to the m1-th operators.

[0271] This embodiment can be referred to the relevant embodiments above.

[0272] In an exemplary embodiment of this disclosure, the method further includes: allocating a number K of buffer groups for each of the plurality of operators to be used in the multiple loops, to instruct the operators to update the position of the allocated buffers step by step over a period of K time steps, so that the operators use different buffers in the K time steps of the same period, and use the same buffer in the k-th time step of different periods, where k = 0, 1, ..., K-1, K ≥ 2. This embodiment configures the number K of buffer groups for the operators, enabling switching between K pairs of buffers (or K buffers) on the operator side to avoid conflicts. In one example, the number K can be the number of receiving channels of the radar system.

[0273] In an exemplary embodiment of this disclosure, the allocation of caches from the shared cache pool for the plurality of operators to access data includes: reallocating caches for operators that process data in both the previous and next time steps between two adjacent time steps; wherein, for each of the plurality of operators, there are K groups of caches allocated and reallocated for that operator during the multiple loops, where K ≥ 2, and each group includes an input cache and / or an output cache; starting from the first processing of the operator, with a period of K time steps, the operator uses the same cache in the k-th time step of different periods, and uses different caches in different time steps of the same period, k = 0, 1, ..., K-1. This embodiment reallocates caches through instructions, eliminating the need for operators to design circuits to update cache information, thus simplifying hardware design.

[0274] In an exemplary embodiment of this disclosure, the data to be processed is a frame of data obtained by the radar system from processing the received radar signal; the plurality of operators are operators that perform 1D-FFT stage processing, and one data block in the data to be processed is the data of a chirp signal received on a channel; or, the plurality of operators are operators that perform 2D-FFT stage processing, and one data block in the data to be processed is the data at the same range gate in the 1D-FFT data.

[0275] One embodiment of this disclosure provides a hardware scheduler, including a control device and an internal memory. The control device includes any of the control devices described in any embodiment of this disclosure (such as a first control device or a second control device). The internal memory is configured to store software code to be decoded and executed by the control device. See also... Figure 17 .

[0276] This disclosure also provides a hardware accelerator, including a hardware scheduler, a specified plurality of operators, a shared cache pool providing cache space for the operators, and a register set, wherein:

[0277] The hardware scheduler adopts the second hardware scheduler described in any embodiment of this disclosure, wherein the cache information allocated to each operator is saved in the register group, and the processed data is stored in external memory;

[0278] The register group is configured to store configuration information for the plurality of operators, the configuration information including information on the cache allocated to the operators;

[0279] The plurality of operators are configured to, upon startup, obtain information about the cache allocated to the operator from the register group; if an input cache is allocated, read the data block to be processed from the input cache and process it; if an output cache is allocated, write the processed data into the output cache.

[0280] The structure of the hardware accelerator in this embodiment can be found in [reference]. Figure 7 .

[0281] In an exemplary embodiment of this disclosure, the operator includes:

[0282] The data reading unit is configured to acquire information about the input buffer allocated to the operator, and based on the information about the input buffer, sequentially read out the data blocks in the input buffer and input them into the processing unit;

[0283] The processing unit is configured to process the data block to obtain a processed data block;

[0284] The write data unit is configured to obtain information about the output buffer allocated to the operator, and write the processed data block into the output buffer based on the information about the output buffer.

[0285] The scheduling interface is configured to receive a start signal and use it as a trigger signal to begin processing this operator.

[0286] The synchronization interface is configured to send a signal indicating that the processing is complete after the write data unit writes the processed data block into the output buffer.

[0287] The structure of the operator in this embodiment can be found in [reference needed]. Figure 18 .

[0288] In one example of this embodiment: the read data unit obtains information about the input buffer allocated to the operator from the register group, and the write data unit obtains information about the output buffer allocated to the operator from the register group;

[0289] The operator also includes a cache update unit;

[0290] The cache update unit is configured to: read from the register group the location and size of the cache initially allocated to the operator, and the number of cache groups K allocated to the operator, where K ≥ 2; and cyclically count the number of times the scheduling interface receives the start signal, starting a new cycle after the accumulated count reaches K times in each cycle, and when the accumulated count in each cycle is k times, calculate the location of the cache used by the operator in the next time step based on the location and size of the initially allocated cache and the value of k, and update the cache location in the register group to the calculated cache location; where k = 0, 1, ..., K-1, and the calculated cache location is different when the value of k is different.

[0291] In an exemplary embodiment of this disclosure, the plurality of operators sequentially include: a CQMD operator, an FFT operator, and an SVA operator; or the plurality of operators sequentially include: a DC operator, an FFT operator, and an SVA operator; or the plurality of operators sequentially include: a CMB operator, a STAS operator, a HIST operator, a CFAR operator, and a STAS operator; or the plurality of operators sequentially include: a DC operator, an FFT operator, an SVA operator, and a CMB operator.

[0292] One embodiment of this disclosure also provides a computing system, including a processor, memory, and a hardware accelerator, wherein:

[0293] The processor is configured to load the data to be processed stored in the memory into the hardware accelerator;

[0294] The memory is configured to store data to be processed, and to store data obtained by the hardware accelerator after processing the data to be processed.

[0295] The hardware accelerator employs the hardware accelerator described in any of the disclosed embodiments.

[0296] In an exemplary embodiment of this disclosure, the computing system is a system-on-a-chip, which may be a millimeter-wave chip or a sensor chip in a radar system.

[0297] One embodiment of this disclosure also provides an automated method for generating operating instructions based on a Radar digital signal processing hardware accelerator.

[0298] As mentioned above, to effectively resolve the contradiction between application flexibility and high performance / low power consumption in hardware accelerators, this disclosure proposes a novel hardware-software interaction scheme based on software code implementation. Users can develop different software codes for different applications according to their needs, customizing the connection relationships and data flows between multiple operators in the hardware accelerator. During actual operation, both the software code and the data flow can be processed by hardware circuitry. This solves the problem of limited development flexibility while also possessing high performance. The hardware accelerator that runs this software code to perform corresponding processing and the software code it executes have been described in detail above. To facilitate development, a method for automatically generating the software code for the hardware accelerator of this disclosure embodiment is also provided.

[0299] Taking a hardware accelerator used in a radar system as an example, before generating software code, the hardware accelerator running the software code needs to be designed first, determining the operators to be designed in the hardware accelerator. To this end, the hardware required for radar signal processing can be broken down into multiple subdivided operators. Each operator, as an independent radar signal processing unit, can implement a basic computational function and can be controlled and scheduled by a hardware scheduler. For example, the hardware scheduler can perform operations such as register configuration, register reading, startup, and synchronization related to the operators through instructions in the software code. When designing operators, the smallest computational units can be identified based on the computational characteristics required by the radar application, such as FFT, complex multiplication, complex addition, complex subtraction, and real number comparators, as candidates for operators in the hardware accelerator to meet possible combination requirements. Furthermore, if a computational task is implemented by multiple operators, and multiple operators may be working simultaneously at a certain time step, when splitting operators, it can also be considered that the amount of data to be computed by multiple operators should be approximately the same at a certain time step. If multiple operators have roughly the same processing capacity, they can complete data processing almost simultaneously at this time step, thus ensuring the highest efficiency of the application of multiple operators.

[0300] The foregoing embodiments of this disclosure have proposed solutions for operator cache allocation and scheduling, including corresponding control devices, hardware schedulers, hardware accelerators, computing systems, and multi-operator control methods. In order to facilitate the design and development for different applications, a code generation method is also needed to automatically generate the software code required for the hardware accelerator of this disclosure to achieve efficient cache allocation and scheduling.

[0301] Therefore, one embodiment of this disclosure provides a code generation method for generating software code used by a hardware accelerator, wherein the hardware accelerator includes a hardware scheduler that executes the software code, a shared cache pool, and various types of operators; such as Figure 21As shown, the code generation method includes:

[0302] Step 400: Display an interactive interface for generating software code, and receive user input through the interactive interface;

[0303] Step 401: Determine configuration information based on user input. The configuration information includes: multiple operators involved in the processing and their algorithm parameters, information on the data flow when the multiple operators perform the processing, and information on the cache of the multiple operators.

[0304] Step 402: Automatically generate software code based on the configuration information. The software code includes multiple cache allocation instructions. When the multiple cache allocation instructions are executed, caches used for accessing data are allocated from the shared cache pool for the multiple operators respectively, so as to form a data channel through which the data flow passes through the multiple operators.

[0305] The code generation method of this embodiment can be applied to a software development platform with a software development kit installed. The hardware accelerator using the software code generated in this embodiment can be any of the hardware accelerators described in the above embodiments of this disclosure; see also [link to relevant documentation]. Figure 7 The interactive interface in this embodiment is a visual interactive interface, but it is not limited to this. The software development platform can provide multiple interactive interfaces, and users can input different types of input information through different interactive interfaces. This disclosure does not impose any restrictions on the specific design of the interactive interface. Each type of operator in the hardware accelerator can have one or more, and the multiple operators involved in processing, determined according to user input (i.e., the multiple operators specified above), must include at least two types, and each type of operator can have one or more. For example, if the processing time of one operator is significantly longer than the processing time of other operators, the user can set multiple operators of the same type to process the data output by the previous operator in parallel to improve efficiency. The shared cache pool in the hardware accelerator does not necessarily have to share a cache for all operators.

[0306] After the configuration information is determined, it can be input into the trained code generation model to generate corresponding software code. These models can be artificial intelligence models or models built based on set code generation rules; this disclosure does not impose any restrictions on this. The content of the software code generated in this embodiment can be found in the pseudocode above. Its actual format can be binary software code and / or assembly language software code. This binary code can be loaded into the hardware scheduler in various ways (see...). Figure 17And its description) are decoded and executed by the hardware scheduler in the hardware accelerator, and then multiple operators participating in the processing are scheduled, including starting the operators; after the operators are started, their respective input data are processed according to the configured algorithm parameters, cache information, etc. Taking the application of the hardware accelerator in radar digital signal processing as an example, the scheduling of each basic function of radar signal processing is controlled by the hardware scheduler, and the hardware scheduler supports executing a segment of instructions (i.e., the software code in the above text) generated by the software according to the application requirements, and sequentially scheduling and controlling the operators for executing the basic functions of radar signal processing.

[0307] In this embodiment, the software development platform can determine the configuration information, multiple operators participating in the processing and their algorithm parameters, the information of the data stream when the multiple operators execute the processing, and the information of the cache of the multiple operators according to the user input; then automatically generate software code, and multiple cache allocation instructions in the software code can respectively allocate caches used when accessing data for the multiple operators, forming a data channel for the data stream to pass through the multiple operators. Therefore, based on this embodiment, the user can specify multiple operators participating in the processing through input operations, and can also customize the required data stream by changing the connection relationship and data flow direction between the multiple operators. The software code generated based on this embodiment can implement soft connections between operators in the hardware accelerator, improving the flexibility of the application; and the data interaction between operators is realized through the shared cache pool in the hardware accelerator, without the need for internal and external data transfer, and high performance and low power consumption can be achieved, solving the contradiction between the application flexibility and high performance and low power consumption existing in the hardware accelerator.

[0308] In this embodiment, the software development platform can pre-save the types and quantities of operators supported by the hardware accelerator and display them to the user through a corresponding visual interaction interface. For example, the hardware accelerator in a certain chip may have M1 operators, but only the functions of M2 operators (M2 < M1) are required in some applications. In one example, the user can specify multiple operators participating in the processing by selecting the displayed operators, so that when the generated software code is executed, only the specified multiple operators are configured, cached, and scheduled, etc., that is, enabling (enable) multiple operators participating in the processing in the hardware accelerator and disabling (disable) other operators. However, in another example, the interaction interface can also support the user to directly input the types and quantities of the specified operators. When there are operators of this type and sufficient quantity in the hardware accelerator, the specified quantity of operators of this type is used as the operators participating in the processing. When the hardware accelerator does not support the operators of this type or the quantity is insufficient, the user is reminded and an error message is given.

[0309] The configuration information also includes algorithm parameters for multiple operators. For example, for the FFT operator, algorithm parameters can be whether the input data is real or complex, and whether windowing is enabled or disabled. For the Peak Search operator, algorithm parameters can be the peak search coefficients, the parameters of the reference window and mask. For the DoA operator, algorithm parameters can be the angle resolution grouping method, angle resolution accuracy, and resolution. When an operator can employ multiple algorithms, the algorithm parameters can also include the type of the selected algorithm. Users can directly input text or numbers, or select the corresponding options to configure algorithm parameters, or the software development platform can calculate the algorithm parameters based on user input. The operator's algorithm parameters can be written to the corresponding register set of the operator using write register instructions in the software code.

[0310] The information about the data flow during the processing by the multiple operators involves the connection relationships and data flow direction between the operators. This can be accomplished through user input on a visual interface. For example, the user can first determine the computational task to be performed by the hardware accelerator, and then specify the operators to be processed according to the computational task. This can be done by dragging the icons of the specified operators to the area for setting the data flow on the visual interface, arranging them in the expected order of the data flow, and adding markers between the operators to indicate the data flow direction, supporting data flow loops (the number of loops can be determined by user input). The resulting graph contains the connection relationships and data flow direction between the multiple operators, and this graph can be used as user input to determine the data flow information. However, this is only an example, and the disclosure does not limit the method of user input, as long as the required configuration information can be obtained. For example, if the user does not adjust some default parameters in the software development platform, these parameters are also considered part of the user input.

[0311] In one exemplary embodiment of this disclosure, the plurality of operators are dedicated hardware operators; or, a portion of the plurality of operators are dedicated hardware operators, and another portion are general-purpose operators implemented based on a general-purpose processor.

[0312] Based on the method of this embodiment, users can generate the required software code through a software development platform, define the data flow between hardware accelerator operators through the software code, including interspersing general operators in dedicated hardware operators, and exchanging computational data between dedicated hardware operators and general operators. The general processor on which the general operator is based can be one or more of DSP, CPU, DPU, and APU, and the general processor can run software to perform computation.

[0313] Data exchange between general-purpose operators and dedicated hardware operators can be achieved through external memory (such as...). Figure 7This process occurs in memory located on the same chip as the hardware accelerator but outside of it. For example, the hardware scheduler can use DMA to move data processed by a dedicated hardware operator to external memory and notify a general-purpose operator. After processing the moved data, the general-purpose operator notifies the hardware scheduler, which then reads the processed data from external memory and writes it to the input buffer of a dedicated hardware operator in a shared buffer pool. This dedicated hardware operator then performs further processing on the data. Dedicated hardware operators and general-purpose operators can exchange data any number of times.

[0314] In one exemplary embodiment of this disclosure, the software code is used for radar digital signal processing. The instructions in the software code are dedicated microinstructions for radar digital signal processing, with instruction lengths of 16 bits, 32 bits, or 64 bits. Using short instructions can increase flexibility and save cache space in the software code.

[0315] In an exemplary embodiment of this disclosure, the software code is used for radar digital signal processing. The plurality of operators sequentially include: CQMD operator, FFT operator, and SVA operator; or sequentially include DC operator, FFT operator, and SVA operator; or sequentially include CMB operator, STAS operator, HIST operator, CFAR operator, and STAS operator; or sequentially include DC operator, FFT operator, SVA operator, and CMB operator. However, this is merely exemplary and not exhaustive. The plurality of operators involved in the processing may only be a portion of these operators, or may include one or more of the following: a multi-channel parallel complex multiply-accumulate operator, a multi-channel parallel real multiply-accumulate operator, a peak search operator, a common angle resolution operator, and an ADC interference detection operator.

[0316] In an exemplary embodiment of this disclosure, the plurality of operators includes one or more of the following operators:

[0317] A first DMA operator is used to load data to be processed from an external source into the shared cache pool. The algorithm parameters of the first DMA operator include the size and storage location of the data to be processed. The cache allocated to the first DMA operator is an output cache.

[0318] A second DMA operator is used to store the data obtained after the processing of the data to be processed by the plurality of operators to an external source. The algorithm parameters of the second DMA operator include the size and storage location of the processed data, and the cache allocated to the second DMA operator is an input cache.

[0319] This embodiment achieves the transfer of data to be processed from external sources (such as DMA operators) through the scheduling of DMA operators. Figure 7The data loading process involves loading data from the memory (in the system) to the shared cache pool and storing the processed data externally. The external storage locations of the data to be processed and the processed data can be obtained directly from the user input or automatically allocated based on the user input data (such as the size of the data to be processed). This location information can be saved as algorithm parameters of the corresponding DMA operator to the registers of the DMA operator. The DMA operator uses this location information to load and store the data.

[0320] When applied to a radar system in this embodiment, the data to be processed may include ADC data, peak search raw data, raw FFT data, etc. The results obtained by the hardware accelerator after processing the data may include interference detection results, FFT results such as multi-dimensional FFT processing results, peak search results, raw FFT data after peak search, DoA angle results (the DoA angle determined after processing), complex or real number multiplication and addition results, etc. In one example, the configuration information may also include relevant parameters of the data to be processed, such as FFT windowing coefficients, peak search direction parameters, DoA antenna calibration parameters, and angle resolution direction weight parameters, which are generally parameters obtained through calibration. These parameters can serve as algorithm parameters for the multiple operators.

[0321] In an exemplary embodiment of this disclosure, before automatically generating software code based on the configuration information, the method further includes: performing any one or more of the following legality checks on the configuration information, and automatically generating the software code only after the checks pass;

[0322] Does the hardware accelerator support the multiple operators and their algorithm parameters?

[0323] Does the hardware accelerator support the processing flow corresponding to the data stream?

[0324] Does the number of operators activated at the same time step meet the requirement of maximum parallelism of multiple operators?

[0325] Does each of the plurality of operators satisfy the minimum parallelism requirement for data input and output?

[0326] Does the shared cache pool have enough space to allocate cache for the multiple operators, and are there any cache allocation conflicts?

[0327] The rules for the validity check in this embodiment can be predefined by the user based on the hardware accelerator running the software code. These rules may include supported operator types, maximum parallelism of multiple operators (the number of operators that can be activated simultaneously), minimum parallelism of a single operator (the minimum amount of data that a single operator can process in parallel), and the maximum capacity of the shared cache pool available for allocation. The validity check can constrain the amount of data exchanged between operators, the connection relationships between adjacent operators, the number of simultaneously activated operators, and the size and location of the cache allocated to operators.

[0328] For example, avoid operators using the same or overlapping caches, or caches located in areas that cannot be accessed simultaneously. Also, if the total size of the caches for multiple operators exceeds the available cache size in the chip's shared cache pool, an error can be flagged, specifying the cause. Furthermore, for multiple operators and data streams specified by the user for processing, the processing capacity required for each time step (stage) needs to be analyzed. If the number of operators to be activated (i.e., parallel processing operators) at a certain time step exceeds the maximum parallelism requirement, an error can be flagged, specifying the cause. Validity checks ensure that software code can run correctly on hardware accelerators and guarantee performance.

[0329] Legality checks are not limited to the above-mentioned methods. For example, they can check whether the memory used to store data to be processed is out of bounds. Another example is when software code is used for radar digital signal processing, the user can input application parameters for the radar system, and the software development platform can check whether the hardware accelerator supports these application parameters. If the check fails, the user is prompted to modify them. These application parameters can be used directly as configuration information or to obtain configuration information.

[0330] In an exemplary embodiment of this disclosure, the software code further includes multiple sets of scheduling instructions. When the multiple sets of scheduling instructions are executed, the multiple operators are scheduled step by step, so that the multiple operators are started sequentially according to the order in which the data stream passes through the operators, and the processing of the data to be processed is completed in a pipeline manner. Each time step starts when a set of scheduling instructions is executed and ends when the execution of that set of scheduling instructions ends.

[0331] This embodiment uses a set of scheduling instructions to schedule multiple operators. This set of scheduling instructions may include a start instruction and a synchronization instruction. Alternatively, it may include a start instruction and a delay instruction, in which case each time step begins when the start instruction is executed and ends after a set delay according to the delay instruction. See the description of the foregoing embodiment.

[0332] Besides scheduling operators via a hardware scheduler executing scheduling instructions, other embodiments can also achieve data interaction between operators through communication. For example, each operator participating in the processing can be configured with information about adjacent operators and written into the corresponding register set. The preceding operator in an adjacent operator obtains the information of the following operator from the corresponding register set, writes the processed data block to the output buffer, and then notifies the following operator through a pre-defined communication line between operators. The following operator then reads data from its input buffer (i.e., the output buffer of the preceding operator) for processing. At this time, the preceding operator can process the next data block through another set of buffers. This method can also realize data interaction between multiple operators in a pipeline manner, with simpler software code but relatively more complex hardware.

[0333] In an exemplary embodiment of this disclosure, when the multiple cache allocation instructions are executed, the caches allocated to the multiple operators respectively satisfy the following: among adjacent operators, the output cache used by the preceding operator at the current time step is the same as the input cache used by the subsequent operator at the next time step, and the caches used by the multiple operators at the same time step are located in different independently accessible regions of the shared cache pool.

[0334] This embodiment satisfies the two conditions mentioned above when allocating cache, namely, a data channel can be formed between adjacent operators, and data can be exchanged according to the defined data flow, avoiding conflicts caused by subsequent operators not being able to read data or reading incorrect data. Furthermore, a pipelined multi-operator parallel processing can be achieved through concurrent access to different cache areas. The adjacent operators in this paper are defined from the perspective of data interaction. When an operator has a loop, it is equivalent to serializing multiple operators of the same type. If the output data of the current operator is used as the input data for iterative calculation, then the current operator and subsequent operators do not constitute the aforementioned adjacent operators.

[0335] In an exemplary embodiment of this disclosure, the configuration information further includes a loop count Q; Q ≥ 2; the software code further includes a loop start instruction preceding the multiple sets of scheduling instructions and a loop end instruction following the multiple sets of scheduling instructions, the loop start instruction including an operand representing the loop count Q; when the software code is executed, the processing of the data to be processed is completed through Q loops.

[0336] This embodiment divides the data processing into multiple loops. In each loop, multiple operators are scheduled using scheduling instructions, which greatly simplifies the software code. The loop count Q can be directly input by the user as an application parameter of the radar system, or it can be determined by other application parameters of the radar system. In the loop start instruction, the loop count Q can be represented by an immediate value or derived from a general-purpose register. Since multiple sets of scheduling instructions are executed in each loop, each loop includes multiple time steps, and one set of scheduling instructions can activate one or more operators.

[0337] In an exemplary embodiment of this disclosure, the data to be processed is a frame of data obtained by the radar system from processing the echo signal, and the number of cycles Q is equal to Num or equal to Num+1, where Num is the number of chirps in a frame of data; for example, if the number of chirps in a frame of data is 128, the number of cycles Q can be determined to be 128 or 129.

[0338] The plurality of operators are operators that perform 1D-FFT stage processing. Each operator completes the processing of data from one chirp signal on one channel in one time step, that is, the data from one chirp signal constitutes a data block processed by the 1D-FFT stage operator in one time step; or, the plurality of operators are operators that perform 2D-FFT stage processing. Each operator completes the processing of data at the same distance gate in the 1D-FFT data in one time step, that is, the data at the same distance gate in the 1D-FFT data constitutes a data block processed by the 2D-FFT stage operator in one time step.

[0339] In an exemplary embodiment of this disclosure, the cache information includes the size and location of the cache, and the plurality of cache allocation instructions include a plurality of first cache allocation instructions located before the loop start instruction;

[0340] When the plurality of first cache allocation instructions are executed, the shared cache pool allocates caches for the plurality of operators to be used for the first processing, and in adjacent operators, the output cache allocated for the first processing of the preceding operator is the input cache allocated for the first processing of the following operator.

[0341] The hardware accelerator also includes a register set to store the operator's configuration parameters; the first cache allocation instruction includes: a first operand, used to indicate the size and location of the cache allocated to the operator; and a second operation data, used to indicate the address of the register to be written to by the first operand for the operator to read.

[0342] The first cache allocation instruction in this embodiment has been described in detail above and will not be repeated here.

[0343] In one example of this embodiment, determining the configuration information based on user input includes: obtaining the application parameters of the system to which the plurality of operators belong, and the type and size of the data to be processed, based on user input; for each operator, obtaining the amount of data accessed by the operator when performing one processing operation based on the type of the operator, the application parameters, and the type and size of the data to be processed; determining the size of the cache allocated to the operator and allocating the cache location based on the amount of data; wherein, the cache of each operator includes an input cache and / or an output cache, and the amount of data accessed by the operator when performing one processing operation includes the amount of input data in the input cache and / or the amount of output data in the output cache.

[0344] Taking a radar system as an example, given the application parameters of the radar system, the type and size of the data to be processed, and the type of operator, the size of the input and output data of an operator can be calculated or obtained by the software development platform based on this information. The application parameters of the radar system can include any one or more of the following: ADC sampling frequency, FMCW waveform parameters, the number of chirp signals in a frame of data, the number of transmitting antennas, the number of receiving antennas, and the number of receiving channels, etc. The size of the input buffer used by the operator can be determined based on the size of the input data to be read by the operator, and can be greater than or equal to the size of the input data; similarly, the size of the output buffer used by the operator can be determined based on the size of the output data to be written by the operator, and can be greater than or equal to the size of the output data. However, in other embodiments, the buffer size can also be directly determined based on user input, that is, the user can directly set the buffer size of a certain operator. But this requires sufficient user experience; the method in this embodiment is more user-friendly.

[0345] After the user customizes the operators and data streams involved in the processing, data interaction between operators is achieved through caching. The cache allocated to the operators can be invisible to the user. The software development platform determines the size of the cache allocated to the multiple operators. When a validity check confirms that the shared cache pool has sufficient space for allocation, the platform can automatically allocate caches to the multiple operators, including the cache size and location (such as the initial location). The software development platform can display the size and / or location of the cache automatically allocated to each operator. Users can adjust the size and / or location of the cache automatically allocated by the platform to ensure efficient application of operators and caches. If the user makes adjustments and passes the validity check, a corresponding cache allocation instruction is generated based on the user-adjusted cache size and / or location. The cache location is described according to the cache physical space description method defined by the software development platform, such as row, column, offset, and bank.

[0346] In one example of this embodiment, the first cache allocation instruction includes a write register instruction and N' write data instructions from among the N write data instructions associated with the write register instruction, wherein:

[0347] The write register instruction includes: an opcode, a second operand representing the register address Add, and an operand representing the number of immediate values ​​N, where N ≥ N' ≥ 2;

[0348] Each write data instruction includes an opcode and an immediate value, the first operand including N' immediate values ​​from the N' write data instructions to represent information about the row position, row number, column position, and column number of the cache allocated to the operator;

[0349] When the first cache allocation instruction is executed, the immediate value of each instruction in the N' write data instructions is written to the register at address Add+n-1 for the operator to read, where n is the sequence number of the instruction in the N write data instructions, and 1≤n≤N.

[0350] For the format of the first cache allocation instruction in this example, please refer to the corresponding description above in conjunction with the pseudocode.

[0351] In one example of this embodiment, the cache information further includes the number of cache groups (pingpong-group) K used by the operator, where K ≥ 2; the software code also includes one or more second cache allocation instructions located between two adjacent groups of scheduling instructions. The second cache allocation instruction includes: a first operand, indicating the location of the cache to be reallocated for an operator in the next time step; and a second operand, indicating the address of the register to which the first operand is to be written. When the first and second cache allocation instructions for the same operator are executed, each operator, during processing, sequentially switches the location of the cache used in the K groups of caches every K time steps. This example implements the reallocation of operator caches through the second cache allocation instructions in the software code. The location indicated in the first cache allocation instruction can be used as the initial location, and one or more translations can be performed in the row or column direction based on this initial location to obtain the location of the reallocated cache, which is then used as the operand in the second cache allocation instruction. The operand of the second cache allocation instruction only needs to indicate the first row or first column position of the reallocated cache.

[0352] In an example of this embodiment, the information in the cache further includes the number of cache groups (pingpong-group) K used by the operator, where K≥2; the first operand is further used to indicate the number of groups K. During the processing, each operator switches the position of the cache used in K groups of caches in sequence with a period of K time steps based on the size of the allocated cache and the number of groups K. In this example, the position of the cache and the number of cache groups K are configured by the operator side according to the first cache allocation instruction, and the cache switching is implemented based on its own logic circuit during the processing without generating a second cache allocation instruction.

[0353] In the above two examples, the number of groups K is the number of cache groups used by the operator in Q loops, which can be the number of input caches used by the operator during Q loops, or the number of output caches used by the operator, or the number of pairs of input and output caches used by the operator. One cycle of allocating and using the cache for the operator, which includes K time steps, can be within one loop or span one loop. Among them, K is determined according to user input or is a default value. For example, when the data of each receiving channel is processed independently, K is equal to the number of receiving channels.

[0354] In an exemplary embodiment of the present disclosure, between the loop start instruction and the loop end instruction, there are M groups of scheduling instructions. Each group of scheduling instructions includes a start instruction and a synchronization instruction. The operand of the start instruction in the m-th group of scheduling instructions is used to indicate the activated operator at the m-th time step, that is, the operator to be started at the m-th time step, where m = 1, 2,..., M; among them, the m-th time step starts from the execution of the start instruction in the m-th group of scheduling instructions and ends when the synchronization instruction in the m-th group of scheduling instructions is executed. M is the number of time steps included in each loop.

[0355] The synchronization instruction of this embodiment can also set an operand to indicate the activated operator at the m-th time step, that is, the operator to be started at the m-th time step. Since the synchronization instruction and the start instruction appear in pairs, it can also be not set and the operand in the start instruction of the same group is used by default. The exemplary pseudo-codes of the start instruction and the synchronization instruction, as well as the specific operations when the hardware accelerator decodes and executes these two instructions, have been given above and will not be repeated here.

[0356] In an example of this embodiment, the activated operator at the m-th time step is determined as follows: During each loop, when m < m1, the activated operator at the m-th time step is the first to m-th operators; when m1≤m≤m2, the activated operator at the m-th time step is the multiple operators; when m2 < m≤M, the activated operator at the m-th time step is the (m - m2 + 1)-th to m1-th operators; where M = m1 + m2 - 1, m is the number of operators to be scheduled among the multiple operators, and m2 is the number of data blocks processed in each loop. The scheduling method of this example can achieve Figure 11A The pipeline shown and its description. In this example, in each loop process, the activation operator at each time step can be determined according to the same rule. There will be one or more idle time steps for the same operator in two adjacent loop processes.

[0357] In another example of this embodiment, the activation operator at the m-th time step is determined as follows: In the first loop process, when m ≤ m1, the activation operators at the m-th time step are the 1st to the m-th operators; when m1 < m ≤ M, the activation operators at the m-th time step are all the said multiple operators; in the second to the second-to-last loop processes, the activation operators at all time steps are all the said multiple operators; in the last loop process, the activation operators at the m'-th time step are the (m'+1)-th to the m1-th operators; where M is equal to the number of receiving channels, m1 is the number of operators to be scheduled among the said multiple operators, M ≥ m1, and m' = 1, 2,..., m1 - 1. The scheduling method of this example can achieve Figure 11B the pipeline shown and the detailed description above.

[0358] As can be seen from the figure, in this example, in Q loops, in the first few time steps at the start of the first loop, multiple operators are activated in sequence. After an operator is activated, it remains an active operator in each subsequent time step until multiple operators exit the process in sequence in the last loop. In this way, the number of time steps M included in each loop can be equal to the number of receiving channels. When M is greater than the number of operators to be scheduled m1, the number of loop times Q is 1 more than Figure 11A the corresponding example. For example, when a frame of data includes 128 chirps, Figure 11A the number of loop times of the corresponding example is 128, while Figure 11B the number of loop times of the corresponding example in this example is 129. However, in this example, there is no idle time for an operator from activation to exit, and the total number of time steps required to complete the entire process is less than Figure 11A the corresponding example, with higher efficiency.

[0359] Because the rules for determining the activation operator differ between the first and last loops in this example, the activation operator can be determined through a judgment instruction in each loop. Specifically, the judgment instruction determines whether the current loop is the first, middle, or last loop. If it's the first loop, the first branch is entered, and the operands indicating the activation operator in the startup instruction are generated according to the method used for determining the activation operator in the first loop. If it's a middle loop (i.e., the second to the penultimate loop process), the second branch is entered, and the operands indicating the activation operator in the startup instruction are generated according to the method used for determining the activation operator in the middle loop (the activation operators for all time steps are the aforementioned multiple operators). If it's the last loop, the third branch is entered, and the operands indicating the activation operator in the startup instruction are generated according to the method used for determining the activation operator in the last loop. If the first two branches are entered, a jump instruction can be used to jump to the subsequent processing flow.

[0360] The following explains the format of the exemplary start command (start_engine command) and synchronization command (sync command):

[0361] The `start_engine` directive is used to start one or more operators, and its exemplary format is shown in the table below:

[0362]

[0363] Bits [31:28] represent the op_code of the instruction; the hardware scheduler has two instruction queues, and bit

[27] is used to distinguish whether the instruction is executed in the different instruction queues; bits [26:20] are not currently used, but can be used to expand the number of operators if the number of operators increases; bits [19:0] are a total of 20 bits, corresponding to a maximum of 20 operators, such as fft, cfar, etc. The pseudocode example corresponding to this instruction is as follows:

[0364] start_eng_qn fft

[0365] start_eng_qn fft+cfar.

[0366] The `start_engine` instruction can start a single operator, as in the first pseudocode, or multiple operators simultaneously, as in the second pseudocode. After the pseudocode is written, a script will convert it into binary code instructions that the hardware scheduler can parse. When the hardware scheduler executes this instruction, it sends a start signal to the corresponding operator based on bits [19:0]. This start signal is high for only one clock cycle. When the operator receives the start signal, it will begin working according to the configured register information. This register information includes the range of values ​​the operator can access, such as `col_offset`, `col_size`, `row_offset`, `row_size`, etc. These registers can be configured before the `start_engine` instruction using the `reg_write` instruction, as shown in the following figure:

[0367] The `sync` instruction is used by the instruction scheduler to check whether the corresponding operator has completed its task. If it has, the hardware scheduler continues executing subsequent instructions; otherwise, it waits for the corresponding operator to complete its task. An example format is as follows:

[0368]

[0369] In the table above, bits [31:28] represent the op_code of the sync instruction, bit

[27] is used to distinguish whether the instruction is executed in different instruction queues, bits [26:20] are temporarily reserved like the start_eng instruction and can be used for subsequent expansion, and bits [19:0] correspond to a maximum of 20 operators like the start_eng instruction. The pseudocode example for this instruction is as follows:

[0370] sync_qn fft

[0371] sync_qn fft+cfar

[0372] The `sync` instruction can check whether a single operator has completed its work, as shown in the first pseudocode, or it can check whether multiple operators have completed their work, as shown in the second pseudocode. When an operator completes its work, it sends a clockcycle high-level "done" signal to the hardware scheduler. After receiving the "done" signal, `seq` considers that the operator has entered the idle state. When all operators corresponding to bits [19:0] have entered the idle state, the `sync` instruction ends, and the hardware scheduler continues to execute subsequent instructions.

[0373] In an exemplary embodiment of this disclosure, the hardware accelerator further includes general-purpose registers and dedicated registers for each operator, and the software code further includes one or more of the following instructions:

[0374] The register read instruction (reg_read instruction) reads the value of the special register corresponding to the specified address when executed;

[0375] The Reg_load instruction writes data to a general-purpose register when executed.

[0376] The arithmetic instruction (rf_op instruction) performs operations between data in general-purpose registers when executed, including addition, subtraction, AND, OR, NOT, shift, etc.

[0377] The register store instruction (reg_store instruction) is used to write the value of a general-purpose register to the special-purpose register corresponding to a specified address.

[0378] In this embodiment, one or more dedicated registers corresponding to each operator are called the register group corresponding to that operator.

[0379] The aforementioned instructions enable data exchange between operators, excluding the data to be processed. For example, statistical data obtained by one operator during processing can be written to a designated dedicated register. The hardware scheduler retrieves this statistical data from the dedicated register using a register read instruction; it can then perform calculations on it and write it to a general-purpose register using a write instruction; arithmetic instructions can perform operations on data in multiple general-purpose registers, and the results can be written to another operator's dedicated register using a register save instruction. This establishes a channel for information transfer between operators. These instructions in the software code can be generated according to specific application requirements.

[0380] In addition to these instructions, software code may also include the following instructions, such as: wait instructions, used to stop the execution of instructions until the processor configures the corresponding event bit, which can be used for debugging; delay instructions, used to delay for a preset time period; break instructions, used to jump out of the current loop or branch; and clear instructions (pc_clear instructions), used to clear the value of the program counter.

[0381] In one example, the Reg_load directive has the following exemplary format:

[0382]

[0383] The table above shows the bit allocation corresponding to the reg_load instruction. Bits [31:28] represent the instruction ID; bit

[27] is used to distinguish the instruction from the different instruction queues for execution; bits [26:25] are used to select the source of the general-purpose register value. 0 indicates that the value corresponding to address bit [19:0] is written to the general-purpose register; 1 indicates that the immediate value is written to the general-purpose register; 2 indicates that the address exists in another general-purpose register, and the data at the corresponding address is read according to the value of the other general-purpose register, and then the read data is written to the current general-purpose register; bit

[24] is used to select whether the immediate value is written to the high or low 16 bits of the general-purpose register; bits [23:20] represent the index of the current general-purpose register; bits [19:0] represent the address, the immediate value, or the index of another general-purpose register.

[0384] The cache allocation instruction in the aforementioned embodiment is implemented through the register write instruction (reg_write instruction), which writes a value to the register corresponding to a certain address; the reg_load instruction here writes a value or the value of the register corresponding to a certain address to a general-purpose register.

[0385] The `reg_read` instruction reads the value of the register corresponding to the specified address. The results of operator processing can be of several types: one is data obtained from calculations on the input data, which can be stored in the output buffer; another is statistical results of the input data, which are smaller in size, such as the count of valid targets. This statistical result can be stored in a result register, typically 32 bits, and the application can obtain these statistical results by reading the register.

[0386] The `reg_store` instruction writes the value of a general-purpose register to the register corresponding to a specified address. General-purpose registers can be used to cache intermediate results or perform simple mathematical operations, and the resulting value can be assigned to the register at a specific address. The `reg_store` instruction can be used to achieve this.

[0387] The operand event_idx in the wait instruction can be configured from 0 to 7. When the wait instruction is executed, seq will stop and will not continue to execute subsequent instructions. Only when the CPU configuration ctl_seq_event_set and event_idx are consistent will seq continue to execute subsequent instructions.

[0388] An example of the delay instruction is delay 0x3, which means that the execution of the next instruction will continue after a delay of 3 cycles.

[0389] The pc_clear instruction is used when the program counter (pc) value needs to be incremented from 0.

[0390] The rf_op instruction can perform simple calculations on intermediate results during instruction execution and then allocate them to a register.

[0391] The software code (short instruction set) generated in the above embodiments can be considered as firmware stored in the chip. Configuration data has been written into the radar signal processing registers, and the configuration data required for each engine's operation has been configured in the engine's internal registers. This means that the hardware acceleration chip implementing the customized functions has been customized and can begin operation. During chip startup, the CPU and hardware scheduler (SEQ) on the chip start independently. The software code is written into the internal memory (instruction queue 0 or 1) of the hardware scheduler. The controllers in the hardware scheduler retrieve the software code (such as 32-bit short instructions) from their respective queues, decode it, determine its instruction function, and then execute the corresponding instruction function.

[0392] The above embodiments of this disclosure provide a method and tool for users to customize and use hardware accelerators. The software code generated by the software generation method according to the above embodiments of this disclosure can be executed by the hardware scheduler in the hardware accelerator of the above embodiments of this disclosure, realizing functions such as configuration of multiple operators in hardware acceleration, cache allocation, and time-sharing scheduling, and achieving the effects described in the above embodiments.

[0393] One embodiment of this disclosure also provides a software development platform for generating software code used by a hardware accelerator, the hardware accelerator including a hardware scheduler that executes the software code, various types of operators, and a shared cache pool, such as... Figure 22 As shown, the software development platform includes:

[0394] The human-computer interaction module 500 is configured to display an interactive interface for generating software code and to receive user input through the interactive interface;

[0395] The information determination module 501 is configured to determine configuration information based on user input. The configuration information includes: multiple operators involved in the processing and their algorithm parameters, information on the data flow when the multiple operators perform processing, and information on the cache of the multiple operators; the multiple operators include at least two types of operators.

[0396] The code generation module 502 is configured to automatically generate software code based on the configuration information. The software code includes multiple cache allocation instructions. When the multiple cache allocation instructions are executed, they allocate caches from the shared cache pool for the multiple operators to use when accessing data, so as to form a data channel through which the data flow passes through the multiple operators.

[0397] The software development platform of this disclosure can automatically generate software code for a hardware accelerator to run. This hardware accelerator can be applied to digital signal processing in radar systems. The software code can use radar-specific microinstructions, such as 32-bit short instructions. This disclosure adopts a new hardware-software interaction method, allowing users to define different instruction sets to develop different applications, change specified operators and their connections, and modify data flow. This solves the problem of low development flexibility of hardware accelerators. Furthermore, during actual operation, all instructions and data flows can be processed through hardware circuit-based operators (such as ASIC dedicated circuits), exhibiting high-performance processing characteristics. This software development platform can be any hardware device or apparatus equipped with a corresponding software development kit.

[0398] One embodiment of this disclosure also provides a code generation apparatus, including a memory 50' and a processor 60', see [link to relevant documentation]. Figure 22 The memory 50' stores a computer program, and the processing device 60' is configured to run the computer program to execute the code generation method described in any embodiment of this disclosure. The code generation device can be any hardware device with a corresponding software development kit installed, such as a computer, server, cloud platform, etc. The software development platform described above can be the code generation device of this embodiment.

[0399] An embodiment of this disclosure also provides a non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the code generation method described in any embodiment of this disclosure.

[0400] It will be understood by those skilled in the art that all or some of the steps, systems, or apparatuses disclosed above, and their functional modules / units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software may be distributed on a computer-readable medium, which may include computer storage media (or non-transitory media) and communication media (or transient media). As is known to those skilled in the art, the term computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, it is well known to those skilled in the art that communication media typically contain computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.

Claims

1. A code generation method for generating software code used by a hardware accelerator, the hardware accelerator including a hardware scheduler for executing the software code, a shared cache pool, and various types of operators, the code generation method comprising: An interactive interface for generating software code is displayed, through which user input is received; The configuration information is determined based on user input. The configuration information includes: multiple operators involved in the processing and their algorithm parameters, information on the data flow when the multiple operators perform the processing, and information on the cache of the multiple operators. The plurality of operators includes at least two types of operators; Software code is automatically generated based on the configuration information. The software code includes multiple cache allocation instructions. When the multiple cache allocation instructions are executed, caches used for accessing data are allocated from the shared cache pool for the multiple operators respectively, so as to form a data channel through which the data flow passes through the multiple operators.

2. The code generation method as described in claim 1, characterized in that: The plurality of operators are dedicated hardware operators; or, a portion of the plurality of operators are dedicated hardware operators, and another portion are general-purpose operators implemented based on a general-purpose processor.

3. The code generation method as described in claim 1, characterized in that: The software code is used for radar digital signal processing. The instructions in the software code are dedicated micro-instructions for radar digital signal processing, and the instruction length is 16 bits, 32 bits, or 64 bits. The plurality of operators sequentially include: CQMD operator, FFT operator, SVA operator; or sequentially include DC operator, FFT operator, SVA operator; or sequentially include CMB operator, STAS operator, HIST operator, CFAR operator, STAS operator; or sequentially include DC operator, FFT operator, SVA operator and CMB operator.

4. The code generation method as described in claim 1, characterized in that: The plurality of operators includes one or more of the following operators: A first DMA operator is used to load data to be processed from an external source into the shared cache pool. The algorithm parameters of the first DMA operator include the size and storage location of the data to be processed. The cache allocated to the first DMA operator is an output cache. A second DMA operator is used to store the data obtained after the plurality of operators have completed the processing of the data to be processed to an external location. The algorithm parameters of the second DMA operator include the size and storage location of the processed data, and the cache allocated to the second DMA operator is an input cache.

5. The code generation method as described in claim 1, characterized in that: Before automatically generating software code based on the configuration information, the method further includes: performing one or more of the following legality checks on the configuration information, and automatically generating the software code only after the checks pass; Does the hardware accelerator support the multiple operators and their algorithm parameters? Does the hardware accelerator support the processing flow corresponding to the data stream? Does the number of operators activated at the same time step meet the requirement of maximum parallelism of multiple operators? Does each of the plurality of operators satisfy the minimum parallelism requirement for data input and output? Does the shared cache pool have enough space to allocate cache for the multiple operators, and are there any cache allocation conflicts? 6. The code generation method as described in claim 1, characterized in that: The software code also includes multiple sets of scheduling instructions. When these multiple sets of scheduling instructions are executed, the multiple operators are scheduled step by step, so that the multiple operators are started sequentially according to the order in which the data stream passes through the operators, and the data to be processed is processed in a pipeline manner. Each time step starts when a set of scheduling instructions is executed and ends when the execution of that set of scheduling instructions ends.

7. The code generation method as described in claim 6, characterized in that: When the multiple cache allocation instructions are executed, the caches allocated to the multiple operators respectively satisfy the following: among adjacent operators, the output cache used by the preceding operator in the current time step is the same as the input cache used by the following operator in the next time step, and the caches used by the multiple operators in the same time step are located in different independently accessible regions of the shared cache pool.

8. The code generation method as described in claim 6, characterized in that: The configuration information also includes the number of loops Q, where Q ≥ 2; The software code also includes a loop start instruction preceding the multiple sets of scheduling instructions and a loop end instruction following the multiple sets of scheduling instructions. The loop start instruction includes an operand representing the number of loops Q. When the software code is executed, the processing of the data to be processed is completed through Q loops.

9. The code generation method as described in claim 8, characterized in that: The data to be processed is a frame of data obtained by the radar system from the processing of the echo signal. The number of cycles Q is equal to Num or equal to Num+1, where Num is the number of chirps in a frame of data. The plurality of operators are operators that perform 1D-FFT stage processing, with each operator processing the data of a chirp signal on a channel in one time step; or, the plurality of operators are operators that perform 2D-FFT stage processing, with each operator processing the data at the same distance gate in the 1D-FFT data in one time step.

10. The code generation method as described in claim 8, characterized in that: The cache information includes the size and location of the cache, and the multiple cache allocation instructions include multiple first cache allocation instructions located before the loop start instruction; When the plurality of first cache allocation instructions are executed, the shared cache pool allocates caches for the plurality of operators to be used for the first processing, and in adjacent operators, the output cache allocated for the first processing of the preceding operator is the input cache allocated for the first processing of the following operator. The hardware accelerator also includes a register set to store the operator's configuration parameters; the first cache allocation instruction includes: a first operand, used to indicate the size and location of the cache allocated to the operator; and a second operation data, used to indicate the address of the register to be written to by the first operand for the operator to read.

11. The code generation method as described in claim 10, characterized in that: The step of determining configuration information based on user input includes: obtaining application parameters of the system to which the multiple operators belong, as well as the type and size of the data to be processed, based on user input; for each operator, obtaining the amount of data accessed when the operator performs one processing operation based on the type of the operator, the application parameters, and the type and size of the data to be processed; determining the size of the cache allocated to the operator based on the amount of data and allocating the location of the cache. Each operator's cache includes an input cache and / or an output cache. The amount of data accessed when the operator performs one processing operation includes the amount of input data in the input cache and / or the amount of output data in the output cache.

12. The code generation method as described in claim 10, characterized in that: The first cache allocation instruction includes a write register instruction and N' write data instructions from among the N write data instructions associated with the write register instruction, wherein: The write register instruction includes: an opcode, a second operand representing the register address Add, and an operand representing the number of immediate values ​​N, where N ≥ N' ≥ 2; Each write data instruction includes an opcode and an immediate value, the first operand including N' immediate values ​​from the N' write data instructions to represent information about the row position, row number, column position, and column number of the cache allocated to the operator; When the first cache allocation instruction is executed, the immediate value of each instruction in the N' write data instructions is written to the register at address Add+n-1 for the operator to read, where n is the sequence number of the instruction in the N write data instructions, and 1≤n≤N.

13. The code generation method as described in claim 10, characterized in that: The cache information also includes the number of cache groups K used by the operator, where K ≥ 2; The software code also includes one or more second cache allocation instructions located between two adjacent sets of scheduling instructions. The second cache allocation instruction includes: a first operand indicating the location of a cache reallocated for use in the next time step for an operator; and a second operand indicating the address of the register to which the first operand is to be written. When the first and second cache allocation instructions for the same operator are executed, each operator, during processing, sequentially switches the location of the cache used in the K sets of caches every K time steps. The first operand is also used to indicate the number of groups K. During the processing, each operator switches the position of the cache used in the K groups of caches in turn, based on the size of the allocated cache and the number of groups K, with a period of K time steps.

14. The code generation method as described in claim 6, characterized in that: The loop start instruction and the loop end instruction are between M sets of scheduling instructions. Each set of scheduling instructions includes a start instruction and a synchronization instruction. The operand of the start instruction in the m-th set of scheduling instructions is used to indicate the activation operator of the m-th time step, that is, the operator to be started in the m-th time step, m = 1, 2, ..., M. Among them, the m-th time step starts from the execution of the start instruction in the m-th set of scheduling instructions and ends when the synchronization instruction in the m-th set of scheduling instructions is executed. M is the number of time steps included in each loop.

15. The code generation method according to claim 14, wherein: The activation operator of the m-th time step is determined as follows: During each loop, When m < m1, the activation operator of the m-th time step is the 1st to m-th operators; When m1 ≤ m ≤ m2, the activation operator of the m-th time step is the plurality of operators; When m2 < m ≤ M, the activation operator of the m-th time step is the (m - m2 + 1)-th to m1-th operators; Among them, M = m1 + m2 - 1, m1 is the number of operators to be scheduled among the plurality of operators, and m2 is the number of data blocks processed in each loop.

16. The code generation method according to claim 14, wherein: The activation operator of the m-th time step is determined as follows: During the first loop, when m ≤ m1, the activation operator of the m-th time step is the 1st to m-th operators, and when m1 < m ≤ M, the activation operator of the m-th time step is the plurality of operators; During the second to the second-to-last loop, the activation operators of all time steps are the plurality of operators; During the last loop, the activation operator of the m'-th time step is the (m' + 1)-th to m1-th operators; Among them, M is equal to the number of receiving channels, m1 is the number of operators to be scheduled among the plurality of operators, M ≥ m1, and m' = 1, 2,..., m1 - 1.

17. The code generation method according to claim 1, wherein: The hardware accelerator further includes a general register and a dedicated register for each operator, and the software code further includes one or more of the following instructions: Read register instruction, which reads the value of the dedicated register corresponding to the specified address when executed; Write general register instruction, which writes data into the general register when executed; Arithmetic instruction, which performs arithmetic operations between the data in the general register when executed; Register save instruction, which is used to write the value of the general register into the dedicated register corresponding to the specified address.

18. A code generation apparatus, comprising a memory and a processor, characterized in that, The memory stores a computer program, and the processing device is configured to run the computer program to execute the code generation method according to any one of claims 1 to 17.

19. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the code generation method according to any one of claims 1 to 17.