A method, system and accelerator specific accelerator for application of a particular accelerator

By using a custom RISC-V instruction architecture to select the delay feedback module and twitch factor storage unit for a specific accelerator, the problem of low flexibility in traditional FFT accelerators is solved, enabling flexible FFT computation and efficient instruction expansion.

CN117349580BActive Publication Date: 2026-06-26INST OF MICROELECTRONICS CHINESE ACAD OF SCI LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INST OF MICROELECTRONICS CHINESE ACAD OF SCI LTD
Filing Date
2023-10-18
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Traditional radix-22 SDF architecture-based FFT accelerators have low flexibility when handling FFT calculations with different point counts and cannot meet the needs of calculations with multiple point counts.

Method used

Using a custom RISC-V instruction architecture, the delay feedback module and twitch factor storage unit of a specific accelerator are selected through custom instruction information to form the target accelerator circuit, supporting FFT calculations with different sampling points.

Benefits of technology

It extends the instructions, improves the structural flexibility and computational efficiency of specific accelerators, and reduces the number of instructions and clock cycles required for each FFT acceleration calculation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117349580B_ABST
    Figure CN117349580B_ABST
Patent Text Reader

Abstract

The application provides an application method and system of a specific accelerator and the specific accelerator, the first processor comprises the specific accelerator, the specific accelerator is an FFT accelerator, the method comprises the following steps: obtaining instruction information, the instruction information is an instruction meeting a custom coding rule of a reduced instruction set; obtaining a required sampling point number and to-be-processed data based on the instruction information; selecting a delay feedback module in a preset delay feedback module set of the specific accelerator according to the sampling point number, and selecting a target twiddle factor storage unit in a twiddle factor storage module; enabling at least two delay feedback modules and the target twiddle factor storage unit to obtain a target accelerator circuit; inputting the to-be-processed data into the target accelerator circuit to obtain a calculation result. Since the specific accelerator can support custom instructions to perform calculation of different sampling point numbers, instruction extension is realized, the instruction extension matches the structure extension of the specific accelerator, and the structural form of the specific accelerator is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of information technology, and more specifically, to an application method, system, and specific accelerator. Background Technology

[0002] The Fast Fourier Transform (FFT), a fast algorithm of the Discrete Fourier Transform (DFT), has become one of the most important algorithms in 5G communication due to its low computational cost, and is widely used in various communication technologies. Furthermore, FFT is also widely applied in spectrum analysis, image processing, machine learning, and other fields. With the increase in bandwidth, transport streams, and the number of antennas, FFT operations face increasing challenges in terms of computational complexity and latency. Implementing FFT operations through software programming is insufficient to meet the high-speed, high-performance, and low-overhead requirements of IoT communication scenarios. Hardware acceleration of the FFT algorithm is a reliable solution to address these challenges.

[0003] like Figure 1 The diagram shows a 64-point base 2 in the prior art. 2 The schematic diagram of the hardware implementation structure of the SDF (Single-path Delay Feedback) FFT accelerator shows that the entire circuit consists of three stages of SDF operation units connected in series. Each stage of the SDF operation unit consists of a butterfly operation unit (BF) 101, a delay unit D102, a multiplication unit 103, and a twitch factor storage unit 104.

[0004] Among them, such as Figure 2 The diagram shows the structure of an SDF (Spin-Functional Function) arithmetic unit. Each SDF unit includes two butterfly arithmetic units (BF201), two delay units (D202), a multiplication unit (M203), and a rotation factor storage unit (T204). The difference between the two butterfly arithmetic units (BF201) is that the data output from the first-stage butterfly arithmetic unit undergoes a multiplication operation before being transmitted to the input of the second stage. The delay unit (Delay) 202 primarily delays the data input to ensure it satisfies radix-2. 2Requirements for the butterfly operation input format. The multiplication unit (Multiply) 203 mainly implements the multiplication operation between the output data of the butterfly operation and the twiddle factor (Twiddle) 204. The calculation process is as follows: First, the data is input sequentially. After inputting N / 2 data, the first-level butterfly operation unit first buffers these N / 2 data through a delay unit, and then performs butterfly operation together with the subsequently input N / 2 data, and passes the result to the next-level butterfly operation unit. The second-level butterfly operation unit also groups, buffers, and calculates the calculation results passed from the previous level, and passes the calculation results to the multiplication unit. The multiplication unit multiplies the calculation result of the second-level butterfly operation unit with the twiddle factor, and then passes the result to the next-level SDF operation unit.

[0005] However, the traditional method uses base 2 2 SDF-structured FFT accelerators, due to their inherent cascaded structure, require different numbers of cascaded stages when processing FFT calculations with different numbers of points, resulting in low flexibility and failing to meet the needs of scenarios requiring the processing of FFT calculations with multiple numbers of points. Summary of the Invention

[0006] In view of this, this application provides a method, system, and specific accelerator for application, as follows:

[0007] A method for applying a specific accelerator to a first processor, the first processor including the specific accelerator, the method comprising:

[0008] Obtain instruction information, wherein the instruction information is an instruction that satisfies the custom encoding rules of the simplified instruction set;

[0009] Based on the instruction information, the required number of sampling points and the data to be processed are obtained;

[0010] Based on the number of sampling points, at least two delay feedback modules are selected from the set of delay feedback modules of a preset specific accelerator, and a target rotation factor storage unit is selected from the rotation factor storage module. The specific accelerator is a fast Fourier transform accelerator. The set of delay feedback modules includes at least six delay feedback modules, and the rotation factor storage module includes at least eight rotation factor storage units. Any one of the at least eight rotation factor storage units can be connected to any delay feedback module.

[0011] Enable the at least two delay feedback modules and the target rotation factor storage unit to obtain the target accelerator circuit;

[0012] The data to be processed is input into the target accelerator circuit to obtain the calculation result.

[0013] Optionally, in the above method, selecting at least two delay feedback modules from a preset set of delay feedback modules for a specific accelerator based on the number of sampling points includes:

[0014] Determine the number of cascaded stages of the delay feedback module required to calculate the number of sampling points;

[0015] Based on the cascade number of the delay feedback modules, at least two target delay feedback modules are selected from the set of delay feedback modules of a preset specific accelerator. The target delay feedback modules include a first delay feedback module and / or a second delay feedback module. The first delay feedback module includes at least two butterfly operation units, and the second delay feedback module includes at least one butterfly operation unit.

[0016] Optionally, in the above method, the instruction information further includes the source operand, and also includes:

[0017] Based on the source operand in the instruction information, the data to be processed is read from the address corresponding to the source operand in the external memory.

[0018] Optionally, in the above method, obtaining the required number of sampling points based on the instruction information includes:

[0019] Analyze the instruction information to obtain the instruction code recorded in the first field of the instruction information;

[0020] Based on the correspondence between the instruction code and the number of sampling points, the number of sampling points corresponding to the instruction information is obtained.

[0021] A particular accelerator includes:

[0022] Delay feedback module set and rotation factor storage module;

[0023] The delay feedback module set includes at least six delay feedback modules, and the at least six delay feedback modules are combined to obtain at least eight sampling point numbers;

[0024] The rotation factor storage module includes at least eight rotation factor storage units, and the rotation factor storage module is connected to the delay feedback module set; the rotation factor storage module provides the delay feedback module with rotation factor storage units corresponding to the number of sampling points.

[0025] Optionally, in the aforementioned specific accelerator, the delay feedback module set includes: at least five first delay feedback modules and one second delay feedback module;

[0026] The first delay feedback module includes at least two butterfly operation units, four delay units, and a multiplication unit, while the second delay feedback module includes at least one butterfly operation unit and one delay unit. The delay units in any two delay feedback modules have different delay times.

[0027] Optionally, in the aforementioned specific accelerator, the rotation factor module includes at least eight rotation factor storage units, each rotation factor storage unit being connected to a corresponding first delay feedback module.

[0028] Optionally, in the aforementioned specific accelerator, the at least five first delay feedback modules and the second delay feedback module are arranged sequentially;

[0029] The output of the second delay feedback module and the output of the last target first delay feedback module among the at least five first delay feedback modules are respectively connected to the data output of the specific accelerator.

[0030] The data input terminal of the specific accelerator is connected to the input terminals of at least four remaining first delay feedback modules, excluding the target first delay feedback module.

[0031] The at least four first delay feedback modules, the target first delay feedback module, and the second delay feedback module are connected in sequence, and a multiplexer is set between any two adjacent first delay feedback modules among the at least four, so that the input terminal of any one of the at least four first delay feedback modules can be connected to the output terminal of the previous first delay feedback or the data input terminal of a specific accelerator.

[0032] An application system for a specific accelerator includes: a first processor and a second processor;

[0033] The second processor receives instruction information, analyzes it to find that the instruction information belongs to a preset custom information type, and sends the instruction information to the first processor. The instruction information is an instruction that meets the simplified instruction set custom encoding rules.

[0034] The first processor reads the instruction information to determine the number of sampling points, selects at least two delay feedback modules from a preset set of delay feedback modules for a specific accelerator based on the number of sampling points, and selects a target rotation factor storage unit from the rotation factor storage module, enabling the at least two delay feedback modules and the target rotation factor storage unit to obtain a target accelerator circuit; requests data to be processed from the second processor; inputs the data to be processed fed back by the second processor into the target accelerator circuit to obtain a calculation result; and sends the calculation result to the second processor, wherein the specific accelerator is a Fast Fourier Transform accelerator.

[0035] Optionally, in the above system, the first processor includes: a communication module, a decoding module, a specific accelerator, a first memory, and a second memory;

[0036] The communication module receives instruction information sent by the second processor, and the decoding module parses the instruction information to obtain the number of sampling points recorded in the first field and the first and second source register information recorded in the second field. The number of sampling points can be determined based on the instruction encoding. The first memory requests data to be processed from the second processor based on the first source register information, so that the second processor obtains the data to be processed from the storage address corresponding to the first source register information in external storage, receives and stores the data to be processed, inputs the data to be processed into the specific accelerator, the specific accelerator processes the data to be processed to obtain a calculation result, writes the calculation result into the second memory, and the second memory sends the calculation result to the second processor through the communication module based on the second source register information, so that the second processor stores the calculation result based on the storage address corresponding to the second source register information in external storage.

[0037] In summary, this application provides an application method, system, and specific accelerator for a specific accelerator. The first processor includes the specific accelerator. The method includes: obtaining instruction information, which is an instruction that satisfies the Reduced Instruction Set Computing (RISC) custom encoding rules; obtaining the number of sampling points to be calculated and the data to be processed based on the instruction information; selecting at least two delay feedback modules from a preset delay feedback module set for the specific accelerator according to the number of sampling points, and selecting a target rotation factor storage unit from a rotation factor storage module. The specific accelerator is a Fast Fourier Transform (FFT) accelerator. The delay feedback module set includes at least six delay feedback modules, and the rotation factor storage module includes at least eight rotation factor storage units, any one of the at least eight rotation factor storage units being connectable to any delay feedback module; enabling the at least two delay feedback modules and the target rotation factor storage unit to obtain a target accelerator circuit; and inputting the data to be processed into the target accelerator circuit to obtain a calculation result. In this embodiment, based on the RISC-V custom instruction architecture, custom instruction information related to FFT calculations of various point counts is defined. The number of sampling points to be calculated and the data to be processed are determined based on the custom instruction information. Based on the number of FFT sampling points, multiple SDF modules and corresponding target twitch factor storage units are determined in the preset FFT accelerator circuit. The target accelerator circuit composed of the determined multiple SDF modules and target twitch factor storage units performs calculations on the data to be processed. Custom instructions can be used, and the instructions can be extended, no longer limited to a limited number of instructions. Moreover, the extension of instructions can match the structural extension of a specific accelerator, improving the structural form of the specific accelerator. Attached Figure Description

[0038] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0039] Figure 1 It is the 64-point base 2 in the existing technology 2 Schematic diagram of the hardware implementation structure of the SDF FFT accelerator;

[0040] Figure 2 This is a schematic diagram of the structure of an SDF computing unit in the prior art;

[0041] Figure 3 This is a flowchart of an embodiment 1 of a specific accelerator application method provided in this application;

[0042] Figure 4This is a flowchart of an embodiment 2 of a specific accelerator application method provided in this application;

[0043] Figure 5 This is a flowchart of embodiment 3 of a specific accelerator application method provided in this application;

[0044] Figure 6 This is a flowchart of embodiment 4 of a specific accelerator application method provided in this application;

[0045] Figure 7 This is a schematic diagram of the structure of a specific accelerator embodiment 1 provided in this application;

[0046] Figure 8 This is a schematic diagram of the structure of a delay feedback set in a specific accelerator embodiment 2 provided in this application;

[0047] Figure 9 This is another structural schematic diagram of the delay feedback F set in a specific accelerator embodiment 2 provided in this application;

[0048] Figure 10 This is a schematic diagram of the structure of a specific accelerator embodiment 3 provided in this application;

[0049] Figure 11 This is a schematic diagram of the structure of an embodiment 1 of a specific accelerator application system provided in this application;

[0050] Figure 12 This is a schematic diagram of the structure of the first processor in Embodiment 2 of an application system for a specific accelerator provided in this application;

[0051] Figure 13 This is a schematic diagram of the encoding information of eight custom instructions in an embodiment 2 of an application system for a specific accelerator provided in this application;

[0052] Figure 14 This is a schematic diagram illustrating an application scenario for a specific accelerator application system provided in this application. Detailed Implementation

[0053] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0054] like Figure 3The flowchart shown is an embodiment 1 of an application method for a specific accelerator provided in this application. The method is applied to a first processor containing a specific accelerator, which is specifically a Fast Fourier Transform (FFT) accelerator. The method includes the following steps:

[0055] Step S301: Obtain instruction information, wherein the instruction information is an instruction that satisfies the simplified instruction set custom encoding rules;

[0056] Specifically, this instruction information refers to RISC (Reduced Instruction Set Computer)-V custom instructions.

[0057] RISC-V is a new open-source reduced instruction set architecture that incorporates the development experience of mature instruction set architectures, making its design more refined and modern. It also supports custom extended instruction sets and has sufficient capability to adapt to customized processor designs ranging from low-power embedded devices to high-performance computing. One of the characteristics of the RISC-V architecture is its flexible configurability and strong scalability, which makes it easy to implement domain-specific accelerators based on the general RISC-V architecture. The standard instruction set defined by the RISC-V architecture uses only a small portion of the instruction code space, reserving more instruction code space for users to use as extended instructions. To facilitate user extensions of RISC-V, the RISC-V architecture predefines four sets of custom instruction types in the 32-bit instruction set. Each custom instruction has its own opcode, and users can use these four instruction types to extend the first custom processor instruction set.

[0058] In this application, the custom RISC-V instruction architecture is adopted to customize instructions related to FFT calculations of various point counts, so as to control the FFT calculations of the corresponding point counts.

[0059] It should be noted that, due to the use of custom instructions, the instructions can be extended, no longer limited to a limited number of instructions. Moreover, the extension of instructions can be matched with the structural extension of a specific accelerator, thereby improving the structural form of the specific accelerator.

[0060] Step S302: Analyze the instruction information to obtain the number of sampling points to be calculated and the data to be processed;

[0061] Specifically, by analyzing the instruction information, the number of sampling points carried in it can be obtained, and the starting address of the data to be processed in the memory can be obtained based on the instruction information, and then the data to be processed can be obtained in the memory based on the starting address.

[0062] Specifically, this number of sampling points refers to the number of FFT sampling points, and all subsequent references to the number of sampling points in this application refer to the number of FFT sampling points.

[0063] The number of sampling points can be directly determined by the information carried in the instruction information; the data to be processed can be obtained from the memory based on the starting address in the memory corresponding to the instruction information.

[0064] Specifically, this instruction is a 32-bit instruction, and according to the RISC-V encoding rules, different fields in this instruction carry different information.

[0065] It should be noted that the above process will be described in detail in subsequent embodiments, but will not be described in detail in this embodiment.

[0066] Step S303: Select at least two delay feedback modules from the set of delay feedback modules for a preset specific accelerator according to the number of sampling points, and select a target rotation factor storage unit from the rotation factor storage module;

[0067] The circuit of the preset specific accelerator includes at least six delay feedback modules, and the rotation factor storage module includes at least eight rotation factor storage units (Twiddle). Any one of the at least eight rotation factor storage units can be connected to any delay feedback module.

[0068] Specifically, the delay feedback module is the SDF module, and all subsequent delay feedback modules mentioned in this application refer to the SDF module.

[0069] The detailed explanation of the circuit structure of the preset specific accelerator will be provided in subsequent specific accelerator embodiments, and will not be described in detail in this embodiment.

[0070] Specifically, based on the number of FFT sampling points determined by the instruction information, the corresponding SDF module and the corresponding twitch factor storage unit are determined in the preset FFT accelerator circuit according to the number of FFT sampling points.

[0071] Step S304: Enable the at least two delay feedback modules and the target rotation factor storage unit to obtain the target accelerator circuit;

[0072] In this process, after selecting the delay feedback module and the target rotation factor storage unit, the selected module and unit are enabled to form the target accelerator circuit.

[0073] Step S305: Input the data to be processed into the target accelerator circuit to obtain the calculation result.

[0074] Once the target accelerator circuit is determined, its input terminal is identified. The data to be processed is then input into the target accelerator circuit through this input terminal, enabling the target accelerator circuit to perform calculations and obtain the results.

[0075] It should be noted that since this FFT accelerator can support custom instructions for calculating different numbers of sampling points, it can be mounted on an open-source processor. This allows the relevant custom instructions to be directly called and executed when performing FFT acceleration calculations. Furthermore, the relevant information in the instructions is connected to the accelerator through combinational logic, eliminating the need to consume additional time cycles for pre-configuration of the accelerator. This reduces the number of instructions to be executed and the number of clock cycles consumed for each FFT acceleration calculation, thereby improving work efficiency.

[0076] In summary, this embodiment provides an application method for a specific accelerator, applied to a first processor including a specific accelerator. The method includes: obtaining instruction information, wherein the instruction information is an instruction that satisfies the custom encoding rules of a reduced instruction set; obtaining the number of sampling points to be calculated and the data to be processed based on the instruction information; selecting at least two delay feedback modules from a preset delay feedback module set of the specific accelerator according to the number of sampling points, and selecting a target rotation factor storage unit from a rotation factor storage module. The specific accelerator is a Fast Fourier Transform accelerator. The delay feedback module set includes at least six delay feedback modules, and the rotation factor storage module includes at least eight rotation factor storage units, any one of the at least eight rotation factor storage units being connectable to any delay feedback module; enabling the at least two delay feedback modules and the target rotation factor storage unit to obtain a target accelerator circuit; and inputting the data to be processed into the target accelerator circuit to obtain a calculation result. In this embodiment, based on the RISC-V custom instruction architecture, custom instruction information related to FFT calculations of various point counts is defined. The number of sampling points to be calculated and the data to be processed are determined based on the custom instruction information. Based on the number of FFT sampling points, multiple SDF modules and corresponding target twitch factor storage units are determined in the preset FFT accelerator circuit. The target accelerator circuit composed of the determined multiple SDF modules and target twitch factor storage units performs calculations on the data to be processed. Custom instructions can be used, and the instructions can be extended, no longer limited to a limited number of instructions. Moreover, the extension of instructions can match the structural extension of a specific accelerator, improving the structural form of the specific accelerator.

[0077] like Figure 4 The flowchart shown is a second embodiment of an application method for an FFT accelerator provided in this application. The method includes the following steps:

[0078] Step S401: Obtain instruction information, wherein the instruction information is an instruction that satisfies the simplified instruction set custom encoding rules;

[0079] Step S402: Analyze the instruction information to obtain the number of sampling points to be calculated and the data to be processed;

[0080] Steps S401-402 are the same as the corresponding steps in Example 1, and will not be described again in this example.

[0081] Step S403: Select the target rotation factor storage unit in the rotation factor storage module according to the number of sampling points;

[0082] The rotation factor storage module contains at least eight rotation factor storage units, and each rotation factor storage unit stores the corresponding rotation factor.

[0083] Specifically, based on the number of sampling points, a rotation factor storage unit with the same number of sampling points is selected as the target rotation factor storage unit in the rotation factor storage module.

[0084] For example, the eight rotation factor storage units in this rotation factor storage module are TW16, TW32, TW64, TW128, TW256, TW512, TW1024, and TW2048. If the number of sampling points is 16, then rotation factor storage unit TW16 is selected; if the number of sampling points is 1024, then rotation factor storage unit TW1024 is selected.

[0085] Step S404: Determine the number of cascaded stages of the delay feedback module required to calculate the number of sampling points;

[0086] The set of delay feedback modules for a specific accelerator includes two types of delay feedback modules: a first delay feedback module and a second delay feedback module.

[0087] The first delay feedback module includes at least two butterfly operation units, and the second delay feedback module includes at least one butterfly operation unit.

[0088] In practice, the first delay feedback module can be a radix-4 SDF module, and the second delay feedback module can be a radix-2 SDF module.

[0089] The number of cascaded stages of the delay feedback module required to perform the calculation based on the number of sampling points is obtained.

[0090] Specifically, the required number of cascaded stages of the delay feedback module is calculated based on the number of sampling points.

[0091] Specifically, the number of sampling points is obtained by multiplying several 4s and several 2s, and the corresponding number of first and second delay feedback modules are selected.

[0092] For example, if the instruction information determines that a 64-point FFT calculation is to be performed, a 3-level radix-4 SDF cell is required. Combined with the FFT accelerator, three radix-4 SDF modules are determined, along with the corresponding twitch factor storage cell TW64.

[0093] For example, a 64-point FFT calculation requires 3-level radix 4SDF cells; a 1024-point FFT calculation requires 5-level radix 4SDF cells; and a 32-point FFT calculation requires 2-level radix 4SDF cells and 1-level radix 2SDF cells.

[0094] Step S405: Select at least two target delay feedback modules from the set of delay feedback modules for a specific accelerator based on the number of cascaded stages of the delay feedback modules;

[0095] The target delay feedback module includes a first delay feedback module and / or a second delay feedback module.

[0096] In this preset set of delay feedback modules for a specific accelerator, the positions of each delay feedback module are fixed and the connection method is selectable. Based on the determined number of series stages, several delay feedback modules that can be connected in series and can be connected to the input and output ends are selected from the set of delay feedback modules as target delay feedback modules.

[0097] It should be noted that this application does not restrict the order of the steps of selecting the target rotation factor storage unit and selecting the delay feedback module. It is possible to select the target delay feedback module first based on the number of sampling points and then select the target rotation factor storage unit, or to select the rotation factor storage unit first based on the number of sampling points and then select the target delay feedback module, or to perform both steps simultaneously.

[0098] Step S406: Enable the at least two delay feedback modules and the target rotation factor storage unit to obtain the target accelerator circuit;

[0099] In this process, at least two target SDF modules and the target twitch factor storage unit determined in the aforementioned steps are connected by a multiplexer, and the target SDF modules are connected to the output terminal by a multiplexer.

[0100] Once the target SDF module and the target rotation factor storage unit are determined, the target SDF module and the target rotation factor storage unit are enabled, so that the circuit they form is turned on, thus obtaining the target accelerator circuit.

[0101] Correspondingly, subsequent input data is input through the first SDF module of the target accelerator circuit, and after calculation by each SDF module in the target accelerator circuit, until the data is processed by the last SDF module, the calculation result is output through the output terminal of the last SDF module.

[0102] Step S407: Input the data to be processed into the target accelerator circuit to obtain the calculation result.

[0103] Step S407 is the same as the corresponding step in Example 1, and will not be described again in this example.

[0104] In summary, this embodiment provides an application method for a specific accelerator, comprising: determining the number of cascaded stages of delay feedback modules required for calculating the number of sampling points; selecting at least two target delay feedback modules from a preset set of delay feedback modules for a specific accelerator based on the number of cascaded stages of the delay feedback modules, wherein the target delay feedback modules include a first delay feedback module and / or a second delay feedback module, the first delay feedback module including at least two butterfly operation units, and the second delay feedback module including at least one butterfly operation unit. In this embodiment, the required number of cascaded stages of SDF modules is determined based on the number of sampling points, multiple target SDF modules are selected from a preset FFT accelerator circuit based on the number of cascaded stages, the target twiddle factor storage unit corresponding to the number of FFT sampling points is determined, and the circuit formed by enabling the target SDF modules and the target twiddle factor storage unit is made conductive to obtain the target accelerator circuit. When selecting SDF modules, each SDF module can be reused, providing high flexibility.

[0105] like Figure 5 The flowchart shown is a third embodiment of an application method for a specific accelerator provided in this application. The method includes the following steps:

[0106] Step S501: Obtain instruction information, wherein the instruction information is an instruction that satisfies the simplified instruction set custom encoding rules;

[0107] Step S502: Analyze the instruction information to obtain the number of sampling points to be calculated;

[0108] Steps S501-502 are the same as the corresponding steps in Example 1, and will not be repeated in this example.

[0109] Step S503: Based on the source operand in the instruction information, read the data to be processed from the address corresponding to the source operand in the external memory;

[0110] The instruction information also includes source operands, which represent the storage location of the data to be processed in external memory.

[0111] In this embodiment, a second processor is also provided corresponding to the first processor used.

[0112] The data to be processed is obtained from external memory by the second processor and then sent to the first processor by the second processor.

[0113] Specifically, the first processor requests data to be processed from the second processor based on the source operand. The second processor reads the data to be processed from the external memory based on the source operand, and then feeds it back to the internal memory of the first processor, and then obtains the data to be processed from the internal memory.

[0114] The second processor provides data to be processed to the first processor. After receiving the data to be processed, the first processor stores it in its internal memory and obtains the data to be processed from the internal memory to input the target accelerator circuit.

[0115] Specifically, the instruction information is RISC-V, a 32-bit instruction. Bits 15-19 of this instruction record the source operand of the storage, which indicates the storage address of the data to be processed in external memory.

[0116] Specifically, after the first processor parses the instruction information, it obtains the record information of bits 15-19. Based on the record information, it determines the starting address of the storage location, which indicates the external memory. The data to be processed is then acquired starting from the starting address of this storage location.

[0117] In a specific implementation, the instruction can also record the external memory where the calculation result is stored. Specifically, the source operand of the external memory is recorded in bits 20-24 of the instruction. The source operand serves as the starting address of the storage address. After the calculation result is obtained, the calculation result is sent to the external memory through the second processor. The external memory stores the result starting from the storage address corresponding to the source operand, thus realizing the storage in the external memory.

[0118] Step S504: Select at least two delay feedback modules from the set of delay feedback modules for a preset specific accelerator according to the number of sampling points, and select a target rotation factor storage unit from the rotation factor storage module;

[0119] Step S505: Enable the at least two delay feedback modules and the target rotation factor storage unit to obtain the target accelerator circuit;

[0120] Step S506: Input the data to be processed into the target accelerator circuit to obtain the calculation result.

[0121] Steps S504-506 are the same as the corresponding steps in Example 1, and will not be described again in this example.

[0122] In summary, this embodiment provides an application method for a specific accelerator. The instruction information further includes a source operand. The method also includes reading data to be processed from the address corresponding to the source operand in the external memory, based on the source operand in the instruction information. In this embodiment, after parsing the source operand of the instruction information, data to be processed is read from the external memory based on the address corresponding to the source operand, thereby providing data to be processed for FFT accelerator computation.

[0123] like Figure 6 The flowchart shown is a 4th embodiment of an application method for a specific accelerator provided in this application. The method includes the following steps:

[0124] Step S601: Obtain instruction information, wherein the instruction information is an instruction that satisfies the simplified instruction set custom encoding rules;

[0125] Step S601 is the same as the corresponding step in Example 1, and will not be described again in this example.

[0126] Step S602: Analyze the instruction information to obtain the instruction code recorded in the first field of the instruction information;

[0127] The different fields in this instruction information record different information.

[0128] The first field can be a set number of digits, such as 7 consecutive digits.

[0129] For example, the system defines 8 instructions: fft16, fft32, fft64, fft128, fft256, fft512, fft1024, and fft2048. Correspondingly, fft16 is represented by 7 consecutive bits as 0000000, fft32 by 7 consecutive bits as 0000001, fft64 by 7 consecutive bits as 0000010, fft128 by 7 consecutive bits as 0000011, fft256 by 7 consecutive bits as 0000100, fft512 by 7 consecutive bits as 0000101, fft1024 by 7 consecutive bits as 0000110, and fft2048 by 7 consecutive bits as 0000111.

[0130] Specifically, the instruction information is parsed to obtain the character in its first field, which is the instruction code, and then the corresponding instruction is determined based on this character.

[0131] For example, parsing the instruction information reveals that the first field is the character 0000010, indicating that the instruction corresponding to this character is FFT64.

[0132] It should be noted that since the instructions are defined using a 7-bit field, these 7 bits are not fully utilized and can be used to extend other instructions, providing a foundation for future expansion.

[0133] Specifically, the first field of the instruction information can be bits 25-31.

[0134] It should be noted that the configuration of the accelerator can be achieved by encoding the funct7 field in the custom instruction and connecting it to the accelerator, without consuming additional time cycles to pre-configure the accelerator.

[0135] Step S603: Based on the correspondence between the instruction code and the number of sampling points, obtain the number of sampling points corresponding to the instruction information;

[0136] Among them, the correspondence between preset instructions and the number of sampling points in the first processor.

[0137] Among them, after determining the instruction code, the number of FFT sampling points corresponding to the instruction code is determined.

[0138] For example, the system defines 8 instructions: fft16, fft32, fft64, fft128, fft256, fft512, fft1024, and fft2048, which correspond to FFT calculations of 16, 32, 64, 128, 256, 512, 1024, and 2048 points, respectively. If the target instruction is fft32, it can be determined that the number of FFT sampling points it corresponds to is 32.

[0139] Step S604: Analyze the instruction information to obtain the data to be processed;

[0140] Step S605: Select at least two delay feedback modules from the set of delay feedback modules for a preset specific accelerator according to the number of sampling points, and select a target rotation factor storage unit from the rotation factor storage module;

[0141] Step S606: Enable the at least two delay feedback modules and the target rotation factor storage unit to obtain the target accelerator circuit;

[0142] Step S607: Input the data to be processed into the target accelerator circuit to obtain the calculation result.

[0143] Steps S604-607 are the same as the corresponding steps in Example 1, and will not be repeated in this example.

[0144] In summary, this embodiment provides a method for applying a specific accelerator, comprising: analyzing the instruction information to obtain the instruction code recorded in the first field of the instruction information; and obtaining the number of sampling points corresponding to the instruction information based on the correspondence between the instruction code and the number of FFT sampling points. In this embodiment, by determining the number of sampling points corresponding to the instruction code recorded in the first field of the instruction information, it is possible to provide a basis for instructing the first processor on the number of sampling points based solely on the information recorded in a certain field of the instruction information, and thereby controlling the first processor to combine the sampling points according to the number of sampling points to obtain the target accelerator circuit.

[0145] Corresponding to the above embodiment of the application method of a specific accelerator provided in this application, this application also provides an embodiment of the specific accelerator.

[0146] like Figure 7 The diagram shown is a structural schematic of a specific accelerator embodiment 1 provided in this application. The specific accelerator includes the following structure: a delay feedback module set 701 and a rotation factor storage module 702.

[0147] The delay feedback module set includes at least six delay feedback modules, and the combination of the at least six delay feedback modules yields at least eight different sampling point numbers.

[0148] The rotation factor storage module includes at least eight rotation factor storage units, and the rotation factor storage module is connected to the delay feedback module set; the rotation factor storage module provides the delay feedback module with rotation factor storage units corresponding to the number of sampling points.

[0149] Specifically, the rotation factor storage unit is the complex constant multiplied in the butterfly operation of the algorithm. Each rotation factor storage unit corresponds to a complex constant, and the complex constants corresponding to each rotation factor storage unit are different.

[0150] The delay feedback modules in this set can be combined to obtain various combinations corresponding to different numbers of sampling points.

[0151] The rotation factor storage module is provided with multiple rotation factor storage units, which can be reused by various delay feedback modules to provide rotation factor storage units for the delay feedback modules involved in the combination of various delay feedback modules.

[0152] Specifically, by combining the various delay feedback modules in the delay feedback module set, it is possible to obtain a number of sampling points that can be supported.

[0153] For example, the delay feedback module set can provide FFT calculations for eight different sampling point numbers, such as 16, 32, 64, 128, 256, 512, 1024, and 2048 points.

[0154] It should be noted that the number of sampling points implemented by the delay feedback module in this delay feedback module is not limited to the example above. In specific implementations, more sampling points can be set according to the actual situation.

[0155] For example, the rotation factor storage units stored in this rotation factor storage module include TW16, TW32, TW64, TW128, TW256, TW512, TW1024, TW2048, etc.

[0156] It should be noted that the value of the rotation factor storage unit in this rotation factor storage module is not limited to the above example. In specific implementations, more rotation factor storage units can be set according to the actual situation.

[0157] In summary, this embodiment provides a specific accelerator, comprising: a delay feedback module set and a rotation factor storage module; wherein the delay feedback module set includes at least six delay feedback modules, and the at least six delay feedback modules are combined to obtain at least eight sampling point numbers; wherein the rotation factor storage module includes at least eight rotation factor storage units, and the rotation factor storage module is connected to the delay feedback module set; the rotation factor storage module provides rotation factor storage units corresponding to the sampling point numbers of the delay feedback modules. In this embodiment, a delay feedback module set consisting of multiple SDF modules is set up. The combination of SDF modules in this SDF module set can obtain at least 8 different FFT sampling point numbers. Moreover, the FFT accelerator is equipped with a rotation factor storage module, which uniformly stores rotation factor storage units. When the SDF module set determines the number of FFT sampling points obtained by combining multiple SDF modules, the rotation factor storage module provides rotation factor storage units corresponding to the number of FFT sampling points. The cascade structure of SDF modules in the FFT accelerator is set according to the number of FFT sampling points. When processing FFT calculations with different number of points, each SDF module can be reused, which provides high flexibility.

[0158] This application provides a specific accelerator embodiment 2, which includes the following structure: a delay feedback module set and a rotation factor storage unit module; the structure and function of the rotation factor storage unit module are the same as those in the aforementioned embodiment 1, and will not be described again in this embodiment.

[0159] like Figure 8The diagram shown is a structural schematic of a delay feedback module set, which includes at least five first delay feedback modules 801 and one second delay feedback module 802.

[0160] Among them, the Figure 8 Taking five first delay feedback modules and one second delay feedback module as an example, the number of first delay feedback modules is not limited to this.

[0161] The first delay feedback module includes two butterfly operation units BF8011, four delay units D8012 and a multiplication unit 8013. The four delay units are combined in pairs and connected in parallel to obtain delay unit pairs corresponding to the butterfly operation units.

[0162] Among them, two butterfly operation units are arranged in sequence. One butterfly operation unit is connected to the input terminal of the first delay feedback module, and the other butterfly operation unit is connected to the multiplication unit 8013 of the first delay feedback module. The four delay units are combined in pairs to obtain two delay unit pairs. The two delay units in a delay unit pair are connected in parallel, and a delay unit pair corresponds to a butterfly operation unit.

[0163] The second delay feedback module includes a butterfly operation unit 8021 and a delay unit 8022, and the delays of the delay units in any two delay feedback modules are different.

[0164] The second delay feedback module consists of only one butterfly operation unit and one delay unit, without a multiplication unit.

[0165] Specifically, the first delay feedback module is a base-4SDF unit (2 2 The second delay feedback module is specifically the base-2SDF unit.

[0166] Specifically, the delay units of different delay feedback modules are different.

[0167] For example, delay units can use delay parameters such as 1024, 512, 256, 128, 64, 32, 16, 8, 2, 1, etc.

[0168] Specifically, the second delay feedback module, as the last module in the delay feedback module set, uses a delay parameter of 1 for its delay unit.

[0169] Specifically, the four delay units in the first delay feedback module are three different delay parameters that are adjacent in size. The maximum delay parameter and the minimum delay parameter are combined with the intermediate delay parameter to obtain two delay unit pairs.

[0170] like Figure 9The diagram shows another structural schematic of the delay feedback module set, which includes at least five first delay feedback modules and one second delay feedback module. In this schematic, the five first delay feedback modules are 901-905, and the second delay feedback module is 906.

[0171] The at least five first delay feedback modules and the second delay feedback module are arranged sequentially.

[0172] The output of the second delay feedback module 906 and the output of the last target first delay feedback module 905 among the at least five first delay feedback modules are respectively connected to the data output of the specific accelerator.

[0173] The data input terminals of the specific accelerator are respectively connected to the input terminals of at least four remaining first delay feedback modules 901-904, excluding the target first delay feedback module;

[0174] The at least four first delay feedback modules 901-904, the target first delay feedback module 905, and the second delay feedback module 906 are connected in sequence, and a multiplexer is provided between any two adjacent first delay feedback modules among the at least four, so that the input terminal of any one of the at least four first delay feedback modules can be connected to the output terminal of the previous first delay feedback or the data input terminal of a specific accelerator.

[0175] Each of the first delay feedback modules 901-904 is also connected to the data input terminal of a specific accelerator through a multiplexer, so that any one of the first delay feedback modules 901-904 can be used as an input delay feedback module.

[0176] Specifically, the multiple first delay feedback modules are arranged sequentially according to the delay parameter size of the delay unit, and different delay feedback modules can be selected according to the number of sampling points of FFT to achieve the selection of different delay paths.

[0177] Should Figure 9In the delay feedback module 901, the delay units are 1024D, 512D, 512D, and 256D. The 1024D and 512D are connected in parallel to form one delay unit pair, and the 512D and 256D are connected in parallel to form another delay unit pair. The delay units of the SDF unit 902 are 256D, 128D, 128D, and 64D. The 256D and 128D are connected in parallel to form one delay unit pair, and the 128D and 64D are connected in parallel to form another delay unit pair. The delay units of the delay feedback module 903 are 64D, 3... The delay units of SDF unit 904 are 16D, 8D, 8D and 4D, with 16D and 8D connected in parallel to form a delay unit pair, and 8D and 4D connected in parallel to form a delay unit pair; the delay units of SDF unit 905 are 4D, 2D, 2D and 1D, with 4D and 2D connected in parallel to form a delay unit pair, and 2D and 1D connected in parallel to form a delay unit pair; the delay unit of SDF unit 906 is 1D.

[0178] Different numbers of sampling points correspond to different combinations of delay feedback modules. In this application, the second delay feedback module and the last of the at least five first delay feedback modules can be used as the output delay feedback module, and the input delay feedback module is selected from each of the first delay feedback modules according to different numbers of sampling points.

[0179] Table 1 below shows the correspondence between the number of sampling points, delay paths, and delay cycles.

[0180] Table 1

[0181]

[0182] Since the required delay varies depending on the number of points, for example, a 64-point FFT calculation requires 3 levels of radix-4 SDF units. The input data will be input from delay feedback module 903 and bypass delay feedback module 906, outputting directly from delay feedback module 905 (since delay feedback module 906 is a radix-2 SDF unit). The six delay units involved should be 32, 16, 8, 4, 2, and 1 cycles respectively. A 128-point FFT calculation requires 3 levels of radix-4 SDF units and 1 level of radix-2 SDF units. The input data will also be input from delay feedback module 903, but output from delay feedback module 906. The seven delay units involved should be 64, 32, 16, 8, 4, 2, and 1 cycles respectively. Therefore, a structure with two parallel delay paths is needed to ensure that FFT calculations with different point numbers can all follow this serial path.

[0183] In summary, this embodiment provides a specific accelerator comprising: the delay feedback module set includes at least five first delay feedback modules and one second delay feedback module; wherein, each first delay feedback module includes at least two butterfly operation units, four delay units, and a multiplication unit, the four delay units being paired in parallel to form delay unit pairs corresponding to the butterfly operation units; the second delay feedback module includes at least one butterfly operation unit and one delay unit, and the delay units in any two delay feedback modules have different delays. In this embodiment, the delay units in each first delay feedback module are two delay units connected in parallel, and the desired delay path is selected by a multiplexer, enabling the delay feedback module set to handle FFT calculations with different numbers of points, thus achieving delay feedback module reuse.

[0184] like Figure 10 The diagram shown is a structural schematic of a specific accelerator embodiment 3 provided in this application, including the following structure: a delay feedback module set and a rotation factor storage module;

[0185] The delay feedback module set is the same as that in the aforementioned embodiment 2, and will not be described again in this embodiment.

[0186] The at least five first delay feedback modules and the second delay feedback module are arranged sequentially, with the five first delay feedback modules being 1001-1005 and the second delay feedback module being 1006.

[0187] The rotation factor storage module 1007 includes at least eight rotation factor storage units, each of which is connected to a corresponding first delay feedback module.

[0188] Among them, the Figure 10 The example uses eight rotation factor storage units. Each rotation factor storage unit corresponds to a different complex constant, and the value of each rotation factor storage unit represents which type of FFT sampling point calculation it can correspond to.

[0189] If a first delay feedback module can sample multiple rotation factor storage units, the multiple rotation factor storage units are connected to the multiplication unit of the first delay feedback module through a multiplexer, so as to input the rotation factor storage units into the first delay feedback module.

[0190] Specifically, the delay feedback module needs to be connected to the rotation factor storage unit corresponding to the number of points it participates in the FFT calculation.

[0191] The second delay feedback module, as the last stage of the circuit, does not have a multiplication unit and does not input the rotation factor storage unit.

[0192] Among them, the Figure 10 The eight rotation factor storage units are TW16, TW32, TW64, TW128, TW256, TW512, TW1024, and TW2048, respectively. The first delay feedback module 1001 is connected to TW1024 and TW2048 through a multiplexer 1008; the first delay feedback module 1002 is connected to TW256, TW512, TW1024, and TW2048 through a multiplexer 1009. 8 are connected; the first delay feedback module 1003 is connected to TW64, TW128, TW256, TW512, TW1024, and TW2048 through multiplexer 1010; the first delay feedback modules 1004-1005 are connected to TW16, TW32, TW64, TW128, TW256, TW512, TW1024, and TW2048 through their respective multiplexers 1011.

[0193] Specifically, the list of rotation factor storage units required for the FFT calculation of each point can be compiled into a rotation factor storage module. The corresponding rotation factor storage units can be sent to the multiplication units of each delay feedback module through a multiplexer, thus realizing the reuse of rotation factor storage units.

[0194] In this TWN, the value of N represents the specific type of FFT sampling point calculation that the delay feedback module participates in. For example, the first delay feedback module 1001 can only participate in the calculation of 1024 and 2048 points, the first delay feedback module 1002 can participate in the calculation of 256, 512, 1024 and 2048 points, the first delay feedback module 1003 can participate in the calculation of 64, 128, 256, 512, 1024 and 2048 points, and the first delay feedback modules 1004-1005 can participate in the calculation of 16, 32, 64, 128, 256, 512, 1024 and 2048 points, respectively.

[0195] In summary, this embodiment provides a specific accelerator, comprising: a twitch factor storage module including at least eight twitch factor storage units, each twitch factor storage unit being connected to a corresponding first delay feedback module. In this embodiment, twitch factor storage units capable of providing the necessary twitch factor storage units for FFT calculations of various point counts are grouped into a twitch factor storage unit module. A multiplexer is used to send the corresponding twitch factor storage units into the multiplication units of each delay feedback module, thus achieving reuse of twitch factor storage units and simplifying the twitch factor storage units in the FFT accelerator.

[0196] Corresponding to the above-described embodiment of the application method for a specific accelerator provided in this application, this application also provides a system embodiment of the application method for the specific accelerator.

[0197] like Figure 11 The diagram shown is a structural schematic of an application system embodiment 1 of a specific accelerator provided in this application. The system includes the following structure: a first processor 1101 and a second processor 1102.

[0198] The second processor 1102 receives instruction information, analyzes it to find that the instruction information belongs to a preset custom information type, and sends the instruction information to the first processor 1101. The instruction information is an instruction that meets the simplified instruction set custom encoding rules.

[0199] Specifically, the first processor 1101 reads the instruction information to determine the number of sampling points, selects at least two delay feedback modules from a preset set of delay feedback modules for a specific accelerator based on the number of sampling points, and selects a target rotation factor storage unit from the rotation factor storage module, enabling the at least two delay feedback modules and the target rotation factor storage unit to obtain a target accelerator circuit; requests data to be processed from the second processor; inputs the data to be processed fed back by the second processor into the target accelerator circuit to obtain a calculation result; and sends the calculation result to the second processor, wherein the specific accelerator is a Fast Fourier Transform accelerator.

[0200] Among them, the preset custom information type is the custom type instruction.

[0201] Specifically, the instruction information uses a custom RISC-V instruction architecture, which defines custom instructions related to FFT calculations of various point counts in order to control the corresponding FFT calculations.

[0202] Specifically, during the decoding stage, the second processor decodes the opcode of the instruction to determine whether the instruction belongs to the custom instruction group. If it belongs to the custom instruction, it sends a data processing request to the first processor. After completing the handshake with the first processor, it sends the instruction information to the first processor.

[0203] Specifically, the second processor decodes the bytes corresponding to the opcode in the instruction information to obtain its content.

[0204] In practice, bits 0-6 of the instruction information record the opcode.

[0205] For a detailed explanation of the execution steps of the first processor, please refer to the explanation in the foregoing method embodiments.

[0206] It should be noted that in the specific implementation, the second processor performs a handshake before transmitting any information to the first processor. If the other party responds and it is determined that the first processor is in an idle state and can process the information, the data information is sent to the other party. If the first processor does not respond, it needs to wait until the first processor is idle before sending the data information to the first processor.

[0207] In summary, this embodiment provides an application system for a specific accelerator, comprising: a first processor and a second processor; wherein, the second processor receives instruction information, analyzes it to determine that the instruction information belongs to a preset custom information type, and sends the instruction information to the first processor, the instruction information being an instruction that satisfies the simplified instruction set custom encoding rules; the first processor reads the instruction information to determine the number of sampling points, selects at least two delay feedback modules from a preset set of delay feedback modules for a specific accelerator based on the number of sampling points, and selects a target rotation factor storage unit from a rotation factor storage module, enabling the at least two delay feedback modules and the target rotation factor storage unit to obtain a target accelerator circuit; requests data to be processed from the second processor; inputs the data to be processed fed back by the second processor into the target accelerator circuit to obtain a calculation result; and sends the calculation result to the second processor, wherein the specific accelerator is a Fast Fourier Transform accelerator. In this embodiment, the first processor and the second processor work together. The second processor determines that the instruction information belongs to a preset custom information type before transmitting the instruction information to the first processor, thereby reducing the processing burden of the first processor. Moreover, the first processor reads the instruction information to determine the corresponding number of sampling points, and then determines the target accelerator circuit in a preset specific accelerator based on the number of sampling points. The data to be processed transmitted by the second processor is input into the target accelerator circuit to obtain the calculation result and feed it back to the second processor. The RISC-V custom instruction architecture is adopted to customize instructions related to FFT calculations of various point numbers, thereby realizing the setting of the number of sampling points.

[0208] This application provides a second embodiment of an application system for a specific accelerator, the system comprising the following structure: a first processor and a second processor;

[0209] The structure and function of the first processor are explained in the aforementioned Embodiment 1, and will not be repeated in this embodiment.

[0210] like Figure 12 The diagram shown is a schematic diagram of the structure of the first processor in this embodiment 2. The first processor includes: a communication module 1201, a decoding module 1202, a specific accelerator 1203, a first memory 1204, and a second memory 1205.

[0211] The communication module 1201 receives instruction information sent by the second processor, and the decoding module 1202 parses the instruction information to obtain the number of sampling points recorded in the first field and the first and second source register information recorded in the second field. The number of sampling points can be determined based on the instruction encoding. The first memory 1204 requests data to be processed from the second processor based on the first source register information, so that the second processor obtains the data to be processed from the storage address corresponding to the first source register information in external storage, receives and stores the data to be processed, inputs the data to be processed into the FFT accelerator 1203, the specific accelerator processes the data to be processed to obtain the calculation result, writes the calculation result into the second memory, and the second memory 1205 sends the calculation result to the second processor through the communication module based on the second source register information, so that the second processor stores the calculation result based on the storage address corresponding to the second source register information in external storage.

[0212] Specifically, the instruction information is divided into 6 fields: opcode, rd, funct3, rs1, rs2 and funct7.

[0213] The opcode field indicates the type of instruction and whether it belongs to the custom instruction group; the rd field indicates the destination register; the funct3 field indicates the read operation; the rs1 field indicates the starting address of the data to be processed in external memory; the rs2 field indicates the starting address of the calculation result in external memory; and the funct7 field indicates the number of FFT sampling points.

[0214] The opcode field consists of 0 to 6 bytes, the rd field consists of 7 to 11 bytes, the funct3 field consists of 12 to 14 bytes, the rs1 field consists of 15 to 19 bytes, the rs2 field consists of 20 to 24 bytes, and the funct7 field consists of 25 to 31 bytes.

[0215] like Figure 13 The diagram shows the encoding information of 8 custom instructions, including the opcode field, rd field, funct3 field, rs1 field, rs2 field, and funct7 field. The opcode field includes 0 to 6 bytes, the rd field includes 7 to 11 bytes, the funct3 field includes 12 to 14 bytes, the rs1 field includes 15 to 19 bytes, the rs2 field includes 20 to 24 bytes, and the funct7 field includes 25 to 31 bytes.

[0216] Among them, funct7 corresponds to fft16, fft32, fft64, fft128, fft256, fft512, fft1024, and fft2048. The corresponding characters of the funct7 field are 0000000, 0000001, 0000010, 0000011, 0000100, 0000101, 0000110, and 0000111.

[0217] It should be noted that the computation result data after the first processor completes the calculation is large (due to the possibility of a 2048-point FFT calculation), so it may not be possible to write the calculation result back to the target register rd (the space in the register file regfile of the second processor may be insufficient). Therefore, in defining the instructions, this application uses rs1 and rs2 as the starting address of the data to be processed stored in external memory and the starting address of the write-back of the calculation result, respectively. In this way, when the first processor executes the instructions, it will read the data to be calculated from external memory and write the calculation result back to external memory according to the values ​​of rs1 and rs2. Therefore, the 8 custom instructions do not define rd.

[0218] Here, rs1 and rs2 represent the source registers. The funct3 field is 011, which means that the instruction needs to read the two source operands indexed by the bits of rs1 and rs2, without writing back the result to the destination register indicated by the bits of rd.

[0219] The source register is a register that stores the value or address of the source operand in the instruction, corresponding to rs1 and rs2.

[0220] The destination register is the register that stores the result of the instruction operation, corresponding to rd.

[0221] The second processor decodes only the opcode field to obtain its content.

[0222] In practice, after decoding the funct7 field, the decoding module can determine which type of FFT calculation is being performed. Similarly, it can determine which twitch factor storage unit is needed. The decoding module transmits the decoded judgment signal to the corresponding multiplexer so that the twitch factor storage unit needed by the current instruction can be connected to the corresponding multiplication unit.

[0223] In summary, this embodiment provides an application system for a specific accelerator. The first processor includes a communication module, a decoding module, a specific accelerator, a first memory, and a second memory. The communication module receives instruction information sent by the second processor. The decoding module parses the instruction information to obtain the number of sampling points recorded in the first field and the first and second source register information recorded in the second field. Based on the instruction encoding, the number of sampling points can be determined. The first memory requests data to be processed from the second processor based on the first source register information, so that the second processor obtains the data to be processed from the storage address corresponding to the first source register information in external storage, receives and stores the data to be processed, inputs the data to be processed into the specific accelerator, processes the data to be processed to obtain a calculation result, writes the calculation result into the second memory, and sends the calculation result to the second processor through the communication module based on the second source register information, so that the second processor stores the calculation result based on the storage address corresponding to the second source register information in external storage. In this embodiment, the specific implementation modules of each function of the first processor are described.

[0224] like Figure 14 The diagram shown is an application scenario diagram of an application system for a specific accelerator provided in this application. In this application scenario, the second processor is the main processor 1401 and the first processor is the coprocessor 1402. The main processor specifically adopts the open-source processor core Hummingbird E203.

[0225] The main processor 1401 includes an instruction fetch module, an execution module, a memory access module, and a NICE (Nuclei Instruction Co-unit Extension) interface. The execution module has request and feedback channels with the coprocessor, and the memory access module has memory request and memory feedback channels with the coprocessor. The NICE interface is specifically designed to handle coprocessor extensions. The execution module includes a register file (regfile).

[0226] It should be noted that the NICE interface is a dedicated interface for the Hummingbird E203 to expand the coprocessor.

[0227] The coprocessor 1402 includes a NICE interface, a decoder, an FFT accelerator, a memory FIFO1, and a memory FIFO2.

[0228] The instruction fetch module of the main processor performs preliminary decoding of the instruction and determines whether the instruction is a custom instruction. If so, the execution module requests the channel to send a data processing request to the coprocessor. After the handshake, the main processor transmits the instruction information to the coprocessor.

[0229] The coprocessor first decodes the instruction string, reading the source operands and the number of FFT points to be calculated. The coprocessor sends a data read request to the main processor's memory access module via the memory request channel. After the handshake, the memory access module transfers the data to be processed to the coprocessor via the memory feedback channel. The data to be processed is then input into memory FIFO1. After FIFO1 reads all the data, it inputs the data into the FFT accelerator for calculation, and the calculation result is input into memory FIFO2. After FIFO2 reads all the data, the coprocessor sends a write-back request to the main processor via the feedback channel. After the handshake, the coprocessor writes the calculation result from FIFO2 back to the memory access module via the memory request channel.

[0230] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. The apparatus provided in the embodiments is described simply because it corresponds to the method provided in the embodiments; relevant parts can be found in the method section.

[0231] The above description of the provided embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features provided herein.

Claims

1. A method for applying a specific accelerator, characterized in that, Applied to a first processor, the first processor including a specific accelerator, the method includes: Obtain instruction information, wherein the instruction information is an instruction that satisfies the custom encoding rules of the simplified instruction set; Based on the instruction information, the required number of sampling points and the data to be processed are obtained; Based on the number of sampling points, at least two delay feedback modules are selected from the set of delay feedback modules of a preset specific accelerator, and a target rotation factor storage unit is selected from the rotation factor storage module. The specific accelerator is a fast Fourier transform accelerator. The set of delay feedback modules includes at least six delay feedback modules, and the rotation factor storage module includes at least eight rotation factor storage units. Any one of the at least eight rotation factor storage units can be connected to any delay feedback module. Enable the at least two delay feedback modules and the target rotation factor storage unit to obtain the target accelerator circuit; The data to be processed is input into the target accelerator circuit to obtain the calculation result.

2. The method according to claim 1, characterized in that, The step of selecting at least two delay feedback modules from a preset set of delay feedback modules for a specific accelerator based on the number of sampling points includes: Determine the number of cascaded stages of the delay feedback module required to calculate the number of sampling points; Based on the cascade number of the delay feedback modules, at least two target delay feedback modules are selected from the set of delay feedback modules of a preset specific accelerator. The target delay feedback modules include a first delay feedback module and / or a second delay feedback module. The first delay feedback module includes at least two butterfly operation units, and the second delay feedback module includes at least one butterfly operation unit.

3. The method according to claim 1, characterized in that, The instruction information also includes the source operand, and further includes: Based on the source operand in the instruction information, the data to be processed is read from the address corresponding to the source operand in the external memory.

4. The method according to claim 1, characterized in that, The step of obtaining the required number of sampling points based on the instruction information includes: Analyze the instruction information to obtain the instruction code recorded in the first field of the instruction information; Based on the correspondence between the instruction code and the number of sampling points, the number of sampling points corresponding to the instruction information is obtained.

5. A specific accelerator, characterized in that, include: Delay feedback module set and rotation factor storage module; The delay feedback module set includes at least six delay feedback modules, and the at least six delay feedback modules are combined to obtain at least eight sampling point numbers; The rotation factor storage module includes at least eight rotation factor storage units, and the rotation factor storage module is connected to the delay feedback module set; the rotation factor storage module provides the delay feedback module with rotation factor storage units corresponding to the number of sampling points.

6. The specific accelerator according to claim 5, characterized in that, The delay feedback module set includes: at least five first delay feedback modules and one second delay feedback module; The first delay feedback module includes at least two butterfly operation units, four delay units, and a multiplication unit, while the second delay feedback module includes at least one butterfly operation unit and one delay unit. The delay units in any two delay feedback modules have different delay times.

7. The specific accelerator according to claim 6, characterized in that, The rotation factor module includes at least eight rotation factor storage units, each of which is connected to a corresponding first delay feedback module.

8. The specific accelerator according to claim 6, characterized in that, The at least five first delay feedback modules and the second delay feedback modules are arranged sequentially; The output of the second delay feedback module and the output of the last target first delay feedback module among the at least five first delay feedback modules are respectively connected to the data output of the specific accelerator. The data input terminal of the specific accelerator is connected to the input terminals of at least four remaining first delay feedback modules, excluding the target first delay feedback module. The at least four first delay feedback modules, the target first delay feedback module, and the second delay feedback module are connected in sequence, and a multiplexer is set between any two adjacent first delay feedback modules among the at least four, so that the input terminal of any one of the at least four first delay feedback modules can be connected to the output terminal of the previous first delay feedback or the data input terminal of a specific accelerator.

9. An application system for a specific accelerator, characterized in that, include: First processor and second processor; The second processor receives instruction information, analyzes it to find that the instruction information belongs to a preset custom information type, and sends the instruction information to the first processor. The instruction information is an instruction that meets the simplified instruction set custom encoding rules. The first processor reads the instruction information to determine the number of sampling points, selects at least two delay feedback modules from a preset set of delay feedback modules for a specific accelerator based on the number of sampling points, and selects a target rotation factor storage unit from the rotation factor storage module, enabling the at least two delay feedback modules and the target rotation factor storage unit to obtain a target accelerator circuit; requests data to be processed from the second processor; inputs the data to be processed fed back by the second processor into the target accelerator circuit to obtain a calculation result; and sends the calculation result to the second processor, wherein the specific accelerator is a Fast Fourier Transform accelerator.

10. The system according to claim 9, characterized in that, The first processor includes: a communication module, a decoding module, a specific accelerator, a first memory, and a second memory; The communication module receives instruction information sent by the second processor, and the decoding module parses the instruction information to obtain the number of sampling points recorded in the first field and the first and second source register information recorded in the second field. The number of sampling points can be determined based on the instruction encoding. The first memory requests data to be processed from the second processor based on the first source register information, so that the second processor obtains the data to be processed from the storage address corresponding to the first source register information in external storage, receives and stores the data to be processed, inputs the data to be processed into the specific accelerator, the specific accelerator processes the data to be processed to obtain a calculation result, writes the calculation result into the second memory, and the second memory sends the calculation result to the second processor through the communication module based on the second source register information, so that the second processor stores the calculation result based on the storage address corresponding to the second source register information in external storage.