[0063] Example one
[0064] Such as figure 1 As shown, the present invention provides a quad-core processor system built with a quad-core structure, which processes data in a single program segment and multiple data mode, that is, all microprocessor cores strictly execute the same program segment at the same time and process multi-dimensional data in parallel. The system includes 4 microprocessor cores with reduced instruction set architecture, among which,
[0065] Each microprocessor core includes:
[0066] Instruction memory, used to store instructions;
[0067] In-core data memory for storing data;
[0068] The central processing unit is used to perform corresponding operations according to the input instructions and data, and update the internal register file and external data storage of the central processing unit.
[0069] Preferably, the central processing unit includes:
[0070] The instruction fetch module is used to fetch instructions from the instruction memory in the current cycle according to the current pointer value, and calculate the pointer value in the next cycle;
[0071] The decoding module is used to decode the instruction from the instruction fetching module and generate all the control signals required by the arithmetic logic unit, comparator and register file module;
[0072] Arithmetic logic unit, used for operation, receiving data from the register file and the data memory in the core, and sending the write enable signal and data to be written to the register file and the data memory in the core;
[0073] Comparator, used to receive the output from the register file, and judge whether the jump instruction occurs according to the received output, if the jump occurs, the address of the jump instruction is calculated through the arithmetic logic unit, and the address is sent to the fetch instruction Module
[0074] The register file is used to receive data from the data memory, arithmetic logic unit, and comparator in the core, and can send data to the arithmetic logic unit and comparator at the same time;
[0075] The pipeline control module is used to control the pipeline, that is, according to the input signal from the execution module, provide corresponding stop signals to the instruction fetching module, the decoding module, the arithmetic logic unit and the comparator to ensure the smooth operation of the pipeline.
[0076] Preferably, each register file includes a local register file and a shared register file, wherein,
[0077] The local register file is used for the closed operation of the data in the core. During the operation, there is no interaction with the data outside the core. The local microprocessor core has full read and write permissions to its local register file;
[0078] The shared register file is used to interconnect with the shared registers of other microprocessor cores outside the core to realize data interaction between the microprocessor cores. The local microprocessor core has read and write permissions to its shared register file According to the application needs, they are allocated to the local microprocessor core or other microprocessor cores. Specifically, the modification to the register file in this embodiment is as follows: figure 2 As shown, the internal structure of the shared register file is as follows image 3 As shown, the register file of the original Opfal processor is divided into two parts: the local register file and the shared register file.
[0079] Preferably, each local register file is divided into two groups, each group has a read port and a write port, wherein the two groups of register files receive different read address signals and give corresponding read values; the two groups of register files receive The same write address and data input signal to ensure that the contents of the two sets of register files are consistent. Specifically, the local register file includes 16 registers, the corresponding register numbers are register 0 to register 15, and each register is 32 bits. The local register file is used for the closed operation of the data in the core. During the operation, there is no interaction with the data outside the core. The local core has full read and write permissions on the local register file. The local register file is divided into two groups, each group has a read port and a write port. The two sets of register files receive different read address signals and give corresponding read values; receive the same write address and data input signal to ensure that the contents of the two sets of register files are consistent. The shared register file includes 4 registers, each of which is 32 bits, and the corresponding register numbers are register 16 to register 19 (also called shared register 0 to shared register 3). There is a special interconnection mode between the shared register file and the shared registers of other cores outside the core to realize data interaction between cores. The local kernel has read permissions on the shared register file, and write permissions are allocated to local or other kernels according to application needs. The shared register file has two read ports and four write ports, and can accept write signals from four different cores at most.
[0080] Preferably, each microprocessor core also includes a configuration register, which is used to configure the connection mode of the shared register file of the microprocessor core to improve the flexibility of the structure. At the same time, the instructions used in each microprocessor core Add configuration instructions to the collection to support the specific implementation of configuration.
[0081] Preferably, the data exchange path between the 4 microprocessor cores is such as Figure 4 As shown, each microprocessor core exchanges data in the following two ways:
[0082] One way is for each microprocessor core to access external data memory through a multi-layer bus structure;
[0083] Another way is to exchange data between cores through the shared register file of each microprocessor core. Specifically, the shared register file establishes a direct data path between the four cores, and at the same time, the connection mode of each path can be flexibly defined through the configuration register, achieving the purpose of realizing the exchange of a small amount of data between the cores.
[0084] Preferably, the multi-layer bus is a cross switch set between the microprocessor core and the external data memory. The four microprocessor cores select the external data memory through different buses. If the selected external data memory is If they are all different, the 4 microprocessor cores perform synchronous transmission; if the selected external data memory is the same, the microprocessor core is selected for priority transmission according to the preset priority rule. Specifically, the multi-layer bus is used to set a cross switch between a master device and a slave device. Multiple master devices select slave devices through different buses. If the selected slave devices are all different, then multiple master devices Transmission can be synchronized; if the selected slave devices are the same, the master device is selected according to the priority rule specified in the design for priority transmission.
[0085] Detailed, such as figure 1 As shown, the structure of the multi-layer bus can include an input module 11, a decoder module 12, an arbiter module 13, and a steering module 14, which realizes 4 master devices (microprocessor cores) to 4 slave devices (external Data storage), where
[0086] The input module 11 temporarily stores the read and write control and data signals from the microprocessor core, intercepts the upper two bits of the address signal, as the selection signal of the external data memory; intercepts the lower 12 bits of the address signal, and shifts it by 2 bits to the right Then output a new address signal that matches the address input port of the external data memory, and output write data and write enable signals at the same time;
[0087] The decoder module 12 receives the selection signal of the external data memory from the input terminal, determines which external data memory is selected for the read and write operation of the microprocessor core, and sets the corresponding selection output to 1. In addition, it receives 4 external data memories. The read data of the data memory is selected according to the decoded external data memory to select the selection signal of the external data memory to select the correct read data and send it to the microprocessor core;
[0088] The arbiter module 12 is used to arbitrate the authority of the bus. When multiple main modules (microprocessor cores) simultaneously request to occupy the shared bus for data communication, the arbitration algorithm allocates the bus resources and determines the right to use the bus resources. The arbitration algorithms include polling, fixed priority, time division multiplexing, lotto algorithm, random contention arbitration algorithm, etc. In this design, in order to improve the efficiency of arbitration, a simpler algorithm and relatively low-cost polling method is selected as the arbitration sequence. The module receives the selection of 0, 1, 2 three-way read and write data, control signals and external data memory Signals are arbitrated according to the polling rules, and finally a set of read-write control signals are selected and output to the corresponding external data storage, and the corresponding hold signals are returned to other signal sources applying for arbitration to inform them that the bus application failed.
[0089] The steering module 14 selects a group of the through control and data signal and the arbitration control data signal to output to the external data storage according to the through selection signal. If the through selection signal is 1, the output is the through signal group; otherwise, it is the arbitration signal group.
[0090] Preferably, the instruction set used by each microprocessor core includes arithmetic operation instructions, logic operation instructions, branch instructions, and access instructions.
[0091] For other details of the first embodiment, please refer to the corresponding part of the first embodiment, which will not be repeated here.
[0092] This embodiment makes great use of the parallelism of the algorithm and improves the execution efficiency of the algorithm. A quad-core structure is used to build a quad-core processor, and each microprocessor core uses a simplified instruction set architecture microprocessor as a prototype. Corresponding improvements include the introduction of shared registers, the addition of configuration registers and configuration instructions, the addition of a left-shift operation function in the arithmetic logic unit, and the modification of the location of branch instructions. Through shared registers and between the microprocessor core and the external data memory The two data exchange methods of building a multi-layer bus between the two establish the data path between the cores of the quad-core processor, improve the performance of the quad-core processor when processing data in parallel, and increase the efficiency of data exchange.