Multi-chip heterogeneous interconnect circuits and control methods, computing chips and accelerator boards
By separating the physical channels and isolating the hardware through the heterogeneous interconnect circuits of multi-chips, the communication bottleneck in the traditional interconnect architecture is solved, and the high-efficiency parallel computing and scalability of the AI computing system are realized.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANG ZHOU NANO CORE CHIP ELECTRONIC TECH CO LTD
- Filing Date
- 2026-05-27
- Publication Date
- 2026-06-30
AI Technical Summary
Traditional single-channel interconnect architectures suffer from problems such as communication bandwidth contention, head-of-line congestion, inconsistent startup times of computing nodes, and uneven transmission latency, resulting in low acceleration efficiency of AI computing systems.
It adopts a multi-chip heterogeneous interconnect circuit, and through physical channel separation and hardware isolation design, it uses PCIe switching network and broadcast bus to transmit data and control signals respectively. Combined with hardware startup trigger module and address space aggregation module, it ensures the real-time and synchronization of control signals.
It effectively eliminates head-of-line congestion, ensures that control signals are delivered at extremely high speeds within nanoseconds/microseconds, and enables highly synchronized parallel execution of each computing node, thereby improving the system's acceleration efficiency and scalability.
Smart Images

Figure CN122309440A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, specifically to a multi-chip heterogeneous interconnect circuit and control method, a computing chip and an accelerator board. Background Technology
[0002] With the explosive growth in the number of parameters in artificial intelligence models (especially large-scale neural networks), the core challenge of AI computing systems has evolved from simply stacking computing power to the "communication wall" problem. In traditional single-channel interconnect architectures, systems typically adopt an "in-band control" mode, where control flow (instructions) and data flow (large-scale weights or tensor data) reuse the same physical channel.
[0003] In actual operation, this architecture is highly prone to communication contention: Bandwidth contention: Massive amounts of tensor data (such as GB-level parameter synchronization) occupy the vast majority of communication bandwidth.
[0004] Head-of-Line Blocking: Due to the small amount of instruction data but extremely high real-time requirements, when it shares a channel with large-scale data, the control signals are often blocked by a huge number of data packets in the transmission queue, causing the instructions to fail to reach the computing node in a timely manner.
[0005] In traditional massively parallel computing scheduling, the scheduling unit (such as the host) typically sends instructions per node. When there are many computing nodes, the instruction delivery process will generate significant cumulative delays, which will cause the computing nodes to receive the start instruction at different times, making it difficult to achieve strictly synchronous parallel execution.
[0006] In algorithms that require high coordination, such as multi-chip all-reduce, differences in startup time can cause a large amount of asymmetric waiting, which severely reduces the overall acceleration efficiency of the system.
[0007] In actual hardware board layout, due to physical space constraints, there are objective differences in the physical trace distance between different computing chips and the host: uneven transmission delay: even with equal-length traces, due to factors such as uneven substrate medium and differences in the number of vias, the absolute time of broadcast signals arriving at each chip still has a picosecond or nanosecond level deviation. This physical-level signal phase difference will further exacerbate the asymmetry of the startup timing of each chip in the ultra-large-scale cluster.
[0008] Therefore, it is necessary to improve existing multi-chip interconnect heterogeneous interconnect circuits and their control methods. Summary of the Invention
[0009] To address the above problems, this application provides a multi-chip heterogeneous interconnect circuit, comprising: multiple computing chips for performing computing tasks; a host for sending control signals and data to the multiple computing chips; a first communication link connecting the host and the multiple computing chips via a PCIe switching unit, transmitting computing data based on a PCIe switching network; a second communication link connecting the host and the multiple computing chips, transmitting control signals in a broadcast manner; each of the multiple computing chips includes a hardware startup trigger module for triggering each of the multiple computing chips to enter a computing state, the hardware startup trigger module being configured to respond only to the control signals from the second communication link and being physically isolated from the first communication link.
[0010] This application eliminates "head-of-line blocking" at its source through physical channel separation and hardware isolation design, ensuring that control signals are delivered at extremely high speeds within nanoseconds / microseconds, guaranteeing the real-time performance of control flow, and preventing false triggering of the hardware startup trigger module.
[0011] Optionally, the computing chip includes a first communication interface module, a second communication interface module, and a computing core; the first communication interface module is used to connect to the first communication link, and the second communication interface module is used to connect to the second communication link; the computing core is connected to the first communication interface module and the second communication interface module respectively, and is used to receive the data and the control signals.
[0012] This application utilizes the PCIe switching network to ensure high bandwidth and reliability of data transmission, meeting the throughput requirements of large-scale tensor data.
[0013] Optionally, the second communication link includes a physical broadcast bus and a broadcast bus controller; The broadcast bus controller is connected to the host and is connected to the multiple computing chips through the physical broadcast bus; The physical broadcast bus is an SPI bus, an I2C bus, or a GPIO signal line; the chip select signal line or enable signal line of the broadcast bus controller is simultaneously connected to the multiple computing chips.
[0014] This application uses a simplified physical broadcast bus (such as GPIO / SPI) to skip protocol stack parsing and interrupt handling, which greatly reduces handshake overhead and instruction latency.
[0015] Optionally, the broadcast bus controller is a microcontroller, a field-programmable gate array, or a general-purpose input / output interface logic circuit integrated on the host side, independent of the host.
[0016] Optionally, it further includes: an address space aggregation module, configured to map the plurality of computing chips as a single device in the logical view of the host; wherein each computing chip corresponds to a different address offset of the single device, and the address space aggregation module is configured to route data to the corresponding computing chip according to the address offset of the data sent by the host.
[0017] This application achieves device virtualization through address space aggregation, breaks through the upper limit of operating system bus resources, and supports transparent expansion and efficient management of ultra-large-scale chip clusters.
[0018] Optionally, the address space aggregation module is disposed on an accelerator board integrating the multiple computing chips, or disposed inside one of the computing chips.
[0019] Optionally, the computing chip further includes a hardware delay alignment module; The hardware delay alignment module is used to send a start pulse with a delay according to a preset compensation value after detecting the broadcast control signal of the second communication link, so as to align the signal transmission delay between each computing chip. The preset compensation value is determined based on the physical wiring length of the computing chip on the accelerator board or the pre-measured signal transmission delay.
[0020] Optionally, the hardware delay alignment module includes a high-frequency counter for counting down after detecting the broadcast control signal, and outputting the start pulse when the count value reaches the preset compensation value.
[0021] This application introduces a hardware delay alignment mechanism to offset the phase difference caused by differences in physical traces, ensuring that each chip starts up synchronously in a "synchronous" manner during large-scale parallel computing.
[0022] To achieve the above-mentioned objectives, this application provides a computing chip, comprising: a first communication interface for connecting to a host to form a first communication link for receiving data; and a second communication interface for connecting to a host to form a second communication link for receiving control signals. The computing chip has a hardware startup trigger module that can control the computing chip to enter a computing state in response to the control signal. The hardware startup trigger module is configured to respond only to the control signal from the second communication link and is physically isolated from the first communication interface.
[0023] The computing chip provided in this application has a first communication interface and a second communication interface, which can receive data and control signals through different communication links to solve the head-of-line blocking problem. Furthermore, the isolation setting between the hardware startup trigger module and the first communication interface avoids the false triggering of the hardware startup trigger module.
[0024] To achieve the above-mentioned objectives, this application provides an accelerator board, including a plurality of computing chips as described above, for connecting to a host computer to perform computing tasks.
[0025] This application provides an accelerator board with high synchronization accuracy and congestion resistance as a hardware carrier, which is suitable for high-density deployment in large-scale computing centers.
[0026] To achieve the aforementioned objectives, this application provides a control method for a multi-chip heterogeneous interconnect circuit, which utilizes the multi-chip heterogeneous interconnect circuit described above, including: Data preloading step: The host writes data to each of the computing chips through the first communication link; Broadcast triggering step: The host drives the broadcast control signal of the second communication link; Hardware startup steps: Each computing chip responds to the control signal, enters the computing state through the hardware startup trigger module, and executes computing tasks.
[0027] This application masks the computational initialization overhead by connecting preloading and broadcast triggering logic, enabling rapid synchronization and efficient parallel execution of large-scale tasks. Attached Figure Description
[0028] Figure 1 A schematic diagram of the structure of a multi-chip heterogeneous interconnect circuit provided for an embodiment of this application; Figure 2 A schematic diagram of the structure of a computing chip provided for an embodiment of this application; Figure 3 A schematic diagram of the structure of another computing chip provided for an embodiment of this application; Figure 4 A schematic diagram of another multi-chip heterogeneous interconnect circuit provided for an embodiment of this application; Figure 5 A schematic diagram of another multi-chip heterogeneous interconnect circuit provided for an embodiment of this application; Figure 6 A schematic diagram of the structure of another computing chip provided for an embodiment of this application; Figure 7 A schematic diagram illustrating the steps of a control method for a multi-chip heterogeneous interconnect circuit provided in an embodiment of this application; Figure 8 This is an address offset mapping diagram of the computing chip in the multi-chip heterogeneous interconnect circuit provided by the embodiments of this application. Detailed Implementation
[0029] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0030] like Figure 1 and Figure 2 As shown, this embodiment provides a multi-chip heterogeneous interconnect circuit, including: multiple computing chips 100 for performing computing tasks; a host 200 for sending control signals and data to the multiple computing chips 100; a first communication link 01 for connecting the host 200 and the multiple computing chips 100 to transmit computing data; and a second communication link 02 for connecting the host 200 and the multiple computing chips 100 to transmit control signals in a broadcast manner; the computing chip 100 includes a hardware startup trigger module 10 for triggering the computing chip 100 to enter the computing state, the hardware startup trigger module 10 is configured to only respond to control signals from the second communication link 02, and is physically isolated from the first communication link 01 to avoid being falsely triggered by the electrical signals of the first communication link 01.
[0031] The multi-chip heterogeneous interconnect circuit provided in this embodiment solves the "head-of-line blocking" problem caused by incomplete data transmission in a single channel, thus ensuring the real-time performance of the control flow. Specifically, it decouples the physical paths of data and signals between the host 200 and the computing chip 100: by completely separating the "high-frequency, low-volume control flow" from the "low-frequency, massive-volume data flow" on the physical channel, it solves the problem of head-of-line blocking of control signals by large-scale tensor data transmission in the traditional single-channel architecture; at the same time, instruction transmission is not affected by congestion: even if the data path (first communication link) is fully loaded due to GB-level data transmission, the host can still send the start instruction to each computing chip in nanosecond / microsecond time through an independent broadcast path (second communication link).
[0032] Furthermore, in this embodiment, by using the first communication link 01 to communicate with the hardware startup trigger module 10 in a shielded and physically isolated manner, dual protection is achieved at both the software and hardware levels, preventing the hardware startup trigger module 10 from being mistakenly triggered by the electrical signal of the first communication link 01.
[0033] In this embodiment, the hardware start-up trigger module 10 not only maintains independence from the first communication link 01 in terms of physical wiring, but also ensures in its logical architecture that it only responds to legitimate control signals from the second communication link 02 through multiple mechanisms such as state interlocking, feature code verification, and time window gating, thus completely eliminating false triggering at the logical level.
[0034] The hardware startup trigger module 10 is logically controlled by the global state machine inside the computing chip 100. To prevent false signals from being generated by electromagnetic coupling or address decoding abnormalities during the transmission of large amounts of data on the first communication link 01, this embodiment employs a two-stage "preparation-trigger" interlocking logic: Before receiving the broadcast signal from the second communication link, the host 200 must first write a specific "trigger authorization word" to the configuration space of the computing chip 100 through the first communication link (PCIe).
[0035] A hardware logic gate (such as an AND gate) is provided on the internal logic path of the hardware start trigger module 10. One input of this logic gate is connected to the physical interface of the second communication link 02, and the other input is connected to the "enable level" generated by the aforementioned "trigger authorization word".
[0036] The physical level transition of the second communication link 02 can only pass through the logic gate when the enable level is high (i.e., the software has confirmed that the data is ready). During non-computation periods, the logic gate is in a forced truncation state, and any stray signals from the interface cannot enter the subsequent startup circuit.
[0037] To achieve the above functions, the following circuit units are integrated inside the hardware startup trigger module 10: A state latch is used to store the "authorization" state from the first communication link 01.
[0038] Logic gates (e.g., AND gates): Physical gates that ultimately output valid signals.
[0039] The physical pins of the second communication interface module 30: directly receive electrical signals from the broadcast bus.
[0040] Phase 1: Software-level authorization and unlocking via the first communication link 01 Data preloading steps: The host 200 sends massive amounts of computing data to the storage units 50 of each computing chip 100 through the first communication link 01.
[0041] Authorization word writing: Only when the host 200 confirms that the data has been completely transmitted and the chip is in a "ready" state will a preset "trigger authorization word" be written to the chip's specific configuration space or control register through the PCIe interface.
[0042] Generating an enable level: After the decoding logic inside the chip recognizes the authorization word, it will toggle the output of the state latch from logic low (0) to logic high (1). This stable high-level signal is connected to one input of the hardware logic gate.
[0043] In the second phase, the hardware layer initiates the process via broadcast through a second communication link: Broadcast triggering steps: The host 200 drives the broadcast bus controller 400 to send control signals to all computing chips simultaneously through the physical broadcast bus (such as GPIO / SPI / I2C).
[0044] Physical signal arrival: After the broadcast signal arrives at the pin of the computing chip 100, it serves as another input signal for the logic gate.
[0045] The logic gate passes: At this point, since the "enable level" generated in the first stage is already logic "1", according to the logic characteristics of an AND gate: Only when these two "1"s are satisfied will the logic gate output the final start pulse, driving the computing core 40 into the computing state.
[0046] Even if electromagnetic noise is generated when the first communication link 01 transmits massive amounts of data, causing unexpected voltage fluctuations (pseudo-signals) on the pins of the second communication link 02, as long as the host 200 has not issued an "authorization word", one input of the logic gate will be a stable "0".
[0047] Stray signals are "forcibly truncated" at the logic gates and cannot enter subsequent circuits, thus ensuring the safety of the system.
[0048] Even if a logical error occurs in the first communication link 01 and some registers are mistakenly written, the computing chip 100 will still not start without the physical broadcast signal of the second communication link 02.
[0049] Furthermore, unlike traditional level-triggered mechanisms, the hardware startup trigger module 10 in this embodiment employs pattern matching logic: The control signal transmitted by the second communication link 02 is defined as a bit stream with a specific length (e.g., 32 bits) and specific encoding characteristics (e.g., specific preamble + opcode + parity bit).
[0050] The hardware startup trigger module 10 integrates a hard-core shift register and a numerical comparator. The logic circuit of the hardware startup trigger module 10 monitors the control signals on the second communication link 02 in real time and compares them bit by bit with the code preset in the hardware circuit. Only when the control signal matches the code is it recognized as a valid control signal and can be executed by the hardware startup trigger module 10.
[0051] Furthermore, the logic circuit also includes a cyclic redundancy check (CRC) module. Even if the crosstalk generated by the high-speed signal of the first communication link 01 occasionally simulates some high and low levels, since it is almost impossible for this random interference signal to simultaneously meet the specific bit sequence pattern and CRC check requirements, the trigger module's state machine will identify it as illegal noise and automatically discard it, thereby achieving an extremely high signal-to-noise ratio at the logic layer.
[0052] To further reduce the probability space of false triggering, the hardware start-up triggering module 10 can also introduce timing-sensitive logic filtering: When the host 200 sends out the computing task through the first communication link 01, it will synchronously start a hardware timer. This timer opens a very short "valid receiving window" (e.g., 500 microseconds) for the hardware startup trigger module 10.
[0053] The hardware startup trigger module 10 is only logically active while the receiving window is open. Once the window ends, the internal logic will automatically switch to shielded mode.
[0054] This design ensures that even during the period of most severe noise generated by high-frequency data exchange in the first communication link 01, as long as the time period is not within the preset receiving window, the trigger module is logically regarded as "disconnected", thereby effectively avoiding level fluctuation interference in unexpected time periods.
[0055] After the trigger pulse is emitted, the logic circuit also includes a closed-loop verification step: When the hardware startup trigger module 10 sends a startup pulse, it sets an internal status bit (Trigger_Status). The host 200 periodically or irregularly reads this status bit through the first communication link 01. If the host 200 finds that it has not yet sent a broadcast command, but the status bit of the computing chip 100 has been set, the logic circuit will automatically trigger an alarm and execute a hardware reset procedure. This logical redundancy design ensures that even if a very low probability hardware mis-trigger occurs, the system can detect and correct the error at the software layer, guaranteeing the atomicity and accuracy of the computing task.
[0056] The aforementioned multiple defense mechanisms at the logical level can be used individually or in combination. Combined with physical isolation wiring, this enables the circuit to adapt to heterogeneous interconnect environments with extremely high bandwidth and extremely high power consumption fluctuations. It not only solves the "head-of-line blocking" problem of control signals in the first communication link, but also logically defines the boundary between control flow and data flow, ensuring high synchronization and safe and stable operation of multiple computing chips in complex electromagnetic environments.
[0057] Optionally, such as Figure 2As shown, the computing chip 100 also includes a first communication interface module 20, a second communication interface module 30, and a computing core 40. The first communication interface module 20 is used to connect to the host 200 to form a first communication link 01. The second communication interface module 30 is used to connect to the host 200 to form a second communication link 02. The computing core 40 is connected to the first communication interface module 20 and the second communication interface module 30 respectively, and is used to receive control signals and data. The computing core 40 is connected to the hardware startup trigger module 10 and can enter the computing state under the trigger of the hardware startup trigger module 10.
[0058] Alternatively, in some implementations, such as Figure 3 As shown, the computing chip 100 also includes a storage unit 50 for storing data sent by the host 200, intermediate computing data and computing results. It is connected to the computing core 40. In some embodiments, the storage unit 50 can pre-store the data required to execute the computing task and provide it to the computing core for use when the computing task is started.
[0059] Optionally, such as Figure 4 As shown, the first communication interface module 20 is a PCIe communication interface module, the second communication interface module 30 is a broadcast interface module, and the multi-chip heterogeneous interconnection circuit also includes a PCIe switching unit 300, which is connected between the host 200 and multiple computing chips 100 to form a first communication link 01; and a broadcast bus controller 400, which is connected to the host 200 and connects multiple computing chips 100 through a physical broadcast bus to form a second communication link.
[0060] PCIe (Peripheral Component Interconnect Express) is a high-speed serial computer expansion bus standard. It is the "highway" connecting the motherboard and various hardware devices (such as graphics cards, solid-state drives, network cards, etc.) inside the computer.
[0061] Optionally, the PCIe switching unit 300 can be set up independently or integrated into an accelerator board or host 200 that integrates multiple computing chips 100.
[0062] Optionally, continue to refer to Figure 4 The physical broadcast bus is an SPI bus, an I2C bus, or a GPIO signal line; the chip select signal line or enable signal line of the broadcast bus controller 400 is simultaneously connected to multiple computing chips 100 to achieve synchronous triggering of multiple computing chips 100.
[0063] SPI (Serial Peripheral Interface) is a high-speed, full-duplex, synchronous communication protocol. It typically uses four lines (clock, input, output, and chip select) to connect to multiple computing chips simultaneously via its "chip select (CS)" signal line. When the host pulls the level low, all chips can simultaneously receive the start command within nanosecond-level error.
[0064] I2C (Inter-Integrated Circuit), also known as the built-in integrated circuit bus, is a two-wire communication protocol (serial data line SDA and serial clock line SCL). It supports multiple devices connected to the same pair of lines, and its wiring is extremely simple, making it suitable for broadcasting low-frequency, high-reliability control signals on accelerator boards with limited physical space.
[0065] GPIO (General-Purpose Input / Output) is a pin on a chip that can be configured to a high level (1) or a low level (0) by software. Without the need for any communication protocol parsing, the hardware startup module can be triggered directly by the level transition of the pin.
[0066] Traditional PCIe instructions require complex protocol stack parsing, register mapping, and interrupt handling. This solution, however, directly triggers instructions by level transitions on the chip select or enable signal lines, skipping protocol stack parsing and reducing latency. Since the signal propagates on the physical bus at near the speed of light, and the chip select line is physically connected to the hardware pins of all chips, multiple chips capture the signal edge almost simultaneously (with nanosecond-level error), thus achieving synchronous triggering at the physical level. In addition, it avoids the instruction arrival time difference caused by the internal arbitration mechanism of the switch in the first communication link (such as PCIe), solving the asymmetric waiting problem commonly found in multi-chip all-reduce algorithms.
[0067] Optionally, the broadcast bus controller 400 is specifically implemented as a microcontroller (MCU), field-programmable gate array (FPGA), or general-purpose input / output interface (GPIO) logic circuit independent of the host 200; the host 200 drives the broadcast bus controller 400 to generate broadcast waveforms by writing to specific registers or sending short instructions.
[0068] Optionally, such as Figure 5As shown, the multi-chip heterogeneous interconnect circuit also includes: an address space aggregation module 500, configured to map multiple computing chips 100 as a single device in the logical view of the host 200; each computing chip 100 corresponds to a different address offset of the single device, and the address space aggregation module 500 can route data to the corresponding computing chip 100 according to the address offset of the data sent by the host 200.
[0069] This design breaks through the resource limitations of the operating system (OS) and improves system scalability. In standard PCIe enumeration, each individual chip typically occupies one bus number. When a board integrates dozens or even hundreds of computing chips, it is very easy to trigger the bus resource limit of the BIOS or Linux kernel. In this implementation, such as Figure 8 As shown, by virtualizing a single device, multiple computing chips 100 are mapped to a single device, and the host 200 only needs to be assigned a device number, thereby greatly saving system resources and supporting larger-scale chip cluster expansion.
[0070] Specifically, this implementation establishes a mapping mechanism at the hardware level through the address space aggregation module 500, mapping multiple physically independent computing chips 100 into a single "virtual" device in the logical view of the host 200 (i.e., from the operating system's perspective). The host only needs to manage a contiguous base address register (BAR) space. The mapping mechanism is based on the high-order bits of the address (as shown in the attached diagram). Figure 8 The Addr[35:32] shown identifies the address offset. The address space aggregation module routes the data precisely to the corresponding computing chip (such as NPU 1, NPU 2, etc.) and its local memory according to different offset values.
[0071] The host 200 does not need to maintain a large list of devices; it only needs to manage a contiguous address space. The address space aggregation module 500 automatically performs hardware-level data distribution based on address offsets, which is transparent to upper-layer software. This simplifies driver development and reduces CPU management overhead.
[0072] It improves data transfer and direct memory access (DMA) efficiency, allowing the host to send data to multiple chips in a large DMA transfer without frequently switching target devices, thus reducing the number of handshakes in the PCIe configuration space.
[0073] Specifically, the address space aggregation module 500 can be as follows: Figure 5The accelerator board shown is integrated into multiple computing chips 100, or is located in one of the computing chips 100, and exchanges signals through a high-speed communication channel between the computing chips 100.
[0074] If the address space aggregation module 500 is set on the accelerator board, it can act as a "logical bridge" to complete the routing before the data enters the computing chip cluster, which is the most efficient and does not occupy computing chip resources.
[0075] If the address space aggregation module 500 is located in one of the computing chips 100, secondary distribution is performed using the high-speed interconnection path inside the computing chip 100, reducing the complexity of board wiring.
[0076] Optionally, such as Figure 6 As shown, the computing chip 100 includes a hardware delay alignment module 60, which is connected to the second communication interface module 30. The hardware delay alignment module 60 is used to send a start pulse with a delay according to a preset compensation value after detecting the broadcast control signal of the second communication link 02, so as to align the signal transmission delay between each computing chip 100.
[0077] Specifically, the preset compensation value is determined based on the physical wiring length of the computing chip 100 on the accelerator board or the pre-measured signal transmission delay; the hardware delay alignment module 60 includes a high-frequency counter, which is used to count down after detecting the broadcast control signal, and outputs a start pulse when the count value reaches the preset compensation value.
[0078] This design eliminates asymmetric startup caused by board-level trace differences: On actual accelerator boards, due to the different physical distances of different computing chips from the host, even with equal-length traces, the time it takes for the broadcast signal to reach each chip can still differ by picoseconds or nanoseconds due to factors such as uneven substrate dielectric and differences in the number of vias; In addition, by setting different "preset compensation values" for each chip, the hardware delay alignment module 60 can cancel out the signal phase difference caused by physical location differences, ensuring that the computing cores of all computing chips receive the startup pulse at the same time.
[0079] This embodiment also provides a computing chip 100, including: The first communication interface module 20 is used to connect to the host 200 to form a first communication link 01 to receive computation data; The second communication interface module 30 is used to connect with the host 200 to form a second communication link 02, so as to receive broadcast control signals and perform computing tasks; The computing chip 100 has a hardware startup trigger module 10 that can respond to external signals to control the computing chip to enter the computing state. The hardware startup trigger module 10 is configured to respond only to electrical signals from the second communication link 02 and is in a signal shielding / isolation state from the first communication link 01, or is designed from a hardware perspective to be physically isolated from the first communication link 01.
[0080] This embodiment provides an accelerator board, such as Figure 5 The device shown is used to connect to a host computer to perform computing tasks and includes multiple computing chips 100.
[0081] This embodiment provides a control method for multi-chip heterogeneous interconnect circuits, such as... Figure 7 As shown, the multi-chip heterogeneous interconnect circuit provided in this embodiment includes: Data preloading step: The host writes data to multiple computing chips through the first communication link; Broadcast triggering steps: The host drives the second communication link to broadcast control signals; Hardware startup process: Multiple computing chips execute computing tasks based on control signals.
[0082] The technical solution of the present invention has now been described in conjunction with the accompanying drawings. However, it will be readily understood by those skilled in the art that the scope of protection of the present invention is obviously not limited to the specific embodiments described above. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions resulting from such changes or substitutions will all fall within the scope of protection of the present invention.
Claims
1. A multi-chip heterogeneous interconnect circuit, characterized in that, include: Multiple computing chips are used to perform computing tasks; The host is used to send control signals and data to the multiple computing chips; The first communication link connects the host and the multiple computing chips through a PCIe switching unit, and transmits the data based on the PCIe switching network. The second communication link is used to connect the host to the multiple computing chips to transmit control signals in a broadcast manner; Each of the plurality of computing chips includes a hardware startup trigger module for triggering the computing chip to enter a computing state. The hardware startup trigger module is configured to respond only to the control signal from the second communication link and is physically isolated from the first communication link.
2. The multi-chip heterogeneous interconnect circuit according to claim 1, characterized in that, The computing chip includes a first communication interface module, a second communication interface module, and a computing core; The first communication interface module is used to connect to the first communication link, and the second communication interface module is used to connect to the second communication link; The computing core is connected to the first communication interface module and the second communication interface module respectively, and is used to receive the data and the control signals; The first communication interface module is a PCIe communication interface module.
3. The multi-chip heterogeneous interconnect circuit according to claim 1, characterized in that, The second communication link includes a physical broadcast bus and a broadcast bus controller; The broadcast bus controller is connected to the host and is connected to the multiple computing chips through the physical broadcast bus; The physical broadcast bus is an SPI bus, an I2C bus, or a GPIO signal line; the chip select signal line or enable signal line of the broadcast bus controller is simultaneously connected to the multiple computing chips.
4. The multi-chip heterogeneous interconnect circuit according to claim 3, characterized in that, The broadcast bus controller is a microcontroller, field-programmable gate array, or general-purpose input / output interface logic circuit integrated on the host side, independent of the host.
5. The multi-chip heterogeneous interconnect circuit according to claim 1, characterized in that, Also includes: The address space aggregation module is configured to map the multiple computing chips as a single device in the logical view of the host. Each computing chip corresponds to a different address offset of the single device, and the address space aggregation module is configured to route data to the corresponding computing chip according to the address offset of the data sent by the host.
6. The multi-chip heterogeneous interconnect circuit according to claim 5, characterized in that, The address space aggregation module is located on the accelerator board integrating the multiple computing chips, or inside one of the computing chips.
7. The multi-chip heterogeneous interconnect circuit according to claim 1, characterized in that, The computing chip also includes a hardware delay alignment module; The hardware delay alignment module is used to send a start pulse with a delay according to a preset compensation value after detecting the broadcast control signal of the second communication link, so as to align the signal transmission delay between each computing chip. The preset compensation value is determined based on the physical wiring length of the computing chip on the accelerator board or the pre-measured signal transmission delay.
8. The multi-chip heterogeneous interconnect circuit according to claim 7, characterized in that, The hardware delay alignment module includes a high-frequency counter, which is used to count down after detecting the broadcast control signal and output the start pulse when the count value reaches the preset compensation value.
9. A computing chip, characterized in that, include: The first communication interface module is used to connect with the host to form a first communication link in order to receive computing data; The second communication interface module is used to connect with the host to form a second communication link in order to receive control signals; The computing chip has a hardware startup trigger module that can control the computing chip to enter a computing state in response to the control signal. The hardware startup trigger module is configured to respond only to the control signal from the second communication link and is physically isolated from the first communication interface.
10. An accelerator board, characterized in that, It includes multiple computing chips as described in claim 9.
11. A control method for a multi-chip heterogeneous interconnect circuit, using the multi-chip heterogeneous interconnect circuit as described in any one of claims 1 to 8, characterized in that, include: Data preloading step: The host writes data to each of the computing chips through the first communication link; Broadcast triggering step: The host drives the broadcast control signal of the second communication link; Hardware startup steps: Each computing chip responds to the control signal, enters the computing state through the hardware startup trigger module, and executes computing tasks.