An inter-chip communication system and method

By establishing a two-dimensional mesh structure between computing chips, the problem of limited interconnect bandwidth between computing chips is solved, achieving efficient multi-chip communication and supporting system expansion and stability.

CN119759833BActive Publication Date: 2026-06-30ZHEJIANG LAB

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG LAB
Filing Date
2024-12-12
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, the interconnection between computing chips is limited by single-line bandwidth, making it difficult to achieve large-scale interconnection, which leads to limited utilization of computing power and fails to meet the needs of large-scale computing.

Method used

An inter-chip communication system is adopted, which combines several connection nodes and switching structures to form a two-dimensional grid structure using connection modules and switching nodes to realize data encapsulation, sorting, verification and transmission, supports out-of-order reordering and forward error correction, and ensures high bandwidth, low latency and high reliability.

Benefits of technology

It enables high-speed interconnection and communication between multiple computing chips, supports smooth system scaling, reduces the complexity and cost of the switching structure, and improves bandwidth utilization and communication stability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119759833B_ABST
    Figure CN119759833B_ABST
Patent Text Reader

Abstract

This specification discloses an inter-chip communication system and method. The system includes several connection nodes and a switching structure. Each connection node is connected to the switching structure. For any connection node, the connection node includes a computing chip and a connection module. The computing chip is used to generate data to be transmitted according to a computing task and transmit the data to be transmitted to the connection module. The connection module is used to determine the target connection node of the data to be transmitted and transmit the data to be transmitted to the target connection node through the switching structure. The connection module is also used to receive data to be transmitted from other connection nodes and transmit the data to be transmitted to the computing chip. High-speed connection communication between multiple computing chips can be achieved by utilizing a unified communication link formed by the connection nodes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of computer technology, and in particular to an inter-chip communication system and method. Background Technology

[0002] The development of large AI models relies on continuously improving computing power. Increasing the model size and training data volume are key ways to directly improve the intelligence level and performance of large AI models. However, the demand for cluster computing power is also showing an exponential growth trend.

[0003] However, improving cluster computing power currently faces some significant constraints. For example, mask limitations restrict single-chip performance improvements, and frequent communication with limited data volume per communication in model-parallel architectures limits computing power utilization. Therefore, inter-chip interconnection of computing chips has become an effective approach that urgently needs to be developed to improve system computing power.

[0004] High-speed inter-chip interconnection of computing chips differs from the traditional parallel computing model, which allocates computing chips to each computing node and connects the computing nodes through an Internet architecture for communication. The aim is to connect several computing chips within a single computing node to form a computing structure with stronger computing power.

[0005] For high-speed inter-chip interconnects, domestic technicians usually use direct interconnects between computing chips. However, direct interconnects between computing chips are limited by the bandwidth of a single line, making it difficult to achieve a large interconnect scale (the typical interconnect scale of direct interconnects between computing chips does not exceed 8).

[0006] Therefore, the present invention provides an inter-chip communication system and method. Summary of the Invention

[0007] This specification provides an inter-chip communication system and method to partially solve the aforementioned problems existing in the prior art.

[0008] The following technical solution is adopted in this specification:

[0009] This specification provides an inter-chip communication system, the system comprising:

[0010] Several connection nodes and a switching structure, wherein each connection node is connected to the switching structure;

[0011] For any given connection node, the connection node includes a computing chip and a connection module;

[0012] The computing chip is used to generate data to be transmitted according to the computing task and transmit the data to be transmitted to the connection module.

[0013] The connection module is used to determine the target connection node of the data to be transmitted, and to transmit the data to be transmitted to the target connection node through the switching structure;

[0014] The connection module is also used to receive data to be transmitted from other connection nodes and transmit the data to be transmitted to the computing chip.

[0015] Optionally, the connection module is specifically used to encapsulate the data to be transmitted into several data frames to be transmitted, and transmit the data frames to be transmitted to the target connection node through the switching structure.

[0016] Optionally, the switching structure specifically includes: a number of switching nodes equal to the number of connection nodes, each switching node being connected to form a two-dimensional mesh structure, each connection node being directly connected to a switching node, and the switching nodes directly connected to each connection node being different from one another.

[0017] Optionally, the connection module is specifically used to encapsulate the data to be transmitted into several data frames containing sorting information, transmit the data to be transmitted to the target connection node through the exchange structure, and the data in each data frame to be transmitted is combined according to the sorting information to represent the complete data to be transmitted.

[0018] The connection module specifically includes: a data buffer, which is used to store the data to be transmitted by the switching structure;

[0019] The connection module is specifically used to determine whether the transmission of any data frame to be transmitted is out of order based on the sorting information and historical sorting information in the data frame to be transmitted when any data frame to be transmitted is received. If the transmission of the data frame to be transmitted is out of order, the data frame to be transmitted is stored in the data buffer until a delayed data frame to be transmitted that can be sequentially combined with the data frame to be transmitted is received. The data frame to be transmitted and the delayed data frame to be transmitted are then sequentially combined according to their respective sorting information and transmitted to the computing chip.

[0020] Optionally, the connection module is specifically used to determine the target connection node of the data to be transmitted, and to transmit the data to be transmitted, which includes the address information of the target connection node, to the switching node directly connected to the connection module.

[0021] The switching node is used to determine the target switching node based on the address information in the data to be transmitted when it receives data to be transmitted. If the current switching node is the target switching node, the received data to be transmitted is transmitted to the directly connected connection module. Otherwise, the target path is determined based on the target switching node and the communication load of each switching node in the switching structure, and the data to be transmitted is transmitted to the next switching node according to the target path.

[0022] Optionally, the connection module is specifically used to generate and add a corresponding check code for each data frame to be transmitted using a cyclic redundancy check function.

[0023] The switching node is specifically used to, upon receiving a data frame to be transmitted, determine the availability of the data frame to be transmitted based on the checksum in the data frame, and transmit the available data frame to the next switching node or the directly connected connection module.

[0024] The connection module is specifically used to, upon receiving a data frame to be transmitted, determine the availability of the data frame to be transmitted based on the checksum in the data frame, and transmit the available data frame to the computing chip.

[0025] Optionally, the connection module is specifically used to encapsulate the data to be transmitted into various data frames to be transmitted. For any data frame to be transmitted, the data frame to be transmitted includes original content and redundant content. The original content is the content in the data to be transmitted, and the redundant content is forward error correction code generated based on the original content.

[0026] The switching node is specifically used to, when receiving a data frame to be transmitted, verify the original content of the data frame to be transmitted based on the redundant content in the data frame to be transmitted. If the verification is successful, the data frame to be transmitted is transmitted to the next switching node or the directly connected connection module. If the verification fails, the original content is restored based on the redundant content, and the restored data frame to be transmitted is transmitted to the next switching node or the directly connected connection module.

[0027] The connection module is further configured to, when receiving a data frame to be transmitted, verify the original content of the data frame to be transmitted based on the redundant content in the data frame to be transmitted, and restore the original content based on the redundant content if the verification fails.

[0028] Optionally, the connection module is specifically used to determine the amount of data that the target connection node can receive. If the amount of data that can receive is not less than the amount of data to be transmitted, the data to be transmitted is transmitted to the target connection node through the switching structure, and the amount of data that the target connection node can receive is updated.

[0029] The connection module is also used to update the amount of data that the current connection module belongs to after receiving data to be transmitted from other connection nodes and transmitting the data to be transmitted to the computing chip.

[0030] Optionally, the connection module is specifically used to: for any data frame to be transmitted, cut the data frame to be transmitted into flow control units of equal length according to a preset length threshold, and transmit each flow control unit to the target connection node through the switching structure.

[0031] Optionally, the connection module is also used to transmit a flow control unit containing clock information to each other connection node at preset time intervals through the switching structure when there is no data to be transmitted.

[0032] Optionally, the switching nodes can be connected using SerDes technology.

[0033] This specification provides a method for inter-chip communication, which is applied to an inter-chip communication system, the inter-chip communication system comprising: a plurality of connection nodes, which are interconnected with each other;

[0034] For any connected node, the method includes:

[0035] Using the computing chip in the connection node, data to be transmitted is generated according to the computing task, and the data to be transmitted is transmitted to the connection module in the connection node.

[0036] Using the connection module, the target connection node for the data to be transmitted is determined, and the data to be transmitted is transmitted to the target connection node;

[0037] When receiving data to be transmitted from other connected nodes, the connection module receives the data to be transmitted and transmits it to the computing chip.

[0038] This specification provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method for inter-chip communication.

[0039] This specification provides an apparatus including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method of inter-chip communication described above.

[0040] The above-mentioned technical solutions adopted in this specification can achieve the following beneficial effects:

[0041] The chip-to-chip communication system described in this specification can utilize a unified communication link formed by various connection nodes to achieve high-speed connection and communication between multiple computing chips. Attached Figure Description

[0042] The accompanying drawings, which are included to provide a further understanding of this specification and form part of this specification, illustrate exemplary embodiments and are used to explain this specification, but do not constitute an undue limitation thereof. In the drawings:

[0043] Figure 1 This is a schematic diagram of the structure of an inter-chip communication system described in this specification;

[0044] Figure 2 This is a schematic diagram of the structure of an inter-chip communication system including a switching node, as described in this specification.

[0045] Figure 3 This is a flowchart illustrating a method for inter-chip communication as described in this specification.

[0046] Figure 4 The corresponding information provided in this specification Figure 3 A schematic diagram of an electronic device. Detailed Implementation

[0047] To make the objectives, technical solutions, and advantages of this specification clearer, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments in this specification without creative effort are within the scope of protection of this application.

[0048] The technical solutions provided in the various embodiments of this specification are described in detail below with reference to the accompanying drawings.

[0049] High-speed inter-chip interconnects are designed to address the severe mismatch between the computing power demands of large-scale applications and the computing power supply of a single chip. The method presented in this specification proposes a "one logic crossbar" concept for high-speed inter-chip interconnects. This concept aims to shield differences in latency, bandwidth, and reliability between chips and between individual chips, while ensuring that the programming model of the computing chip remains unchanged. Furthermore, this concept aims to overcome the problems existing in current technologies, enabling the constructed inter-chip communication system to smoothly scale with the application.

[0050] Specifically, Figure 1 This is a schematic diagram of an inter-chip communication system according to the present specification. The inter-chip communication system specifically includes several connection nodes and a switching structure. Each connection node is connected to the switching structure. For any connection node, the connection node includes a computing chip and a connection module.

[0051] Each connection node and the switching structure are physically connected via signal lines; the computing chip and connection module in a connection node are also physically connected via signal lines; the computing chip mentioned in this specification can be a commonly used computing chip such as a GPU chip or a CPU chip; the connection module can be a network adapter or a bus converter to provide routing guidance for communication between connection nodes; the switching structure can be composed of switching nodes, each of which is a physical device with routing selection and frame de-framing / framing functions.

[0052] In this inter-chip communication system, based on the aforementioned "one logic crossbar" concept, the complexity and cost of implementing the switching structure are reduced. Each connection node can use its own connection module to determine its target connection node and complete communication through the shared switching structure. Therefore, connection nodes do not need to establish direct connections between each other. The bandwidth of this shared switching structure is much higher than that of direct interconnections, thereby increasing the available bandwidth of a single connection node and reducing the latency of communication between nodes. When expanding this inter-chip communication system, simply connect any other connection node with a computing chip and connection module to the shared switching structure.

[0053] The computing chip is used to generate data to be transmitted according to the computing task and transmit the data to be transmitted to the connection module.

[0054] When data is transmitted between the connection nodes, the output connection node first uses the computing chip in the connection node to generate the data to be transmitted according to the computing task. At the same time, it adds the transmission information of the target connection node that is the receiver of the data to be transmitted to the data to be transmitted. Then, the data to be transmitted is transmitted to the connection module in the output connection node.

[0055] The recipient of the data to be transmitted can be determined based on the computational task.

[0056] The connection module is used to determine the target connection node of the data to be transmitted, and to transmit the data to be transmitted to the target connection node through the switching structure.

[0057] When the connection module receives data to be transmitted from a computing chip at the same connection node as the inter-chip interconnection module, it can determine the target connection node in the inter-chip communication system that is the receiver of the data to be transmitted based on the transmission information attached to the data to be transmitted, and transmit the data to be transmitted to the target connection node through the switching structure.

[0058] Specifically, the data to be transmitted, containing the address information of the target connection node, can be transmitted to the switching structure, so that the switching structure can transmit the data to be transmitted to the target connection node according to the address information.

[0059] In one or more embodiments of this specification, a piece of data to be transmitted may correspond to several receivers. Thus, the connection module can determine several target connection nodes of the data to be transmitted and transmit the data to the determined target connection nodes.

[0060] The connection module is also used to receive data to be transmitted from other connection nodes and transmit the data to be transmitted to the computing chip.

[0061] For the target connection node of the receiver, the connection module of the target connection node can receive the data to be transmitted from other connection nodes and transmit the data to be transmitted to the computing chip in the receiver.

[0062] like Figure 1 The chip-to-chip communication system shown can complete communication between each connection node through the connection module built into each connection node and the switching structure connected to each connection node, which can ensure the requirements of low latency, high bandwidth and high reliability of communication between each connection node.

[0063] In this chip interconnection system, the complete interconnection protocol is divided into four layers: physical layer, data link layer, transaction layer, and application layer. Each connection node deploys the protocol content for the physical layer, data link layer, transaction layer, and application layer, while the switching structure deploys the protocol content for the physical layer and data link layer. During a single communication, the output party processes the data to be transmitted in the order of application layer-transaction layer-data link layer-physical layer, while the receiving party processes the data in the reverse order of physical layer-data link layer-transaction layer-application layer.

[0064] In one or more embodiments of this specification, the connection module is specifically used to encapsulate the data to be transmitted into several data frames to be transmitted, and to transmit the data frames to be transmitted to the target connection node through the switching structure.

[0065] A data frame is a protocol data unit at the data link layer, mainly composed of three parts: a frame header, a data portion, and a frame trailer. The frame header and trailer contain necessary control information, such as synchronization information, address information, and error control information; the data portion contains the transmitted data content. The switching structure and each connection node can support the encapsulation and decapsulation of data frames.

[0066] In one or more embodiments of this specification, a plurality of switching nodes, equal in number to the number of connection nodes, are connected to form a two-dimensional mesh structure. Each connection node is directly connected to a switching node, and the switching nodes directly connected to each connection node are different from one another.

[0067] The structure of an inter-chip communication system including switching nodes is as follows: Figure 2 As shown, each switching node is connected to the others in a two-dimensional mesh structure. Each switching node is connected to a different connection node. The switching nodes and connection nodes are directly connected through the connection modules in the connection nodes. Thus, there are multiple communication links of equal length between any two switching nodes in the figure, providing multiple routing options for information transmission between connection nodes. This can improve the fault tolerance and load balancing of the inter-chip communication system, and transform a single communication with a long delay when directly connected between connection nodes into multiple communications with shorter delays in the inter-chip communication system, which can meet the experimental requirements of multi-connection node communication.

[0068] Furthermore, the two-dimensional mesh structure allows for easy expansion of the network by adding connection nodes without requiring large-scale modifications to the entire topology. However, to ensure that the latency of the inter-chip communication system does not affect normal communication between connection nodes, the two-dimensional mesh still has a maximum network radius limitation. That is, the expansion capability of the inter-chip communication system provided in this specification has a maximum value and does not support unlimited expansion.

[0069] Assuming the latency budget for direct connection between nodes is T, and the latency for communication between two adjacent nodes in the inter-chip communication system is t, then the maximum network radius is: Radix = T / t;

[0070] On the other hand, if the connecting nodes form a two-dimensional torus structure, then theoretically the number of connecting nodes that can be connected is: (Radix*2)*(Radix*2).

[0071] For example, the latency of a PCIe switch chip is generally no more than 350ns, while the latency of a single crossbar on a NoC is approximately 5ns. Based on the above, Radix = 70, therefore the number of connectable nodes is 19600.

[0072] In one or more embodiments of this specification, the connection module is specifically used to encapsulate the data to be transmitted into several data frames containing sorting information, transmit the data to be transmitted to the target connection node through the switching structure, and the data in each data frame to be transmitted is combined according to the sorting information to represent the complete data to be transmitted; the connection module specifically includes: a data buffer, the data buffer being used to store the data to be transmitted transmitted by the switching structure; the connection module is specifically used to, when receiving any data frame to be transmitted, determine whether the transmission of the data to be transmitted is out of order according to the sorting information in the data frame to be transmitted and historical sorting information, when the transmission of the data to be transmitted is out of order, save the data frame to be transmitted in the data buffer until a delayed data frame to be transmitted that can be sequentially combined with the data frame to be transmitted is received, and then sequentially combine the data frame to be transmitted and the delayed data frame to be transmitted according to their respective sorting information and transmit them to the computing chip.

[0073] In actual data transmission, a complete set of data is divided into multiple data frames and transmitted in batches. Since the availability of the communication link may change over time, the data frames sent one after another may arrive at the target connection node out of order. The method provided in this manual can support out-of-order reordering and ensure the stability of communication between each connection node.

[0074] Furthermore, according to the above-mentioned "until a delayed data frame that can be sequentially combined with the data frame to be transmitted is received, the data frame to be transmitted and the delayed data frame to be transmitted are sequentially combined according to their respective sorting information and then transmitted to the computing chip", this out-of-order reordering process does not need to wait for the complete data to be transmitted to be received. It has the characteristics of partial reordering and partial delivery, which can further distribute the transmission pressure and ensure low latency in communication between each connection node.

[0075] The system provided in this specification can implement the function of partial data frame rearrangement and partial delivery on each connection module. This function is set at the transaction layer, that is, it is only necessary to implement the above-mentioned partial rearrangement and partial delivery function on the connection node side without implementing it in the connection structure.

[0076] In one or more embodiments of this specification, the connection module is specifically used to determine the target connection node of the data to be transmitted, and to transmit the data to be transmitted, which includes the address information of the target connection node, to the switching node directly connected to the connection module; the switching node is used to, when receiving the data to be transmitted, determine the target switching node according to the address information in the data to be transmitted; if the current switching node is the target switching node, transmit the received data to be transmitted to the directly connected connection module; otherwise, determine the target path according to the target switching node and the communication load of each switching node in the switching structure, and transmit the data to be transmitted to the next switching node according to the target path.

[0077] Specifically, during the process of the connection module transmitting data to the target connection node, the data is transmitted through the communication links between the various switching nodes. Each switching node can determine the route with the lowest load among the various branches of the current communication link to complete the transmission of the data. In a complete data transmission process (data transmission from one connection node to another), multiple switching nodes may be involved. Any switching node involved in the data transmission can re-determine the transmission route based on the usage status of the branches of the current communication link when it receives the data, and update the next switching node for the current switching node. The next switching node is the directly connected switching node that will receive the data transmitted by the current switching node.

[0078] In one or more embodiments of this specification, the connection module is specifically used to: after encapsulating the data to be transmitted into several data frames containing sorting information, generate and add a corresponding checksum for each data frame to be transmitted using a cyclic redundancy check function; the switching node is specifically used to, upon receiving a data frame to be transmitted, determine the availability of the data frame to be transmitted based on the checksum in the data frame to be transmitted, and transmit the available data frame to be transmitted to the next switching node or the directly connected connection module; the connection module is specifically used to, upon receiving a data frame to be transmitted, determine the availability of the data frame to be transmitted based on the checksum in the data frame to be transmitted, and transmit the available data frame to be transmitted to the computing chip.

[0079] To avoid data frame corruption caused by communication link problems, Cyclic Redundancy Code (CRC) technology can be used to detect errors in each data frame to be transmitted. When the connection module or switching node receives the data frame to be transmitted, it can recalculate the checksum using the same redundancy check function. If the recalculated checksum is inconsistent with the checksum in the data frame to be transmitted, it can be determined that the data frame to be transmitted is corrupted, and the data frame to be transmitted cannot continue to be transmitted, thus further ensuring the reliability of the inter-chip communication system.

[0080] The system provided in this specification can implement the above-mentioned cyclic redundancy check function on each connection module and each switching node. This function is set at the data link layer, that is, the above-mentioned partial rearrangement and partial delivery function must be implemented on both the connection node side and the connection node.

[0081] In one or more embodiments of this specification, the connection module is specifically used to encapsulate the data to be transmitted into various data frames to be transmitted. For any data frame to be transmitted, the data frame to be transmitted includes original content and redundant content. The original content is the content in the data to be transmitted, and the redundant content is forward error correction coding generated based on the original content. The switching structure is specifically used to, when receiving a data frame to be transmitted, verify the original content in the data frame to be transmitted based on the redundant content in the data frame to be transmitted. If the verification is successful, the data frame to be transmitted is transmitted to the next switching node or a directly connected connection module. If the verification fails, the original content is restored based on the redundant content, and the restored data frame to be transmitted is transmitted to the next switching node or a directly connected connection module. The connection module is also used to, when receiving a data frame to be transmitted, verify the original content in the data frame to be transmitted based on the redundant content in the data frame to be transmitted. If the verification fails, the original content is restored based on the redundant content.

[0082] To further avoid data frame corruption caused by communication link problems, forward error correction (FEC) technology can be used to generate redundant content based on the data to be transmitted. This allows the original content to be repaired when it is corrupted, and the repaired data to be transmitted can then continue to be transmitted without the need for an additional retransmission mechanism, thus ensuring the stability and reliability of communication.

[0083] The system provided in this specification can implement the aforementioned forward error correction code function on each connection module and each switching node. This function is set at the physical layer, that is, the aforementioned forward error correction code function must be implemented on both the connection node side and the connection node itself.

[0084] In one or more embodiments of this specification, the connection module is specifically used to determine the amount of data that the target connection node can receive. If the amount of data that can receive is not less than the amount of data to be transmitted, the module transmits the data to be transmitted to the target connection node through the switching structure and updates the amount of data that the target connection node can receive. The connection module is also used to update the amount of data that the current connection module belongs to after receiving data to be transmitted from other connection nodes and transmitting the data to be transmitted to the computing chip.

[0085] When data transmission occurs between interconnected nodes, a credit mechanism can be employed. When any interconnected node needs to transmit data to another node, it determines the target node's receivable data volume based on its stored data receivable volumes for other interconnected nodes. Only if this receivable data volume is not less than the data volume of the data to be transmitted will the data be transmitted to the target node. The interconnected node then updates its own stored data receivable volume for the target node by subtracting the corresponding data volume from the target node's receivable volume. Simultaneously, the communication links between the interconnected nodes update the data receivable volumes of the target node stored by other interconnected nodes in the inter-chip communication system. Correspondingly, the inter-chip interconnect module of the target node, after receiving the data, updates the data receivable volumes of the target node stored by each interconnected node in the inter-chip communication system by adding the corresponding data volume to the target node's receivable volume. This ensures that the data to be transmitted will not be lost due to congestion upon arrival at the target node.

[0086] In one or more embodiments of this specification, the connection module is specifically used to: for any data frame to be transmitted, cut the data frame to be transmitted into flow control units of equal length according to a preset length threshold, and transmit each flow control unit to the target connection node through the switching structure.

[0087] The flow control unit is transmitted to the target connection node at a preset time interval. The connection module is also used to transmit idle data streams to other connection nodes at a preset time interval when there is no data to be transmitted.

[0088] Dividing data frames into equal-length flow control units (flits) further refines the granularity of the communication process. Parallel transmission of a single data frame across multiple flow control units can further reduce communication latency and improve bandwidth utilization. Specifically, virtual channel (VC) technology can be used to divide a communication link into several VCs, mapping each flow control unit corresponding to a data frame to its respective VC for simultaneous transmission. This also reduces the implementation cost of the inter-chip communication system.

[0089] On the other hand, the connection module is also used to transmit a flow control unit containing clock information to each other connection node at preset time intervals through the switching structure when there is no data to be transmitted.

[0090] Scrambling techniques can be used to configure the transmission of the flow control unit, ensuring that each connection node receiving the control data stream correctly restores its clock.

[0091] In one or more embodiments of this specification, the computing chip is specifically used to: transmit the data to be transmitted to the connection module via an AXI interface, and the connection module is specifically used to: transmit the data to be transmitted to the computing chip via an AXI interface.

[0092] The AXI protocol supports out-of-order delivery, outstanding features, interleave, bidirectional handshake flow control, and other characteristics. It is a high-performance, high-bandwidth, and low-latency on-chip bus protocol. The inter-chip communication system provided in this specification borrows from the design of a system-on-a-chip, setting the interface between the computing chip and the connection module within the connection node as an AXI interface.

[0093] The system provided in this specification can implement the above-mentioned AXI interface function on each connection module and each computing chip. This function is set at the application layer, that is, it needs to be implemented on both the connection module side and the computing chip side, but it does not need to implement the above-mentioned AXI interface function on the switching structure side.

[0094] In one or more embodiments of this specification, the switching nodes are connected using SerDes technology.

[0095] SerDes technology can convert multiple parallel data channels into a single high-speed serial link, improving the bandwidth utilization of the inter-chip communication system provided in this specification.

[0096] This specification also provides a method for establishing this inter-chip communication system. Specifically, the complete interconnection protocol can be divided into four layers: physical layer, data link layer, transaction layer, and application layer, wherein:

[0097] The application layer completes the AXI standard interface.

[0098] Data transmission and reception at the transaction and application layers; efficient allocation of communication link resources; support for out-of-order partial reordering and partial delivery; support for equal-length data frame segmentation.

[0099] The data link layer is responsible for: data frame encapsulation and decapsulation; maintaining data frame sequence numbers; appending and verifying CRC values ​​for data frames; generating and parsing control or status data streams; and maintaining the link state machine.

[0100] Physical layer, implementing low-latency FEC; data frame scrambling; SerDes transmission.

[0101] When dealing with data frame loss requiring retransmission, the method provided in this specification employs an application-layer end-to-end retransmission mechanism. Compared to the link-layer point-to-point retransmission mechanism, firstly, it avoids the resource waste of "one person sick, everyone takes medicine," reducing resource requirements on intermediate paths (such as switching structures) and enabling all connected nodes to form an organic whole. Secondly, it reduces the complexity and cost of implementing the switching structure, facilitating the construction of high-capacity switching chips. Thirdly, the method provided in this specification sets the application layer as an AXI bus, integrating the retransmission mechanism with the AXI error handling mechanism, achieving high efficiency and low cost.

[0102] Figure 3 This is a flowchart illustrating a method for inter-chip communication according to this specification. The method is applied to an inter-chip communication system, which includes: a plurality of connection nodes and a switching structure. Each connection node is connected to the switching structure. For any connection node, the method includes the following steps:

[0103] S300: Using the computing chip in the connection node, generate data to be transmitted according to the computing task, and transmit the data to be transmitted to the connection module in the connection node;

[0104] S302: Using the connection module, determine the target connection node of the data to be transmitted, and transmit the data to be transmitted to the target connection node through the switching structure;

[0105] S310: When receiving data to be transmitted from other connection nodes, the connection module is used to receive the data to be transmitted and transmit the data to the computing chip.

[0106] Optional, in such Figure 3 In step S302 shown, the data to be transmitted is encapsulated into several data frames to be transmitted, and the data frames to be transmitted are transmitted to the target connection node through the switching structure.

[0107] Optionally, the switching structure specifically includes: a number of switching nodes equal to the number of connection nodes, each switching node being connected to form a two-dimensional mesh structure, each connection node being directly connected to a switching node, and the switching nodes directly connected to each connection node being different from one another.

[0108] Optional, in such Figure 3 In step S302, the data to be transmitted is encapsulated into several data frames containing sorting information, and the data to be transmitted is transmitted to the target connection node through the switching structure. The data in each data frame to be transmitted is combined according to the sorting information to represent the complete data to be transmitted. The connection module specifically includes: a data buffer, which is used to store the data to be transmitted by the switching structure; in such a way... Figure 3 In step S310 shown, when any data frame to be transmitted is received, it is determined whether the transmission of the data to be transmitted is out of order based on the sorting information in the data frame to be transmitted and the historical sorting information. If the transmission of the data to be transmitted is out of order, the data frame to be transmitted is stored in the data buffer until a delayed data frame to be transmitted that can be sequentially combined with the data frame to be transmitted is received. The data frame to be transmitted and the delayed data frame to be transmitted are sequentially combined according to the sorting information in each data frame to be transmitted and then transmitted to the computing chip.

[0109] Optional, in such Figure 3 In step S302 shown, the target connection node of the data to be transmitted is determined, and the data to be transmitted, containing the address information of the target connection node, is transmitted to the switching node directly connected to the connection module. This allows the switching node in the switching structure to determine the target switching node based on the address information in the data to be transmitted when it receives the data to be transmitted. If the current switching node is the target switching node, the received data to be transmitted is transmitted to the directly connected connection module. Otherwise, a target path is determined based on the target switching node and the communication load of each switching node in the switching structure, and the data to be transmitted is transmitted to the next switching node based on the target path.

[0110] Optional, in such Figure 3 In step S302, a cyclic redundancy check function is used to generate and add a corresponding checksum to each data frame to be transmitted. This allows the switching nodes in the switching structure to determine the availability of a data frame upon receiving it, based on the checksum in the data frame, and then transmit the available data frame to the next switching node or a directly connected connection module. Figure 3 In step S310 shown, when a data frame to be transmitted is received, the connection module determines the availability of the data frame to be transmitted based on the checksum in the data frame, and transmits the available data frame to the computing chip.

[0111] Optional, in such Figure 3 In step S302, the data to be transmitted is encapsulated into data frames. For any given data frame, it includes original content and redundant content. The original content is the content of the data to be transmitted, and the redundant content is a forward error correction code generated based on the original content. This allows the switching node in the switching structure to verify the original content of the data frame when it receives it, based on the redundant content. If the verification is successful, the data frame is transmitted to the next switching node or a directly connected connection module. If the verification fails, the original content is restored based on the redundant content, and the restored data frame is transmitted to the next switching node or a directly connected connection module. Figure 3 In step S302 shown, when a data frame to be transmitted is received, the original content of the data frame to be transmitted is verified according to the redundant content in the data frame to be transmitted. If the verification fails, the original content is restored according to the redundant content.

[0112] Optional, in such Figure 3 In step S302, the amount of data that the target connection node can receive is determined. If the amount of data that can receive is not less than the amount of data to be transmitted, the data to be transmitted is transmitted to the target connection node through the switching structure, and the amount of data that the target connection node can receive is updated. In step S310, after receiving the data to be transmitted from other connection nodes and transmitting the data to be transmitted to the computing chip, the amount of data that the current connection module belongs to is updated.

[0113] Optional, in such Figure 3 In step S302 shown, for any data frame to be transmitted, the data frame to be transmitted is cut into flow control units of equal length according to a preset length threshold, and each flow control unit is transmitted to the target connection node through the switching structure.

[0114] Optionally, when there is no data to be transmitted, for any connection node, the connection node can transmit a flow control unit containing clock information to each other connection node through the switching structure at a preset time interval.

[0115] Optionally, the switching nodes can be connected using SerDes technology.

[0116] This specification also provides a computer-readable storage medium storing a computer program that can be used to execute the above-described... Figure 3 The provided method for inter-chip communication.

[0117] This instruction manual also provides Figure 4 The diagram shows a schematic structural representation of the electronic device. Figure 4 At the hardware level, the inter-chip communication device includes a processor, internal bus, network interface, memory, and non-volatile memory, and may also include other hardware required for the business logic. The processor reads the corresponding computer program from the non-volatile memory into memory and then runs it to achieve the above-mentioned functionality. Figure 3 The method of inter-chip communication described herein. Of course, in addition to software implementation, this specification does not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. In other words, the execution subject of the following processing flow is not limited to individual logic units, but can also be hardware or logic devices.

[0118] In the 1990s, improvements to a technology could be clearly distinguished as either hardware improvements (e.g., improvements to the circuit structure of diodes, transistors, switches, etc.) or software improvements (improvements to the methodology). However, with technological advancements, many methodological improvements today can be considered direct improvements to the hardware circuit structure. Designers almost always obtain the corresponding hardware circuit structure by programming the improved methodology into the hardware circuit. Therefore, it cannot be said that a methodological improvement cannot be implemented using hardware physical modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user programming the device. Designers can program and "integrate" a digital system onto a PLD themselves, without needing chip manufacturers to design and manufacture dedicated integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing integrated circuit chips, this programming is mostly implemented using "logic compiler" software. Similar to the software compiler used in program development, the original code before compilation must be written in a specific programming language, called a Hardware Description Language (HDL). There are many HDLs, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, and RHDL (Ruby Hardware Description Language). Currently, the most commonly used are VHDL (Very-High-Speed ​​Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should understand that by simply performing some logic programming on the method flow using one of these hardware description languages ​​and programming it into an integrated circuit, the hardware circuit implementing the logical method flow can be easily obtained.

[0119] The controller can be implemented in any suitable manner. For example, it can take the form of a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code form, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included therein for implementing various functions can also be considered as structures within the hardware component. Alternatively, the means for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.

[0120] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.

[0121] For ease of description, the above devices are described in terms of function, divided into various units. Of course, in implementing this specification, the functions of each unit can be implemented in one or more software and / or hardware components.

[0122] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0123] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0124] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0125] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0126] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0127] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0128] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0129] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0130] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0131] This specification can be described in the general context of computer-executable instructions that are executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. This specification can also be practiced in distributed computing environments, where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0132] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0133] The above description is merely an embodiment of this specification and is not intended to limit this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this application.

Claims

1. An inter-chip communication system, characterized by, The system includes: a plurality of connection nodes and a switching structure, wherein each connection node is connected to the switching structure; For any given connection node, the connection node includes a computing chip and a connection module; the connection module includes a data buffer; The computing chip is used to generate data to be transmitted according to the computing task and transmit the data to be transmitted to the connection module. The connection module is used to determine the target connection node of the data to be transmitted, and to encapsulate the data to be transmitted into several data frames containing sorting information. The data to be transmitted is transmitted to the target connection node through the exchange structure. The data in each data frame to be transmitted is combined according to the sorting information to represent the complete data to be transmitted. The connection module is further configured to receive data to be transmitted from other connection nodes and transmit the data to be transmitted to the computing chip; wherein, when any data frame to be transmitted is received, if it is determined that the transmission of the data to be transmitted is out of order based on the sorting information in the data frame to be transmitted and the historical sorting information, the data frame to be transmitted is stored in the data buffer until a delayed data frame to be transmitted that can be sequentially combined with the data frame to be transmitted is received, and the data frame to be transmitted and the delayed data frame to be transmitted are sequentially combined according to their respective sorting information and transmitted to the computing chip.

2. The system of claim 1, wherein, The exchange structure specifically includes: a number of exchange nodes equal to the number of connection nodes, with each exchange node connected to form a two-dimensional grid structure, each connection node being directly connected to a single exchange node, and each connection node being directly connected to a different exchange node.

3. The system of claim 2, wherein, The connection module is specifically used to determine the target connection node of the data to be transmitted, and to transmit the data to be transmitted, which includes the address information of the target connection node, to the switching node directly connected to the connection module. The switching node is used to determine the target switching node based on the address information in the data to be transmitted when it receives data to be transmitted. If the current switching node is the target switching node, the received data to be transmitted is transmitted to the directly connected connection module. Otherwise, the target path is determined based on the target switching node and the communication load of each switching node in the switching structure, and the data to be transmitted is transmitted to the next switching node according to the target path.

4. The system of claim 2, wherein, The connection module is specifically used to generate and add a corresponding check code to each data frame to be transmitted using a cyclic redundancy check function. The switching node is specifically used to, upon receiving a data frame to be transmitted, determine the availability of the data frame to be transmitted based on the checksum in the data frame, and transmit the available data frame to the next switching node or the directly connected connection module. The connection module is specifically used to, upon receiving a data frame to be transmitted, determine the availability of the data frame to be transmitted based on the checksum in the data frame, and transmit the available data frame to the computing chip.

5. The system of claim 2, wherein, The connection module is specifically used to encapsulate the data to be transmitted into various data frames to be transmitted. For any data frame to be transmitted, the data frame to be transmitted includes original content and redundant content. The original content is the content in the data to be transmitted, and the redundant content is forward error correction code generated based on the original content. The switching node is specifically used to, when receiving a data frame to be transmitted, verify the original content of the data frame to be transmitted based on the redundant content in the data frame to be transmitted. If the verification is successful, the data frame to be transmitted is transmitted to the next switching node or the directly connected connection module. If the verification fails, the original content is restored based on the redundant content, and the restored data frame to be transmitted is transmitted to the next switching node or the directly connected connection module. The connection module is further configured to, when receiving a data frame to be transmitted, verify the original content of the data frame to be transmitted based on the redundant content in the data frame to be transmitted, and restore the original content based on the redundant content if the verification fails.

6. The system as described in claim 1, characterized in that, The connection module is specifically used to determine the amount of data that the target connection node can receive. If the amount of data that can receive is not less than the amount of data to be transmitted, the data to be transmitted is transmitted to the target connection node through the switching structure, and the amount of data that the target connection node can receive is updated. The connection module is also used to update the amount of data that the current connection module belongs to after receiving data to be transmitted from other connection nodes and transmitting the data to be transmitted to the computing chip.

7. The system as described in claim 1, characterized in that, The connection module is specifically used to: for any data frame to be transmitted, cut the data frame to be transmitted into flow control units of equal length according to a preset length threshold, and transmit each flow control unit to the target connection node through the switching structure.

8. The system as described in claim 1, characterized in that, The connection module is also used to transmit a flow control unit containing clock information to each other connection node at preset time intervals through the exchange structure when there is no data to be transmitted.

9. The system as described in claim 1, characterized in that, The switching nodes are connected using SerDes technology.

10. A method for inter-chip communication, characterized in that, The method is applied to an inter-chip communication system, which includes: a plurality of connection nodes and a switching structure, wherein each connection node is connected to the switching structure. For any connected node, the method includes: Using the computing chip in the connection node, data to be transmitted is generated according to the computing task, and the data to be transmitted is transmitted to the connection module in the connection node; the connection module includes a data buffer. Using the connection module, the target connection node of the data to be transmitted is determined, and the data to be transmitted is encapsulated into several data frames containing sorting information. The data to be transmitted is transmitted to the target connection node through the exchange structure. The data in each data frame to be transmitted is combined according to the sorting information to represent the complete data to be transmitted. When receiving data to be transmitted from other connected nodes, the connection module receives the data to be transmitted and transmits it to the computing chip. Specifically, when any data frame to be transmitted is received, if it is determined, based on the sorting information and historical sorting information in the data frame, that the transmission of the data to be transmitted is out of order, the data frame to be transmitted is stored in the data buffer until a delayed data frame that can be sequentially combined with the data frame to be transmitted is received. The data frame to be transmitted and the delayed data frame to be transmitted are then sequentially combined according to their respective sorting information and transmitted to the computing chip.