Artificial intelligence server cluster network based on hybrid optoelectronic interconnection and CXL-oF protocol, memory access method and device and electronic equipment
By combining hybrid optoelectronic interconnects with a hybrid physical interconnect layer based on the CXL-oF protocol, along with intelligent traffic scheduling and network management, the communication bottleneck of AI server cluster networks is solved, enabling low-latency, high-bandwidth, and high-energy-efficiency data interoperability and simplifying distributed memory programming.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING SMARTCHIP MICROELECTRONICS TECHNOLOGY CO LTD
- Filing Date
- 2026-04-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing AI server cluster networks suffer from bottlenecks in scalability, energy efficiency, and topology flexibility. Optical interconnects lack the capabilities of intelligent flow control and memory semantic protocols, leading to communication latency and energy consumption issues.
By employing hybrid optoelectronic interconnects and the CXL-oF protocol, a hybrid physical interconnect layer consisting of an electrical switching plane and a reconfigurable optical switching plane is constructed. Intelligent traffic scheduling is performed through an optoelectronic joint scheduler, and efficient memory access across racks and clusters is achieved by combining CXL-oF bridging devices and a network manager.
It enables low-latency, high-bandwidth, and high-energy-efficiency data interoperability in ultra-large-scale AI server cluster networks, solves communication bottlenecks and topology rigidity problems, and simplifies the complexity of distributed memory programming.
Smart Images

Figure CN122002166B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence server cluster network technology, specifically to an artificial intelligence server cluster network, memory access method, apparatus, and electronic device based on hybrid optoelectronic interconnection and CXL-oF protocol. Background Technology
[0002] With the rapid development of deep learning technology, the number of parameters in deep learning models is leaping from tens of billions to trillions at an unprecedented rate. This trend poses severe challenges to computing infrastructure, causing traditional architecture design to face an unprecedented crisis. Against the backdrop of Moore's Law gradually slowing down, the performance bottleneck of large-scale AI server clusters is no longer limited to the computing power of a single node, but has rapidly shifted to the efficiency of data transfer between nodes and the latency of cross-node memory access. These have now become key factors restricting the performance improvement of AI server clusters.
[0003] Current mainstream AI server cluster architectures primarily rely on layered electrical interconnect technologies. For example, servers connect internally using NVLink or PCIe, while communication between servers is achieved through InfiniBand or RoCE Ethernet. This architecture can maintain a certain level of efficiency in small-scale clusters (such as those with hundreds of accelerator cards), but when the cluster size expands to thousands or even tens of thousands of accelerator nodes, it encounters insurmountable physical obstacles and energy efficiency dilemmas. First, with the continuous increase in high-speed signal rates, the physical attenuation of electrical signals at high frequencies increases exponentially. The complex equalization techniques and retimers introduced to maintain signal integrity result in extremely high power consumption at the SerDes interface, severely limiting the energy efficiency and computing density of large-scale clusters. Second, static network topologies, represented by Fat-Tree, are severely mismatched with the highly regular aggregate communication patterns in AI training (such as All-Reduce). This causes data packets to shuttle ineffectively between multiple layers of switches, increasing network hop count and unpredictable long-tail latency, leaving expensive computing resources idle while waiting for data. Finally, building a fully connected CAN cluster, if using the traditional three-layer Fat-Tree architecture, requires tens of thousands of high-speed copper or fiber optic cables and thousands of switches. This type of cabling not only brings huge physical deployment costs, but also makes cable management and troubleshooting extremely complex. Any cable failure, loosening, or poor contact may lead to the interruption of the entire training task or a significant degradation in performance. The reliability of the system decreases sharply as the scale increases.
[0004] Meanwhile, the CXL (Compute Express Link) protocol brought revolutionary memory semantics to heterogeneous computing, successfully breaking down memory silos by maintaining hardware-level cache consistency. However, the CXL protocol is primarily designed for short-distance interconnects within server racks and struggles to support cross-data center transmissions. More importantly, its listener-based consistency mechanism is highly susceptible to "broadcast storms" in large-scale nodes, and its implicit tree-like topology assumption structurally conflicts with the flat, mesh-like communication patterns required for artificial intelligence training.
[0005] Faced with the scalability challenges of electrical interconnects and the distance and topology limitations of the CXL protocol, the industry has begun to explore more fundamental physical layer solutions. Optical Circuit Switching (OCS) technology utilizes MEMS (Micro-Electro-Mechanical Systems) micromirror arrays or liquid crystal technology to directly switch optical paths in the optical domain without requiring optical-to-electrical-to-optical (OEO) conversion. It is thus considered the ultimate physical solution to the interconnect bottlenecks of the "post-Moore's Law era," offering transparent, high-bandwidth, and low-power transmission. Although many major companies have experimented with OCS technology in data centers, its application to AI server clusters still faces significant challenges, primarily in the following aspects:
[0006] 1. Reconfiguration Delay: The switching time of MEMS OCS is typically in the millisecond range, while that of LCD OCS, although fast, is still in the microsecond range. In contrast, the packet forwarding of electrical switching is in the nanosecond range. For the frequently changing communication modes in AI model training, millisecond-level network interruptions are unacceptable and may lead to pipeline bubbles.
[0007] 2. Dumb Pipe Characteristics: OCS cannot parse packet headers and cannot perform flow control or packet loss retransmission. It relies on upper-layer protocols to handle all network congestion and reliability issues. Existing TCP / IP protocols perform extremely poorly in the face of transient interruptions during optical path switching, while RDMA (Remote Direct Memory Access) protocols are extremely sensitive to packet loss.
[0008] 3. Lack of memory semantics: Pure OCS networks cannot support CXL, a consistency protocol that requires extremely tight timing coordination and immediate response.
[0009] In summary, the electrical interconnects in existing AI server cluster networks have bottlenecks in terms of scalability, energy efficiency, and topology flexibility, while optical interconnects, which have physical advantages, lack the ability to handle intelligent flow control and memory semantic protocols.
[0010] How to extend the memory semantic advantages of the CXL protocol to the scale of large-scale clusters, and organically integrate the intelligent and fast control of electrical switching with the high bandwidth and low power transmission of optical switching, so as to achieve low latency, high bandwidth and high energy efficiency data interoperability in ultra-large-scale artificial intelligence server cluster networks, is an urgent problem to be solved. Summary of the Invention
[0011] To address the problems in related technologies, embodiments of this disclosure provide an artificial intelligence server cluster network, memory access method, apparatus, and electronic device based on hybrid optoelectronic interconnection and CXL-oF protocol.
[0012] In a first aspect, this disclosure provides an artificial intelligence server cluster network based on hybrid optoelectronic interconnection and the CXL-oF protocol. The artificial intelligence server cluster network includes: a network manager, a hybrid physical interconnect layer, and multiple server racks. Each server rack includes an electrical switch and multiple server nodes. The hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. Multiple server nodes in each server rack are interconnected through corresponding electrical switches based on their own electrical interfaces. The electrical switches in the server racks are connected to the electrical backbone network in the electrical switching plane. The server nodes in the server racks are connected to the centralized optical path switching device in the reconfigurable optical switching plane through optical interfaces. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching device in the reconfigurable optical switching plane. Each server node includes a CXL-oF bridging device supporting the CXL-oF protocol. The CXL-oF bridging device includes a CXL module, an optoelectronic joint scheduler, and a Fabric packaging engine, wherein:
[0013] The CXL-oF bridging device is configured to: acquire a memory access instruction generated based on an AI model training task; determine a target server node based on the request address in the memory access instruction; if the target server node is not a local server node, generate a CXL data packet corresponding to the memory access instruction through the CXL module and write the CXL data packet into the outbound traffic queue; the optoelectronic joint scheduler determines the routing interface of the CXL data packet based on the descriptive features of the CXL data packet in the outbound traffic queue; then, the Fabric encapsulation engine encapsulates the CXL data packet based on the request address of the memory access instruction and the routing interface and descriptive features of the CXL data packet to generate a corresponding Fabric transmission frame, and routes the Fabric transmission frame to the routing interface for transmission to the target server node through the corresponding designated switching plane; the routing interface includes: an electrical interface or an optical interface; the designated switching plane corresponding to the electrical interface is an electrical switching plane; and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane.
[0014] The network manager is configured to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, including: during the execution phase of an AI model training task, controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance.
[0015] Secondly, this disclosure provides a memory access method in an artificial intelligence server cluster network based on hybrid optoelectronic interconnection and CXL-oF protocol. The artificial intelligence server cluster network includes a network manager, a hybrid physical interconnect layer, and multiple server racks. Each server rack includes an electrical switch and multiple server nodes. The hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. The multiple server nodes in the server racks interconnect via corresponding electrical switches based on their own electrical interfaces. The electrical switches in the server rack are connected to the electrical backbone network in the electrical switching plane. The server nodes in the server rack are connected to the centralized optical path switching equipment in the reconfigurable optical switching plane via optical interfaces. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching equipment in the reconfigurable optical switching plane. The server node includes a CXL-oF bridging device supporting the CXL-oF protocol. The CXL-oF bridging device includes a CXL module, an optoelectronic joint scheduler, and a Fabric packaging engine. The memory access method is applied to the server node and includes:
[0016] Obtain the memory access instruction generated by the AI model training task, determine the target server node according to the request address in the memory access instruction, if the target server node is not the local server node, generate a CXL data packet corresponding to the memory access instruction through the CXL module, and write the CXL data packet into the outbound traffic queue;
[0017] The optoelectronic joint scheduler determines the routing interface of the CXL data packet based on the descriptive characteristics of the CXL data packet in the outbound traffic queue. Then, the Fabric encapsulation engine encapsulates the CXL data packet according to the request address of the memory access instruction, the routing interface of the CXL data packet, and the descriptive characteristics of the CXL data packet, generates a corresponding Fabric transmission frame, and routes the Fabric transmission frame to the routing interface for transmission to the target server node through the corresponding designated switching plane. The routing interface includes an electrical interface or an optical interface. The designated switching plane corresponding to the electrical interface is an electrical switching plane, and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane.
[0018] The network manager is used to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, including: an execution phase based on an AI model training task, and controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance.
[0019] Thirdly, this disclosure provides a memory access device in an artificial intelligence server cluster network based on hybrid optoelectronic interconnection and the CXL-oF protocol. The artificial intelligence server cluster network includes a network manager, a hybrid physical interconnect layer, and multiple server racks. Each server rack includes an electrical switch and multiple server nodes. The hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. The multiple server nodes in the server racks interconnect via corresponding electrical switches based on their own electrical interfaces. The electrical switches in the server rack are connected to the electrical backbone network in the electrical switching plane. The server nodes in the server rack are connected to the centralized optical path switching equipment in the reconfigurable optical switching plane via optical interfaces. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching equipment in the reconfigurable optical switching plane. The server node includes a CXL-oF bridging device supporting the CXL-oF protocol. The CXL-oF bridging device includes a CXL module, an optoelectronic joint scheduler, and a Fabric packaging engine. The memory access device is located on the server node and includes:
[0020] The CXL packet generation module is configured to: obtain a memory access instruction generated based on an AI model training task; determine the target server node based on the request address in the memory access instruction; if the target server node is not a local server node, generate a CXL packet corresponding to the memory access instruction through the CXL module and write the CXL packet into the outbound traffic queue.
[0021] The CXL packet processing module is configured as follows: the optoelectronic joint scheduler determines the routing interface of the CXL packet based on the descriptive characteristics of the CXL packets in the outbound traffic queue; then, the Fabric encapsulation engine encapsulates the CXL packet based on the request address of the memory access instruction, the routing interface of the CXL packet, and the descriptive characteristics of the CXL packet, generates a corresponding Fabric transmission frame, and routes the Fabric transmission frame to the routing interface for transmission to the target server node through the corresponding designated switching plane. The routing interface includes: an electrical interface or an optical interface; the designated switching plane corresponding to the electrical interface is an electrical switching plane; and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane.
[0022] The network manager is used to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, including: an execution phase based on an AI model training task, and controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance.
[0023] Fourthly, this disclosure provides an electronic device including a memory and a processor; the memory is used to store computer instructions, wherein the computer instructions are executed by the processor to implement the memory access method described in the second aspect.
[0024] Fifthly, this disclosure provides a computer-readable storage medium having computer instructions stored thereon, which, when executed by a processor, implement the memory access method described in the second aspect.
[0025] In a sixth aspect, this disclosure provides a computer program product including computer instructions that, when executed by a processor, implement the memory access method described in the second aspect.
[0026] According to the technical solution provided in this disclosure, based on deep hardware and software collaboration, through innovative hybrid architecture and protocol extension, the CXL memory semantics applicable to short distances and the optoelectronic interconnection physical characteristics of large-scale clusters are deeply adapted and intelligently collaborated. This fundamentally solves the communication bottleneck problem in ultra-large-scale AI training, and achieves a perfect integration of the extreme scalability of the physical layer and the unified memory semantics of the logical layer. Thus, low-latency, high-bandwidth, and high-energy-efficiency data interoperability is achieved in ultra-large-scale artificial intelligence server cluster networks.
[0027] On the one hand, this disclosure constructs a hybrid physical interconnect layer consisting of an "electrical switching plane" and a "reconfigurable optical switching plane" in an artificial intelligence server cluster network, and performs intelligent traffic scheduling through an optoelectronic joint scheduler. Specifically, the reconfigurable optical switching plane, with its high bandwidth, low power consumption, and physical isolation, undertakes the transmission of large volumes of data; while the electrical switching plane, with its low latency, high flexibility, and fast response, undertakes the transmission of control packets and small volumes of data. This effectively avoids the "reconfiguration delay" and "dumb pipe" problems of pure optical switching, as well as the "power consumption wall" and "topology rigidity" problems of pure electrical switching, achieving complementary advantages. On the other hand, the optoelectronic joint scheduler can make real-time decisions based on the descriptive characteristics of CXL data packets (such as type and size), dynamically routing them to electrical or optical interfaces. Specifically, it routes short control packets (such as CXL.io configuration and CXL.cache probes) to the low-latency electrical plane to ensure fast response; and routes large data transport and aggregated communication traffic to the high-bandwidth optical plane to avoid congestion. This fine-grained traffic engineering achieves dual optimization of overall network throughput and latency.
[0028] On the other hand, this disclosure introduces a CXL-oF bridging device in the server node. Through the Fabric encapsulation engine, CXL data packets based on the local PCIe / CXL physical layer are encapsulated into "Fabric transmission frames" that can be transmitted over long distances on both optical and electrical planes. This transparently extends the cache-coherent memory access semantics of the CXL protocol from the traditional "in-chassis / rack" scope to the entire data center level. This enables processors and accelerators across racks or even across clusters to access remote memory as efficiently as accessing local memory. As a result, it provides a unified memory address space and cache-coherent model for AI server clusters, greatly simplifying the complexity of distributed memory programming.
[0029] Furthermore, this disclosure introduces a network manager to perform unified and centralized software-defined management of the electrical switching backbone and centralized optical path switching equipment through an out-of-band management network. In particular, it can control the reconstruction of the physical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instructions corresponding to the execution phase to the reconfigurable optical switching plane in advance during the execution phase of the AI model training task. This allows for the forward-looking and dynamic reconstruction of the physical optical topology into a matching logical topology (such as a ring or tree structure), thereby significantly reducing communication hops and latency at the physical level. In addition, by cooperating with the task scheduling system, optical path reconstruction can be completed in advance during the switching interval of the training task execution phase, reducing the impact of millisecond-level reconstruction latency on the computing task to almost zero. This solves the core problem of applying pure OCS to dynamic computing loads.
[0030] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0031] Other features, objects, and advantages of this disclosure will become more apparent from the following detailed description of non-limiting embodiments, taken in conjunction with the accompanying drawings. In the drawings:
[0032] Figure 1 This diagram illustrates the structure of an artificial intelligence server cluster network based on hybrid optoelectronic interconnection and CXL-oF protocol according to an embodiment of the present disclosure.
[0033] Figure 2 This diagram illustrates an internal module structure of a server node according to an embodiment of the present disclosure.
[0034] Figure 3 A schematic diagram of the CXL-oF protocol stack and encapsulation format according to an embodiment of the present disclosure is shown;
[0035] Figure 4 A schematic diagram of the global physical address mapping table and a DCD directory table structure diagram are shown according to embodiments of the present disclosure;
[0036] Figure 5 A flowchart illustrating a method for determining the routing interface of a CXL data packet using an optoelectronic joint scheduler according to an embodiment of the present disclosure is shown.
[0037] Figure 6 A module architecture diagram of a network manager according to an embodiment of the present disclosure is shown;
[0038] Figure 7 A flowchart illustrating a network manager controlling physical optical topology reconfiguration of a reconfigurable optical switching plane according to an embodiment of the present disclosure is shown.
[0039] Figure 8 A flowchart illustrating a memory access method in an artificial intelligence server cluster network based on hybrid optoelectronic interconnect and CXL-oF protocol according to an embodiment of the present disclosure;
[0040] Figure 9 A structural block diagram of a memory access device in an artificial intelligence server cluster network based on hybrid optoelectronic interconnect and CXL-oF protocol according to an embodiment of the present disclosure is shown.
[0041] Figure 10 A structural block diagram of an electronic device according to an embodiment of the present disclosure is shown. Detailed Implementation
[0042] In the following, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings to enable those skilled in the art to readily implement them. Furthermore, for clarity, portions unrelated to the description of exemplary embodiments have been omitted from the drawings.
[0043] In this disclosure, it should be understood that terms such as “comprising” or “having” are intended to indicate the presence of features, figures, steps, behaviors, components, parts or combinations thereof disclosed in this specification, and are not intended to exclude the possibility of the presence or addition of one or more other features, figures, steps, behaviors, components, parts or combinations thereof.
[0044] It should also be noted that, unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other. This disclosure will now be described in detail with reference to the accompanying drawings and embodiments.
[0045] In this disclosure, any operation involving the acquisition of user information or user data, or the display of user information or user data to others, is an operation authorized or confirmed by the user, or actively selected by the user.
[0046] As mentioned earlier, electrical interconnects in existing AI server cluster networks have bottlenecks in terms of scalability, energy efficiency, and topology flexibility, while optical interconnects, which have physical advantages, lack the ability to handle intelligent flow control and memory semantic protocols.
[0047] To overcome the limitations of existing ultra-large-scale artificial intelligence server cluster networks and achieve low-latency, high-bandwidth, and high-energy-efficiency data interoperability within such networks, this disclosure constructs a hybrid physical interconnect layer consisting of an "electrical switching plane" and a "reconfigurable optical switching plane" within the artificial intelligence server cluster network. An optoelectronic joint scheduler dynamically routes CXL data packets to either the electrical switching plane or the reconfigurable optical switching plane in real time based on the descriptive characteristics of the CXL data packets. Before transmission, the CXL data packets are encapsulated into Fabric transmission frames that can be transmitted over long distances on both the optical and electrical planes using a Fabric encapsulation engine. This extends the memory semantic advantages of the CXL protocol to the large-scale cluster scope and organically integrates the advantages of electrical switching and optical path switching, thereby achieving low-latency, high-bandwidth, and high-energy-efficiency data interoperability in ultra-large-scale artificial intelligence server cluster networks.
[0048] Figure 1 This diagram illustrates the structure of an artificial intelligence server cluster network based on hybrid optoelectronic interconnection and the CXL-oF protocol, according to an embodiment of this disclosure. Figure 1 As shown, the artificial intelligence server cluster network includes: a network manager, a hybrid physical interconnect layer, and multiple server racks ( Figure 1Taking a three-server rack as an example, the server rack includes an electrical switch and multiple server nodes ( Figure 1 (Taking a scenario with four server nodes as an example), the hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. Multiple server nodes in the server rack are interconnected through corresponding electrical switches based on their own electrical interfaces. The electrical switches in the server rack are connected to the electrical backbone network in the electrical switching plane. The server nodes in the server rack are connected to the centralized optical path switching device in the reconfigurable optical switching plane through optical interfaces. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching device in the reconfigurable optical switching plane.
[0049] In this disclosure, the electrical switching plane and the reconfigurable optical switching plane play complementary core roles in the network architecture. The electrical switching plane is the infrastructure for processing control signaling, heartbeat detection, and latency-sensitive data streams; specifically, it is implemented within a rack using PCIe Gen6. Switches or low-latency Ethernet switches interconnect all server nodes within the same rack. Between racks, a lightweight electrical backbone (such as 100G Ethernet) is maintained as a supplement and control channel for optical path switching. The electrical switching plane acts as the "nervous system" of the system, ensuring that network connectivity is not completely interrupted during the millisecond-level "blackout period" of optical path reconfiguration. CXL's keep-alive signal and consistency metadata can still be transmitted normally, preventing protocol timeout crashes. The reconfigurable optical switching plane is the core transmission layer of the system that carries high-bandwidth, high-volume data streams. It can be implemented through a centralized, large-scale OCS device matrix (such as 1024x1024 ports). The OCS device uses a MEMS micromirror array inside, and establishes a physical optical path between any input and output optical fibers by electrostatically driving the mirror angle. The reconfigurable optical switching plane acts as the "vascular system" of the system, providing massive data throughput. Since there are no intermediate nodes in the optical path, the data transmission latency is only limited by the speed of light (about 5ns / m). For large-scale communication across data centers, its latency jitter is almost zero.
[0050] Figure 2 A block diagram illustrating the internal module structure of a server node according to an embodiment of the present disclosure is shown. Figure 2As shown, the server node includes basic modules such as a processor and AI accelerator, as well as a CXL-oF bridging device supporting the CXL-oF protocol, electrical interfaces, and optical interfaces. The CXL-oF bridging device comprises a CXL module, an optoelectronic joint scheduler, and a Fabric packaging engine. Inserted between the controller and the physical network, the CXL-oF bridging device completely simulates a standard, local CXL device on the controller side. It connects directly to the processor via a PCIe physical channel. When the processor issues a CXL.mem read / write request, the CXL-oF bridging device immediately responds to the link layer protocol like a local device, making the controller perceive the device as "ready." In one implementation of this disclosure, the CXL-oF bridging device is implemented using a high-performance FPGA (such as Intel Agilex or AMD Versal). This FPGA implements the CXL 3.0 IP Core, is configured in Type 3 device mode, supports x16 channels, and has a data rate of 64 GT / s. For the optical interface, the QSFP-DD optical module is directly driven by the FPGA's high-speed SerDes. In this way, the optical signal does not pass through the Ethernet MAC, but instead uses a custom lightweight physical layer protocol (e.g., 64b / 66b encoded raw stream) to minimize latency. When the optoelectronic joint scheduler executes the scheduling logic, it uses built-in PES logic to determine congestion based on the buffer depth. When the optical interface transmit queue depth exceeds a preset proportion (e.g., 80%), a backpressure signal is triggered, and subsequent large data packets are marked as "overflow" and redirected to the Ethernet interface.
[0051] Figure 3 This diagram illustrates the CXL-oF protocol stack and encapsulation format according to an embodiment of this disclosure. CXL-oF is short for "CXL-over-Fabric," and CXL is short for "Compute Express Link," meaning "fast computing link." The CXL-oF protocol can be interpreted as "CXL protocol based on Fabric," its core meaning being to extend the CXL protocol from local device-level interconnection to network-level interconnection, achieving decoupling and pooling of hardware resources across servers and racks. "Fabric" refers to a networked interconnection architecture (such as Ethernet, InfiniBand, etc.), which extends CXL devices from single-server locality to a shared resource pool across nodes through a switching network.
[0052] To run the standard CXL protocol on a non-standard hybrid physical layer, this disclosure adds a "protocol adaptation layer" below the standard CXL protocol stack, thereby enabling the transition from a "local bus" to a "networked bus." Figure 3As shown, the CXL-oF protocol stack includes the CXL transaction layer, the CXL link layer, and the Fabric adaptation layer. The CXL transaction layer and CXL link layer belong to the standard CXL protocol stack and are implemented by CXL modules. The Fabric adaptation layer corresponds to the "protocol adaptation layer" and is implemented by the Fabric encapsulation engine. The CXL transaction layer is the semantic core of the CXL protocol. It generates specific types of CXL transactions (such as MemRd / MemWr of CXL.mem, and request / response of CXL.cache) based on CPU instructions or device requests, and parses and executes these transactions on the peer. The CXL link layer is the reliability engine of the CXL protocol, responsible for error-free data transmission between two directly connected ports. The Fabric adaptation layer is a layer not present in standard CXL; it receives the Flit sequence from the CXL link layer, uses it as payload, and encapsulates it into a custom Fabric transport frame. In a specific example, to ensure compatibility with existing networks, the Fabric transport frame is encapsulated using the UDP / IP protocol.
[0053] According to embodiments of this disclosure, the CXL-oF bridging device is configured to: acquire a memory access instruction generated based on an AI model training task; determine a target server node based on the request address in the memory access instruction; if the target server node is not a local server node, generate a CXL data packet corresponding to the memory access instruction through the CXL module and write the CXL data packet into an outbound traffic queue; the optoelectronic joint scheduler determines the routing interface of the CXL data packet based on the descriptive features of the CXL data packet in the outbound traffic queue; then, the Fabric encapsulation engine encapsulates the CXL data packet based on the request address of the memory access instruction, the routing interface of the CXL data packet, and the descriptive features of the CXL data packet to generate a corresponding Fabric transmission frame, and routes the Fabric transmission frame to the routing interface for transmission to the target server node through a corresponding designated switching plane; the routing interface includes: an electrical interface or an optical interface; the designated switching plane corresponding to the electrical interface is an electrical switching plane; and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane.
[0054] In one implementation of this disclosure, memory access instructions include Load (read) or Store (write) operation instructions generated by the GPU or CPU during AI training tasks. These instructions appear on the link in the form of CXL.mem transaction requests. The CXL-oF bridging device can monitor and intercept these instructions via hardware, extracting the read / write request addresses of CXL.mem in real time through a TLP parser included in its internal logic. Based on the extracted request addresses, it determines the device hit by the memory access instruction and then executes different processing paths depending on the hit device. Specifically, if a local device is hit, the process proceeds directly to the local controller; if a remote device is hit, a CXL transaction is triggered.
[0055] Determining the target server node based on the requested address in the memory access instruction includes:
[0056] The identifier of the target server node corresponding to the request address of the memory access instruction is obtained based on a preset global physical address mapping table; wherein, each entry in the global physical address mapping table is used to describe the correspondence between the global physical address range and the identifiers of the server nodes in the plurality of server racks, and the request address corresponds to the global physical address in the global physical address range; the target server node is determined according to the identifier of the target server node.
[0057] In this disclosure, in order to logically map the high-bandwidth memory HBM and CXL extended memory, which are physically distributed across different racks and different GPUs, into a contiguous global physical address space, a global physical address mapping table is maintained inside the CXL-oF bridging device. This table is issued to each server node by the management software or network manager during system initialization. The table entry format is usually: [address range AB] -> the ID of the node to which it belongs.
[0058] Figure 4 This diagram illustrates a global physical address mapping table and a DCD directory table structure diagram according to embodiments of the present disclosure, such as... Figure 4 As shown, the global physical address space is pre-statically divided into several large, contiguous regions, each of which is mapped to a specific server node by default. For example, 0x0000…-0x7FFF… is mapped to Node A (ID:01) by default. This means that when the CPU accesses an address within this range, the system first assumes that the data is located in the local HBM of Node A.
[0059] In practical implementation, the first step is to extract the request address from the memory access instruction. This request address corresponds to a global physical address within the global physical address range. Then, the extracted physical address is matched against the mapping table. If the address falls within the local device address range, the target node is the local node; if it falls within the address range of a memory pool such as node B, the target node is the remote node B. Furthermore, the mapping table needs to support dynamic updates to adapt to the elastic scaling of the memory pool.
[0060] The descriptive features of the CXL data packet include: the category and / or size of the CXL data packet. Figure 5 The flowchart illustrates a method for determining the routing interface of a CXL packet using an optoelectronic joint scheduler according to an embodiment of the present disclosure. According to the embodiment of the present disclosure, when the optoelectronic joint scheduler determines the routing interface of a CXL packet based on the descriptive characteristics of the CXL packets in the outbound traffic queue, it monitors the CXL packets in the outbound traffic queue using its own flow feature-based classification engine, and determines the routing interface of the CXL packet based on the type and / or size of the current CXL packet, including the following steps S510~S560:
[0061] In step S510, it is determined whether the current CXL data packet is a first type of data packet.
[0062] If the current CXL data packet is a first type of data packet, then step S520 is executed, that is: the routing interface of the current CXL data packet is the electrical interface of the local server node.
[0063] The first type of data packet includes: CXL.io configuration packet, CXL.cache probe, or data packet with a payload less than a preset length.
[0064] If the current CXL data packet is not a first type of data packet, then proceed to step S530.
[0065] In step S530, it is determined whether the current CXL data packet is a second type of data packet and whether the corresponding target server node currently has an available active optical path connection.
[0066] The second type of data packet includes: large memory page migration data packets based on the CXL.mem protocol or data packets marked as Collective type.
[0067] If the current CXL data packet is a second type of data packet and the corresponding target server node currently has an available active optical path connection, then proceed to step S540.
[0068] In step S540, it is determined whether the optical transmission queue is in a congested state.
[0069] If not, proceed to step S550, i.e., the routing interface of the current CXL data packet is the optical interface of the local server node. If yes, proceed to step S580.
[0070] The optical transmit queue is a queue used to buffer Fabric transmission frames routed to the optical interface.
[0071] In step S560, the link status of the reconfigurable optical switching plane is monitored.
[0072] In step S570, it is determined whether the reconfigurable optical switching plane is being reconfigured. If not, step S550 is executed; if so, step S580 is executed.
[0073] In step S580, CXL data packets set to be transmitted through the reconfigurable optical switching plane or CXL data packets determined to be second type data packets are first sliced. Then, the sliced CXL data packets are encapsulated to generate corresponding Fabric transmission frames. The corresponding Fabric transmission frames are then routed to the electrical interface of the local server node to be transmitted to the target server node through the electrical switching plane.
[0074] like Figure 3 As shown, the Fabric transmission frame includes: a custom exchange frame header (F-Header), payload data, and frame check information (Fabric_Trailer). The custom exchange frame header sequentially includes: a routing tag field (Routing_Tag), a routing hint field (Path_Hint), a timestamp field (Timestamp), and a traffic type field (Traffic_Class). In addition, the custom exchange frame header includes a 6-bit reserved field (Reserved) for future expansion or alignment, padded with zeros when generating the frame header. The Fabric encapsulation engine encapsulates the CXL data packet according to the request address of the memory access instruction and the routing interface and descriptive characteristics of the CXL data packet, generating a corresponding Fabric transmission frame, including:
[0075] The routing tag field in the custom switching frame header is generated based on the request address of the memory access instruction and a preset global physical address mapping table. The routing tag field, Routing_Tag, is 16 bits long and indicates the physical ID and outbound port number of the target server node. The physical ID of the target server node is its identifier and is globally unique. The outbound port number is located on the local server node and its acquisition relies on a table lookup mechanism. Based on the routing strategy of the AI server cluster network pre-calculated by the network manager, a mapping table of "global physical address to target physical port" is generated and distributed and written to the on-chip high-speed storage of each server node's CXL-oF bridging device via the out-of-band management network. When a CXL data packet is generated, the optoelectronic joint scheduler extracts the destination address (which is the request address of the memory access instruction) from the CXL data packet within nanoseconds. Using this address as an index, it quickly queries the local on-chip preset routing table, and the lookup result directly returns the corresponding outbound port number.
[0076] The routing hint field in the custom switching frame header is generated based on the routing interface of the CXL packet and the forwarding instruction information injected by the network manager. The routing hint field, Path_Hint, is 8 bits long. When generating the routing hint field, the Fabric encapsulation engine uses the path decision result (i.e., routing interface) made by the opto-optical joint scheduler for the current CXL packet to generate the routing hint field. Simultaneously, the Fabric encapsulation engine queries a policy cache table injected by the network manager through the out-of-band control channel. This table contains forwarding instructions for specific target nodes or traffic types. The hardware logic in the Fabric encapsulation engine arbitrates these two types of inputs: if there is an explicit dynamic instruction in the policy cache table (such as "use optical path ID 5"), it takes precedence; otherwise, the decision of the opto-optical joint scheduler is followed. Finally, the arbitration result is encoded into an 8-bit value. For example, 0x01 represents forcing the electrical switching plane, 0x02 represents prioritizing the reconfigurable optical switching plane, and 0x03 is the specific optical path identifier. This value is written into the routing hint field, allowing the frame deframer at the OCS receiver to quickly distribute packets without looking up a complex routing table.
[0077] The timestamp field in the custom exchange frame header is generated based on the transmission time of the Fabric transmission frame. Specifically, the timestamp field is 32 bits long, and its generation is driven by a local timer synchronized with a global precision time protocol clock source. At the precise moment when the encapsulation pipeline advances to begin assembling the frame header, a hardware trigger latches the current 32-bit timer count. This action is a hardware operation strictly aligned with the data processing pipeline, ensuring that the correspondence between the timestamp and the packet transmission time is accurate to the nanosecond level. The latched value is directly filled into the timestamp field, which will be used to calculate the network time of flight at the receiving end and provide a global sorting basis for possible out-of-order events.
[0078] The traffic type field in the custom exchange frame header is generated based on the descriptive characteristics of the CXL data packet. Specifically, the generation of the traffic type field depends on the real-time parsing of the original CXL data packet header information. The hardware logic decodes the type field of the CXL transaction layer data packet and converts it into a 3-bit service level defined by the network architecture according to the mapping rules pre-programmed in the firmware. For example, a consistency request for CXL.cache is mapped to the highest priority, a control command for CXL.io is mapped to the second highest priority, and a batch data read / write operation for CXL.mem is mapped to the standard priority. This 3-bit encoding is written into the traffic type field, which will be read in the queue scheduling of all subsequent network switching nodes to ensure that high-priority control signaling can be forwarded with low latency.
[0079] The CXL data packet is used as the payload data, which is 256 bytes long and can be a standard CXL 3.0 Flit. A CRC checksum, 32 bits long, is used as the frame check information for optical transmission error detection. The custom exchange frame header, payload data, and frame check information are concatenated sequentially to generate the Fabric transmission frame. Finally, the frame header, payload, and frame check information are chained together into a complete bitstream, forming a standardized reconfigurable interconnect network transmission frame, awaiting delivery to the designated physical network interface transmission queue. The entire encapsulation process is completed in a deep pipeline, ensuring line-rate processing capability for high-speed serial data.
[0080] According to an embodiment of this disclosure, the network manager is configured to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, including: controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance during the execution phase of the AI model training task.
[0081] Figure 6This diagram illustrates the module architecture of a network manager according to an embodiment of the present disclosure. In this disclosure, the network manager is the core control engine of the system. It is deeply integrated with AI training frameworks (such as PyTorch / TensorFlow) through plugins or agents. It is not a traditional SDN controller, but rather a fusion of an "AI task scheduler" and a "network resource orchestrator." Its core idea is to leverage the predictability of AI training tasks to transform future communication patterns into optimal physical connections, and to achieve seamless network reconfiguration through time window switching.
[0082] like Figure 6 As shown, the network manager's modular architecture is divided into three layers: the northbound interaction layer, the core intelligence layer, and the southbound interaction layer. The northbound interaction layer deeply integrates with and interacts with the AI training framework through the northbound API interface, which is key to achieving "task awareness" and enabling it to acquire the core metadata of the computation graph. The core intelligence layer is implemented through the core intelligence engine, which includes a traffic prediction module and a topology calculation engine, responsible for converting computation graph information into specific network reconfiguration commands. The southbound interaction layer controls the underlying physical devices (reconfigurable optical and electrical switching planes) in an out-of-band management manner through the southbound API interface, ensuring the reliability and stability of control commands without affecting data plane traffic.
[0083] Figure 7 A flowchart illustrating the physical optical topology reconfiguration of a reconfigurable optical switching plane according to an embodiment of the present disclosure is shown. During the execution phase of an AI model training task, when controlling the reconfiguration of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconfiguration command corresponding to the execution phase to the reconfigurable optical switching plane in advance, the following steps S710-S730 are included:
[0084] In step S710, the current AI model training task is obtained based on the AI model training framework.
[0085] In one implementation of this disclosure, when a user or scheduling system (such as Kubernetes) initiates a distributed training task, the AI model training framework initializes a Job object. This system captures events such as "task graph construction complete" or "distributed environment initialization complete" through hooks or event listeners provided by the framework. From these events, the system extracts the task's unique identifier (Job ID), the computation graph definition (GraphDef), the list of nodes participating in training (NodeList), and their roles (e.g., rank 0 is master), and encapsulates this information into an internal TrainingTask object.
[0086] In step S720, before the current AI model training task is executed, based on the computation graph information of the current AI model training task, multiple consecutive execution stages of the entire execution process of the current AI model training task and the optimal physical optical topology configuration of the reconfigurable optical switching plane in each execution stage are obtained.
[0087] Specifically, the computation graph information of the current AI model training task is obtained from the AI model training framework through the northbound API, and then the computation graph is parsed.
[0088] Specifically, step S720 is achieved through the following steps S721~S723:
[0089] In step S721, the set of communication operators corresponding to each execution stage is obtained by parsing the computation graph information of the current AI model training task; the set of communication operators includes one or more communication operators.
[0090] When analyzing a computation graph, graph analysis algorithms (such as topological sorting) can be used to traverse the graph, establish the operator execution sequence and the data / control dependencies between them, and analyze the operator sequences and dependencies in the computation graph to identify time intervals with the same or similar communication patterns. For example, a training iteration can be divided into stages such as "forward computation," "gradient synchronization," and "weight update." Within each stage, all collective communication operators (such as AllReduce and AllGather) and their involved node groups and tensor sizes are marked. This forms the basis for subsequent predictions. Finally, a stage division blueprint is output, clearly defining which consecutive stages constitute the entire task process and the set of communication operators for each stage.
[0091] In step S722, for any execution stage, a pre-trained traffic prediction model is used to predict the full node pair communication traffic matrix at the end of any execution stage based on the set of communication operators for that execution stage. The matrix element (i, j) in the full node pair communication traffic matrix represents the expected amount of communication data between server node i and server node j that participated in executing the corresponding communication operator within that execution stage.
[0092] The traffic prediction model in this disclosure needs to have the ability to capture the complex mapping relationship between communication patterns and the final traffic matrix, which can be implemented based on existing graph neural networks, encoder-decoder structures (such as Seq2Seq, Transformer) and deep feedforward networks.
[0093] When using a pre-trained traffic prediction model for prediction, predictions can be made based on features such as operator type, participating nodes, and data volume.
[0094] The full node-to-node communication traffic matrix refers to a complete and quantitative description of the expected amount of communication data between all participating computing nodes in a data center during a specific AI training execution phase.
[0095] The above steps S721~S722 are passed through Figure 6 This is achieved using the traffic prediction module in the network manager shown.
[0096] In step S723, the optimal physical optical topology configuration of the reconfigurable optical switching plane during any execution phase is calculated based on the full node pair communication traffic matrix at the end of any execution phase.
[0097] This step S723 is passed. Figure 6 This is achieved through the topology calculation engine in the network manager. The topology calculation engine runs optimization algorithms (such as connection allocation based on integer programming), taking the traffic matrix as input and aiming to maximize effective throughput and minimize communication latency, to calculate an optimal physical optical topology configuration. This determines which input ports of the OCS should establish direct optical connections with which output ports. For example, for AllReduce, the optimal topology is a ring or a torus; for All-to-All, the optimal topology is a specific expander graph. This ensures that high-traffic nodes can obtain direct links. The final OCS configuration scheme details the physical connection mapping relationships required for each stage.
[0098] In step S730, when the current AI model training task is executed, within the current execution phase, an OCS reconstruction instruction for the next execution phase is generated in advance according to the optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in the next execution phase. Within the current execution phase or in the gap between the current execution phase and the next execution phase, the OCS reconstruction instruction for the next execution phase is sent to the reconfigurable optical switching plane, so that the centralized optical path switching device in the reconfigurable optical switching plane can reconstruct its own physical optical topology according to the OCS reconstruction instruction for the next execution phase before the next execution phase.
[0099] Specifically, within the current execution phase, the optimal OCS configuration scheme for the next execution phase is translated into a sequence of OCS reconstruction instructions executable by the device. These instructions are then sent to the OCS controller via the southbound API before the end of the current execution phase or during phase intervals. Since OCS switching takes only a few milliseconds, while an iteration step in large model training typically takes hundreds of milliseconds to seconds, this scheduling can hide the reconstruction overhead within the computation or gradient accumulation phases, achieving "zero-awareness" switching. This allows the OCS to complete the reconstruction of port cross-connections before the start of the next phase. Thus, when the AI task enters the next execution phase, the network topology is already in an optimal state, and communication traffic can be transmitted along the optimal path.
[0100] The following example uses a key communication operation in a large-scale distributed AI training task to demonstrate how the network manager can achieve a high degree of coordination between the network physical topology and the computing task through precise perception, prediction, and pre-configuration, thereby significantly improving performance. This example details the system processing flow of the gradient synchronization (All-Reduce) phase for a ResNet-50 model training task performed on 1024 GPU nodes.
[0101] Phase 1: Task Awareness and Topology Pre-computation
[0102] At the start of the Nth iteration of the training task, the AI training framework (using PyTorch as an example) interacts with the network manager through a dedicated Python library hook provided in this embodiment, sending a forward-looking signal to the network manager stating: "An all-reduce operation with approximately 1GB of data is expected to be initiated in about 200 milliseconds." Upon receiving this signal, the network manager's core intelligent engine immediately initiates the topology calculation process. Based on the analysis of the all-reduce communication mode, it determines that the optimal physical network topology at this time is a ring. Subsequently, the engine calculates an efficient Hamiltonian ring path for the 1024 nodes participating in this synchronization, ensuring that each node has definite upstream and downstream neighbors in the ring. Based on this path, the network manager generates a detailed and executable "optical switching plane port connectivity mapping table."
[0103] Phase Two: Advance Issuance of Instructions and Physical Reconfiguration
[0104] During the window between the end of the Nth iteration computation phase (i.e., forward and backward propagation) and the actual start of gradient synchronization, the network manager converts the port mapping table generated in phase one into specific device control commands and pre-issues them to the controller of the reconfigurable optical switching plane via the southbound interface. The controller drives its MEMS micromirror array to begin rotating, performing cross-connection reconstruction of the physical optical path. This physical switching process takes approximately 5 milliseconds. During this brief reconstruction period, if some nodes that complete computation faster attempt to send gradient data prematurely, their port's path execution system will detect that the target optical link is not yet ready. It will then automatically buffer the data to be sent temporarily in its local high-bandwidth memory or coordinate by sending minimal control signaling through the backup electrical switching plane, thus avoiding data loss or invalid transmission and ensuring seamless integration of computation and communication.
[0105] Phase 3: Optical Path Readiness and High-Speed Synchronization
[0106] Once the reconfigurable optical switching plane is reconfigured and locked, the predetermined ring optical path between all nodes is physically established. Upon real-time detection of optical signal recovery, the path execution system of each node immediately triggers an action to release the gradient data temporarily stored in the buffer. Data then begins to flow at full line speed (e.g., 800 Gbps) within the established all-optical ring topology. Because data is transmitted between nodes via physically direct fiber optic links, skipping all the buffering, queuing, and processing stages of traditional electrical switches, communication latency is extremely low and highly deterministic. Ultimately, the completion time of the All-Reduce operation for the entire 1GB of data is reduced by 3 to 5 times compared to the execution time in a static multilayer electrical switching network, with minimal performance fluctuations.
[0107] Phase Four: Topology Resource Release and Flexible Reuse
[0108] After the gradient synchronization operation is completed, the network manager does not immediately dismantle the ring topology. It makes decisions based on subsequent task scheduling information obtained from the AI framework: if it predicts that another gradient synchronization of the same pattern will occur soon, it may maintain the current ring topology to eliminate the overhead of reconstructing it again; if subsequent computation phases require other communication modes (such as point-to-point communication), the network manager instructs the reconfigurable optical switching plane to dismantle the current ring connection, release optical port resources, and prepare for the reconstruction of the optimal topology in the next phase. This flexible, on-demand allocation and release of physical network resources maximizes the utilization of optical switching resources.
[0109] This embodiment fully demonstrates how the network manager can transform the communication semantics of AI applications into the optimal physical network form in real time, and hide the reconstruction latency through precise time control, ultimately achieving a leap in network performance.
[0110] The AI-driven network manager disclosed herein transforms the data center network from a passive, general-purpose data transmission pipeline into an active, dedicated high-performance computing base through a set of "perception-prediction-optimization-pre-configuration" intelligent control flow. Based on the computing-communication overlap strategy, it uses the GPU computing time (usually tens to hundreds of milliseconds) in AI training to mask the switching time of optical switches, thus solving the problem of low efficiency caused by frequent reconstruction of dynamic optical networks and realizing zero-perception network topology transformation.
[0111] Furthermore, to ensure multi-tenant security at the physical layer, this disclosure introduces the concept of an optical virtual LAN (VLAN) in the network manager. This is an extension and innovation of the traditional network VLAN concept at the physical optical layer. It is an abstract concept implemented by a software-defined networking (SDN) controller, achieving logical isolation and policy management on the physical optical infrastructure. Its core idea is to create logically independent and mutually invisible virtual optical networks for different users, tenants, or applications on shared, reconfigurable physical optical switching equipment. In practical implementation, on the one hand, physical isolation ensures that optical paths belonging to different tenants are physically disconnected within the OCS. On the other hand, each time an optical path is established, the optical modules at the sending and receiving ends exchange encrypted optical path fingerprints to verify their identities. If the fingerprints do not match, the corresponding server node will immediately cut off traffic on that port to prevent data leakage caused by erroneous connections.
[0112] The network manager implements a multi-tenant isolation scheme for optical paths, essentially through a closed software loop covering resource marking, policy verification, dynamic authentication, and hardware execution. Its specific implementation method is as follows:
[0113] First, the network manager internally establishes and maintains a core tenant-resource mapping database. Each tenant registered in the system (e.g., Tenant_A, Tenant_B) is assigned a unique numerical identifier and a logical isolation label, namely the Optical VLAN ID. Simultaneously, all physical ports on the reconfigurable optical switching plane, as well as the server optical module ports connected to it, are logically divided and marked as belonging to a specific tenant's resource pool. For example, ports 1 to 48 are assigned to Tenant_A (VLAN 100), and ports 49 to 96 are assigned to Tenant_B (VLAN 200). Any optical path establishment request must be associated with a tenant identity.
[0114] When the network manager needs to establish an optical path for an AI training task, the process enters the policy execution phase. After the topology computing engine generates a preliminary optical path connection scheme (such as connecting port A and port B), it does not directly issue instructions. The system first queries the aforementioned mapping database and executes a key verification rule: the tenant ID of the requesting task must be exactly the same as the tenant IDs marked on port A and port B. Only when all three match completely is the optical path configuration considered valid and allowed to proceed to the next stage. If the ports belong to different tenants, the request will be immediately rejected and a security alert will be generated, thus preventing misconfigurations of cross-tenant physical connections at the software level.
[0115] After policy verification, the network manager generates a specific reconfigurable optical switching plane reconfiguration command. This command not only includes connection information such as source and destination ports, but must also forcibly carry a tag authorizing this connection. This tag is sent as part of the command to the reconfigurable optical switching plane controller. The reconfigurable optical switching plane controller must have the ability to parse and comply with this tag, ensuring that its internal switching matrix establishes physical optical connections only between ports belonging to the same VLAN tag, thus implementing isolation at the hardware level.
[0116] To address potential risks such as hardware failures, control channel hijacking, or configuration drift, the network manager further introduces a dynamic optical path fingerprint authentication mechanism. For each approved optical path, the manager, acting as a trusted center, dynamically generates a one-time-use, high-strength encrypted fingerprint, such as a hash value generated based on connection parameters and random numbers. Subsequently, the manager secretly distributes this fingerprint to the terminal devices at both ends of the optical path through independent and secure out-of-band management channels.
[0117] After the reconfigurable optical switching plane completes the physical connection, the network manager triggers a verification process. The terminal devices at both ends of the optical path exchange and compare their fingerprint information using a predefined security protocol, either through the newly established optical path itself or a separate in-band management channel. This is an active, end-to-end verification process. If both fingerprints match perfectly, it confirms that the current physical connection is consistent with the expected connection authorized by the network manager, and the terminal device immediately activates the port, allowing application data to begin flowing. If the fingerprints do not match, it indicates a possible connection error (e.g., the reconfigurable optical switching plane mistakenly connects a port of Tenant_A to a port of Tenant_B). The terminal device's security module will immediately take enforcement measures, physically disconnecting the port or dropping all traffic, achieving millisecond-level security circuit breaking, and reporting the security event to the network manager.
[0118] This disclosure implements an optical path multi-tenant isolation scheme on the network manager. Firstly, it ensures that data flows from different tenants are transmitted through completely independent physical paths, eliminating the risk of information leakage and side-channel attacks, and meeting the highest security requirements. Secondly, by combining software-defined policies with dynamic fingerprint verification, the scheme achieves flexible and efficient resource scheduling while ensuring security. Each tenant receives a dedicated, predictable network slice, guaranteeing deterministic communication latency and bandwidth for its critical AI tasks, freeing them from interference from sudden traffic spikes from other tenants. Finally, this mechanism transforms the traditional passive, detection-based security model into a proactive, verification-based defense model, enabling millisecond-level circuit breaking at the initial establishment of erroneous connections. This significantly improves the overall system reliability and operational automation level, laying a solid foundation for the secure and intensive operation of large-scale computing resources.
[0119] Furthermore, this disclosure also provides a fine-grained control scheme for a task-aware Dynamic Coherence Domain (DCD). The core idea of this scheme is to transform the enormous overhead of strong consistency across the entire network into an efficient model of "strong consistency within the domain + weak consistency between domains + directed failures" through an application-aware grouping strategy. The specific implementation of the scheme is based on the interaction and cooperation between the network manager and server nodes. The dynamic consistency domain is not physically fixed, but rather a logical set of nodes dynamically defined according to the attributes of the running application tasks. A low-latency strong consistency protocol is used within the domain, while a high-throughput loose consistency protocol is used between domains, achieving an optimal trade-off between performance and correctness.
[0120] Specifically, according to embodiments of this disclosure, the network manager is further configured to:
[0121] First, the parallel strategy for the AI model training task is obtained. This parallel strategy refers to the method of splitting a large model and its training data across multiple computing devices according to different dimensions; different strategies result in different computational communication modes. The network manager can obtain a complete description of the parallel strategy for the current AI model training task from the AI model training framework via the northbound API. This includes: parallel dimensions, such as data parallelism, tensor parallelism, pipelined parallelism, expert parallelism, etc., and resource mapping, i.e., which physical server nodes each parallel group is specifically assigned to.
[0122] Then, based on the parallel strategy of the AI model training task, each server node in the multiple server racks is divided into multiple dynamic consistency domains (DCDs), wherein server nodes belonging to the same DCD are used to collaboratively process specific data or model partitions of the AI model training task.
[0123] Specifically, the network manager analyzes the parallel strategy and identifies communication-intensive logical units that require strong memory consistency. For example, in expert parallelism, all nodes processing the same expert can form a DCD; in pipeline parallelism, nodes within the same pipeline stage can form a DCD; in data parallelism, all nodes can belong to a large DCD, but if the scale is extremely large, it can be further divided into sub-DCDs based on topological proximity to reduce protocol traffic.
[0124] Next, the network manager assigns a unique ID to each DCD and generates a DCD mapping table to record the DCD ID to which each server node belongs. Finally, the DCD mapping table is sent to the DCD manager in the CXL-oF bridging device of each server node through the southbound API interface, so that it can determine which nodes belong to the same DCD based on the DCD mapping table.
[0125] The CXL-oF bridging device also includes a DCD manager, and the CXL-oF bridging device is further configured to:
[0126] For server nodes within the same DCD, strong cache consistency of memory accesses within the same DCD is maintained through a hardware snooping protocol. Specifically, for memory access requests originating from the local CPU but destined for other nodes within the same DCD, the CXL-oF bridging device treats them as local extended memory accesses. In this case, a bus-based snooping protocol variant (such as MESI) is used. The CXL-oF bridging device, acting as a consistency point within the DCD, listens for or broadcasts relevant requests (such as read / write invalid) and collects responses from other nodes within the domain (such as cache line status), ensuring that all nodes see a strictly consistent memory view. This is typically implemented in a hardware state machine with extremely low latency.
[0127] Among multiple DCDs, the DCD manager maintains weak cache consistency based on the DCD directory table. For example... Figure 4 As shown, the directory entries in the DCD directory table include the correspondence between memory page addresses, the server node to which the memory page belongs, and the memory page status. The memory page status includes: shared, exclusive, and invalid. Here, a memory page refers to the smallest granular unit of physical storage resource management (e.g., a 4KB physical storage block). Shared means that the memory page is cached in read-only mode by nodes in multiple DCDs; exclusive means that the memory page is cached in exclusive (writable) mode by a single node in one DCD; invalid means that the memory page is not validly cached in a remote DCD. When a cross-DCD memory read / write request occurs, the bridging device first queries the local DCD directory table (or queries the host node), and determines the subsequent operation (such as forwarding the request or sending an invalidation message) based on the status, rather than broadcasting.
[0128] According to embodiments of this disclosure, the CXL-oF bridging device is further configured as follows:
[0129] When the memory access instruction is determined to be a cross-DCD memory write operation, a cross-domain consistency maintenance operation is triggered for the target memory page. The target memory page refers to the memory page corresponding to the memory address accessed by the memory access instruction. The cross-domain consistency maintenance operation includes: based on the DCD directory table, sending point-to-point reverse invalidation messages only to server nodes whose target memory page status is shared (as recorded in the DCD directory table), instead of broadcasting across the entire network; and completing the write operation after receiving invalidation confirmation from the corresponding node. Reverse invalidation is an important feature introduced in CXL3.0. In traditional protocols, the initiator is responsible for invalidating other copies of the write operation. However, in "reverse invalidation," the data's host or current owner can proactively send invalidation requests to nodes holding older copies. This optimizes write operation latency and reduces the burden on the requesting party.
[0130] This disclosure, when the memory access instruction is a cross-DCD memory write operation instruction, sends point-to-point reverse invalidation messages only to the shared nodes recorded in the directory, instead of broadcasting to the entire network. In this way, the huge overhead of traditionally broadcasting invalidation messages to all nodes in the network is completely avoided, reducing the number of messages from O(N) to O(k), where k is the number of nodes that actually share the memory page, which is usually much smaller than the total number of nodes N.
[0131] According to embodiments of this disclosure, the server node is configured as follows:
[0132] In response to the detection of a cross-DCD write conflict event, a cache invalidation request is sent to the server node currently holding the target memory page in an exclusive state, as recorded in the DCD directory table, so that it can release the exclusive state and update the DCD directory table, thereby resolving the write conflict and maintaining eventual consistency of the data.
[0133] Figure 8A flowchart illustrating a memory access method in an artificial intelligence server cluster network based on hybrid optoelectronic interconnect and CXL-oF protocol according to an embodiment of the present disclosure is shown. The AI server cluster network includes: a network manager, a hybrid physical interconnect layer, and multiple server racks. Each server rack includes an electrical switch and multiple server nodes. The hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. Multiple server nodes in each server rack are interconnected via corresponding electrical switches based on their own electrical interfaces. The electrical switches in the server racks are connected to the electrical backbone network in the electrical switching plane. The server nodes in the server racks are connected to the centralized optical path switching device in the reconfigurable optical switching plane via optical interfaces. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching device in the reconfigurable optical switching plane. Each server node includes a CXL-oF bridging device supporting the CXL-oF protocol. The CXL-oF bridging device includes a CXL module, an optoelectronic joint scheduler, and a Fabric packaging engine. The memory access method is applied to the server node, such as... Figure 8 As shown, the method includes the following steps S810~S820:
[0134] In step S810, a memory access instruction generated based on the AI model training task is obtained, and the target server node is determined according to the request address in the memory access instruction. If the target server node is not a local server node, a CXL data packet corresponding to the memory access instruction is generated through the CXL module, and the CXL data packet is written into the outbound traffic queue.
[0135] In step S820, the optoelectronic joint scheduler determines the routing interface of the CXL data packet based on the description characteristics of the CXL data packet in the outbound traffic queue. Then, the Fabric encapsulation engine encapsulates the CXL data packet according to the request address of the memory access instruction, the routing interface of the CXL data packet, and the description characteristics of the CXL data packet, generates a corresponding Fabric transmission frame, and routes the Fabric transmission frame to the routing interface for transmission to the target server node through the corresponding designated switching plane. The routing interface includes an electrical interface or an optical interface. The designated switching plane corresponding to the electrical interface is an electrical switching plane, and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane.
[0136] The network manager is used to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, including: an execution phase based on an AI model training task, and controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance.
[0137] Determining the target server node based on the requested address in the memory access instruction includes:
[0138] The identifier of the target server node corresponding to the request address of the memory access instruction is obtained based on a preset global physical address mapping table; wherein, each entry in the global physical address mapping table is used to describe the correspondence between the global physical address range and the identifiers of the server nodes in the plurality of server racks, and the request address corresponds to the global physical address in the global physical address range.
[0139] The target server node is determined based on its identifier.
[0140] According to embodiments of this disclosure, the descriptive features of the CXL data packet include: the category and / or size of the CXL data packet; the optoelectronic joint scheduler determines the routing interface of the CXL data packet based on the descriptive features of the CXL data packets in the outbound traffic queue, including:
[0141] The system monitors CXL packets in the outbound traffic queue using its own flow-feature-based classification engine. It determines the routing interface of the CXL packet based on its type and / or size, including: if the current CXL packet is a first-type packet, the routing interface is the electrical interface of the local server node. The first-type packet includes: CXL.io configuration packets, CXL.cache probes, or packets with payloads less than a preset length; if the current CXL packet is a second-type packet and the corresponding target server node currently has an available active optical path connection, it determines whether the optical transmission queue is congested. If not, the routing interface is the optical interface of the local server node. The second-type packet includes: large memory page migration packets based on the CXL.mem protocol or packets marked as Collective.
[0142] According to embodiments of this disclosure, the memory access method further includes:
[0143] The link status of the reconfigurable optical switching plane is monitored by the CXL-oF bridging device. When the reconfigurable optical switching plane is detected to be reconfigured or the optical transmission queue is congested, the CXL data packets set to be transmitted through the reconfigurable optical switching plane or the CXL data packets identified as second-type data packets are first sliced. Then, the sliced CXL data packets are encapsulated to generate corresponding Fabric transmission frames. The corresponding Fabric transmission frames are then routed to the electrical interface of the local server node to be transmitted to the target server node through the electrical switching plane.
[0144] According to embodiments of this disclosure, the Fabric transmission frame includes: a custom exchange frame header, payload data, and frame verification information. The custom exchange frame header sequentially includes: a route label field, a route hint field, a timestamp field, and a traffic type field. The Fabric encapsulation engine encapsulates the CXL data packet according to the request address of the memory access instruction and the routing interface and descriptive features of the CXL data packet to generate a corresponding Fabric transmission frame, including:
[0145] The routing label field in the custom switching frame header is generated based on the request address of the memory access instruction and the preset global physical address mapping table;
[0146] The routing hint field in the custom switching frame header is generated based on the routing interface of the CXL data packet and the forwarding indication information injected by the network manager;
[0147] The timestamp field in the custom exchange frame header is generated based on the transmission time of the Fabric transmission frame;
[0148] The traffic type field in the custom exchange frame header is generated based on the descriptive features of the CXL data packet;
[0149] The CXL data packet is used as the payload data, and the CRC checksum is used as the frame check information. The custom exchange frame header, the payload data, and the frame check information are concatenated in sequence to generate the Fabric transmission frame.
[0150] According to embodiments of this disclosure, the execution phase of the AI model training task controls the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance, including:
[0151] The current AI model training task is obtained based on the AI model training framework; before the current AI model training task is executed, multiple consecutive execution stages of the entire execution process of the current AI model training task and the optimal physical optical topology configuration of the reconfigurable optical switching plane in each execution stage are obtained based on the computation graph information of the current AI model training task.
[0152] When the current AI model training task is executed, within the current execution phase, an OCS reconstruction instruction for the next execution phase is generated in advance based on the optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in the next execution phase. Within the current execution phase or in the gap between the current execution phase and the next execution phase, the OCS reconstruction instruction for the next execution phase is sent to the reconfigurable optical switching plane, so that the centralized optical path switching device in the reconfigurable optical switching plane can reconstruct its own physical optical topology based on the OCS reconstruction instruction for the next execution phase before the next execution phase.
[0153] According to embodiments of this disclosure, obtaining multiple consecutive execution stages of the entire execution process of the current AI model training task and the optimal physical optical topology configuration of the reconfigurable optical switching plane in each execution stage based on the computation graph information of the current AI model training task includes:
[0154] The set of communication operators corresponding to each execution stage is obtained by parsing the computation graph information of the current AI model training task; the set of communication operators includes one or more communication operators.
[0155] For any execution phase, a pre-trained traffic prediction model is used to predict the full node pair communication traffic matrix at the end of any execution phase based on the set of communication operators for that execution phase. The matrix element (i, j) in the full node pair communication traffic matrix represents the expected amount of communication data from server node i to server node j that participated in executing the corresponding communication operator within any execution phase.
[0156] The optimal physical optical topology configuration of the reconfigurable optical switching plane during any execution phase is calculated based on the full node-to-node communication traffic matrix at the end of any execution phase.
[0157] According to embodiments of this disclosure, the memory access method further includes:
[0158] The parallel strategy for the AI model training task is obtained through the network manager. Based on the parallel strategy for the AI model training task, each server node in the multiple server racks is divided into multiple dynamic consistency domains (DCDs). Server nodes belonging to the same DCD are used to collaboratively process specific data or model partitions of the AI model training task.
[0159] According to embodiments of this disclosure, the CXL-oF bridging device further includes a DCD manager, and the memory access method further includes:
[0160] The DCD manager performs the following operations: for server nodes within the same DCD, strong cache consistency of memory access within the same DCD is maintained through a hardware snooping protocol; among multiple DCDs, weak cache consistency is maintained based on the DCD directory table; the directory entries in the DCD directory table include the correspondence between memory page addresses and the server node to which the memory page belongs and the memory page status, and the memory page status includes: shared, exclusive, and invalid.
[0161] According to embodiments of this disclosure, the memory access method further includes:
[0162] When the memory access instruction is determined to be a cross-DCD memory write operation instruction, a cross-domain consistency maintenance operation is triggered for the target memory page. The target memory page refers to the memory page corresponding to the memory address to be accessed by the memory access instruction. The cross-domain consistency maintenance operation includes: based on the DCD directory table, sending point-to-point reverse invalidation messages only to server nodes recorded in the DCD directory table where the target memory page is in a shared state, instead of broadcasting to the entire network, and completing the write operation after obtaining invalidation confirmation from the corresponding node.
[0163] According to embodiments of this disclosure, the memory access method further includes:
[0164] In response to the detection of a cross-DCD write conflict event, a cache invalidation request is sent to the server node currently holding the target memory page in an exclusive state, as recorded in the DCD directory table, so that it can release the exclusive state and update the DCD directory table, thereby resolving the write conflict and maintaining eventual consistency of the data.
[0165] According to the technical solution provided in this disclosure, based on deep hardware and software collaboration, through innovative hybrid architecture and protocol extension, the CXL memory semantics applicable to short distances and the optoelectronic interconnection physical characteristics of large-scale clusters are deeply adapted and intelligently collaborated. This fundamentally solves the communication bottleneck problem in ultra-large-scale AI training, and achieves a perfect integration of the extreme scalability of the physical layer and the unified memory semantics of the logical layer. Thus, low-latency, high-bandwidth, and high-energy-efficiency data interoperability is achieved in ultra-large-scale artificial intelligence server cluster networks.
[0166] Figure 9This diagram illustrates a structural block diagram of a memory access device in an artificial intelligence server cluster network based on a hybrid optoelectronic interconnect and CXL-oF protocol, according to an embodiment of the present disclosure. The artificial intelligence server cluster network includes: a network manager, a hybrid physical interconnect layer, and multiple server racks. Each server rack includes an electrical switch and multiple server nodes. The hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. Multiple server nodes in each server rack are interconnected via corresponding electrical switches based on their own electrical interfaces. The electrical switches in the server racks are connected to the electrical backbone network in the electrical switching plane. The server nodes in the server racks are connected to the electrical backbone network via optical interfaces. The network manager is connected to the centralized optical path switching equipment in the reconfigurable optical switching plane. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching equipment in the reconfigurable optical switching plane. The server node includes a CXL-oF bridging device supporting the CXL-oF protocol. The CXL-oF bridging device includes a CXL module, an optoelectronic joint scheduler, and a Fabric encapsulation engine. The memory access device 900 is located on the server node and includes a CXL packet generation module configured to: acquire memory access instructions generated based on an AI model training task, and determine the target server address based on the requested address in the memory access instructions. If the target server node is not a local server node, the CXL module generates a CXL data packet corresponding to the memory access instruction and writes the CXL data packet into the outbound traffic queue. The CXL data packet processing module is configured to: determine the routing interface of the CXL data packet based on the descriptive characteristics of the CXL data packet in the outbound traffic queue; then, the Fabric encapsulation engine encapsulates the CXL data packet according to the request address of the memory access instruction and the routing interface and descriptive characteristics of the CXL data packet, generating a corresponding Fabric transmission frame, and then sending the Fabric transmission frame to the appropriate location. The frame is routed to the routing interface for transmission to the target server node via the corresponding designated switching plane. The routing interface includes an electrical interface or an optical interface. The designated switching plane corresponding to the electrical interface is an electrical switching plane, and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane. The network manager is used to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network. This includes controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance during the execution phase of the AI model training task.
[0167] This disclosure also provides an electronic device, Figure 10 A structural block diagram of an electronic device according to an embodiment of the present disclosure is shown, such as... Figure 10 As shown, it includes a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the memory access method as described in any of the above method embodiments.
[0168] This disclosure also provides a computer-readable storage medium, which may be a computer-readable storage medium included in the electronic device or computer system described in the above embodiments; or it may be a standalone computer-readable storage medium not assembled into a device. The computer-readable storage medium stores one or more programs, which are used by one or more processors to execute the memory access methods described in this disclosure.
[0169] This disclosure also provides a computer program product, including a computer program that, when executed by a processor, implements the memory access method described in any one of the claims of this disclosure.
[0170] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features disclosed in this disclosure that have similar functions.
Claims
1. An artificial intelligence server cluster network based on hybrid optical-electrical interconnects and CXL-oF protocol, characterized in that, The AI server cluster network includes: a network manager, a hybrid physical interconnect layer, and multiple server racks. Each server rack includes an electrical switch and multiple server nodes. The hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. Multiple server nodes in each server rack are interconnected via corresponding electrical switches based on their own electrical interfaces. The electrical switches in each server rack are connected to the electrical backbone network in the electrical switching plane. The server nodes in each server rack are connected to the centralized optical path switching device in the reconfigurable optical switching plane via optical interfaces. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching device in the reconfigurable optical switching plane. Each server node includes a CXL-oF bridging device supporting the CXL-oF protocol. The CXL-oF bridging device includes a CXL module, an optoelectronic joint scheduler, and a Fabric packaging engine. The CXL-oF bridging device is configured to: acquire a memory access instruction generated based on an AI model training task; determine a target server node based on the request address in the memory access instruction; if the target server node is not a local server node, generate a CXL data packet corresponding to the memory access instruction through the CXL module and write the CXL data packet into the outbound traffic queue; determine the routing interface of the CXL data packet based on the descriptive features of the CXL data packet in the outbound traffic queue through the optoelectronic joint scheduler; then encapsulate the CXL data packet based on the request address of the memory access instruction, the routing interface of the CXL data packet, and the descriptive features of the CXL data packet through the Fabric encapsulation engine to generate a corresponding Fabric transmission frame; and route the Fabric transmission frame to the routing interface for transmission to the target server node through the corresponding designated switching plane. The routing interface includes an electrical interface or an optical interface, the designated switching plane corresponding to the electrical interface is an electrical switching plane, and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane. The network manager is configured to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, including: during the execution phase of an AI model training task, controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance.
2. The artificial intelligence server cluster network according to claim 1, characterized in that, Determining the target server node based on the requested address in the memory access instruction includes: The identifier of the target server node corresponding to the request address of the memory access instruction is obtained based on a preset global physical address mapping table; wherein, each entry in the global physical address mapping table is used to describe the correspondence between the global physical address range and the identifiers of the server nodes in the plurality of server racks, and the request address corresponds to the global physical address in the global physical address range. The target server node is determined based on its identifier.
3. The artificial intelligence server cluster network according to claim 1, characterized in that, The descriptive characteristics of the CXL data packet include: the type and / or size of the CXL data packet. The optoelectronic joint scheduler determines the routing interface of the CXL data packet based on the descriptive characteristics of the CXL data packets in the outbound traffic queue, including: The system monitors CXL packets in the outbound traffic queue using its own flow-feature-based classification engine. It determines the routing interface of the CXL packet based on its type and / or size, including: if the current CXL packet is a first-type packet, the routing interface is the electrical interface of the local server node. The first-type packet includes: CXL.io configuration packets, CXL.cache probes, or packets with payloads less than a preset length; if the current CXL packet is a second-type packet and the corresponding target server node currently has an available active optical path connection, it determines whether the optical transmission queue is congested. If not, the routing interface is the optical interface of the local server node. The second-type packet includes: large memory page migration packets based on the CXL.mem protocol or packets marked as Collective.
4. The artificial intelligence server cluster network according to claim 3, characterized in that, The CXL-oF bridging device is also configured to: The link status of the reconfigurable optical switching plane is monitored. When the reconfigurable optical switching plane is detected to be reconfigured or the optical transmission queue is in a congested state, the CXL data packets set to be transmitted through the reconfigurable optical switching plane or the CXL data packets determined to be second type data packets are first sliced. Then, the sliced CXL data packets are encapsulated to generate corresponding Fabric transmission frames. The corresponding Fabric transmission frames are then routed to the electrical interface of the local server node to be transmitted to the target server node through the electrical switching plane.
5. The artificial intelligence server cluster network according to claim 1, characterized in that, The Fabric transmission frame includes: a custom exchange frame header, payload data, and frame verification information. The custom exchange frame header includes, in sequence: a route label field, a route hint field, a timestamp field, and a traffic type field. The Fabric encapsulation engine encapsulates the CXL data packet according to the request address of the memory access instruction and the routing interface and descriptive features of the CXL data packet to generate a corresponding Fabric transmission frame, including: The routing label field in the custom switching frame header is generated based on the request address of the memory access instruction and the preset global physical address mapping table; The routing hint field in the custom switching frame header is generated based on the routing interface of the CXL data packet and the forwarding indication information injected by the network manager; The timestamp field in the custom exchange frame header is generated based on the transmission time of the Fabric transmission frame; The traffic type field in the custom exchange frame header is generated based on the descriptive features of the CXL data packet; The CXL data packet is used as the payload data, and the CRC checksum is used as the frame check information. The custom exchange frame header, the payload data, and the frame check information are concatenated in sequence to generate the Fabric transmission frame.
6. The artificial intelligence server cluster network according to claim 1, characterized in that, The execution phase of the AI model training task controls the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instructions corresponding to the execution phase to the reconfigurable optical switching plane in advance, including: The current AI model training task is obtained based on the AI model training framework; before the current AI model training task is executed, multiple consecutive execution stages of the entire execution process of the current AI model training task and the optimal physical optical topology configuration of the reconfigurable optical switching plane in each execution stage are obtained based on the computation graph information of the current AI model training task. When the current AI model training task is executed, within the current execution phase, an OCS reconstruction instruction for the next execution phase is generated in advance based on the optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in the next execution phase. Within the current execution phase or in the gap between the current execution phase and the next execution phase, the OCS reconstruction instruction for the next execution phase is sent to the reconfigurable optical switching plane, so that the centralized optical path switching device in the reconfigurable optical switching plane can reconstruct its own physical optical topology based on the OCS reconstruction instruction for the next execution phase before the next execution phase.
7. The artificial intelligence server cluster network according to claim 6, characterized in that, The step of obtaining multiple consecutive execution stages of the entire execution process of the current AI model training task and the optimal physical optical topology configuration of the reconfigurable optical switching plane in each execution stage based on the computation graph information of the current AI model training task includes: The set of communication operators corresponding to each execution stage is obtained by parsing the computation graph information of the current AI model training task; the set of communication operators includes one or more communication operators. For any execution phase, a pre-trained traffic prediction model is used to predict the full node pair communication traffic matrix at the end of any execution phase based on the set of communication operators for that execution phase. The matrix element (i, j) in the full node pair communication traffic matrix represents the expected amount of communication data from server node i to server node j that participated in executing the corresponding communication operator within any execution phase. The optimal physical optical topology configuration of the reconfigurable optical switching plane during any execution phase is calculated based on the full node-to-node communication traffic matrix at the end of any execution phase.
8. The artificial intelligence server cluster network according to claim 1, characterized in that, The network manager is also configured to: Obtain the parallel strategy for the AI model training task, and divide each server node in the multiple server racks into multiple dynamic consistency domains (DCDs) based on the parallel strategy for the AI model training task. Server nodes belonging to the same DCD are used to collaboratively process specific data or model partitions of the AI model training task. The CXL-oF bridging device also includes a DCD manager, and the CXL-oF bridging device is further configured to: For server nodes within the same DCD, strong cache consistency for memory access within the same DCD is maintained through a hardware snooping protocol; for multiple DCDs, weak cache consistency is maintained through the DCD manager based on the DCD directory table; the directory entries in the DCD directory table include the correspondence between memory page addresses and the server node to which the memory page belongs and the memory page status, and the memory page status includes: shared, exclusive, and invalid.
9. The artificial intelligence server cluster network according to claim 8, characterized in that, The CXL-oF bridging device is also configured to: When the memory access instruction is determined to be a cross-DCD memory write operation instruction, a cross-domain consistency maintenance operation is triggered for the target memory page. The target memory page refers to the memory page corresponding to the memory address to be accessed by the memory access instruction. The cross-domain consistency maintenance operation includes: based on the DCD directory table, sending point-to-point reverse invalidation messages only to server nodes recorded in the DCD directory table where the target memory page is in a shared state, instead of broadcasting to the entire network, and completing the write operation after obtaining invalidation confirmation from the corresponding node.
10. The artificial intelligence server cluster network according to claim 9, characterized in that, The server node is configured as follows: In response to the detection of a cross-DCD write conflict event, a cache invalidation request is sent to the server node currently holding the target memory page in an exclusive state, as recorded in the DCD directory table, so that it can release the exclusive state and update the DCD directory table, thereby resolving the write conflict and maintaining eventual consistency of the data.
11. A memory access method in an artificial intelligence server cluster network based on hybrid optoelectronic interconnection and CXL-oF protocol, characterized in that, The AI server cluster network includes: a network manager, a hybrid physical interconnect layer, and multiple server racks. Each server rack includes an electrical switch and multiple server nodes. The hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. Multiple server nodes in each server rack are interconnected via corresponding electrical switches based on their own electrical interfaces. The electrical switches in the server racks are connected to the electrical backbone network in the electrical switching plane. The server nodes in the server racks are connected to the centralized optical path switching device in the reconfigurable optical switching plane via optical interfaces. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching device in the reconfigurable optical switching plane. Each server node includes a CXL-oF bridging device supporting the CXL-oF protocol. The CXL-oF bridging device includes a CXL module, an optoelectronic joint scheduler, and a Fabric packaging engine. The memory access method is applied to the server node and includes: Obtain the memory access instruction generated by the AI model training task, determine the target server node according to the request address in the memory access instruction, if the target server node is not the local server node, generate a CXL data packet corresponding to the memory access instruction through the CXL module, and write the CXL data packet into the outbound traffic queue; The optoelectronic joint scheduler determines the routing interface of the CXL data packet based on the descriptive characteristics of the CXL data packet in the outbound traffic queue. Then, the Fabric encapsulation engine encapsulates the CXL data packet according to the request address of the memory access instruction, the routing interface of the CXL data packet, and the descriptive characteristics of the CXL data packet, generating a corresponding Fabric transmission frame. The Fabric transmission frame is then routed to the routing interface for transmission to the target server node through the corresponding designated switching plane. The routing interface includes an electrical interface or an optical interface. The designated switching plane corresponding to the electrical interface is an electrical switching plane, and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane. The network manager is used to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, including: an execution phase based on an AI model training task, and controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance.
12. The memory access method according to claim 11, characterized in that, Determining the target server node based on the requested address in the memory access instruction includes: The identifier of the target server node corresponding to the request address of the memory access instruction is obtained based on a preset global physical address mapping table; wherein, each entry in the global physical address mapping table is used to describe the correspondence between the global physical address range and the identifiers of the server nodes in the plurality of server racks, and the request address corresponds to the global physical address in the global physical address range. The target server node is determined based on its identifier.
13. The memory access method according to claim 11, characterized in that, The descriptive characteristics of the CXL data packet include: the type and / or size of the CXL data packet. The optoelectronic joint scheduler determines the routing interface of the CXL data packet based on the descriptive characteristics of the CXL data packets in the outbound traffic queue, including: The system monitors CXL packets in the outbound traffic queue using its own flow-feature-based classification engine. It determines the routing interface of the CXL packet based on its type and / or size, including: if the current CXL packet is a first-type packet, the routing interface is the electrical interface of the local server node. The first-type packet includes: CXL.io configuration packets, CXL.cache probes, or packets with payloads less than a preset length; if the current CXL packet is a second-type packet and the corresponding target server node currently has an available active optical path connection, it determines whether the optical transmission queue is congested. If not, the routing interface is the optical interface of the local server node. The second-type packet includes: large memory page migration packets based on the CXL.mem protocol or packets marked as Collective.
14. The memory access method according to claim 13, characterized in that, The memory access method further includes: The link status of the reconfigurable optical switching plane is monitored by the CXL-oF bridging device. When the reconfigurable optical switching plane is detected to be reconfigured or the optical transmission queue is congested, the CXL data packets set to be transmitted through the reconfigurable optical switching plane or the CXL data packets identified as second-type data packets are first sliced. Then, the sliced CXL data packets are encapsulated to generate corresponding Fabric transmission frames. The corresponding Fabric transmission frames are then routed to the electrical interface of the local server node to be transmitted to the target server node through the electrical switching plane.
15. The memory access method according to claim 11, characterized in that, The Fabric transmission frame includes: a custom exchange frame header, payload data, and frame verification information. The custom exchange frame header includes, in sequence: a route label field, a route hint field, a timestamp field, and a traffic type field. The Fabric encapsulation engine encapsulates the CXL data packet according to the request address of the memory access instruction and the routing interface and descriptive features of the CXL data packet to generate a corresponding Fabric transmission frame, including: The routing label field in the custom switching frame header is generated based on the request address of the memory access instruction and the preset global physical address mapping table; The routing hint field in the custom switching frame header is generated based on the routing interface of the CXL data packet and the forwarding indication information injected by the network manager; The timestamp field in the custom exchange frame header is generated based on the transmission time of the Fabric transmission frame; The traffic type field in the custom exchange frame header is generated based on the descriptive features of the CXL data packet; The CXL data packet is used as the payload data, and the CRC checksum is used as the frame check information. The custom exchange frame header, the payload data, and the frame check information are concatenated in sequence to generate the Fabric transmission frame.
16. The memory access method according to claim 11, characterized in that, The execution phase of the AI model training task controls the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instructions corresponding to the execution phase to the reconfigurable optical switching plane in advance, including: The current AI model training task is obtained based on the AI model training framework; before the current AI model training task is executed, multiple consecutive execution stages of the entire execution process of the current AI model training task and the optimal physical optical topology configuration of the reconfigurable optical switching plane in each execution stage are obtained based on the computation graph information of the current AI model training task. When the current AI model training task is executed, within the current execution phase, an OCS reconstruction instruction for the next execution phase is generated in advance based on the optimal physical optical topology configuration corresponding to the reconfigurable optical switching plane in the next execution phase. Within the current execution phase or in the gap between the current execution phase and the next execution phase, the OCS reconstruction instruction for the next execution phase is sent to the reconfigurable optical switching plane, so that the centralized optical path switching device in the reconfigurable optical switching plane can reconstruct its own physical optical topology based on the OCS reconstruction instruction for the next execution phase before the next execution phase.
17. The memory access method according to claim 16, characterized in that, The step of obtaining multiple consecutive execution stages of the entire execution process of the current AI model training task and the optimal physical optical topology configuration of the reconfigurable optical switching plane in each execution stage based on the computation graph information of the current AI model training task includes: The set of communication operators corresponding to each execution stage is obtained by parsing the computation graph information of the current AI model training task; the set of communication operators includes one or more communication operators. For any execution phase, a pre-trained traffic prediction model is used to predict the full node pair communication traffic matrix at the end of any execution phase based on the set of communication operators for that execution phase. The matrix element (i, j) in the full node pair communication traffic matrix represents the expected amount of communication data from server node i to server node j that participated in executing the corresponding communication operator within any execution phase. The optimal physical optical topology configuration of the reconfigurable optical switching plane during any execution phase is calculated based on the full node-to-node communication traffic matrix at the end of any execution phase.
18. The memory access method according to claim 11, characterized in that, The memory access method further includes: The parallel strategy for the AI model training task is obtained through the network manager. Based on the parallel strategy for the AI model training task, each server node in the multiple server racks is divided into multiple dynamic consistency domains (DCDs). Server nodes belonging to the same DCD are used to collaboratively process specific data or model partitions of the AI model training task. The CXL-oF bridging device further includes a DCD manager, and the memory access method further includes: The DCD manager performs the following operations: For server nodes within the same DCD, strong cache consistency of memory access within the same DCD is maintained through a hardware snooping protocol; between multiple DCDs, weak cache consistency is maintained through the DCD manager based on the DCD directory table; the directory entries in the DCD directory table include the correspondence between memory page addresses and the server node to which the memory page belongs and the memory page status, and the memory page status includes: shared, exclusive, and invalid.
19. The memory access method according to claim 18, characterized in that, The memory access method further includes: When the memory access instruction is determined to be a cross-DCD memory write operation instruction, a cross-domain consistency maintenance operation is triggered for the target memory page. The target memory page refers to the memory page corresponding to the memory address to be accessed by the memory access instruction. The cross-domain consistency maintenance operation includes: based on the DCD directory table, sending point-to-point reverse invalidation messages only to server nodes recorded in the DCD directory table where the target memory page is in a shared state, instead of broadcasting to the entire network, and completing the write operation after obtaining invalidation confirmation from the corresponding node.
20. The memory access method according to claim 19, characterized in that, The memory access method further includes: In response to the detection of a cross-DCD write conflict event, a cache invalidation request is sent to the server node currently holding the target memory page in an exclusive state, as recorded in the DCD directory table, so that it can release the exclusive state and update the DCD directory table, thereby resolving the write conflict and maintaining eventual consistency of the data.
21. A memory access device in an artificial intelligence server cluster network based on hybrid optoelectronic interconnection and CXL-oF protocol, characterized in that, The AI server cluster network includes: a network manager, a hybrid physical interconnect layer, and multiple server racks. Each server rack includes an electrical switch and multiple server nodes. The hybrid physical interconnect layer includes an electrical switching plane and a reconfigurable optical switching plane. The electrical switching plane includes an electrical backbone network, and the reconfigurable optical switching plane includes at least one centralized optical path switching device. Multiple server nodes in each server rack are interconnected via corresponding electrical switches based on their own electrical interfaces. The electrical switches in each server rack are connected to the electrical backbone network in the electrical switching plane. The server nodes in each server rack are connected to the centralized optical path switching device in the reconfigurable optical switching plane via optical interfaces. The network manager is connected to both the electrical backbone network in the electrical switching plane and the centralized optical path switching device in the reconfigurable optical switching plane. Each server node includes a CXL-oF bridging device supporting the CXL-oF protocol. The CXL-oF bridging device includes a CXL module, an optoelectronic joint scheduler, and a Fabric packaging engine. A memory access device is located on each server node and includes: The CXL packet generation module is configured to: obtain a memory access instruction generated based on an AI model training task; determine the target server node based on the request address in the memory access instruction; if the target server node is not a local server node, generate a CXL packet corresponding to the memory access instruction through the CXL module and write the CXL packet into the outbound traffic queue. The CXL packet processing module is configured to: determine the routing interface of the CXL packet based on the descriptive characteristics of the CXL packets in the outbound traffic queue through the optoelectronic joint scheduler; then encapsulate the CXL packet based on the request address of the memory access instruction, the routing interface of the CXL packet, and the descriptive characteristics of the CXL packet through the Fabric encapsulation engine to generate a corresponding Fabric transmission frame; and route the Fabric transmission frame to the routing interface for transmission to the target server node through the corresponding designated switching plane. The routing interface includes: an electrical interface or an optical interface; the designated switching plane corresponding to the electrical interface is an electrical switching plane; and the designated switching plane corresponding to the optical interface is a reconfigurable optical switching plane. The network manager is used to manage the reconfigurable optical switching plane and the electrical switching plane through an out-of-band management network, including: an execution phase based on an AI model training task, and controlling the reconstruction of the physical optical topology of the reconfigurable optical switching plane by issuing the OCS reconstruction instruction corresponding to the execution phase to the reconfigurable optical switching plane in advance.
22. An electronic device, characterized in that, It includes a memory and a processor; the memory is used to store computer instructions, wherein the computer instructions are executed by the processor to implement the memory access method according to any one of claims 11 to 20.
23. A computer-readable storage medium storing computer instructions thereon, characterized in that, When the computer instructions are executed by the processor, they implement the memory access method according to any one of claims 11 to 20.
24. A computer program product comprising computer instructions, characterized in that, When the computer instructions are executed by the processor, they implement the memory access method according to any one of claims 11 to 20.