Cluster-oriented location-independent communication method, data structure, and computing system

By introducing the Logical Destination Handle (LDH) mechanism, the decoupling of communication and storage management in the computing cluster is achieved, solving the problem of low communication efficiency under the high dynamic data flow of the new generation of large models, and improving the system scalability and memory management efficiency.

CN121935201BActive Publication Date: 2026-06-30SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2026-03-30
Publication Date
2026-06-30

Smart Images

  • Figure CN121935201B_ABST
    Figure CN121935201B_ABST
Patent Text Reader

Abstract

This invention discloses a location-independent communication method, data structure, and computing system for computing clusters, addressing the performance loss and programming complexity issues caused by the strong coupling between communication initiation and data storage layout in distributed computing. The communication initiator sends communication data packets based on a logical destination handle. The consumer's location management unit receives the data packets, dynamically maps the consumer entity identified by the logical destination handle to the target storage resource, and autonomously allocates an actual storage location within it to write data. Finally, it generates a completion record containing that location to notify the consumer entity. This invention supports dynamic memory allocation, metadata pass-through, and multiple allocation strategies, eliminating the address synchronization overhead before communication, improving storage utilization and system scalability, and is suitable for highly dynamic computing scenarios such as large-model MoE and KV cache migration.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of distributed computing and high-performance computing cluster communication technology, specifically to location-independent communication methods, data structures, and computing systems for computing clusters. Background Technology

[0002] In recent years, the improvement of large language models (LLMs) has shown a clear scaling law, but the computational cost of traditional dense Transformer architectures rapidly exceeds the sustainable scalability of hardware as the number of parameters increases. To overcome this computational bottleneck, next-generation model architectures such as Mixture-of-Experts (MoE) and Sparse Attention have become key pathways. These new architectures replace dense models with sparse structures, allowing each input to selectively activate only a small number of parameters for computation, thereby achieving trillions of parameters in model expressive power with limited computational cost.

[0003] However, the ever-increasing model size and growing requirements for long context mean that next-generation models still need multi-level, multi-GPU distributed training or inference. Examples include expert parallelism in MoE and context parallelism in long context attention. Distributed execution pushes system bottlenecks to cross-device communication; in MoE inference or training, all-to-all communication can account for nearly 50% of execution time. More importantly, sparse architectures result in highly dynamic, irregular data flows, with communication routes dynamically determined by runtime gating algorithms, leading to large traffic fluctuations and making static orchestration difficult.

[0004] The interconnection and communication of current mainstream computing clusters (such as communication libraries based on RDMA and NVLink) are generally built on the direct memory access semantics of load / store. Essentially, this is an "address-coupled" communication paradigm. That is, for a producer (initiator) to initiate a remote write, the instruction must carry the precise destination address of the consumer (consumer), such as the physical or virtual address of the memory. This paradigm implicitly imposes a hard premise in engineering: without a precise address, data transfer between hardware devices cannot begin.

[0005] However, the highly dynamic and irregular data flows under the new generation of large-scale model algorithms suffer from a serious semantic mismatch with the "address-coupled" communication abstraction of existing hardware. In the algorithm layer, gated routing often only specifies coarse-grained producer-consumer relationships at runtime; for example, a token should be assigned to an expert for computation, or a KVcache needs to be computed with a query. This coarse-grained relationship cannot meet the specific memory address requirements of hardware layer communication; and since the relationship is determined at runtime based on the input, it cannot be predicted or pre-arranged by the compiler.

[0006] To compensate for this mismatch, existing systems have to introduce complex software mediation steps. For example, in the MoE model, before actual communication occurs, the total number of tokens sent by each source device to each target expert must be counted at runtime. Then, a global synchronization operation is performed to exchange metadata, and an accurate offset mapping table is calculated and generated based on this. Only then can data transmission be initiated. This process essentially uses the expensive cost of global synchronization to make up for the physical address information required for hardware transmission.

[0007] This traditional communication protocol based on "address coupling" exposes unavoidable structural limitations when facing the new generation of highly dynamic computing workloads:

[0008] 1. High control flow overhead: In dynamic communication scenarios, in order to meet the address binding requirements, the two communicating parties need to frequently handshake and exchange addresses, which introduces significant delays, makes the software stack complex and redundant, and makes it difficult to achieve efficient computation-communication overlap.

[0009] 2. Rigid memory management: The initiating end needs to directly manipulate the physical address of the consumer end, which usually requires the consumer end to pre-allocate large blocks of contiguous memory. It is difficult to support flexible on-demand allocation, which easily leads to memory fragmentation.

[0010] 3. Difficulty in heterogeneous adaptation: Different computing chips (GPU, NPU, TPU, etc.) have different memory addressing mechanisms. Address binding protocols require the initiating end to have a deep understanding of the memory details of the consumer end, which hinders direct communication and cluster expansion across heterogeneous chips.

[0011] These overheads caused by "address coupling" are systemic costs. Continuing to use this paradigm will lead to more frequent synchronization and more complex coordination, diminishing the benefits of scaling. Therefore, an upgrade to the communication protocol paradigm is urgently needed: shifting from "writing to physical addresses" to "delivering to logical destinations," decoupling communication from storage management, fundamentally eliminating the need for address synchronization, and unleashing the scaling potential of large models. Summary of the Invention

[0012] To address the shortcomings of existing technologies, the purpose of this invention is to provide a location-independent communication method, data structure, and computing system for computing clusters.

[0013] A location-independent communication method for computing clusters provided by the present invention includes the following steps:

[0014] Step S1: The communication initiator sends a communication data packet to the communication consumer based on the logical destination handle; the logical destination handle is used to identify the communication consumer entity, but does not specify the actual storage location of the data payload in the target storage resources of the consumer.

[0015] Step S2: The location management unit at the consumer end receives the communication data packet and executes a dynamic location allocation process, which includes:

[0016] The logical destination handle is parsed, and the communication consumer entity is mapped to the target storage resource managed by the communication consumer.

[0017] Within the target storage resource, an actual storage location is autonomously determined and allocated for the data payload, and the data payload is written to that location;

[0018] Step S3: After the data is stored, the location management unit generates a completion record containing the actual storage location, and notifies the communication consumer entity of the record according to a preset strategy so that it can associate the data content with the storage location.

[0019] Preferably, the encoding of the logical destination handle includes one or more fields for routing and resource mapping, including: target device index, task or process index, communication queue index, or operator instance index.

[0020] Preferably, the step of parsing the logical destination handle and mapping the communication consumer entity to the target storage resource managed by the communication consumer is implemented in at least one of the following ways:

[0021] Query the predefined data structure that stores the association between the logical destination handle and the target storage resource;

[0022] A preset conversion algorithm is executed, taking at least a portion of the logical destination handle as input, to calculate the address or index of the target storage resource;

[0023] The logical destination handle is directly converted into an access signal for the target storage resource through a hard-coded decoding circuit.

[0024] Preferably, the method further includes a feedback process:

[0025] The communication data packet contains a feedback signal index;

[0026] After receiving or processing the communication data packet, the location management unit sends a feedback signal containing the index back to the communication initiator based on the processing result.

[0027] Preferably, the communication data packet includes a user context label and executes a metadata pass-through mechanism:

[0028] In step S1, the communication initiator fills the user-defined metadata into the user context label;

[0029] In step S2, the location management unit does not parse or modify the content of the user context tag;

[0030] In step S3, the location management unit includes the user context tag in the completion record and notifies the communication consumer entity.

[0031] Preferably, the communication data packet includes an opcode field to indicate the transmission type, and the transmission type includes at least one of the following:

[0032] Write the data type; indicate that the data packet carries a payload, and step S2 needs to be executed;

[0033] Signaling type only; indicates that the data packet does not carry a valid data payload and is used to transmit logical control events or synchronization signals on the communication link;

[0034] Flush type: Indicates the data packet used to enforce consistency and visibility of the storage system.

[0035] Preferably, the communication data packet includes an allocation prompt field; the location management unit adopts a corresponding allocation strategy based on this field, and the allocation strategy includes at least one of the following:

[0036] Default strategy: Follows the default allocation strategy configuration;

[0037] Compact append strategy: Writes data payloads sequentially to the target storage resource;

[0038] Discrete block allocation strategy: Extract the address of a free block or page from the data structure that organizes discrete blocks as the actual storage location;

[0039] Hash strategy: Use hash mapping to achieve automatic bucketing or load balancing of data storage;

[0040] Hierarchical priority strategy: Allocate data of different importance to storage areas with different priorities or different latency.

[0041] Preferably, the notification step in step S3 is implemented in the following manner:

[0042] The location management unit writes the completion record into the communication status area managed by the consumer terminal;

[0043] The writing method is either log queue mode or direct mapping mode;

[0044] When the direct mapping mode is used and the communication data packet contains a user context label, the write offset is calculated as follows: extract a predefined bit field from the user context label as an index value to calculate the offset; or calculate the corresponding offset based on the offset of the actual storage location relative to the target storage resource base address.

[0045] A data structure for a communication data packet provided by the present invention includes:

[0046] The data packet header contains a logical destination handle that indicates the communication consumer entity, and / or contains an opcode, an allocation prompt field, a data length field, a feedback signal index field, and a user context label;

[0047] Data payload: Contains binary data whose length is consistent with that indicated by the data length field or with the preset allocation granularity.

[0048] A computing system provided by the present invention includes:

[0049] The communication initiator is configured to generate and send communication data packets;

[0050] A location management unit, located at the communication consumer end, is configured to execute the dynamic location allocation process and generate a completion record;

[0051] The target storage resource, managed by the communication consumer, is used to store the written data;

[0052] The communication consumer is configured to receive the completed record and access data accordingly.

[0053] The location management unit can be implemented in the form of: dedicated hardware circuit, programmable logic device, or software / firmware logic, or any combination thereof.

[0054] Compared with the prior art, the present invention has the following beneficial effects:

[0055] 1. Significantly reduce communication latency and synchronization overhead: By introducing Logical Destination Handle (LDH), communication initiation and storage management are decoupled. The initiating end does not need to query or negotiate the specific storage address of the consumer end in advance to initiate data transmission, which fundamentally avoids global address synchronization across devices and greatly reduces end-to-end communication latency. It is especially suitable for highly dynamic and irregular data flow scenarios (such as MoE routing and KV Cache migration).

[0056] 2. Enhanced system programmability and software simplicity: Upper-layer applications only need to specify the logical destination (such as expert ID, task index, etc.), without having to worry about the underlying storage layout and heterogeneous memory details. This greatly simplifies the distributed programming model, reduces the complexity of the software stack, allows developers to focus more on business logic, and improves development efficiency and system maintainability.

[0057] 3. Dynamic memory management: The consumer-side PMU can dynamically adopt the optimal allocation strategy (such as compact append, discrete block allocation, tiered storage, etc.) based on real-time load, data importance, or memory fragmentation, thereby significantly reducing memory fragmentation and improving the utilization of storage resources.

[0058] 4. High versatility: The position-independent core concept proposed in this invention does not depend on a specific application model or hardware architecture, and can be widely applied to scenarios that require efficient dynamic data exchange (such as MoE routing, Attention sparse computing, streaming processing and other high-performance computing tasks). It can flexibly support interconnections at all levels between various computing chips such as CPU, GPU, TPU, and NPU. Attached Figure Description

[0059] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0060] Figure 1 This is a flowchart of the communication method provided in an embodiment of the present invention (including steps S1-S3).

[0061] Figure 2 This is a general system architecture model and overall workflow diagram provided by embodiments of the present invention;

[0062] Figure 3 This is a comparison diagram of the core concepts of traditional communication methods and the communication method proposed in this invention;

[0063] Figure 4 This is a schematic diagram of the data structure of the Logical Destination Handle (LDH) in an embodiment of the present invention;

[0064] Figure 5 This is a schematic diagram of the data structure of the communication data packet (CP) in an embodiment of the present invention;

[0065] Figure 6 This is a schematic diagram of the data structure of the feedback signal (ACK) in an embodiment of the present invention;

[0066] Figure 7 This is a schematic diagram of the data structure of the Resource Mapping Table (RST) in an embodiment of the present invention;

[0067] Figure 8This is a schematic diagram illustrating the location allocation strategy executed by the location management unit (PMU) in an embodiment of the present invention;

[0068] Figure 9 This is a schematic diagram illustrating the principle of the communication status (CS) notification mode in an embodiment of the present invention. Detailed Implementation

[0069] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all fall within the protection scope of the present invention.

[0070] 1. The core terms used in this manual are defined as follows:

[0071] Communication Packet (CP): The basic unit of transmission on a link, containing a packet header and an optional data payload.

[0072] Acknowledgement (ACK): A signal sent back from the consumer to the initiator, used to implement mechanisms such as data reception confirmation, flow control, or congestion avoidance.

[0073] Consumer Entity (CE): An abstract logical concept within the communication consumer, representing the final receiver and user of communication data. This entity is a broad definition, and its instances can be a computing task, a process, a communication queue, an operator instance in a data flow processing graph, a data shard in distributed storage, or any logical entity that needs to consume the transmitted data.

[0074] Logical Destination Handle (LDH): Used by the communication initiator to reference the entity on the communication consumer side. Its core function is to enable the initiator to specify the logical ownership of data without needing to know the final physical location of the data within the consumer side, thus decoupling the communication process from the final physical layout of the data.

[0075] User Context Tag (UCT): A user-defined metadata field that may be included in the packet header. The data layer is transparent to its content; it does not parse it but is only responsible for passing it through the link and delivering it to the consumer.

[0076] Target Storage Resource (TSR): A storage space managed by the communication consumer itself, used to store the data payload of communication data packets. Its management method is determined by the consumer; it can be statically reserved or dynamically requested.

[0077] Resource Mapping Table (RST): An exemplary data structure used to store the association between an LDH and its corresponding TSR when mapping a consumer entity (CE) identified by a logical destination handle (LDH) to a target storage resource (TSR) using a lookup data structure.

[0078] Actual Data Location (ADL): The specific location or address where the data payload is ultimately written into the TSR, such as a global memory address or an index of the on-chip cache.

[0079] Completion Record (CR): Structured information generated by the PMU after the data payload has been placed in the TSR, used to provide communication completion status to the communication consumer entity. It generally contains the actual storage location of the data (ADL) and optionally includes transparently transmitted UCT metadata.

[0080] 2. System Architecture

[0081] like Figure 2 As shown, the communication system defined in this invention involves three core logical abstractions. These abstract entities functionally cooperate with each other and can be mapped to any compatible hardware or software implementation:

[0082] a. Protocol Initiator: The logical entity that initiates a communication transaction and injects data; that is, the data sender. It generally includes structures that implement the functions of packaging communication data packets and invoking the underlying hardware to initiate transmission. It may also include internal or external computing and storage systems or subsystems. Examples of protocol initiators include, but are not limited to, general-purpose computing cores (CPUs), parallel computing units (GPU SMs or TPU Cores), direct memory access controllers (DMA Engines), network interface cards (NICs), or storage controllers.

[0083] b. Protocol Consumer: The logical entity that ultimately stores, accesses, and processes data; i.e., the end user of the data. It relies on a Location Management Unit (PMU) to handle the details of physical communication and generally includes internal or external computing and storage systems or subsystems for the final use of the data. Examples of computing systems or subsystems include, but are not limited to, computing cores, computing threads, and dedicated acceleration circuitry. Examples of storage systems or subsystems include, but are not limited to, various levels of cache, on-chip memory, global memory, system main memory, and extended storage.

[0084] c. Position Management Unit (PMU): A logical entity located at the communication consumer end, responsible for handling the communication details of this method. Its core responsibilities include parsing data packets, performing storage resource management including dynamic location allocation, and post-write notification. The specific implementation of the PMU is flexible, and examples include, but are not limited to: dedicated hardware circuitry (such as logic in an interconnect hub), programmable logic (such as an FPGA), or firmware / software logic running on the controller. The hardware circuitry includes, but is not limited to, independent hardware units, or computing logic embedded in an interconnect hub, network interface card (NIC), memory management unit (MMU), on-chip network router (NoC router), or near-memory computing logic embedded on the memory side, etc.

[0085] 3. Communication routing:

[0086] This invention introduces a Logical Destination Handle (LDH) as the core credential for communication routing. Its fundamental purpose is to separate the routing problem of "which logical destination the data is going to" from the storage management problem of "which physical location of which storage resource the data is stored in at that destination," so that the communication initiator can initiate communication without being aware of the consumer's memory layout, while ensuring correct routing in cross-device communication.

[0087] 3.1 Comparison of Communication Methods

[0088] The relationships between the core concepts of the communication method proposed in this invention are as follows: Figure 3 As shown in (b) above, and as... Figure 3 This contrasts sharply with the traditional method shown in (a) in the text.

[0089] a. In the traditional address-based communication method, the communication initiator must directly concern itself with and specify the hard binding relationship between the data payload and its actual storage location (ADL) at the consumer end.

[0090] b. In the location-independent communication method proposed in this invention, the communication initiator only indicates the logical relationship between the data payload and the consumer entity (CE) that will use the data.

[0091] A consumer entity is an abstract logical unit within the consumer, representing the end user of the data. It's not limited to a single computational task but is a broad concept: it can be a block of threads on a GPU core, a network processing queue, an expert network instance in a MoE model, or simply a data shard to be processed. The initiating end only points to such an abstract "consumer entity" through the LDH. Which storage resources this entity is using or will use (i.e., the target storage resource TSR), and the specific location within those storage resources where the data should be stored (i.e., the actual storage location ADL), are entirely determined by the consumer's PMU at runtime. Through this mechanism, the initiating end only needs to focus on the logical routing target; the management of storage resources on the consumer is handled by the consumer itself.

[0092] At the communication consumer end, after the PMU receives the data packet, its dynamic location allocation process consists of two sub-steps:

[0093] Step 1: Map from the consumer entity (CE) to the target storage resource (TSR) it is using or will use.

[0094] Step 2: Dynamically allocate the actual storage location (ADL) of the data within the corresponding TSR.

[0095] 3.2 Information Encoding of LDH

[0096] To accurately locate the target device within the cluster network and, more precisely, the target consumer entity within that device, the LDH must carry sufficient routing information to ensure route accuracy. The LDH encoding typically includes (but is not limited to) the following two types of key information:

[0097] a. Target Device Index (Device_ID): Used to uniquely identify the physical device of the consumer in the cluster network, such as a specific CPU, GPU, NPU or TPU node. Network switching devices (such as network switches) use this index for global routing to route data packets to the target node.

[0098] b. Internal Entity Index (Entity_ID): Used to uniquely identify a communication consumer entity within the target device, such as a kernel or operator instance, a thread block or a group of thread blocks, a task or process, a communication queue, a task stream, etc. The logical meaning of this index remains consistent between the initiating and consuming ends. For example, an index may correspond to the computation of an expert on a certain device, or an index may correspond to the computation of an attention head on a certain device.

[0099] 3.3 Example of LDH encoding format

[0100] like Figure 4 As shown, to adapt to different hardware architectures and software stacks, the LDH encoding format supports multiple implementation methods, including but not limited to:

[0101] a. Flat Integer Type: Uses a globally unique fixed-width integer, such as a 32-bit or 64-bit integer, which is uniformly allocated by the cluster's global resource manager, making it simple and efficient. Example: {Global_Handle_ID (64-bit)}.

[0102] b. Hierarchical Structure: This type uses bit-field partitioning to divide a fixed-width integer and embeds routing information into the handle, facilitating fast hardware parsing. This format allows network intermediary devices to directly parse a portion of the bits (e.g., the high-order bits) for routing, while the PMU parses another portion (e.g., the low-order bits) for resource location. An example is {Target_Node_ID (16-bit) | Device_ID (16-bit) | Local_Entity_ID (32-bit)}.

[0103] c. Object Pointer Type: At the initiating end, LDH is represented as a pointer to a descriptor object in local memory. This object internally caches detailed routing information of the destination node, such as QP Number information similar to that in RDMA. When initiating transmission, the transmitting hardware automatically extracts network routing information from this object and fills it into the packet header.

[0104] Through the above mechanism, LDH successfully decouples the abstract logical communication intent from the internal storage resource management of the consumer, laying the foundation for location-independent communication.

[0105] 4. Data packet structure

[0106] like Figure 5As shown, this method defines the communication data packet (CP) as the basic carrier of communication. Its structure is highly flexible and hardware-friendly, and can adapt to the needs of different high-performance computing scenarios. The CP mainly consists of two parts: the data packet header and the data payload.

[0107] 4.1 Data packet header

[0108] The header uses a compact design, containing one required field and several optional fields, as defined below:

[0109] a. Logical Destination Handle (LDH): This is a required field and is the core credential for CP network routing, directly containing the value of the LDH mentioned above. Its encoded information mainly covers two aspects: the target device index (Device_ID) and the internal entity index (Entity_ID). After receiving the data packet, the PMU maps the Entity_ID information to the target storage resource (TSR) that the consumer entity is currently using or will use, and obtains the basic configuration information of the storage resource (such as base address, allocation policy, etc.).

[0110] b. Opcode: This is an optional field used to indicate the transmission type and processing semantics of the current data packet. Its implementation typically uses short-width encoding, such as 2-bit or 4-bit, and supports at least one of the following types:

[0111] b-1. Write data type (WRITE_DATA): Indicates that the data packet carries a valid data payload, and the consumer PMU needs to perform dynamic location allocation and write the payload to storage.

[0112] b-2. Signaling-Only Type (SIGNAL_ONLY): Indicates that the data packet does not carry a valid data payload and is used to transmit logical control events or synchronization signals on the communication link. When the consumer PMU receives such a packet, it does not allocate a new data location (i.e., does not consume TSR resources), but instead performs predefined control tasks based on other fields in the packet header, such as UCT or specific flag bits. Specific embodiments include, but are not limited to:

[0113] Fence execution. The PMU ensures that all write operations (WRITE_DATA) arriving before this CP have been processed before processing this signaling packet, in order to guarantee the timing dependencies of operations;

[0114] Function register configuration. The PMU performs configuration operations on specific internal or external registers or storage locations based on the index or specific flags specified by the UCT, causing them to start or stop executing specific functions. Executable functions include, but are not limited to, using internal storage to count received packets, detecting the arrival or writing of specific packets, etc., to implement distributed lightweight task control;

[0115] Notification issued. The PMU generates only one completion record (CR) and informs the consumer that a certain batch of communication data transmission has ended or heartbeat keep-alive through a notification mechanism (e.g., writing to the communication status area CS). At this time, the ADL field in the CR can be set to an invalid value or a specific status code.

[0116] b-3. Flush Type: Indicates that the data packet is used to enforce consistency and visibility of the storage system, and typically does not carry a data payload. The consumer-side PMU forces all pending data held in its internal buffer units, such as the internal data path, write merge buffer, or intermediate state cache, to be immediately written to the final Actual Data Location (ADL). The PMU only returns an acknowledgment or generates a completion record to the initiator after all pending data has been confirmed to be written to disk. Its main purpose is to ensure data consistency and visibility at both the initiator and consumer ends. Specific embodiments include, but are not limited to:

[0117] Ensure data visibility. In some high-performance architectures, data written by the PMU may first be temporarily stored in an intermediate buffer and has not yet reached main memory. If the consumer directly reads main memory at this time, it may read old data. FLUSH forces this temporary data to be stored in main memory, ensuring that the consumer can see the latest written data.

[0118] Persistent checkpoints. Before the system saves a fault recovery checkpoint, the initiating end sends a FLUSH to ensure that all data in transit (in-flight) or temporarily stored in the consumer's interface buffer has been completely and safely written to the persistent storage area, preventing data loss when saving the checkpoint.

[0119] c. Allocation Hint: This is an optional field that allows the initiator to communicate its preference for a location allocation strategy to the consumer PMU, but does not force the PMU to execute it. Its implementation typically uses short-width encoding, such as 2-bit or 4-bit, and supports at least one or more of the following combinations:

[0120] c-1. Default policy: The request follows the default configuration of the PMU, or the default configuration associated with the corresponding TSR (such as recorded in the RST or preset by the hardware), and does not execute the allocation logic of specific preferences.

[0121] c-2. Compact Append Strategy: The request uses a compact append strategy, where the data payload is written continuously to the target storage resource. This strategy is suitable for streaming or variable-length data aggregation.

[0122] c-3. Scattered Block Allocation Strategy: The request adopts a scattered block allocation strategy, which extracts free block or page addresses from the data structure that organizes scattered blocks, such as a linked list queue, as the actual storage location. This strategy is suitable for fixed-length blocks to reduce memory fragmentation, such as in a key-value cache.

[0123] c-4. Hash Strategy: The request uses a hash mapping method to calculate the hash based on the content of the data packet, such as calculating the hash based on a certain field in the user context tag. This enables automatic bucketing or load balancing of data storage and is suitable for scenarios that require data to be evenly distributed.

[0124] c-5. Priority Strategy: Requests are prioritized in a hierarchical manner, allocating data of different importance to storage areas of different priorities or delays based on packet hints. For example, critical data is allocated to on-chip cache, while ordinary data is allocated to device memory.

[0125] d. Data Payload Length (Data_Size): This is an optional field used for variable-length data communication, indicating the size of the subsequent data payload in bytes. It is typically implemented as an integer field, such as a 16-bit or 32-bit integer. When the CP transmits fixed-length data, this field can be omitted, and the PMU will default to the allocation granularity configured by the PMU or the target TSR. When the CP transmits variable-length data, the initiating end fills this field, and the PMU reads this field to determine the amount of space to be allocated.

[0126] e. Feedback Signal Index (ACK_ID): This is an optional field used to specify which data packet the feedback signal is for at the initiating end.

[0127] f. User Context Tag (UCT): This is an optional field used to implement the metadata pass-through mechanism. The UCT field is an opaque bit field, filled by the initiator. For example, if the initiator specifies 64-bit data, the PMU does not parse its semantics and only returns it as is during the notification phase. Its implementation is usually a fixed-length wide word, such as 64-bit or 128-bit, which can be flexibly defined according to the application scenario. Taking a 64-bit width as an example, the following shows two specific scenarios and possible implementations in a general scenario:

[0128] f-1. MoE expert parallel scenario: UCT can be encoded as {Src_GPU_ID (32bit) | Request_ID (16bit) | Sequence_ID (16bit)}.

[0129] f-2. Attention scenario: UCT can be encoded as {Request_ID (16bit) | Sequence_ID (16bit) | Head_ID (16bit) | Block_ID (16bit)}.

[0130] f-3. General scenario: UCT can contain application layer pointers such as {Application_Pointer (64bit)}.

[0131] 4.2 Data Payload

[0132] The data payload carries the actual binary data being transmitted, such as the specific data content of structures like Tensors, Vectors, or KV Blocks. Its length is subject to the following constraints:

[0133] When processing WRITE_DATA type CPs, the data payload length must be consistent with the default allocation granularity of the LDH-mapped TSR, the default allocation granularity of the PMU, or the length indicated by the Data_Size field in the data packet. For example, if the consumer configures the resource pool granularity to 4KB Page and omits the data packet payload length field by default, then the data payload length of each CP should also be 4KB; if the data packet payload length field is set, then the data payload length of each CP should be consistent with the value of that field. For SIGNAL_ONLY or FLUSH type CPs, the data payload length is usually 0. This design helps simplify the location allocation logic and improve throughput.

[0134] Through the above structure, CP can carry rich control information and application metadata with low overhead, meeting the dual requirements of high-performance computing clusters for communication flexibility and efficiency.

[0135] 5. Feedback and Flow Control Mechanism

[0136] like Figure 6 As shown, to ensure communication reliability and network stability, this method defines a feedback signal mechanism from the consumer to the initiator. This mechanism can be dynamically enabled or implemented according to the service's reliability requirements or needs such as flow control and congestion management.

[0137] 5.1 Data Structure of Feedback Signal

[0138] Feedback signals are typically a very simple control package, whose core fields include:

[0139] a. Feedback Signal Index (ACK_ID): This field explicitly informs the initiator which specific data packet the feedback signal is a response to. The value of this field corresponds one-to-one with the ACK_ID field in the original communication data packet (CP) header (see Section 4.1 for definition).

[0140] b. Status Code (Status_Code): Used to inform the initiating end of the reception status of the original CP. Its implementation usually uses short-width encoding, such as 2-bit or 4-bit, and supports at least one of the following types:

[0141] b-1.ACK_REV: Successful reception. Indicates that the consumer PMU has successfully received the data packet completely from the link and passed basic checks (such as CRC check). The data is temporarily stored in the PMU's internal input buffer.

[0142] b-2.ACK_PROC: Successful processing or storage. Indicates that the consumer PMU has successfully resolved the logical destination handle, allocated the dynamic location, and written the data completely to the final actual storage location (ADL), and optionally generated a completion record.

[0143] b-3.ACK_RETRY: Reception failed (e.g., verification error), please retransmit. This indicates that the consumer PMU received the data packet from the link but failed the basic verification and needs to retransmit.

[0144] b-4.ACK_ERROR: Critical error (such as invalid LDH, authentication failure), communication terminated. Indicates that a critical error occurred during processing at the consumer PMU, requiring communication to be terminated.

[0145] c. Flow Control Credit (Credit): This is an optional field used for flow control, indicating the remaining available space of the current TSR on the consumer, such as the number of remaining pages.

[0146] d. Congestion Flag (Congestion_Flag): This is an optional field used for flow control, indicating whether the consumer or intermediate network is in a congested state.

[0147] 5.2 Functional Modes of Feedback Signals

[0148] a. Reliability ACK

[0149] The initiating end retains a copy of the CP (Content Message) for timing purposes after sending it, and the consumer PMU (Power Management Unit) sends back an ACK (Acknowledgement) after receiving or processing the CP. If the initiating end does not receive an ACK within a timeout period, it automatically retransmits. This method can be used for critical control signaling or data transmission where packet loss is intolerable.

[0150] b. Credit-based flow control

[0151] During initialization, the initiating end sets an initial credit value, such as 100 free blocks. For each data packet sent, the initiating end deducts 1 credit, stopping transmission when the credit reaches 0. After processing the data and releasing space, the consumer end returns the credit via the flow control credit field in the ACK, for example, ACK: Credit+5s. This mechanism effectively prevents consumer buffer overflow and achieves backpressure regulation.

[0152] c. Explicit Congestion Notification (ECN)

[0153] When the PMU detects that the internal processing queue depth exceeds the threshold, or detects that the remaining space of the TSR is critically low, it sets the congestion flag in the ACK packet. After receiving the ACK with the congestion flag, the initiating end automatically reduces the sending rate, such as by reducing the sending window or increasing the packet sending interval, to proactively alleviate network pressure.

[0154] 5.3 Piggybacking

[0155] To save bandwidth, if the consumer also has data to send to the initiator, the feedback signal can be embedded (piggyback) in the CP packet header of the reverse transmission, eliminating the need to send a separate control packet. This can significantly improve data transmission efficiency in bidirectional high-frequency communication scenarios such as All-to-All.

[0156] 6. Dynamic location allocation

[0157] The core function of the consumer-side location management unit (PMU) is to execute a complete dynamic location allocation process upon arrival of data packets. This process is logically divided into two closely related sub-steps:

[0158] a. Storage Resource Mapping: First, map the consumer entity identified by the Logical Destination Handle (LDH) in the packet header to the target storage resource (TSR) managed by the consumer.

[0159] b. Specific location allocation: Based on the allocation hint (Allocation_Hint) in the data packet header or the default strategy of PMU and TSR, the actual storage location (ADL) of the data inside the TSR is dynamically calculated.

[0160] The following will explain these two sub-steps accordingly.

[0161] 6.1 Storage Resource Mapping

[0162] After receiving a communication data packet (CP), the PMU extracts the LDH from it. The PMU can determine the corresponding TSR by consulting the data structure, algorithm calculation, or hard-coded logic.

[0163] a. Data Structure Lookup: The PMU maintains a data structure that predefines or stores the association between relevant LDHs and their corresponding target storage resources. This data structure may also contain configuration information of the target storage resources to facilitate subsequent location allocation or notification steps. This section uses the Resource Mapping Table (RST) lookup as a typical example for detailed explanation. Figure 7 As shown, a typical RST structure embodiment is illustrated.

[0164] The Resource Mapping Table (RST) is maintained by the consumer's PMU, and its physical memory location can be within the PMU or on a specific storage resource outside the PMU. When the PMU receives a data packet carrying a specific LDH, it extracts the LDH field as an index to look up the table and retrieve information about the storage resources that the consumer entity (CE) indicated by the logical destination handle is using or will use. The storage resource is referred to as the Target Storage Resource (TSR), and its information in the RST includes at least:

[0165] a-1. Base Position (Base_Position): The starting position or address of the target storage resource (TSR) mapped by LDH in the consumer's physical memory, such as a specific memory address in HBM, DRAM, or a cache line region in SRAM.

[0166] The information about the storage resources in the RST may optionally include:

[0167] a-2. Allocation Granularity (Alloc_Granu): Defines the minimum unit size for each dynamic location allocation, such as a 4KB page or a 512B block. When the packet header does not indicate the payload length, the packet payload length is matched to this granularity.

[0168] a-3. Allocation Policy and Allocation Policy Support: Instructs the PMU on which algorithm to use to allocate the internal location of storage resources, such as compact appending or discrete block allocation. Some allocation policies may require additional specific storage resource support, such as using registers to store pointers needed for linear appending, or using caches to store a linked list of free blocks needed for discrete block allocation. See the "Dynamic Location Allocation" section below for details.

[0169] a-4. Capacity: The maximum space size allowed to be used by the target storage resource mapped by LDH, used for boundary checks.

[0170] a-5. Notification Policy: Instructs the PMU on how to record a completion record to notify the consumer entity after the write is completed. See the "Completion Notification Mechanism" section below for details.

[0171] b. Algorithm Calculation: The PMU does not query the data structure. Instead, it uses a pre-defined transformation algorithm to directly calculate the corresponding TSR information, taking at least a portion of the LDH as input. For example, if the system stipulates that each consumer entity has a fixed-size and contiguous storage area, the PMU can directly calculate Base_Position = Global_Base_Position + Entity_ID * Fixed_Size. This method eliminates table lookup overhead, has extremely low latency, and is suitable for rule-based scenarios.

[0172] c. Hard-coded logic: The PMU directly converts the LDH into access signals for the corresponding TSR through a hard-coded decoding circuit. For example, in a dedicated ASIC, some bits of the LDH may be directly connected to the chip select signal lines of the on-chip memory bank.

[0173] 6.2 Specific Location Allocation

[0174] After determining the TSR and its basic configuration, the PMU calculates the ADL based on a preset allocation strategy (from the PMU or internal RST) or an allocation hint in the data packet (from the Allocation_Hint). To support the aforementioned diverse allocation hints, the PMU can internally implement a set of allocation strategy logic, such as... Figure 8 As shown. Specific implementation methods include at least one or more combinations of the following:

[0175] a. Compact Append Strategy: Its implementation involves the PMU maintaining a runtime context for the TSR, containing an atomically incrementing offset pointer, such as a register named Current_Offset, which is typically initialized to 0. When a packet arrives, the PMU performs an atomic increment operation on the aforementioned pointer using either the packet payload length Data_Size or the default payload length Alloc_Granu to obtain the current offset. For example, when the data payload length remains constant, Current_Offset can directly record a pointer to that payload length, and the actual storage location can be calculated as: ADL = Base_Position + AtomicAdd(Current_Offset, 1) * Alloc_Granu. Alternatively, it can record the position offset, and the actual storage location can be calculated as: ADL = Base_Position + AtomicAdd(Current_Offset, Alloc_Granu). When the data payload length may change, Current_Offset records the position offset, and the actual storage location can be calculated as: ADL = Base_Position + AtomicAdd(Current_Offset, Data_Size). Here, Base_Position is the starting storage location of the target storage resource, Data_Size is the data packet payload length, and Alloc_Granu is the default allocation granularity of the target storage resource. This strategy ensures the continuity of concurrent communication data during actual storage, eliminates storage gaps, and is suitable for scenarios such as streaming data and variable-length data aggregation.

[0176] b. Scatter Allocation Strategy: The PMU maintains a data structure to manage free storage resources, such as a Free List or Bitmap. This data structure pre-stores the address indices of all available physical blocks or pages within the corresponding TSR. Their physical locations can be inside the PMU or pointed to by pointers at a specific storage location outside the PMU. When a data packet arrives, the PMU retrieves a free item from this data structure. For example, if the data structure is organized as a queue, the PMU pops a free item from the queue. The formula for calculating the actual storage location can be expressed as: ADL = Pop(Free_Queue). This strategy allows data to be stored discretely in any free location in memory, without requiring contiguous physical space. It effectively eliminates memory fragmentation and achieves on-demand allocation, making it particularly suitable for the transmission and storage of fixed-length data blocks (such as KV Cache Blocks).

[0177] c. Hash Strategy: The implementation mechanism involves the PMU using external or internally integrated hash calculation logic, such as CRC32, MurmurHash circuits, or general-purpose computing circuits. The PMU extracts specific information from a specific location in the data packet to calculate its hash value. For example, it extracts a specific key field from the User Context Label (UCT) field in the packet header to calculate the hash value, and then moduloes this with the storage capacity. The actual storage location calculation formula can be expressed as: ADL = Base_Position + (Hash(Key_Field) % Capacity) * Block_Size. This strategy enables automatic data sharding or load balancing, distributing data evenly across different storage locations without requiring complex location calculations at the initiating end.

[0178] d. Priority-based strategy: The implementation mechanism involves logically associating each TSR with multiple physical storage regions with different performance characteristics, including but not limited to high-priority regions (such as on-chip SRAM, L1 and L2 caches, and low-latency memory) and normal-priority regions (such as HBM and DDR). The PMU selects based on the priority information in the packet allocation hint. If the hint indicates high priority, the PMU allocates an address from the high-priority region, for example, ADL = Alloc(High_Priority_Pool); otherwise, it allocates from the normal-priority region, for example, ADL = Alloc(Normal_Pool). Alternatively, if the priority information in the packet allocation hint indicates normal or low priority, the PMU allocates from the normal-priority region; otherwise, the PMU allocates an address from the high-priority region. This strategy implements hierarchical storage management, ensuring that critical data (such as control signaling and frequently accessed metadata) resides in the storage tier closest to the computing core with the lowest latency.

[0179] 6.3 Diversity of Actual Storage Location (ADL) Forms

[0180] This method does not limit the specific physical form of ADL. Depending on the hardware implementation, ADL may include, but is not limited to:

[0181] a. Physical memory address, pointing to the specific byte address of HBM, DDR, or extended memory.

[0182] b. Cache index, which points to the set / way index of the on-chip cache.

[0183] c. Register / Doorbell Address: Points to a specific hardware register or Doorbell area used to trigger signals rather than store large amounts of data.

[0184] Through the above mechanism, the PMU can manage complex memory location allocation details, providing efficient, flexible and location-independent communication and write services for upper-layer applications.

[0185] 7. Discovery and Notification

[0186] like Figure 9 As shown, the main problem this invention also needs to solve is how the consumer knows that the data is ready and its storage location after it arrives. To this end, after completing the data writing in step S2, the PMU automatically executes step S3, that is, generates a Completion Record (CR) and notifies the corresponding communication consumer entity. A typical and efficient notification method is to write the CR into a preset communication state (CS) area.

[0187] 7.1 Completion Record (CR)

[0188] A CR is a structured credential used by the PMU to deliver communication results to the consumer entity. It is designed to ensure that data can be correctly indexed and used by the consumer entity after storage. It is generated after the data has been stored. Figure 9 As shown in (a), its data structure typically includes, but is not limited to, the following fields:

[0189] a. Actual Storage Location (ADL): The address or index where the data payload is ultimately dynamically allocated and written, such as a 64-bit memory address. The consumer entity accesses the data based on this location.

[0190] b. User Context Tag (UCT): An optional field, metadata passed from the CP header, such as a 64-bit user tag. The consumer entity uses this tag to identify the contextual meaning of the data, such as "this is the k-th token from the n-th request on the m-th GPU." By binding ADL to UCT, CR allows the consumer entity to directly obtain the contextual semantics of the data without additional queries.

[0191] 7.2 Notification Mode

[0192] like Figure 9 As shown in (b), in order to adapt to different types of consumer architectures, such as CPU serial processing or GPU parallel processing, the way CR is written to the CS region supports multiple configurable modes, and the embodiments include at least one of the following:

[0193] a. Log Queue Mode

[0194] The consumer specifies a memory region, or uses a dedicated memory unit within the PMU, as the CS region for receiving communication status logs. The PMU internally maintains a write pointer (Notification_Write_Pointer). Whenever a new CR is generated, the PMU appends it to the CS region at an offset of Notification_Write_Pointer, and then the pointer automatically increments. This mode supports rollback, forming a circular buffer. In this case, the consumer must read the internal state of the buffer promptly; otherwise, when the buffer is full, communication data writing and CR generation will stop. This mode is suitable for scenarios where the consumer uses a single-threaded polling or interrupt-driven model and needs to process data strictly according to the arrival order, such as when the CPU processes data streams from multiple sources.

[0195] b. Direct Mapping Mode

[0196] The consumer designates a memory region as the CS region, recording its starting position as CS_Base. The PMU calculates the offset (CS_Offset) of the completed record according to specific rules. The PMU writes the CR to the position pointed to by CS_Base + CS_Offset. This operation is similar to directly updating a specific entry in a state table.

[0197] In this mode, the calculation of offset CS_Offset includes at least one of the following methods:

[0198] b-1. Explicit Indexing:

[0199] The PMU extracts a predefined bit field (e.g., the lower 16 bits) from the transparently transmitted UCT as an index value, and then calculates the offset based on the index value. For example, CS_Offset = Extract(UCT, Mask) * CR_Size, where CR_Size is the size of the CR in bytes. One scenario where this mode is applicable is that the consumer thread Thread_i processes data Token_i, and the sender specifies the Token sequence number Token_i in the lower bits of the UCT field of the data packet. This UCT field is further used by the consumer to calculate the offset using an explicit index. The communication status of Token_i is then written to the CS[i] position, and the consumer thread Thread_i can directly check Status[i], thus achieving lock-free parallel processing.

[0200] b-2. Implicit Location Association:

[0201] The state position can be calculated using the actual data storage location or by using the PMU's built-in increment / decrement pointer CS_Ptr. An address-based example is CS_Offset = (ADL - Data_Base_Position) / Data_Block_Size * CR_Size, where Data_Block_Size is the data payload length Data_Size defined in the packet header or the default allocation granularity Alloc_Granu for the target storage resource when the payload length is undefined. A pointer-based example is CS_Offset = CS_Ptr * CR_Size. One scenario where this mode is applicable is when data is stored in the 5th block, the state is automatically written to the 5th slot of the state table, and the consumer allocates computation according to this strategy.

[0202] 7.3 Variations and Extensions of Notification Mechanisms

[0203] To adapt to the complex collaborative needs of high-performance computing clusters, the notification mechanism of this method is not limited to generating standard completion records, but also supports extended functions that can be integrated with the internal logic of the PMU. These extended functions can be triggered by opcodes (such as SIGNAL_ONLY) or specific user context tags (UCTs) to implement synchronization, statistics, and control functions beyond data transmission, including additional function configuration implementations and variant notification modes such as interrupt triggering and doorbell triggering.

[0204] a. Special function control and register operations

[0205] The PMU internally integrates or connects a set of function configuration registers or special storage locations. These registers map various lightweight control functions of the PMU. When the PMU receives a specific CP (especially of the SIGNAL_ONLY type), it performs configuration or update operations on the aforementioned specific registers or storage locations based on the index specified by the UCT or specific flag bits in the data packet. These operations can trigger the PMU to start, stop, or execute specific predefined functions without generating a regular completion record. Specific embodiments include, but are not limited to:

[0206] a-1. Distributed Counting and Synchronization. The PMU uses internal storage units as atomic counters. Whenever a specific signal packet is received, the PMU performs an accumulation operation at the corresponding position, such as Counter[ID]++. This can be used to count the number of received data packets, such as counting the amount of data received in aggregated communication, or detecting the arrival of specific key data packets. When the count value reaches a preset threshold, the PMU can automatically trigger subsequent actions, such as interrupt notification or CR writing, thereby realizing distributed lightweight task synchronization and control.

[0207] a-2. Function bit toggling. The PMU performs set, reset, interrupt, or other operations on specific registers or storage locations to enable or disable specific functional units, data stream processing channels, or mark the completion of a certain computation stage.

[0208] b. Chained triggering and event cascading

[0209] The PMU internally maintains an event trigger table. When the PMU generates a CR or completes a specific write operation, it queries the trigger table and automatically initiates the next hardware operation. For example, the PMU can be configured in interrupt-triggered mode, sending an interrupt signal to the consumer processor every time i CRs are written, or when the accumulated write volume in the CS region reaches a specific threshold. Alternatively, the PMU can be configured in doorbell-triggered mode, automatically sending a signal to the next-level unit (such as the GPU's Command Dispatcher or DMA engine) after writing a CR, waking up sleeping waiting threads.

[0210] c. Error Reporting and Status Rewrite

[0211] The CR (Error Log) write area, such as the preset CS area, is used as the error log area. When an allocation failure occurs (such as a full TSR), a verification error occurs, or a permission violation occurs, the PMU generates a special CR (carrying an error code instead of an ADL) and writes it to a predefined error status slot or triggers a system interrupt.

[0212] 7.4 Metadata Transmission Mechanism

[0213] This invention achieves efficient integration of application semantics and underlying transmission through an optional metadata pass-through mechanism:

[0214] a. In step S1, the initiating end fills the key information of the upper layer application (such as request number, sequence number, weight parameters, etc.) into UCT.

[0215] b. In step S2, the PMU treats it as "black box" data and does not parse it.

[0216] c. In step S3, the PMU includes the UCT along with the ADL in the Completion Record (CR). When notification is made using the CS area, the CR is written to a specific location within the CS area.

[0217] Under this mechanism, when the communication consumer reads the CR, it not only obtains where the data is (ADL), but also immediately obtains the context semantics (UCT) of what the data is, without needing to recover the context through additional lookup tables or out-of-band queries. This greatly simplifies software stack design and reduces end-to-end latency.

[0218] 8. Communication process

[0219] like Figure 1 As shown, combining the various mechanisms described in detail in Sections 1-7 above, the complete process of the location-independent communication method proposed in this invention is summarized as follows. This process clearly demonstrates how data is transformed from logical intent into physical reality and ultimately notified to the consumer.

[0220] a. Step S1: Send step

[0221] The initiating end generates and sends a communication data packet (CP). At this stage, the initiating end does not need to query or know the internal storage layout of the consumer end; it only needs to determine the logical target of the consumer entity, encode it as a Logical Destination Handle (LDH), and then, based on the specific implementation and application requirements, selectively fills in the packet header with: an opcode (e.g., WRITE_DATA); an allocation hint (e.g., Append); a user context label (UCT), such as {RequestID | SequenceID}, used to carry upper-layer business semantics; a data payload length (Data_Size), indicating the data size in the case of variable-length data payloads; and a feedback signal index (ACK_ID), used to generate the feedback signal sent from the consumer end to the initiating end. Finally, the initiating end injects the CP into the network, completing the initiating end's communication operation.

[0222] b. Step S2: Location Allocation and Writing Steps

[0223] The CP (Content Packet) arrives at the consumer via the internet and is intercepted and processed by the Location Management Unit (PMU). The PMU first parses the LDH (Location Descriptor Hierarchy) in the packet header and performs storage resource mapping. The PMU obtains the TSR (Transmission Service Registry) configuration corresponding to the LDH (such as starting storage location, capacity, allocation granularity and strategy, notification strategy, etc.) by consulting data structures (e.g., Resource Mapping Table, RST), executing algorithms, or utilizing hard-coded logic. Next, based on the Allocation_Hint in the packet header or the default configuration, the PMU executes a dynamic allocation algorithm to ultimately determine the actual storage location (ADL) of the data: if it's the default strategy, it points to the default strategy configured in the RST; if it's an append strategy, it calculates the location using an incrementing offset pointer; if it's a discrete strategy, it obtains a free block from the data structure managing storage resources; if it's a hash strategy, it performs hash calculations to obtain the bucket; if it's a hierarchical priority strategy, it allocates the data packet to different storage resources according to its priority. Finally, the PMU writes the CP's data payload into the calculated ADL via the chip's internal bus. This process is transparent to upper-layer software and is entirely completed automatically by the PMU logic. Depending on the specific implementation of the feedback signal, the PMU can generate the feedback signal ACK after parsing the CP packet header or after the data payload has been stored, in order to inform the communication initiator of the communication status and to implement flow control.

[0224] c. Step S3: Notification Step

[0225] After data storage is complete, the PMU generates a completion record (CR) and notifies the consumer entity. The PMU combines the ADL obtained in step S2 with the optional, transparently transmitted UCT from step S1 to generate the CR. Next, the PMU executes a notification action according to a preset notification strategy (such as log queue mode or direct mapping mode). For example, it calculates the write position of the CR in a communication state (CS) area and writes the CR to that position. Optionally, the PMU triggers an interrupt or doorbell to wake up the consumer. Finally, the communication consumer actively reads the CS area, or reads the CS area after being notified or woken up, to obtain the CR, thereby simultaneously knowing the actual storage location of the data and the data's context information, and then begins subsequent calculations.

[0226] 9. Examples of specific application scenarios

[0227] To more intuitively illustrate the application value of this method in high-performance computing, the following uses two typical scenarios in Large Language Models (LLM): MoE All-to-All communication and KV Cache dynamic transfer as examples to describe in detail how to use this method.

[0228] 9.1 Example 1: All-to-All Token Distribution in the MoE (Mixture of Experts) Model

[0229] In MoE model training / inference, the embedding layer generates a large number of tokens, each of which needs to be routed to a specific "expert" located on a different GPU for processing. This is a typical many-to-many communication, and the number of tokens received by each expert is dynamically changing. By using the location-independent communication method defined in this invention, the overhead of global memory address synchronization and querying the consumer's memory offset at the initiating end can be eliminated, while data continuity within the consumer's memory is achieved, greatly improving the execution efficiency of MoE.

[0230] a. Method Configuration

[0231] a-1. Consumer-side target storage resources: Each expert buffer is configured as a TSR, associated with entries in the RST through the LDH, and Handle_Expert_N points to the resource pool of the Nth expert.

[0232] a-2. Dynamic position allocation strategy: Compact append strategy to ensure that the received tokens are stored contiguously in the video memory, which is beneficial for subsequent expert matrix multiplication calculations.

[0233] a-3. Notification strategy: Log Queue Mode, which facilitates batch processing by expert kernels, and combines SIGNAL_ONLY type CP to indicate the end of communication.

[0234] b. Communication process

[0235] S1: The source GPU sends the CP. LDH is Handle_Expert_3, meaning the data packet is sent to the 3rd expert. Opcode is WRITE_DATA, Allocation_Hint is Append, ACK_ID is {Request_ID | Token_Seq_ID}, and UCT is {Src_GPU_ID | Request_ID | Token_Seq_ID}. UCT is used to route the result back to the original source after the expert's computation is complete. The payload is the token's embedded vector data.

[0236] S2: The target GPU's PMU receives the CP and returns an ACK. The PMU looks up the TSR in a table and, according to the append strategy, calculates the ADL by atomically incrementing the Current_Offset it maintains internally. The PMU writes the Token vector into this ADL. Due to the append strategy, tokens from different source GPUs will be closely packed with no storage gaps.

[0237] S3: The PMU generates CR {ADL, UCT} and appends it to the CS log queue. The expert computation kernel starts, reads a set of CRs in batches from the CS queue, and obtains a batch of continuous token data and its corresponding source information. The kernel directly performs calculations on this batch of data, and the expert computation results are returned to the source GPU using the pass-through information in UCT.

[0238] 9.2 Example 2: Dynamic Migration of KV Cache

[0239] In long-text inference, the KV cache (key-value cache) not only consumes a huge amount of GPU memory but also needs to be dynamically migrated between different GPUs, such as switching from CPU to GPU or load balancing between different GPUs. KV caches are typically managed in units of fixed-size blocks, such as groups of 16 tokens. By using the location-independent communication method defined in this invention, "zero-copy" dynamic allocation of the KV cache can be achieved without CPU intervention in memory management. Simultaneously, by utilizing direct mapping notifications, large-scale lock-free parallel computation on the GPU is realized.

[0240] a. Method configuration:

[0241] a-1. Target storage resources for the consumer: Configure a TSR for storing KV Cache, with its LDH being Handle_KVCache_Pool.

[0242] a-2. Dynamic location allocation strategy: Scatter allocation strategy, which uses a free list to manage video memory pages and eliminate fragmentation.

[0243] a-3. Notification strategy: Direct Mapping mode, using explicit indexes, and directly indexing the state table using BlockID.

[0244] b. Communication process:

[0245] S1: The source node sends a CP. The LDH is Handle_KVCache_Pool, the Opcode is WRITE_DATA, the Allocation_Hint is Scatter, the ACK_ID is {1024}, and the UCT is {... | Block_ID=1024}, explicitly indicating that this is Block 1024. The payload is 4KB of KV Block data.

[0246] S2: The target GPU's PMU receives the CP and returns an ACK. The PMU looks up the TSR in its table and, according to the discrete strategy, pops a physical page address Addr_P (i.e., ADL) from its maintained Free List. Then, the PMU writes the KV data to Addr_P.

[0247] S3: The PMU extracts Block_ID=1024 from the UCT. Using explicit indexing, the PMU calculates the offset Offset = 1024 * CR_Size, and therefore writes CR {Addr_P, ...} into the 1024th item of the CS region. The thread responsible for this block in the Attention calculation kernel directly checks the 1024th item of CS. After finding the state update, it reads the physical page pointed to by Addr_P to participate in the attention calculation.

[0248] In summary, by decoupling the communication initiation mechanism from the consumer-side storage management, and combining a flexible pass-through mechanism with mapping, allocation, and notification strategies, this invention significantly improves the communication efficiency, programmability, and system resilience of computing clusters under dynamic loads.

[0249] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. Unless otherwise specified, the embodiments and features described in this application can be arbitrarily combined with each other.

Claims

1. A location-independent communication method for a computing cluster, characterized by, Includes the following steps: Step S1: The communication initiator sends a communication data packet to the communication consumer based on the logical destination handle; the communication consumer includes a communication consumer entity, which is an abstract logical unit inside the communication consumer that represents the end user of the communication data; the logical destination handle is used to identify the communication consumer entity to indicate the logical relationship between the data and the communication consumer entity, but does not specify the actual storage location of the data payload in the target storage resources of the consumer. Step S2: The location management unit at the consumer end receives the communication data packet and executes a dynamic location allocation process, which includes: The logical destination handle is parsed, and the communication consumer entity is mapped to the target storage resource managed by the communication consumer. Within the target storage resource, an actual storage location is autonomously determined and allocated for the data payload, and the data payload is written to that location; Step S3: After the data is stored, the location management unit generates a completion record containing the actual storage location, and notifies the communication consumer entity of the record according to a preset strategy so that it can associate the data content with the storage location.

2. The location-independent communication method for a computing cluster according to claim 1, wherein, The encoding of the logical destination handle communication consumer includes one or more fields for routing and resource mapping, including: target device index, task or process index, communication queue index, or operator instance index.

3. The location-independent communication method for a computing cluster according to claim 1, wherein, The step of parsing the logical destination handle and mapping the communication consumer entity to the target storage resource managed by the communication consumer is implemented in at least one of the following ways: Query the predefined data structure that stores the association between the logical destination handle and the target storage resource; The preset conversion algorithm is executed, taking a portion of the logical destination handle as input, to calculate the address or index of the target storage resource; The logical destination handle is directly converted into an access signal for the target storage resource through a hard-coded decoding circuit.

4. The location-independent communication method for computing clusters according to claim 1, characterized in that, The method also includes a feedback process: The communication data packet sent by the communication initiator contains a feedback signal index; After receiving or processing the communication data packet, the location management unit sends a feedback signal containing the index back to the communication initiator based on the processing result, so as to confirm the data reception or processing status with the initiator and implement flow control or congestion avoidance mechanisms.

5. The location-independent communication method for computing clusters according to claim 1, characterized in that, The communication data packet contains a user context label and executes a metadata pass-through mechanism: In step S1, the communication initiator fills the user-defined metadata into the user context label; In step S2, the location management unit does not parse or modify the content of the user context tag; In step S3, the location management unit includes the user context tag in the completion record and notifies the communication consumer entity.

6. The location-independent communication method for computing clusters according to claim 1, characterized in that, The communication data packet contains an opcode field to indicate the transmission type, which includes at least one of the following: Write the data type; indicate that the data packet carries a payload, and step S2 needs to be executed; Signaling type only; Indicates that the data packet does not carry a valid data payload and is used to transmit logical control events or synchronization signals on the communication link; Flush type: Indicates the data packet used to enforce consistency and visibility of the storage system.

7. The location-independent communication method for computing clusters according to claim 1, characterized in that, The communication data packet contains an allocation prompt field; the location management unit adopts a corresponding allocation strategy based on this field, and the allocation strategy includes at least one of the following: Default strategy: Follows the default allocation strategy configuration; Compact append strategy: Writes data payloads sequentially to the target storage resource; Discrete block allocation strategy: Extract the address of a free block or page from the data structure that organizes discrete blocks as the actual storage location; Hash strategy: Use hash mapping to achieve automatic bucketing or load balancing of data storage; Hierarchical priority strategy: Allocate data of different importance to storage areas with different priorities or different latency.

8. The location-independent communication method for computing clusters according to claim 1, characterized in that, The notification step in step S3 is implemented in the following way: The location management unit writes the completion record into the communication status area managed by the consumer terminal; The writing method is either log queue mode or direct mapping mode; When the direct mapping mode is used and the communication data packet contains a user context label, the write offset is calculated as follows: extract a predefined bit field from the user context label as an index value to calculate the offset; or calculate the corresponding offset based on the offset of the actual storage location relative to the target storage resource base address.

9. A data structure for implementing a communication data packet according to any one of claims 1 to 8, characterized in that, include: The data packet header contains a logical destination handle that indicates the communication consumer entity, and / or contains an opcode, an allocation prompt field, a data length field, a feedback signal index field, and a user context label; Data payload: Contains binary data whose length is consistent with that indicated by the data length field or with the preset allocation granularity.

10. A computing system applying the method of any one of claims 1 to 8, characterized in that, include: The communication initiator is configured to generate and send communication data packets; A location management unit, located at the communication consumer end, is configured to execute the dynamic location allocation process and generate a completion record; The target storage resource, managed by the communication consumer, is used to store the written data; The communication consumer is configured to receive the completed record and access data accordingly. The location management unit can be implemented in the form of: dedicated hardware circuit, programmable logic device, or software / firmware logic, or any combination thereof.