A memory expansion card and system
By employing an architecture with multiple processing units and memory expansion cards, and utilizing high-bandwidth interconnects and graph access engine circuitry, the problem of low memory access efficiency in graph neural networks is solved, enabling efficient processing of unstructured data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2022-07-01
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies struggle to efficiently process and store unstructured graph data, especially in graph neural network processing, where low memory access efficiency leads to system performance bottlenecks.
It adopts an architecture with multiple processing units and storage expansion cards, connects the processing units and storage expansion cards through high-bandwidth interconnect units, and uses graph access engine circuits to accelerate memory access, realize node sampling and attribute data extraction, and improve the processing efficiency of graph neural networks.
It improves memory access efficiency for graph neural network processing, reduces system performance bottlenecks, and enhances the ability to process unstructured data.
Smart Images

Figure CN117389921B_ABST
Abstract
Description
Technical Field
[0001] This disclosure generally relates to the field of artificial intelligence technology, and in particular to a storage expansion card and system. Background Technology
[0002] While traditional deep learning models excel at pattern recognition and data mining by capturing hidden patterns in Euclidean data (such as images, text, and videos), graph neural networks (GNNs) have extended the capabilities of machine learning to non-Euclidean domains represented by graphs, where objects have complex relationships and interdependencies. Research shows that GNNs outperform current state-of-the-art technologies for applications ranging from molecular reasoning to community detection.
[0003] Graph neural networks (GNNs) are highly effective models for modeling and processing unstructured data. Recently, GNNs have been increasingly used in applications such as recommender systems and risk control systems. However, graph data is unstructured; therefore, accessing graph data may result in random access to memory. Summary of the Invention
[0004] According to one aspect, a system includes a first processing unit, a second processing unit, a third processing unit, a fourth processing unit, a first storage expansion card, a second storage expansion card, a third storage expansion card, and a fourth storage expansion card. Each of the first, second, third, and fourth processing units is configured to perform graph neural network (GNN) processing. Each of the first, second, third, and fourth storage expansion cards is configured to store graph data for GNN processing. Each of the first, second, third, and fourth processing units is communicatively coupled to each of the first, second, third, and fourth storage expansion cards via interconnection units of a first type. The first processing unit is connected to the first storage expansion card via the first type of interconnection unit. Each of the third and fourth processing units is communicatively coupled; the second processing unit is communicatively coupled to each of the third and fourth processing units via a first type of interconnect unit; the first processing unit is communicatively coupled to the second processing unit via two second type of interconnect units; the third processing unit is communicatively coupled to the fourth processing unit via two second type of interconnect units; the first storage expansion card is communicatively coupled to each of the second and third storage expansion cards via a third type of interconnect unit; the fourth storage expansion card is communicatively coupled to each of the second and third storage expansion cards via a third type of interconnect unit; and each of the first, second, third, and fourth storage expansion cards includes graph access engine circuitry for accelerating memory access in graph neural network processing.
[0005] In some embodiments, the bandwidth of the second type of interconnect unit is half the bandwidth of the first type of interconnect unit.
[0006] In some embodiments, the form factor of the first type of interconnect unit is two QSFP-DD ports, and the bandwidth of the two QSFP-DD ports is equal to or greater than 100 GB / s.
[0007] In some embodiments, the form factor of the second type of interconnect unit is four Mini-SAS ports, the bandwidth of the four Mini-SAS ports being equal to or greater than 50 GB / s; the first processing unit is communicatively coupled to the second processing unit in parallel through two second type interconnect units; and the third processing unit is communicatively coupled to the fourth processing unit in parallel through two second type interconnect units.
[0008] In some embodiments, the form factor of the third type of interconnect unit is a QSFP-DD port, and the bandwidth of the QSFP-DD port is equal to or greater than 50 GB / s.
[0009] In some embodiments, each of the plurality of storage expansion cards is also used to perform data conversion between local memory operations and packet transmissions via one or more interconnect units of a first or second type.
[0010] In some embodiments, each of the plurality of storage expansion cards includes a switch for performing data bypass on data received from one or more storage expansion cards via one or more third-type interconnect units.
[0011] In some embodiments, the graph access engine circuitry is further configured to: extract partial structural data of a graph from one or more storage expansion cards; perform node sampling using the extracted partial structural data of the graph to select one or more sampling nodes; extract partial attribute data of the graph from one or more storage expansion cards based on the selected one or more sampling nodes; and send the extracted partial attribute data of the graph to one or more processing units.
[0012] In some embodiments, each processing unit is further configured to perform graph neural network (GNN) processing on the graph using the extracted attribute data.
[0013] In some embodiments, each storage expansion card is implemented on a field programmable gate array (FPGA).
[0014] According to another aspect, a system includes: a plurality of processing units, each processing unit being configured to perform graph neural network (GNN) processing; and a plurality of memory expansion cards, each memory expansion card being configured to store graph data for GNN processing, wherein each of the plurality of processing units is communicatively coupled to other processing units via one or more interconnecting units; the plurality of processing units are communicatively coupled to the plurality of memory expansion cards, and each of the plurality of memory expansion cards includes graph access engine circuitry for accelerating memory access during GNN processing.
[0015] In some embodiments, the plurality of processing units include a first processing unit, a second processing unit, a third processing unit, and a fourth processing unit; the plurality of storage expansion cards include a first storage expansion card, a second storage expansion card, a third storage expansion card, and a fourth storage expansion card; each of the first processing unit, the second processing unit, the third processing unit, and the fourth processing unit is communicatively coupled to each of the first storage expansion card, the second storage expansion card, the third storage expansion card, and the fourth storage expansion card via a first type of interconnection unit; the first processing unit is communicatively coupled to each of the third processing unit and the fourth processing unit via a first type of interconnection unit; the second processing unit is communicatively coupled to each of the third processing unit and the fourth processing unit via a first type of interconnection unit; the first processing unit is communicatively coupled to the second processing unit via two second type of interconnection units; and the third processing unit is communicatively coupled to the fourth processing unit via two second type of interconnection units.
[0016] In some embodiments, the form factor of the first type of interconnect unit is two QSFP-DD ports, the bandwidth of which is equal to or greater than 100 GB / s; the form factor of the second type of interconnect unit is four Mini-SAS ports, the bandwidth of which is equal to or greater than 50 GB / s; the first processing unit is communicatively coupled to the second processing unit in parallel through two second type interconnect units, and the third processing unit is communicatively coupled to the fourth processing unit in parallel through two second type interconnect units.
[0017] In some embodiments, each of the first processing unit, the second processing unit, the third processing unit, and the fourth processing unit includes a switch for performing data bypass on data received from one or more other processing units via one or more interconnection units of the first or second type.
[0018] In some embodiments, the graph access engine circuitry is further configured to: extract partial structural data of a graph from one or more storage expansion cards; perform node sampling using the extracted partial structural data of the graph to select one or more sampling nodes; extract partial attribute data of the graph from one or more storage expansion cards based on the selected one or more sampling nodes; and send the extracted partial attribute data of the graph to one or more processing units.
[0019] In some embodiments, each storage expansion card is implemented on a field-programmable gate array (FPGA).
[0020] According to another aspect, a storage expansion card includes: one or more memories for storing graph data for graph neural network processing (GNN); a first type of interconnect unit for connecting the storage expansion card to a processing unit for performing graph neural network processing; two second type of interconnect units for connecting the storage expansion card to two other storage expansion cards; and graph access engine circuitry for: acquiring partial structural data of a graph from the one or more memories or the other two storage expansion cards; performing node sampling using the extracted partial structural data of the graph to select one or more sampling nodes; extracting partial attribute data of the graph from the one or more memories or the other two storage expansion cards based on the selected one or more sampling nodes; and sending the extracted partial attribute data of the graph to the processing unit through the first type of interconnect unit, wherein the bandwidth of each of the two second type of interconnect units is half the bandwidth of the first type of interconnect unit.
[0021] In some embodiments, the processing unit is communicatively coupled to other processing units via a first type of interconnection unit or two third type interconnection units.
[0022] In some embodiments, the form factor of the first type of interconnect unit is two QSFP-DD ports with a bandwidth equal to or greater than 100 GB / s; the form factor of the second type of interconnect unit is one QSFP-DD port with a bandwidth equal to or greater than 50 GB / s; and the form factor of the third type of interconnect unit is four Mini-SAS ports with a bandwidth equal to or greater than 50 GB / s.
[0023] According to another aspect, a method is provided, comprising: extracting partial structural data of a graph from one or more memories or two other memory expansion cards in a memory expansion card via access engine circuitry in the memory expansion card, wherein the memory expansion card is communicatively coupled to a processing unit for graph neural network (GNN) processing via a first type of interconnect unit, and communicatively coupled to each of the other two memory expansion cards via a second type of interconnect unit; performing node sampling using the extracted partial structural data of the graph via the access engine circuitry to select one or more sampling nodes; extracting partial attribute data of the graph from one or more memories or two other memory expansion cards based on the selected one or more sampling nodes via the access engine circuitry; sending the extracted partial attribute data of the graph to the processing unit via the first type of interconnect unit via the access engine circuitry; and performing graph neural network (GNN) processing on the graph by the processing unit using the extracted partial attribute data of the graph.
[0024] In some embodiments, the processing unit is communicatively coupled to other processing units via a first type of interconnect unit or two third type interconnect units. In some embodiments, the form factor of the first type of interconnect unit is two QSFP-DD ports with a bandwidth equal to or greater than 100 GB / s; the form factor of the second type of interconnect unit is one QSFP-DD port with a bandwidth equal to or greater than 50 GB / s; and the form factor of the third type of interconnect unit is four Mini-SAS ports with a bandwidth equal to or greater than 50 GB / s.
[0025] In some embodiments, the method further includes performing data bypass on data received from each of the other two storage expansion cards via a switch in the storage expansion card.
[0026] In some embodiments, the memory expansion card is implemented on a field-programmable gate array (FPGA).
[0027] According to another aspect, a non-transitory computer-readable storage medium stores instructions that, when executed by one or more processors, cause the one or more processors to perform an operation comprising: extracting partial structural data of a graph from one or more memories or two other memory expansion cards in a memory expansion card via access engine circuitry in a memory expansion card, wherein the memory expansion card is communicatively coupled to a processing unit for graph neural network (GNN) processing via interconnects of a first type and to each of the other two memory expansion cards via interconnects of a second type; performing node sampling using the extracted partial structural data of the graph via the access engine circuitry to select one or more sampling nodes; extracting partial attribute data of the graph from one or more memories or two other memory expansion cards based on the selected one or more sampling nodes via the access engine circuitry; sending the extracted partial attribute data of the graph to the processing unit via interconnects of the first type via the access engine circuitry; and having the processing unit perform graph neural network (GNN) processing on the graph using the extracted partial attribute data of the graph. Attached Figure Description
[0028] These and other features of the systems, methods, and hardware devices of this disclosure, as well as the operational methods and functions of the related elements and component combinations of the structure, and the economics of manufacture, will become more apparent upon consideration of the following description and appended claims with reference to the accompanying drawings, which form part of this specification, wherein similar reference numerals denote corresponding portions in the drawings. However, it should be understood that the drawings are for illustration and description only and are not intended to be a definition of limitations of this disclosure.
[0029] Figure 1 This is a schematic diagram of an exemplary figure according to some embodiments of the present disclosure.
[0030] Figure 2A This is a schematic diagram of an exemplary system using a GNN accelerator architecture according to some embodiments of the present disclosure.
[0031] Figure 2B This is a schematic diagram of an exemplary system for accelerating GNN performance according to some embodiments of the present disclosure.
[0032] Figure 2C This is a schematic diagram of an exemplary GNN access engine according to some embodiments of the present disclosure.
[0033] Figure 2D This is a schematic diagram of an exemplary system for a shared memory resource according to some embodiments of the present disclosure.
[0034] Figure 3This is a schematic diagram of an exemplary parallel processing unit (PPU) card according to some embodiments of the present disclosure.
[0035] Figure 4A This is a schematic diagram of an exemplary connected PPU card system according to some embodiments of the present disclosure.
[0036] Figure 4B This is a schematic diagram of another exemplary system of connected PPU cards according to some embodiments of the present disclosure.
[0037] Figure 4C This is a schematic diagram of another exemplary system of connected PPU cards according to some embodiments of the present disclosure.
[0038] Figure 5 This is a schematic diagram of an exemplary smart memory extension (SMX) card according to some embodiments of the present disclosure.
[0039] Figure 6A This is a schematic diagram of an exemplary system for connecting a PPU card and a smart storage expansion card according to some embodiments of the present disclosure.
[0040] Figure 6B This is a schematic diagram of another exemplary system for connecting a PPU card and a smart storage expansion card according to some embodiments of the present disclosure.
[0041] Figure 6C This is a schematic diagram of yet another exemplary system for connecting a PPU card and a smart storage expansion card according to some embodiments of the present disclosure.
[0042] Figure 7 This is a schematic diagram of an exemplary smart storage expansion card based on a graph access engine, according to some embodiments of the present disclosure.
[0043] Figure 8A This is a schematic diagram of an exemplary system for connecting a PPU card and a graph access engine-based smart storage expansion card according to some embodiments of the present disclosure.
[0044] Figure 8B This is a schematic diagram of another exemplary system for connecting a PPU card and a smart storage expansion card based on a graph access engine, according to some embodiments of the present disclosure.
[0045] Figure 9 This is a schematic diagram of an exemplary memory access system for a storage expansion card according to some embodiments of the present disclosure.
[0046] Figure 10This is a schematic diagram of an exemplary memory access system for a memory expansion card based on a graph access engine, according to some embodiments of the present disclosure.
[0047] Figure 11 This is a schematic diagram of an exemplary memory access system according to some embodiments of the present disclosure, showing an ICN-to-mem gasket module.
[0048] Figure 12 This is a schematic diagram of a packet (PKT) engine module of an exemplary memory access system according to some embodiments of the present disclosure.
[0049] Figure 13 This is a schematic diagram of a chip-to-chip direct memory access engine module of an exemplary memory access system according to some embodiments of the present disclosure.
[0050] Figure 14 This is a schematic diagram of a graph access engine module of an exemplary memory access system according to some embodiments of the present disclosure.
[0051] Figure 15 This is a schematic diagram of a fabric switch module of an exemplary memory access system according to some embodiments of the present disclosure.
[0052] Figure 16 This is a flowchart illustrating an exemplary method for accelerating GNN processing using one or more storage expansion cards according to some embodiments of this disclosure. Detailed Implementation
[0053] This disclosure is made in the context of a particular application and its requirements, and is intended to enable those skilled in the art to make and use embodiments. Various modifications to the embodiments of this disclosure will be apparent to those skilled in the art, and the general principles defined can be applied to other embodiments and applications without departing from the spirit and scope of this disclosure. Therefore, this disclosure is not limited to the embodiments shown, but is accorded the broadest scope consistent with the principles and features of this disclosure.
[0054] Data can be structured or unstructured. Structured data can be organized according to a predefined data model or pattern. Unstructured data, however, cannot be organized using a predefined data model or predefined method. For example, text files (e.g., emails, reports, etc.) may contain information without a predefined structure (e.g., single letters or words). Therefore, unstructured data may contain irregularities and ambiguities, making it difficult to understand using traditional programs or data structures. Furthermore, accessing unstructured data from computer memory can involve significant random access, making memory access cumbersome and inefficient.
[0055] Using graphs is a way to represent unstructured data. A graph is a data structure that consists of two components: nodes (or vertices) and edges. For example, a graph G can be defined as a set of nodes V and a set of edges E connecting those nodes V. Nodes in a graph can have a set of features or attributes (e.g., user profiles in a graph representing a social network). If two nodes are connected by an edge, one node is defined as an adjacent node of the other. Graphs are highly flexible data structures because they do not require predefined rules to determine how many nodes they contain or how the nodes are connected by edges. Due to the great flexibility they offer, graphs are one of the most widely used data structures for storing or representing unstructured data (such as text files). For example, graphs can store data with relational structures, such as relationship data between buyers or products in an online shopping platform.
[0056] Figure 1 This is a schematic diagram of an exemplary figure according to some embodiments of the present disclosure. (e.g.) Figure 1 As shown, the graph includes nodes n111, n112, n113, n114, n115, and n116. Furthermore, the graph includes edges e121, e122, e123, e124, e125, e126, and e127. Each node has one or more adjacent nodes. For example, node n112 shares edge e121 with node n111, and node n113 shares edge e122 with node n111; therefore, nodes n112 and n113 are adjacent nodes to node n111.
[0057] When storing a graph in a computer's memory, nodes, edges, and attributes can be stored in various data structures. One approach to storing a graph is to separate attribute data from its corresponding nodes. For example, node identifiers can be stored in an array, with each identifier providing an address or pointer to the location of the attribute data for that node. The attribute data for all nodes can be stored together and accessed by reading the address or pointer stored in the corresponding node identifier. By separating attribute data from its corresponding nodes, the data structure provides faster traversal access to the graph.
[0058] Graph Neural Networks (GNNs) are a type of neural network that can directly manipulate graphs. GNNs are better suited for graph operations than traditional neural networks (e.g., convolutional neural networks) because they can better adapt to graphs of arbitrary size or complex topology. GNNs can perform inference on data described in graph format. GNNs are capable of performing node-level, edge-level, or graph-level prediction tasks.
[0059] GNN processing involves both GNN training and GNN inference, both of which involve GNN computation. Typical GNN computation for a node (or vertex) involves aggregating features (e.g., attribute data) of its neighboring nodes (direct neighbors or neighbors of each neighbor) and then computing a new activation function for the node to determine its feature representation (e.g., feature vector). Therefore, GNN processing with a small number of nodes typically requires input features from a large number of nodes. Since the nodes required for input features can easily cover a large portion of a graph, especially huge real-world graphs (e.g., with hundreds of millions of nodes and billions of edges), using all neighboring nodes for message aggregation is prohibitively costly.
[0060] To make GNNs more suitable for these real-world applications, node sampling is often employed to reduce the number of nodes involved in message / feature aggregation. For example, positive and negative sampling can be used to determine the optimization objective and outcome variance in GNN processing. For a given root node whose feature representation is being computed, positive sampling can sample nodes in the graph that are connected to the root node (directly or indirectly) via edges (e.g., nodes connected to the root node and within a preset distance from the root node); negative sampling can sample nodes that are not connected to the root node via edges (e.g., nodes beyond a preset distance from the root node). The nodes obtained through positive and negative sampling can be used to train feature representations of root nodes with different objectives.
[0061] To perform GNN calculations, the system can retrieve graph data from memory and send the graph data to one or more processors for processing. Figure 2A This is a schematic diagram of an exemplary system using a GNN accelerator architecture according to some embodiments of this disclosure. Figure 2AAs shown, system 2200 includes one or more processors 2210, a GNN accelerator 2220, memory 2230, and one or more dedicated processors 2240. In some embodiments, the one or more processors 2210 include one or more central processing units (CPUs). In some embodiments, the one or more dedicated processors 2240 may include one or more CPUs, one or more graphics processing units (GPUs), one or more tensor processing units (TPUs), one or more neural network processing units (NPUs), one or more dedicated graph neural network processing units, etc. In some embodiments, memory 2230 may include synchronous dynamic random-access memory (SDRAM), such as double data rate synchronous dynamic random-access memory (DDR SDRAM).
[0062] like Figure 2A As shown, the GNN accelerator 2220 can receive instructions and information about the GNN from one or more processors 2210 and retrieve GNN-related data from memory 2230. After receiving data from memory 2230, the GNN accelerator 2220 can preprocess the data and send the preprocessed data to one or more dedicated processors 2240 for further processing.
[0063] In some embodiments, such as Figure 2A As shown, the GNN accelerator 2220 includes a graph structure processor 2221, a GNN sampler 2222, a GNN attribute processor 2223, and an address mapper 2224. The graph structure processor 2221 receives instructions and information about the GNN from one or more processors 2210 and retrieves information about one or more root nodes and their edges from memory 2230. Then, the graph structure processor 2221 sends the retrieved information to the GNN sampler 2222.
[0064] GNN sampler 2222 is used to select one or more sampling nodes for GNN processing based on edge information of one or more root nodes. In some embodiments, GNN sampler 2222 selects one or more sampling nodes based on positive or negative sampling. For example, based on positive sampling, one or more sampling nodes are selected from nodes connected to one or more root nodes via edges (e.g., sampling nodes are adjacent to one or more root nodes). Based on negative sampling, one or more sampling nodes are selected from nodes not directly connected to one or more root nodes via edges (e.g., sampling nodes are not adjacent to or close to one or more root nodes). In some embodiments, positive sampling selects from neighboring nodes that are connected to the root node and within a preset distance of the root node. The connection can be a direct connection (one edge between the source node and the target node) or an indirect connection (multiple edges from the source node to the target node). The "preset distance" can be configured according to the implementation. For example, if the preset distance is 1, it means that only directly connected neighboring nodes are selected for positive sampling. If the preset distance is infinite, it means that there is no connection between nodes (including direct and indirect connections). Negative sampling selects from nodes outside the preset distance of the root node. It is understandable that any algorithm other than positive sampling and negative sampling can be used to select sampling nodes.
[0065] After selecting a sampling node, the GNN sampler 2222 sends the sampling node selection information to the GNN attribute processor 2223. Based on the sampling node information, the GNN attribute processor 2223 can retrieve the sampling node information from the memory 2230. In some embodiments, the sampling node information includes one or more features or attributes (also referred to as attribute data) for each sampling node. The GNN attribute processor 2223 can further send the retrieved sampling node information, along with information about one or more root nodes and their edges, to a dedicated processor 2240. The dedicated processor 2240 performs GNN processing based on the information received from the GNN attribute processor 2223.
[0066] In some embodiments, the graph structure processor 2221 and the GNN attribute processor 2223 use an address mapper 2224 to retrieve information from memory 2230. The address mapper is used to provide hardware address information in memory 2230 based on node and edge information. For example, using identifier n111 (e.g., ... Figure 1The root node (n111) is identified as part of the GNN input. If the graph structure processor 2221 needs to extract information about node n111 (e.g., attribute data of node n111), the graph structure processor 2221 provides the identifier n111 to the address mapper 2224, which then determines the physical address of the location in memory 2230 for storing the information about node n111 (e.g., attribute data of node n111). In some embodiments, the address mapper 2224 also determines the edges in memory 2230 for storing node n111 (e.g., edges of node n111). Figure 1 The location of one or more physical addresses of the edges e121 and e122).
[0067] Figure 2A The System 2000 shown can be used to accelerate memory access of GNNs in many different systems to improve GNN performance. Figure 2B This is a schematic diagram of an exemplary system for improving GNN performance according to some embodiments of this disclosure. Figure 2B As shown, the acceleration system 2300 includes a memory over fabric (MoF) 2305, an access engine 2310, a RISC-V 2330, a general matrix multiply (GEMM) execution engine 2340, and a vector processing unit (VPU) execution engine 2350. Figure 2B The access engine 2310 shown is similar to Figure 2A The GNN module 2220 is shown. The access engine 2310 can be used to access memory (e.g., such as...) Figure 2B The DDR (Access Module) shown retrieves the data required to perform GNN calculations. For example, access engine 2310 retrieves node identifiers, edge identifiers, and attribute data corresponding to the node identifiers. The data retrieved by access engine 2310 is provided to the execution engine (e.g., GEMM execution engine 2340 or VPU execution engine 2350) or the processor used for GNN-related calculations. Figure 2B As shown, both types of engines can accelerate the execution of specific GNN-related calculations.
[0068] Although System 2300 includes an acceleration engine and processor to accelerate GNN-related computations, the access engine 2310 may become a bottleneck for the overall performance of System 2300 because data retrieval performed by the access engine is slower than that performed by the execution engine. Figure 2C This is a schematic diagram of an exemplary GNN access engine according to some embodiments of the present disclosure. It will be understood that... Figure 2C The access engine 2400 shown is similar to Figure 2BThe access engine shown is 2310. (As shown in the image...) Figure 2C As shown, the access engine 2400 includes the GetNeighbor module 2410, the GetSample module 2420, the GetAttribute module 2430, and the GetEncode module 2440.
[0069] In some embodiments, the GetNeighbor module 2410 is used to access or identify neighboring nodes of the input node identifier. For example, with Figure 2A Similar to the graph architecture processor 2221 shown, the GetNeighbor module 2410 receives instructions and information about the GNN and retrieves them from DDR (e.g., corresponding to...). Figure 2A The GetNeighbor module 2410 retrieves information about one or more nodes, their edges, and their neighbors (adjacent nodes) from the memory 2230. The GetNeighbor module 2410 then sends the retrieved information to the GetSample module 2420 (e.g., corresponding to...). Figure 2A The GNN sampler 2222).
[0070] In some embodiments, the GetSample module 2420 is used to receive information from the GetNeighbor module 2410 on one or more nodes, and perform node sampling on those one or more nodes for GNN processing. For example, with Figure 2A Similar to the GNN sampler 2222 shown, the GetSample module 2420 is used to select one or more sampling nodes for GNN processing based on edge information of one or more nodes. In some embodiments, the GNN sampler 2222 selects one or more sampling nodes based on positive sampling and / or negative sampling. After selecting the sampling nodes, the GetSample module 2420 sends the selection information of the sampling nodes to the GetAttribute module 2430.
[0071] In some embodiments, the GetAttribute module 2430 is used to receive information about the selected or sampled node from the GetSample module 2420, and from memory (e.g., Figure 2C The DDR or shown Figure 2AThe memory 2230 shown extracts attribute information of the sampled nodes. For example, similar to the GNN attribute processor 2223, the GetAttribute module 2430 is used to extract attribute data of the sampled nodes from the memory 2230 based on the received sampled node (e.g., the identifier of the sampled node). In some embodiments, the GetAttribute module needs to extract attribute information of the sampled nodes from a remote location. For example, the GetAttribute module needs to extract attribute information from another board. Therefore, the GetAttribute module utilizes the MoF module 2450 to extract attribute information from a remote location (e.g., another board). In some embodiments, the attribute data of the sampled nodes includes one or more features of each sampled node.
[0072] A graphics processing unit (GPU) is a well-known device that performs the computations required to fill a frame buffer, which in turn displays images on the screen. The central processing unit (CPU) offloads the task of filling the frame buffer (a computationally intensive task) to the GPU, freeing up the CPU to perform other tasks in a timely manner.
[0073] A general-purpose graphics processing unit (GPGPU) is an extension of a GPU because GPGPUs can be programmed to perform other computationally intensive (non-graphics processing) operations. In artificial intelligence (AI) and machine learning applications (such as GNN processing), CPUs are typically paired with multiple GPGPUs (e.g., 100 GPGPUs) performing convolution-type operations in parallel. In this disclosure, GPU and GPGPU are used interchangeably to describe GPUs used for performing general-purpose computing, unless otherwise specified.
[0074] A GPGPU can have a processor and memory coupled to that processor. In many artificial intelligence and machine learning applications (such as GNN processing), the memory must be large in capacity and very fast. Therefore, in AI / machine learning settings, the memory in a GPGPU is typically implemented using a large-capacity, high-speed memory known as high-bandwidth memory (HBM).
[0075] A typical HBM consists of multiple dynamic random-access memory (DRAM) dies stacked vertically on top of each other to provide large storage capacities, such as 4GB, 24GB, and 64GB, but with a small form factor. Furthermore, each DRAM die can include two 128-bit data channels to provide high bandwidth.
[0076] Traditional HBM designs suffer from numerous problems. For example, GPGPUs have a maximum effective memory capacity, which in turn limits the operations the GPGPU can perform in a timely manner. This maximum effective capacity exists because it is increasingly difficult to stack chips vertically on top of each other, effectively limiting the number of chips that can be stacked to form an HBM and its maximum capacity. Furthermore, each chip in an HBM is typically manufactured using the largest possible reticle, which limits the maximum chip size and capacity. This problem becomes increasingly severe as the amount of data to be processed continues to grow.
[0077] Furthermore, aside from the maximum effective capacity, some or all of the other memory, such as a portion of the CPU's memory, cannot be used as an extension of the GPGPU's Memory By-Memory (HBM) to provide additional high-speed storage capacity. This is because the GPGPU is coupled to the extended memory (e.g., the CPU) via the Peripheral Component Interconnect Express (PCIe) bus. Accessing data via the PCIe bus is 100 times slower than accessing data in the HBM, which is too slow for many AI / machine learning applications. Due to physical limitations, current HBM-based designs are difficult to scale, and multi-GPU designs (e.g., scaling solutions) result in a significant waste of computational resources.
[0078] Therefore, since the GPGPU's memory (HBM) has the maximum effective capacity, and since all or part of other memory is not used as an extension of the GPGPU's memory (HBM) to provide additional high-speed storage capacity, it is necessary to increase the capacity of the GPGPU's memory and provide near-memory operation.
[0079] Many electronic technologies, such as digital computers, calculators, audio devices, video devices, and telephone systems, contribute to increased productivity and reduced costs in analyzing and communicating data and information across most sectors of business, science, education, and entertainment. Electronic components are used in many important applications (e.g., medical procedures, vehicle assistance, financial applications), activities that typically involve processing and storing large amounts of information. These applications often involve significant information processing. However, processing (e.g., storing, processing, communicating, etc.) large amounts of information can present problems and difficulties.
[0080] In many applications (e.g., for) Figure 1 The diagram shown illustrates an application of GNN processing. For a system to process information quickly and accurately, this capability often depends on access to that information. Traditional systems typically struggle to sort and process large amounts of information, especially in parallel processing environments. Providing too little storage capacity is often detrimental, potentially leading to complete application failure. However, the traditional approach of providing a large amount of dedicated memory with sufficient capacity on each parallel processing resource to store all the information can be extremely expensive. Furthermore, each processing resource typically has different storage access needs at different times, and much of the storage may be idle or essentially wasted. Traditional attempts to share storage resources often result in communication problems and significantly reduce the speed at which processing resources access information, leading to considerable performance limitations and degradation.
[0081] Figure 2D This is a schematic diagram of an exemplary system 200 with shared memory resources according to some embodiments of the present disclosure. Typically, system 200 includes multiple servers, and each server includes multiple parallel computing units. Figure 2D As shown, system 200 includes servers 201 and 202. Server 201 includes parallel processing units (PPUs) PPU_0a to PPU_n, a PCIe bus 211, a memory card 213, a network interface controller or card (NIC) 212, and a host CPU 214. Each parallel processing unit includes, for example, a processing core or memory (…). Figure 2DComponents such as (not shown). In some embodiments, the parallel processing unit may be a neural network processing unit (NPU) or a graphics processing unit (GPU). In some embodiments, multiple NPUs or GPUs are arranged in a parallel configuration. PCIe bus 211 may be communicatively coupled to PPU_0a to PPU_n, memory card 213, host CPU 214, and NIC 212, and NIC 212 may be communicatively coupled to network 230. Host CPU 214 may be communicatively coupled to memory 215 (e.g., RAM, DRAM, DDR4, DDR, etc.). PCIe bus 221 may be communicatively coupled to PPU_0b to PPU_m, memory card 223, host CPU 224, and NIC 222, and NIC 222 may be communicatively coupled to network 230. In some embodiments, network 230 is Ethernet.
[0082] In some embodiments, system 200 uses, for example, a partitioned global address space (PGAS) programming model to merge a unified memory address space. In many applications, a specific PPU needs to access information stored on the system's memory card. Therefore, in Figure 2D In the example, PPU_0a on server 201 may need to access information stored on memory cards 213 and 223. To access information on memory card 213, the information can be transferred somewhere in the system via PCIe bus 211, depending on the location in the system. For example, to write data from PPU_0a to memory card 213 on server 201, data is sent from PPU_0a to memory card 213 via PCIe bus 211; and to write data from PPU_0a on server 201 to memory card 223 on server 202, data is sent from PPU_0a to NIC 221 via PCIe bus 211, then to NIC 222 via network 230, and then to memory card 223 via PCIe bus 212.
[0083] System 200 can be used in applications such as graph analytics and graph neural networks; more specifically, it can be used in applications such as online shopping engines, social networks, recommendation engines, mapping engines, fault analysis, network management, and search engines. These applications perform a large number of memory access requests (e.g., read / write requests), and therefore require the transfer (e.g., read / write) of large amounts of data for processing. Although the bandwidth and data transfer rate of the PCIe bus are quite high, they still limit such applications. In fact, the PCIe bus is often too slow, and its bandwidth is too narrow for such applications. Traditional attempts at flexible configuration and expansion of storage capacity are also limited by the slow speed and narrow bandwidth of the PCIe bus scheme.
[0084] Embodiments of this disclosure provide methods and systems for improving efficient memory access between various PPUs. Figure 3 This is a schematic diagram of an exemplary PPU card according to some embodiments of the present disclosure. Figure 3 The diagrams shown are for illustrative purposes only and may vary depending on the implementation. Figure 3 The PPU card 300 shown can have fewer, more, and optional components and connections. In some embodiments, the PPU card 300 is implemented on a field-programmable gate array (FPGA).
[0085] PPU card 300 can be used to provide processing capabilities. In some embodiments, PPU card 300 includes one or more PPUs. It is understood that PPU card 300 can include any processing unit, not just a PPU. For example, PPU card 300 includes one or more CPUs, one or more GPUs or GPGPUs, one or more Tensor Processing Units (TPUs), one or more Neural Processing Units (NPUs), or one or more Dedicated Graph Neural Network Processing Units. In some embodiments, PPU card is similar to Figure 2A The dedicated processor 2240. For example... Figure 3 As shown, the PPU card 300 includes multiple connections. For example, as Figure 3 As shown, the PPU card 300 includes three full-speed interconnect network (ICN) links (or interconnections) and two half-speed links (or interconnections). In some embodiments, one or more ICN full-speed links are bidirectional (e.g., 100 GB / s), and one or more ICN half-speed links are bidirectional (e.g., 50 GB / s). In some embodiments, a form factor of the ICN full-speed link is an ICN bridge (e.g., similar to an Nvidia bridge or NVLINK bridge) or a QSFP-DD port (e.g., two QSFP-DD ports per ICN full-speed link). In some embodiments, a form factor of the ICN half-speed link is a Mini-SAS connector (e.g., four Mini-SAS ports per ICN half-speed link). In some embodiments, the PPU card 300 includes a PCIe connection.
[0086] Figure 4A This is a schematic diagram of an exemplary connected PPU card system according to some embodiments of the present disclosure. Figure 4A The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 4A The system 400 shown can have fewer, more, and optional components and connections.
[0087] like Figure 4AAs shown, system 400 includes four PPUs: PPU a, PPU b, PPU c, and PPU d. In some embodiments, each of PPUs a through PPU d is similar to... Figure 3 The PPU card 300 is shown. For example, each PPU from PPU a to PPU d may include 3 ICN full-speed links and 2 half-speed links. Figure 4A As shown, each PPU from PPU a to PPU d is connected to every other PPU via an ICN full-speed link. For example, PPU a is connected to PPU b via ICN full-speed link b, to PPU c via ICN full-speed link a, and to PPU d via ICN full-speed link e. Therefore, as Figure 4A As shown, six ICN full-speed links af connect each pair of PPUs from PPU a to PPU d. Figure 4A The ICN full-speed link shown can help each PPU efficiently access the resources of other PPUs (e.g., storage resources). Therefore, storage capacity is no longer limited to a single PPU, and data transfer between PPUs can be improved.
[0088] Figure 4B This is a schematic diagram of another exemplary system of connected PPU cards according to some embodiments of the present disclosure. Figure 4B The diagrams shown are for illustrative purposes only and may vary depending on the implementation. Figure 4B The system 410 shown can have fewer, more, and optional components and connections.
[0089] like Figure 4B As shown, system 410 may include four PPUs, namely PPU a, PPU b, PPU c, and PPU d. In some embodiments, each of PPUs a through PPU d is similar to Figure 3 The PPU card 300 is shown. For example, each PPU from PPU a to PPU d includes 3 full-speed ICN links and 2 half-speed links. Figure 4B As shown, each PPU from PPU a to PPU d is connected to each of the other PPUs via one ICN full-speed link or two ICN half-speed links. For example, PPU a is connected to PPU b via ICN half-speed links g and h (e.g., the two half-speed interconnects can be configured in parallel to provide full-speed connectivity), to PPU c via ICN full-speed link a, and to PPU d via ICN full-speed link e. Therefore, as Figure 4B As shown, four ICN full-speed links a, d, e, and f, and four ICN half-speed links g, h, i, and j connect each pair of PPUs from PPU a to PPU d. Figure 4BThe ICN full-speed and half-speed links shown can help each PPU efficiently access the resources of other PPUs (e.g., storage resources). Therefore, storage capacity is no longer limited to a single PPU, and data transfer between PPUs can be improved.
[0090] Figure 4C This is a schematic diagram of another exemplary system of connected PPU cards according to some embodiments of the present disclosure. Figure 4C The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 4C The system 420 shown can have fewer, more, and optional components and connections.
[0091] like Figure 4C As shown, system 420 includes eight PPUs, namely PPU a, PPU b, PPU c, PPU d, PPU e, PPU f, PPU g, and PPU h. In some embodiments, each of PPUs a to PPU h is similar to Figure 3 The PPU card 300 is shown. For example, each PPU from PPU a to PPU h includes 3 full-speed ICN links and 2 half-speed links. Figure 4C As shown, each PPU from PPUa to PPU d is connected to other PPUs via an ICN full-speed link or two ICN half-speed links. For example, PPU a is connected to PPU b via ICN full-speed link b, to PPU c via ICN full-speed link a, to PPU d via ICN full-speed link e, and to PPU g via ICN half-speed links q and r. Therefore, as... Figure 4C As shown, 12 ICN full-speed links a, b, c, d, e, f, k, l, m, n, o, and p, and 8 ICN half-speed links g, h, i, j, q, r, s, and t connect multiple pairs of PPUs from PPUa to PPU h. Therefore, each PPU in system 420 can access the resources (e.g., storage resources) of any other PPU through a maximum of two hops. For example, although there is no direct connection between PPU a and PPU e, PPU a can access the resources (e.g., storage resources) in PPU e through ICN full-speed link b and two ICN half-speed links g and h. Figure 4C The ICN full-speed and half-speed links shown can help each PPU efficiently access the resources of other PPUs. Therefore, storage capacity is no longer limited to a single PPU, and data transfer between PPUs can be improved.
[0092] In some embodiments, each PPU includes an ICN switch that facilitates data access to resources within each PPU. For example, PPU a can use the ICN switch located in PPU b to access storage resources in PPU e via ICN full-speed link b and two ICN half-speed links g and h.
[0093] In some embodiments, the system also includes one or more storage expansion cards. Figure 5 This is a schematic diagram of an exemplary smart storage expansion card according to some embodiments of the present disclosure. Figure 5 The diagrams shown are for illustrative purposes only and may vary depending on the implementation. Figure 5 The Smart Storage Expansion (SMX) card 500 shown can have fewer, more, and optional components and connections. In some embodiments, the SMX card 500 is implemented on an FPGA.
[0094] Figure 5 The SMX card 500 shown can be used for processors (e.g., Figure 3 The PPU card 300 shown and Figure 4A , Figure 4B and Figure 4C The PPU shown provides additional storage capacity and can be communicatively coupled to the processor. In some embodiments, the SMX card 500 includes one or more memories (e.g., ...). Figure 2A Memory 2230, Figure 2B or Figure 2C As shown, DDR or solid-state drives (SSDs) are used to store graph data. Figure 5 As shown, the SMX card 500 can include multiple connections. For example, as Figure 5 As shown, the SMX card 500 includes two ICN full-speed links. In some embodiments, one or more ICN full-speed links are bidirectional (e.g., 100 GB / s). In some embodiments, a form factor of the ICN full-speed link is a QSFP-DD port (e.g., each ICN full-speed link has two QSFP-DD ports). In some embodiments, the SMX card 500 also includes one or more converting breakout cables between the ICN bridge and one or more QSFP-DD ports. In some embodiments, the SMX card 500 includes a PCIe connection.
[0095] Figure 6A This is a schematic diagram of an exemplary system for connecting a PPU card and a smart storage expansion card according to some embodiments of the present disclosure. Figure 6A The diagrams shown are for illustrative purposes only and may vary depending on the implementation. Figure 6AThe system 600 shown can have fewer, more, and optional components and connections.
[0096] like Figure 6A As shown, system 600 includes four PPUs: PPU a, PPU b, PPU c, and PPU d. In some embodiments, each of PPUs a, PPU b, PPU c, and PPU d is similar to... Figure 3 The PPU card 300 is shown. For example, each of PPUs a, b, c, and d includes 3 full-speed ICN links and 2 half-speed links. Figure 6A As shown, each of PPUs, PPU a, PPU b, PPU c, and PPU d is connected to one or more other PPUs via an ICN full-speed link or two ICN half-speed links. For example, PPU a is connected to PPU b via two ICN half-speed links g and h, and to PPU c via ICN full-speed link a. In some embodiments, each of PPUs, PPU a, PPU b, PPU c, and PPU d is connected to an SMX card. For example, as... Figure 6A As shown, PPU a is connected to SMX card a via two ICN full-speed links a1 and a2, PPU b is connected to SMX card b via two ICN full-speed links b1 and b2, PPU c is connected to SMX card c via two ICN full-speed links c1 and c2, and PPU d is connected to SMX card d via two ICN full-speed links d1 and d2. In some embodiments, Figure 6A Each SMX card in the SMX card adapter shown is similar to Figure 5 The SMX card 500. For example... Figure 6A As shown, there are 10 ICN full-speed links and 4 ICN half-speed links. Therefore, each PPU in system 600 can access the resources (e.g., storage resources) of any other PPU or SMX card through a maximum of 3 hops. For example, although there is no direct link connecting PPU a and SMX card d, PPU a can access the storage resources in SMX card d through 2 ICN half-speed links g and h, 1 ICN full-speed link d, and 2 ICN full-speed links d1 and d2. In some embodiments, each PPU includes an ICN switch that facilitates data access to the resources. For example, PPU a can use the ICN switch located in PPU b and PPU d to access the storage resources in SMX card d through 2 ICN half-speed links g and h, 1 ICN full-speed link d, and 2 ICN full-speed links d1 and d2. Figure 6AThe ICN full-speed and half-speed links shown can help each PPU efficiently access the resources (e.g., storage resources) of other PPUs and SMX cards. Therefore, storage capacity is no longer limited to a single PPU, and data transfer between PPUs and SMX cards can be improved.
[0097] Figure 6B This is a schematic diagram of another exemplary system for connecting a PPU card and a smart storage expansion card according to some embodiments of the present disclosure. Figure 6B The diagrams shown are for illustrative purposes only and may vary depending on the implementation. Figure 6B The system 610 shown can have fewer, more, and optional components and connections.
[0098] like Figure 6B As shown, system 610 includes four PPUs: PPU a, PPU b, PPU c, and PPU d. In some embodiments, each of PPUs a, PPU b, PPU c, and PPU d is similar to... Figure 3 The PPU card 300 is shown. For example, each of PPUs a, b, c, and d includes 3 full-speed ICN links and 2 half-speed links. Figure 6B As shown, each of PPUs, PPU a, PPU b, PPU c, and PPU d is connected to each of the other PPUs via an ICN full-speed link or two ICN half-speed links. For example, PPU a is connected to PPU b via two ICN half-speed links g and h, to PPU c via ICN full-speed link a, and to PPU d via ICN full-speed link e. In some embodiments, each of PPUs a, PPU b, PPU c, and PPU d is connected to an SMX card. For example, as... Figure 6B As shown, PPU a is connected to SMX card a via ICN full-speed link a1, PPU b is connected to SMX card b via ICN full-speed link b1, PPU c is connected to SMX card c via ICN full-speed link c1, and PPU d is connected to SMX card d via ICN full-speed link d1. In some embodiments, Figure 6B Each SMX card in the SMX card AD shown is similar to Figure 5 The SMX card 500. For example... Figure 6BAs shown in the diagram, there are 8 ICN full-speed links and 4 ICN half-speed links. Therefore, each PPU in system 610 can access the resources (e.g., storage resources) of any other PPU or SMX card through a maximum of two hops. For example, although there is no direct link connecting PPU a and SMX card d, PPU a can access the storage resources in SMX card d through one ICN full-speed link e and one ICN full-speed link d1. In some embodiments, each PPU may include an ICN switch to facilitate data access to resources. For example, PPU a uses the ICN switch located in PPU d to access the storage resources in SMX card d through one ICN full-speed link e and one ICN full-speed link d1. Figure 6B The ICN full-speed and half-speed links shown can help each PPU efficiently access the resources (e.g., storage resources) of other PPUs and SMX cards. Therefore, storage capacity is no longer limited to a single PPU, and data transfer between PPUs and SMX cards can also be improved.
[0099] Figure 6C This is a schematic diagram of another exemplary system for connecting a PPU card and a smart storage expansion card according to some embodiments of the present disclosure. Figure 6C The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 6C The system 620 shown can have fewer, more, and optional components and connections.
[0100] like Figure 6C As shown, system 620 includes four PPUs: PPU a, PPU b, PPU c, and PPU d. In some embodiments, each of PPUs a, b, c, and d is similar to... Figure 3 The PPU card 300 is shown. For example, each of PPUs a, b, c, and d includes 3 full-speed ICN links and 2 half-speed links. Figure 6C As shown, each of PPUs, PPU b, PPU c, and PPU d is connected to one or more other PPUs via an ICN full-speed link or two ICN half-speed links. For example, PPU a is connected to PPU b via two ICN half-speed links g and h, and to PPU c via ICN full-speed link a. In some embodiments, each of PPUs, PPU a, PPU b, PPU c, and PPU d is connected to an SMX card. For example, as... Figure 6CAs shown, PPU a is connected to SMX card a via one ICN full-speed link a1, PPU b is connected to SMX card b via one ICN full-speed link b1, PPU c is connected to SMX card c via one ICN full-speed link c1, and PPU d is connected to SMX card d via one ICN full-speed link d1. In some embodiments, each PPU is connected to an additional SMX card. For example, as Figure 6C As shown, PPU a is connected to SMX card c via an ICN full-speed link c1, PPU b is connected to SMX card d via an ICN full-speed link d3, PPU c is connected to SMX card a via an ICN full-speed link a3, and PPU d is connected to SMX card b via an ICN full-speed link b3. In some embodiments, Figure 6C Each SMX card in the SMX cards a-d shown is similar to Figure 5 The SMX card 500. For example... Figure 6C As shown in the diagram, there are 10 ICN full-speed links and 4 ICN half-speed links. Therefore, each PPU in system 620 can access the resources (e.g., storage resources) of any other PPU or SMX card through a maximum of two hops. For example, although there is no direct link connecting PPU a and SMX card d, PPU a can access the storage resources in SMX card d through two ICN half-speed links g and h and one ICN full-speed link d3. In some embodiments, each PPU may include an ICN switch to facilitate data access to resources. For example, PPU a can use the ICN switch located in PPU b to access the storage resources in SMX card d through two ICN half-speed links g and h and one ICN full-speed link d3. Figure 6C The ICN full-speed and half-speed links shown can help each PPU efficiently access the resources (e.g., storage resources) of other PPUs and SMX cards. Therefore, storage capacity is no longer limited to a single PPU, and data transfer between PPUs and SMX cards can also be improved.
[0101] In some embodiments, the system further includes one or more storage expansion cards based on a graph access engine (GAE). Figure 7 This is a schematic diagram of an exemplary smart storage expansion card based on a graph access engine, according to some embodiments of the present disclosure. Figure 7 The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 7 The GAE SMX card (i.e., a memory expansion card based on a graph access engine) 700 shown can have fewer, more, and optional components and connections. In some embodiments, the GAE SMX card 700 is implemented on an FPGA.
[0102] Figure 7The GAE SMX card 700 shown can be used for processors (e.g., Figure 3 The PPU card 300 shown and Figure 4A , Figure 4B and Figure 4C The PPU shown provides additional storage capacity and can be communicatively coupled to the processor. In some embodiments, the GAE SMX card 700 includes one or more memories (e.g., ...). Figure 2A Memory 2230, Figure 2B or Figure 2C As shown, DDR or solid-state drives (SSDs) are used to store graph data. Figure 7 As shown, the GAE SMX Card 700 includes multiple connections. For example, as Figure 7 As shown, the GAE SMX card 700 includes one ICN full-speed link and two MoF (memory-over-fabric) links (or interconnects). In some embodiments, the ICN full-speed link is bidirectional (e.g., 100 GB / s). In some embodiments, a form factor of the ICN full-speed link is a QSFP-DD port (e.g., each ICN full-speed link has two QSFP-DD ports). In some embodiments, one or more MoF links are bidirectional (e.g., each link is 50 GB / s). In some embodiments, the MoF link is an FPGA-to-FPGA connection, such as FPGA-to-FPGA connection IP developed for accelerating graph applications. In some embodiments, a form factor of the MoF link is a QSFP-DD port (e.g., each MoF link has one QSFP-DD port). The MoF links can connect to GAEs and facilitate connected GAEs, for example, to communicate with each other to perform near-memory processing for graph applications. In some embodiments, the GAE SMX card 700 also includes one or more conversion branch cables between the ICN bridge and one or more QSFP-DD ports. In some embodiments, the GAE SMX card 700 includes a PCIe connection.
[0103] In some embodiments, the GAE SMX card 700 includes Figure 2A , Figure 2B and Figure 2C One or more modules are shown. For example, the GAE SMX card 700 includes modules similar to... Figure 2B Access engine 2310 or Figure 2C Access engine 2400 access engine, similar to Figure 2B The RISC-V 2330 is a RISC-V variant, similar to... Figure 2B GEMM execution engine 2340 (or Figure 2BThe execution engine is the VPU execution engine 2350 or a combination thereof. Therefore, the GAE SMX card 700 can perform operations that accelerate GNN memory access in a near-memory manner. In addition, the operations performed by the GAE SMX card can facilitate the PPU to perform GNN operations (e.g., similar to the GNN module 2220 in Figure 2 facilitating the dedicated processor 2240 to perform GNN operations).
[0104] Figure 8A This is a schematic diagram of an exemplary system for connecting a PPU card and a graph access engine-based smart storage expansion card according to some embodiments of the present disclosure. Figure 8A The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 8A The system 800 shown can have fewer, more, and optional components and connections.
[0105] like Figure 8A As shown, system 800 includes four PPUs: PPU a, PPU b, PPU c, and PPU d. In some embodiments, each of PPUs a, PPU b, PPU c, and PPU d is similar to... Figure 3 The PPU card 300 is shown. For example, each PPU of PPU a, PPUb, PPU c, and PPU d includes 3 full-speed ICN links and 2 half-speed links. Figure 8A As shown, each of PPUs a, pPUb, pPUc, and pPUd is connected to each of the other pPUs via an ICN full-speed link or two ICN half-speed links. For example, pPU a is connected to pPU b via two parallel ICN half-speed links g and h, to pPU c via ICN full-speed link a, and to pPU d via ICN full-speed link e. In some embodiments, each of pPUs a, pPUb, pPUc, and pPU d can be connected to a GAE SMX card. For example, as... Figure 8A As shown, PPU a is connected to GAE SMX card a via ICN full-speed link a1, PPU b is connected to GAE SMX card b via ICN full-speed link b1, PPU c is connected to GAE SMX card c via ICN full-speed link c1, and PPU d is connected to GAE SMX card d via ICN full-speed link d1. In some embodiments, Figure 8A Each of the GAE SMX cards shown in the diagram, from a to a series of cards, is similar to... Figure 7 The GAE SMX card 700. (Example) Figure 8AAs shown, system 800 has eight ICN full-speed links and four ICN half-speed links connecting the PPU card and the GAE SMX card. Therefore, each PPU in system 800 can access the resources (e.g., storage resources) of any other PPU or GAE SMX card through a maximum of two hops. For example, although there is no direct link connecting PPU a and GAE SMX card d, PPU a can access the storage resources in GAE SMX card d through one ICN full-speed link e and one ICN full-speed link d1. In some embodiments, each PPU may include an ICN switch to facilitate data access to the resources. For example, PPU a can use the ICN switch located in PPU d to access the storage resources in GAE SMX card d through one ICN full-speed link e and one ICN full-speed link d1. Figure 8A The ICN full-speed and half-speed links shown can help each PPU efficiently access the resources (e.g., storage resources) of other PPUs and GAE SMX cards. Therefore, storage capacity is no longer limited to a single PPU, and data transfer between PPUs and GAE SMX cards can be improved.
[0106] Figure 8B This is a schematic diagram of another exemplary system for connecting a PPU card and a smart storage expansion card based on a graph access engine, according to some embodiments of the present disclosure. Figure 8B The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 8B The system 810 shown can have fewer, more, and optional components and connections.
[0107] like Figure 8B As shown, system 810 includes four PPUs: PPU a, PPU b, PPU c, and PPU d. In some embodiments, each of PPUs a, b, c, and d is similar to... Figure 3 The PPU card 300 is shown. For example, each of PPUs a, b, c, and d includes 3 full-speed ICN links and 2 half-speed links. Figure 8B As shown, each of PPUs, PPU a, PPU b, PPU c, and PPU d is connected to each of the other PPUs via one ICN full-speed link or two ICN half-speed links. For example, PPU a is connected to PPU b via two ICN half-speed links g and h, to PPU c via ICN full-speed link a, and to PPU d via ICN full-speed link e. In some embodiments, each of PPUs, PPU a, PPU b, PPU c, and PPU d is connected to a GAE SMX card. For example, as... Figure 8BAs shown, PPU a is connected to GAE SMX card a via ICN full-speed link a1, PPU b is connected to GAE SMX card b via ICN full-speed link b1, PPU c is connected to GAE SMX card c via ICN full-speed link c1, and PPU d is connected to GAE SMX card d via ICN full-speed link d1. In some embodiments, each GAE SMX card is connected to one or more other GAE SMX cards. For example, as... Figure 8B As shown, GAE SMX card a is connected to GAE SMX card b via MoF link b, and to GAE SMX card c via MoF link a. In some embodiments, Figure 8A Each of the GAE SMX cards shown in GAE SMX card a-GAESMX card d is associated with... Figure 7 Similar to the GAE SMX card 700. For example... Figure 8A As shown in the diagram, there are 8 ICN full-speed links, 4 ICN half-speed links, and 4 MoF links. Therefore, each PPU in system 800 can access the resources (e.g., storage resources) of any other PPU or GAE SMX card through a maximum of 2 hops. For example, although there is no direct link connecting PPU a and GAE SMX card d, PPU a can access the storage resources in GAE SMX card d through one ICN full-speed link e and one ICN full-speed link d1. In some embodiments, each PPU includes an ICN switch that facilitates data access to resources. For example, PPU a uses the ICN switch located in PPU d to access the storage resources in GAE SMX card d through one ICN full-speed link e and one ICN full-speed link d1. Furthermore, each GAE SMX card can access the resources (e.g., storage resources) of any other GAE SMX card through a maximum of 2 hops. For example, GAE SMX card a accesses the storage resources of GAE SMX card b via MoF link b, while GAE SMX card a accesses the storage resources of GAE SMX card d via MoF links b and c or via MoF links a and d. In some embodiments, each GAE SMX card includes a MoF switch for facilitating data access to the resources. For example, GAE SMX card a uses the MoF switch located in GAE SMX card b to access the storage resources of GAE SMX card d via MoF links b and c. Figure 8B The ICN full-speed and half-speed links, as well as the MoF link, shown can help each PPU and GAE SMX card efficiently access the resources (e.g., storage resources) of other PPUs and GAE SMX cards. Therefore, storage capacity is no longer limited to a single PPU or a single GAE SMX card, and data transfer between PPUs and GAE SMX cards can be improved. In some embodiments, Figure 8BThe system 810 shown can help facilitate data access to graph data through near-memory processing.
[0108] It should be noted that Figure 8A The system 800 in the middle is different Figure 8B System 810. For example, for a PPU accessing a GAE SMX card that is not directly connected to the PPU, system 800 provides a different route than system 810. For example, in system 800, PPUa accesses GAE SMX card d through another PPU (e.g., PPU d). In system 810, PPU a accesses GAE SMX card d through another GAE SMX card (e.g., GAE SMX card a).
[0109] In some embodiments, each SMX card (e.g., Figure 5 and Figures 6A-6C The SMX card shown includes one or more memory control modules to facilitate memory access. Figure 9 This is a schematic diagram of an exemplary memory access system for a storage expansion card according to some embodiments of the present disclosure. Figure 9 The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 9 The memory access system 900 shown can have fewer, more, and optional components and connections. It is understood that... Figure 9 The memory access system 900 shown can... Figure 5 and Figures 6A-6C The SMX card shown is implemented in this embodiment. In some embodiments, the memory access system 900 is implemented on an FPGA.
[0110] In some embodiments, the memory access system 900 includes a plurality of random access memories (RAMs). For example, such as Figure 9 As shown, the memory access system 900 includes four DDRs (e.g., each DDR has a bandwidth of 12.5 GB / s). In some embodiments, each DDR is connected to a memory interface. For example, as... Figure 9 As shown, each DDR is connected to a memory interface generator (MIG) used to generate the memory interface of the DDR on the FPGA. In some embodiments, the DDR is connected to the AXI bus.
[0111] In some embodiments, Figure 9The illustrated memory access system 900 includes an ICN-to-mem gasket module, which performs data translation between local memory operations (e.g., memory operations on DDR) and packet transfers via one or more ICN links. The ICN-to-mem gasket module can be used to connect to an AXI bus and access DDR via the AXI bus. In some embodiments, the memory access system 900 includes two full-speed ICN links. For example, similar to... Figure 5 and Figures 6A-6C The SMX cards shown, each including a memory access system 900, include two ICN full-speed links, each ICN full-speed link having two QSFP-DD ports (e.g., as shown in the image). Figure 9 (As shown). In some embodiments, the memory access system 900 includes a plurality of modules, such as one or more C2C DMA engine modules, one or more PKT engine modules, one or more I / F modules, one or more PRC modules, and one or more MAC modules, connecting the ICN to the storage and the QSFP-DD port.
[0112] In some embodiments, Figure 9 The memory access system 900 shown includes a PCIe connection, similar to Figure 5 and Figures 6A-6C The SMX card shown. In some embodiments, such as Figure 9 As shown, the PCIe connection and PCIe slot are connected to the AXI bus and access the DDR via the AXI bus.
[0113] In some embodiments, each GAE SMX card (e.g., Figure 7 and Figures 8A-8B The GAE SMX card shown includes one or more memory control modules to facilitate memory access. Figure 10 This is a schematic diagram of an exemplary memory access system for a memory expansion card based on a graph access engine, according to some embodiments of the present disclosure. Figure 10 The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 10 The memory access system 1000 shown can have fewer, more, and optional components and connections. It is understood that... Figure 10 The memory access system 1000 shown can... Figure 7 and Figures 8A-8B The SMX card shown is implemented in some embodiments. In some embodiments, the memory access system 1000 is implemented on an FPGA.
[0114] In some embodiments, the memory access system 1000 includes a plurality of random access memories (RAMs). For example, such as Figure 10 As shown, the memory access system 1000 includes four DDRs (e.g., each DDR has a bandwidth of 12.5 GB / s). In some embodiments, each DDR is connected to a memory interface. For example, as... Figure 10 As shown, each DDR is connected to a memory interface generator (MIG) used to generate the memory interface on the FPGA. In some embodiments, the DDR is connected to an AXI bus.
[0115] In some embodiments, Figure 10 The illustrated memory access system 1000 includes a GAE module connected to an AXI bus and accessing DDR via the AXI bus. In some embodiments, the GAE module includes a graph accelerator module for performing near-memory processing on graph applications. In some embodiments, the GAE module includes an ICN-to-memory pad module that performs data translation between local memory operations (e.g., memory operations on DDR) and packet transfers via one or more ICN links. In some embodiments, the memory access system 1000 includes a single ICN full-speed link. For example, similar to... Figure 7 and Figures 8A-8B The GAE SMX cards shown, each including a memory access system 1000, include one ICN full-speed link, and each ICN full-speed link has two QSFP-DD ports (e.g., such as...). Figure 10 (As shown). In some embodiments, the memory access system 1000 includes a plurality of modules for connecting the ICN in the GAE module to the storage pad module and the QSFP-DD port, such as one or more C2C DMA engine modules, one or more PKT engine modules, one or more I / F modules, one or more PRC modules and one or more MAC modules.
[0116] In some embodiments, the memory access system 1000 includes two MoF links. For example, similar to Figure 7 and Figures 8A-8B The GAE SMX cards shown, each including a memory access system 1000, include two MoF links, each MoF link having one QSFP-DD port (e.g., as shown in the image). Figure 10 (As shown). In some embodiments, the memory access system 1000 includes multiple modules, such as one or more MoF center modules, one or more MoF switch modules, one or more MoF edge modules, and one or more MAC modules, connecting the GAE module and the MoF link to a QSFP-DD port.
[0117] In some embodiments, Figure 10The memory access system 1000 includes a PCIe connection, similar to Figure 7 and Figures 8A-8B The GAE SMX card shown. In some embodiments, such as Figure 10 As shown, the PCIe connection and PCIe slot are connected to the AXI bus, and DDR can be accessed through the AXI bus.
[0118] Figure 11 This is a schematic diagram of an exemplary memory access system from ICN to a storage pad module according to some embodiments of the present disclosure. Figure 11 The diagrams shown are for illustrative purposes only and may vary depending on the implementation. Figure 11 The ICN-to-storage pad module 1100 shown can have fewer, more, and optional components and connections. It should be understood that... Figure 11 The ICN-to-storage pad module 1100 shown can be similar to Figure 9 or Figure 10 The ICN-to-store pad module is described. In some embodiments, the ICN-to-store pad module 1100 is implemented on an FPGA.
[0119] In some embodiments, Figure 11 The illustrated ICN-to-store pad module 1100 includes an arbitrator, a router, and modules such as an atomic ALU module and an atomic FIFO module to perform memory-based operations (e.g., atomic operations). For example, the arbitrator arbitrates ICN packets entering the atomic module based on the ICN command type, and the router routes responses and data to different ICN links based on physical addresses (e.g., encoded physical addresses). In some embodiments, the ICN-to-store pad module 1100 includes a cache. In some embodiments, the atomic FIFO module manages the sequence of operations for atomic operations. In some embodiments, the atomic ALU module is responsible for handling the computations involved in the atomic operations.
[0120] Figure 12 This is a schematic diagram of a packet (PKT) engine module of an exemplary memory access system according to some embodiments of the present disclosure. Figure 12 The diagrams shown are for illustrative purposes only and may vary depending on the implementation. Figure 12 The PKT engine module 1200 shown can have fewer, more, and optional components and connections. It is understood that... Figure 12 The PKT engine module 1200 shown is similar to Figure 9 or Figure 10 The PKT engine module is described. In some embodiments, the PKT engine module 1200 is implemented on an FPGA.
[0121] In some embodiments, Figure 12 The PKT engine module 1200 is used to handle packet editing and clock domain crossing (CDC) design. In some embodiments, such as Figure 12 As shown, the PKT engine module 12 includes two functional domains: an ingress processing domain and an egress processing domain. In some embodiments, different categories of input streams (e.g., kernel category, DMA category, etc.) are arbitrated (e.g., through an arbitration module based on strict priority arbitration rules or round-robin arbitration rules). The winner of the arbitration is transmitted to the ingress packet (ING PKT) editing module. In some embodiments, the PKT engine module 1200 includes one or more CDC FIFO modules (e.g., asynchronous FIFOs) for performing CDC data transfer. In some embodiments, the egress processing domain is designed as a mirror / reverse version of the ingress processing domain, with one difference being the addition of a third category of streams (e.g., response category).
[0122] Figure 13 This is a schematic diagram of a chip-to-chip (C2C) direct memory access (DMA) engine module of an exemplary memory access system according to some embodiments of the present disclosure. Figure 13 The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 13 The chip-to-chip direct memory access engine module 1300 shown can have fewer, more, and alternative components and connections. It is understood that... Figure 13 The chip-to-chip direct memory access engine module 1300 shown is similar to Figure 9 or Figure 10 The chip-to-chip direct memory access engine module. In some embodiments, the chip-to-chip direct memory access engine module is implemented on an FPGA.
[0123] In some embodiments, the chip-to-chip direct memory access engine module 1300 is used to manage chip-to-chip connections via one or more ICN links. In some embodiments, such as... Figure 13 As shown, the chip-to-chip direct memory access engine module 1300 includes three functional domains: memory access control domain (MEM_Access_CTRL), chip-to-chip ingress control domain (C2C_Ingress_CTRL), and chip-to-chip egress control domain (C2C_Egress_CTRL).
[0124] In some embodiments, the memory access control domain includes two fence processing modules for performing fence operations in parallel across multiple ICN command classes. In some embodiments, the memory access control domain also includes a memory arbitration module for performing arbitration between two or more ICN command classes and sending appropriate ICN commands and data to an ICN-to-store padding module (e.g., Figure 11 (ICN to storage pad module 1100). In some embodiments, the memory access control domain further includes a read completion buffer manager module (RdCplBuf_Manager) and a write completion buffer manager module (WrRspBuf_Manager), the read completion buffer manager module being used to manage responses to read commands (read complete), and the write completion buffer manager module being used to manage responses to write commands (write response).
[0125] In some embodiments, the chip-to-chip entry control domain includes a packet stream unpacking module (Packet_Stream_Unpack), which unpacks an ICN data stream into commands and data according to, for example, a customized ICN protocol. In some embodiments, depending on the operation category (e.g., kernel, DMA), the unpacked commands and data are sent to a kernel credit control module (Kernel_Credit_Ctrl) or a DMA credit control module (DMA_Credit_Ctrl).
[0126] In some embodiments, the chip-to-chip exit control domain includes a chip-to-chip kernel generation module (C2C_Kernel_Gen), a chip-to-chip DMA generation module (C2C_DMA_Gen), and a chip-to-chip write confirmation generation module (C2C_WrAck_Gen). The chip-to-chip kernel generation module generates kernel-class responses, the chip-to-chip DMA generation module generates DMA-class responses, and the chip-to-chip write confirmation generation module generates responses for the processor (e.g., ...). Figures 3-8B The write confirmation response is shown in the PPU. In some embodiments, the generated response is fed to a chip-to-chip arbitration module (C2C_Arbitration), which is included in the chip-to-chip export control domain and is used to perform arbitration, such as arbitration based on round-robin or strict priority. In some embodiments, the winner of the arbitration is sent to a packet stream packing module (Packet_Stream_Pack), which is included in the chip-to-chip export control domain and is used to pack commands and data according to, for example, a customized ICN protocol.
[0127] Figure 14This is a schematic diagram of a graph access engine module of an exemplary memory access system according to some embodiments of the present disclosure. Figure 14 The diagrams shown are for illustrative purposes only and may vary depending on the implementation. Figure 14 The graph access engine module 1400 shown can have fewer, more, and optional components and connections. It is understood that... Figure 14 The graph access engine module 1400 shown can be similar to Figure 10 The graph access engine module. In some embodiments, the graph access engine module is implemented on an FPGA.
[0128] In some embodiments, the graph access engine module is used to extend the ICN to the storage pad module (e.g., Figure 11 The ICN to storage pad module 1100 in the graph access engine module supports near-memory processing for graph applications. In some embodiments, the graph access engine module includes two domains: the PPU ingress / egress domain (PPU_Ingress / Egress) and the graph access engine ingress / egress domain (GAE_Ingress / Egress).
[0129] In some embodiments, such as Figure 14 As shown, the PPU inlet / outlet domain is similar to Figure 11 The ICN is transmitted to the storage pad module 1100. In some embodiments, the graph access engine domain includes one or more modules that are similar to... Figure 2A , Figure 2B or Figure 2C The modules shown. For example, the graph access engine domain includes access engines (e.g., similar to...). Figure 2B Access engine 2310 or Figure 2C Access engine 2400), MoF first-in-first-out queue (FIFO), RISC-V (e.g., similar to...) Figure 2B RISC-V 2330) and execution engine (e.g., similar to Figure 2B GEMM execution engine 2340 Figure 2B (VPU execution engine 2350 or a combination thereof). Therefore, the graph access engine 1400 enables the GAE SMX card to facilitate the PPU to perform GNN operations (e.g., similar to...). Figure 2A The GNN module 2220 facilitates the dedicated processor 2240 to perform GNN operations.
[0130] Figure 15 This is a schematic diagram of a fabric switch module of an exemplary memory access system according to some embodiments of the present disclosure. Figure 15 The diagrams are for illustrative purposes only and may vary depending on the implementation. Figure 15 The MoF switch module 1500 shown can have fewer, more, and optional components and connections. It is understood that... Figure 15 The MoF switch module shown is similar to Figure 10 The MoF switch is implemented on an FPGA in some embodiments.
[0131] In some embodiments, the MoF switch module includes four GAE SMX cards connected in a ring topology (e.g., such as...). Figure 8B As shown, four GAE SMX cards are connected via four MoF links. Therefore, each GAE SMX card is used to support data transfer with a single hop, thus all SMX cards are fully interconnected. Therefore, the MoF switch module 1500 can be used to perform single-hop data bypass. For example, as... Figure 10 As shown, in terms of data flow, the MoF switch module is located between the MoF central module and the MoF edge module. In some embodiments, the MoF switch module includes two switch buffers for storing data packets of the current transaction. Based on information encoded in the data packets (e.g., card ID), the MoF switch module includes a switch manager module for generating control signals to determine whether an incoming data packet is received or bypassed to another SMX card.
[0132] Figure 16 This is a flowchart illustrating an exemplary method for accelerating GNN processing using one or more storage expansion cards according to some embodiments of this disclosure. Method 1600 can be... Figure 8A or Figure 8B Implemented in the environment shown. Method 1600 can be achieved by... Figures 8A-15 The apparatus, device, or system shown performs the procedure. Depending on the implementation, method 1600 may include additional, fewer, or alternative steps performed in various sequences or in parallel.
[0133] Step 1610 includes retrieving partial structural data of the graph from one or more memories in the storage expansion card via the storage expansion card. In some embodiments, this is accomplished by the access engine circuitry of the storage expansion card (e.g., Figure 14 The access engine shown performs the retrieval operation. In some embodiments, the storage expansion card performs the retrieval operation via a first type of interconnect unit (e.g., the access engine shown). Figure 8A or Figure 8B The ICN full-speed link shown is communicatively coupled to the processing unit used for graph neural network (GNN) processing. In some embodiments, the storage expansion card is connected via a second type of interconnect unit (e.g., Figure 8BThe MoF link shown is communicatively coupled to each of the other two memory expansion cards, and the method also includes extracting partial structural data from one or both of the other two memory expansion cards. In some embodiments, the memory expansion cards are implemented on an FPGA.
[0134] Step 1620 includes performing node sampling using the extracted partial structural data of the graph to select one or more sampled nodes. In some embodiments, node sampling is performed by access engine circuitry. In some embodiments, it is performed in conjunction with... Figure 2C The GetNeighbor module 2410 or Figure 2C Node sampling is performed in a similar manner to the GetSample module 2420 in the example.
[0135] Step 1630 includes retrieving partial attribute data of the graph from one or more memories based on one or more selected sampling nodes. In some embodiments, the partial attribute data of the graph is retrieved by access engine circuitry. In some embodiments, the method further includes retrieving partial attribute data from one or both of the other two memories.
[0136] Step 1640 includes sending the extracted partial attribute data of the graph to the processing unit via interconnection units of the first type. In some embodiments, the extracted partial attribute data of the graph is sent by access engine circuitry.
[0137] Step 1650 includes performing GNN processing on the graph using the extracted partial attribute data of the graph. In some embodiments, the GNN processing is performed by a processing unit. In some embodiments, the host includes one or more processors for performing the GNN processing. In some embodiments, the one or more processors include one or more central processing units (CPUs), one or more graphics processing units (GPUs), one or more graph neural network processing units, one or more dedicated graph neural network processing units, and so on.
[0138] In some embodiments, the processing unit is communicatively coupled to other processing units via a first type of interconnect unit or two third type interconnect units. In some embodiments, the form factor of each first type interconnect unit is two QSFP-DD ports, with a bandwidth equal to or greater than 100 GB / s. In some embodiments, the form factor of each second type interconnect unit is one QSFP-DD port, with a bandwidth equal to or greater than 50 GB / s. In some embodiments, the form factor of each third type interconnect unit is four Mini-SAS ports, with a bandwidth equal to or greater than 0 GB / s. In some embodiments, the method further includes (e.g., similar to...) Figure 8BThe GAE SMX card shown performs data bypassing on data received from each of the other two storage expansion cards via a second type of interconnect unit through a switch in the storage expansion card. In some embodiments, the method further includes (e.g., similar to...) Figure 8A The PPU shown performs data bypass on data received from other processing units via a first type of interconnection unit or two third type interconnection units through a switch in the processing unit.
[0139] Each process, method, and algorithm described in the preceding sections may be embodied in a code module executed by one or more computer systems or a computer processor containing computer hardware, and may be fully or partially automated by that code module. These processes and algorithms may be implemented, in part or in whole, in dedicated circuitry.
[0140] When the functions disclosed herein are implemented as software functional units and sold or used as independent products, they may be stored in a non-volatile computer-readable storage medium executable by a processor. Specific technical solutions (all or part) disclosed herein, or aspects contributing to the present technology, may be embodied in the form of a software product. A software product includes multiple instructions that may be stored in a storage medium to cause a computing device (which may be a personal computer, server, network device, etc.) to perform all or some steps of the methods of the embodiments of this disclosure. The storage medium includes a flash drive, a portable hard disk drive, ROM, RAM, a magnetic disk, an optical disk, another medium that can be used to store program code, or any combination thereof.
[0141] A further embodiment provides a system including a processor and a non-transitory computer-readable storage medium storing instructions executable by the processor to cause the system to perform operations corresponding to the steps in any of the methods of the above embodiments. The embodiment also provides a non-transitory computer-readable storage medium configured with instructions executable by one or more processors to cause one or more processors to perform operations corresponding to the steps in any of the methods of the above embodiments.
[0142] The embodiments disclosed herein can be implemented through a cloud platform, server, or group of servers (collectively referred to as the "Service System") that interacts with a client. The client can be a terminal device or a client registered by a user on the platform, wherein the terminal device can be a mobile terminal, a personal computer (PC), or any device that can have the platform application installed.
[0143] The various features and processes described above can be used independently of each other or combined in various ways. All possible combinations and sub-combinations are within the scope of this disclosure. Furthermore, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are not limited to any particular sequence and can be executed in other suitable sequences, including associated blocks or states. For example, described blocks or states may be executed in a different order than the definitive disclosed order, or multiple blocks or states may be combined within a single block or state. Example blocks or states may be executed serially, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed exemplary embodiments. The configuration of the exemplary systems and components described herein may differ from that described. For example, elements may be added to, removed from, or rearranged from the disclosed exemplary embodiments compared to the disclosed exemplary embodiments.
[0144] The various operations of the exemplary methods described herein can be performed at least partially by an algorithm. This algorithm may be contained in program code or instructions stored in memory (e.g., the aforementioned non-transitory computer-readable storage medium). Such an algorithm may include a machine learning algorithm. In some embodiments, the machine learning algorithm does not explicitly program the computer to perform the function, but learns from training data to build a predictive model for performing that function.
[0145] The various operations of the exemplary methods described herein can be performed at least partially by one or more processors, which may be temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, these processors constitute the engine of a processor implementation that operates to perform one or more of the operations or functions described herein.
[0146] Similarly, taking one or more specific processors as hardware examples, the methods described herein can be implemented at least partially by a processor. For example, at least some operations of the methods can be performed by one or more processors or an engine implemented by a processor. Furthermore, one or more processors can also support the performance of the related operations in a "cloud computing" environment or as "software as a service" (SaaS). For example, at least some operations can be performed by a set of computers (as an example of a machine including processors) that can be accessed via a network (e.g., the Internet) and one or more appropriate interfaces (e.g., Application Program Interfaces (APIs)).
[0147] The performance of certain operations may be distributed across processors, residing not only within a single machine but also deployed across multiple machines. In some exemplary embodiments, the processor or processor-implemented engine may reside in a single geographic location (e.g., within a home environment, office environment, or server farm). In other exemplary embodiments, the processor or processor-implemented engine may be distributed across multiple geographic locations.
[0148] In this specification, multiple instances can be implemented as components, operations, or structures described as single instances. Although a single operation of one or more methods is shown and described as a separate operation, one or more individual operations can be performed simultaneously, and the operations do not need to be performed in the order shown. Structures and functions presented as separate components in the exemplary configuration can be implemented as composite structures or components. Similarly, structures and functions presented as single components can be implemented as separate components. These and other changes, modifications, additions, and improvements are within the scope of this document.
[0149] While a summary of the subject matter has been described with reference to specific exemplary embodiments, various modifications and alterations may be made to these embodiments without departing from the broader scope of embodiments of this disclosure. For convenience only, the term "this disclosure" may be used herein, individually or in its entirety, to refer to such embodiments of the subject matter, and if multiple pieces of information or concepts are disclosed, it is not intended to voluntarily limit the scope of this application to any single piece of information or concept.
[0150] The embodiments shown herein have been described in sufficient detail to enable those skilled in the art to practice the disclosed teachings. Other embodiments may be used and derived therefrom, allowing for structural and logical substitutions and changes without departing from the scope of this disclosure. Therefore, the detailed description is not intended to be limiting, and the scope of the various embodiments is defined only by the appended claims and all their equivalents.
[0151] Any process description, element, or block in the flowcharts described herein and / or in the accompanying drawings should be understood as a potential representation of a module, segment, or code segment, including one or more executable instructions for implementing a specific logical function or step in the process. Alternative implementations are included within the scope of the embodiments described herein, wherein, as those skilled in the art will understand, depending on the functionality involved, elements or functions may be removed from what is shown or discussed, or elements or functions may be executed out of order (including substantially parallel or reverse execution).
[0152] Unless otherwise expressly stated or the context otherwise requires, the word "or" as used herein is inclusive rather than exclusive. Therefore, "A, B, or C" here means "A, B, A and B, A and C, B and C, or A, B, and C," unless otherwise expressly stated or the context otherwise requires. Furthermore, "and" is consequential unless otherwise expressly stated or the context otherwise requires. Therefore, "A and B" here means "A and B, together or separately," unless otherwise expressly stated or the context otherwise requires. Furthermore, multiple instances may be provided for a resource, operation, or structure described herein as a single instance. Moreover, the boundaries between various resources, operations, engines, and data stores are somewhat arbitrary, and a particular operation is illustrative within the context of a particular illustrative configuration. Other allocations of functionality may be envisioned and fall within the scope of various embodiments herein. In general, structures and functions presented as separate resources in exemplary configurations can be implemented as combined structures or resources. Similarly, structures and functions presented as single resources can be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within the scope of embodiments of this disclosure as represented by the appended claims. Therefore, the specification and drawings should be considered illustrative rather than restrictive.
[0153] The terms “comprising” or “including” are used to indicate the presence of a subsequently stated feature, but do not preclude the addition of other features. Unless otherwise expressly stated or otherwise understood in the context in which they are used, conditional language (e.g., “may”) is generally intended to convey that certain embodiments include certain features, elements, and / or steps that are not included in other embodiments. Therefore, such conditional language generally does not imply that one or more embodiments require features, elements, and / or steps in any way, or that one or more embodiments necessarily include logic for determining, with or without user input or prompting, whether such features, elements, and / or steps are included in or will be performed in any particular embodiment.
Claims
1. A system comprising: Multiple processing units are used to perform graph neural network processing; and Multiple storage expansion cards are used to store graph data for graph neural network processing, among which, Each of the plurality of processing units is communicatively coupled to other processing units through one or more interconnecting units; The plurality of processing units are respectively communicatively coupled to the plurality of storage expansion cards; and Each of the plurality of storage expansion cards includes a graph access engine circuit for accelerating memory access in graph neural network processing, comprising: extracting partial structural data of a graph from one or more of the plurality of storage expansion cards; performing node sampling using the extracted partial structural data of the graph to select one or more sampling nodes; extracting partial attribute data of the graph from the one or more storage expansion cards based on the selected one or more sampling nodes; and sending the extracted partial attribute data of the graph to one or more of the plurality of processing units.
2. The system according to claim 1, wherein, The plurality of processing units include a first processing unit, a second processing unit, a third processing unit, and a fourth processing unit; The plurality of storage expansion cards includes a first storage expansion card, a second storage expansion card, a third storage expansion card, and a fourth storage expansion card; Each of the first processing unit, the second processing unit, the third processing unit, and the fourth processing unit is communicatively coupled to each of the first storage expansion card, the second storage expansion card, the third storage expansion card, and the fourth storage expansion card via a first type of interconnection unit; The first processing unit is communicatively coupled to each of the third and fourth processing units via an interconnection unit of the first type; The second processing unit is communicatively coupled to each of the third and fourth processing units via an interconnection unit of the first type; The first processing unit is communicatively coupled to the second processing unit through two interconnecting units of the second type; The third processing unit is communicatively coupled to the fourth processing unit through two interconnecting units of the second type; The first storage expansion card is communicatively coupled to the second storage expansion card and each of the third storage expansion cards via a third type of interconnect unit; and The fourth storage expansion card is communicatively coupled to each of the second and third storage expansion cards via the third type of interconnect unit.
3. The system according to claim 2, wherein, The bandwidth of the second type of interconnect unit is half that of the first type of interconnect unit.
4. The system according to claim 2, wherein, The form factor of the interconnect unit of the first type is two QSFP-DD ports, the bandwidth of which is equal to or greater than 100 GB / s.
5. The system according to claim 2, wherein, The form factor of the second type of interconnect unit is four Mini-SAS ports, the bandwidth of which is equal to or greater than 50 GB / s; The first processing unit is communicatively coupled to the second processing unit in parallel via two interconnect units of the second type; and The third processing unit is communicatively coupled to the fourth processing unit in parallel through two interconnecting units of the second type.
6. The system according to claim 2, wherein, The form factor of the third type of interconnect unit is a QSFP-DD port, and the bandwidth of the QSFP-DD port is equal to or greater than 50 GB / s.
7. The system according to claim 2, wherein, Each of the plurality of storage expansion cards is also used to perform data conversion between local memory operations and packet transmissions via one or more interconnect units of the first or second type.
8. The system according to claim 2, wherein, Each of the plurality of storage expansion cards includes a switch for performing data bypass on data received from one or more storage expansion cards via one or more interconnect units of the third type.
9. The system according to claim 1, wherein, The plurality of processing units include a first processing unit, a second processing unit, a third processing unit, and a fourth processing unit; The plurality of storage expansion cards includes a first storage expansion card, a second storage expansion card, a third storage expansion card, and a fourth storage expansion card; Each of the first processing unit, the second processing unit, the third processing unit, and the fourth processing unit is communicatively coupled to each of the first storage expansion card, the second storage expansion card, the third storage expansion card, and the fourth storage expansion card via a first type of interconnection unit; The first processing unit is communicatively coupled to each of the third and fourth processing units via a first type of interconnection unit; The second processing unit is communicatively coupled to each of the third and fourth processing units via an interconnection unit of the first type; The first processing unit is communicatively coupled to the second processing unit through two interconnecting units of the second type; and The third processing unit is communicatively coupled to the fourth processing unit through two interconnecting units of the second type.
10. The system according to claim 9, wherein, The form factor of the first type of interconnect unit is two QSFP-DD ports, and the bandwidth of the two QSFP-DD ports is equal to or greater than 100 GB / s; The form factor of the second type of interconnect unit is four Mini-SAS ports, the bandwidth of which is equal to or greater than 50 GB / s; The first processing unit is communicatively coupled to the second processing unit in parallel via two interconnect units of the second type; and The third processing unit is communicatively coupled to the fourth processing unit in parallel through two interconnecting units of the second type.
11. The system according to claim 9, wherein, Each of the first processing unit, the second processing unit, the third processing unit, and the fourth processing unit includes a switch for performing data bypass on data received from one or more other processing units via one or more interconnection units of the first or second type.
12. The system according to claim 1, wherein, Each of the plurality of memory expansion cards is implemented on a field-programmable gate array.
13. The system according to claim 1, wherein, Each of the plurality of processing units includes one or more central processing units, one or more graphics processing units, one or more tensor processing units, one or more neural network processing units, or one or more graph neural network processing units.
14. A storage expansion card, comprising: One or more memories are used to store graph data for graph neural network processing; A first-type interconnect unit for connecting the storage expansion card to a processing unit for performing graph neural network processing; Two second-type interconnect units are used to connect the storage expansion card to two other storage expansion cards; and The diagram access engine circuitry is used for: Extract partial structural data of the diagram from the one or more memory devices or the other two memory expansion cards; Node sampling is performed using the extracted partial structural data of the graph to select one or more sampling nodes; Based on one or more selected sampling nodes, extract partial attribute data of the graph from the one or more memories or the other two storage expansion cards; and The extracted partial attribute data of the graph is sent to the processing unit through the first type of interconnection unit. The bandwidth of each of the two second-type interconnect units is half that of the first-type interconnect unit.
15. The storage expansion card according to claim 14, wherein, The processing unit is communicatively coupled to other processing units via the first type of interconnection unit or two third type interconnection units.
16. The storage expansion card according to claim 14, wherein: The form factor of the interconnect unit of the first type is two QSFP-DD ports, the bandwidth of which is equal to or greater than 100 GB / s.
17. The storage expansion card according to claim 14, wherein, The form factor of the second type of interconnect unit is a QSFP-DD port, wherein the bandwidth of the QSFP-DD port is equal to or greater than 50 GB / s; and The third type of interconnect unit has a shape factor of four Mini-SAS ports, the bandwidth of which is equal to or greater than 50 GB / s.
18. The storage expansion card according to claim 14, wherein, The storage expansion card is implemented on a field-programmable gate array.
19. The storage expansion card according to claim 15, wherein, The storage expansion card is also used to perform data conversion between local memory operations and packet transmission via one or more interconnect units of the first or second type.