Distributed processing system and method for graph convolutional network inference
The distributed processing system using FPGAs and a GPU efficiently partitions and processes sparse matrices to address inefficiencies in GCN inference, enhancing processing speed and reducing communication costs.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION
- Filing Date
- 2025-11-28
- Publication Date
- 2026-06-11
Smart Images

Figure KR2025020107_11062026_PF_FP_ABST
Abstract
Description
Distributed processing system and method for graph convolutional network inference
[0001] The present disclosure relates to a distributed processing system and method for Graph Convolutional Network (GCN) inference, and more specifically, to a distributed processing system and method for Graph Convolutional Network (GCN) inference that can drastically reduce communication time and computation time by distributing the processing of Graph Convolutional Network (GCN) inference through a plurality of FPGAs.
[0002] Recently, Graph Convolutional Networks (GCNs) are being increasingly utilized in various fields, such as social network analysis, recommendation systems, and compound analysis. Consequently, there is a growing need for hardware acceleration technologies to efficiently process large-scale graph data.
[0003] Traditionally, a structure in which the CPU directly manages the GPU to perform large-scale computations was primarily used. However, this approach had drawbacks, including increased computational and management burdens on the CPU, degraded overall system performance, and limitations on GPU scalability.
[0004] To solve these problems, a distributed processing method using a Field Programmable Gate Array (FPGA) together with a GPU has been proposed. In conventional technology, a structure including one FPGA and one GPU was used, and a method was mainly utilized in which the FPGA analyzes instructions received from the CPU to distribute operations, and the GPU performs the corresponding operations.
[0005] However, in such an architecture, the FPGA remained at the level of assisting the GPU's computations, leading to a problem where data communication costs between the GPU and the FPGA increased significantly. Furthermore, even if the GPU performed computations independently, the results had to be transmitted back to the CPU via the FPGA, which resulted in a decrease in the overall efficiency of the system. Moreover, if the load balance between the FPGA and the GPU was not properly maintained, computations would become concentrated on a specific device, causing a bottleneck.
[0006] Therefore, there is a need for new distributed processing techniques that can handle Graph Convolutional Network (GCN) inference on large-scale graph datasets more efficiently.
[0007] The aforementioned background technology is one that the inventor possessed or acquired in the process of deriving the content of the disclosure of the present application, and it cannot be considered as prior art disclosed to the general public prior to the filing of this application.
[0008] The present disclosure provides a distributed processing method for graph convolutional network (GCN) inference to solve the above-mentioned problems, a computer program stored on a recording medium, and a device (system).
[0009] The present disclosure may be implemented in various ways, including a method, a system (device), or a computer program stored on a readable storage medium.
[0010] According to one embodiment of the present disclosure, a distributed processing system for graph convolutional network (GCN) inference comprises a CPU, a plurality of field programmable gate arrays (FPGAs), and a GPU. The CPU creates a plurality of sparse matrix blocks by partitioning a sparse matrix stored in memory, and allocates the created plurality of sparse matrix blocks to each of the plurality of FPGAs. Each of the plurality of FPGAs individually performs an aggregation operation based on the sparse matrix blocks allocated from the CPU to derive an aggregation operation result. The GPU obtains the aggregation operation results derived from each of the plurality of FPGAs and can perform an update operation using the obtained aggregation operation results.
[0011] Additionally, each of the plurality of FPGAs may include an FPGA memory that stores a sparse matrix block allocated from the CPU, a plurality of processing elements that individually perform Sparse-Dense Matrix Multiplication (SpDMM) operations on the sparse matrix block stored in the FPGA memory, and a controller that controls the operation of the plurality of processing elements.
[0012] Additionally, each of the plurality of FPGAs can individually perform a sparse-dense product operation on each of the plurality of sparse rows included in the sparse matrix block through the plurality of processing elements to generate a plurality of output matrices consisting of a single row containing a feature vector of a specific node, and can combine the generated plurality of output matrices into one and transmit the derived aggregation operation result to the single GPU.
[0013] In addition, the single GPU can individually collect aggregate operation results derived by performing aggregate operations through each of the plurality of FPGAs, and when all aggregate operation results corresponding to the generated plurality of sparse matrix blocks are collected, it can perform a General Matrix Multiplication (GEMM) operation based on a dense matrix containing the collected aggregate operation results.
[0014] In addition, the above-mentioned GPU can perform update operations and activation function operations using aggregate operation results obtained from the plurality of FPGAs at a specific time, and then perform update operations once again using aggregate operation results obtained at the specific time.
[0015] According to one embodiment of the present disclosure, a distributed processing method for graph convolutional network (GCN) inference performed through a distributed processing system for graph convolutional network (GCN) inference comprising a CPU, a plurality of field programmable gate arrays (FPGAs), and a GPU may include the steps of: creating a plurality of sparse matrix blocks by dividing a sparse matrix previously stored in memory through the CPU, and allocating the created plurality of sparse matrix blocks to each of the plurality of FPGAs; individually performing an aggregation operation based on the sparse matrix blocks allocated from the CPU through each of the plurality of FPGAs to derive an aggregation operation result for each; and obtaining the aggregation operation results derived from each of the plurality of FPGAs through the GPU, and performing an update operation using the obtained aggregation operation results.
[0016] Additionally, each of the plurality of FPGAs includes a plurality of processing elements, and the step of deriving the aggregation operation result may include the step of generating a plurality of output matrices consisting of a single row containing a feature vector of a specific node by individually performing a sparse-dense product operation on each of the plurality of sparse rows included in a sparse matrix block through the plurality of processing elements, and the step of combining the generated plurality of output matrices into one to derive the aggregation operation result.
[0017] Additionally, the step of performing the update operation may include individually collecting aggregate operation results derived by performing aggregate operations through each of the plurality of FPGAs, and when all aggregate operation results corresponding to the generated plurality of sparse matrix blocks are collected, performing a General Matrix Multiplication (GEMM) operation based on a dense matrix containing the collected aggregate operation results.
[0018] Additionally, the step of performing the update operation may include performing an update operation and an activation function operation using aggregate operation results obtained from the plurality of FPGAs at a specific time, and after the update operation and the activation function operation are completed, performing the update operation once again using aggregate operation results obtained at the specific time.
[0019] A computer program stored on a computer-readable recording medium may be provided to execute the distributed processing method of graph convolutional network (GCN) inference described above on a computer.
[0020] According to some embodiments of the present disclosure, by distributing graph convolutional network (GCN) inference through a plurality of FPGAs and one GPU, the computational load can be efficiently distributed to maximize system resource utilization.
[0021] According to some embodiments of the present disclosure, by performing update operations and activation function operations through one GPU and then performing update operations once again, the number of communications between the FPGA and the GPU can be reduced, thereby improving the overall GCN inference processing speed.
[0022] The effects of the present disclosure are not limited to those mentioned above, and other unmentioned effects will be clearly understood by a person skilled in the art to which the present disclosure pertains (referred to as "person skilled in the art") from the description in the claims.
[0023] Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, wherein similar reference numerals indicate similar elements, but are not limited thereto.
[0024] FIG. 1 is a block diagram showing the internal configuration of a distributed processing system according to one embodiment of the present disclosure.
[0025] FIG. 2 is a block diagram showing the internal configuration of an information processing system according to one embodiment of the present disclosure.
[0026] FIG. 3 is a block diagram showing the internal configuration of an FPGA according to one embodiment of the present disclosure.
[0027] FIG. 4 is a flowchart of a distributed processing method for graph convolutional network (GCN) inference according to one embodiment of the present disclosure.
[0028] FIG. 5 is a flowchart of a method for performing aggregation operations through an FPGA according to one embodiment of the present disclosure.
[0029] FIG. 6 is a flowchart of a method for performing an update operation through a GPU according to one embodiment of the present disclosure.
[0030] FIG. 7 is a diagram illustrating, in an exemplary manner, the operation of FPGAs and GPUs over time according to one embodiment of the present disclosure.
[0031] Hereinafter, specific details for implementing the present disclosure will be described in detail with reference to the attached drawings. However, in the following description, specific descriptions regarding widely known functions or configurations will be omitted if there is a risk that the gist of the present disclosure may be unnecessarily obscured.
[0032] In the attached drawings, identical or corresponding components are assigned the same reference numerals. Additionally, in the description of the following embodiments, the description of identical or corresponding components may be omitted. However, even if a description of a component is omitted, it is not intended that such component is not included in any embodiment.
[0033] The advantages and features of the disclosed embodiments and the methods for achieving them will become clear by referring to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below but may be implemented in various different forms, and the embodiments provided are merely to make the present disclosure complete and to fully inform those skilled in the art of the scope of the invention.
[0034] The terms used in this specification will be briefly explained, and the disclosed embodiments will be described in detail. The terms used in this specification have been selected to be as generally used as possible, taking into account their functions in this disclosure; however, these terms may vary depending on the intent of those skilled in the art, case law, the emergence of new technologies, etc. Additionally, in specific cases, terms may be arbitrarily selected by the applicant, and in such cases, their meanings will be described in detail in the relevant description of the invention. Therefore, the terms used in this disclosure should be defined not merely by their names, but based on their meanings and the content throughout this disclosure.
[0035] In this specification, singular expressions include plural expressions unless the context clearly specifies them as singular. Additionally, plural expressions include singular expressions unless the context clearly specifies them as plural. Throughout the specification, when a part is described as including a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components.
[0036] Additionally, the terms 'module' or 'part' as used in the specification refer to software or hardware components, and the 'module' or 'part' performs certain roles. However, the meaning of 'module' or 'part' is not limited to software or hardware. The 'module' or 'part' may be configured to reside in an addressable storage medium or configured to run on one or more processors. Thus, as an example, the 'module' or 'part' may include components such as software components, object-oriented software components, class components, and task components, and at least one of processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, or variables. The components and the functions provided within the 'module' or 'part' may be combined into a smaller number of components and 'modules' or 'parts', or further separated into additional components and 'modules' or 'parts'.
[0037] In one embodiment of the present disclosure, a ‘module’ or ‘part’ may be implemented as a processor and memory. The term ‘processor’ should be broadly interpreted to include a general-purpose processor, a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a controller, a microcontroller, a state machine, etc. In some contexts, the term ‘processor’ may refer to an application-specific integrated circuit (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), etc. The term ‘processor’ may also refer to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of multiple microprocessors, a combination of one or more microprocessors combined with a DSP core, or any other combination of such configurations. Additionally, the term ‘memory’ should be broadly interpreted to include any electronic component capable of storing electronic information. 'Memory' may refer to various types of processor-readable media, such as Random Access Memory (RAM), Read-Only Memory (ROM), Non-Volatile Random Access Memory (NVRAM), Programmable Read-Only Memory (PROM), Erasable-Programmable Read-Only Memory (EPROM), Electrically Erasable PROM (EEPROM), Flash Memory, Magnetic or Optical Data Storage Devices, Registers, etc. If a processor can read information from memory and / or write information to memory, the memory is said to be in an electronic communication state with the processor. Memory integrated into a processor is in an electronic communication state with the processor.
[0038] In the present disclosure, the 'system' may include at least one of a server device and a cloud device, but is not limited thereto. For example, the system may be composed of one or more server devices. As another example, the system may be composed of one or more cloud devices. As yet another example, the system may be configured and operated with both a server device and a cloud device.
[0039] FIG. 1 is a block diagram showing the internal configuration of a distributed processing system (100) according to one embodiment of the present disclosure. In one embodiment, the distributed processing system (100) may include a CPU (110), memory (120), a plurality of FPGAs (130_1, 130_2, 130_N), and a GPU (140).
[0040] First, the CPU (110) can control the operation of components included in the distributed processing system (100) to perform distributed processing of graph convolutional network (GCN) inference. More specifically, the CPU (110) can create multiple sparse matrix blocks by dividing a sparse matrix stored in memory (120). For example, the CPU (110) can allocate multiple sparse matrix blocks to each of multiple FPGAs (130_1 to 130_N). Additionally, the CPU (110) can control the order of operations of multiple FPGAs (130_1, 130_2, 130_N) and GPUs (140), and manage data transfer so that the results of the operations can be exchanged.
[0041] Memory (120) can be connected to the CPU (110) to store a graph dataset. For example, memory (120) stores a sparse matrix and a node feature matrix and can operate as a public data store accessible to the CPU (110), multiple FPGAs (130_1, 130_2, 130_N), and GPU (140). Here, a large dataset for graph convolutional network (GCN) inference can be designed to be stored in memory (120) first before being partitioned and allocated through the CPU (110).
[0042] Multiple FPGAs (130_1, 130_2, …, 130_N) can individually perform aggregation operations based on a sparse matrix block allocated from the CPU (110). For example, each of the multiple FPGAs (130_1, 130_2, …, 130_N) can include multiple processing elements (PE) internally to perform sparse-dense matrix multiplication (SpDMM) on multiple rows of a sparse matrix block, thereby generating an output matrix containing feature vectors of specific nodes. Additionally, each of the multiple FPGAs (130_1, 130_2, …, 130_N) can combine multiple output matrices into one and transmit it to the GPU (140) as the result of the aggregation operation.
[0043] The GPU (140) can collect aggregation operation results from multiple FPGAs (130_1, 130_2, 130_N) and perform an update operation based on the collected results. For example, the GPU (140) can perform a General Matrix Multiplication (GEMM) by multiplying the dense matrix formed from the aggregation operation results with a weight matrix. Additionally, the GPU (140) can perform an update operation and an activation function operation consecutively, and then perform an additional update operation based on the same aggregation operation results.
[0044] FIG. 2 is a block diagram showing the internal configuration of an information processing system (200) according to one embodiment of the present disclosure. The information processing system (200) may include a memory (210), a processor (220), a communication module (230), and an input / output interface (240). The information processing system (200) may be configured to communicate information and / or data through a network using the communication module (230). The information processing system (200) may be a distributed processing system for graph convolutional network (GCN) inference, or a system separately provided outside the distributed processing system to control the operation of the distributed processing system, but is not limited thereto.
[0045] The memory (210) may include any computer-readable recording medium. According to one embodiment, the memory (210) may include a non-transient computer-readable recording medium, such as a read-only memory (ROM), a disk drive, a solid-state drive (SSD), a flash memory, etc., and may include a permanent mass storage device. As another example, a permanent mass storage device such as a ROM, an SSD, a flash memory, a disk drive, etc., may be included in the information processing system (200) as a separate permanent storage device distinct from the memory. Additionally, the memory (210) may store an operating system and at least one program code (e.g., code for process execution for an arithmetic unit).
[0046] These software components may be loaded from a computer-readable recording medium separate from the memory (210). This separate computer-readable recording medium may include a recording medium that can be directly connected to the information processing system (200), for example, a computer-readable recording medium such as a floppy drive, disk, tape, DVD / CD-ROM drive, or memory card. As another example, the software components may be loaded into the memory (210) via a communication module (230) rather than a computer-readable recording medium. For example, at least one program may be loaded into the memory (210) based on a computer program (e.g., a program for distributed processing of graph convolutional network (GCN) inference, etc.) that is installed by files provided through the communication module (230) by developers or a file distribution system that distributes installation files for applications.
[0047] The processor (220) may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input / output operations. Instructions may be provided to the processor (220) by memory (210) or a communication module (230). For example, the processor (220) may be configured to execute instructions received according to program code stored in a recording device such as memory (210).
[0048] The communication module (230) may provide a configuration or function for the user terminal and the information processing system (200) to communicate with each other via a network, and may provide a configuration or function for the information processing system (200) to communicate with an external system (e.g., a separate cloud system, server system, storage system, etc.). For example, control signals, commands, data, etc. provided under the control of the processor (220) of the information processing system (200) may be transmitted to the user terminal and / or the external system through the communication module (230) and the network, and through the communication module of the user terminal and / or the external system. The processor (220) may provide distributed processing information for graph convolutional network (GCN) inference to the user terminal (not shown).
[0049] Additionally, the input / output interface (240) of the information processing system (200) may be a means for interfacing with a device (not shown) for input or output that is connected to the information processing system (200) or that the information processing system (200) may include. In FIG. 2, the input / output interface (240) is shown as an element configured separately from the processor (220), but is not limited thereto, and the input / output interface (240) may be configured to be included in the processor (220). The information processing system (200) may include more components than those shown in FIG. 2. However, there is no need to clearly illustrate most of the conventional technical components.
[0050] FIG. 3 is a block diagram showing the internal configuration of an FPGA according to one embodiment of the present disclosure. In one embodiment, the FPGA (300) may include a memory (310), a plurality of processing elements (PE) (320_1, 320_2, , 320_N), and a controller (330).
[0051] The FPGA memory (310) can store sparse matrix blocks allocated from the CPU. For example, the FPGA memory (310) stores sparse matrix blocks allocated to the FPGA (300) in rows and can be accessed by a plurality of processing elements (320_1, 320_2, , 320_N).
[0052] Multiple processing elements (320_1, 320_2, , 320_N) can individually perform a sparse-dense product (SpDMM) on a row of a sparse matrix block stored in memory (310). For example, each of the multiple processing elements (320_1, 320_2, , 320_N) can aggregate feature vectors of nodes corresponding to the row assigned to it to generate a row of an output matrix containing a new feature vector of a specific node.
[0053] The controller (330) can control the operation of a plurality of processing elements (320_1, 320_2, ..., 320_N). For example, the controller (330) can distribute the target of operation to be performed at a specific point in time to each of the plurality of processing elements (320_1, 320_2, ..., 320_N), combine the output matrices derived by the plurality of processing elements (320_1, 320_2, ..., 320_N) into one to generate an aggregate operation result, and transmit the aggregate operation result to the GPU.
[0054] FIG. 4 is a flowchart of a distributed processing method for graph convolutional network (GCN) inference according to one embodiment of the present disclosure.
[0055] In the method (400), first, a CPU included in a distributed processing system can create multiple sparse matrix blocks by dividing a sparse matrix to be computed for graph convolutional network (GCN) inference, and can assign the multiple sparse matrix blocks to each of the multiple FPGAs (S410). For example, the CPU can enable parallel computation on the sparse matrix through each FPGA by dividing the adjacency matrix of a large graph into rows or blocks and assigning them to each FPGA.
[0056] Subsequently, each of the multiple FPGAs included in the distributed processing system can individually perform aggregation operations based on the sparse matrix block allocated through the CPU to derive the aggregation operation result (S420). For example, each of the multiple FPGAs can derive a dense matrix corresponding to the sparse matrix block as an output matrix by performing a sparse-dense product (SpDMM) operation on the sparse matrix block.
[0057] Subsequently, a single GPU included in the distributed processing system can obtain aggregate operation results derived from each of the multiple FPGAs and perform update operations using the aggregate operation results. For example, a single GPU can collect aggregate operation results derived by performing aggregate operations through each of the multiple FPGAs, and can update the feature vector of a node by performing a dense-dense product (GEMM) operation based on a dense matrix containing the collected aggregate operation results. Additionally, the GPU can non-linearly transform the updated feature vector by continuously performing activation function operations as needed.
[0058] FIG. 5 is a flowchart of a method for performing aggregation operations through an FPGA according to one embodiment of the present disclosure.
[0059] In the method (500), first, when a specific FPGA included in a distributed processing system receives a sparse matrix block from a CPU, the allocated sparse matrix block can be stored in FPGA memory (S510). For example, the specific FPGA can receive an adjacency matrix of a graph partitioned by the CPU in blocks, and can store the received adjacency matrix in FPGA memory, which is internal memory.
[0060] Subsequently, a specific FPGA can assign each of the multiple rows contained in a sparse matrix block stored in the FPGA memory to each of the multiple processing elements included in the specific FPGA (S520). For example, the specific FPGA can distribute the multiple row-unit data to each processing element, thereby enabling each processing element to perform operations independently.
[0061] Subsequently, a specific FPGA can generate multiple output matrices consisting of a single row containing a feature vector of a specific node by individually performing a sparse-dense product operation on each of the multiple sparse rows included in the sparse matrix block through multiple processing elements (S530). For example, each processing element can generate a single row of an output matrix containing a new node feature vector by aggregating the neighbor node feature vectors of the node based on the sparse row assigned to it.
[0062] Subsequently, a specific FPGA can combine multiple output matrices derived from multiple processing elements to generate a single dense matrix as an aggregation operation result (S540). For example, a specific FPGA can merge output matrices produced from individual processing elements to complete an aggregation result corresponding to a sparse matrix block.
[0063] Subsequently, a specific FPGA can combine multiple output matrices derived through multiple processing elements to generate a single dense matrix as the result of an aggregation operation, and can transmit the single dense matrix to a GPU (S550).
[0064] FIG. 6 is a flowchart of a method for performing an update operation through a GPU according to one embodiment of the present disclosure.
[0065] In method (600), first, the GPU can obtain aggregate operation results from multiple FPGAs through communication with multiple FPGAs (S610). For example, the GPU can collect the results of sparse-dense product operations performed in each FPGA and use the results as integrated input data.
[0066] Afterward, the GPU can perform update operations and activation function operations using a dense matrix containing the acquired aggregation operation results (S620). For example, the GPU can perform a dense-dense product (GEMM) operation by multiplying the dense matrix containing the aggregation operation results with a weight matrix, and then perform an activation function operation such as ReLU.
[0067] Afterward, the GPU can re-perform update operations including the aggregate operation results (S630). For example, the GPU can increase the efficiency of the update step by performing additional update operations based on the same aggregate operation results.
[0068] Afterward, the GPU can provide the results of the update operation performed again to the multiple FPGAs through communication with the multiple FPGAs (S640). For example, the GPU can transmit the result matrix, which has undergone update and activation function operations, to the FPGAs to support the FPGAs in performing subsequent aggregation operations.
[0069] In conventional structures, there is a problem where the overall computation speed is degraded due to the repeated occurrence of multiple communications in the process of deriving aggregate computation results from the FPGA, transmitting them to the GPU to perform update computations, and transmitting the results back to the FPGA. In contrast, in the structure of the present disclosure, the PU continuously performs update computations and activation function computations using the aggregate computation results, and then performs update computations once more based on the same aggregate computation results. This reduces unnecessary data transmission between the FPGA and the GPU and shortens the total communication time, thereby improving the overall processing efficiency of the system while maintaining the same accuracy.
[0070] FIG. 7 is a diagram illustrating, in an exemplary manner, the operation of FPGAs and the operation of a GPU over time according to an embodiment of the present disclosure. As illustrated, the CPU can divide a sparse matrix stored in main memory to create a plurality of sparse matrix blocks and transmit the generated sparse matrix blocks to each of the plurality of FPGAs. For example, the CPU can enable parallel processing by dividing and transmitting data in block units of a size that each FPGA can process.
[0071] Subsequently, each of the multiple FPGAs can perform a sparse-dense product (SpDMM) operation based on a sparse matrix block received from the CPU to derive an aggregation operation result. For example, each FPGA can perform an operation corresponding to the block assigned to it and, as a result, generate an output matrix containing feature vectors of specific nodes.
[0072] The GPU can sequentially collect (Recv) aggregation operation results from multiple FPGAs. Here, when the aggregation operation is completed in all multiple FPGAs and all corresponding aggregation operation results are collected, the GPU can form a dense matrix containing the collected aggregation operation results.
[0073] Next, the GPU can perform update operations by performing dense-dense product (GEMM) operations based on the dense matrix. For example, the GPU can perform subsequent GCN inference operations by combining partial operation results derived from the FPGA to complete the aggregation result of the entire graph, and generating a new feature vector through a GEMM operation that multiplies this by a weight matrix.
[0074] The method described above may be provided as a computer program stored on a computer-readable recording medium for execution on a computer. The medium may continuously store a program executable by a computer, or temporarily store it for execution or download. Additionally, the medium may be various recording or storage means in the form of a single or multiple hardware components combined, and may not be limited to a medium directly connected to a computer system but may exist distributed over a network. Examples of media may include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and media configured to store program instructions, including ROM, RAM, and flash memory. Furthermore, other examples of media may include recording or storage media managed by app stores that distribute applications or sites and servers that supply or distribute various other software.
[0075] The methods, operations, or techniques of the present disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those skilled in the art will understand that the various exemplary logical blocks, modules, circuits, and algorithmic steps described in connection with the disclosure herein may be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate such interchangeability between hardware and software, various exemplary components, blocks, modules, circuits, and steps have been generally described above in terms of their functional aspects. Whether such functions are implemented in hardware or in software depends on the design requirements imposed on the specific application and the overall system. Those skilled in the art may implement the functions described in various ways for each specific application, but such implementations should not be construed as departing from the scope of the present disclosure.
[0076] In a hardware implementation, the processing units used to perform the techniques may be implemented in one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure, computers, or a combination thereof.
[0077] Accordingly, the various exemplary logic blocks, modules, and circuits described in connection with the present disclosure may be implemented or performed by any combination of general-purpose processors, DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or those designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but alternatively, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, for example, a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors coupled with a DSP core, or any other combination of configurations.
[0078] In firmware and / or software implementations, techniques may be implemented as instructions stored on a computer-readable medium such as random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. The instructions may be executable by one or more processors, and may cause the processor(s) to perform specific aspects of the functions described in this disclosure.
[0079] Where implemented in software, the techniques may be stored on a computer-readable medium as one or more instructions or code, or transmitted through a computer-readable medium. Computer-readable media include both computer storage media and communication media, including any medium that facilitates the transmission of a computer program from one place to another. Storage media may be any available medium accessible by a computer. As a non-limiting example, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium accessible by a computer that can be used to transfer or store desired program code in the form of instructions or data structures. Additionally, any connection is appropriately referred to as a computer-readable medium.
[0080] For example, if software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair cable, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, coaxial cable, fiber optic cable, twisted pair cable, digital subscriber line, or wireless technologies such as infrared, radio, and microwave are included within the definition of a medium. As used herein, disk and disc include CD, laser disc, optical disc, DVD (digital versatile disc), floppy disk, and Blu-ray disc, wherein disks usually play data magnetically, whereas discs play data optically using a laser. The above combinations should also be included within the scope of computer-readable media.
[0081] The software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other known form of storage medium. An exemplary storage medium may be connected to a processor so that the processor can read information from the storage medium or write information to the storage medium. Alternatively, the storage medium may be integrated into the processor. The processor and the storage medium may exist within an ASIC. The ASIC may exist within a user terminal. Alternatively, the processor and the storage medium may exist as separate components within the user terminal.
[0082] Although the embodiments described above have been described as utilizing aspects of the subject matter disclosed herein in one or more standalone computer systems, the present disclosure is not limited thereto and may be implemented in conjunction with any computing environment, such as a network or a distributed computing environment. Furthermore, aspects of the subject matter in the present disclosure may be implemented in a plurality of processing chips or devices, and storage may be similarly affected across a plurality of devices. Such devices may include PCs, network servers, and portable devices.
[0083] Although the present disclosure has been described in relation to some embodiments, various modifications and changes may be made without departing from the scope of the present disclosure as understood by a person skilled in the art to which the invention of the present disclosure pertains. Furthermore, such modifications and changes should be considered to fall within the scope of the claims appended to this specification.
Claims
1. CPU; Multiple FPGAs (field programmable gate arrays); and A single GPU In a distributed processing system for graph convolutional network (GCN) inference comprising, The above CPU is, A plurality of sparse matrix blocks are created by partitioning a sparse matrix previously stored in memory, and the plurality of created sparse matrix blocks are assigned to each of the plurality of FPGAs. Each of the above plurality of FPGAs is, Aggregation operations are performed individually based on sparse matrix blocks allocated from the above CPU to derive the results of each aggregation operation, and The above-mentioned GPU is, Characterized by obtaining aggregate operation results derived from each of the plurality of FPGAs and performing an update operation using the obtained aggregate operation results. Distributed processing system for Graph Convolutional Network (GCN) inference.
2. In Paragraph 1, Each of the above plurality of FPGAs is, FPGA memory storing a sparse matrix block allocated from the above CPU; A plurality of processing elements that individually perform Sparse-Dense Matrix Multiplication (SpDMM) operations on sparse matrix blocks stored in the FPGA memory; and A controller that controls the operation of the plurality of processing elements mentioned above including, Distributed processing system for Graph Convolutional Network (GCN) inference.
3. In Paragraph 2, Each of the above plurality of FPGAs is, By individually performing a sparse-dense product operation on each of the multiple sparse rows included in the sparse matrix block through the above multiple processing elements, a plurality of output matrices consisting of a single row containing a feature vector of a specific node are generated, and Characterized by combining the generated multiple output matrices into one and transmitting the derived aggregation operation result to the single GPU. Distributed processing system for Graph Convolutional Network (GCN) inference.
4. In Paragraph 3, The above-mentioned GPU is, The aggregate operation results derived by performing an aggregate operation through each of the plurality of FPGAs are collected individually, and when all aggregate operation results corresponding to the generated plurality of sparse matrix blocks are collected, a General Matrix Multiplication (GEMM) operation is performed based on a dense matrix including the collected aggregate operation results. Distributed processing system for Graph Convolutional Network (GCN) inference.
5. In Paragraph 1, The above-mentioned GPU is, Characterized by performing update operations and activation function operations using aggregate operation results obtained from the plurality of FPGAs at a specific point in time, and then performing update operations once again using the aggregate operation results obtained at the specific point in time. Distributed processing system for Graph Convolutional Network (GCN) inference.
6. A distributed processing method for graph convolutional network (GCN) inference performed through a distributed processing system for graph convolutional network (GCN) inference comprising a CPU, a plurality of FPGAs (field programmable gate arrays) and a single GPU, wherein A step of generating a plurality of sparse matrix blocks by dividing a sparse matrix previously stored in memory through the above CPU, and assigning the generated plurality of sparse matrix blocks to each of the plurality of FPGAs; A step of individually performing aggregation operations based on a sparse matrix block allocated from the CPU through each of the plurality of FPGAs to derive aggregation operation results for each; and A step of obtaining aggregate operation results derived from each of the plurality of FPGAs through the above-mentioned single GPU, and performing an update operation using the obtained aggregate operation results. including, Distributed processing method for Graph Convolutional Network (GCN) inference.
7. In Paragraph 6, Each of the above plurality of FPGAs is, It includes multiple processing elements, The step of deriving the aggregate operation result above is, A step of generating a plurality of output matrices, each consisting of a single row containing a feature vector of a specific node, by individually performing a sparse-dense product operation on each of the plurality of sparse rows included in the sparse matrix block through the plurality of processing elements; and A step of combining the multiple output matrices generated above into one to derive an aggregation operation result including, Distributed processing method for Graph Convolutional Network (GCN) inference.
8. In Paragraph 7, The step of performing the above update operation is, A step of individually collecting aggregate operation results derived by performing aggregate operations through each of the plurality of FPGAs, wherein when all aggregate operation results corresponding to the generated plurality of sparse matrix blocks are collected, performing a General Matrix Multiplication (GEMM) operation based on a dense matrix including the collected aggregate operation results. including, Distributed processing method for Graph Convolutional Network (GCN) inference.
9. In Paragraph 1, The step of performing the above update operation is, Using the aggregate operation results obtained at a specific point in time from the plurality of FPGAs mentioned above, update operations and activation function operations are performed, After the above update operation and the above activation function operation are completed, a step of performing the update operation once again using the aggregate operation results obtained at the above specific time. including, Distributed processing method for Graph Convolutional Network (GCN) inference.
10. A computer program stored on a computer-readable recording medium for executing a method according to any one of paragraphs 6 through 9 on a computer.