System and method for sparse-dense matrix multiplication operation using high bandwidth memory (HBM)
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION
- Filing Date
- 2025-11-17
- Publication Date
- 2026-06-11
Smart Images

Figure KR2025018974_11062026_PF_FP_ABST
Abstract
Description
Sparse-dense matrix multiplication system and method using High Bandwidth Memory (HBM)
[0001] The present disclosure relates to a sparse-dense matrix multiplication operation system and method using high-bandwidth memory (HBM), and more specifically, to a sparse-dense matrix multiplication operation system and method using high-bandwidth memory (HBM) that can maximize computational performance and memory bandwidth efficiency by performing a sparse-dense matrix multiplication operation during graph convolutional network (GCN) inference operations using high-bandwidth memory.
[0002] Graph Convolutional Networks (GCNs) operate by aggregating information about nodes in graph data and their neighbors, and then combining the aggregated results to learn node representations. This structure is utilized in various application fields, such as social networks, recommendation systems, and compound analysis.
[0003] However, GCN's computational process suffers from bottlenecks when processing large-scale graph datasets. In particular, the aggregation and joining stages require access to massive amounts of data, and memory access efficiency is degraded due to irregular data patterns. This leads to reduced overall computational performance and presents limitations that make it difficult to apply to real-time services.
[0004] In conventional technology, methods such as splitting sparse matrices or applying complex algorithms that reflect the characteristics of graph data have been proposed to mitigate these problems. However, these approaches have the disadvantage that the algorithms themselves are complex, requiring significant time for the computation preparation phase and limiting computation speed. Therefore, there is a continuously being raised need for hardware and algorithmic improvements to efficiently process large-scale graph datasets.
[0005] The aforementioned background technology is one that the inventor possessed or acquired in the process of deriving the contents of the disclosure of the present application, and it cannot be considered as prior art disclosed to the general public prior to the filing of this application.
[0006] The present disclosure provides a method for performing a sparse-dense matrix multiplication operation using high-bandwidth memory (HBM) to solve the above-mentioned problems, a computer program stored on a recording medium, and a device (system).
[0007] The present disclosure may be implemented in various ways, including a method, a system (device), or a computer program stored on a readable storage medium.
[0008] According to one embodiment of the present disclosure, a sparse-dense matrix multiplication operation system using high bandwidth memory (HBM) comprises a CPU and a Field Programmable Gate Array (FPGA), wherein the CPU transmits sparse matrix data and dense matrix data stored in memory to the FPGA to perform operations for graph convolutional network (GCN) inference, and the FPGA may include a high bandwidth memory (HBM) that stores sparse matrix data and dense matrix data transmitted from the CPU, an on-chip memory (On-Chip SRAM) that receives and stores dense matrix data from the high bandwidth memory, and a processing group that performs a sparse-dense matrix multiplication (SpDMM) operation based on sparse matrix data provided from the high bandwidth memory and dense matrix data provided from the on-chip memory.
[0009] Additionally, the on-chip memory includes a first on-chip memory that stores dense matrix data provided from high-bandwidth memory and a second on-chip memory that temporarily stores the operation results derived from performing a sparse-dense matrix multiplication operation through a processing group, and the FPGA can output the operation results temporarily stored in the second on-chip memory or a predetermined number of operation results stored in the second on-chip memory by merging them for a predetermined period.
[0010] Additionally, the FPGA may further include an Input Arbiter that retrieves dense matrix data stored in high-bandwidth memory and transmits it to a first on-chip memory, and an Output Arbiter that transmits the operation result derived by performing a sparse-dense matrix multiplication operation through a processing group to a second on-chip memory.
[0011] In addition, the FPGA may further include a PCIe DMA (Peripheral Component Interconnect Express Direct Memory Access) module that acquires sparse matrix data and dense matrix data stored in the CPU's memory and transfers them to high-bandwidth memory, and transfers the operation result derived by performing a sparse-dense matrix multiplication operation through a processing group to the CPU.
[0012] Additionally, the processing group converts sparse matrix data provided from high-bandwidth memory and performs a sparse-dense matrix multiplication based on the converted sparse matrix data, wherein the converted sparse matrix data may be a single array including an indicator representing the position of a row, a non-zero value for each row, and the position of a column corresponding to the non-zero value for each row.
[0013] In addition, the high-bandwidth memory includes multiple pseudo channels, and each of the multiple pseudo channels can operate independently of one another by having a memory controller that manages data access and transmission and a switch that coordinates the data flow between the memory controller and the processing group.
[0014] Additionally, the processing group may include a plurality of unit processing groups connected to each of the plurality of pseudo-channels and performing a sparse-dense matrix multiplication operation corresponding to each of the plurality of pseudo-channels.
[0015] Additionally, the processing group includes a plurality of processing elements, and each of the plurality of processing elements may include a multiplier that performs a sparse-dense matrix multiplication operation and an accumulator that accumulates the result of the operation through the multiplier.
[0016] According to one embodiment of the present disclosure, a method for performing a sparse-dense matrix multiplication operation using a high-bandwidth memory (HBM) performed through a high-bandwidth memory (HBM) operation system including a CPU and a Field Programmable Gate Array (FPGA) may include: a step of transmitting sparse matrix data and dense matrix data stored in memory to an FPGA to perform an operation for graph convolutional network (GCN) inference through a CPU; a step of storing the sparse matrix data and dense matrix data transmitted from the CPU through the high-bandwidth memory (HBM) of the FPGA; a step of receiving and storing dense matrix data from the high-bandwidth memory through the on-chip SRAM of the FPGA; and a step of performing a sparse-dense matrix multiplication (SpDMM) operation based on the sparse matrix data provided from the high-bandwidth memory and the dense matrix data provided from the on-chip memory through a processing group of the FPGA.
[0017] A computer program stored on a computer-readable recording medium may be provided to execute the sparse-dense matrix multiplication method using the aforementioned high-bandwidth memory (HBM) on a computer.
[0018] According to some embodiments of the present disclosure, by performing sparse-dense matrix multiplication (SpDMM) in an optimized manner, it is possible to resolve memory bandwidth issues and minimize data transfer, thereby improving GCN computation performance.
[0019] According to some embodiments of the present disclosure, by adopting a row-wise product-based data flow, it is possible to efficiently process each matrix element and optimize memory usage.
[0020] According to some embodiments of the present disclosure, by utilizing graph partitioning and a High-Degree Node (HDN) cache, memory access locality can be improved and memory bottlenecks caused by repetitive data access can be reduced.
[0021] According to some embodiments of the present disclosure, by introducing a runtime execution model, it is possible to maximize matrix parallelism and mitigate DRAM access delay through a multi-row fixed data flow method.
[0022] According to some embodiments of the present disclosure, by applying a joint design of software and hardware, efficient data compression and transmission are enabled, and there is an effect of being able to flexibly respond to irregular access patterns of matrix data.
[0023] The effects of the present disclosure are not limited to those mentioned above, and other unmentioned effects will be clearly understood by a person skilled in the art to which the present disclosure pertains (referred to as "person skilled in the art") from the description in the claims.
[0024] Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, wherein similar reference numerals indicate similar elements, but are not limited thereto.
[0025] FIG. 1 is a block diagram showing the internal configuration of a sparse-dense matrix multiplication system using high-bandwidth memory (HBM) according to one embodiment of the present disclosure.
[0026] FIG. 2 is a block diagram showing the internal configuration of an information processing system according to one embodiment of the present disclosure.
[0027] FIG. 3 is a diagram illustrating the process of converting sparse matrix data according to one embodiment of the present disclosure.
[0028] FIG. 4 is a diagram illustrating the structure of a high-bandwidth memory including a plurality of pseudo channels and a memory controller and a switch connected thereto according to one embodiment of the present disclosure.
[0029] FIG. 5 is a block diagram illustrating the configuration of a processing group including a plurality of processing elements according to one embodiment of the present disclosure.
[0030] FIG. 6 is a diagram illustrating a structure for parallel processing of matrix data through a plurality of processing groups according to one embodiment of the present disclosure.
[0031] FIG. 7 is a flowchart of a sparse-dense matrix multiplication operation method using high-bandwidth memory (HBM) according to one embodiment of the present disclosure.
[0032] Hereinafter, specific details for implementing the present disclosure will be described in detail with reference to the attached drawings. However, in the following description, specific descriptions regarding well-known functions or configurations will be omitted if there is a risk that the gist of the present disclosure may be unnecessarily obscured.
[0033] In the attached drawings, identical or corresponding components are assigned the same reference numerals. Additionally, in the description of the following embodiments, the description of identical or corresponding components may be omitted. However, even if a description of a component is omitted, it is not intended that such component is not included in any embodiment.
[0034] The advantages and features of the disclosed embodiments and the methods for achieving them will become clear by referring to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below but may be implemented in various different forms, and the embodiments provided are merely to make the present disclosure complete and to fully inform those skilled in the art of the scope of the invention.
[0035] The terms used in this specification will be briefly explained, and the disclosed embodiments will be described in detail. The terms used in this specification have been selected to be as generally used as possible, taking into account their functions in this disclosure; however, these terms may vary depending on the intent of those skilled in the art, case law, the emergence of new technologies, etc. Additionally, in specific cases, terms may be arbitrarily selected by the applicant, and in such cases, their meanings will be described in detail in the relevant description of the invention. Therefore, the terms used in this disclosure should be defined not merely by their names, but based on their meanings and the content throughout this disclosure.
[0036] In this specification, singular expressions include plural expressions unless the context clearly specifies them as singular. Additionally, plural expressions include singular expressions unless the context clearly specifies them as plural. Throughout the specification, when a part is described as including a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components.
[0037] Additionally, the terms 'module' or 'part' as used in the specification refer to software or hardware components, and the 'module' or 'part' performs certain roles. However, the meaning of 'module' or 'part' is not limited to software or hardware. The 'module' or 'part' may be configured to reside in an addressable storage medium or configured to run on one or more processors. Thus, as an example, the 'module' or 'part' may include components such as software components, object-oriented software components, class components, and task components, and at least one of processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, or variables. The components and the functions provided within the 'module' or 'part' may be combined into a smaller number of components and 'modules' or 'parts', or further separated into additional components and 'modules' or 'parts'.
[0038] In one embodiment of the present disclosure, a ‘module’ or ‘part’ may be implemented as a processor and memory. The term ‘processor’ should be broadly interpreted to include a general-purpose processor, a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a controller, a microcontroller, a state machine, etc. In some contexts, the term ‘processor’ may refer to an application-specific integrated circuit (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), etc. The term ‘processor’ may also refer to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of multiple microprocessors, a combination of one or more microprocessors combined with a DSP core, or any other combination of such configurations. Additionally, the term ‘memory’ should be broadly interpreted to include any electronic component capable of storing electronic information. 'Memory' may refer to various types of processor-readable media, such as Random Access Memory (RAM), Read-Only Memory (ROM), Non-Volatile Random Access Memory (NVRAM), Programmable Read-Only Memory (PROM), Erasable-Programmable Read-Only Memory (EPROM), Electrically Erasable PROM (EEPROM), Flash Memory, Magnetic or Optical Data Storage Devices, Registers, etc. If a processor can read information from memory and / or write information to memory, the memory is said to be in an electronic communication state with the processor. Memory integrated into a processor is in an electronic communication state with the processor.
[0039] In the present disclosure, the 'system' may include at least one of a server device and a cloud device, but is not limited thereto. For example, the system may be composed of one or more server devices. As another example, the system may be composed of one or more cloud devices. As yet another example, the system may be configured and operated with both a server device and a cloud device.
[0040] FIG. 1 is a block diagram showing the internal configuration of a sparse-dense matrix multiplication system using high-bandwidth memory (HBM) according to one embodiment of the present disclosure. In one embodiment, the sparse-dense matrix multiplication system (100) may include a CPU (110), memory (120), and FPGA (130).
[0041] The CPU (110) can control the operation of components included in the sparse-dense matrix multiplication operation system (100) for computation for Graph Convolutional Network (GCN) inference. More specifically, the CPU (110) can transmit sparse matrix data and dense matrix data stored in memory (120) to the FPGA (130). Additionally, the CPU (110) can receive the results of the sparse-dense matrix multiplication (SpDMM) operation performed through the FPGA (130) and use them for subsequent processing.
[0042] The memory (120) can be linked with the CPU (110) to store sparse matrix data and dense matrix data. For example, the memory (120) can store a sparse matrix containing node and edge information of graph data and a dense matrix containing features of each node.
[0043] The FPGA (130) can perform sparse-dense matrix multiplication (SpDMM) operations based on instructions and data transmitted from the CPU (110). The FPGA (130) may include, but is not limited to, a PCIe DMA module (131), a memory controller (132), high-bandwidth memory (133), an input arbiter (134), a first on-chip memory (135), a processing group (136), an output arbiter (137), and a second on-chip memory (138).
[0044] First, the PCIe DMA (Peripheral Component Interconnect Express Direct Memory Access) module (131) can handle data transfer between the CPU (110) and the FPGA (130). The PCIe DMA module (131) can, for example, transfer sparse matrix data and dense matrix data stored in the memory (120) of the CPU (110) to the high-bandwidth memory (133) of the FPGA (130). Additionally, the PCIe DMA module (131) can transfer the results calculated through the FPGA (130) back to the CPU (110).
[0045] Next, the memory controller (132) can manage the data flow within the FPGA (130). The memory controller (132) can control the operation of storing data transmitted from the PCIe DMA module (131) in the high-bandwidth memory (133) or reading data from the high-bandwidth memory (133) and transmitting it to the processing group (136).
[0046] High Bandwidth Memory (HBM) (133) can be placed within an FPGA (130) to store sparse matrix data and dense matrix data. High Bandwidth Memory (133) can provide high memory bandwidth and minimize data access latency by including, for example, a parallel channel structure suitable for processing large-scale graph data.
[0047] The input arbiter (134) can perform the operation of transferring dense matrix data stored in the high-bandwidth memory (133) to the first on-chip memory (On-Chip SRAM) (135). Through this, the dense matrix data that is used repeatedly can be efficiently utilized within the FPGA (130).
[0048] The first on-chip memory (135) can store dense matrix data provided through the input arbiter (134). The first on-chip memory (135) can be implemented as an SRAM structure capable of high-speed access, and can improve computational efficiency by repeatedly providing dense matrix data to the processing group (136).
[0049] The processing group (136) can perform sparse-dense matrix multiplication (SpDMM) operations based on sparse matrix data and dense matrix data. For example, the processing group (136) can perform sparse-dense matrix multiplication (SpDMM) operations based on sparse matrix data provided from high-bandwidth memory (133) and dense matrix data provided from the first on-chip memory (135). The processing group (136) may include a plurality of processing elements, and each processing element may include a multiplier and an accumulator to perform matrix multiplication operations in parallel.
[0050] The output arbiter (137) can transmit the computation result derived through the processing group (136) to the second on-chip memory (On-Chip SRAM) (138).
[0051] The second on-chip memory (138) can temporarily store the calculation results provided through the output arbiter (137). At this time, the second on-chip memory (138) can buffer the calculation results for a certain period or store data until a predetermined number of calculation results are collected, and then merge them and output them to the CPU (110).
[0052] FIG. 2 is a block diagram showing the internal configuration of an information processing system (200) according to one embodiment of the present disclosure. The information processing system (200) may include a memory (210), a processor (220), a communication module (230), and an input / output interface (240). The information processing system (200) may be configured to communicate information and / or data through a network using the communication module (230). The information processing system (200) may be a sparse-dense matrix multiplication system or a system separately provided outside the sparse-dense matrix multiplication system to control the operation of the sparse-dense matrix multiplication system, but is not limited thereto.
[0053] The memory (210) may include any computer-readable recording medium. According to one embodiment, the memory (210) may include a non-transient computer-readable recording medium, such as a read-only memory (ROM), a disk drive, a solid-state drive (SSD), a flash memory, etc., and may include a permanent mass storage device. As another example, a permanent mass storage device such as a ROM, an SSD, a flash memory, a disk drive, etc., may be included in the information processing system (200) as a separate permanent storage device distinct from the memory. Additionally, the memory (210) may store an operating system and at least one program code (e.g., code for process execution for an arithmetic unit).
[0054] These software components may be loaded from a computer-readable recording medium separate from the memory (210). This separate computer-readable recording medium may include a recording medium that can be directly connected to the information processing system (200), for example, a computer-readable recording medium such as a floppy drive, disk, tape, DVD / CD-ROM drive, or memory card. As another example, the software components may be loaded into the memory (210) via a communication module (230) rather than a computer-readable recording medium. For example, at least one program may be loaded into the memory (210) based on a computer program (e.g., a program for distributed processing of graph convolutional network (GCN) inference, etc.) that is installed by files provided through the communication module (230) by developers or a file distribution system that distributes installation files for applications.
[0055] The processor (220) may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input / output operations. Instructions may be provided to the processor (220) by memory (210) or a communication module (230). For example, the processor (220) may be configured to execute instructions received according to program code stored in a recording device such as memory (210).
[0056] The communication module (230) may provide a configuration or function for the user terminal and the information processing system (200) to communicate with each other via a network, and may provide a configuration or function for the information processing system (200) to communicate with an external system (e.g., a separate cloud system, server system, storage system, etc.). For example, control signals, commands, data, etc. provided under the control of the processor (220) of the information processing system (200) may be transmitted to the user terminal and / or the external system through the communication module (230) and the network, and through the communication module of the user terminal and / or the external system. The processor (220) may provide distributed processing information for graph convolutional network (GCN) inference to the user terminal (not shown).
[0057] Additionally, the input / output interface (240) of the information processing system (200) may be a means for interfacing with a device (not shown) for input or output that is connected to the information processing system (200) or that the information processing system (200) may include. In FIG. 2, the input / output interface (240) is shown as an element configured separately from the processor (220), but is not limited thereto, and the input / output interface (240) may be configured to be included in the processor (220). The information processing system (200) may include more components than those shown in FIG. 2. However, there is no need to clearly illustrate most of the conventional technical components.
[0058] FIG. 3 is a diagram illustrating a process for converting sparse matrix data according to one embodiment of the present disclosure. As illustrated, a processing group can convert sparse matrix data (310) stored in a high-bandwidth memory (133) to generate converted sparse matrix data (320).
[0059] First, the processing group can first sequentially search each row of the sparse matrix data (310). Referring to (A) of FIG. 3, the values 1, 2, and 3 exist in the 0th row and the values 4 and 5 exist in the 1st row, so the processing group (136) can extract these values as non-zero values for each row.
[0060] Subsequently, the processing group can convert the data into a single array by recording together an indicator indicating the starting position of the row of extracted non-zero values, the actual value for each row, and the column position to which each value belongs.
[0061] More specifically, when a new row begins, the processing group may record an indicator (X) indicating the starting position of the row in the array. Then, the non-zero value of each row and its corresponding column index may be added to the array in sequence. For example, the processing group may record the value "1" of the 0th row with the row indicator (X=0) and column index 0, the value "2" with column index 3, and the value "3" with column index 9.
[0062] Through this conversion process, the processing group can generate sparse matrix data (310) into converted sparse matrix data (320) in a single array structure. As shown in (B) of FIG. 3, the converted sparse matrix data (320) is stored in contiguous memory according to Address (330), and row indicators (X), values, and column position information are listed sequentially.
[0063] This single array structure can reduce memory usage, improve data access efficiency, and simplify subsequent sparse-dense matrix multiplication operations.
[0064] FIG. 4 is a diagram illustrating the structure of a high-bandwidth memory (133) including a plurality of pseudo channels and a memory controller (132) and a switch (430) connected thereto according to one embodiment of the present disclosure. As illustrated, the high-bandwidth memory (133) may include a plurality of unit memories (410) and a plurality of unit memory controllers (420) for accessing these unit memories (410).
[0065] The unit memory (410) is a basic storage block constituting the high-bandwidth memory (133), and each unit memory (410) can independently perform data storage and reading. These unit memories (410) can perform the role of storing and transmitting large amounts of matrix data generated from a large dataset at high speed.
[0066] The memory controller (132) may include a plurality of unit memory controllers (420), and each unit memory controller (420) may be connected to each unit memory (410) to manage data access and transmission of the corresponding unit memory. Specifically, the unit memory controller (420) may interpret data access commands requested from the processing group (136) and control the process of reading or writing data stored in the unit memory (410).
[0067] A switch (430) can be positioned between multiple unit memory controllers (420) and processing groups to coordinate data flow. For example, if a specific processing group requests access to a specific unit memory (410), the switch (430) can set a path so that the request is forwarded to the appropriate unit memory controller (420). This minimizes data conflicts or bottlenecks even when multiple processing groups access high-bandwidth memory (133) simultaneously.
[0068] Accordingly, the high-bandwidth memory (133) may include multiple pseudo channels, and each pseudo channel may independently perform data access and transmission by including a memory controller (420) and a switch (430). This structure maximizes parallelism and independence, thereby significantly improving the efficiency of large-scale data access for sparse-dense matrix multiplication operations.
[0069] FIG. 5 is a block diagram illustrating the configuration of a processing group (136) including a plurality of processing elements (510, 520, 530) according to one embodiment of the present disclosure. The processing group (136) may include a plurality of processing elements (PE) (510, 520, 530).
[0070] Processing elements (510, 520, 530) may each include a multiplier (511, 521, 531) and an accumulator (512, 522, 532). The multiplier (511, 521, 531) may receive sparse matrix data and dense matrix data as input and perform matrix multiplication operations. For example, the multiplier (511, 521, 531) may calculate the product of the two input data and pass the result to the accumulator (512, 522, 532). The accumulator (512, 522, 532) may accumulate the operation result passed from the multiplier (511, 521, 531) to generate a final sum result for one row.
[0071] In this way, the processing elements (510, 520, 530) included in the processing group (136) can process matrix data in parallel by independently performing multiplication and accumulation operations. Therefore, the present structure can maximize the parallelism of sparse-dense matrix multiplication operations and provide high computational performance even on large datasets.
[0072] FIG. 6 is a diagram illustrating a structure for parallel processing of matrix data through a plurality of processing groups (610, 620, 630) according to one embodiment of the present disclosure. As illustrated, the processing group may be composed of a plurality of unit processing groups (610, 620, 630), and each of the plurality of unit processing groups (610, 620, 630) is connected to each of a plurality of pseudo-channels (640, 650, 660) to perform a sparse-dense matrix multiplication operation corresponding to the pseudo-channel.
[0073] More specifically, each of the multiple unit processing groups (610, 620, 630) can be connected to a corresponding pseudo-channel through an M-bit interface. This configuration allows each processing group to individually access memory without mutual interference to read and write data required for computation, and each pseudo-channel can operate independently of the unit processing group assigned to it to perform parallel computations, thereby maximizing the computational performance and processing efficiency of the entire system.
[0074] As a result, all processing groups can generate independent outputs in parallel, thereby maximizing computational parallelism. This structure allows for high processing performance even with large datasets and offers the advantage of effectively utilizing the parallelism and efficiency of FPGAs and HBMs.
[0075] FIG. 7 is a flowchart of a sparse-dense matrix multiplication operation method using high-bandwidth memory (HBM) according to one embodiment of the present disclosure.
[0076] In method (700), first, the CPU of the sparse-dense matrix multiplication system can transfer sparse matrix data and dense matrix data stored in memory to the FPGA for graph convolution network inference (S710). For example, the CPU can directly transfer large amounts of sparse matrix data and dense matrix data to high-bandwidth memory on the FPGA side through a PCIe DMA module.
[0077] Subsequently, the FPGA can store sparse matrix data and dense matrix data transmitted from the CPU in high-bandwidth memory (S720). For example, high-bandwidth memory is composed of multiple pseudo channels, and each channel operates independently, thereby enabling parallel storage and access of data.
[0078] Subsequently, the on-chip memory of the FPGA can receive and store dense matrix data provided from high-bandwidth memory (S730). For example, the first on-chip memory stores dense matrix data so that it can be quickly referenced during subsequent computation processes.
[0079] Subsequently, the processing group of the FPGA can perform sparse-dense matrix multiplication (SpDMM) operations based on sparse matrix data provided from high-bandwidth memory and dense matrix data provided from on-chip memory (S740). For example, the processing group may include multiple processing elements, and the sparse-dense matrix multiplication operations can be performed in parallel using multipliers and accumulators provided in each processing element. This enables high-speed matrix operations and efficient graph neural network inference.
[0080] The method described above may be provided as a computer program stored on a computer-readable recording medium for execution on a computer. The medium may continuously store a computer-executable program, or temporarily store it for execution or download. Additionally, the medium may be various recording or storage means in the form of a single or multiple hardware components, and may not be limited to a medium directly connected to a computer system but may exist distributed over a network. Examples of media may include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and media configured to store program instructions, including ROM, RAM, and flash memory. Furthermore, other examples of media may include recording or storage media managed by app stores that distribute applications or sites and servers that supply or distribute various other software.
[0081] The methods, operations, or techniques of the present disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those skilled in the art will understand that the various exemplary logical blocks, modules, circuits, and algorithmic steps described in connection with the disclosure herein may be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate such interchangeability between hardware and software, various exemplary components, blocks, modules, circuits, and steps have been generally described above in terms of their functional aspects. Whether such functions are implemented in hardware or in software depends on the design requirements imposed on the specific application and the overall system. Those skilled in the art may implement the functions described in various ways for each specific application, but such implementations should not be construed as departing from the scope of the present disclosure.
[0082] In a hardware implementation, the processing units used to perform the techniques may be implemented in one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure, computers, or a combination thereof.
[0083] Accordingly, the various exemplary logic blocks, modules, and circuits described in connection with the present disclosure may be implemented or performed by any combination of general-purpose processors, DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or those designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but alternatively, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, for example, a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors coupled with a DSP core, or any other combination of configurations.
[0084] In firmware and / or software implementations, techniques may be implemented as instructions stored on a computer-readable medium such as random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. The instructions may be executable by one or more processors, and may cause the processor(s) to perform specific aspects of the functions described in this disclosure.
[0085] Where implemented in software, the techniques may be stored on a computer-readable medium as one or more instructions or code, or transmitted through a computer-readable medium. Computer-readable media include both computer storage media and communication media, including any medium that facilitates the transmission of a computer program from one place to another. Storage media may be any available medium accessible by a computer. As a non-limiting example, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium accessible by a computer that can be used to transfer or store desired program code in the form of instructions or data structures. Additionally, any connection is appropriately referred to as a computer-readable medium.
[0086] For example, if software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair cable, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, coaxial cable, fiber optic cable, twisted pair cable, digital subscriber line, or wireless technologies such as infrared, radio, and microwave are included within the definition of a medium. As used herein, disk and disc include CD, laser disc, optical disc, DVD (digital versatile disc), floppy disk, and Blu-ray disc, wherein disks usually play data magnetically, whereas discs play data optically using a laser. The above combinations should also be included within the scope of computer-readable media.
[0087] The software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other known form of storage medium. An exemplary storage medium may be connected to a processor so that the processor can read information from the storage medium or write information to the storage medium. Alternatively, the storage medium may be integrated into the processor. The processor and the storage medium may exist within an ASIC. The ASIC may exist within a user terminal. Alternatively, the processor and the storage medium may exist as separate components within the user terminal.
[0088] Although the embodiments described above have been described as utilizing aspects of the subject matter disclosed herein in one or more standalone computer systems, the present disclosure is not limited thereto and may be implemented in conjunction with any computing environment, such as a network or a distributed computing environment. Furthermore, aspects of the subject matter in the present disclosure may be implemented in a plurality of processing chips or devices, and storage may be similarly affected across a plurality of devices. Such devices may include PCs, network servers, and portable devices.
[0089] Although the present disclosure has been described in relation to some embodiments, various modifications and changes may be made without departing from the scope of the present disclosure as understood by a person skilled in the art to which the invention of the present disclosure pertains. Furthermore, such modifications and changes should be considered to fall within the scope of the claims appended to this specification.
Claims
1. CPU; and Field Programmable Gate Array (FPGA) In a sparse-dense matrix multiplication system including, The above CPU is, To perform operations for Graph Convolutional Network (GCN) inference, sparse matrix data and dense matrix data stored in memory are transferred to the FPGA, and The above FPGA is, High Bandwidth Memory (HBM) for storing sparse matrix data and dense matrix data transmitted from the above CPU; On-chip memory (On-Chip SRAM) that receives and stores dense matrix data from the above high-bandwidth memory; and A processing group that performs Sparse-Dense Matrix Multiplication (SpDMM) operations based on sparse matrix data provided from the high-bandwidth memory and dense matrix data provided from the on-chip memory. including, Sparse-dense matrix multiplication system using High Bandwidth Memory (HBM).
2. In Paragraph 1, The above-mentioned on-chip memory is, A first on-chip memory for storing dense matrix data provided from the above high-bandwidth memory; and A second on-chip memory that temporarily stores the operation result derived from performing a sparse-dense matrix multiplication operation through the above processing group. Includes, The above FPGA is, Characterized by merging and outputting calculation results temporarily stored in the second on-chip memory for a predetermined period or a predetermined number of calculation results stored in the second on-chip memory. Sparse-dense matrix multiplication system using High Bandwidth Memory (HBM).
3. In Paragraph 2, The above FPGA is, An input arbiter that retrieves dense matrix data stored in the high-bandwidth memory and transmits it to the first on-chip memory; and An output arbiter that transmits the operation result derived by performing a sparse-dense matrix multiplication operation through the processing group to the second on-chip memory. including, Sparse-dense matrix multiplication system using High Bandwidth Memory (HBM).
4. In Paragraph 1, The above FPGA is, A PCIe DMA (Peripheral Component Interconnect Express Direct Memory Access) module that acquires sparse matrix data and dense matrix data stored in the memory of the CPU, transfers them to the high-bandwidth memory, and transfers the operation result derived by performing a sparse-dense matrix multiplication operation through the processing group to the CPU. including more, Sparse-dense matrix multiplication system using High Bandwidth Memory (HBM).
5. In Paragraph 1, The above processing group is, Transform sparse matrix data provided from the high-bandwidth memory and perform a sparse-dense matrix multiplication based on the transformed sparse matrix data, The above transformed sparse matrix data is, Characterized as a single array including an indicator representing the position of a row, a non-zero value for each row, and the position of the column corresponding to the non-zero value for each row. Sparse-dense matrix multiplication system using High Bandwidth Memory (HBM).
6. In Paragraph 1, The above high-bandwidth memory is, It includes multiple pseudo channels, Each of the above plurality of medical channels is, Characterized by the arrangement of a memory controller that manages data access and transmission and a switch that coordinates the data flow between the memory controller and the processing group, thereby enabling mutually independent operation. Sparse-dense matrix multiplication system using High Bandwidth Memory (HBM).
7. In Paragraph 6, The above processing group is, A plurality of unit processing groups connected to each of the plurality of pseudo-channels and performing a sparse-dense matrix multiplication operation corresponding to each of the plurality of pseudo-channels including, Sparse-dense matrix multiplication system using High Bandwidth Memory (HBM).
8. In Paragraph 1, The above processing group is, It includes multiple processing elements, Each of the above plurality of processing elements is, A multiplier that performs sparse-dense matrix multiplication operations; and Accumulator that accumulates the result of the operation through the above multiplier including, Sparse-dense matrix multiplication system using High Bandwidth Memory (HBM).
9. A method for performing a sparse-dense matrix multiplication operation using high-bandwidth memory (HBM) through a system for performing a sparse-dense matrix multiplication operation using high-bandwidth memory (HBM) including a CPU and a Field Programmable Gate Array (FPGA), wherein A step of transmitting sparse matrix data and dense matrix data stored in memory to the FPGA through the CPU to perform operations for graph convolutional network (GCN) inference; A step of storing sparse matrix data and dense matrix data transmitted from the CPU through the High Bandwidth Memory (HBM) of the FPGA; A step of receiving and storing dense matrix data from the high-bandwidth memory through the on-chip memory (On-Chip SRAM) of the FPGA; and A step of performing a Sparse-Dense Matrix Multiplication (SpDMM) operation based on sparse matrix data provided from the high-bandwidth memory and dense matrix data provided from the on-chip memory through the Processing Group of the above FPGA. including, Sparse-dense matrix multiplication method using High Bandwidth Memory (HBM).
10. A computer program stored on a computer-readable recording medium to execute the method according to paragraph 9 on a computer.