Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction

An optimization method and technology of assembly instructions, applied in memory systems, complex mathematical operations, program control design, etc., can solve problems such as low performance, inability to control each part of assembly instructions, and inability to intuitively understand the meaning of assembly instructions, so as to avoid conflicts. Effect

Inactive Publication Date: 2017-05-17
INST OF COMPUTING TECH CHINESE ACAD OF SCI +1
View PDF3 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The present invention cannot intuitively understand the meaning of each assembly instruction. NVIDIA provides PTX pseudo assembly codes, which can generate different assembly instructions according to GPU chips. However, because it is not a native assembly instruction, the present invention cannot control Generate each part of the assembly instruction, and sometimes a PTX instruction will generate multiple assembly instructions, which is inconvenient for the present invention to optimize
[0005] 2. Low performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction
  • Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction
  • Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] Specifically, the network virtualization framework involved in the present invention is as figure 1 shown. The steps of using the double-buffered matrix multiplication algorithm on the GPU are as follows:

[0034] Step 1: First, block the original matrix according to bm (column length of A matrix block) and bn (row length of B matrix block), and each block processes the output matrix C of dimension;

[0035] Step 2: Create 4 temporary storage spaces smA, smB, smAx and smBx on shared memory (secondary storage on GPU);

[0036] Step 3: Read the smA-sized matrix to smA from matrix A on the global memory (first-level storage on the GPU), and read the smB-sized matrix from matrix B to smB;

[0037] Step 4: Load one column (A matrix block data) from smA to the register each time, load one row (B matrix block data) from smB to the register, use the FFMA (multiply plus fusion) instruction to do matrix multiplication, and do matrix While multiplying, read the next line of sm...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a single-precision matrix multiplication optimization method based on an NVIDIA Kepler GPU assembly instruction. The method comprises the steps that according to the column length bm of a matrix block A and a row length bn of a matrix block B, an original matrix is blocked, and each block is subjected to <bm,bn> processing; a matrix C is output dimensionally; four storage spaces smA, smB, smAx and smBx are created on a GPU secondary storage; a matrix of the size being smA is read from a matrix A on a GPU primary storage to the smA, and a matrix of the size being smB is read from a matrix B to the smB; a column matrix block A data is loaded from the smA to a register each time, a column matrix block B data is loaded from the smB to the register each time, the content of the register is read, a multiply-add-fused instruction is applied to matrix multiplication operation, and while the matrix multiplication operation is performed, a column of data of the next smA is read from the GPU primary storage to the smAx, and a column of data of the next smB is read to the smBx; after smA and smB matrix multiplication is performed, smA and smAx addresses are interchanged, and smB and smBx addresses are interchanged.

Description

technical field [0001] The invention relates to the technical fields of deep learning, high-performance computing, and GPGPU programming, in particular to a single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instructions. Background technique [0002] GPU graphics processor is a chip dedicated to image and video processing. Due to the particularity of its chip design—simplifying logic processing and increasing computing units, early GPUs were only used to process graphics and image-related application programming. With the increasingly powerful chips, GPU has turned to GPGPU (computing graphics processing unit) development, that is, its versatility has been greatly improved. At present, GPU has been widely used in embedded systems, smart terminals, personal computers, workstations and other equipment. Tesla series GPU is launched by NVIDIA, which is specially used for numerical calculation. Compared with ordinary GPU, it...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F9/302G06F9/30G06F17/16G06T1/20
CPCG06T1/20G06F9/30036G06F9/3012G06F17/16
Inventor 谭光明张秀霞周可人王朝尉
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products