Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An optimization method and technology of assembly instructions, applied in memory systems, complex mathematical operations, program control design, etc., can solve problems such as low performance, inability to control each part of assembly instructions, and inability to intuitively understand the meaning of assembly instructions, so as to avoid conflicts. Effect

Inactive Publication Date: 2017-05-17

INST OF COMPUTING TECH CHINESE ACAD OF SCI +1

View PDF3 Cites 8 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The present invention cannot intuitively understand the meaning of each assembly instruction. NVIDIA provides PTX pseudo assembly codes, which can generate different assembly instructions according to GPU chips. However, because it is not a native assembly instruction, the present invention cannot control Generate each part of the assembly instruction, and sometimes a PTX instruction will generate multiple assembly instructions, which is inconvenient for the present invention to optimize

[0005] 2. Low performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0033] Specifically, the network virtualization framework involved in the present invention is as figure 1 shown. The steps of using the double-buffered matrix multiplication algorithm on the GPU are as follows:

[0034] Step 1: First, block the original matrix according to bm (column length of A matrix block) and bn (row length of B matrix block), and each block processes the output matrix C of dimension;

[0035] Step 2: Create 4 temporary storage spaces smA, smB, smAx and smBx on shared memory (secondary storage on GPU);

[0036] Step 3: Read the smA-sized matrix to smA from matrix A on the global memory (first-level storage on the GPU), and read the smB-sized matrix from matrix B to smB;

[0037] Step 4: Load one column (A matrix block data) from smA to the register each time, load one row (B matrix block data) from smB to the register, use the FFMA (multiply plus fusion) instruction to do matrix multiplication, and do matrix While multiplying, read the next line of sm...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a single-precision matrix multiplication optimization method based on an NVIDIA Kepler GPU assembly instruction. The method comprises the steps that according to the column length bm of a matrix block A and a row length bn of a matrix block B, an original matrix is blocked, and each block is subjected to <bm,bn> processing; a matrix C is output dimensionally; four storage spaces smA, smB, smAx and smBx are created on a GPU secondary storage; a matrix of the size being smA is read from a matrix A on a GPU primary storage to the smA, and a matrix of the size being smB is read from a matrix B to the smB; a column matrix block A data is loaded from the smA to a register each time, a column matrix block B data is loaded from the smB to the register each time, the content of the register is read, a multiply-add-fused instruction is applied to matrix multiplication operation, and while the matrix multiplication operation is performed, a column of data of the next smA is read from the GPU primary storage to the smAx, and a column of data of the next smB is read to the smBx; after smA and smB matrix multiplication is performed, smA and smAx addresses are interchanged, and smB and smBx addresses are interchanged.

Description

technical field [0001] The invention relates to the technical fields of deep learning, high-performance computing, and GPGPU programming, in particular to a single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instructions. Background technique [0002] GPU graphics processor is a chip dedicated to image and video processing. Due to the particularity of its chip design—simplifying logic processing and increasing computing units, early GPUs were only used to process graphics and image-related application programming. With the increasingly powerful chips, GPU has turned to GPGPU (computing graphics processing unit) development, that is, its versatility has been greatly improved. At present, GPU has been widely used in embedded systems, smart terminals, personal computers, workstations and other equipment. Tesla series GPU is launched by NVIDIA, which is specially used for numerical calculation. Compared with ordinary GPU, it...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F9/302G06F9/30G06F17/16G06T1/20

CPCG06T1/20G06F9/30036G06F9/3012G06F17/16

Inventor 谭光明张秀霞周可人王朝尉

Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology