Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Single-accuracy matrix multiplication optimization method based on loongson chip 3A

A technology of matrix multiplication and optimization method, which is applied in the field of electrical digital data processing, can solve the problem of low performance of single-precision matrix multiplication, achieve the effect of improving operation efficiency and overcoming invalid prefetch

Inactive Publication Date: 2011-10-12
UNIV OF SCI & TECH OF CHINA
View PDF1 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Loongson 3A has unique instructions such as 128-bit memory access and parallel single-precision floating point. However, the basic linear algebra subroutine library (GotoBLAS) developed by the High Performance Computing Group of the Supercomputing Center at the University of Texas at Austin is not specific to Loongson 3A. Special optimization made by the characteristics, so the performance of the single-precision matrix multiplication it uses is not high on the Loongson 3A platform

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0016] The present invention is based on the single-precision matrix multiplication optimization method of Loongson 3A. First, the two single-precision source matrices of Loongson 3A are divided into two sub-matrices according to the principle that the block size is not larger than the second-level cache respectively. The principle that is larger than half of the second-level cache is divided into two sub-matrices; the 128-bit memory access instruction of Godson 3A is used in the matrix multiplication core calculation code of the 32-bit memory access instruction of Godson 3A, the single-precision floating-point multiply-add instruction and the prefetch instruction And parallel single-precision floating-point instructions, and use twice the size of the operation data set minus the size of the operation data unit prefetch address calculation method to prefetch the data.

[0017] In this embodiment, the two single-precision source matrices of Loongson 3A are first divided into two...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a single-accuracy matrix multiplication optimization method based on a loongson chip 3A. The method is characterized by comprising the following steps of: dividing two single-accuracy source matrixes of the loongson chip 3A into two sub matrixes according to a principle that the two single-accuracy source matrixes are less than or equal to a half of a one-level cache and less than or equal to a half of a second-level cache; and pre-fetching data by using a 128-bit access instruction and a concurrent single-accuracy floating point instruction of the loongson chip 3A in a matrix multiplication core computation code of a 32-bit access instruction, a single-accuracy floating point multiplication-addition instruction and a pre-fetching instruction of the loongson chip 3A and using a pre-fetching address calculation mode of subtracting the size of an operation data CDS from the first address CACAS of an operation data set, so that a floating point operation part can basically operate at full load. By the method, the problem of invalid pre-fetching of address-non-aligned data is solved, and the executive efficiency of an address-non-aligned single-accuracy matrix multiplication is approximate to that of an address-aligned single-accuracy matrix multiplication. Compared with a basic linear algebra subprogram library (GotoBLAS) version 2-1.07, the single-accuracy matrix multiplication which is optimized by the method provided by the invention has the advantage that: an operation speed is averagely improved by above 90 percent.

Description

technical field [0001] The invention belongs to the technical field of electrical digital data processing, and in particular relates to an optimization method for single-precision matrix multiplication based on Loongson 3A. Background technique [0002] Loongson 3A is China's first quad-core central processing unit (CPU) with completely independent intellectual property rights. In the field of high-performance computing, Godson 3A needs the support of the basic linear algebra subroutine library. The better basic linear algebra subroutine library that can be used on Godson 3A is the basic linear algebra subroutine library (GotoBLAS) developed by the High Performance Computing Group of the Supercomputing Center of the University of Texas at Austin. The basic linear algebra subroutine library (GotoBLAS) developed by the High Performance Computing Group of the University of Texas at Austin Supercomputing Center achieves efficient single-precision matrix by using optimization te...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/16
Inventor 顾乃杰何颂颂张斌许耿纯
Owner UNIV OF SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products