Winograd convolution implementation method based on vector instruction acceleration calculation

An implementation method and vector technology, applied in the field of operation support systems, can solve problems such as the inability to meet the continuous requirements of memory read and write at the same time, and achieve the effects of avoiding suboptimality, improving computing performance, and improving computing performance

Active Publication Date: 2021-12-24
ZHEJIANG LAB
View PDF6 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] S2, build a memory data layout strategy, and arrange the original data of Winograd convolution and the intermediate buffered data on the memory. The data layout is the arrangement of the data on the memory. Since the memory address space in the CPU system is linear, multi-dimensional data Storage in memory needs to be arranged according to the rules specified by the data layout. The data layout of the Winograd convolution intermediate buffer needs to take into account both the conversion step and the matrix multiplication step. Since the matrix multiplication step involves high-dimensional matrix multiplication, there is complex data multiplexing, so It is necessary to carefully design the storage hierarchy structure of the CPU system, and reasonably arrange data layout technologies such as vector block, register block, and cache block to obtain better memory access performance. Therefore, it is also necessary to block the data dimension to fully Give full play to the role of the storage hierarchy of the CPU system and optimize the memory access performance. The design of the intermediate buffer needs to solve the conflict of memory access modes between the various steps of Winograd convolution: the memory access locality of the conversion step has a natural relationship with the calculation process of the matrix multiplication step. Contradiction, unable to meet the continuity requirements of its memory read and write at the same time, the data layout of the intermediate buffer divides the data dimension into multiple blocks to coordinate the relationship between the locality of the program and the storage hierarchy of the CPU system. Use X to represent the data dimension. For the block data in the intermediate buffer, it includes vector block XsimdBlock, register block XregBlock, and cache block Xblock. XnbBlock represents the number of cache blocks. Each block is divided according to the number of cache blocks, Cache blocks, register blocks and vector blocks are arranged from the outer layer to the inner layer. Since the matrix multiplication step of Winograd convolution involves complex data multiplexing, it is necessary to make full use of the cache in the storage hierarchy of the CPU system. Cache to optimize computing performance, cache block divides the data used in the matrix multiplication step into multiple basic blocks, and the data of each basic block temporarily resides in the cache, so that the matrix multiplication step can take advantage of the locality of memory access to reduce access To achieve the purpose of speeding up program performance, divide the data dimension X of the intermediate buffer by the vector block XsimdBlock to obtain the number of vector blocks, and then divide the number of vector blocks by the register block XregBlock to obtain the register block The number of blocks, and then divide the number of register blocks by the cache block Xblock to get the final number of cache blocks; compared to the optimality of matrix multiplication, for the data layout of the intermediate buffer, the Winograd block dimension is arranged to be more The position of the inner layer; because in the matrix multiplication step, the Winograd block dimension is used as a multi-threaded parallel dimension, it needs to be arranged in the outer layer to optimize performance, and in the conversion step, the Winograd block dimension is in the inner layer of the operation, more It is beneficial to the memory layout design of the conversion step to write the intermediate buffer. Therefore, at the cost of some performance loss in the matrix multiplication step, the Winograd convolution has better overall performance. can;

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Winograd convolution implementation method based on vector instruction acceleration calculation
  • Winograd convolution implementation method based on vector instruction acceleration calculation
  • Winograd convolution implementation method based on vector instruction acceleration calculation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] Specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

[0038] The invention proposes a Winograd convolution realization method on a CPU platform. Through in-depth research, it is found that there are many defects in the existing Winograd convolution implementation on the CPU. The present invention is based on the micro-architecture characteristics of the CPU, including vector registers and vector operation units, multi-core parallelism, multi-level cache, etc., to perform Winograd convolution. Fine-grained optimizations. The method proposed by the present invention can not only be used as a technical means for the server to use the existing CPU (without purchasing expensive GPU) to accelerate deep learning computin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Winograd convolution implementation method based on vector instruction acceleration calculation. The method comprises the following steps: S1, constructing a register partitioning strategy, and in a Winograd convolution implementation process on a central processing unit (CPU), when original data is converted to a Winograd data field, performing vector partitioning and register partitioning on data buffered in the middle; S2, constructing a memory data layout strategy, arranging the original data of the Winograd convolution and the data of the intermediate buffer on the memory, and arranging the Winograd block dimension to the position of the innermost layer for the data layout of the intermediate buffer relative to the optimality of matrix multiplication; and S3, constructing cache block search, searching a performance optimal solution of a cache block in a small range determined according to the CPU hardware parameters and the convolution parameters, storing the performance optimal solution and the corresponding convolution parameters, and subsequently directly adopting the performance optimal solution through the convolution parameters.

Description

technical field [0001] The present invention relates to the field of operation support systems for deep learning applications, in particular to a method for improving convolution algorithms through vector instructions and memory access optimization, thereby accelerating deep learning training and reasoning. Background technique [0002] In recent years, artificial intelligence research has become increasingly popular. As the core technology of artificial intelligence, deep learning is playing an increasingly important role in academic research and practical applications relying on deep neural network models. Deep learning includes two tasks, training and reasoning. Training is to iteratively calculate the training data set on the deep neural network model, so that the neural network can continuously update its internal model parameters, and gradually complete the target tasks (such as image classification, image segmentation, etc.) ) ability; and reasoning is to use the tra...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/30G06F9/50G06N3/04G06N3/063
CPCG06F9/30101G06F9/3012G06F9/5027G06N3/063G06N3/045
Inventor 曾令仿陈晓锋陈志广
Owner ZHEJIANG LAB
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products