Method for accelerating lattice-Boltzmann by utilizing graphic processing units (GPUs)

A lattice and lattice point technology, applied in the fields of computer high-performance computing and computational fluid dynamics, can solve the problems of low peak floating-point computing power of CPU, large network transmission overhead, large time, etc., to reduce construction costs and management, and improve processing. performance, the effect of reducing power consumption

Inactive Publication Date: 2012-09-19
LANGCHAO ELECTRONIC INFORMATION IND CO LTD
3 Cites 29 Cited by

AI-Extracted Technical Summary

Problems solved by technology

This approach consumes a lot of time, power and maintenance costs due to the low peak floating-point computing power of the CPU and the huge network transmission overhead.
Moreover, as people's requirements for fluid simulation cycles are getting shorte...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

The object of the present invention is to accelerate lattice Boltzmann method, improve its processing performance, make CPU and GPU carry out cooperative calculation, thereby satisfy the demand of fluid simulation, and reduce the construction cost of computer room and management, operation, maintenance cost. In the present invention, it will be necessary to put the initialization calculation on the CPU side for execution, ...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention provides a method for accelerating a lattice-Boltzmann by utilizing graphic processing units (GPUs), and relates to a GPU at a host port and a GPU at an equipment port. The method comprises the steps that the host port gives parameters such as computational domains, the reference length, the freestream velocity, the density, reynolds numbers and the like according to physical problems, and divides thread numbers of a design kernel according to a grid; and the equipment host calculates equilibrium state distribution functions of all lattice points in each direction through the macroscopic parameters (the density, the speed, the reynolds numbers, the viscosity coefficient, and the like), uses the equilibrium state distribution functions as initial fields of the calculation, parallelly solves a discrete equation and processes boundaries, and returns a result finally obtained through iteration to the host port. According to the method, the migration and the collision in the lattice-Boltzmann method are calculated by utilizing the rapid calculation characteristic of the GPU at the equipment port, and the iteration process of the lattice-Boltzmann method is accelerated through coordination operation of the GPU at the host port and the GPU at the equipment port.

Application Domain

Technology Topic

Iteration processState distribution +5

Image

  • Method for accelerating lattice-Boltzmann by utilizing graphic processing units (GPUs)
  • Method for accelerating lattice-Boltzmann by utilizing graphic processing units (GPUs)
  • Method for accelerating lattice-Boltzmann by utilizing graphic processing units (GPUs)

Examples

  • Experimental program(1)

Example Embodiment

[0043] The present invention will be described in detail below with reference to the drawings in the specification:
[0044] In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.
[0045] The purpose of the present invention is to accelerate the lattice Boltzmann method, improve its processing performance, and enable the CPU and GPU to perform collaborative calculations, so as to meet the needs of fluid simulation, and reduce the cost of construction and management, operation and maintenance of the computer room. In the present invention, the initialization calculation will need to be executed on the CPU side, and the time-consuming and very parallelized discrete equation solving and boundary processing part will be parallelized by CUDA technology to be executed in parallel on the GPU side. GPU performs collaborative computing and finally realizes the accelerated lattice Boltzmann method, such as image 3 Shown. The specific steps and implementation process are as follows:
[0046] 1) According to the physical problem, the macro parameters (density, speed, viscosity coefficient, etc.) in the given computing domain on the host side are passed to the device side;
[0047] 2) Define the data structure and storage method of the device, which is used to store the balance distribution function of each grid point and the macro parameters such as the speed and density of each grid point. All grid points are calculated from the macro parameters transmitted from the host The distribution function of the equilibrium state in each direction on the upper side, as the initial field of calculation;
[0048] 3) Design the migration collision kernel, design the number of threads for each block as BLOCKSIZE (the value is 64-512), the thread structure is: Block(BLOCKSIZE, 1), Grid((NX+BLOCKSIZE-1)/BLOCKSIZE, NY) , And let each thread in the kernel calculate the migration and collision process of a grid point, such as Figure 4 As shown, the kernel pseudo code is as follows;
[0049] 1: k = gridDim.y * blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;//k represents the subscript of the grid point
[0050] 2: /*In the migration process, perform a summary read operation on the distribution function of the relevant grid points around the current grid point*/
[0051] 3: fr = fr0[k];//0 represents the distribution function of the previous time level
[0052] 4: fe = fe0[k-1];
[0053] 5: fn = fn0[k-NX];
[0054] 6: fw = fw0[k+1];
[0055] 7: fs = fs0[k+NX];
[0056] 8: fne = fne0[k-NX-1];
[0057] 9: fnw = fnw0[k-NX+1];
[0058] 10: fsw = fsw0[k+NX+1];
[0059] 11: fse = fse0[k+NX-1];
[0060] 12: /*Collision process*/
[0061] 13: Find the macroscopic quantity according to the distribution function fr-fse after migration
[0062] 14: Find the balanced distribution functions f1, f2, f3, f4, f5, f6, f7, f8 in each direction according to the macroscopic quantities;
[0063] 15: Find the distribution function fr1 after collision according to f1, f2, f3, f4, f5, f6, f7, f8 and the distribution function after migration fr, fe, fn, fw, fs, fne, fnw, fsw, fsw, fse [k], fe1[k], fn1[k], fw1[k], fs1[k], fne1[k], fnw1[k], fsw1[k], fse1[k];
[0064] 4) The boundary is processed on the device side. Boundary processing can adopt methods such as rebound method and unbalanced extrapolation method. When processing the boundary, each thread is also designed to process the calculation of one node;
[0065] 5) Judge whether the iteration is completed, and output if it is completed, otherwise continue the iteration;
[0066] 6) The device side obtains the macro parameters such as speed, density and flow function in parallel according to the distribution function and transmits the result to the host side; the host side outputs the result;
[0067] 7) Performance test
[0068] a) Test environment and test data
[0069] The test environment includes hardware environment, software environment, and running software. The running software includes the CPU version of the LBM algorithm running on the CPU and the LBM algorithm running on the GPU; the test data selects the top cover driving square cavity flow, and the input includes the grid size And some other input parameters, the specific test environment and test data parameters are shown in the following table;
[0070]
[0071] b) Performance results
[0072] In order to ensure the stability of the test performance results, we tested the above job 10 times, the data type is double, the average time for the CPU version of the LBM algorithm to run 10 times on a single CPU is 19763 seconds, and the GPU version of the LBM algorithm on a single GPU The average time to run the same job 10 times above is 598 seconds, and the performance of the GPU version is 19763/598 = 33 times that of the CPU version.
[0073] It can be seen from the technical scheme of the present invention that the part of the present invention is the performance bottleneck in the LBM algorithm through test migration and collision and boundary processing, and this part of the data is completely independent, and is completely suitable for parallel computing using CUDA on the GPU. Time-consuming initialization parameters and result output are still executed on the CPU side, and the CPU and GPU perform collaborative calculations. Through the test, the overall performance has been increased by 33 times. The current computing GPU computing node is equivalent to the computing performance of the original 33 or more CPU computing node clusters. This not only meets the needs of fluid simulation, but also greatly reduces power consumption and reduces computer room construction Cost and management, operation, and maintenance costs, and this method is simple to implement and requires low development costs.
[0074] Except for the technical features described in the specification, all are the known techniques of those skilled in the art.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Unified memory system

InactiveUS20070294487A1Performance of be prevent from deteriorateImprove processing performanceMemory systemsMemory controllerMemory systems
Owner:PANASONIC CORP

Classification and recommendation of technical efficacy words

  • Improve processing performance
  • Reduce power consumption

Polyethylene color masterbatch and preparing method thereof

ActiveCN101148523AThermal Oxygen Aging Resistance ProcessabilityImprove processing performanceCarbon blackPhenols
Owner:BEIJING BEIHUA GAOKE NEW TECH

Method and device for exception handling of Android platform

Owner:QINGDAO HISENSE MOBILE COMM TECH CO LTD

Semiconductor integrated circuit device and method of manufacturing the same

ActiveUS20060113520A1Extend kindReduce power consumptionSolid-state devicesBulk negative resistance effect devicesIntegrated circuitOxide
Owner:RENESAS ELECTRONICS CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products