8/8/2019 Presentation CUDA
1/37
HIGH PERFORMANCE COMPUTING
ON GPU
8/8/2019 Presentation CUDA
2/37
Graphics Processing UnitsA graphics processing unit orGPU is a specialized microprocessor that offloadsand accelerates 3D or 2D graphics rendering.
Modern GPUs highly parallel structure makes them more effective.
NVIDIA's Tesla Architecture exposes the computational Horse power of the
NVIDIA's GPU.
GPU is specialized for compute-
intensive, highly parallel
designed such that moretransistors are devoted to data
processing rather than data
caching and flow control.
8/8/2019 Presentation CUDA
3/37
Physical Memory Layout of NVIDIA GPUs
The Device has its own GlobalMemory which all the
cores(Thread processors) can access. N multiprocessors
have M cores each. Cores share an instruction unitwith
other cores in a multiprocessor. Each processor has its own
localmemory(residing on DRAM), separate register set and
all the M cores shares an on chip memory called shared
memory.The Host can write to the globalmemorybut not
the shared memory.
8/8/2019 Presentation CUDA
4/37
TESLA C1060
NVIDIA Tesla C1060 has 10 series NVIDIA
architecture having 30 multiprocessor with 8cores , a double precision unit and on chip
shared memory.
8/8/2019 Presentation CUDA
5/37
What is CUDA ?
CUDA is a scalable parallelprogramming
modeland a software environment for
parallelcomputing
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel programming model
8/8/2019 Presentation CUDA
6/37
Kernels and ThreadsParallel portions ofan application are executed on
the device as kernels One kernel is executed at a time
All the parallel threads execute the same kernel.
Some devices of Highcomputation power canexecute more than one concurrent kernels.
Important Definition
8/8/2019 Presentation CUDA
7/37
More about threads
A CUDA kernel is executed by an array ofthreads.
All threads run the same code.
Each thread has an ID that it uses to computememory addresses and make control decisions.
Computation of memory address and control decisions will be discussed later.
8/8/2019 Presentation CUDA
8/37
THREAD BATCHING
Kernellaunches a grid of thread blocks.
Threads within a block cooperate via shared memory
Threads within a block can synchronize(Thread
Coorporation)Threads in different blocks cannot cooperate
8/8/2019 Presentation CUDA
9/37
MEMORY ACCESS
8/8/2019 Presentation CUDA
10/37
EXECUTION MODEL
8/8/2019 Presentation CUDA
11/37
CUDA C and Compilation CUDA C provides a simple path for users familiar with the C programming language to
easily write programs for execution by the device. It consists of a minimal set ofextensions to the C language and a runtime library.
CUDA provides nvcc compiler which spilts the CUDA code into PTX code(Used atruntime) and standard C code (calls the Standard C Compiler at compile time).
RUNS ON CPU RUNS ON GPU
8/8/2019 Presentation CUDA
12/37
Managing memoryGPUs memory can only be managed by the CPU and CPU has access only
to the Global Memory .
The following memory operations implies only for the Global
memory (Not to the local or shared memory).
Allocate/Free memory
cudaMalloc(void ** pointer, size_t nbytes) //To allocate nbytes of memory
cudaMemset(void * pointer, int value, size_t count) //To set count bytes to
value.
cudaFree(void* pointer) // To free memory allocated by cudaMalloc
HOST DEVICE data transfer
cud
aMem
cpy(void *dst, void *sr
c, size_t nbytes,enum
cud
aMem
cpyK
ind direction)
//Transfers nbytes of data from src to dst. direction specify the initial and final memory type
8/8/2019 Presentation CUDA
13/37
CUDA function Qulaifiers
__global__
Function called from host and executed on device
Must return void
Eg, kernels__device__
Function called from device and run on device.
Cannot be called from host code
__host__ Function called from host and executed on host
(default)
8/8/2019 Presentation CUDA
14/37
Kernel Calls and Unique Thread Index
Kernels are called by the modified syntax:
kernel()Here dim3 is a vector type with x , y , z as the members.
We can initialise dim3 objects by the constructor
For 1D grid: dim3 dG(var_x,1,1) or dim3 dG(var)
For 2D grid: dim3 dG(var_x,var_y,1) or dim3 dG(var1,var2)
Similarly for blocks:For 1D block: dim3 dB(var_x,1,1) or dim3 dB(var)
For 2D block: dim3 dB(var_x,var_y,1) or dim3 dV(var_x,var_y)
For 3D blocks: dim3 dB(var_x,var_y,var_z) or dim3 dB(var_x,var_y,var_z)
8/8/2019 Presentation CUDA
15/37
8/8/2019 Presentation CUDA
16/37
Thread Synchronization
Host synchronization:
void CudaThreadsynchronize();
Blocks until all the CUDA calls are executed.
Device synchronization:
void __syncthreads();
Synchronizes the threads in a Blocks.
There no way to synchronize threads outside the block.
Programmer should be careful to avoid RAW/WAW/WAR
hazards.
8/8/2019 Presentation CUDA
17/37
Heteroprogramming and Synchronization
// copy data from host to device
cudaMemcpy(a_d, a_h, numBytes,cudaMemcpyHostToDevice);
// execute the kernel
inc_gpu(a_d,N);
// run independent CPU code
run_cpu_stuff();
// copy data from device back to host cudaMemcpy(a_h, a_d, numBytes,
cudaMemcpyDeviceToHost);
8/8/2019 Presentation CUDA
18/37
Error Reporting
Example:
cudaThreadSynchronize();
Kernel_Launch(arg_list);
cudaThreadSynchronize();
printf(%s\n, cudaGetErrorString( cudaGetLastError() ) );
All CUDA calls return error code but some calls areAsynchronous, so programming should synchronize to
Keep checks.
8/8/2019 Presentation CUDA
19/37
Hardware Implementation
The CUDA architecture is built around a scalable array of multithreaded
Multiprocessors . When a CUDA program on the host CPU invokes a kernel grid,
the blocks of the gridare enumerated and distributed to multiprocessors with
available execution capacity. The threads of a thread blockexecute concurrently on
one multiprocessor, and multiple thread blocks can execute concurrently on one
multiprocessor. As thread blocks terminate, new blocks are launched on thevacated multiprocessors. This makes the Framework Scalable.
A multiprocessor is designed to execute hundreds of threads concurrently. To
manage such a large amount of threads, it employs a unique architecture called
SIMT(Single-Instruction,Multiple-Thread).When a multiprocessor is given one or
more thread blocks to execute, it partitions them into warps. A warp executes one
common instruction at a time, so full efficiency is realized when all 32 threads of a
warp agree on their execution path.
8/8/2019 Presentation CUDA
20/37
PERFROMANCE OPTIMIZATION
Performance optimization revolves around threebasic strategies:
Maximizing parallel execution
Optimizing memory usage Optimizing instruction usage to achieve
maximum instruction throughput
8/8/2019 Presentation CUDA
21/37
Maximizing parallel execution
Amdahls law states that the maximum speed-up (S) of aprogram is
where Pis the fraction of the total serial execution time taken by the portion of code that can be parallelized and
N is the number of processors over which the parallel portion of the code runs. The larger N is (that is, the greaterthe number of processors), the smaller the P/N fraction.
It can be simpler to view N as a very large number, which essentiallytransforms the equation into
S = 1 / 1P
Now, if of a program is parallelized, the maximum speed-up over serialcode is 1/ (1 ) = 4. So our aim is to increase P by increasing the fractionof parallel code.
8/8/2019 Presentation CUDA
22/37
Optimizing memory transfers
To run kernels, data values must be transferred from the host
to the device along the PCI Express (PCIe) bus. It is importantto minimize data transfer between the host and the device,even if that means running kernels on the GPU that do notdemonstrate any speed-up.
8/8/2019 Presentation CUDA
23/37
DeviceDevice transfer
CUDA provides function for device to device data transferwhich can only be called from the Host code.
The call to cudaMemcopy() is Asynchronous but
next kernel wont start until memory transfer is
complete. and what if there is large amout of memory trasfer?
The GPU cores will be idle.
To increase the performance we can allot the job, ofcopying N bytes of data, to B blocks each running k threadsin parallel. For best performance N=k * B .( Eg, it takes 4.5times less time if we allot the job of copying 1 MB of datato around 1k threads ).
8/8/2019 Presentation CUDA
24/37
Shared Memory
Each Multiprocessor has 16 kb of Shared Memoryassociated with it
Provides thread corporation within a blockof
threads. Sharing of memory access
Redundant computations
Because it is on-chip, shared memory is much
faster than local and global memory.
8/8/2019 Presentation CUDA
25/37
Coalesced Access to Global Memory
Global memory can be viewed in terms of aligned segmentsof 16 and 32 words.
Coalesced access in which all
threads but one access the
corresponding word in a segment
Choosing thread block sizes as multiples of 16, facilitates
memory accesses by half warps that are aligned to segments.
But a warp-size is 32 , so there should be minimum 32
threads.
Misaligned sequential
addresses that fall within two
128-byte segments
8/8/2019 Presentation CUDA
26/37
Optimizing Instruction Usage
A warp executes one common instruction at a time, so
full efficiency is realized when all 32 threads of a warp
agree on their execution path. Any flow control
instruction (if, switch, do, for, while) can significantlyaffect the instruction throughput by causing threads of
the same warp to different execution paths. If this
happens, the different execution paths must be
serialized, increasing the total number of instructionsexecuted for this warp.
8/8/2019 Presentation CUDA
27/37
Parallelizing w.r.t. pixelsIf the processing of pixels are independent .
Eg ,Conversion from rgb to grey ,Conversion from one format to
another.
char* host_img_rgb= malloc(3*height*width*sizeof(char)) ;
char* host_img_grey= malloc(height*width*sizeof(char)) ; //allocating in Host
char* dev_img_rgb, dev_img_grey; //device pointers
cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char));
cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char)); //allocating in Device
//read image into the HOST memory
//copy that rgb image into the Device memory
cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostToDevice);
Kernel(dev_img_rgb,dev_img_grey);
//copy back to host memory
cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char ),cudaHostToDevice) ;
8/8/2019 Presentation CUDA
28/37
Visualizing the kernel execution
.Every Block contains
256 threads
Reading from
Writes to
Parallel Execution of Each Block
256 as the threads per block encourages
coleased memory access
8/8/2019 Presentation CUDA
29/37
Improvementschar* host_img_rgb= malloc(3*height*width*sizeof(char)) ;
char* host_img_grey= malloc(height*width*sizeof(char)) ; //allocating in Hostchar* dev_img_rgb, dev_img_grey; //device pointers
cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char)); //allocatingin Device
//read image into the HOST memory
//copy that rgb image into the Device memory
cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostToDevice);
cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char));
Kernel(dev_img_rgb,dev_img_grey);
cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char ),cudaDeviceToHost) ;
8/8/2019 Presentation CUDA
30/37
cudaMalloc((void **) &(dev_img_rgb), 3
*width*height*sizeof(char));
cudaMalloc((void **) &(dev_img_grey),
width*height*sizeof(char));
Better way:
cudaMalloc((void**)&temp_dev_point,4*width*height*sizeof(c
har));
dev_img_rgb=temp_dev_point;
dev_img_grey=temp_dev_point + (width*height);
For eg, it takes 12 times less time allocating 6000 bytes than to
allocate 4 arrays of 1500 each.
IMPROVEMENT IN ALLOCATION
8/8/2019 Presentation CUDA
31/37
Problems in data transfer and
executioncudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cuda
HostToDevice);
Kernel(dev_img_rgb,dev_img_grey);
Kernel has to wait for the data transfer .
Therefore the cores are idle.
Moreover the HostDevice Transfer is slow.
8/8/2019 Presentation CUDA
32/37
Page Locked Memory
Cuda allows the programmer to allocate Page
locked Host memory.
The data transfer rate, between Page lockedhost memory and the device memory, is high.
It allows Asynchronous Data transfer.
8/8/2019 Presentation CUDA
33/37
Concurrency//Use of Streams//Creating Streams
cudaStream_t stream[height];
for (int i = 0; i < height; ++i)
cudaStreamCreate(&stream[i]);
//Specifying sequence of host to device transfers.
for (int i = 0; i < height; ++i)
cudaMemcpyAsync(dev_img_rgb + (i * 3* width), host_img_rgb + (i* 3 *width) , width * sizeof(char), cudaMemcpyHostToDevice, stream[i]);
//Specifying sequence of kernel launches
for (int i = 0; i < height; ++i)
Kernel (dev_img_rgb + i * 3*width,dev_img_grey + i * width);
//Specifying sequence of device to host transfers.
for (int i = 0; i < height; ++i)
cudaMemcpyAsync(host_img_grey + (i * width), dev_img_grey + (i * width) ,width * sizeof(char), cudaMemcpyDeviceToHost, stream[i]);
8/8/2019 Presentation CUDA
34/37
Host->Device
Host->Device
Execution
Execution
Comparison of Timelines for non-concurrent
and concurrent execution
Device->Host
Device->Host
8/8/2019 Presentation CUDA
35/37
Parallelizing nested loops
Eg, Parallelizing w.r.t. the pixels in a Patch
for(int i=0;i
8/8/2019 Presentation CUDA
36/37
//We launch a 2-D grid
dim3 grid(width/patch_width,
height/patch_height);
//and grid with 2-D blocks
dim3 block(patch_width,patch_height);
//launch kernel
Kernel_name(arg list..);
How to find the index of the patch inside the grid?
blockIdx.y * gridDim.x + blockIdx.x
How to find the index of the pixel inside the block?
threadIdx.y * blockDim.x + threadIdx.x
8/8/2019 Presentation CUDA
37/37
How to choose best configuration
argument.
Cuda provides an OCCUPANCY Calculator
as an excel file.
Occupancy is the ratio of the number ofactive warps per multiprocessor to the
maximum number of possible active warps.
Top Related