Mizuiro Kumo

CUDA C Program structure

CUDA C program structure

Execution of a CUDA program

The Structure of a CUDA C program reflects the coexistence of a host (CPU) and one or more devices (GPU) in the computer. CUDA C source file can have a mixture of host code and device code. By default, any traditional C program is a CUDA program that contains only host code. CUDA_program_excution host code和device code按图中顺序交替执行。

CUDA thread

  1. In CUDA, the execution of each thread is sequential.
  2. CUDA program initiates parallel execution by calling kernel functions, which causes the underlying runtime mechanisms to launch a grid of threads that process different parts of the data in parallel.
  3. CUDA programmer can assume that GPU threads take very few clock cycles to generate and schedule, owing to efficient hardware support. This assumption contrasts with traditional CPU threads, which typically take thousands of clock cycles to generate and schedule. "CUDA thread overhead is much less than CPU thread".

CUDA vector addition example

sequential process

// compute vector sum C_h = A_h + B_h
void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	for (int i = 0; i < n; ++i) {
		C_h[i] = A_h[i] + B_h[i];
	}
}

int main() {
	// Memory allocation for arrays A, B, C
	// I/O to read A and B, N elements each to initialize them.
	...
	vecAdd(A, B, C, N);
}

"_h" for "used for host" and "_d" for "used for device"

parallel process

CPU_GPU_interact

void vecAdd(float* A, float* B, float* C, int n) {
	int size = n * sizeof(float);
	float *d_A *d_B *d_C;
	// Part 1: Allocate device memory for A, B, and C
	// Copy A and B to device memory
	...

	// Part 2: Call kernel - to launch a grid of threads
	// to perform the actual vector addition
	...

	// Part 3: Copy C from the device memory
	// Free device vectors
	...
}

vecAdd function is an outsourcing agent that:

  1. ships input data to a device
  2. activates the calculation on the device
  3. collects the results from the device

Device global memory and data transfer

cudaMalloc and cudaFree

cudaMalloc(void** devPtr, size_t size) Allocate memory on the device.

void* is a "pointer to anything", void** is a "pointer to 'pointer to anything'"

cudaFree(void* devPtr)

example:

float * A_d;
int size = n * sizeof(float)
cudaMalloc((void**)&A_d, size);
...
cudaFree(A_d);

cudaMemCpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind)

cudaMemcpy operations, issued in the same direction (i.e. host to device) will always serialize. The data will not be “copied in parallel”. This is due to the characteristics of the PCIE bus: only one outstanding operations can be transmitted at a time.

with these CUDA API, we can continue with vecAdd function:

void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	int size = n * sizeof(float);
	float *A_d, *B_d, *C_d;

	// Part 1: Allocate device memory for A, B, and C
	cudaMalloc((void**)&A_d, size);
	cudaMalloc((void**)&B_d, size);
	cudaMalloc((void**)&C_d, size);
	// Copy A and B to device memory
	cudaMemcpy(A_d, A_h, cudaMemcpyHostToDevice);
	cudaMemcpy(B_d, B_h, cudaMemcpyHostToDevice);

	// Part 2: Call kernel - to launch a grid of threads
	// to perform the actual vector addition
	...
	
	// Part 3: Copy C from the device memory
	cudaMemcpy(C_d, C_h, cudaMemcpyDeviceToHost);
	// Free device vectors
	cudaFree(A_d);
	cudaFree(B_d);
	cudaFree(C_d);
}

Error Checking and Handling in CUDA

// example
cudaError_t error = cudaMalloc((void**)&A_d, size);
if (cudaSuccess != error) {
	printf("%s in %s at line %d\n", cudaGetErrorStrig(err), __FILE__, __LINE__);
	exit(EXIT_FAILURE);
}

Kernel functions and threading

CUDA C programming is an instance of single-program multiple-data (SPMD) parallel programming style.

CUDA thread hierarchy

When a program's host code calls a kernel, the CUDA runtime system launches a grid of threads that are organized into a two-level hierarchy. cuda_thread_hierarchy

How to distinguish threads from each other? We need to use two built-in variables: blockIdx and threadIdx

CUDA-C function qualifier

By default, all functions in a CUDA program are host functions, which is simply a tradition C function that executes on CPU.

Qualifier Keyword Callable From Executed On Executed By
__host__ default Host Host Caller Host thread
__global__ Host(or device) Device New grid of device threads
__device__ Device Device Caller device thread

CUDA-C built-in variables

Compare C function and kernel function

// C version
void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	for (int i = 0; i < n; ++i) {
		C_h[i] = A_h[i] + B_h[i];
	}
}
// cuda version
__global__
void vecAddKernel(float* A, float* B, float* C, int n) {
	int i = threadIdx.x + blockDim.x * blockIdx.x;
	if (i < n) {
		C[i] = A[i] + B[i];
	}
}

The kernel function does not have a loop that corresponds to on in host version, because the loop is now replaced with grid of threads. The entire grid forms the equivalent of the loop. Each thread in the grid corresponds to one iteration of the original loop.

The if(i<n) statement allows the kernel to be called to process vectors of arbitrary lengths. For i>n, the code won't run.

Calling kernel function

When the host code calls a kernel, it sets the grid and thread block dimensions via execution configuration parameters.

/**
*@brief vecAdd host code, calls kernel function
*@param [in] A input vector A in host memory, alloc outside vecAdd
*@param [in] B input vector B in host memory, alloc outside vecAdd
*@param [out] C input vector C in host memory, alloc outside vecAdd
*@param [in] n vector length of A, B, and C
*/
void vecAdd(float* A, float* B, float* C, int n) {
	float* A_d;
	float* B_d;
	float* C_d;
	int size = n * sizeof(float);

	cudaMalloc((void**)&A_d, size);
	cudaMalloc((void**)&B_d, size);
	cudaMalloc((void**)&C_d, size);

	cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
	cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

	vecAddKernel<<<ceil(n/256.0), 256>>>(A_d, B_d, C_d, n);

	cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);

	cudaFree(A_d);
	cudaFree(B_d);
	cudaFree(C_d);
} 

execution configuration parameters is wrapped with <<< and >>> before the traditional C function arguments. The first configuration parameter gives the number of blocks in the grid. The second specifies the number of threads in each block.

Compile CUDA-C function

cuda_compile