Understanding GPU Programming on NVIDIA A100 GPUs with Parallel Prefix Sums

Prefix sum (or scan) is a foundational operation in parallel computing. It computes cumulative sums over an array, enabling efficient algorithms for sorting, searching, and many other applications. This blog demonstrates various CUDA implementations of prefix sum on an NVIDIA A100 GPU, showcasing the advantages of shared memory and dynamic parallelism.

Prefix Sum with Associative Operators

The prefix sum of an array \( A \) produces an array \( S \), where:

\( S[i] = A[0] \oplus A[1] \oplus \dots \oplus A[i] \) for \( i \geq 0 \),

and \( \oplus \) is any associative operator (e.g., addition, multiplication, or bitwise AND).

The recursive definition of the prefix sum is:

\( S[i] = \begin{cases} A[0] & \text{if } i = 0, \\ S[i-1] \oplus A[i] & \text{if } i > 0. \end{cases} \)

While addition (\( + \)) is used in this implementation, the approach generalizes to any associative operation.

GPU Architecture Explained (Using NVIDIA A100)

Understanding how GPUs work is crucial to optimizing parallel algorithms like prefix sum. Let’s break down the key architectural concepts using the NVIDIA A100 GPU as an example:

Streaming Multiprocessors (SMs)

An NVIDIA A100 GPU contains 108 Streaming Multiprocessors (SMs). Each SM acts as a mini-processor, capable of executing multiple threads simultaneously. Within an SM:

Threads and Warps

Threads are the smallest unit of execution in a GPU. In CUDA:

For example, if a thread block contains 1024 threads, it will be divided into \( 1024 / 32 = 32 \) warps.

Blocks and Grids

Threads are grouped into blocks, and blocks are organized into a grid. Each block executes independently, with shared memory facilitating communication among threads within a block.

For example, to process an array of 1 million elements with 1024 threads per block, we need \( \lceil 10^6 / 1024 \rceil = 976 \) blocks. These blocks are distributed across the 108 SMs of the A100.

Memory Hierarchy

Optimizing memory usage (e.g., minimizing global memory accesses) is critical for performance.

Parallelizing Prefix Sum

To efficiently compute prefix sums on a GPU, we leverage the parallel nature of CUDA. The key idea is to divide the computation into multiple steps, updating the array in-place with increasing strides. Here’s a step-by-step breakdown:

Step 1: Load Data into Shared Memory

Each thread copies a portion of the array from global memory into shared memory, allowing fast access during computation.

Step 2: Compute Local Prefix Sums

Using a stride-based approach, each thread computes its local prefix sum by combining values within its segment of the array:

\( A[i] = A[i] \oplus A[i - \text{stride}] \)

This is repeated for increasing strides (\( \text{stride} = 1, 2, 4, \dots \)) until the entire block is processed.

Step 3: Handle Inter-Block Dependencies

Since blocks compute prefix sums independently, their results must be adjusted to reflect the cumulative sums of preceding blocks. This requires a second phase or dynamic parallelism.

CUDA Implementations

Naive, Global, and Shared Memory Prefix Sums

These implementations are designed for intra-block prefix sums:

1. Naive Prefix Sum

This implementation computes prefix sums sequentially within a parallel kernel, making it inefficient on GPUs:

__global__ void naive_prefix_sum(int *input, int n) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid == 0) return;

    for (int i = 1; i < n; ++i) {
        if (tid == i) {
            input[tid] += input[tid - 1];
        }
        __syncthreads();
    }
}
    

2. Logarithmic Prefix Sum Using Global Memory

This kernel uses \( O(\log n) \) steps but relies on slow global memory access:

__global__ void global_prefix_sum(int *input, int n) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    for (int step = 1; step < n; step *= 2) {
        int temp = (tid >= step) ? input[tid - step] : 0;
        __syncthreads();
        if (tid < n) {
            input[tid] += temp;
        }
        __syncthreads();
    }
}
    

3. Logarithmic Prefix Sum Using Shared Memory

This kernel leverages shared memory for faster intra-block communication:

__global__ void shared_prefix_sum(int *input, int n) {
    __shared__ int shared_mem[1024]; // Shared memory array
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    // Load data into shared memory
    if (tid < n) {
        shared_mem[threadIdx.x] = input[tid];
    } else {
        shared_mem[threadIdx.x] = 0;
    }
    __syncthreads();

    for (int step = 1; step < blockDim.x; step *= 2) {
        int temp = (threadIdx.x >= step) ? shared_mem[threadIdx.x - step] : 0;
        __syncthreads();
        shared_mem[threadIdx.x] += temp;
        __syncthreads();
    }

    // Write results back to global memory
    if (tid < n) {
        input[tid] = shared_mem[threadIdx.x];
    }
}
    

Multi-Kernel and Dynamic Prefix Sum

The computation of prefix sums across multiple blocks requires handling inter-block dependencies. Two methods address this challenge:

Multi-Kernel Prefix Sum

The computation is divided into two kernels:

This method is efficient but involves overhead due to launching multiple kernels and synchronizing between them.

Dynamic Prefix Sum

This approach uses dynamic parallelism to launch the adjustment kernel directly from the device:

This method is faster because it eliminates the need for the host to manage inter-block adjustments, making it ideal for workloads with significant inter-block dependencies.

4. Multi-Kernel Execution

This method splits the computation into two kernels: one for local prefix sums and another for adjustments across blocks.

__global__ void compute_local_prefix_sum(int *input, int *block_sums, int n) {
    __shared__ int shared_mem[1024];
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    // Compute local prefix sum
    if (tid < n) {
        shared_mem[threadIdx.x] = input[tid];
    } else {
        shared_mem[threadIdx.x] = 0;
    }
    __syncthreads();

    for (int step = 1; step < blockDim.x; step *= 2) {
        int temp = (threadIdx.x >= step) ? shared_mem[threadIdx.x - step] : 0;
        __syncthreads();
        shared_mem[threadIdx.x] += temp;
        __syncthreads();
    }

    if (tid < n) {
        input[tid] = shared_mem[threadIdx.x];
    }

    // Store the block sum
    if (threadIdx.x == blockDim.x - 1) {
        block_sums[blockIdx.x] = shared_mem[threadIdx.x];
    }
}

__global__ void adjust_block_sums(int *input, int *block_sums, int n) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    // Adjust with preceding block sums
    if (blockIdx.x > 0 && tid < n) {
        input[tid] += block_sums[blockIdx.x - 1];
    }
}
    

5. Dynamic Parallelism

Dynamic parallelism enables a kernel to launch another kernel for inter-block adjustments:

__global__ void dynamic_prefix_sum(int *input, int *block_sums, int n) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    __shared__ int shared_mem[1024];
    if (tid < n) {
        shared_mem[threadIdx.x] = input[tid];
    } else {
        shared_mem[threadIdx.x] = 0;
    }
    __syncthreads();

    for (int step = 1; step < blockDim.x; step *= 2) {
        int temp = (threadIdx.x >= step) ? shared_mem[threadIdx.x - step] : 0;
        __syncthreads();
        shared_mem[threadIdx.x] += temp;
        __syncthreads();
    }

    if (tid < n) {
        input[tid] = shared_mem[threadIdx.x];
    }

    if (threadIdx.x == blockDim.x - 1) {
        block_sums[blockIdx.x] = shared_mem[threadIdx.x];
    }

    // Launch adjustment kernel dynamically
    if (blockIdx.x == gridDim.x - 1 && threadIdx.x == 0) {
        adjust_block_sums<<>>(input, block_sums, n);
    }
}
    

Why is Dynamic Parallelism Faster Than Multi-Kernel Execution?

Multi-Kernel Prefix Sum

Dynamic Parallelism

In summary, Dynamic Parallelism outperforms Multi-Kernel Prefix Sum by reducing the dependency on the host for managing inter-block adjustments. This results in lower latency and better scalability for large-scale parallel workloads.

Performance Comparison

Performance Results

The chart above compares the performance of different prefix sum methods, demonstrating the efficiency of shared memory and dynamic parallelism for managing inter-block dependencies.

Conclusion

Optimizing prefix sum computation on GPUs requires careful consideration of memory access patterns, inter-block communication, and parallel execution. Dynamic parallelism and shared memory optimizations showcase the power of modern GPUs like the NVIDIA A100 in efficiently solving parallel problems.

References

Harris, Mark, Sengupta, Shubhabrata, & Owens, John D. Chapter 39. Parallel Prefix Sum (Scan) with CUDA. In GPU Gems 3, Part VI: GPU Computing. NVIDIA Corporation. https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda.