Welcome to the blog! Here, we delve into advanced optimization techniques to maximize algorithm performance on high-performance computing systems. Explore topics ranging from SIMD vectorization to parallel algorithms tailored for cutting-edge architectures.
A deep dive into various optimization techniques for matrix multiplication, including tiling, loop unrolling, parallelization, and AVX-512 SIMD on modern CPUs.
A Deep Dive into Matrix Multiplication on A100 GPUs: Exploring Naive, Shared Memory, and WMMA Tensor Core Approaches.
We explore the roofline analysis of matrix multiplication. The roofline plot provides insights into whether a program is memory-bound (limited by data transfer rate) or compute-bound (limited by available computational throughput).
We explore the anti-diagonal parallelization technique for the Needleman-Wunsch algorithm.
This blog explores the fundamentals of GPU programming using parallel prefix sums as a case study. It demonstrates various CUDA implementations, including naive, shared memory, and dynamic parallelism, highlighting the performance benefits of efficient memory management and GPU architecture on NVIDIA A100 GPUs.
This blog introduces a heuristic-based parallelization strategy for the Needleman-Wunsch (NW) algorithm, a cornerstone of bioinformatics used for global sequence alignment. Traditionally, the NW algorithm faces challenges in parallelization due to its sequential dependencies in dynamic programming computations. The proposed heuristic employs iterative dynamic programming with chunk-based computations, overlapping boundaries, and iterative updates, achieving both correctness and scalability.