Blog Index

Welcome to the blog! Here, we delve into advanced optimization techniques to maximize algorithm performance on high-performance computing systems. Explore topics ranging from SIMD vectorization to parallel algorithms tailored for cutting-edge architectures.

1) Optimization Techniques for Matrix Multiplication on CPUs
A deep dive into various optimization techniques for matrix multiplication, including tiling, loop unrolling, parallelization, and AVX-512 SIMD on modern CPUs.
2) Optimization Techniques for Matrix Multiplication on GPUs
A Deep Dive into Matrix Multiplication on A100 GPUs: Exploring Naive, Shared Memory, and WMMA Tensor Core Approaches.
3) Roofline Analysis of Matrix Multiplication
We explore the roofline analysis of matrix multiplication. The roofline plot provides insights into whether a program is memory-bound (limited by data transfer rate) or compute-bound (limited by available computational throughput).
4) Parallel Needleman-Wunsch Algorithm
We explore the anti-diagonal parallelization technique for the Needleman-Wunsch algorithm.
5) Understanding GPU Programming on NVIDIA A100 GPUs with Parallel Prefix Sums
This blog explores the fundamentals of GPU programming using parallel prefix sums as a case study. It demonstrates various CUDA implementations, including naive, shared memory, and dynamic parallelism, highlighting the performance benefits of efficient memory management and GPU architecture on NVIDIA A100 GPUs.
6) Heuristic-Based Parallel Needleman-Wunsch Algorithm
This blog introduces a heuristic-based parallelization strategy for the Needleman-Wunsch (NW) algorithm, a cornerstone of bioinformatics used for global sequence alignment. Traditionally, the NW algorithm faces challenges in parallelization due to its sequential dependencies in dynamic programming computations. The proposed heuristic employs iterative dynamic programming with chunk-based computations, overlapping boundaries, and iterative updates, achieving both correctness and scalability.