A collection of articles on algorithms and computational techniques in Artificial Intelligence (AI), Parallel Computing, and Computational Genomics.
This blog introduces the concept of training large language models (LLMs) from scratch, covering the fundamental steps and practices.
This blog introduces the concept of tensor parallelism in the context of autoregressive large language model (LLM) inference. It covers the basics of distributing tensor computations across multiple GPUs to accelerate model inference.
This blog explores advanced parallelism techniques in large language models, including tensor parallelism, expert parallelism, and mixture of experts (MoE) strategies.
This blog explains self-attention in transformers from a systems perspective, focusing on runtime and memory complexity.
This blog explains KV caches in transformers from a systems perspective, focusing on runtime and memory complexity, prefix caching, and quantization techniques.
A deep dive into various optimization techniques for matrix multiplication, including tiling, loop unrolling, parallelization, and AVX-512 SIMD on modern CPUs.
A deep dive into matrix multiplication on A100 GPUs, exploring naive, shared memory, and WMMA tensor core approaches.
We explore the roofline analysis of matrix multiplication. The roofline plot provides insights into whether a program is memory-bound (limited by data transfer rate) or compute-bound (limited by available computational throughput).
We explore the anti-diagonal parallelization technique for the Needleman-Wunsch algorithm.
This blog explores the fundamentals of GPU programming using parallel prefix sums as a case study. It demonstrates various CUDA implementations, including naive, shared memory, and dynamic parallelism, highlighting the performance benefits of efficient memory management and GPU architecture on NVIDIA A100 GPUs.
This blog introduces a heuristic-based parallelization strategy for the Needleman-Wunsch (NW) algorithm, a cornerstone of bioinformatics used for global sequence alignment. Traditionally, the NW algorithm faces challenges in parallelization due to its sequential dependencies in dynamic programming computations. The proposed heuristic employs iterative dynamic programming with chunk-based computations, overlapping boundaries, and iterative updates, achieving both correctness and scalability.