Skip to content
View dogansagbili's full-sized avatar

Highlights

  • Pro

Organizations

@ParCoreLab

Block or report dogansagbili

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Optimized primitives for collective multi-GPU communication

C++ 4,647 1,219 Updated Apr 24, 2026

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm

C++ 213 33 Updated Apr 18, 2026

AI prompts for accelerating the research workflow.

153 22 Updated Mar 10, 2026
Cuda 3 1 Updated Feb 21, 2026

Triton-based Symmetric Memory operators and examples

Python 98 14 Updated Mar 28, 2026

Uniconn is a unified, portable high-level C++ communication library that supports both point-to-point and collective operations across GPU clusters. Uniconn enables seamless switching between backe…

Cuda 3 Updated Dec 17, 2025

Modern C++ Programming Course (C++03/11/14/17/20/23/26)

HTML 14,942 1,064 Updated Apr 19, 2026

Professionally written C++ function traits library (single header-only) for retrieving info about any function (arg types, arg count, return type, etc.)

C++ 49 6 Updated Sep 3, 2025

AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

Python 184 39 Updated Apr 26, 2026

Perplexity open source garden for inference technology

Rust 400 38 Updated Dec 25, 2025

A suite of microbenchmarks developed for systems with multi-GPU per node.

Cuda 9 3 Updated Jan 22, 2026

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 505 93 Updated Apr 26, 2026

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 5,777 526 Updated Apr 26, 2026

Modular RDMA Interface

C++ 117 35 Updated Apr 26, 2026

torchcomms: a modern PyTorch communications API

C++ 358 132 Updated Apr 26, 2026

COCCL: Compression and precision co-aware collective communication library

C++ 30 3 Updated Mar 16, 2025

MiniAMR Adaptive Mesh Refinement (AMR) Mini-App

C 39 26 Updated Nov 12, 2024

Distributed MoE in a Single Kernel [NeurIPS '25]

Cuda 247 33 Updated Apr 6, 2026

A comprehensive hands-on project for learning GPU programming with CUDA and HIP, covering fundamental concepts through advanced optimization techniques.

C++ 35 3 Updated Nov 20, 2025

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 516 76 Updated Apr 14, 2026

VIP cheatsheet for Stanford's CME 295 Transformers and Large Language Models

4,357 615 Updated Jul 27, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 9,554 1,202 Updated Apr 24, 2026

Perplexity GPU Kernels

C++ 568 87 Updated Nov 7, 2025

[DEPRECATED] Moved to ROCm/rocm-systems repo

C++ 145 44 Updated Apr 24, 2026

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 883 149 Updated Sep 26, 2025

RAJA Performance Portability Layer (C++)

C++ 576 111 Updated Apr 24, 2026

Lightweight C++ command line option parser

C++ 4,752 641 Updated Apr 22, 2026

Collaborative Collection of C++ Best Practices. This online resource is part of Jason Turner's collection of C++ Best Practices resources. See README.md for more information.

8,726 907 Updated Aug 6, 2024

CMake for C++ Best Practices

CMake 1,737 202 Updated Apr 15, 2026

Unleash the true power of scheduling

Python 35 3 Updated Apr 23, 2026
Next