Skip to content
View dogansagbili's full-sized avatar

Highlights

  • Pro

Organizations

@ParCoreLab

Block or report dogansagbili

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Optimized primitives for collective multi-GPU communication

C++ 4,665 1,231 Updated May 1, 2026

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm

C++ 213 33 Updated Apr 18, 2026

AI prompts for accelerating the research workflow.

153 24 Updated Mar 10, 2026
Cuda 3 1 Updated Feb 21, 2026

Triton-based Symmetric Memory operators and examples

Python 98 14 Updated Mar 28, 2026

Uniconn is a unified, portable high-level C++ communication library that supports both point-to-point and collective operations across GPU clusters. Uniconn enables seamless switching between backe…

Cuda 3 Updated Dec 17, 2025

Modern C++ Programming Course (C++03/11/14/17/20/23/26)

HTML 14,950 1,063 Updated Apr 19, 2026

Professionally written C++ function traits library (single header-only) for retrieving info about any function (arg types, arg count, return type, etc.)

C++ 49 6 Updated Sep 3, 2025

AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

Python 188 39 Updated Apr 30, 2026

Perplexity open source garden for inference technology

Rust 404 39 Updated Dec 25, 2025

A suite of microbenchmarks developed for systems with multi-GPU per node.

Cuda 9 3 Updated Jan 22, 2026

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 507 93 Updated Apr 30, 2026

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 5,957 538 Updated Apr 30, 2026

Modular RDMA Interface

C++ 119 37 Updated May 1, 2026

torchcomms: a modern PyTorch communications API

C++ 358 137 Updated Apr 30, 2026

COCCL: Compression and precision co-aware collective communication library

C++ 30 3 Updated Mar 16, 2025

MiniAMR Adaptive Mesh Refinement (AMR) Mini-App

C 39 26 Updated Nov 12, 2024

Distributed MoE in a Single Kernel [NeurIPS '25]

Cuda 249 34 Updated Apr 27, 2026

A comprehensive hands-on project for learning GPU programming with CUDA and HIP, covering fundamental concepts through advanced optimization techniques.

C++ 35 3 Updated Nov 20, 2025

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 519 76 Updated Apr 28, 2026

VIP cheatsheet for Stanford's CME 295 Transformers and Large Language Models

4,371 616 Updated Jul 27, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 9,592 1,217 Updated Apr 29, 2026

Perplexity GPU Kernels

C++ 570 86 Updated Nov 7, 2025

[DEPRECATED] Moved to ROCm/rocm-systems repo

C++ 145 44 Updated Apr 24, 2026

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 883 149 Updated Sep 26, 2025

RAJA Performance Portability Layer (C++)

C++ 577 111 Updated Apr 30, 2026

Lightweight C++ command line option parser

C++ 4,756 641 Updated Apr 29, 2026

Collaborative Collection of C++ Best Practices. This online resource is part of Jason Turner's collection of C++ Best Practices resources. See README.md for more information.

8,731 907 Updated Aug 6, 2024

CMake for C++ Best Practices

CMake 1,742 204 Updated Apr 15, 2026

Unleash the true power of scheduling

Python 35 3 Updated Apr 23, 2026
Next