8000

Trending

See what the GitHub community is most excited about today.

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,320 857 Built by

1 star today

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Cuda 1,082 167 Built by

0 stars today

NVIDIA / cuopt

GPU accelerated decision optimization

Cuda 808 158 Built by

1 star today

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 29,496 3,510 Built by

26 stars today

brucefan1983 / GPUMD

Graphics Processing Units Molecular Dynamics

Cuda 748 179 Built by

2 stars today A3E2

NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,355 2,058 Built by

5 stars today

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 9,105 1,148 Built by

2 stars today

mirage-project / mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

Cuda 2,186 192 Built by

2 stars today

baidu-research / warp-ctc

Fast parallel CTC.

Cuda 4,074 1,033 Built by

0 stars today

NVIDIA / nccl-tests

NCCL Tests

Cuda 1,484 363 Built by

2 stars today

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,289 393 Built by

3 stars today

Dao-AILab / causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

Cuda 823 171 Built by

1 star today

NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,826 463 Built by

0 stars today

0