8000
Skip to content
View yzhaiustc's full-sized avatar

Block or report yzhaiustc

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

DeepSeek-V3/R1 inference performance simulator

Jupyter Notebook 192 31 Updated Mar 27, 2025

A Quirky Assortment of CuTe Kernels

Python 901 105 Updated Apr 6, 2026

Virtual whiteboard for sketching hand-drawn like diagrams

TypeScript 120,378 13,156 Updated Apr 3, 2026

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,495 995 Updated Apr 6, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,312 854 Updated Mar 22, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,101 1,141 Updated Mar 31, 2026

Fully open reproduction of DeepSeek-R1

Python 25,964 2,409 Updated Apr 2, 2026

Puzzles for learning Triton, play it with minimal environment configuration!

Python 660 92 Updated Mar 17, 2026

Development repository for the Triton language and compiler

MLIR 18,851 2,736 Updated Apr 6, 2026

A simple pip-installable Python tool to generate your own HTML citation world map from your Google Scholar ID.

Python 698 62 Updated Mar 14, 2026

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 822 62 Updated Mar 6, 2025

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,282 99 Updated Aug 28, 2025

You like pytorch? You like micrograd? You love tinygrad! ❤️

Python 32,204 4,030 Updated Apr 6, 2026

Grok open release

Python 51,526 8,462 Updated Aug 30, 2024

Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs

Python 1,005 61 Updated Mar 3, 2026

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,048 87 Updated Sep 4, 2024

A compiler for homomorphic encryption

C++ 704 127 Updated Apr 6, 2026

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 13,301 2,257 Updated Apr 6, 2026

A retargetable MLIR-based machine learning compiler and runtime toolkit.

C++ 3,690 872 Updated Apr 6, 2026

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 239 22 Updated Sep 24, 2023

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 113 15 Updated Sep 10, 2024

Fast and memory-efficient exact attention

Python 23,167 2,587 Updated Apr 6, 2026
C++ 60 20 Updated Dec 18, 2024

100 Days of RTL

SystemVerilog 408 111 Updated Aug 15, 2024

CUDA on non-NVIDIA GPUs

Rust 14,070 899 Updated Apr 6, 2026

Making large AI models cheaper, faster and more accessible

Python 41,370 4,519 Updated Mar 30, 2026

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

C++ 31 7 Updated Jun 26, 2024

SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) systems. It is developed as part of the U.S. Department of Ene…

C++ 133 29 Updated Oct 21, 2025
Next
0