yzhaiustc

Yujia Zhai yzhaiustc

214 followers · 15 following

@NVIDIA
Santa Clara, California
16:44 (UTC -07:00)
https://yzhaiustc.github.io/

Achievements

x2 x2

Achievements

x2 x2

Stars

zartbot / shallowsim

DeepSeek-V3/R1 inference performance simulator

Jupyter Notebook 192 31 Updated Mar 27, 2025

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 901 105 Updated Apr 6, 2026

excalidraw / excalidraw

Virtual whiteboard for sketching hand-drawn like diagrams

TypeScript 120,378 13,156 Updated Apr 3, 2026

ByteDance-Seed / Seed-Thinking-v1.5

813 18 Updated Jun 9, 2025

ai-dynamo / dynamo

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,495 995 Updated Apr 6, 2026

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,312 854 Updated Mar 22, 2026

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 9,101 1,141 Updated Mar 31, 2026

huggingface / open-r1

Fully open reproduction of DeepSeek-R1

Python 25,964 2,409 Updated Apr 2, 2026

deepseek-ai / DeepSeek-R1

91,960 11,735 Updated Jun 27, 2025

SiriusNEO / Triton-Puzzles-Lite

Puzzles for learning Triton, play it with minimal environment configuration!

Python 660 92 Updated Mar 17, 2026

triton-lang / triton

Development repository for the Triton language and compiler

MLIR 18,851 2,736 Updated Apr 6, 2026

ChenLiu-1996 / CitationMap

A simple pip-installable Python tool to generate your own HTML citation world map from your Google Scholar ID.

Python 698 62 Updated Mar 14, 2026

mit-han-lab / omniserve

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 822 62 Updated Mar 6, 2025

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,282 99 Updated Aug 28, 2025

tinygrad / tinygrad

You like pytorch? You like micrograd? You love tinygrad! ❤️

Python 32,204 4,030 Updated Apr 6, 2026

xai-org / grok-1

Grok open release

Python 51,526 8,462 Updated Aug 30, 2024

volcengine / veScale

Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs

Python 1,005 61 Updated Mar 3, 2026

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,048 87 Updated Sep 4, 2024

google / heir

A compiler for homomorphic encryption

C++ 704 127 Updated Apr 6, 2026

NVIDIA / TensorRT-LLM

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 13,301 2,257 Updated Apr 6, 2026

iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

C++ 3,690 872 Updated Apr 6, 2026

AlibabaResearch / flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 239 22 Updated Sep 24, 2023

tlc-pack / libflash_attn

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 113 15 Updated Sep 10, 2024

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 23,167 2,587 Updated Apr 6, 2026

intel / xetla

C++ 60 20 Updated Dec 18, 2024

raulbehl / 100DaysOfRTL

100 Days of RTL

SystemVerilog 408 111 Updated Aug 15, 2024

vosen / ZLUDA

CUDA on non-NVIDIA GPUs

Rust 14,070 899 Updated Apr 6, 2026

hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible

Python 41,370 4,519 Updated Mar 30, 2026

eth-cscs / spla

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

C++ 31 7 Updated Jun 26, 2024

icl-utk-edu / slate

SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) systems. It is developed as part of the U.S. Department of Ene…

C++ 133 29 Updated Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yujia Zhai yzhaiustc

Achievements

Achievements

Block or report yzhaiustc

Stars

zartbot / shallowsim

Dao-AILab / quack

excalidraw / excalidraw

ByteDance-Seed / Seed-Thinking-v1.5

ai-dynamo / dynamo

deepseek-ai / DeepGEMM

deepseek-ai / DeepEP

huggingface / open-r1

deepseek-ai / DeepSeek-R1

SiriusNEO / Triton-Puzzles-Lite

triton-lang / triton

ChenLiu-1996 / CitationMap

mit-han-lab / omniserve

bytedance / flux

tinygrad / tinygrad

xai-org / grok-1

volcengine / veScale

IST-DASLab / marlin

google / heir

NVIDIA / TensorRT-LLM

iree-org / iree

AlibabaResearch / flash-llm

tlc-pack / libflash_attn

Dao-AILab / flash-attention

intel / xetla

raulbehl / 100DaysOfRTL

vosen / ZLUDA

hpcaitech / ColossalAI

eth-cscs / spla

icl-utk-edu / slate