CUDA implementation of openGJK and the EPA algorithm for high-performance collision detection on NVIDIA GPUs.
Contributors: Marcus Hedlund, Vismay Churiwala, Cindy Wei
gjk.gpu.demo.mp4
Real time 1000 Polytopes Collision Simulation, check out the 20,000 polytopes demo here
OpenGJK-GPU provides efficient GPU-accelerated implementations of the GJK (Gilbert-Johnson-Keerthi) and EPA (Expanding Polytope Algorithm) algorithms for collision detection between convex polytopes.
GJK is a fast iterative algorithm for computing the minimum distance between two convex polytopes in 3D space. It works by iteratively building a simplex within the Minkowski difference of the two polytopes, converging toward the closest point to the origin.
EPA is a complementary algorithm that computes penetration depth and witness points when two polytopes are overlapping. While GJK can detect collisions, EPA determines how deeply objects penetrate each other and where the contact points are—critical for collision response in physics engines.
Our GPU implementations leverage two levels of parallelism: processing multiple collision pairs simultaneously and parallelizing computation within each collision across a warp. This approach achieves speedups of up to 37x over CPU implementations when handling many polytope pairs.
The API provides two main functions for collision detection:
void GJK::GPU::computeDistances(const int n,
const gkPolytope* bd1,
const gkPolytope* bd2,
gkSimplex* simplices,
gkFloat* distances);Computes minimum distance between n pairs of polytopes. Returns distances in the distances array (0.0 indicates collision).
void GJK::GPU::computeCollisionInformation(const int n,
const gkPolytope* bd1,
const gkPolytope* bd2,
gkSimplex* simplices,
gkFloat* distances,
gkFloat* witness1,
gkFloat* witness2,
gkFloat* contact_normals = nullptr);Computes penetration depth, witness points, and contact normals for colliding polytopes. Takes pre-computed simplices and distances from GJK as input.
Data Format:
- Polytopes use flattened coordinate arrays:
[x0, y0, z0, x1, y1, z1, ...] - All memory allocation and GPU transfers are handled internally
Required:
- Git
- C/C++ compiler (GCC, Clang, or MSVC)
- CMake (version 3.18 or higher)
- CUDA Toolkit (version 11.0 or higher)
- NVIDIA GPU with CUDA support
Clone the repository:
git clone https://github.com/vismaychuriwala/OpenGJK-GPU.git
cd OpenGJK-GPUThen use these commands to build and run our examples:
On Windows:
cmake -E make_directory build
cmake -E chdir build cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build build --config Release
cd build\bin\ReleaseOn Linux/Mac:
cmake -E make_directory build
cmake -E chdir build cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build build
cd build/binThe repository includes two usage examples in examples/usage/:
Demonstrates basic GJK distance computation:
Windows:
GJKUsage.exeLinux/Mac:
./GJKUsageExpected output:
Distance between bodies 3.653650
Witnesses: (1.025173, 1.490318, 0.255463) and (-1.025173, -1.490318, -0.255463)
Demonstrates sequential GJK and EPA usage for collision information:
Windows:
EPAUsage.exeLinux/Mac:
./EPAUsageExpected output:
Penetration depth: 1.500000
Witness point on cube 1: (1.000000, 0.500000, 0.707107)
Witness point on cube 2: (-0.500000, 0.500000, 0.707107)
Contact normal (from cube 1 to cube 2): (1.000000, -0.000000, 0.000000)
To switch between 32-bit (float) and 64-bit (double) precision, edit GJK/common.h:
- 32-bit:
#define USE_32BITS - 64-bit:
//#define USE_32BITS
Our GPU kernels leverage CUDA warp-level parallelism to achieve significant performance improvements. Each GJK collision pair is processed by a dedicated half warp (16 threads) while EPA uses a full warp (32 threads), with threads collaborating to parallelize computationally expensive operations:
-
GJK Algorithm: Uses 16 threads per collision to parallelize expensive support function calls and sub-distance algorithms. Threads within a warp use
__shfl_sync()operations to share data and perform parallel reductions, eliminating the need for shared memory synchronization. -
EPA Algorithm: Uses all 32 threads of a warp to parallelize face normal computation and closest-face searches. Face normals and distances are computed in parallel across threads, with warp shuffles used for efficient reduction to find the global minimum.
This two-level parallelism (across collisions and within each collision) maximizes GPU utilization and minimizes memory access overhead.
| GPU GJK Implementation Block Diagram - showing warp-parallel execution structure |
The API design separates GJK and EPA into independent functions, providing flexibility for different use cases:
-
computeDistances(): Performs GJK distance computation only. Fast and efficient for collision detection when penetration depth is not needed. -
computeCollisionInformation(): Performs EPA using pre-computed GJK results. Can be called conditionally only when collisions are detected, avoiding unnecessary computation for separated objects.
This separation allows users to optimize their collision detection pipeline by running EPA only when needed, reducing computational overhead for non-colliding objects.
| GJK/EPA Interface Diagram - showing the sequential workflow |
All GPU memory operations are handled internally by the wrapper functions, providing a clean and simple API:
- Automatic allocation and deallocation of device memory for polytopes, simplices, and results
- Efficient host-to-device and device-to-host memory transfers
- Proper cleanup on function exit, preventing memory leaks
- No manual CUDA memory management required from the user
This abstraction allows users to focus on their application logic without worrying about low-level GPU memory management details.
| GJK Runtime GPU VS CPU Varying Number of Vertices | GJK Runtime GPU VS CPU Varying Number of Polytopes |
Performance benchmarks demonstrate significant GPU acceleration across varying problem sizes. When testing 1000 polytope pairs with increasing vertex complexity, the GPU achieves speedups ranging from 7x at 50 vertices to a peak of 37x at 1000 vertices per polytope, before decreasing to 16x at 5000 vertices. This indicates optimal GPU efficiency at moderate vertex counts where the workload is substantial enough to leverage parallelism without becoming memory-bound. When varying the number of polytope pairs while keeping vertex count fixed at 500, the GPU advantage scales consistently with problem size—starting close to 2x speedup at 50 polytopes and reaching 27x speedup at 50,000 polytopes. This scaling behavior demonstrates the GPU's strength in handling massively parallel workloads where each polytope pair collision check can be processed independently. Overall, GPU-accelerated GJK is particularly beneficial for large-scale collision detection scenarios typical in physics simulations and game engines, where many polytope pairs must be processed simultaneously.
gjk.gpu.20000.mp4
20,000 Polytopes Collision Simulation at 60 FPS
The repository includes a real-time physics simulation visualizer demonstrating collision detection with thousands of polytopes at 60 FPS. The simulation employs spatial grid subdivision for broad-phase culling, dynamically generating collision pairs each frame to minimize unnecessary GJK computations. Two visualization options are available: OpenGL (recommended, faster) and Raylib (legacy).
Compiler:
- Visual Studio 2022+ with C++ build tools
- CUDA Toolkit 13.0+ (default:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0) - Must use x64 Native Tools Command Prompt
Libraries:
- GLFW 3.4 - Download from https://www.glfw.org/download.html, extract to
C:\glfw-3.4.bin.WIN64 - GLM - Download from https://github.com/g-truc/glm/releases, extract to
C:\glm - GLAD - Already included in
glad/folder (generated from https://glad.dav1d.de/)
- Install Visual Studio 2022+ with C++ build tools
- Install CUDA Toolkit 13.0+ (ensure it's in the default location or update paths in build scripts)
- Download and extract GLFW 3.4:
- Download from https://www.glfw.org/download.html
- Extract to
C:\glfw-3.4.bin.WIN64
- Download and extract GLM:
- Download from https://github.com/g-truc/glm/releases
- Extract to
C:\glm
- Open x64 Native Tools Command Prompt (from Visual Studio)
cd visualization
build_opengl.bat
gjk_visualizer_opengl.exe- WASD - Move camera
- Q/E - Up/Down
- Mouse - Rotate view (hold left button)
- Scroll - Zoom
- F - Toggle wireframe
- R - Reset camera
- ESC - Exit
Configuration: Edit visualization/sim_config.h to adjust simulation parameters (object count, grid size, physics constants).
For the legacy Raylib version, see visualization/README.md for setup instructions.