GPU compute and AI kernels,
without a single-vendor lock-in.
navatala_gpu is an Apache-2.0 developer-preview runtime and kernel corpus for CUDA, HIP/ROCm, OpenCL, Vulkan compute, and Metal. It targets AI/ML, dataframe, graph, sparse linear algebra, and CFD-style workloads from one consistent contract model.
What ships in this release
Runtime
One C++20 API over CUDA, HIP/ROCm, Vulkan compute, OpenCL, and Metal. Device enumeration, memory allocation, execution queues, events, and graph-oriented execution support where the backend exposes it.
Kernel corpus
Portable kernels are emitted for multiple backends from the same contract model. Coverage is explicit and not uniform; vendor-library dispatch is being added where tuned ROCm/CUDA libraries are the right execution path.
Python bindings
pip install navatala-gpu gives linalg, sparse, dataframe, graph, ml, and cfd modules. DLPack-oriented interop is designed for array frameworks such as PyTorch, CuPy, and JAX as backend support matures.
What is in the corpus
Kernel counts by domain (CUDA backend; per-backend availability lives in the published manifest):
| Domain | Kernels |
|---|---|
| ML / DataFrame / graph / vector search | 1,398 |
| Iterative linear solvers (CG, BiCGSTAB, IDR, GMRES) | 409 |
| Volume-of-Fluid finite-volume CFD | 157 |
| Sparse linear algebra | 52 |
| Distributed communication | 20 |
| Algebraic multigrid (AMG) | 16 |
| Parallel primitives | 14 |
| Neural / spectral operators | 12 |
| BLAS / dense linear algebra | 2 |
| Total | 2,080 |
For AI and data-science work
The ML library covers clustering (k-means, DBSCAN, HDBSCAN, GMM), nearest neighbours (KNN brute-force and sparse, vector search via CAGRA / HNSW / Vamana / IVF-Flat / IVF-PQ), classification & regression (SVM, decision trees, random forests, linear/ridge/lasso, ARIMA, FIL), dimensionality reduction (PCA, t-SNE, UMAP), and explainability (SHAP).
DNN building blocks include attention, normalization, RoPE, fused-router MoE primitives, and softmax/reduction kernels. Backend support varies by dtype and platform; the published coverage matrix is the source of truth.
Five backends, honest coverage
Per-backend coverage is not uniform. CUDA and HIP/ROCm currently have the broadest generated coverage. OpenCL and Vulkan are close behind, with some limitations around atomics, dynamic shared memory, and 64-bit types. Metal is significantly smaller because Apple GPUs have effectively no double-precision support — about 96% of the missing-on-Metal kernels are F64-named or F64-implied; F32 paths are present.
See the live coverage matrix at docs/BACKEND_COVERAGE.md in the repository.
How fixes flow
Kernel files and Python facade modules are regenerated as a unit. Bug reports and reproducers route through the maintainers; the fix is applied internally and the public tree is regenerated. Hand-authored layers — runtime, examples, tests, docs, and tooling — accept normal pull requests. The repository ships a Regen-Manifest-Trailer: hook that keeps regenerated changes traceable.
See CONTRIBUTING.md for the full contribution model.
Status — developer preview
navatala_gpu is a developer preview / alpha. The runtime and kernel corpus are already useful for experimentation, portability work, and internal CFD/analytics workloads, but conformance coverage, wrapper APIs, ROCm benchmarking, and selective backend tuning are still expanding.
We do not position this as a mature replacement for cuML, RAFT, AMGX, cuDNN, rocBLAS, rocSPARSE, or OpenFOAM solvers. It is a portable kernel/runtime corpus with generated fallback paths and, over time, explicit vendor-library dispatch where tuned vendor libraries are the correct backend.
Start with the repo, the docs, or the wheel.
Building something specific that needs faster GPU portability? Get in touch.