GPU compute and AI kernels,
without a single-vendor lock-in.

navatala_gpu is an Apache-2.0 developer-preview runtime and kernel corpus for CUDA, HIP/ROCm, OpenCL, Vulkan compute, and Metal. It targets AI/ML, dataframe, graph, sparse linear algebra, and CFD-style workloads from one consistent contract model.

pip install navatala-gpu

View on GitHub ↗ Read the Docs ↗ PyPI ↗

Apache-2.0 Developer preview · alpha CUDA · HIP · Vulkan · OpenCL · Metal

What ships in this release

Runtime

One C++20 API over CUDA, HIP/ROCm, Vulkan compute, OpenCL, and Metal. Device enumeration, memory allocation, execution queues, events, and graph-oriented execution support where the backend exposes it.

Kernel corpus

Portable kernels are emitted for multiple backends from the same contract model. Coverage is explicit and not uniform; vendor-library dispatch is being added where tuned ROCm/CUDA libraries are the right execution path.

Python bindings

pip install navatala-gpu gives linalg, sparse, dataframe, graph, ml, and cfd modules. DLPack-oriented interop is designed for array frameworks such as PyTorch, CuPy, and JAX as backend support matures.

What is in the corpus

Kernel counts by domain (CUDA backend; per-backend availability lives in the published manifest):

Domain	Kernels
ML / DataFrame / graph / vector search	1,398
Iterative linear solvers (CG, BiCGSTAB, IDR, GMRES)	409
Volume-of-Fluid finite-volume CFD	157
Sparse linear algebra	52
Distributed communication	20
Algebraic multigrid (AMG)	16
Parallel primitives	14
Neural / spectral operators	12
BLAS / dense linear algebra	2
Total	2,080

For AI and data-science work

The ML library covers clustering (k-means, DBSCAN, HDBSCAN, GMM), nearest neighbours (KNN brute-force and sparse, vector search via CAGRA / HNSW / Vamana / IVF-Flat / IVF-PQ), classification & regression (SVM, decision trees, random forests, linear/ridge/lasso, ARIMA, FIL), dimensionality reduction (PCA, t-SNE, UMAP), and explainability (SHAP).

DNN building blocks include attention, normalization, RoPE, fused-router MoE primitives, and softmax/reduction kernels. Backend support varies by dtype and platform; the published coverage matrix is the source of truth.

Five backends, honest coverage

Per-backend coverage is not uniform. CUDA and HIP/ROCm currently have the broadest generated coverage. OpenCL and Vulkan are close behind, with some limitations around atomics, dynamic shared memory, and 64-bit types. Metal is significantly smaller because Apple GPUs have effectively no double-precision support — about 96% of the missing-on-Metal kernels are F64-named or F64-implied; F32 paths are present.

See the live coverage matrix at docs/BACKEND_COVERAGE.md in the repository.

How fixes flow

Kernel files and Python facade modules are regenerated as a unit. Bug reports and reproducers route through the maintainers; the fix is applied internally and the public tree is regenerated. Hand-authored layers — runtime, examples, tests, docs, and tooling — accept normal pull requests. The repository ships a Regen-Manifest-Trailer: hook that keeps regenerated changes traceable.

See CONTRIBUTING.md for the full contribution model.

Status — developer preview

navatala_gpu is a developer preview / alpha. The runtime and kernel corpus are already useful for experimentation, portability work, and internal CFD/analytics workloads, but conformance coverage, wrapper APIs, ROCm benchmarking, and selective backend tuning are still expanding.

We do not position this as a mature replacement for cuML, RAFT, AMGX, cuDNN, rocBLAS, rocSPARSE, or OpenFOAM solvers. It is a portable kernel/runtime corpus with generated fallback paths and, over time, explicit vendor-library dispatch where tuned vendor libraries are the correct backend.

Start with the repo, the docs, or the wheel.

View on GitHub ↗ Read the Docs ↗

Building something specific that needs faster GPU portability? Get in touch.

GPU compute and AI kernels,without a single-vendor lock-in.