Show HN: Soperator – Run Slurm in Kubernetes

github.com

5 points by rdjjke 4 hours ago

Hi HN,

With this operator, you can run Slurm in Kubernetes and enjoy the benefits of both systems.

Both Slurm and Kubernetes can serve as workload managers for distributed model training and high-performance computing (HPC) in general.

Each system has its strengths and weaknesses, but the trade-offs are significant. Slurm offers advanced and effective scheduling, granular hardware control, and accounting, but lacks universality. On the other hand, Kubernetes can be used for purposes other than training (e.g. inference) and provides good auto-scaling and self-healing capabilities.

It's unfortunate that there is no way to combine the benefits of both solutions. And since many big tech companies use Kubernetes as their default infrastructure layer without supporting a dedicated model training system, some ML engineers don't even have a choice.

That's why we decided to merge these systems, taking a "Kubernetes-first" approach. We implemented a Kubernetes operator, which is a software component that runs and manages Slurm clusters as Kubernetes resources.