Orchestrate distributed training across 1000+ GPUs using PyTorch/DeepSpeed/Megatron-LM.
Optimize networking (InfiniBand/RDMA) and memory management to prevent OOM.
Automate checkpointing and failure recovery during month-long training runs.

🎯 Requirements

Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
Experience managing SLURM or Kubernetes-based GPU clusters.
Strong systems engineering background (C++, CUDA, Python).

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot