Own Mellum LLM training stack and architecture; optimize speed, cost, stability.
Profile hotspots with Nsight and NVTX; optimize overlap and kernels.
Design and evaluate architecture: depth/width, attention variants, MoE routing.
Implement custom ops (Triton/CUDA); PyTorch extensions; upstream when possible.
Apply memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8, NCCL tuning.
Hardening runs: elastic fault tolerance, robust checkpointing, reproducibility.

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot