Hyphen Connect

540 jobs posted

Save Job

View company profile →

Please mention that you found this job on tryremotely.com. Thanks & good luck!

Tired of Manually Applying to Jobs?

Let JobCopilot do it for you!

Set your preferences and let your AI copilot handle the job search while you sleep.

Applies for jobs that actually match your skills
Tailors your resume and cover letter automatically
Works 24/7—so you don't have to

Activate JobCopilot

Hyphen Connect

LLM Pre-training Distributed Engineer (AI Infrastructure)

On site

Engineering

Added

24 minutes ago

Location

🇺🇸 Boston

Type

Full time

Salary

Salary not provided

Apply Now

Save Job

Related skills

python kubernetes pytorch deepspeed cuda

📋 Description

Orchestrate distributed training across 1,000+ GPUs using PyTorch/DeepSpeed/Megatron-LM.
Tune InfiniBand/RDMA networking and memory to prevent out-of-memory errors.
Automate checkpointing and failure recovery during month-long training runs.

🎯 Requirements

3D parallelism (Data, Tensor, Pipeline).
Experience managing SLURM or Kubernetes-based GPU clusters.
Strong systems engineering background (C++, CUDA, Python).

Apply on employer's website

This employer gathers applications via their own applicant tracking system.

You will be redirected to an external application form.

Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest — finding, filtering, and applying while you focus on what matters.

Activate JobCopilot