Principal Production Engineer

Added
4 days ago
Type
Full time
Salary
Upgrade to Premium to se...

Related skills

linux aws grafana prometheus python

๐Ÿ“‹ Description

  • Design and implement highly available, scalable infra across AWS, GCP, and bare metal
  • Drive an automation-first culture by writing code (Python/Go) to remove toil
  • Implement observability with Prometheus, Grafana, OpenTelemetry; define SLIs/SLOs
  • Lead Incident Commander (on-call) and build response playbooks
  • Conduct deep-dive post-incident analyses
  • Partner with Engineering for operability reviews

๐ŸŽฏ Requirements

  • 10+ years in reliability, scalability, and availability for large-scale services
  • Deep programming in Python, Go, or C/C++
  • Strong networking, Linux/RHEL, and distributed systems knowledge
  • Experience in 24/7 on-call incident management
  • ITIL-based problem management and operability reviews
  • Cloud (AWS/Azure/GCP) and IaC (Ansible, Terraform, Helm)
  • Chaos engineering and disaster recovery at scale
  • L7 proxies, DNS at scale, and OS networking internals

๐ŸŽ Benefits

  • Various health plans
  • Time off for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks, and more!
Share job

Meet JobCopilot: Your Personal AI Job Hunter

Automatically Apply to Engineering Jobs. Just set your preferences and Job Copilot will do the rest โ€” finding, filtering, and applying while you focus on what matters.

Related Engineering Jobs

See more Engineering jobs โ†’