workfromanywhereworkfromanywhere
All jobs
Andromeda ClusterDevOps

Senior Site Reliability Engineer AI Infrastructure

Global Remote / San FranciscoPosted 27 days ago

Andromeda Cluster provides scaled AI infrastructure for startups, working with AI labs, data centers, and cloud providers to deliver compute for training and inference. The company aims to build a marketplace for global AI compute, expanding in AI infrastructure, research, and engineering.

Location: Global Remote / San Francisco

Responsibilities

  • Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
  • Serve as the primary technical point of contact for customers running large-scale training workloads.
  • Define SLOs and error budgets for GPU infrastructure, owning capacity planning.
  • Ensure health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink).
  • Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training performance, and hardware health.
  • Build automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
  • Lead incident response for complex failures spanning hardware, networking, orchestration, and ML frameworks.

Requirements

  • Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent).
  • Production experience with InfiniBand, RoCE, or NVLink fabrics in distributed training.
  • Working knowledge of large training jobs and systems like NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP.
  • Expert-level Linux knowledge: kernel tuning, driver management, performance profiling.
  • Strong experience running Kubernetes with GPU workloads, including device plugins and topology-aware scheduling.
  • Strong engineering skills in Python, Go, or Bash; proficiency with Infrastructure-as-Code tools.
  • Hands-on experience building monitoring and alerting for GPU infrastructure.
  • Proven track record leading incident response for complex distributed systems.

Location

Global Remote / San Francisco

Category

DevOps

Source

remoteok

Posted

27 days ago

Similar remote jobs

Cloud Networking Engineer

Remote (MENA region)
3d ago
Mark43NewEngineering

Senior Software Engineer - Universal Search

Canada, UK, and 36 U.S. states including Alabama, Arizona, California (excluding San Francisco), Colorado, Connecticut, Washington D.C., Florida, Georgia, Iowa, Idaho, Illinois, Indiana, Kansas, Massachusetts, Maryland, Maine, Michigan, Minnesota, Missouri, North Carolina, Nebraska, New Hampshire, New Jersey, New Mexico, New York, Ohio, Oklahoma, Oregon, Pennsylvania, South Carolina, Tennessee, Texas, Utah, Virginia, Vermont, Washington, Wisconsin, West Virginia.
today
HarnessNewDevOps

Senior Customer Engineer

Remote - United States/Canada (must be in Central or Eastern time zone)$148,000—$160,000 USD
today