Andromeda ClusterDevOps

Senior Site Reliability Engineer AI Infrastructure

Global Remote / San FranciscoPosted 27 days ago

Andromeda Cluster provides scaled AI infrastructure for startups, working with AI labs, data centers, and cloud providers to deliver compute for training and inference. The company aims to build a marketplace for global AI compute, expanding in AI infrastructure, research, and engineering.

Location: Global Remote / San Francisco

Responsibilities

Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
Serve as the primary technical point of contact for customers running large-scale training workloads.
Define SLOs and error budgets for GPU infrastructure, owning capacity planning.
Ensure health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink).
Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training performance, and hardware health.
Build automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
Lead incident response for complex failures spanning hardware, networking, orchestration, and ML frameworks.

Requirements

Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent).
Production experience with InfiniBand, RoCE, or NVLink fabrics in distributed training.
Working knowledge of large training jobs and systems like NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP.
Expert-level Linux knowledge: kernel tuning, driver management, performance profiling.
Strong experience running Kubernetes with GPU workloads, including device plugins and topology-aware scheduling.
Strong engineering skills in Python, Go, or Bash; proficiency with Infrastructure-as-Code tools.
Hands-on experience building monitoring and alerting for GPU infrastructure.
Proven track record leading incident response for complex distributed systems.

Apply Now

Location

Global Remote / San Francisco

Similar remote jobs

Programa Nacional para la Reducción de PérdidasOther

Chief Operating Officer

Remote

exec operations full time telecommuting non tech

yesterday

Bright Vision TechnologiesDevOps

Cloud Networking Engineer

Remote (MENA region)

cloud networking aws azure devops

3d ago

Mark43NewEngineering

Senior Software Engineer - Universal Search

Canada, UK, and 36 U.S. states including Alabama, Arizona, California (excluding San Francisco), Colorado, Connecticut, Washington D.C., Florida, Georgia, Iowa, Idaho, Illinois, Indiana, Kansas, Massachusetts, Maryland, Maine, Michigan, Minnesota, Missouri, North Carolina, Nebraska, New Hampshire, New Jersey, New Mexico, New York, Ohio, Oklahoma, Oregon, Pennsylvania, South Carolina, Tennessee, Texas, Utah, Virginia, Vermont, Washington, Wisconsin, West Virginia.

engineering software engineering backend engineering search engineering distributed systems

today

QuokkaNewDevOps

DevOps Engineer

Remote (US)

devops engineering cloud engineering site reliability engineering devops infrastructure engineering

today

HarnessNewDevOps

Senior Customer Engineer

Remote - United States/Canada (must be in Central or Eastern time zone)$148,000—$160,000 USD

customer engineering customer success devops technical support sre

today