Site Reliability Engineer

Wave Talent · Greater London, England, United Kingdom

Senior / Staff Site Reliability Engineer | £136k–£180k + equity | Remote Europe or LondonWe're partnering with a fast-growing developer infrastructure startup on a senior SRE hire at a pivotal moment in their growth.The platform runs AI agents and background workflows in production at massive scale handling hundreds of millions of executions per month on infrastructure they run themselves. The team is ~13 people. No engineering managers. Engineers own large parts of the system and work directly with the founders.The core challenge right now is scale. Execution volume is growing faster than the team can build, which means the next hires are walking into genuine distributed systems problems — not a greenfield rebuild or a dashboard feature.What you'll be working onOwning observability across the platform OpenTelemetry, metrics, logs, traces, and making them genuinely useful at 3amDesigning and operating distributed systems primitives under real production load — queues, schedulers, checkpoints, backpressureArchitecting and tuning auto-scaling infrastructure that runs untrusted customer code at high throughputHardening multi-tenant sandbox isolation, secrets handling, network policy, and supply chain securityOwning Terraform and IaC as a first principle across a cloud-native footprintRunning on-call practice: SLOs, runbooks, blameless postmortems, paging hygieneWhat they're looking forStrong observability background production experience with OpenTelemetry, Prometheus or equivalentDistributed systems experience you've designed or operated systems with non-trivial failure modesStrong with in TypeScript and/or Go the codebase is TypeScript-heavy with Go emerging as a second language.Self-managed Kubernetes in production, not just managed control planesPerformance and scaling instincts you've chased real bottlenecks across app, database, and infra layersTerraform as a first principle, run at meaningful scaleSecurity mindset — multi-tenant isolation, least privilege, threat modellingPostgres and Redis under load, AWS strongly preferredThe processScreening call, hiring manager conversation, Technical with roughly a 10% pass rate, then a final with the wider team. The bar is high but if you find that motivating rather than off-putting, that's probably a good sign.