Senior DevOps Engineer

Lumicity · San Francisco Bay Area

Senior DevOps / Site Reliability EngineerLocation: San Francisco Bay Area (Hybrid) Level: Senior Type: Full-TimeThe companySeries B healthcare AI company that has grown revenue by a tremendous amount. More than 100 enterprise healthcare organizations use our platform to automate complex, compliance-critical operational workflows — the kind of work that used to require large manual teams and still carries serious downstream risk if it breaks.We're about 100 people, well-funded, and at an inflection point: our platform is scaling fast, our engineering team is growing, and reliability is becoming mission-critical. This isn't a company that's been around long enough to accumulate decades of technical debt. You'd be building the right foundation from the start.The roleWe're hiring a Senior DevOps Engineer or Site Reliability Engineer — depending on where your experience and interests land.Both roles sit within our engineering team, report into engineering leadership, and work closely with backend and ML engineers. The difference is in focus:DevOps track: Infrastructure as code, CI/CD, deployment systems, developer experience, and platform reliability.SRE track: Observability, incident management, SLO frameworks, and production reliability across distributed systems.Whichever track you're on, this is a hands-on, high-ownership role. You'll have real production responsibility and real impact on how the platform performs at scale.What you'll work onDesign and evolve AWS-based cloud infrastructure using TerraformOwn and improve CI/CD pipelines (GitHub Actions) for fast, safe deploymentsStandardize deployment patterns across serverless workloads (Lambda), containerized services (ECS), and workflow orchestration systemsDefine observability standards across metrics, logs, and traces using OpenTelemetry, Datadog, Grafana, and SentryBuild proactive detection for reliability risks, latency regressions, and performance degradationPartner with backend and ML teams to debug distributed system issues, including Postgres performanceLead and support incident response and root cause analysisAutomate security and compliance workflows (access controls, audit readiness, vulnerability management)Participate in on-call rotationWhat we're looking forMust have:7+ years in DevOps, SRE, or infrastructure engineering in a B2B SaaS environmentStrong production AWS experienceDeep hands-on Terraform (IaC) experienceCI/CD pipeline ownership (GitHub Actions or equivalent)Experience with serverless and containerized services in productionPostgres in production (performance, tuning, operations)Observability tooling: metrics, logs, traces — and the ability to turn signals into actionScripting fluency (Python, Bash, or similar)High ownership mindset — you're not waiting to be assigned an incident, you're already thinking about failure modesNice to have:Experience in healthcare, fintech, or other regulated environmentsClickHouse or high-scale analytics systemsOpenTelemetry and modern observability architectureML infrastructure experienceWhy join nowDefine reliability and infrastructure standards before they calcifyTight collaboration with product, backend, and ML — no siloed infra teamMeaningful equity in a company with strong investor backing and real market tractionModern cloud-native stack: AWS, Terraform, GitHub Actions, ECS, Lambda, Aurora Postgres, Datadog, OpenTelemetryInterested or know someone who might be? Apply below or reach out directly.