Platform Engineer

Harrison Clarke · San Francisco Bay Area

Our client, an early-stage company building advanced AI systems, is seeking a senior platform engineer to take ownership of their core platform. This is not a traditional DevOps position focused purely on CI/CD, the role spans GPU orchestration, multi-cloud Kubernetes environments, real-time networking, and observability.The company is already running production workloads across multiple clusters, regions, and hardware types, and is actively expanding into additional cloud providers. This hire will play a key role in scaling and stabilizing that infrastructure.Key ResponsibilitiesDesign and manage multi-region Kubernetes clusters across cloud and GPU-focused providers using infrastructure-as-codeOwn the deployment lifecycle through GitOps practices (Helm, Kustomize, automated releases, continuous delivery)Manage GPU infrastructure, including scheduling efficiency, workload placement, and cold-start optimizationOversee networking systems such as ingress, gateways, load balancing, and cross-region connectivityBuild and maintain observability across metrics, logs, traces, and performance profilingEnsure infrastructure security across identity, secrets, and encryptionMaintain CI/CD workflows supporting a monorepo of services and deployment artifactsPartner closely with ML engineers to optimize model serving and GPU utilizationCandidate ProfileStrong experience operating Kubernetes in production environments, including troubleshooting, autoscaling, and upgradesProven background with infrastructure-as-code tools (e.g., Terraform, Pulumi)Hands-on experience running GPU workloads on Kubernetes and understanding resource optimizationFamiliarity with GitOps tooling such as ArgoCD or Flux, and Helm-based deploymentsExperience with in-memory data systems (e.g., Redis) and distributed architecturesSolid understanding of observability tooling and practicesStrong networking fundamentals, particularly in low-latency or distributed systemsExperience working in environments with broad ownership across infrastructurePreferred BackgroundExposure to GPU cloud providers beyond major hyperscalersExperience with real-time or streaming infrastructureProficiency in Go or PythonFamiliarity with ML model deployment and optimizationExperience managing infrastructure cost, particularly for GPU-heavy workloads