SRE Infrastructure Engineer
OJUS LLC · San Francisco, CA
Apply & track with Apply EdgeTitle: SRE Infrastructure EngineerLocation: SFO, CA (5 Days Onsite)Job Description:We are seeking a SRE Infrastructure Resource having 8+ years of professional experience ensuring the reliability, scalability, and performance of Google Cloud-based services through automation, monitoring, and proactive engineering. Key responsibilities include managing infrastructure as code (Terraform), optimizing GKE/Kubernetes, incident response, and implementing SLIs/SLOs to minimize manual toil.This role requires close collaboration with cross-functional teams, adherence to DevOps and Agile practices, and ownership of service quality and delivery.Key Responsibilities· GCP Infrastructure Management: Design, deploy, and maintain robust infrastructure components, including VPCs, Compute Engine, GKE (Kubernetes), and storage solutions.· Automation & IaC: Utilize Terraform or Deployment Manager to manage cloud resources and build CI/CD pipelines to automate deployments. Minimizing manual, repetitive tasks by developing automation scripts and custom tools to streamline deployments and operations.· Observability & Incident Management: Develop monitoring, alerting, and logging systems (e.g., Cloud Monitoring, Prometheus, Grafana). Act as primary on-call to troubleshoot production incidents.· Incident Management: Serving as a first responder for system outages and conducting deep-dive root cause analysis (post-mortems) to prevent recurrence· CI/CD Pipeline Management: Designing and supporting automated deployment pipelines using Jenkins, ArgoCD, Artifactory, DevSecOps, GitLab CI, or GitHub Actions· Reliability Engineering: Define and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) - Latency, Traffic, Errors, and Saturation· Optimization & Security: Proactively optimize infrastructure for cost, performance, and security compliance.· Site Reliability Engineer, Google Cloud Engine AI SRE at Google: Focus specifically on AI workload health, and GCE visibilityMandatory Technical Skills & Competencies· Experience: 8+ years in SRE, DevOps, or systems engineering, specifically with Google Cloud Platform.· Technical Skills: Deep knowledge of Linux, Kubernetes (GKE), networking (VPCs, CDNs), and containerization.· Programming: Proficiency in scripting/programming languages like Python, Go, or Shell.· Methodologies: Strong understanding of GitOps, CI/CD pipelines, and SRE principles (error budgets, toil reduction)· Strong troubleshooting skills across the full stack (network, OS, application).· Ability to balance system stability with the need for rapid deployment.· Observability Tools: Experience implementing monitoring and logging stacks like Prometheus, Grafana, or the Google Cloud Operations Suite· Excellent collaboration skills to work with development teams for service ownershipSoft Skills· Strong problem-solving and analytical skills· Clear communication with technical and non-technical stakeholders· Ownership mindset and production-grade engineering discipline· Ability to work independently and within cross-functional teams