أبلاي إيدج ابدأ البحث عن عمل

Sr. Site Reliability Engineer

Optomi · Seattle, WA

قدّم وتابع مع أبلاي إيدج
Sr. Software Engineer*24-month contract with potential to convert*Hybrid: 4x a week onsite in Seattle, WAOptomi, in partnership with our premier client, is seeking a highly skilled Site Reliability Engineer to support a growing portfolio of enterprise platforms, including AI-driven initiatives, automation services, and next-generation observability and data platforms. This role will focus heavily on Kubernetes-based infrastructure, platform reliability, automation, and operational scalability across complex distributed environments.The ideal candidate will bring deep hands-on expertise in Kubernetes, infrastructure automation, and platform engineering while also serving as a technical leader capable of influencing engineering direction, reliability standards, and operational best practices across teams.Key ResponsibilitiesDesign, support, and optimize highly available cloud and containerized platform environments.Lead operational reliability initiatives across distributed systems and Kubernetes-based infrastructure.Implement and maintain monitoring, observability, telemetry, and alerting solutions for platform health and performance.Drive capacity planning, performance optimization, and SLA/SLO reliability objectives.Build and enhance infrastructure automation and Infrastructure-as-Code (IaC) frameworks.Develop reusable CI/CD pipeline components, automation modules, and internal platform tooling.Collaborate with software engineering, architecture, security, and infrastructure teams to improve scalability, resiliency, and deployment efficiency.Troubleshoot complex infrastructure, networking, and application performance issues across large-scale environments.Contribute to technical documentation, architecture standards, operational procedures, and engineering best practices.Evaluate and leverage AI-assisted engineering and automation tools to improve development and operational workflows.Required Qualifications5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or Software Engineering.5+ years of systems administration experience supporting large-scale enterprise environments.5+ years of experience automating infrastructure and operational processes.3+ years of experience building developer-facing platforms, tooling, or internal engineering services.Experience designing reusable infrastructure modules, libraries, templates, or shared services used across multiple teams.Strong experience operating within large enterprise organizations supporting cross-functional engineering initiatives.Excellent written communication skills with the ability to produce technical documentation, architecture proposals, and engineering guides.Core ExpertiseDeep hands-on expertise with Kubernetes and containerized infrastructure.Strong understanding of distributed systems, cloud platforms, and infrastructure reliability engineering.Extensive Linux administration, troubleshooting, and performance optimization experience.Strong experience with Infrastructure-as-Code and automation tools such as Terraform, OpenTofu, Ansible, or similar technologies.Experience implementing and managing CI/CD pipelines using platforms such as GitLab CI/CD, GitHub Actions, or equivalent tools.Strong understanding of monitoring, observability, telemetry, and logging practices.Additional Technical SkillsExperience with cloud-native technologies and container platforms.Familiarity with Docker and container lifecycle management.Proficiency in scripting or programming languages such as Bash, Python, Go, JavaScript, or TypeScript.Solid understanding of networking fundamentals including HTTP, TLS, SSH, DNS, virtual networking, and load balancing.Experience integrating security and compliance scanning tools into CI/CD workflows.Experience deploying and managing infrastructure programmatically through APIs and SDKs.Ability to implement instrumentation, monitoring, and telemetry across applications and infrastructure.Familiarity with API design principles and developer platform integration patterns.Strong troubleshooting and root cause analysis skills for complex system and performance issues.