Senior SRE Engineer (MLOps) - AI

Salla · Mecca, Makkah, Saudi Arabia

DescriptionSalla is looking for a Senior SRE Engineer (MLOps) to join our Salla AI team. This role focuses on running our AI and ML systems as real production systems, not side experiments — owning the operational layer around models, prompts, agents, inference services, and retrieval systems. You will be responsible for enabling Agentic AI and Generative AI features to operate reliably, securely, and cost-effectively at scale within the Salla ecosystem.This role is SRE- and platform-engineering-first, with a strong emphasis on reliability, observability, safe releases, cost, and governance, while collaborating closely with engineering, data, and AI teams to give every pod a fast, safe path to production. It exists because AI systems fail differently from normal services — a prompt change can behave like a code change, an agent calling tools needs auditability, and latency, quality, and cost can move together in uncomfortable ways.Key ResponsibilitiesOwn reliability for ML and agentic AI services in production — SLOs, dashboards, alerts, runbooks, and incident follow-upsBuild observability across the AI stack — latency, errors, traces, tool calls, cost, and user impactDesign safe-release patterns for models, prompts, agents, tools, and configuration, including canary, rollback, feature-flag, and evaluation-gate strategiesProvide operational support for inference APIs, queues, retrieval layers, and AI workflows running on Kubernetes/EKSEstablish ownership, traceability, and guardrails around what agentic systems (e.g. Sidekick, the growth advisor) are allowed to do, including how they call internal toolsDefend agent tool-calling against prompt injection and untrusted-data risks — establish and enforce data-trust boundaries so that untrusted store/merchant content cannot manipulate agent decisions, tool calls, or actionsDrive AI cost governance — per-model and per-pod spend visibility, token-cost tracking, and anomaly alertingBuild automation and self-service paths so product teams have a known safe path to production instead of rebuilding it each timeTurn recurring operational pain into simple, reusable platform standards that other teams adoptParticipate in architecture discussions, code reviews, and technical decision-makingRequirements4+ years in SRE, platform engineering, DevOps, or production infrastructure, operating distributed systems in production — not only in demosHands-on experience with Kubernetes and cloud-native systems in productionFamiliarity with deploying ML projectsStrong command of CI/CD, GitOps, observability, and incident responseSolid experience with infrastructure-as-code, secrets management, and networkingAbility to write automation or platform tooling in Python, or a similar languageProduction judgment — knowing how to make systems measurable, debuggable, repeatable, and safe to change (you do not need to be a machine learning researcher)Ability to work across teams, explain trade-offs clearly, and turn operational pain into standards engineers will actually useNice to have:Experience with MLOps or ML platforms — model serving, registries, evaluation, feature/data dependencies, drift monitoring, or ML pipelinesFamiliarity with LLM applications or agentic systems — RAG, vector databases, tool calling, workflow orchestration, memory, traces, guardrails, or evaluation pipelinesExposure to tooling such as OpenTelemetry, Prometheus, Grafana, MLflow, KServe, Ray, LiteLLM, vLLM, LangGraph, Arize Phoenix, or LangSmithExperience with Kafka consumers, GPU workloads, inference optimization, model routing, or AI cost governanceExperience working in cross-functional product teams involving AI, backend, and frontend engineers