Site Reliability Engineer-AI production-automated testing ,Observability
TAT IT Technolgies · Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates
قدّم وتابع مع أبلاي إيدجUrgent requirement for Site Reliability Engineer( AI production readiness automated testing ,Observability, SLIs, resilience) in banking domain required for our banking clients in Abu Dhabi ,UAEHybrid role combines SRE and automated testing to ensure AI-driven cloud applications are production-ready, resilient, and compliant with banking standards.--MustStrong expertise in Python-based testing frameworks (PyTest, Robot, or similar) & experience with Azure / AWS cloud platforms.--MustHands-on observability tools (Prometheus, Grafana, ELK, Datadog) & experience defining and implementing SLIs/SLOs for distributed systems.--MustPractical exposure to chaos engineering and load testing frameworks (Gremlin, Locust, Jmeter) & Familiarity with AI/ML evaluation tools for production readiness.--MustStrong background in security and compliance automation within regulated industries (banking/finance )--MustRole OverviewWe are seeking a Site Reliability Engineer (AI Production Readiness) to ensure our AI-driven cloud applications are production-ready, resilient, and compliant with banking standards. This hybrid role combines SRE practices with automated testing expertise, focusing on reliability, observability, and proactive validation of both application logic and infrastructure.Key ResponsibilitiesAutomated Validation Frameworks Design and implement Python-based automated testing frameworks to validate AI application logic, APIs, and cloud infrastructure.Resilience Engineering Conduct chaos testing, load testing, and fault injection to ensure systems withstand failures and maintain service continuity.SLIs/SLOs Definition Establish clear Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for AI workloads, ensuring measurable reliability targets.Observability & Monitoring Build proactive monitoring, alerting, and logging pipelines across Azure and AWS environments to detect anomalies before they impact users.Security & Compliance Implement automated compliance checks aligned with banking regulations, ensuring secure deployment pipelines and audit readiness.AI Evaluation Tools Integrate AI-specific evaluation frameworks to continuously assess model performance, fairness, and reliability in production.Skills: reliability,ai,automated testing