MLOps Engineer

CNTXT AI · Abu Dhabi Emirate, United Arab Emirates

Job role:A dedicated startup is being formed to industrialize and scale a secure, AI-enabled, multi-source decision-support software offering. The platform is a multi-sensor fusion and agentic AI solution connecting to diverse data sources (for example geospatial layers, imagery, video, and other operational signals). This role will support the delivery of a scalable product and contribute to establishing the processes, standards, and collaboration practices required for sustainable growth.The Cloud Infrastructure Engineer is responsible for designing, deploying, and maintaining secure, scalable, and highly available cloud environments. This role focuses on building robust infrastructure on AWS (or multi-cloud environments, if applicable), automating operational processes, and ensuring the reliability and performance of cloud-based systems. The ideal candidate combines deep technical expertise with strong problem-solving skills and a passion for automation and cloud-native technologies.Job ResponsibilitiesDesign and operate end-to-end ML/LLM delivery pipelines: data to training/fine-tuning to evaluation to packaging to deploymentBuild CI/CD for models and services, including automated testing, validation gates, and rollback strategiesStandardize experiment tracking, model/version lineage, and artifact management (datasets, prompts, checkpoints, embeddings)Implement monitoring and observability: latency, cost, drift, quality signals, and safety/guardrails metricsOptimize inference performance and cost (batching, caching, quantization, hardware choices)Define and enforce environment and dependency management across dev/stage/prodWork with engineering on scalable serving patterns (APIs, streaming, event-driven), and with security on access controls and secretsSupport release readiness: runbooks, incident response, SLOs/SLAs, and post-release stability trackingCoordinate with procurement and legal where needed for tooling, cloud services, and vendor onboardingStartup mode: hands-on, flexible, comfortable pivoting, and able to unblock teams quicklyInterfaces / stakeholdersQualifications & ExperienceTypically 5+ years in MLOps/DevOps/Data Platform roles, including production deployments of ML and/or LLM-powered systems.Experience in fast-paced product environments preferred.Tools (examples)ML lifecycle: MLflow / Weights & Biases / equivalentServing: FastAPI, Triton (plus), Ray Serve (plus)Orchestration: Airflow/Dagster (plus)Observability: Prometheus/Grafana, OpenTelemetry, ELKCloud: AWS/Azure/GCP (or private cloud)KPIsDeployment frequency and lead time for model releasesProduction stability: incident rate, MTTR, SLO complianceModel quality health: drift detection coverage, evaluation gate pass rateinference cost and latency improvementsReproducibility and traceability coverage (lineage completeness)Competencies Strong MLOps fundamentals: model lifecycle, reproducibility,evaluation, deployment, monitoring Proficiency with containers and orchestration (Docker; Kubernetesis a plus) CI/CD and automation (GitHub Actions/GitLab CI/Jenkins),infrastructure-as-code (Terraform is a plus) Experience with model serving patterns (REST/gRPC), andobservability tools Comfort with cloud primitives (compute, storage, networking) andcost management practices Clear communication and documentation; strong ownership andoperational discipline