Senior ML Infrastructure Engineer

RecT Solutions · Berlin, Germany

We are looking for an engineer to take end-to-end ownership of our client's AI compute layer spanning from cluster management and cost efficiency to maximising distributed training performance.In this role, you will partner closely with research scientists, translating their requirements into high-impact infrastructure decisions that accelerate their workflows.This role goes beyond traditional DevOps or IT support. The company cultivates a highly collaborative and open research environment. If you have a novel model architecture you've been wanting to test, you will have the autonomy and resources to build and train it with them.Key ResponsibilitiesManage and Scale GPU Infrastructure: Oversee the company's multi-cluster GPU environments.Maximise Training Efficiency: Boost GPU utilization and training speeds by profiling runs, resolving memory and communication bottlenecks, and debugging complex distributed systems.Design Future Architecture: Lead capacity planning and architect multi-cluster orchestration strategies to stay ahead of rapidly growing compute demands.Enhance Developer Productivity: Engineer the internal platform by building and maintaining CI/CD pipelines, experiment tracking, data processing workflows, and custom tools that allow the researchers to iterate faster.Optimise Compute Budgets: Take full ownership of the infrastructure spend. You will relentlessly hunt down compute waste and optimize hardware efficiency per dollar.Core StackPyTorch, GCP, Slurm, Docker, Triton, Weights & Biases (wandb), GitHub Actions, and UV.What We're Looking For5+ years of relevant experience: Proven background in building and maintaining production-grade GPU infrastructure or distributed training systems at an AI lab, HPC facility, or high-growth ML startup.Cluster Management Expertise: Deep, hands-on knowledge of Slurm and multi-tenant GPU workloads. You know how to troubleshoot complex scheduling failures and minimise costly infrastructure downtime.First-Principles Systems Thinking: You understand hardware constraints like memory bandwidth and GPU profiling at a fundamental level. You think in terms of hardware limits, not just software configurations.Deep PyTorch & Python Fluency: You know PyTorch internals well enough to profile large-scale training runs and accurately pinpoint whether a bottleneck stems from compute, networking, or I/O.Proven Impact: A track record of making architectural decisions that resulted in measurable improvements to training throughput or cost reduction.Modern AI Tooling: You actively leverage AI coding assistants (such as Cursor, Claude Code, or similar) to accelerate your development workflows without compromising code quality.Bonus Points If You HavePrior responsibility for massive compute budgets (managing tens of millions in annual GPU spend).Background in hybrid cloud or multi-cloud HPC deployments.Hands-on experience writing custom compute kernels (CUDA, Triton).Experience successfully transitioning an organisation from single to multi-cluster orchestration.Past experience developing custom ML platforms, model registries, or experiment tracking solutions.