أبلاي إيدج ابدأ البحث عن عمل

Research Engineer (Inference)

Axiōma Search · London Area, United Kingdom

قدّم وتابع مع أبلاي إيدج
Research Engineer (Inference)AboutServing a multimodal agent model in production is a different problem to serving a standard LLM. Context length, tool calls, and computer-use workloads create constraints that require co-designing the inference stack with the model team - not just bolting on a serving framework after the fact.This is a VC-backed challenger lab building state-of-the-art computer-use agents. The inference team owns the full stack from engine layer (vLLM, SGLang) through to serving architecture (disaggregated inference, intelligent routing).The team operates at the intersection of research and production - translating cutting-edge techniques directly into the systems behind live agent products.What you'll doBuild and operate the inference stack serving multimodal agentic models in productionImprove latency, throughput, and cost across the serving stackResearch and implement inference techniques tailored to agent workloadsCo-design with the models team on training-time decisions that affect inference behaviourEvaluate inference frameworks and hardware platforms and feed findings back into roadmap decisionsStay current with advances in inference, model serving, and accelerator technologyWhat you'll needStrong software engineering fundamentals and a solid production track recordProficient in Python and at least one systems language - Rust, C++, or GoHands-on experience with PyTorch or JAX in an industry settingExperience with inference frameworks: vLLM, SGLang, TensorRT-LLMSolid distributed systems fundamentals and experience operating production ML infrastructureWorking knowledge of modern ML including transformers and multimodal architecturesOptional BonusResearch engagement: advanced degree with research output, top-tier publications (NeurIPS, ICML, MLSys, OSDI), or open-source contributionsGPU kernel work - CUDA, Triton, or similarExperience with quantisation, speculative decoding, disaggregated inference, or KV-cache compressionShortlisted candidates will be contacted within 48 hours.