Machine Learning Inference Engineer

Oscar · San Francisco County, CA

We are partnering with a fast-growing AI startup building next-generation multimodal generative systems focused on highly realistic visual experiences at scale. The company operates at the intersection of computer vision, generative AI, and real-time inference infrastructure, developing advanced AI products used by enterprise customers across large consumer-facing industries.This is a highly technical and hands-on engineering role focused on production inference optimization for multimodal and generative AI systems. The ideal candidate will have deep expertise in GPU inference, model serving, PyTorch-based deployment, and performance optimization for large-scale AI applications.The role offers significant ownership across infrastructure, inference systems, and production model optimization, with opportunities to contribute to novel AI system design and scalable deployment architectures.What You'll Work OnBuild and optimize high-performance inference-serving systems for multimodal and generative AI modelsImprove latency, throughput, scalability, and GPU utilization for production AI workloadsProductionize large PyTorch-based models for real-world deployment environmentsDesign and maintain model-serving microservices and distributed inference infrastructureOptimize inference pipelines using:TensorRTTriton Inference ServervLLMCUDA/GPU acceleration techniquesWork on:KV cache optimizationmodel pruningquantizationdistillationbatching strategiesmemory optimizationlatent-space conditioningDeploy and scale multimodal architectures including:diffusion modelsvision-language models (VLMs)large vision pipelinesCollaborate closely with research and product engineering teams to balance:model qualitylatencyinfrastructure costproduction reliabilityOwn the full inference optimization lifecycle from experimentation to production deploymentIdeal BackgroundStrong experience building and optimizing AI inference systems in productionDeep understanding of GPU architecture and performance optimizationHands-on expertise with:PythonPyTorchCUDATensorRTTritonvLLMExperience with multimodal AI, computer vision, or generative AI systemsFamiliarity with diffusion models or large-scale vision pipelines is strongly preferredStrong understanding of model deployment tradeoffs:throughput vs latencymemory efficiencymodel quality vs compute costExperience working with distributed inference systems and scalable serving infrastructureComfortable operating in highly autonomous, fast-moving startup environmentsNice to HaveExperience with:diffusion model optimizationmultimodal transformersquantization techniquesFlashAttentionTensorRT-LLMspeculative decodingmodel parallelismKubernetes-based ML infrastructureContributions to open source AI infrastructure projectsPublications, patents, or research experience in AI systems, vision, or generative modelingWhy This OpportunityWork on cutting-edge multimodal and generative AI systems deployed at scaleSignificant ownership and autonomy across core AI infrastructureOpportunity to solve complex GPU inference and scaling challengesHigh-impact engineering role with direct visibility into product performanceFast-moving environment with strong technical talent densityOpportunity to contribute to novel IP and patentable systems 80% covered healthcare, 401k 3% matching, $500 learning stipend, Global program- work anywhere in the world for 3 monthsDesired Skills and ExperienceWe are partnering with a fast-growing AI startup building next-generation multimodal generative systems focused on highly realistic visual experiences at scale. The company operates at the intersection of computer vision, generative AI, and real-time inference infrastructure, developing advanced AI products used by enterprise customers across large consumer-facing industries.This is a highly technical and hands-on engineering role focused on production inference optimization for multimodal and generative AI systems. The ideal candidate will have deep expertise in GPU inference, model serving, PyTorch-based deployment, and performance optimization for large-scale AI applications.The role offers significant ownership across infrastructure, inference systems, and production model optimization, with opportunities to contribute to novel AI system design and scalable deployment architectures.What You'll Work OnBuild and optimize high-performance inference-serving systems for multimodal and generative AI modelsImprove latency, throughput, scalability, and GPU utilization for production AI workloadsProductionize large PyTorch-based models for real-world deployment environmentsDesign and maintain model-serving microservices and distributed inference infrastructureOptimize inference pipelines using:TensorRTTriton Inference ServervLLMCUDA/GPU acceleration techniquesWork on:KV cache optimizationmodel pruningquantizationdistillationbatching strategiesmemory optimizationlatent-space conditioningDeploy and scale multimodal architectures including:diffusion modelsvision-language models (VLMs)large vision pipelinesCollaborate closely with research and product engineering teams to balance:model qualitylatencyinfrastructure costproduction reliabilityOwn the full inference optimization lifecycle from experimentation to production deploymentIdeal BackgroundStrong experience building and optimizing AI inference systems in productionDeep understanding of GPU architecture and performance optimizationHands-on expertise with:PythonPyTorchCUDATensorRTTritonvLLMExperience with multimodal AI, computer vision, or generative AI systemsFamiliarity with diffusion models or large-scale vision pipelines is strongly preferredStrong understanding of model deployment tradeoffs:throughput vs latencymemory efficiencymodel quality vs compute costExperience working with distributed inference systems and scalable serving infrastructureComfortable operating in highly autonomous, fast-moving startup environmentsNice to HaveExperience with:diffusion model optimizationmultimodal transformersquantization techniquesFlashAttentionTensorRT-LLMspeculative decodingmodel parallelismKubernetes-based ML infrastructureContributions to open source AI infrastructure projectsPublications, patents, or research experience in AI systems, vision, or generative modelingWhy This OpportunityWork on cutting-edge multimodal and generative AI systems deployed at scaleSignificant ownership and autonomy across core AI infrastructureOpportunity to solve complex GPU inference and scaling challengesHigh-impact engineering role with direct visibility into product performanceFast-moving environment with strong technical talent densityOpportunity to contribute to novel IP and patentable systemsOscar Associates Limited (US) is acting as an Employment Agency in relation to this vacancy.