We are looking for an exceptional MLOps Team Lead to own, build, and scale the infrastructure and automation that powers state-of-the-art Large Language Models (LLMs) and AI systems.
This is a technical leadership role that blends hands-on engineering with strategic vision. You will define MLOps best practices, build high-performance ML infrastructure, and lead a world-class team working at the intersection of AI research and production-grade ML systems.
You will work closely with LLM Algorithm Researchers, ML Engineers, and Data Scientists to enable fast, scalable, and reliable ML workflows covering everything from distributed training to real-time inference optimization.
If you have deep technical expertise, thrive in high-scale AI environments, and want to lead the next generation of MLOps, we want to hear from you.
Role and Responsibilities
MLOps Infrastructure & Automation
Architect and maintain scalable, self-service ML pipelines, CI/CD workflows, and orchestration frameworks (Kubeflow, MLflow, Airflow).
Design high-scale distributed training environments, leveraging multi-GPU/TPU clusters and parallelization strategies.
Optimize ML workflows for speed, scalability, and cost efficiency across cloud (AWS/GCP) and on-prem environments.
Model Deployment & Real-Time Inference
Build ultra-low-latency, high-throughput inference architectures optimized for LLMs at scale.
Implement A/B testing, canary releases, and rollback mechanisms for model deployment.
Develop robust monitoring, logging, and alerting solutions for model performance, drift detection, and reliability.
Cloud & Compute Optimization
Lead the design and scaling of multi-cloud ML infrastructure using Kubernetes, Terraform, and ArgoCD.
Optimize GPU/TPU utilization, autoscaling, and resource allocation to maximize efficiency.
Build and manage feature stores, data pipelines, and large-scale storage solutions.
Leadership & Cross-Team Collaboration
Work closely with LLM researchers, ML engineers, and platform teams to align MLOps infrastructure with cutting-edge AI research and real-world deployment needs.
Define and enforce best practices for model governance, security, and compliance.
Requirements: 3+ years of experience in MLOps, ML infrastructure, or AI platform engineering.
2+ years of hands-on experience in ML pipeline automation, large-scale model deployment, and infrastructure scaling.
Expertise in deep learning frameworks (like PyTorch, TensorFlow, JAX) and MLOps platforms (like Kubeflow, MLflow, TFX).
Proven track record of building production-grade ML systems that scale to billions of predictions daily.
Deep knowledge of Kubernetes, cloud-native architectures (AWS/GCP), and infrastructure as code (Terraform, Helm, ArgoCD).
Strong software engineering skills in Python, Bash, and Go, with a focus on writing clean, maintainable, and scalable code.
Experience with observability & monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry).
Strong background in security, compliance, and model governance for AI/ML systems.
This position is open to all candidates.