We are looking for a DevOps Team Lead to lead our geographically diverse team and take ownership of our Cloud Infrastructure and Platform Engineering strategy, enabling high-scale, cutting-edge GenAI products running across 40+ Kubernetes clusters on GCP and AWS.
This role combines technical leadership, team management, and hands-on engineering, requiring solid expertise in cloud-native technologies, Kubernetes at scale, and modern DevOps principles. You will collaborate closely with engineering teams to design scalable infrastructure solutions, optimize developer workflows, and ensure platform reliability and efficiency.
Role and Responsibilities
Team Leadership & Mentorship: Lead and manage a geographically distributed team, fostering growth, engagement, and professional development. Mentor engineers, conduct performance reviews, career growth planning, and encourage knowledge-sharing across R&D teams.
Cloud & Kubernetes Management: Guide the design and implementation of scalable multi-cluster Kubernetes environments across GCP & AWS.
Developer Experience & Enablement: Oversee the development of self-service tools and automation to improve efficiency for R&D teams.
Incident & Reliability Engineering: Collaborate with engineering teams to optimize cost, performance, and reliability of production infrastructure through monitoring, capacity planning, and scaling strategies.
Security & Governance: Drive best practices for RBAC, IAM, cloud security, and compliance, ensuring robust infrastructure security.
Automation & Infrastructure as Code: Promote adoption of GitOps workflows and Infrastructure as Code (Terraform, Helm, Crossplane) for improved automation and consistency.
Cross-Team Collaboration: Align cloud infrastructure goals with business needs by working closely with engineering, security, and product teams.
Requirements: 7+ years of DevOps, SRE, or Platform Engineering experience.
5+ years working with public cloud platforms (AWS/GCP) at scale.
Senior-level Kubernetes expertise, including experience managing enterprise-grade, multi-cluster environments.
Experience with Infrastructure as Code (Terraform, Helm) and familiarity with GitOps principles (ArgoCD, FluxCD, etc.).
Familiarity with observability and monitoring tools (Prometheus, Grafana, Datadog, OpenTelemetry, etc.).
Proficiency in scripting and automation (Python, Go, Bash) for infrastructure management.
Knowledge of cloud networking (VPC, load balancers, service meshes) and security best practices (RBAC, IAM, security groups, network policies).
Experience with CI/CD pipelines, optimizing for performance, security, and developer velocity.
This position is open to all candidates.