we are seeking an experienced Senior Site Reliability Engineer (SRE) to manage the reliability, scalability, and uptime of our SaaS production environments, ensuring 99.999% uptime. You will collaborate with the SOC, DevOps, and other teams to proactively monitor and optimize the health of our systems, combining technical expertise and problem-solving to ensure exceptional service reliability and performance.
Reporting to our VP R&D, this position is full time, hybrid and located in Herzeliya.
SaaS Reliability: Ensure 99.999% uptime for SaaS environments by optimizing monitoring, incident management, and capacity planning. Work with dev and product teams to ensure smooth deployments and operations.
Collaboration with SOC & DevOps: Partner with SOC to implement security best practices and automated threat response. Work with DevOps to scale CI/CD pipelines, manage releases, and implement disaster recovery plans.
Incident Response: Lead incident response for outages, conduct post-mortems, and implement corrective actions. Develop and maintain runbooks for common incidents.
Performance & Scalability: Monitor system performance, conduct load testing, and implement auto-scaling and performance optimizations. Ensure capacity planning aligns with scaling needs.
Automation & IaC: Drive Infrastructure as Code (IaC) practices using tools like Terraform and Ansible to automate and streamline infrastructure management.
Monitoring & Reporting: Maintain monitoring and alerting systems to ensure system health, meet SLAs/SLOs, and improve issue detection.
Requirements: Must Have:
Education: Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
Experience: At least 5 years of hands-on experience as a Site Reliability Engineer (SRE) or similar roles, with a focus on managing and supporting production environments in SaaS or cloud-native applications.
Proven expertise in maintaining high-availability production systems with a focus on achieving 99.999% uptime or better.
Strong experience with cloud infrastructure (AWS, GCP, Azure) and related tools/services such as EC2, S3, Lambda, RDS, and Kubernetes.
Expertise in monitoring, logging, and observability platforms (e.g., Prometheus, Grafana, Datadog, ELK stack, Splunk).
Proficient with Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Ansible.
In-depth understanding of CI/CD pipelines and tools like Jenkins, GitLab CI, CircleCI, or similar, including automated testing and deployment strategies.
Strong background in incident response and management, with experience in performing root cause analysis (RCA) and leading post-mortems.
Solid understanding of distributed systems, microservices architectures, and the challenges related to scaling and maintaining them in a cloud environment.
Experience with K8S based products in production
Experience with runbook creation and automation
Excellent problem-solving skills with the ability to think critically under pressure and mitigate issues efficiently.
Familiarity with security best practices and working alongside a SOC team to ensure the integrity and safety of production systems.
Preferred Skills and Experience:
Experience with container orchestration using Kubernetes or Docker Swarm and containerization best practices.
Familiarity with load balancing, networking, and CDN technologies to optimize traffic flow and ensure high availability.
Background in automation frameworks such as Chef, Puppet, or SaltStack.
Strong knowledge of distributed databases (e.g., PostgreSQL, MySQL, Cassandra, MongoDB) and their operation in cloud environments.
Experience with site reliability metrics (e.g., SLAs, SLOs, SLIs) and creating reporting systems to track performance over time.
Familiarity with DevSecOps practices to integrate security measures into the DevOps pipeline.
Certifications such as AWS Certified Solutions Architect, Google
This position is open to all candidates.