we are seeking an experienced Senior Site Reliability Engineer (SRE) to manage the reliability, scalability, and uptime of our SaaS production environments, ensuring 99.999% uptime. You will collaborate with the SOC, DevOps, and other teams to proactively monitor and optimize the health of our systems, combining technical expertise and problem-solving to ensure exceptional service reliability and performance.
About us: For over two decades, we have mastered the art of real-time data. Weve built platforms that power the worlds most demanding systems, shaping how organizations grow their business with data-driven services. We have pioneered technologies that optimize data-driven services working with global organizations including American Airlines, Morgan Stanley, CSX, Goldman Sachs, Société Générale, Credit Agricole and more. As pioneers in data-tech, we are building on our DNA in real-time operational data to deliver cutting-edge GenAI solutions that empower businesses to unlock the full potential of their structured data and transform how they interact with information.
Reporting to our VP R&D, this position is full time, hybrid and located in Herzeliya.
Key Responsibilities:
SaaS Reliability: Ensure 99.999% uptime for SaaS environments by optimizing monitoring, incident management, and capacity planning. Work with dev and product teams to ensure smooth deployments and operations.
Collaboration with SOC & DevOps: Partner with SOC to implement security best practices and automated threat response. Work with DevOps to scale CI/CD pipelines, manage releases, and implement disaster recovery plans.
Incident Response: Lead incident response for outages, conduct post-mortems, and implement corrective actions. Develop and maintain runbooks for common incidents.
Performance & Scalability: Monitor system performance, conduct load testing, and implement auto-scaling and performance optimizations. Ensure capacity planning aligns with scaling needs.
Automation & IaC: Drive Infrastructure as Code (IaC) practices using tools like Terraform and Ansible to automate and streamline infrastructure management.
Monitoring & Reporting: Maintain monitoring and alerting systems to ensure system health, meet SLAs/SLOs, and improve issue detection.
Requirements: Must Have:
Education: Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
Experience: At least 7 years of hands-on experience as a Site Reliability Engineer (SRE) or similar roles, with a focus on managing and supporting production environments in SaaS or cloud-native applications.
Proven expertise in maintaining high-availability production systems with a focus on achieving 99.9% uptime or better.
Strong experience with cloud infrastructure (AWS, GCP, Azure) and related tools/services such as EC2, S3, Lambda, RDS, and Kubernetes.
Expertise in monitoring, logging, and observability platforms (e.g., Prometheus, Grafana, Datadog, ELK stack, Splunk).
Proficient with Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Ansible.
In-depth understanding of CI/CD pipelines and tools like Jenkins, GitLab CI, CircleCI, or similar, including automated testing and deployment strategies.
Strong background in incident response and management, with experience in performing root cause analysis (RCA) and leading post-mortems.
Solid understanding of distributed systems, microservices architectures, and the challenges related to scaling and maintaining them in a cloud environment.
Experience with K8S based products in production
Experience with runbook creation and automation
Preferred Skills and Experience:
Experience with container orchestration using Kubernetes or Docker Swarm and containerization best practices.
Background in automation frameworks such as Chef, Puppet, or SaltStack.
This position is open to all candidates.