We are seeking a highly skilled Machine Learning Engineer to enable a large team of deep learning engineers, data scientists and computational biologists.
We are looking for a driven candidate to expand our model observability software, lead the optimization of GPU infrastructure, and ensure efficient collaboration with data engineering. The ideal candidate will have development skills in python, deep expertise in cloud infrastructure, high-performance computing, and AI workload optimization. This role involves designing robust GPU management systems, automating model performance metrics, and supporting researchers with the tools they need to train and deploy machine learning models effectively.
Location: Ramat Gan, Israel (hybrid role)
What will you do?
GPU Infrastructure Optimization:
Design and implement strategies to maximize the efficient utilization of GPU resources across the organization.
Develop tools and processes for GPU allocation, workload management, and performance monitoring in alignment with selected infrastructure tools.
Monitor and fine-tune GPU performance to ensure optimal throughput for machine learning workloads.
Model Observability:
Build and maintain a robust system for automated reporting of key model performance metrics.
Integrate with diverse data sources to create customizable dashboards for monitoring performance across datasets.
Set up anomaly detection systems and alerts to ensure timely identification of performance degradation.
Enhance the existing benchmarking suite for seamless evaluation of datasets in federated data lakes.
Collaboration and Support:
Partner with machine learning scientists, data engineers, and DevOps teams to enable researchers to efficiently train and deploy models.
Provide technical guidance and support for effectively utilizing available infrastructure and tools.
Technology Research:
Stay updated with the latest advancements in GPU technologies, ML infrastructure best practices, and model performance metrics.
Evaluate and recommend new tools, technologies, and approaches to enhance the efficiency of the ML enablement platform.
Git and Code Ownership:
Implement best practices for Git workflows, code versioning, and safe release processes.
Foster a culture of high-quality, collaborative development within the engineering team.
Requirements: Required qualifications:
Bachelors degree in Computer Science, Engineering, or a related field
4+ years of experience as a software engineer
2+ years of experience in cloud infrastructure or developer platform teams
Proficiency in Python and Git
Hands-on experience with high-performance computing (HPC) and GPU cluster performance optimization for AI workloads
Strong knowledge of GPU technologies and deployment strategies
Familiarity with GCP compute deployment options, such as Kubernetes
Experience integrating observability tools for model performance metrics and evaluation.
Preferred qualifications:
Knowledge of federated learning and multi-dataset evaluation methodologies
Experience in designing and scaling benchmarking frameworks
Strong analytical and troubleshooting skills in cloud infrastructure and GPU utilization.
Desired personal traits:
You want to make an impact on humankind.
You prioritize We over I.
You enjoy getting things done and striving for excellence.
You collaborate effectively with people of diverse backgrounds and cultures
You have a growth mindset.
You are candid, authentic, and transparent.
This position is open to all candidates.