Mobileye’s ML Platform group builds and operates the core infrastructure that powers large scale AI workloads. We manage a massive, high performance environment consisting of both multi cloud clusters and on prem bare metal nodes optimized with AI accelerators.
We are looking for a highly experienced Senior SRE / Linux Systems Engineer who thrives on managing complex, low level infrastructure. This isn't just a cloud-configuration role, you will be responsible for the health and performance of expensive, high density hardware. You must be an expert at troubleshooting open source systems and "living" inside Linux environments to ensure our AI clusters run at peak efficiency.
What will your job look like?
Build and maintain infrastructure for large‑scale AI and HPC workloads across on‑prem and cloud environmentsOperate and enhance our multi‑cloud, multi‑cluster scheduling platformTroubleshoot complex issues across the stack: from kernel-level tuning and drivers to networking, storage, and distributed system bottlenecks.Ensure the reliability of critical platform services: queuing systems, time-series databases, and logging pipelinesDevelop deeply integrated automation and toolingCollaborate with ML engineers and IT engineers to optimize hardware utilization for data intensive workloadsDrive best practices in system design, observability, and infrastructure-as-code
All you need is:
10+ years of hands‑on experience in SRE, Linux Administration, or Systems EngineeringExpert-level Linux knowledge: Deep understanding of system internals, debugging, performance tuning, and the ability to solve failures where hardware meets software.Kubernetes Expertise: Proven experience managing K8s at scale (both managed EKS and bare-metal deployments)Distributed Systems Mastery: Hands-on experience debugging and maintaining:Queuing Systems: RabbitMQ or similarMetrics/Observability Stacks: Prometheus, Thanos, and Grafana, or similarLogging: Elasticsearch or similarRelational Databases: PostgreSQL, or similarInfrastructure-as-Code: Proficiency with Terraform, Helm, and configuration managementNetworking & Scripting: Strong fundamentals in networking and proficiency in BashAdvantages:
Familiarity with GPU/Accelerator scheduling, AI/ML pipelinesExperience with multi cloud architectures and hybrid environmentsExperience with workflow orchestration tools (e.g., Argo Workflows)
What We Offer:
IImpact: Support the engineering that advances Mobileye’s AI and global transportation safetyCutting-Edge Hardware: Work with high-value, AI-optimized bare-metal clusters at a massive scaleTechnical Depth: A highly technical environment focused on solving deep systems engineering challengesCollaboration: Work alongside elite ML, software, and systems engineers