Senior SRE & Linux Infrastructure Engineer - ML Platform

Hybrid

Mobileye’s ML Platform group builds and operates the core infrastructure that powers large scale AI workloads. We manage a massive, high performance environment consisting of both multi cloud clusters and on prem bare metal nodes optimized with AI accelerators.

We are looking for a highly experienced Senior SRE / Linux Systems Engineer who thrives on managing complex, low level infrastructure. This isn't just a cloud-configuration role, you will be responsible for the health and performance of expensive, high density hardware. You must be an expert at troubleshooting open source systems and "living" inside Linux environments to ensure our AI clusters run at peak efficiency.

What will your job look like?

Build and maintain infrastructure for large‑scale AI and HPC workloads across on‑prem and cloud environments

Operate and enhance our multi‑cloud, multi‑cluster scheduling platform

Troubleshoot complex issues across the stack: from kernel-level tuning and drivers to networking, storage, and distributed system bottlenecks.

Ensure the reliability of critical platform services: queuing systems, time-series databases, and logging pipelines

Develop deeply integrated automation and tooling

Collaborate with ML engineers and IT engineers to optimize hardware utilization for data intensive workloads

Drive best practices in system design, observability, and infrastructure-as-code

All you need is:

10+ years of hands‑on experience in SRE, Linux Administration, or Systems Engineering

Expert-level Linux knowledge: Deep understanding of system internals, debugging, performance tuning, and the ability to solve failures where hardware meets software.

Kubernetes Expertise: Proven experience managing K8s at scale (both managed EKS and bare-metal deployments)

Distributed Systems Mastery: Hands-on experience debugging and maintaining:Queuing Systems: RabbitMQ or similar

Metrics/Observability Stacks: Prometheus, Thanos, and Grafana, or similar

Logging: Elasticsearch or similar

Relational Databases: PostgreSQL, or similar

Infrastructure-as-Code: Proficiency with Terraform, Helm, and configuration management

Networking & Scripting: Strong fundamentals in networking and proficiency in Bash

Advantages:

Familiarity with GPU/Accelerator scheduling, AI/ML pipelines

Experience with multi cloud architectures and hybrid environments

Experience with workflow orchestration tools (e.g., Argo Workflows)

What We Offer:

IImpact: Support the engineering that advances Mobileye’s AI and global transportation safety

Cutting-Edge Hardware: Work with high-value, AI-optimized bare-metal clusters at a massive scale

Technical Depth: A highly technical environment focused on solving deep systems engineering challenges

Collaboration: Work alongside elite ML, software, and systems engineers

Apply Now