Senior DevOps Engineer – ML Platform

Mobileye • IL

Hybrid

AI Engineering's ML-Platform team goal, is to deliver a modern infrastructure and solutions to enhance Mobileye's Algorithm development life cycle and shorten our delivery times. We are an independent group, consisting of excellent and experienced engineers with diverse skills in algorithms, software, and infrastructure. We strive to implement a DevOps culture allowing our engineers to easily collaborate on large-scale products. We develop cross-company products that enable the research and deployment of state-of-the-art algorithms.

What will your job look like?

Build and maintain infrastructure for large‑scale AI and HPC workloads across on‑prem and cloud environments

Operate and enhance our multi‑cloud, multi‑cluster scheduling platform

Develop automation, tooling, and platform services und Bash

Troubleshoot complex issues across the stack: compute, networking, storage, orchestration, and distributed systems

Improve reliability of critical systems

Collaborate with ML, data, and backend teams to support evolving platform needs

Drive best practices in CI/CD, infrastructure-as-code, and system design

Participate in on‑call rotations for critical infrastructure components

All you need is:

10+ years of hands‑on experience in DevOps, SRE, systems engineering, or similar roles

Linux knowledge, including debugging, performance tuning, ana system internals

Proven experience working with HPC environments, large clusters, or high‑performance compute systems

Solid experience with Kubernetes (EKS or similar managed K8s services)

Knowledge of infrastructure‑as‑code tools(Terraform, Helm, etc.)

Hands‑on experience with:

PostgreSQL or similar relational databases

Elasticsearch or similar search/indexing systems

Prometheus/Thanos/Grafana or similar observability stacks

RabbitMQ or similar messaging systems

Strong proficiency in Bash, networking fundamentals, and debugging distributed systems.

Experience investigating complex issues across compute, storage, networking, and orchestration layers

Advantages:

Experience with multi‑cloud architectures

Experience with workflow orchestration tools such as Argo Workflows (or similar systems like Airflow, Prefect, Flyte)

Familiarity with GPU scheduling, AI/ML pipelines, or data‑intensive workloads

Background in large‑scale distributed systems or platform engineering

Ability to write production‑quality Go (Golang) code

What We Offer:

Impactful engineering that advances Mobileye’s AI capabilities and strengthens the safety of transportation systems globally

The opportunity to work on cutting‑edge AI infrastructure at massive scale

A highly technical environment with deep engineering challenges

Collaboration with great ML, software, and systems engineers

Apply Now