Member of Technical Staff - Infrastructure Engineer
Black Forest Labs • Freiburg (Germany), San Francisco (USA)About Black Forest Labs
We're a team of world-class researchers and engineers creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we're just getting started.
Headquartered in Freiburg, Germany with a growing presence in San Francisco, we're scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity.
Why this role?
We're looking for engineers to build and maintain the engine that powers our mission to develop visual intelligence. From maintaining and scaling clusters, to building research platforms to accelerate the rate of innovation, this team operates with large breadth and depth. We build the systems to make multi-week/month long training possible, to orchestrate resources at scale, and at the same time efficiently, enabling the next breakthrough model. If you’re obsessed with distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement, this team would be perfect for you.
What you'll work on:
- Maintain research infrastructure, ensuring health, and optimizing components to extract peak performance from the system (both on application, and infrastructure side)
- Scale infrastructure to meet growing research demands while maintaining reliability and performance
- Collaborate with research teams to deeply understand their infrastructure needs, and design solutions that balance performance with cost efficiency.
- Identify and resolve performance bottlenecks and capacity hotspots through deep analysis of distributed systems at scale.
- Build and evolve telemetry and monitoring systems to provide deep visibility into infrastructure performance, utilization, and costs across our cloud and datacenter fleets.
- Participate in on-call rotations and incident response to maintain system reliability
Technical Focus:
- Python, Bash, Go
- Kubernetes
- Nvidia GPU drivers, and operators
- OTel, Prometheus
What We’re Looking For:
- Experience working with petabyte-scale video and image datasets
- Proven ability to debug performance and reliability issues across large distributed fleets
- Strong problem-solving skills and ability to work independently
- Strong communication skills and the ability to work effectively with both internal and external partners
- Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP
- Experience with SLURM
- Experience building or operating large-scale training platforms
Base Annual Salary: $180,000–$300,000 USD