Zoox is looking for an experienced Software Engineering Manager to lead our High Performance Computing Storage infrastructure team. Zoox HPC Storage provides abstraction layers for petabyte-scale data movement and management for critical, high-throughput use cases, such as ML foundation model training, synthetic data generation, and more. You will take on a breadth of end-to-end responsibilities, including distributed system design, optimization of storage-related GPU utilization bottlenecks, and cost-effective resource management.
The position comes with a high degree of independence and the opportunity to help define Zoox’s scaling strategy, both technically and organizationally. You will be responsible for hiring and maintaining the health of your team, as well as growing and coaching them to support the continued success of their careers.
In this role, you will:
Work closely with AI teams and other software customers to holistically address pain points, find optimization opportunities, and ultimately charter systems-solutions for broad categories of storage use casesDevelop a multi-year vision and roadmap for storage at Zoox, including investment into new data movement and management paradigms to meet Zoox’s ever growing computational and storage needs in a cost-effective mannerOwn the hiring process end-to-end, from thoughtful role definition to interview loop design to successfully hiring bar raisersMentor, coach, and advocate for your direct reports
Qualifications:
Experience managing teams of 5-10Demonstrated ability to prioritize development work and build cross-functional consensus across ML stakeholdersExperience with high performance storage systems deployed on cloud providers, such as FSx for Lustre on Amazon Web Services (AWS)Strong operational background with highly available systemsBachelor's degree in computer science (or related field)
Bonus Qualifications:
Experience with ML-specific data formats such as Mosaic Streaming Datasets (MDS)Experience with end-to-end hosted ML services such as AWS SageMaker HyperPodProficiency with Python, Java, or other managed languages