Director of Site Reliability Engineering (SRE)

Remote

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Director of Site Reliability Engineering (SRE) in United States.

In this leadership role, you will define and drive the reliability, performance, and operational excellence of large-scale global production infrastructure. You will oversee distributed Site Reliability Engineering teams responsible for ensuring uptime, scalability, and efficiency across mission-critical systems supporting a global customer base. Acting as a key member of the Cloud Operations leadership team, you will shape incident management, change management, observability, and infrastructure strategy while fostering a culture of continuous improvement. You will also play a central role in aligning engineering, operations, and customer-facing teams to deliver seamless service reliability. This position combines deep technical oversight with people leadership, budget ownership, and cross-functional collaboration across a globally distributed organization. It is ideal for a leader passionate about building resilient systems and high-performing SRE teams at scale.

Accountabilities:

You will lead and manage globally distributed Site Reliability Engineering teams responsible for maintaining production infrastructure, ensuring adherence to service level objectives, and delivering 24/7 operational support. You will own incident management, change management, and operational escalation processes, ensuring rapid resolution of critical issues and continuous system stability. You will define and track operational KPIs and SLOs, using data-driven insights to improve reliability, performance, and engineering efficiency. You will oversee infrastructure demand forecasting, capacity planning, and budget management for operational tooling and observability platforms. You will collaborate closely with infrastructure engineering, customer support, and data center operations teams to ensure seamless cross-functional execution. Additionally, you will drive continuous improvement initiatives, establish operational standards, and lead strategic efforts to enhance global production systems.

Requirements:

You bring strong leadership experience in Site Reliability Engineering, cloud operations, or infrastructure management, with at least 6+ years in management roles and significant experience at the director level. You have a proven track record of managing large-scale, mission-critical production environments and distributed global teams. You possess deep technical expertise in cloud infrastructure, observability, incident response, and large-scale systems operations. You are highly skilled in building and improving operational processes such as incident, change, and problem management. You demonstrate strong analytical and data-driven decision-making abilities, paired with excellent communication and cross-functional collaboration skills. You are experienced in budget ownership, vendor coordination, and strategic infrastructure planning. A background in MSP, Infrastructure-as-a-Service, or cloud-scale environments is highly preferred.

Benefits:

Competitive compensation including base salary, annual bonus, and RSU equity grants
Comprehensive health coverage including medical, dental, and vision for employees and families
401(k) retirement savings plan and employee stock purchase program
Flexible vacation policy and paid parental leave
Remote-first flexibility within the continental United States
Learning and development programs to support career growth
Wellness, childcare, and family support benefits
Work-from-home equipment support (including MacBook Pro and workspace stipend)
Inclusive, values-driven culture focused on diversity, equity, and belonging

Apply Now