Head of Site Reliability Engineer
dLocal • ESAs we continue to grow globally and increase the complexity and scale of our systems, we’re strengthening our focus on Site Reliability Engineering.
We are looking for a Head of Site Reliability Engineering (SRE) to lead the SRE division and take end‑to‑end ownership of reliability across our platform.
In this role, you will:
-
Define and drive the SRE strategy, vision, and roadmap for dLocal.
-
Lead and grow a multi‑region SRE organization, including SRE Technical Referents and engineers at different seniority levels.
-
Partner closely with Product, Engineering, and Platform leaders to ensure we can scale safely, with clear reliability guardrails and strong operational excellence.
This is a high‑impact, hands‑on leadership role reporting to VP of Cloud Platform for someone who can move comfortably between strategy, architecture, and execution, while coaching and empowering a senior, distributed team.
What will you do?
Strategy and leadership
-
Own the global reliability strategy for dLocal’s platforms and services, aligning SRE goals with company and product objectives.
-
Define and socialize SRE standards and principles (SLIs/SLOs/SLAs, error budgets, production readiness, incident management practices, capacity planning, etc.).
-
Lead the SRE division: set org structure, define roles and scopes, and drive hiring, performance, and career development.
-
Build a culture of high ownership, continuous improvement, and data‑driven decisions across all reliability‑related work.
-
Ensure our most critical systems meet or exceed availability, latency, and performance targets.
-
Oversee and continuously evolve incident management (on‑call strategy, incident response, communication, postmortems, follow‑ups, and KPIs).
-
Own the strategy for observability and monitoring (metrics, logs, traces) and alerting across all environments, including tool selection, standards, and adoption.
-
Drive operational excellence: reduce toil via automation, improve deployment safety, and standardize production practices across teams.
-
Partner with Architecture, Platform, and Product Engineering leaders to define reliable, scalable architectures for our core systems and critical flows.
-
Guide the adoption of best practices in automation and Infrastructure as Code (IaC) across SRE and dependent engineering teams.
-
Sponsor and oversee large cross‑team reliability programs, such as major observability migrations, resilience testing frameworks, or reliability improvements for key products.
-
Provide senior technical leadership on capacity planning, performance engineering, resilience and disaster recovery.
-
-
Lead, mentor, and coach SRE Leader, Technical Referents, and senior ICs, helping them grow in both technical depth and leadership.
-
Collaborate closely with:
-
Product & Engineering to balance feature delivery and reliability.
-
Security, Cloud Platform, and Infrastructure to ensure secure and robust foundations.
-
Business stakeholders (e.g., Operations, Support, Commercial) to align on reliability expectations and SLAs.
-
Communicate clearly about risk, trade‑offs, and priorities to both technical and non‑technical audiences, including senior leadership.
-
-
Reliability, operations, and observability
Architecture and technical direction
People and cross‑functional collaboration
Which skill do you need?
Must‑have
-
Solid experience leading SRE / Production Engineering / Platform teams in high‑availability, high‑scale environments (fintech, payments, or similarly critical domains is a plus).
-
Proven track record managing managers and senior ICs, building and scaling distributed technical teams.
-
Deep hands‑on expertise in:
-
Reliability engineering: SLIs/SLOs, error budgets, capacity planning, resilience and disaster recovery.
-
Incident management: on‑call models, incident response, postmortems, continuous improvement of incident processes.
-
Observability and monitoring: metrics, logs, traces, alerting strategies, and ecosystem of tools.
-
Automation and IaC: strong familiarity with modern CI/CD pipelines, configuration management, and infrastructure as code.
-
Ability to shape technical strategy, translate it into a clear roadmap, and ensure consistent execution across multiple teams.
-
Excellent communication and influencing skills; comfortable driving alignment across Engineering, Product, and non‑technical stakeholders.
-
Strong analytical and problem‑solving skills, able to operate effectively in ambiguous, fast‑changing contexts.
-
Professional proficiency in English; comfortable working in a global, multi‑time‑zone, multicultural environment.
-
Experience in payments / fintech or other regulated, mission‑critical industries.
-
Hands‑on background as an SRE, Senior/Staff Engineer, or Platform Engineer before moving into leadership.
-
Experience implementing or maturing:
-
Centralized observability platforms and unified alerting strategies.
-
Standardized production readiness reviews and reliability sign‑off processes.
-
Chaos engineering / resilience testing practices.
-
Nice to have
-