This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Site Reliability Engineer in United States.
This is a senior, hands-on role within a small, high-leverage SRE team, responsible for ensuring the reliability, scalability, and security of a high-growth digital financial platform. The Staff SRE will architect, automate, and optimize cloud infrastructure, focusing on operational excellence and system resilience. You will collaborate closely with engineering, product, and security teams to embed reliability into every layer of the platform while mentoring fellow engineers and shaping long-term infrastructure strategy. This role provides the opportunity to directly impact platform performance, member trust, and product velocity through robust monitoring, incident prevention, and automation. You will lead initiatives across GCP environments, cloud networking, Kubernetes, and IaC, while exploring innovative automation solutions, including LLM-driven tooling, to reduce toil and improve operational efficiency. This position is ideal for a systems thinker who thrives in ambiguous, high-impact environments and wants to build resilient, scalable services for millions of users.
Accountabilities:
Lead architecture and automation across cloud infrastructure, ensuring reliability, scalability, security, and cost-effectiveness.Define and operate SLIs, SLOs, and error budgets, translating reliability goals into measurable business outcomes.Design and optimize multi-region, disaster recovery, and capacity planning strategies to support platform growth.Manage and optimize cloud networking, including VPC architecture, ingress/egress, Cloud Armor, VPN, and DNS.Drive infrastructure-as-code and GitOps practices using Terraform, Kubernetes, Helm, and ArgoCD to enable repeatable, predictable deployments.Mentor SREs and infrastructure engineers through hands-on collaboration, design reviews, and incident retrospectives.Partner with cross-functional teams to align platform decisions with product velocity, security, and long-term durability.Requirements:
8+ years of experience in software, infrastructure, or site reliability engineering.5+ years of hands-on experience operating production systems in GCP (compute, networking, storage, IAM, observability).Deep experience with Kubernetes (GKE), Helm, containerization, Terraform (IaC), and ArgoCD.Strong programming skills in Python, Go, or TypeScript/JavaScript for automation and internal tooling.Proven ability to define and operate against SLIs, SLOs, and error budgets.Strong knowledge of relational and distributed databases (e.g., MySQL, Cloud SQL, Cloud Spanner, Redis) including performance tuning and HA strategies.Experience leading incident response, root cause analysis, and systemic remediation.Bonus: Experience in fintech or regulated environments, CI tooling familiarity, and high-growth startup experience.Benefits:
Competitive compensation and benefits package.Premium Medical, Dental, and Vision Insurance plans.401(k) savings plan with matching contributions.Flexible PTO and generous company holidays, including Juneteenth and Winter Break.Paid parental and caregiver leave.Flexible hours with a virtual-first work culture and home office stipend.Opportunities for professional growth, mentorship, and impactful work on a high-growth platform.Company-sponsored in-person and virtual events for team connection.