Senior Site Reliability Engineer, Infrastructure
Jobgether • USThis position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, Infrastructure in United States.
This role focuses on building and scaling the observability foundation that powers global, high-performance cloud infrastructure environments. You will design and operate the end-to-end telemetry pipeline for physical datacenter systems and provisioning workflows, transforming raw hardware and system signals into actionable insights. The position sits at the core of infrastructure reliability, ensuring engineering and operations teams have the visibility they need to run large-scale distributed systems effectively. You will work across multiple domains, including datacenter operations, networking, provisioning, and platform engineering. The environment is highly technical, fast-growing, and mission-critical, requiring both deep systems expertise and strong cross-functional collaboration. This is a hands-on role where you will shape how infrastructure observability is built and scaled globally.
Accountabilities
- Design and build observability pipelines for datacenter and provisioning infrastructure, including telemetry ingestion from systems such as Redfish, IPMI, SNMP, and OpenTelemetry.
- Own the full observability stack, from data collection through storage, processing, visualization, and alerting using tools such as Grafana, Loki, and Mimir.
- Develop dashboards, metrics, and alerting systems that provide actionable insights for datacenter operations, networking, systems, and provisioning teams.
- Define and enforce standards for telemetry collection, observability design, and infrastructure monitoring across global environments.
- Partner with cross-functional engineering and operations teams to translate operational needs into measurable signals and reliable monitoring systems.
- Drive infrastructure-as-code practices for observability systems to ensure scalability, consistency, and maintainability.
- Continuously improve system reliability, visibility, and operational efficiency across large-scale infrastructure environments.
- 5+ years of experience in site reliability engineering, platform engineering, or infrastructure engineering in production environments.
- Strong hands-on experience building observability systems, including metrics, logs, alerting, and monitoring pipelines.
- Familiarity with tools such as Grafana, Loki, Mimir, or similar observability platforms.
- Working knowledge of datacenter hardware telemetry protocols such as Redfish, IPMI, and/or SNMP.
- Strong Linux systems knowledge and experience operating production-grade infrastructure.
- Experience with infrastructure-as-code tools such as Terraform, Ansible, Chef, or equivalent technologies.
- Proven ability to collaborate across technical and operational teams in complex environments.
- Strong communication skills and ability to translate operational needs into engineering solutions.
- Competitive salary within the $125,000–$135,000 range, based on experience and location
- 100% employer-paid medical, dental, and vision insurance for employees
- 401(k) retirement plan with employer matching and immediate vesting
- Annual professional development reimbursement and learning support
- Remote work support, including home office and internet stipends
- Generous PTO policy, paid holidays, and long-term sabbatical benefits
- Wellness, gym, and additional lifestyle reimbursements
- Inclusive and flexible remote-first work environment