Site Reliability Engineer

Python Remote

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer in the United States.

In this role, you will play a critical part in ensuring the reliability, scalability, and performance of modern, user-facing systems. You’ll work at the intersection of software engineering and operations, building robust infrastructure and driving automation to support high-quality service delivery. The position offers the opportunity to design resilient systems, improve operational efficiency, and proactively address risks before they impact users. You will collaborate closely with cross-functional teams to enhance system design and implement best practices in observability and incident response. This environment values continuous improvement, innovation, and data-driven decision-making. It’s an ideal role for someone who thrives in fast-paced environments and is passionate about building reliable, scalable platforms.

Accountabilities:

Ensure high availability, reliability, and scalability of production systems and services
Develop and maintain automation tools for deployments, configuration management, and operational workflows
Implement and manage monitoring and alerting systems to provide real-time visibility into system health
Respond to, troubleshoot, and resolve incidents while conducting post-mortems to prevent recurrence
Define and monitor Service Level Objectives (SLOs) and performance indicators
Perform capacity planning and resource forecasting to support system growth
Collaborate with engineering teams to identify operational risks and improve system architecture
Analyze system and application metrics to drive performance optimization initiatives

Requirements:

Minimum of 5 years of experience in IT, software engineering, or technology operations roles
At least 2 years of hands-on experience in Site Reliability Engineering, DevOps, or observability-focused roles
Strong expertise in cloud platforms such as AWS or Azure
Solid understanding of distributed systems, networking, storage, and operating systems
Experience with infrastructure as code tools (e.g., Terraform) and containerization technologies (e.g., Docker)
Proficiency with monitoring and observability tools such as DataDog, Prometheus, Grafana, or similar
Programming or scripting skills in languages such as Python, Ruby, or JavaScript
Strong problem-solving skills and the ability to work collaboratively across teams
Excellent communication skills with a proactive and detail-oriented mindset

Benefits:

Competitive salary with performance-based bonus opportunities
Comprehensive medical, dental, and vision insurance
Generous paid time off and company holidays
401(k) plan with employer matching contributions
Paid parental leave and family support programs
Flexible and collaborative work environment
Opportunities for professional growth and skill development

Apply Now