This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Manager, Site Reliability Engineer - DGX Cloud in India.
This role offers the opportunity to lead a high-performing Site Reliability Engineering (SRE) team focused on cloud platform reliability, scalability, and operational excellence. You will oversee the end-to-end reliability of critical cloud services, drive automation to reduce toil, and implement robust monitoring and incident management practices. Collaborating with engineering, product, and customer-facing teams, you will influence architecture, performance, and security standards while fostering a culture of continuous improvement and innovation. The position emphasizes both technical leadership and people management, including mentorship, career development, and team growth. You will be instrumental in shaping operational strategies that support high-impact AI and cloud solutions. This is a high-visibility, hands-on leadership role requiring strategic vision and a deep technical skill set.
Accountabilities:
Lead, mentor, and inspire a team of Site Reliability Engineers, promoting collaboration, ownership, and technical excellence.Define and enforce SRE best practices, including SLOs, SLIs, error budgets, and incident response processes.Collaborate with engineering and product teams to design, deploy, and operate highly scalable and resilient cloud services.Drive automation across the service lifecycle, reducing manual toil and improving operational efficiency.Implement monitoring, logging, alerting, and tracing solutions to ensure system observability and performance.Oversee incident management, lead post-mortems, and ensure lessons learned are incorporated into operational improvements.Align cloud operations with security standards, compliance requirements, and enterprise best practices.Requirements:
Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent experience.10+ years of experience in Site Reliability Engineering, DevOps, or similar roles, including at least 5 years in leadership/management.Proven experience operating and managing large-scale distributed cloud systems (AWS, GCP, Azure).Expertise in Kubernetes, containerization, and microservices architecture.Strong knowledge of SRE principles, including SLOs, SLIs, error budgets, and incident management.Hands-on experience with infrastructure automation tools (Terraform, Ansible, Chef, Puppet).Proficiency in at least one programming language (Python, Go, or similar).Deep understanding of Linux systems, networking (TCP/IP), and cloud security standards.Experience building observability platforms using tools like Prometheus, Grafana, ELK Stack, Splunk, or Jaeger.Exceptional leadership, communication, and problem-solving skills, with the ability to engage both technical and non-technical stakeholders.Benefits:
Highly competitive salary and performance-based incentives.Opportunity to lead a technically challenging, high-impact team in cloud and AI infrastructure.Exposure to cutting-edge SRE practices and cloud platform architecture.Professional development and mentorship opportunities for team growth.Flexible working arrangements and a collaborative work environment.Participation in strategic initiatives influencing platform reliability, scalability, and innovation.