Jobgether

Senior Site Reliability Engineer, Wikimedia Enterprise

Jobgether • SA
GoPython Remote

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, Wikimedia Enterprise in Saudi Arabia.

This role sits at the intersection of large-scale infrastructure engineering and mission-driven technology powering global knowledge distribution systems. You will help design, operate, and evolve highly available, high-performance API and data infrastructure that supports large-scale reuse of Wikimedia content worldwide. The position involves deep technical ownership of reliability, scalability, and observability for critical services. You will work in a fully distributed, globally collaborative environment alongside experienced SREs, software engineers, and platform teams. The role combines hands-on engineering, incident response, and long-term reliability strategy. It also offers the opportunity to contribute to systems that directly impact how knowledge is accessed and reused across the internet. You will operate in a fast-paced, product-focused engineering culture with strong emphasis on automation, experimentation, and continuous improvement.

Accountabilities

In this role, you will be responsible for ensuring the reliability, scalability, and performance of large-scale distributed systems that power data and API services. You will:

  • Define, track, and continuously improve SLOs, SLIs, and error budgets for critical services
  • Design and enhance observability systems including metrics, logging, and distributed tracing
  • Participate in incident response, on-call rotations, and post-incident reviews to drive continuous improvement
  • Build and maintain CI/CD and GitOps pipelines enabling secure, automated, and reliable deployments
  • Implement infrastructure-as-code and automation-first practices to reduce operational toil
  • Design and operate scalable cloud infrastructure across production environments
  • Drive capacity planning, performance optimization, and resilience testing (including chaos engineering practices)
  • Improve developer experience by enabling self-service infrastructure and streamlined workflows
  • Collaborate with security, software, and release engineering teams to embed reliability and security best practices
  • Optimize infrastructure cost and efficiency using FinOps principles without compromising availability
  • Develop and maintain operational metrics such as MTTR, MTTD, and incident frequency
  • Contribute to platform engineering initiatives that standardize infrastructure across teams
  • Mentor peers and promote best practices in SRE, automation, and systems reliability
  • Requirements

    This position requires strong expertise in site reliability engineering, distributed systems, and cloud infrastructure, along with a proactive and collaborative mindset. You should have:

    • 5+ years of experience in SRE, DevOps, or infrastructure engineering roles
    • Strong experience with infrastructure-as-code tools such as Terraform and/or Ansible
    • Proficiency in at least one programming language (Python, Go, or similar)
    • Hands-on experience with cloud platforms such as AWS, GCP, or Azure
    • Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD or similar tools)
    • Strong understanding of SRE principles including SLOs, SLIs, and error budgets
    • Experience with observability tooling such as Prometheus, OpenTelemetry, or equivalent
    • Proven experience in incident response, on-call operations, and postmortem analysis
    • Ability to operate and optimize large-scale distributed systems with high availability requirements
    • Strong communication and collaboration skills in distributed, remote-first environments
    • Ability to document systems clearly and contribute to shared engineering knowledge
    • Strong ownership mindset, with a focus on automation, reliability, and continuous improvement
    • Adaptability to fast-evolving, technology-driven environments
    • Benefits

      • Remote-first work model with global collaboration
      • Opportunity to work on high-impact systems supporting global knowledge platforms
      • Exposure to large-scale distributed systems and modern cloud-native architectures
      • Culture of engineering excellence, automation, and continuous improvement
      • Strong emphasis on learning, experimentation, and open collaboration
      • Competitive compensation adjusted to location and experience
      • Inclusive and diverse work environment with global team exposure
      • Opportunity to contribute to open knowledge infrastructure used worldwide