Platform Engineer - SRE
KTO • BRGoPython Remote
Welcome to KTO Group, where innovation drives excitement in iGaming. Founded in 2018 by Andreas Bardun, we’re transforming online gaming with a focus on transparency and player satisfaction.
At KTO.com, we blend the thrill of sports betting with online casino entertainment, tailored to local markets and powered by our proprietary platform for a seamless, personalized experience.
KTO is a rising leader in LATAM, proudly ranked among Brazil’s top 10 iGaming brands. Join us as we set new standards in trust, innovation, and the future of iGaming.
We are looking for a Platform Engineer with a deep passion for Site Reliability Engineering (SRE). In this role, you won't just maintain systems—you will build the "paved roads" that our developers use to ship code quickly, safely, and reliably. You will design self-service infrastructure, architect robust deployment pipelines, and embed SRE best practices (SLIs, SLOs, Error Budgets) directly into our internal platform ecosystem.
What You Will Do (Impact & Responsibilities)
- Build Self-Service Infrastructure: Design and scale highly available Infrastructure as Code (IaC) modules using Terraform, empowering development teams to provision resources autonomously and securely.
- Champion Platform Reliability: Partner closely with engineering teams to define, measure, and operationalize SRE metrics (SLIs, SLOs, and Error Budgets) to balance feature velocity with system stability.
- Elevate Developer Experience (DevEx): Architect frictionless, GitOps-driven CI/CD pipelines utilizing GitHub Actions and ArgoCD, facilitating automated, secure, and progressive deployments (Blue/Green, Canary).
- Drive Advanced Observability: Architect a comprehensive, unified observability stack (Elastic Cloud, Grafana, Prometheus) to monitor APM, logs, and metrics. Implement event correlation to reduce alert fatigue and Mean Time to Resolution (MTTR).
- Orchestrate at Scale: Manage and optimize our containerized ecosystem utilizing Kubernetes and Helm, ensuring maximum scalability, security, and resource efficiency.
- Proactive Incident Mitigation: Act as a key responder for complex distributed system issues. Lead blameless post-mortems and implement automation to prevent recurring performance and availability bottlenecks.
- Champion Platform Reliability: Partner closely with engineering teams to define, measure, and operationalize SRE metrics (SLIs, SLOs, and Error Budgets) to balance feature velocity with system stability.
- Elevate Developer Experience (DevEx): Architect frictionless, GitOps-driven CI/CD pipelines utilizing GitHub Actions and ArgoCD, facilitating automated, secure, and progressive deployments (Blue/Green, Canary).
- Drive Advanced Observability: Architect a comprehensive, unified observability stack (Elastic Cloud, Grafana, Prometheus) to monitor APM, logs, and metrics. Implement event correlation to reduce alert fatigue and Mean Time to Resolution (MTTR).
- Orchestrate at Scale: Manage and optimize our containerized ecosystem utilizing Kubernetes and Helm, ensuring maximum scalability, security, and resource efficiency.
- Proactive Incident Mitigation: Act as a key responder for complex distributed system issues. Lead blameless post-mortems and implement automation to prevent recurring performance and availability bottlenecks.