KTO

Platform Engineer - SRE

KTO • BR
GoPython Remote
Welcome to KTO Group, where innovation drives excitement in iGaming. Founded in 2018 by Andreas Bardun, we’re transforming online gaming with a focus on transparency and player satisfaction.
At KTO.com, we blend the thrill of sports betting with online casino entertainment, tailored to local markets and powered by our proprietary platform for a seamless, personalized experience.
KTO is a rising leader in LATAM, proudly ranked among Brazil’s top 10 iGaming brands. Join us as we set new standards in trust, innovation, and the future of iGaming.

We are looking for a Platform Engineer with a deep passion for Site Reliability Engineering (SRE). In this role, you won't just maintain systems—you will build the "paved roads" that our developers use to ship code quickly, safely, and reliably. You will design self-service infrastructure, architect robust deployment pipelines, and embed SRE best practices (SLIs, SLOs, Error Budgets) directly into our internal platform ecosystem.

What You Will Do (Impact & Responsibilities)

- Build Self-Service Infrastructure: Design and scale highly available Infrastructure as Code (IaC) modules using Terraform, empowering development teams to provision resources autonomously and securely.
- Champion Platform Reliability: Partner closely with engineering teams to define, measure, and operationalize SRE metrics (SLIs, SLOs, and Error Budgets) to balance feature velocity with system stability.
- Elevate Developer Experience (DevEx): Architect frictionless, GitOps-driven CI/CD pipelines utilizing GitHub Actions and ArgoCD, facilitating automated, secure, and progressive deployments (Blue/Green, Canary).
- Drive Advanced Observability: Architect a comprehensive, unified observability stack (Elastic Cloud, Grafana, Prometheus) to monitor APM, logs, and metrics. Implement event correlation to reduce alert fatigue and Mean Time to Resolution (MTTR).
- Orchestrate at Scale: Manage and optimize our containerized ecosystem utilizing Kubernetes and Helm, ensuring maximum scalability, security, and resource efficiency.
- Proactive Incident Mitigation: Act as a key responder for complex distributed system issues. Lead blameless post-mortems and implement automation to prevent recurring performance and availability bottlenecks.

Experience & qualifications required

  • Platform & Cloud Expertise: Solid hands-on experience designing and operating highly available production environments in AWS (or similar major cloud providers).
  • Infrastructure & Orchestration: Deep proficiency managing containerized workloads using Kubernetes and Helm, coupled with a strong track record of writing maintainable Terraform code.
  • Automation & GitOps: Expertise in designing modern CI/CD pipelines, specifically utilizing GitHub Actions and GitOps principles with ArgoCD.
  • Observability Mastery: Strong operational experience with observability and telemetry tools like Elastic Cloud, Prometheus, and Grafana.
  • Software Engineering Foundation: Proficiency in at least one modern programming language (e.g., Python, Go, or Java) to build platform tooling, combined with solid Linux/Unix and shell scripting skills.
  • Data Systems Knowledge: Operational familiarity with managing and scaling SQL and NoSQL databases in a distributed environment.
  • Bonus Points (Nice to Have)

  • Internal Developer Platforms (IDP): Experience building or maintaining self-service developer portals (e.g., Backstage).
  • Advanced Telemetry: Deep knowledge of distributed tracing (e.g., OpenTelemetry, Jaeger) and correlation techniques across microservices.
  • Progressive Delivery: Experience implementing advanced deployment strategies like feature flags and automated canary analysis.
  • Industry Certifications: Relevant credentials such as CKA (Certified Kubernetes Administrator), AWS Solutions Architect, or DevOps-related certifications.