Platform Engineer - Observability (Mid Level)

GoTypeScriptPython Hybrid

Help us use technology to make a big green dent in the universe!

Kraken powers some of the most innovative global developments in energy.

We’re a technology company focused on creating a smart, sustainable energy system. From optimising renewable generation, creating a more intelligent grid and enabling utilities to provide excellent customer experiences, our operating system for energy is transforming the industry around the world in a way that benefits everyone.

It’s a really exciting time in energy. Help us make a real impact on shaping a better, more sustainable future.

Kraken Customer

What we do: build the most AI-driven, innovative, forward-thinking platform for energy management. From optimizing resources to delivering cost-effective, exceptional customer experiences through advanced Customer Information Systems (CIS), billing, meter data management, CRM, and AI-driven communications, Kraken is powering the next wave of innovation in the energy industry.

Why we do it: future energy will not look like energy as we know it today. We need to not just think about our future, but build for it. Now.

The Team

We have expanded our tentacles and are looking for someone (based in Melbourne, Australia) to join our Global Platform Engineering Reliability - Observability team.

Our Reliability group is responsible for architecting, developing, and maintaining the resilient and scalable infrastructure that powers and supports our platform.

We’re building a brand-new Observability team at Kraken and we’re looking for a Platform Engineer II to grow with it. Your core focus will be contributing to the availability, performance and scalability of products across Kraken - helping engineering teams get the visibility they need to build and operate reliable services with confidence. You’ll work alongside the team to improve monitoring and alerting across our product suite and support engineering teams as they transition to on-call over the next six months.

We operate with a high degree of autonomy and trust. That means you'll sometimes need to navigate unclear requirements, ask questions and move forward with incomplete information. You won't be handed a perfect spec for every task - but you will have a supportive team, clear goals and the space to develop your skills as the team evolves.

You’ll have the opportunity to shape how observability is done at Kraken and your contributions will have a direct, visible impact.

What you'll do:

Support and implement monitoring and alerting strategy across Kraken’s customer business

Define and uphold observability best practices across multiple products and platforms

Partner with product teams to implement observability tooling and improve reliability across the organisation

Help product teams build best-in-class dashboards for their requirements or bespoke use cases

Work with product teams to define and implement meaningful Service Level Objectives (SLOs) and Service Level Indicators (SLIs), aligned to contractual Service Level Agreements (SLAs)

Build, tune, and continuously improve alerts and monitors using golden signals (latency, traffic, errors, saturation) as a framework - reducing noise and increasing actionable signal

Help product teams transition to on-call models by improving signals, alert quality, and operational readiness

Improve tooling and self-service capabilities for alerting and monitoring across multiple product teams

Analyse incident metrics to identify trends and improvement opportunities, communicating insights clearly back to product teams

Manage the cost and usage of our observability tooling stack in collaboration with FinOps

Contribute to broader platform reliability infrastructure improvements where needed

Help solve interesting and difficult problems - there’s a significant opportunity for disruption in the global energy market

What you'll have:

Solid hands-on experience across our core platform stack:

AWS (supporting and improving cloud infrastructure used by product teams)

Terraform (infrastructure as code; comfortable operating with Terraform day-to-day)

Kubernetes (container orchestration and deployment management; comfortable working with Kubernetes day-to-day)

Experience using industry-standard observability tooling - we use Datadog, Grafana, Prometheus and Rootly (experience with other monitoring/alerting platforms is transferable)

Strong collaboration and communication skills - able to work effectively with developers, product managers, and other stakeholders to design and deliver impactful observability “golden paths” and monitoring experiences

Exposure to Python (or a similar C-based language like TypeScript, Go, C#) - able to understand how applications behave in production to support observability and reliability improvements

Previous experience working in small, highly autonomous teams

A working style that fits how we operate:

Comfortable with ambiguity and able to create structure in unclear situations

Proactive learning mindset (experiment, iterate, and adapt as the team evolves approaches)

Strong asynchronous written communication (Slack/Notion/docs) and a habit of keeping others in the loop

Autonomy and accountability - making progress independently and owning outcomes

What will help:

Previous experience working in a data-focused or Observability team

Experience working on SaaS platforms, including engaging product teams to ensure upskilling and knowledge sharing

Experience building observability tooling to support large-scale internet-facing services

Experience instrumenting and diagnosing issues with very large relational databases

Familiarity with PostgreSQL (or similar RDBMS), particularly Amazon RDS at scale

Experience using SLOs to drive meaningful performance and reliability improvements

Apply Now