Help us use technology to make a big green dent in the universe!
Kraken powers some of the most innovative global developments in energy.
We’re a technology company focused on creating a smart, sustainable energy system. From optimising renewable generation, creating a more intelligent grid and enabling utilities to provide excellent customer experiences, our operating system for energy is transforming the industry around the world in a way that benefits everyone.
It’s a really exciting time in energy. Help us make a real impact on shaping a better, more sustainable future.
Kraken Customer
What we do: build the most AI-driven, innovative, forward-thinking platform for energy management. From optimizing resources to delivering cost-effective, exceptional customer experiences through advanced Customer Information Systems (CIS), billing, meter data management, CRM, and AI-driven communications, Kraken is powering the next wave of innovation in the energy industry.
Why we do it: future energy will not look like energy as we know it today. We need to not just think about our future, but build for it. Now.
The Team
We have expanded our tentacles and are looking for someone (based in Melbourne, Australia or remote within Australia) to join our Global Platform Engineering Reliability - Product Reliability team.
Our Reliability group is responsible for architecting, developing, and maintaining the resilient and scalable infrastructure that powers and supports our platform.
As a Product Reliability Engineer within the newly created ‘Product Reliability’ team, you'll be responsible for ensuring the availability, performance and scalability of the products on our platform.
Your proficiency in supporting products that serve millions of customers will ensure stability and high performance for our brands and clients.
You’ll keep up with best practices in building products for scale. Your communication skills and attention to detail will be indispensable as you pinpoint areas for enhancement, ensure optimal product performance and continuously improve our reliability and efficiency.
What you'll do:
Teach and support product teams on best practices for reliability, implementation patterns and effective usage of our existing platformsSupport product teams in improving the performance and availability of their systemsBe hands-on in code and infrastructure to help product teams with reliability improvementsProvide comprehensive feedback to the wider Platform group on improvements to be made to core infrastructure based on observations and first-hand experience in the code baseSupport the build-out of proof-of-concept requirements in product teams as needed to evolve application deployment architecture to align with business growth as well as enhance scalability and system resilienceCollaborate with product teams to support the release of new features and services, ensuring adherence to reliability and performance standardsGuide product teams in designing systems for resilience and graceful failure under heavy loadAssist application teams with post-incident tasks and follow-ups, and contribute to the creation and review of post-mortem documentationAnalyse incident metrics to identify trends and potential improvements, communicating these insights to the product teamsHelp solve interesting and difficult problems. There’s a great opportunity for disruption in the global energy market
What you'll have:
Great communication skills, working effectively with developers, product managers and other business stakeholders to understand, design and deliver impactful projects and reliability improvementsSolid hands-on experience across our core platform stack:
AWS (supporting and improving cloud infrastructure used by product teams)Terraform (infrastructure as code; comfortable operating with Terraform day-to-day)Kubernetes (container orchestration and deployment management; comfortable working with Kubernetes day-to-day)Experience using industry-standard observability tooling - we use Datadog, Grafana, Prometheus and Rootly (experience with other monitoring/alerting platforms is transferable)Strong collaboration and communication skills - able to work effectively with developers, product managers, and other stakeholders to design and deliver impactful observability “golden paths” and monitoring experiencesExposure to Python (or a similar C-based language like TypeScript, Go, C#) - able to understand how applications behave in production to support observability and reliability improvementsPrevious experience working in small, highly autonomous teams
A working style that fits how we operate:
Comfortable with ambiguity and able to create structure in unclear situationsProactive learning mindset (experiment, iterate, and adapt as the team evolves approaches)Strong asynchronous written communication (Slack/Notion/docs) and a habit of keeping others in the loopAutonomy and accountability - making progress independently and owning outcomes
What will help:
Previous experience as a Site Reliability EngineerExperience working on SaaS platforms, including engaging product teams to ensure up-skilling and knowledge sharing across teamsExperience managing and supporting a large scale internet facing serviceExperience in responding to incidents and outages, writing technical incident reports and organising incident retrospectivesExperience working with very large relational databasesExperience in using service level objectives to improve application performanceA proactive, innovative mindset