About Us:
Zensurance is redefining commercial insurance for Canadian businesses.
As a leading InsurTech, we make getting the right coverage simple, fast, and accessible through a digital-first experience. Our platform combines advanced technology with deep industry expertise to deliver tailored insurance solutions that help businesses thrive.
Zensurance has been recognized for its rapid growth and industry impact:
At Zensurance, we value ownership, collaboration, and innovation. Our team thrives on solving complex challenges, challenging the status quo, and making a real impact in an industry ready for change.
If you're looking to build something meaningful in a fast-growing, customer-focused company, we’d love to hear from you!
We are looking for a Senior Site Reliability Engineer (SRE) to join our Digital Platform!
As a Senior Site Reliability Engineer (SRE), reporting to the Team Lead, you will partner with the Engineering Department to drive the reliability, scalability, and performance of our production systems. Your focus will be on defining and implementing best practices across infrastructure security, observability, release engineering, and developer tooling to meet department-level operational requirements. In addition, you will be expected to assist the Engineering Leadership Team, own our Incident Management process and automate operational tasks.
In addition, you will be expected to coach, mentor lower-level professionals, and assist the Engineering Leadership Team in continuously improving craft capabilities.
This is a remote-first role within Canada. #LI-Remote
Responsibilities:
Write code and tools to automate repetitive, manual operational tasks to free up engineering time.Participate in on-call rotations to rapidly detect, triage, and resolve system outages and emergencies.Implement comprehensive observability (logging, tracing, metrics) and configure intelligent alerts to monitor system health in real-time.Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure and manage the performance and availability of services.Partner with development teams to ensure new services are designed for scalability, resilience, and reliability from the start.Develop and test robust Disaster Recovery (DR) and failover procedures to ensure business continuity.Perform other duties as assigned.
Requirements:
University degree or college diploma in a recognized technical, vocational or academic program (preferably in Engineering or Computer Science) or equivalent work experience5+ years of experience as a Site Reliability EngineerProven experience with Terraform for provisioning and managing cloud infrastructureExperience with KubernetesExperience with AWS as a cloud service provider Demonstrated experience maintaining and improving an Incident Management processExperience with a major observability platform (e.g., Prometheus, Grafana, Datadog, ELK Stack, Splunk, or New Relic)Experience with distributed systems to ensure that services meet scalability, reliability and uptime goals by implementing strategies like redundancy, failover solutions, and monitoringExperience with GitHub Actions as tool for Continuous Integration/Continuous Delivery (CI/CD)Experience in Backup and Recovery ScenariosAbility to communicate efficiently and work in a collaborative styleA commitment to continuous improvement, continuous learning and knowledge sharing
Nice to have:
Prior experience with Insurance is a plusFamiliarity with Helm for deploying and managing applications on KubernetesAWS and DataDog certifications are an asset, including DevOps Engineer, Solutions Architect, or SysOps AdministratorExperience with TypeScript