Senior Site Reliability Engineer (Remote First)

Zensurance • CA

Remote

About Us:

Zensurance is redefining commercial insurance for Canadian businesses.

As a leading InsurTech, we make getting the right coverage simple, fast, and accessible through a digital-first experience. Our platform combines advanced technology with deep industry expertise to deliver tailored insurance solutions that help businesses thrive.

Zensurance has been recognized for its rapid growth and industry impact:

✅ Deloitte’s Technology Fast 50 (2023, 2024, 2025)

✅ Deloitte’s Technology Fast 500 (2024, 2025)

✅ Top Insurance Employers (2022)

At Zensurance, we value ownership, collaboration, and innovation. Our team thrives on solving complex challenges, challenging the status quo, and making a real impact in an industry ready for change.

If you're looking to build something meaningful in a fast-growing, customer-focused company, we’d love to hear from you!

We are looking for a Senior Site Reliability Engineer (SRE) to join our Digital Platform!

As a Senior Site Reliability Engineer (SRE), reporting to the Team Lead, you will partner with the Engineering Department to drive the reliability, scalability, and performance of our production systems. Your focus will be on defining and implementing best practices across infrastructure security, observability, release engineering, and developer tooling to meet department-level operational requirements. In addition, you will be expected to assist the Engineering Leadership Team, own our Incident Management process and automate operational tasks.

In addition, you will be expected to coach, mentor lower-level professionals, and assist the Engineering Leadership Team in continuously improving craft capabilities.

This is a remote-first role within Canada. #LI-Remote

Responsibilities:

Write code and tools to automate repetitive, manual operational tasks to free up engineering time.

Participate in on-call rotations to rapidly detect, triage, and resolve system outages and emergencies.

Implement comprehensive observability (logging, tracing, metrics) and configure intelligent alerts to monitor system health in real-time.

Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure and manage the performance and availability of services.

Partner with development teams to ensure new services are designed for scalability, resilience, and reliability from the start.

Develop and test robust Disaster Recovery (DR) and failover procedures to ensure business continuity.

Perform other duties as assigned.

Requirements:

University degree or college diploma in a recognized technical, vocational or academic program (preferably in Engineering or Computer Science) or equivalent work experience

5+ years of experience as a Site Reliability Engineer

Proven experience with Terraform for provisioning and managing cloud infrastructure

Experience with Kubernetes

Experience with AWS as a cloud service provider

Demonstrated experience maintaining and improving an Incident Management process

Experience with a major observability platform (e.g., Prometheus, Grafana, Datadog, ELK Stack, Splunk, or New Relic)

Experience with distributed systems to ensure that services meet scalability, reliability and uptime goals by implementing strategies like redundancy, failover solutions, and monitoring

Experience with GitHub Actions as tool for Continuous Integration/Continuous Delivery (CI/CD)

Experience in Backup and Recovery Scenarios

Ability to communicate efficiently and work in a collaborative style

A commitment to continuous improvement, continuous learning and knowledge sharing

Nice to have:

Prior experience with Insurance is a plus

Familiarity with Helm for deploying and managing applications on Kubernetes

AWS and DataDog certifications are an asset, including DevOps Engineer, Solutions Architect, or SysOps Administrator

Experience with TypeScript

Apply Now