Halter

Site Reliability Engineer - Network Team

Halter • NZ
About the role

At Halter, we’re building more than software - we’re transforming the way the world farms. Our smart collars let farmers shift, monitor, and care for their cattle via deep integrations & insights. Behind it all is the Network Team, powering one of New Zealand’s largest private IoT networks with 400,000+ connected devices and counting.

We’re looking for a Site Reliability Engineer (SRE) to help scale our systems to a million animals and beyond. You’ll apply cloud-scale NRE practices to a wildly distributed, rural IoT network across multiple countries.

Our vision is to become the OS for farming globally. This isn’t your average backend gig - this one moos 🐮. You’re not just writing code — you’re ensuring availability for hundreds of thousands of animals and farmers who rely on Halter every single day.

What you'll do

  • 📈 Build & run observability for gateways, towers, and backend/edge services (metrics, logs, tracing, alerts; strong signal / low noise).
  • 🤖 Automate ops: golden configs, zero-touch provisioning, safe canaries/rollbacks, scheduled maintenance, and self-healing where sensible.
  • 🚨 Lead incidents end-to-end (runbooks, comms, mitigation, post-mortems) and drive fixes into code, configs, and process.
  • 🚀 Harden deploys: progressive rollouts for firmware/agent/service changes across thousands of devices and multi-region backends.
  • ⚙️ Performance tuning: reduce command/telemetry latency, smooth OTA pipelines, and de-risk noisy/unreliable links with back-pressure & retries.
  • 🧭 Capacity & readiness: plan headroom for spikes and growth; chaos engineering for failover paths (cellular ↔ satellite, region failover).
  • 📘 Own runbooks & SOPs that enable field teams and on-call to respond quickly and consistently.
  • 🤝 Partner with Network/RF engineers on coverage/capacity changes, interference hunts, and carrier/satellite escalations.
  • 🧭 Mentor teammates on SRE mindset, tools, and operational excellence.
  • Who we're looking for:

  • 🧰 SRE/large-scale ops experience (cloud + distributed systems).
  • 💻 Strong automation & scripting (Python/Go/etc.) and IaC (Terraform/Ansible/etc.).
  • 📡 Solid networking fundamentals (TCP/IP, routing, VPNs, firewalls) + RF awareness (LoRa/LTE/sat a plus).
  • 🔭 Hands-on with observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).
  • 🧯 Proven incident management for high-availability systems.
  • 🧩 Performance tuning for latency-sensitive, unreliable-link environments.
  • 🐧 Comfortable in Linux across cloud and edge devices.
  • 📊 Data-driven: able to turn noisy telemetry into decisions (SQL or Jupyter a plus).
  • 🧠 Pragmatic problem-solver who balances reliability, speed, and cost.
  • ➕ Bonus: IoT/off-grid/field deployments experience. 🏕️
  • Network awareness (baseline, not deep-dive) 📡You don’t need to be a routing/RF guru — we have those. You should be comfortable with:
  • 🌐 Basic L3 troubleshooting: ping/traceroute, IP/subnetting, DNS/DHCP/NAT basics, reading simple routes.
  • 📶 Reading link health: interpreting RSSI/SNR (LoRa) or RSRP/SINR (LTE) at a high level; spotting “link looks bad vs service is bad.”
  • 🛰️ Backhaul pragmatics: understanding failover states (cellular ↔ satellite), cost/perf trade-offs, and safe config rollout patterns.
  • 🗺️ Topology literacy: knowing what a gateway/tower/backhaul path looks like and where to put probes and alerts.