Site Reliability Engineer - Network Team

Halter • NZ

About the role

At Halter, we’re building more than software - we’re transforming the way the world farms. Our smart collars let farmers shift, monitor, and care for their cattle via deep integrations & insights. Behind it all is the Network Team, powering one of New Zealand’s largest private IoT networks with 400,000+ connected devices and counting.

We’re looking for a Site Reliability Engineer (SRE) to help scale our systems to a million animals and beyond. You’ll apply cloud-scale NRE practices to a wildly distributed, rural IoT network across multiple countries.

Our vision is to become the OS for farming globally. This isn’t your average backend gig - this one moos 🐮. You’re not just writing code — you’re ensuring availability for hundreds of thousands of animals and farmers who rely on Halter every single day.

What you'll do

📈 Build & run observability for gateways, towers, and backend/edge services (metrics, logs, tracing, alerts; strong signal / low noise).

🤖 Automate ops: golden configs, zero-touch provisioning, safe canaries/rollbacks, scheduled maintenance, and self-healing where sensible.

🚨 Lead incidents end-to-end (runbooks, comms, mitigation, post-mortems) and drive fixes into code, configs, and process.

🚀 Harden deploys: progressive rollouts for firmware/agent/service changes across thousands of devices and multi-region backends.

⚙️ Performance tuning: reduce command/telemetry latency, smooth OTA pipelines, and de-risk noisy/unreliable links with back-pressure & retries.

🧭 Capacity & readiness: plan headroom for spikes and growth; chaos engineering for failover paths (cellular ↔ satellite, region failover).

📘 Own runbooks & SOPs that enable field teams and on-call to respond quickly and consistently.

🤝 Partner with Network/RF engineers on coverage/capacity changes, interference hunts, and carrier/satellite escalations.

🧭 Mentor teammates on SRE mindset, tools, and operational excellence.

Who we're looking for:

🧰 SRE/large-scale ops experience (cloud + distributed systems).

💻 Strong automation & scripting (Python/Go/etc.) and IaC (Terraform/Ansible/etc.).

📡 Solid networking fundamentals (TCP/IP, routing, VPNs, firewalls) + RF awareness (LoRa/LTE/sat a plus).

🔭 Hands-on with observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).

🧯 Proven incident management for high-availability systems.

🧩 Performance tuning for latency-sensitive, unreliable-link environments.

🐧 Comfortable in Linux across cloud and edge devices.

📊 Data-driven: able to turn noisy telemetry into decisions (SQL or Jupyter a plus).

🧠 Pragmatic problem-solver who balances reliability, speed, and cost.

➕ Bonus: IoT/off-grid/field deployments experience. 🏕️

Network awareness (baseline, not deep-dive) 📡You don’t need to be a routing/RF guru — we have those. You should be comfortable with:

🌐 Basic L3 troubleshooting: ping/traceroute, IP/subnetting, DNS/DHCP/NAT basics, reading simple routes.

📶 Reading link health: interpreting RSSI/SNR (LoRa) or RSRP/SINR (LTE) at a high level; spotting “link looks bad vs service is bad.”

🛰️ Backhaul pragmatics: understanding failover states (cellular ↔ satellite), cost/perf trade-offs, and safe config rollout patterns.

🗺️ Topology literacy: knowing what a gateway/tower/backhaul path looks like and where to put probes and alerts.

Apply Now