Senior Infrastructure Engineer - Observability - Remote from United Kingdom

Aircall • GB

Python Remote

Aircall is a unicorn AI-powered customer communications platform used by 22,000+ companies worldwide to drive revenue, faster resolutions, and scale. We’re redefining what a customer communications platform can be—by combining voice, SMS, WhatsApp, and AI into one seamless workspace.

Our momentum comes from a simple but powerful idea: help every customer-facing team work smarter, not harder. Aircall’s AI Voice Agent automates routine calls, AI Assist streamlines post-call tasks, and AI Assist Pro delivers real-time guidance that helps people do their best work. The result—companies grow revenue, deliver faster resolutions, and scale service.

We’ve built a product customers love and a business that scales fast. Aircall operates in nine global offices (Paris, New York, San Francisco, Sydney, Madrid, London, Berlin, Seattle, and Mexico City), and is backed by world-class investors. Our teams are shipping AI innovation faster than ever and expanding across new product lines and markets.

At Aircall, you’ll join a company in motion—ambitious, profitable, and product-driven—where impact is visible, decisions are fast, and growth is real.

How We Work at Aircall: At Aircall, we believe in customer obsession, continuous learning, and delivering extraordinary outcomes. We value open collaboration, taking ownership, and making smart, informed decisions with speed and precision. If you thrive in a fast-paced, team-driven environment where curiosity, trust, and impact matter, you'll fit right in

We’re looking for an Observability Engineer to own and evolve Aircall’s monitoring, alerting, and observability stack. You’ll work cross-functionally with backend, front end and infrastructure and teams to ensure our systems are transparent, measurable, and continuously improving in reliability and performance.

This role is ideal for someone passionate about observability-as-code, metric design, and helping engineering teams gain meaningful visibility into their systems.

Key Responsibilities:

Develop comprehensive observability best practices: Define and standardize guidelines for metrics, traces, and logs, ensuring consistent implementation and adoption across all engineering teams. This includes establishing naming conventions, data collection methodologies, and retention policies to ensure high-quality and actionable observability data whilst optimising cost and waste.

Collaborate strategically with engineering teams: Partner closely with various engineering teams to enhance overall system reliability and performance. This involves actively participating in architectural reviews, defining clear Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and seamlessly integrating observability practices into continuous integration and continuous deployment (CI/CD) pipelines to promote a culture of "observability by design."

Automate monitoring setup and provisioning: Drive the automation of monitoring infrastructure through Infrastructure-as-Code (e.g., leveraging the Terraform Datadog provider) and develop intuitive self-service observability tools. This empowers engineering teams to rapidly provision and manage their monitoring resources, reducing manual overhead and accelerating time to insight.

Improve alerting hygiene and effectiveness: Continuously refine and optimize alerting mechanisms by meticulously tuning thresholds, implementing intelligent noise reduction strategies, and ensuring all alerts are directly aligned with potential business impact. The goal is to deliver timely, relevant, and actionable alerts that enable proactive incident response and minimize service disruption.

Train and empower product teams: Provide comprehensive training and ongoing support to product teams, enabling them to effectively utilize observability tools. This includes guiding them in building insightful dashboards that visualize key performance indicators and creating robust alerts that proactively detect issues within their respective services.

Evaluate and integrate advanced observability tools: Proactively research, evaluate, and integrate new and emerging observability tools and technologies as needed. This may include exploring solutions for OpenTelemetry adoption, advanced log aggregation platforms, distributed tracing systems, and other tools that enhance our overall observability capabilities and support the evolving needs of our infrastructure and applications.

Qualifications:

3-5 years of experience in observability within SRE, DevOps, or platform engineering roles.

Strong hands-on experience with Datadog (dashboards, monitors, synthetics, logs, APM, RUM).

Proficiency with Terraform or other Infrastructure-as-Code tools.

Solid understanding of Kubernetes, microservices, and cloud infrastructure (EKS, Lambda, RDS, S3, AWS networking).

Familiarity with distributed tracing and OpenTelemetry concepts.

Strong scripting skills (Python, Bash, or similar).

Experience defining and managing SLIs/SLOs and service-level observability frameworks.

Excellent collaboration and communication skills; you can work with both engineers and non-technical stakeholders.

Nice to Have :

Experience with incident management and on-call processes.

Exposure to data visualization or analytics tools beyond Datadog.

Knowledge of logging pipelines (e.g., FluentBit, Logstash).

Experience working in high-scale SaaS environments.

Previous experience in developer enablement or platform teams.

Apply Now