DevOps/Observability Engineer

Remote

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a DevOps/Observability Engineer based in United States.

This role sits at the core of modern cloud infrastructure reliability, focused on building and scaling a next-generation observability platform for complex, distributed systems. You will design and implement end-to-end monitoring, logging, and telemetry pipelines that provide deep visibility across large-scale cloud environments. The position requires strong expertise in cloud-native architectures, with a focus on AWS, Kubernetes, and open-source observability tooling. You will play a key role in unifying metrics, logs, and traces using technologies such as OpenTelemetry, Prometheus, Grafana, and Splunk. Operating in a fast-paced, engineering-driven environment, you will collaborate closely with platform and DevOps teams to improve system reliability, performance, and cost efficiency. This is a highly technical, hands-on role where your work directly strengthens the stability and scalability of mission-critical systems.

Accountabilities:

Design and implement end-to-end observability architectures using OpenTelemetry, Prometheus, Grafana, and related tools across cloud environments.
Build and maintain centralized observability pipelines across multi-account AWS environments, including CloudWatch, CloudTrail, and VPC Flow Logs.
Develop scalable log aggregation and routing strategies, including filtering, noise reduction, and integration with systems such as Splunk HEC.
Create advanced alerting frameworks and high-quality dashboards using Alertmanager, CloudWatch Alarms, and Grafana with PromQL.
Deploy and manage observability infrastructure using Infrastructure as Code tools such as Terraform.
Support Kubernetes and container-based observability across EKS and ECS environments.
Optimize observability systems for performance, cost efficiency, and scalability in large-scale production environments.
Collaborate with engineering teams to improve system reliability, monitoring standards, and incident response capabilities.

Requirements:

8+ years of experience in DevOps, Site Reliability Engineering, or Observability Engineering roles.
Strong hands-on experience designing unified observability pipelines using OpenTelemetry, Prometheus, and Grafana.
Deep expertise in AWS observability services including CloudWatch, CloudTrail, and cross-account telemetry strategies.
Proven ability to build and manage large-scale log aggregation systems and optimize high-volume data pipelines.
Strong experience with Kubernetes (EKS) or containerized environments (ECS) in production settings.
Advanced proficiency with Terraform or other Infrastructure as Code tools for infrastructure and observability deployments.
Experience building alerting systems, dashboards, and monitoring frameworks for distributed systems.
Strong understanding of cost optimization strategies for observability platforms (log filtering, metric reduction, storage tiering).
Excellent problem-solving, debugging, and collaboration skills in complex cloud-native environments.

Benefits:

Competitive compensation aligned with experience and market benchmarks.
Remote work flexibility within United States.
Opportunity to work on large-scale, AI-driven, cloud-native infrastructure systems.
Exposure to enterprise clients and high-impact digital transformation projects.
Hands-on experience with leading observability and cloud technologies in production environments.
Strong learning and upskilling culture in AI, cloud, and platform engineering.
Collaborative, high-performance engineering environment focused on innovation and reliability.
Opportunity to shape next-generation observability practices at scale.

Apply Now