Apollo Research

Backend Engineer (Monitoring)

Apollo Research • GB
Python
Application deadline: We accept submissions until 16 January 2026. We review applications on a rolling basis and encourage early submissions.


THE OPPORTUNITY

Join our new AGI safety monitoring team and help transform complex AI research into practical tools that reduce risks from AI. As a Backend Engineer, you'll work closely with our CEO, monitoring engineers and Evals team software engineers to build tools that make AI agent safety accessible at scale. We are building tools that monitor AI coding agents for safety and security failures.

You will join a small team and will have significant ability to shape the team & tech, and have the ability to earn responsibility quickly. You will like this opportunity if you care about building tools that genuinely make AI agents safe and thrive in high-paced environments as well as enjoy closely working with researchers.


KEY RESPONSIBILITIES

Infrastructure & Architecture
- Design and implement scalable backend systems capable of processing and analyzing large volumes of AI agent logs in real-time
- Build and maintain data processing pipelines that extract, transform, and store agent trajectory data efficiently
- Architect database schemas and data models optimized for both high-throughput writes and complex analytical queries
- Design for reliability, implementing robust error handling, retry logic, and graceful degradation strategies
- Monitor system performance and optimize bottlenecks to ensure sub-second latency for critical monitoring operations

API Development
- Develop secure, well-documented RESTful APIs that allow users to integrate our monitoring tools into their workflows
- Implement authentication, authorization, and rate limiting to protect users data and ensure fair resource usage
- Build webhook systems and real-time notification services to alert users about critical safety events
- Design API interfaces that are intuitive for developers while remaining flexible for diverse user use cases
- Design and implement integrations with Security Information and Event Management (SIEM) systems, enabling users to stream monitoring alerts and security events into their existing security operations workflows

Data Systems
- Implement efficient storage solutions for both structured data (monitoring results, metadata) and unstructured data (agent logs, code outputs)
- Build data processing systems that can handle everything from streaming real-time monitoring to batch analysis of historical data
- Design and implement caching strategies to optimize frequent queries and reduce infrastructure costs
- Create data retention and archival policies that balance users needs with storage efficiency

Monitoring & Observability
- Build comprehensive logging, metrics, and tracing systems to ensure visibility into system health and performance
- Implement alerting systems that notify the team of infrastructure issues before they impact users
- Create dashboards and tools that help the team understand system behavior and diagnose issues quickly
- Design systems that make debugging production issues straightforward and minimize time-to-resolution

Collaboration & Quality
- Work closely with our researchers to understand their needs and translate research prototypes into production-ready systems
- Collaborate with frontend engineers to design APIs and data structures that enable excellent user experiences
- Participate in code reviews to maintain high standards for code quality, security, and performance
- Document architectural decisions, API specifications, and system behaviors to facilitate knowledge sharing
- Contribute to technical discussions about technology choices, trade-offs, and implementation approaches

JOB REQUIREMENTS

  • 4+ years of experience building production backend systems at scale
  • Strong Python proficiency with experience in frameworks like FastAPI, Flask, or Django
  • Experience designing and implementing RESTful APIs with clear documentation
  • Solid understanding of database design and optimization (SQL and/or NoSQL)
  • Experience with cloud platforms (AWS, Google Cloud, or Azure) and containerization technologies (Docker, Kubernetes)
  • Experience building data-intensive applications or processing large-scale log data
  • Strong understanding of system design principles, including scalability, reliability, and security
  • Experience with asynchronous processing, message queues, and distributed systems
  • Demonstrated ability to write clean, well-tested, maintainable code

  • Bonus
  • Familiarity with real-time data processing frameworks (Kafka, Redis Streams, etc.)
  • Experience with ML/AI infrastructure or building tools for AI applications
  • Previous work on developer tools, monitoring systems, or security tools
  • Experience with infrastructure-as-code (Terraform, CloudFormation, etc.)
  • Familiarity with AI safety concepts or evaluation frameworks like Inspect
  • Contributions to open-source backend infrastructure projects
  • Experience building security-centric tools
  • Experience with code analysis platforms
  • Experience with Golang

  • We want to emphasize that people who feel they don't fulfill all of these characteristics but think they would be a good fit for the position nonetheless are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine.

    REPRESENTATIVE PROJECT

  • Real-time agent monitoring infrastructure: Design and build the backend system that processes AI coding agent outputs in real-time to detect safety and security issues. Start by implementing a scalable ingestion pipeline that can accept agent logs via API, then build a processing system that routes logs through various monitors based on their characteristics. Implement a storage layer that efficiently handles both recent high-frequency queries and historical analysis. Add a notification system that alerts users when monitors detect concerning behaviors, with configurable thresholds and delivery methods. Throughout the project, ensure the system maintains sub-second p95 latency for critical operations while gracefully handling traffic spikes and partial system failures.
  • BENEFITS

  • Salary: 100k - 180k GBP (~135k - 245k USD)
  • Flexible work hours and schedule
  • Unlimited vacation
  • Unlimited sick leave
  • Lunch, dinner, and snacks are provided for all employees on workdays
  • Paid work trips, including staff retreats, business trips, and relevant conferences
  • A yearly $1,000 (USD) professional development budget.
  • LOGISTICS

  • Start Date: Target of 2-3 months after the first interview
  • Time Allocation: Full-time
  • Location: The office is in London, and the building is next to the London Initiative for Safe AI (LISA) offices. This is an in-person role. In rare situations, we may consider partially remote arrangements on a case-by-case basis
  • Work Visas: We can sponsor UK visas