Senior Site Reliability Engineer Engineer (SRE) (Remote)

Remote

About Finom

Finom is a European tech startup headquartered in Amsterdam, and we’re on a journey towards revolutionizing the financial landscape for entrepreneurs worldwide. Our mission is to develop an all-in-one financial B2B solution that integrates banking functions, accounting, financial management, and invoicing into a seamless, mobile-first platform.

We recently closed a €115 million Series C equity round (around $133 million), bringing our total funding to approximately $346 million. This significant investment follows a $105 million growth funding round from General Catalyst, a long-term backer since 2021 known for supporting companies like Airbnb, HubSpot, KAYAK, and Stripe.

Finom's platform goes beyond traditional banking, offering invoicing and a growing suite of features, including AI-enabled accounting, aiming to simplify financial management for entrepreneurs. We're actively expanding our reach across key EU markets like Germany, France, the Netherlands, Italy, and Spain.

At Finom, we’re not just redefining the entrepreneurial experience — we’re empowering our employees to make a real difference. Your work matters, and your impact extends far beyond product metrics. We nurture innovation and an inspiring work environment where bold ideas thrive, prioritizing thorough research, swift implementation of solutions, and ensuring that every effort we make benefits our users, employees, partners, and our business as a whole.

Maintaining our start-up spirit, we prioritize thorough research, swift implementation of solutions, and ensuring that every effort we make benefits our users, employees, partners, and, of course, our business.

We are looking for a Senior SRE Engineer to drive the design, implementation, and evolution of our Kubernetes-based platform in a multi-cloud environment (GCP/AWS). At Finom, SREs are not just executors of tasks; you are the architects of reliability.

This role requires strong ownership of reliability, scalability, and platform architecture for high-load, mission-critical systems operating 24/7.

What You Will Be Doing

Lead the Platform Evolution: Design and operate our Kubernetes ecosystem (GKE, multi-cluster) with a focus on high availability and zero-downtime operations.

Build "Paved Roads": Own and evolve our PaaS strategy, using GitOps (ArgoCD) and CI/CD (GitLab) to empower domain teams to deploy independently.

Architect Reliability: Define and implement our observability strategy across metrics, logs, and tracing (Prometheus, VictoriaMetrics, OpenTelemetry).

Drive Infrastructure-as-Code: Lead the automation of our infrastructure using Terraform, ensuring all resources are standardized and version-controlled.

Own the Error Budget: Partner with engineering teams to establish and manage SLOs, SLAs, and incident management frameworks.

Disaster Recovery Mastery: Design and participate in regular DR drills, implementing blue/green and active/passive strategies across regions to ensure service continuity.

Innovate Operations: Proactively apply AI-driven approaches to improve operational efficiency and automated bottleneck detection.

Who You Are

Production K8s Mastery: Strong hands-on experience managing Kubernetes (GKE preferred) in high-load, multi-cluster production environments.

Cloud Infrastructure: Deep experience with GCP (AWS is a strong plus) and Terraform for large-scale infrastructure.

GitOps Expertise: Solid experience with ArgoCD, GitLab CI, and the "Infrastructure as Code" philosophy.

Observability Expert: Deep knowledge of the Prometheus/Grafana stack and implementing tracing/logging at scale.

System Design: Proven ability to design highly available 24/7 systems with automated failover and rollback capabilities.

English Fluency: English level B2+ for effective cross-functional communication.

Nice-to-Haves

Compliance Knowledge: Understanding of banking-grade standards like PCI DSS, GDPR, or ISO 27001.

Distributed Systems: Experience with Kafka (Confluent), RabbitMQ, or managing high-load Redis and PostgreSQL clusters.

AI for Ops: Experience using AI tools to improve alerting, anomaly detection, or engineering efficiency.

Security-Minded: Experience with Vault for secret management and credential rotation.

Our Infrastructure Landscape

Primary Cloud: GCP (~90%)

Orchestration & Deploy: GKE, ArgoCD, GitLab CI

Automation: Terraform

Data & Messaging: PostgreSQL, Kafka, Redis, RabbitMQ

Observability: Prometheus, Grafana, VictoriaMetrics, OpenTelemetry, Cloud Logging

Security: Vault

Apply Now