TextNow

Site Reliability Engineer

TextNow • CA
Remote
We believe communication belongs to everyone. We exist to democratize phone service.  TextNow is evolving the way the world connects and that's because we're made up of people with curious minds who bring an optimistic, yet critical lens into the work we do.   We're the largest provider of free phone service in the nation. And we're just getting started.

Join us in our mission to break down barriers to communication and free the flow of conversation for people everywhere.

TextNow is looking for motivated Site Reliability Engineer to own infrastructure, monitoring, logging, ci/cd, reliability and everything in between!

What You'll Do

  • Ensure System Reliability: Design, build, and maintain scalable, resilient, and highly available systems to support TextNow’s infrastructure and services.
  • Automation & Infrastructure as Code: Develop and maintain automation using Terraform, Ansible, and other tools to enable efficient deployment, scaling, and operations of cloud-based systems (AWS preferred).
  • Incident Response & On-Call Support: Participate in an on-call rotation, troubleshoot issues, and drive incident resolution to minimize downtime and improve system performance. Conduct post-mortems and implement corrective actions to enhance reliability.
  • Performance Monitoring & Optimization: Implement and improve observability tools, logging, and monitoring solutions to identify and mitigate potential system issues proactively.
  • Collaboration & Cross-Team Engagement: Work closely with software engineers, DevOps, and product teams to align technical efforts with business objectives and improve system reliability from development to production.
  • Continuous Improvement: Identify areas for improvement in architecture, automation, and operational practices. Contribute to the design and implementation of new SRE best practices.
  • You'll be a great fit if you have:

  • Experienced in SRE/DevOps: You have 5+ years of experience in an operationally focused role, such as SRE, DevOps, or Infrastructure Engineering, with a deep understanding of reliability, scalability, and performance optimization.
  • Proficient with Key Technologies: Hands-on experience with AWS, GitHub, Terraform, Ansible, or similar tools to build and manage cloud infrastructure efficiently.
  • Incident Management Expert: You are comfortable handling production incidents, analyzing root causes, and implementing long-term fixes to prevent recurrence.
  • Automation & Observability Focused: Passionate about reducing toil through scripting and automation while ensuring robust observability using logging, metrics, and monitoring tools.
  • Collaborative & Impact-Driven: You enjoy working cross-functionally with engineers, product teams, and leadership to drive meaningful improvements to system reliability.