AXDRAFT

Senior Site Reliability Engineer

AXDRAFT • IN
Python Hybrid
Onit, Inc. is looking for a Senior Site Reliability Engineer to join our Core Infrastructure team. This role will help to ensure the reliability of a diverse set of applications across our AWS infrastructure. To be successful in this role you will need to collaborate and pair with team members, have strong technical skills, and a passion for technology. The individual we seek is skilled in observability, excellent at troubleshooting, and has strong problem-solving skills. You must be able to multi-task in a fast-paced environment and be a self-starter with the ability to work independently. 

Responsibilities

  • Troubleshoot deployment failures and infrastructure issues across our full AWS infrastructure stack (EKS, RDS,) This incudes dev, test, and production environments. 
  • Create and maintain monitors for uptime and performance using Datadog, CloudWatch and other monitoring tools. 
  • Find ways to help reduce errors in systems and reduce noise in monitors and alerts 
  • Work with others on user stories to improve system health. 
  • Help create and prioritize work / stories. 
  • Participate in standups with US and India team. 
  • Help define runbooks and automation to solve production problems. 
  • Troubleshoot applications from a configuration and logging perspective. 
  • Assist with responding to and analyzing security events from security tooling.  
  • Help train others to take on SRE responsibilities. 
  • Assist with performance optimization by identifying performing bottlenecks and making recommendations on improvements.  
  • Verify systems are monitored, backed up, and following best practices ... via audits and automation 
  • Investigate how to take better advantage of the tools we use for monitoring, security.
  • Requirements

  • Bachelor's degree in computer science or equivalent experience is required. 
  • 5+ years of experience for the following:  
  • AWS (EC2, EKS, ECS, S3, RDS, CloudWatch, CloudTrail, IAM, AWS CLI, etc.). Experience with containers and EKS is a must. 
  • Linux (Centos, Amazon Linux, Ubuntu) 
  • Git source code management (Gitlab, GitHub) 
  • Bash shell scripting or other scripting / programming experience 
  • SaaS based Web application experience  
  • Relational Database performance and monitoring (Postgres RDS preferred) 
  • Experience with Jenkins or similar CI/CD tooling 
  • A solid understanding of the components that make up production systems (Memory, CPU, Disk space, Disk i/o, Network i/o, etc.) is required. 
  • Strong experience with monitoring, alerting, and log aggregation tools: Datadog, AWS CloudWatch, PagerDuty, Statuspage. 
  • Ability to read and interpret application server logs, outputs, CloudTrail and other critical logging output 
  • Excellent troubleshooting skills required. 
  • Product base company experience is preferred.
  • Nice to Have Skills

  • Prior application coding and debugging experience (Ruby, Python, etc.) 
  • Terraform and/or CloudFormation 
  • Experience troubleshooting application integrations 
  • Other Technologies: Cloudflare, AWS Guard duty, Crowdstrike. 
  • Good to have experience with Agentic AI automation.