Zilliz

Senior Site Reliability Engineer Cloud Platform

Zilliz • US
GoPython Hybrid
Zilliz is a fast-growing startup developing the industry’s leading vector database company for enterprise-grade AI. Founded by the engineers behind Milvus, the world’s most popular open-source vector database, the company builds next-generation database technologies to help organizations quickly create AI applications. On a mission to democratize AI, Zilliz is committed to simplifying data management for AI applications and making vector databases accessible to every organization.

What you will do:

  • Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms.
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems.
  • Develop and implement strategies for monitoring, incident management, and disaster recovery.
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention.
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness.
  • Collaborate with software engineers to enhance system reliability, scalability, and performance.
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes.
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency.
  • What we are looking for:

  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems.
  • Proficiency in scripting languages such as Python, Go, or Java.
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker.
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools.
  • Experience with infrastructure as code tools such as Terraform or Ansible.
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo.
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly.
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines.
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously.
  • Experience with Open Source Milvus Vector Database is nice to have