Everseen: A leader in vision AI solutions for the world’s leading retailers.
The Role
We are seeking a Machine Learning Platform/Backend Engineer to design, build, and maintain scalable infrastructure that empowers our data scientists and machine learning engineers to develop, train, benchmark, and monitor machine learning models efficiently. You will be instrumental in shaping our internal Machine Learning Platform and driving automation, reproducibility, and performance across the machine learning lifecycle.
What you'll do
Own the design and implementation of the internal ML platform, enabling end-to-end workflow orchestration, resource management, and automation using cloud-native technologies (GCP/Azure).Design and manage Kubernetes-based infrastructure for multi-tenant GPU and CPU workloads with strong isolation, quota control, and monitoring. Integrate and extend orchestration tools (Airflow, Kubeflow, Ray, Vertex AI, Azure ML or custom schedulers) to automate data processing, training, and deployment pipelines.Develop shared services for model behavior/performance tracking, data/datasets versioning, and artifact management (MLflow, DVC, or custom registries).Document architecture, policies, and operational runbooks to ensure platform maintainability and transparency.Contribute to CI/CD pipelines for ML models, integrating automated testing, deployment, and rollback mechanisms.Build reusable components for data ingestion, model training.Ensure compliance with data governance, security, and audit requirements.
Collaborate with
AI/ML Engineering teamData Engineering teamSoftware Development EngineersDevOps teamProduct ManagersSecurity & Compliance Teams
Profile and Skills
Strong programming skills (Python)Hands-on experience with Kubernetes, Docker, and cloud services.Experience with CI/CD tools (e.g., GitLab, Jenkins).Understanding of ML training pipelines, data lifecycle, and model serving conceptsExcellent communication and collaboration skills.Familiarity with workflow orchestration tools (e.g., Airflow, Kubeflow, Ray, Vertex AI, Azure ML).
Additional Skills
Understanding ML lifecycle, model versioning, and monitoring.Experience with ML frameworks (e.g., TensorFlow, PyTorch).Experience with GPU orchestration (e.g., NVIDIA GPU Operator, MIG).Experience with Infrastructure as Code (e.g., Terraform).Knowledge of data engineering tools (e.g., Snowflake, Databricks, BigQuery, Airbyte, Kafka).Familiarity with feature stores and model registries.Exposure to large-scale distributed systems and performance optimization.