Overview
Experteer OverviewIn this role you will design, build, and optimize large-scale ML and data infrastructure across on-prem NVIDIA DGX clusters and AWS, enabling robust training pipelines and production-ready ML solutions. You will lead complex projects within the ML Platform team, delivering scalable, maintainable software and data workflows. You’ll collaborate with data science and engineering teams to integrate ML processes while advancing platform reliability and developer experience. This is a hands-on leadership position focused on impactful, end-to-end outcomes.
Benefits
* Architect and develop core components of the ML platform and data infrastructure for training, inference, and large-scale data processing
* Design and implement scalable GPU cluster and distributed data pipeline solutions on-prem and in AWS
* Lead project workstreams, ensuring timely delivery and alignment with platform roadmap
* Build and optimize data pipelines for ingestion, transformation, storage, and retrieval supporting ML workflows
* Write clean, efficient code across multiple languages (services, tooling, automation)
* Collaborate with data science and engineering teams to integrate ML and data workflows (feature stores, model registries)
* Implement CI/CD for ML/data workflows using Argo Workflows, ArgoCD, GitHub Actions
* Maintain observability and logging (Prometheus, Grafana, CloudWatch, ELK/Opensearch) and drive SLOs and cost-awareness
* Operate AWS services (EKS, EC2, VPC, IAM, S3, EFS, Batch) across hybrid environments, contributing to security and compliance
* Continuously improve platform reliability, performance, and developer experience; stay current on MLOps/data engineering practices
Responsibilities
* 8+ years of software engineering experience
* Strong proficiency in Python and at least one of Go, C++, Rust, Bash
* Solid grasp of data structures, algorithms, concurrency, networking, and distributed systems
* Experience with Kubernetes, Helm, Argo Workflows, ArgoCD, Docker
* Ability to pass coding assessments demonstrating clean code
* Exposure to AWS (EKS, EC2, VPC, IAM, S3, EFS, Batch) and CI/CD automation
* Nice-to-have: hands-on experience with GPU clusters and ML frameworks (TensorFlow, PyTorch)
Required/Fundamental Qualifications
* Requisite qualifications aligned with responsibilities above (no additional content provided in the source).
#J-18808-Ljbffr