UNA201

AWS EKS GPU Spot Recovery Operator for ML Workloads

Room: Una | Time: 10:00

In this session I will be speaking about Kubernetes operator for AWS EKS that detects Spot interruptions on GPU nodes, triggers ML job checkpointing, and recovers training by rescheduling on available nodes. This I prepared with an idea to provide an interesting example of a repeatable workflow for maximizing GPU cost savings without losing progress on ML training jobs. The setup covered in this talk will include services AWS EventBridge, S3/EFS , IAM, and CloudWatch. I will be presenting a live example to show node interruption handling, and seamless ML job recovery using my developed solution.

Natalie Serebryakova
Staff Cloud Engineer at IN-N-OUT.CLOUD
Contact Us

Credits

This website uses the open source AWS Community Day Template built by AWSug.nl hosted on Amazon CloudFront and Amazon S3. The website uses bootstrap and hugo.