UNA201

AWS EKS GPU Spot Recovery Operator for ML Workloads

Room: Una | Time: 10:00

In this session I will be speaking about Kubernetes operator for AWS EKS that detects Spot interruptions on GPU nodes, triggers ML job checkpointing, and recovers training by rescheduling on available nodes. This I prepared with an idea to provide an interesting example of a repeatable workflow for maximizing GPU cost savings without losing progress on ML training jobs. The setup covered in this talk will include services AWS EventBridge, S3/EFS , IAM, and CloudWatch. I will be presenting a live example to show node interruption handling, and seamless ML job recovery using my developed solution.

Natalie Serebryakova

Staff Cloud Engineer at IN-N-OUT.CLOUD

Contact Us

We are happy to hear from you! For all the informations, feedback or any other question, plase contact us at info@awscommunityadria.com

Policies

Code of Conduct

Join the Experience

Credits

This website uses the open source AWS Community Day Template built by AWSug.nl hosted on Amazon CloudFront and Amazon S3. The website uses bootstrap and hugo.