In this session I will be speaking about Kubernetes operator for AWS EKS that detects Spot interruptions on GPU nodes, triggers ML job checkpointing, and recovers training by rescheduling on available nodes. This I prepared with an idea to provide an interesting example of a repeatable workflow for maximizing GPU cost savings without losing progress on ML training jobs. The setup covered in this talk will include services AWS EventBridge, S3/EFS , IAM, and CloudWatch. I will be presenting a live example to show node interruption handling, and seamless ML job recovery using my developed solution.
This website uses the open source AWS Community Day Template built by AWSug.nl hosted on Amazon CloudFront and Amazon S3. The website uses bootstrap and hugo.