One thing we at MemVerge have learned working in the HPC and batch computing arena for the past several years is that customer workloads are incredibly diverse and optimization typically requires an entire team of specialists. The use of cloud-based Spot compute adds another layer of complexity to the already tricky mix...while everyone loves the idea of saving up to 90% vs. on-demand by running on Spot compute, the inconvenient truth is that it isn’t easy. This post may risk upsetting both Seattle and Mountainview based cloud companies at the same time so let me start by saying that the cloud native batch tools really are a great option for many users and we encourage you to learn about each so you can make the most informed decision. GCP launched its Batch service just last year and AWS Batch has been around since 2016. For the surfers out there, AWS Batch is like the reliable longboard you've had for years, dependable and good for a variety of workloads. GCP Batch is like that new high-performance shortboard, designed for the nimble and adventurous HPC afficionado. If you're in an organization with deep technical resources and experience in the cloud, both AWS and Google Cloud’s Batch services can probably be tuned to handle just about anything you throw at it. Even still...
Long-running batch jobs lasting hours to days have a high chance of getting reclaimed
Deadlines can be missed due to multiple restarts for jobs that failed due to Spot reclaims
Real example of a Spot EC2 instance that only lasted ~13 minutes before being reclaimed by AWS
For short-running batch jobs that last from seconds to a few minutes, the time penalty for starting over is minimal. However, for longer-running jobs in data pipelines, such as those used by the Nextflow community the inconvenience of restarting batch jobs from the beginning quickly escalates. This is especially true when tasks in subsequent stages cannot begin until all tasks in a prior stage are completed. Imagine kicking off a 12-hour pipeline late in the afternoon with the expectation of being able to review results by the time you are back at work the next morning. By running the workload on Spot compute, the risk is both a delay in having results ready and excess cloud costs due to re-running pipeline tasks that can each take multiple hours to complete.
Things you may not have known about Spot compute on AWS and Google Cloud
1. Google Cloud provides a 30 second warning before your Spot VM is reclaimed
2. Google Cloud actually has two types of ephemeral VMs “preemptible” and “Spot”
3. AWS provides a more generous 2 minute warning before your Spot EC2 is reclaimed
4. In our experience, GCP Spot VMs don’t always give you a full 30 seconds before reclaiming the compute, a challenge for most checkpoint and restore based approaches
So what do you do when your Spot instance gets reclaimed?
MemVerge first built its reputation by developing software that can snapshot data in-memory for on-prem servers (we call it AppCapsule) and applied it to the Cloud. We first launched on AWS and AliCloud and then added support for GCP and Baidu Cloud, empowering even cloud novices with a way to automatically checkpoint any batch job, provision a new Spot instance and recover + resume the running state of their job, optimizing for both cost and wall time.
Read more about Memory Machine Cloud's many automation features here
Based on a VM policy you set, reclaimed Spot instances can be replaced only by another Spot instance “Spot Only” or you can opt for “Spot First” which enables individual batch jobs to fall back and finish using on-demand compute when Spot inventory is thin or not available at all, ensuring your batch jobs complete on time, every time, and at the most cost optimal execution.
Book a demo with us using the button up top or check out some customer case studies here.
In my next installment I'll talk about right-sizing your compute during runtime to optimize each job, another area that today's cloud native batch tools aren't always a great solution for.