MMC Batch Engine for AWS Overview - Memory Machine Cloud

Goal

The MMC Engine captures the entire running state of an AWS Batch Job into a consistent image and restores the Job on a new Compute Instance without losing any work progress. It ensures a high quality of service at the Batch level using low-cost, but unreliable Spot-based Compute Instances. If you're looking to deploy this solution on your AWS Account, head to this link

The MMC Batch Engine's key features include:

Full integration into the customer Batch environment
Automated checkpoint and restore
No change to the customer workflow
No change to the Job applications and Workflow Manager scripts
Scalable across thousands of Batch Jobs and Compute Instances
Secure data processing within the customer VPC

Architecture

The MMC Batch Engine integrates with AWS Batch ECS. The architecture is shown below:

The AWS Batch and MMC components include:

In the AWS Batch Queue, under Compute Environment, each Compute Instance needs to have the MMC Engine installed. The MMC Batch Engine is responsible for the checkpoint and restore of the Batch job containers, as well as checkpoint image lifecycle management. The MMC Batch Engine is shown as a solid blue block and delivered as an installer package.
Each AWS Batch Queue requires dedicated storage for the MMC checkpoint images. The storage options include:
a. S3 bucket and the open-source JuiceFS to map file access to S3’s object storage, shown as a shaded blue block.
b. Alternatively, MMC checkpoint images can be stored directly in EFS, FSx Lustre, and other NFS systems.

Users Experience

The Platform or DevOps team sets up the MMC Engine for AWS Batch Queue:

Install MMC Batch Engine on each Queue → Compute Environment → Compute Resources, either by
- Adding MMC Batch Engine install script in the Launch Template → User data, or
- Using custom AMI with MMC Batch Engine installed in the Launch Template
Provision storage for checkpoint images and attach to each Compute Instance:
- If the workload is already using NFS-based storage (EFS, FSx Lustre), add another directory for checkpoint images
- If using S3 bucket, install JuiceFS on Compute Instance per Use JuiceFS on AWS | JuiceFS Document Center
Under Compute Environment → Compute Resources
- Set to Spot instances "type": "SPOT",
- Only use instance types with the same CPU family, for example, "instanceTypes": [ "m6i.large", "m6i.2xlarge", "m6i.16xlarge" ], MMC Engine supports checkpoint and restore of application across the same CPU family, and ideally the same generation.
- To ensure Batch jobs continue running when no Spot instances are available, create a secondary Compute Environment with On-Demand instances, attach it to the same Job Queue, and set it to a lower priority than the Spot environment.

For the end-user or workflow manager submitting jobs to the AWS Batch Queue, the Job Definition stays mostly the same, except for the following:

Add environment variable MMC_CHECKPOINT_MODE=true to enable MMC Engine
Set retry attempt to the max 10 times "retryStrategy": { "attempts": 10 }

Spot Preemption Handling

For Compute Instances running on Spot VM, the following diagram shows the MMC Batch Engine protecting jobs when the instance is preempted:

Compute Instance 1 has a root volume EBS 1, and two Jobs, A & B, are running on the instance.

On Spot preemption warning, MMC Engine checkpoints both Jobs, including both the processes inside the Docker container and working files in the EBS 1, and stores the checkpoint images on S3.

When Batch reschedules Job A to run on another Compute Instance 2, normally Job A would restart from the beginning, losing all previous work progress. The MMC Batch Engine restores the Job A process and working files to the attached EBS 2. Job A’s previous work progress is maintained across the new Compute Instance 2.

Likewise for Job B on Compute Instance 3.