Book a Demo
Book a Demo
AppCapsule++, the 2nd generation of MMCloud snapshot technology to support bigger workloads and reduce overhead

AppCapsule++, the 2nd generation of MMCloud snapshot technology to support bigger workloads and reduce overhead

YT Liang, Wanqiang Ma, Shuonan Zou 2023-12-20218

What is AppCapsule and what problems does it solve?

AppCapsule creates a snapshot of all the data in memory, and all the application states required to restore the application at this moment in time, as well as the data on storage. With AppCapsule, we can transport the application runtime across different machines.

Three problems that AppCapsule can help to solve -

  1. Spot reclaim - If there is a spot eviction event, Memory Machine Cloud creates an AppCapsule, and moves it to another spot instance, where the job will continue. That means, you can save cost running jobs on spot instance, while keeping all your job progress and does not need to run from the beginning all over again. A.k.a. SpotSurfer.
  2. OOM (Out Of Memory) - Memory Machine Cloud detects if OOM is about to happen and if so, creates an AppCapsule, and moves it to another instance to avoid the application crash as well as have the job continue running.
  3. Right-sizing - vertical scaling down to avoid over-provisioning or up to provide more resources to the application as it is needed for shorter wall-time and possibly lower total cost. You can trigger either manually by using float cli command / OpCenter Web GUI or automatically done by Memory Machine Cloud (a.k.a. WaveRider).

What is AppCapsule++ and what benefits does it bring?

Three enhancements and benefits that AppCapsule++ brings to the users -

  1. Only saving delta - instead of saving a full snapshot, AppCapsule++ only saves the difference (memory change) since the previous snapshot. This way, the time required for the final snapshot is minimal.
  2. Adaptive trigger - instead of snapshotting in a fixed, pre-determined interval, AppCapsule++ automatically and intelligently checks the memory change and take a snapshot when it reaches a threshold, which is calculated based on the cloud service provider’s pre-emption window. The benefit is no need to guess the fixed interval for periodic snapshot.
  3. Asynchronous offloading - instead of synchronous snapshotting which will pause the application, AppCapsule++ does asynchronous data offloading, which decreases the time that the application is frozen. The benefit is lower impact to job completion wall-time.

Below is an overall comparison table of AppCapsule and AppCapsule++

Comparison AppCapsule AppCapsule++
Snapshot content Full snapshot Delta snapshot
Snapshot trigger Can be
- Periodic done by MMCloud, or
- Manual done by user’s script
Adaptive
Configuration Pre-determined interval for periodic snapshot No special configuration
Application completion wall-time Impact higher
(Synchronous)
Impact lower
(Asynchronous)
Storage overhead Relatively lower
(only need to save final snapshot to restore)
Relatively higher
(need to save all delta snapshots to restore)
Usage --dumpMode full --dumpMode incremental

Benchmark comparison

Below is the benchmark experiment result running on AWS comparing AppCapsule and AppCapsule++. The experiment is based on 50GB memory change in terms of 60GB, 120GB, 240GB, and 480GB application memory usage, and how much time in seconds did AppCapsule and AppCapsule++ take to take the final snapshot.

Three observations from this benchmark -

  1. For AppCapsule, as application memory usage grows, the snapshot time also increases. The reason is that AppCapsule takes a complete snapshot every time.
  2. For AppCapsule++, as application memory usage grows, the snapshot time stays more or less the same. The reason is that AppCapsule++ takes a delta snapshot, which is triggered when the amount of change in memory reaches a certain threshold.
  3. Considering using spot instance on AWS, which has 2-minute pre-emption time window, AppCapsule can only support relatively small workload (in this experiment for example, 60GB since the snapshot time is 109 seconds). AppCapsule will not be able to support bigger workload (in this experiment for example, 120GB needs 250 seconds to finish a snapshot which is way over the 2-minute time window). However, since AppCapsule++ only takes delta snapshot, the snapshot time is consistent with memory change, not the application memory usage.
    AppCapsule vs AppCapsule++

More detailed information in table format below:

Experiment No. Experiment Variable AppCapsule AppCapsule++
1 Application memory usage: 60G
Application memory change: 50G
109s
r5.2xlarge/gp3
91s
r5.2xlarge/gp3
2 Application memory usage: 120G
Application memory change: 50G
250s
r5.4xlarge/io1
102s
r5.4xlarge/io1
3 Application memory usage: 240G
Application memory change: 50G
533s
r5.8xlarge/io1
104s
r5.8xlarge/io1
4 Application memory usage: 480G
Application memory change: 50G
1548s
r5.16xlarge/io1
80s
r5.16xlarge/io1

Call to action - save your spending and time

What is the characteristic of your running job and environment? AppCapsule++ is available now with Memory Machine Cloud 2.4 Goa release. Please reach out or leave comments below and we would love to chat and help!

Comments