Book a Demo

Introducing the MMCloud Flyte Plugin: Reduce cost and boost performance when running Flyte workflows

Edwin Yu 2023-09-182687

Summary

The Memory Machine Cloud (MMCloud) plugin for Flyte is now available for beta testing. The plugin allows Flyte users to configure MMCloud as the compute environment for the execution of Flyte workflows. With the plugin, Flyte users can:

Reliably deploy stateful tasks on spot instances
Optimize performance and cost of long-running batch and HPC workloads
Monitor cloud resource usage in real time at the task or job level
Automate DevOps, MLOps, and CloudOps processes

MMCloud plugin for Flyte

The MMCloud plugin for Flyte is available in the Python Package Index (PyPI) as flytekitplugins-mmcloud. The plugin allows users to execute Python functions in a Flyte workflow using MMCloud as the compute platform. Support for shell tasks is on the way.

Why use the MMCloud Flyte plugin?

With MMCloud as the compute environment, users can choose whether to deploy on-demand or spot instances. On-demand instances are reliable (they run until the job completes), but expensive. Spot instances are available at a discount to the equivalent on-demand instances, but are unreliable (the cloud provider can reclaim the spot instance with nominal warning).

To our knowledge, there is no batch computing solution today (apart from MMCloud) that can automatically checkpoint and re-deploy a running application on a new compute instance. The standard guidance when using cloud native batch managers is to retry from the beginning using another spot or on-demand instance. Starting over from the beginning isn't a great approach, especially for long-running batch workloads like ML training, ETL, or scientific computing/HPC.

MMCloud's checkpoint and recovery service seamlessly migrates a Flyte user's running job from a spot instance flagged for reclamation to a new spot (or on-demand) instance. This allows users to reduce cloud costs with minimal impact on wall time or application performance.

Jobs can fail due to an out-of-memory error if the VM is underprovisioned for the workload. With other batch solutions, the usual policy is to restart from the beginning on a larger instance, or to run on a larger than needed instance to ensure enough memory is available to handle the most resource intensive parts of the job. MMCloud provides a continuous optimization service for Flyte users that moves jobs to optimally-sized instances based on a user-defined policy. This feature ensures that, at every stage in the execution pipeline, instances are neither under- or over-provisioned.

Cloud native monitoring tools provide visibility into resource utilization at the virtual machine level, but it is usually not possible to see cloud resource utilization at a more granular level, such as at a task or job level. MMCloud provides Flyte users with real-time resource consumption metrics at the task level, and these are viewable graphically via a browser-based web interface. Usage reports are also available in CSV format.

How can I try the MMCloud Flyte plugin?

Deploy the MMCloud OpCenter

To run a Flyte workflow with MMCloud, you must install the MMCloud OpCenter. Get started free HERE.

Deploy the MMCloud agent

Install the float binary (obtainable from the OpCenter) on your Flyte host.

To install the plugin, run the following command:

pip install flytekitplugins-mmcloud

The agent server requires the following secrets :

mmc_address: MMCloud OpCenter address
mmc_username: MMCloud OpCenter username
mmc_password: MMCloud OpCenter password

These can be specified as environment variables for a local deployment:

_FSEC_MMC_ADDRESS
_FSEC_MMC_USERNAME
_FSEC_MMC_PASSWORD

Alternatively, for a K8s deployment using the Flyte agent Helm chart, add these secrets under agentSecret > secretData > data in values.yaml.

To start the agent server locally:

pyflyte serve --port 8000

To deploy the agent on a K8s cluster, use the Flyte agent Helm chart. To build an agent image, you must install the float binary obtained from the OpCenter. A sample Dockerfile can be found at this README.

Enable the agent-service and the MMCloud agent in the relevant configuration file. The name you provide the MMCloud agent does not matter as long as <AGENT_ENDPOINT> is correct.

Run a Flyte workflow

As an example, run a workflow, modified to use the MMCloud plugin, from Flyte's Getting Started page. Code can be found at this README. Specify the task_config in the @task decorator for any tasks you want to run using MMCloud.

Resource (cpu and mem) requests and limits, container images, and environment variable specifications are supported.

Install flytekitplugins-envd and substitute your own registry where appropriate, or replace image_spec with a pre-built Flytekit image with pandas and scikit-learn installed.

Alternatively, the image can be specified later inpyflyte run, pyflyte register, or pyflyte package.

To run the workflow, enter the following command: pyflyte run --remote example.py training_workflow --hyperparameters '{"C": 0.1}'

Flyte with MMCloud architecture

MMCloud in AWS architecture

The MMCloud object in the Flyte with MMCloud architecture contains the MMCloud OpCenter and any worker nodes instantiated by the OpCenter, as shown in the following figure.

As an example of how MMCloud migrates running jobs when a spot reclaim event occurs, the following figure shows a Jupyter server "worker node" that has been running for several weeks, entirely on EC2 spot compute. MMCloud protected the Jupyter server from several EC2 Spot reclaim events, saving the user from starting over and rebuilding their environment.

Conclusion

We're excited to announce the beta release of our Flyte plugin. Please try it out and share your feedback. Your feedback will be instrumental in making the plugin better for everyone in the Flyte community.

Memory Machine Cloud MMCloud WaveRider MMCloud SpotSurfer MMCloud Float mmce.pro Memverge.io bigmemorycloud.com Cloud Automation

Comments

Memory Machine^TM Cloud