Book a Demo
Book a Demo
Introducing the MMCloud Flyte Plugin: Reduce cost and boost performance when running Flyte workflows

Introducing the MMCloud Flyte Plugin: Reduce cost and boost performance when running Flyte workflows

Edwin Yu 2023-09-181847

Summary

The Memory Machine Cloud (MMCloud) plugin for Flyte is now available for beta testing. The plugin allows Flyte users to configure MMCloud as the compute environment for the execution of Flyte workflows. With the plugin, Flyte users can:

  1. Reliably deploy stateful tasks on spot instances
  2. Optimize performance and cost of long-running batch and HPC workloads
  3. Monitor cloud resource usage in real time at the task or job level
  4. Automate DevOps, MLOps, and CloudOps processes


MMCloud plugin for Flyte

The MMCloud plugin for Flyte is available in the Python Package Index (PyPI) as flytekitplugins-mmcloud. The plugin allows users to execute Python functions in a Flyte workflow using MMCloud as the compute platform. Support for shell tasks is on the way.

Why use the MMCloud Flyte plugin?

With MMCloud as the compute environment, users can choose whether to deploy on-demand or spot instances. On-demand instances are reliable (they run until the job completes), but expensive. Spot instances are available at a discount to the equivalent on-demand instances, but are unreliable (the cloud provider can reclaim the spot instance with nominal warning).

To our knowledge, there is no batch computing solution today (apart from MMCloud) that can automatically checkpoint and re-deploy a running application on a new compute instance. The standard guidance when using cloud native batch managers is to retry from the beginning using another spot or on-demand instance. Starting over from the beginning isn't a great approach, especially for long-running batch workloads like ML training, ETL, or scientific computing/HPC.

MMCloud's checkpoint and recovery service seamlessly migrates a Flyte user's running job from a spot instance flagged for reclamation to a new spot (or on-demand) instance. This allows users to reduce cloud costs with minimal impact on wall time or application performance.

Jobs can fail due to an out-of-memory error if the VM is underprovisioned for the workload. With other batch solutions, the usual policy is to restart from the beginning on a larger instance, or to run on a larger than needed instance to ensure enough memory is available to handle the most resource intensive parts of the job. MMCloud provides a continuous optimization service for Flyte users that moves jobs to optimally-sized instances based on a user-defined policy. This feature ensures that, at every stage in the execution pipeline, instances are neither under- or over-provisioned.

Cloud native monitoring tools provide visibility into resource utilization at the virtual machine level, but it is usually not possible to see cloud resource utilization at a more granular level, such as at a task or job level. MMCloud provides Flyte users with real-time resource consumption metrics at the task level, and these are viewable graphically via a browser-based web interface. Usage reports are also available in CSV format.


How can I try the MMCloud Flyte plugin?

Deploy the MMCloud OpCenter

To run a Flyte workflow with MMCloud, you must install the MMCloud OpCenter. Get started free HERE.

Deploy the MMCloud agent

Install the float binary (obtainable from the OpCenter) on your Flyte host.

To install the plugin, run the following command:

pip install flytekitplugins-mmcloud

The agent server requires the following secrets :

  1. mmc_address: MMCloud OpCenter address
  2. mmc_username: MMCloud OpCenter username
  3. mmc_password: MMCloud OpCenter password


These can be specified as environment variables for a local deployment:

  1. _FSEC_MMC_ADDRESS
  2. _FSEC_MMC_USERNAME
  3. _FSEC_MMC_PASSWORD


Alternatively, for a K8s deployment using the Flyte agent Helm chart, add these secrets under agentSecret > secretData > data in values.yaml.

To start the agent server locally:

pyflyte serve --port 8000

To deploy the agent on a K8s cluster, use the Flyte agent Helm chart. To build an agent image, you must install the float binary obtained from the OpCenter. A sample Dockerfile can be found at this README.

Enable the agent-service and the MMCloud agent in the relevant configuration file. The name you provide the MMCloud agent does not matter as long as <AGENT_ENDPOINT> is correct.

Run a Flyte workflow

As an example, run a workflow, modified to use the MMCloud plugin, from Flyte's Getting Started page. Code can be found at this README. Specify the task_config in the @task decorator for any tasks you want to run using MMCloud.

Resource (cpu and mem) requests and limits, container images, and environment variable specifications are supported.

Install flytekitplugins-envd and substitute your own registry where appropriate, or replace image_spec with a pre-built Flytekit image with pandas and scikit-learn installed.

Alternatively, the image can be specified later inpyflyte run, pyflyte register, or pyflyte package.

To run the workflow, enter the following command: pyflyte run --remote example.py training_workflow --hyperparameters '{"C": 0.1}'

Flyte with MMCloud architecture


MMCloud in AWS architecture

The MMCloud object in the Flyte with MMCloud architecture contains the MMCloud OpCenter and any worker nodes instantiated by the OpCenter, as shown in the following figure.



As an example of how MMCloud migrates running jobs when a spot reclaim event occurs, the following figure shows a Jupyter server "worker node" that has been running for several weeks, entirely on EC2 spot compute. MMCloud protected the Jupyter server from several EC2 Spot reclaim events, saving the user from starting over and rebuilding their environment.

Conclusion

We're excited to announce the beta release of our Flyte plugin. Please try it out and share your feedback. Your feedback will be instrumental in making the plugin better for everyone in the Flyte community.

Comments