The Memory Machine Cloud (MMCloud) plugin for Flyte is now available for beta testing. The plugin allows Flyte users to configure MMCloud as the compute environment for the execution of Flyte workflows. With the plugin, Flyte users can:
The MMCloud plugin for Flyte is available in the Python Package Index (PyPI) as flytekitplugins-mmcloud. The plugin allows users to execute Python functions in a Flyte workflow using MMCloud as the compute platform. Support for shell tasks is on the way.
With MMCloud as the compute environment, users can choose whether to deploy on-demand or spot instances. On-demand instances are reliable (they run until the job completes), but expensive. Spot instances are available at a discount to the equivalent on-demand instances, but are unreliable (the cloud provider can reclaim the spot instance with nominal warning).
To our knowledge, there is no batch computing solution today (apart from MMCloud) that can automatically checkpoint and re-deploy a running application on a new compute instance. The standard guidance when using cloud native batch managers is to retry from the beginning using another spot or on-demand instance. Starting over from the beginning isn't a great approach, especially for long-running batch workloads like ML training, ETL, or scientific computing/HPC.
MMCloud's checkpoint and recovery service seamlessly migrates a Flyte user's running job from a spot instance flagged for reclamation to a new spot (or on-demand) instance. This allows users to reduce cloud costs with minimal impact on wall time or application performance.
Jobs can fail due to an out-of-memory error if the VM is underprovisioned for the workload. With other batch solutions, the usual policy is to restart from the beginning on a larger instance, or to run on a larger than needed instance to ensure enough memory is available to handle the most resource intensive parts of the job. MMCloud provides a continuous optimization service for Flyte users that moves jobs to optimally-sized instances based on a user-defined policy. This feature ensures that, at every stage in the execution pipeline, instances are neither under- or over-provisioned.
Cloud native monitoring tools provide visibility into resource utilization at the virtual machine level, but it is usually not possible to see cloud resource utilization at a more granular level, such as at a task or job level. MMCloud provides Flyte users with real-time resource consumption metrics at the task level, and these are viewable graphically via a browser-based web interface. Usage reports are also available in CSV format.
To run a Flyte workflow with MMCloud, you must install the MMCloud OpCenter. Get started free HERE.
Install the float
binary (obtainable from the OpCenter) on your Flyte host.
To install the plugin, run the following command:
pip install flytekitplugins-mmcloud
The agent server requires the following secrets :
mmc_address
: MMCloud OpCenter addressmmc_username
: MMCloud OpCenter usernamemmc_password
: MMCloud OpCenter passwordThese can be specified as environment variables for a local deployment:
_FSEC_MMC_ADDRESS
_FSEC_MMC_USERNAME
_FSEC_MMC_PASSWORD
Alternatively, for a K8s deployment using the Flyte agent Helm chart, add these secrets under agentSecret
> secretData
> data
in values.yaml
.
To start the agent server locally:
pyflyte serve --port 8000
To deploy the agent on a K8s cluster, use the Flyte agent Helm chart. To build an agent image, you must install the float
binary obtained from the OpCenter. A sample Dockerfile
can be found at this README.
Enable the agent-service
and the MMCloud agent in the relevant configuration file. The name you provide the MMCloud agent does not matter as long as <AGENT_ENDPOINT>
is correct.
As an example, run a workflow, modified to use the MMCloud plugin, from Flyte's Getting Started page. Code can be found at this README. Specify the task_config
in the @task
decorator for any tasks you want to run using MMCloud.
Resource (cpu and mem) requests and limits, container images, and environment variable specifications are supported.
Install flytekitplugins-envd
and substitute your own registry where appropriate, or replace image_spec
with a pre-built Flytekit image with pandas
and scikit-learn
installed.
Alternatively, the image can be specified later inpyflyte run
, pyflyte register
, or pyflyte package
.
To run the workflow, enter the following command: pyflyte run --remote example.py training_workflow --hyperparameters '{"C": 0.1}'
MMCloud in AWS architecture
The MMCloud object in the Flyte with MMCloud architecture contains the MMCloud OpCenter and any worker nodes instantiated by the OpCenter, as shown in the following figure.
As an example of how MMCloud migrates running jobs when a spot reclaim event occurs, the following figure shows a Jupyter server "worker node" that has been running for several weeks, entirely on EC2 spot compute. MMCloud protected the Jupyter server from several EC2 Spot reclaim events, saving the user from starting over and rebuilding their environment.
We're excited to announce the beta release of our Flyte plugin. Please try it out and share your feedback. Your feedback will be instrumental in making the plugin better for everyone in the Flyte community.