JuiceFlow - Memory Machine Cloud

Abstract

This blog provides the evolution and the overview of the Memverge JuiceFlow solution, which is meant for highly performant and cost-effective Nextflow pipeline executions on Memory Machine Cloud.

What is JuiceFlow?

JuiceFlow is a transient, head-node setup that utilizes JuiceFS as its primary filesystem to execute Nextflow pipelines on MMCloud. JuiceFS, an open-source, high-performance distributed file system designed for cloud environments, brings several unique features to the table, including:

Separation of Data and Metadata: JuiceFS stores file data in chunks within object storage platforms like Amazon S3, while metadata is stored in various databases, such as Redis. This architecture enhances performance and scalability.
Performance: JuiceFS is engineered to achieve millisecond-level latency and nearly unlimited throughput, depending on the object storage's scale. This makes it exceptionally suitable for high-performance computing environments like those needed for Nextflow pipelines.
Easy Integration with MMCloud: MMCloud facilitates the deployment process by providing pre-configured Nextflow head node templates with JuiceFS setup, simplifying the user's experience.
Comparison with S3FS: JuiceFS offers significant performance and scalability advantages over S3FS. For those interested in a detailed comparison, refer to the JuiceFS vs. S3FS documentation.

Once a Nextflow run concludes, whether due to failure or completion, the node will self-terminate, thus the term "transient." Furthermore, JuiceFlow can reload previous runs from your S3 bucket, provided that one metadata file exists per bucket. This feature enables you to -resume runs or access data created by other runs within the same JuiceFS file system, streamlining data management and pipeline execution.

Development of JuiceFlow

The development of the JuiceFS + Nextflow solution has undergone many iterations, leading to the creation of JuiceFlow.
Initially, the setup integrated the JuiceFS filesystem and Nextflow within the same instance. However, upon deployment via CLI, users were required to ssh into the instance to initiate their run and also manually dispose of the VM after execution.
In the second iteration, we emphasized the "transient-ness" of the Nextflow head-node while aiming for minimal user intervention. The JuiceFS and Nextflow components were allocated to two separate instances. In this setup, JuiceFS remained persistent, whereas the Nextflow component mounted the filesystem and stayed active until the Nextflow run concluded, underscoring its "transient" aspect. This arrangement eliminated the need for users to ssh into the head-node instance, allowing them to interact primarily through two CLI command submissions.
The ultimate configuration, JuiceFlow, merges the JuiceFS and Nextflow components into a single node, akin to the original setup, yet it inherits the "transient" hands-off characteristic of the second iteration. This was achieved through a singular CLI command.

Overall JuiceFlow Setup

The prerequisites for setting up include a functional MMC OpCenter, an AWS S3 Bucket, and the corresponding S3 access and secret keys.
The user is required to submit a single command that encapsulates the aforementioned details along with two primary scripts:
- transient_JFS.sh: Formats the work directory S3 bucket to JuiceFS format.
- job-submit.sh: Contains Nextflow input parameters and MMC config.
Upon submission, a VM will be initiated. This VM is responsible for formatting a directory within the designated S3 bucket for JuiceFS. If the bucket contains a metadata file from a previous job, it will load the corresponding JuiceFS partition, granting the job access to that specific directory. Within the job, a Nextflow configuration file specific for MMCloud using the nf-float plugin is generated, and the designated Nextflow command is executed.
The process integrates the use of the nf-float plugin, which facilitates the spawning of additional instances for each process within the pipeline on MMCloud. These VMs are configured to terminate upon the completion of their respective processes.
Following the execution of the pipeline, regardless of the outcome (failure or success), JuiceFlow will archive the metadata of the JuiceFS filesystem to the user's work directory S3 bucket. Should JuiceFlow be executed again using the same bucket, it will recognize the metadata file and reload the identical JuiceFS Filesystem. This feature is particularly beneficial for instances where the -resume command is utilized in Nextflow executions.
For a more detailed setup, please refer to the JuiceFlow AWS QuickStart Guide.

Important Considerations

JuiceFS Bucket Requirements: JuiceFS cannot be formatted on a sub-directory of a bucket. The root directory of the bucket, or the bucket itself, must be specified in the CLI command.
Google Cloud Platform Access Requirements: For GCP users, the service account must have permissions to get and put objects into your gs:// bucket.
Handling gcp_key.json: This key serves as the credential for a service account with access to your gs:// bucket. To incorporate the contents into the job, it is read into an environment variable that the container can access.

Monitoring Execution Logs

Workflow Execution Log

Once the head node job starts executing, you can monitor the Nextflow standard output in OpCenter GUI as follows:

Click on the head-node job-ID.
Navigate to the Attachments tab.
Click "view" (eye-icon) on the stdout.autosave file.

Individual Job Logs

To monitor workflow execution and get a detailed view of each process in the Nextflow workflow:

Click on the Workflows dashboard in the OpCenter.

Individual Job Logs - Workflow Dashboard

Click on the workflow name. This allows you to monitor the jobs running in this workflow in a consolidated view.

To view the execution log of any particular job:
- Click on the job-ID.
- Navigate to the Attachments tab.
- Click "view" (eye-icon) on the stdout.autosave file.

Expand here for more details on work directory inspection instructions

To inspect the files in the work directory, follow these steps:

Submit a job similar to the commands above with a script designed to "sleep forever," as shown in the script below:

#!/bin/bash
while true
do
  sleep 1s
done

You can now ssh into the instance and inspect all files in the work directory by navigating to /mnt/jfs. There's no need to enter the container since JFS is mounted to the instance itself. It's also possible to create a template for easier execution, but ensure to specify the correct bucket in the ENV variables.

NOTE: This method assumes that no files in the JFS mount are being changed, as we are not updating the metadata in this setup. This is intended only for reading/copying logs. Should you alter anything in the file system, be aware that changes will NOT be saved unless the metadata is updated.

Performance Metrics

Below are some key performance & cost metrics of JuiceFlow using the test_full profiles of popular nf-core pipelines

Pipeline	Execution Time	Execution Cost ($)
nf-core/sarek	9hr 32min	6.3288
nf-core/atacseq	4hr 22min	6.6143
nf-core/rnaseq	8hr 00min	21.8443

NOTE: All pipeline jobs were run on AWS Spot instances, utilizing MMCloud's spot-protection feature. This feature automatically saves job progress upon spot instance preemption and restores it on a new spot instance, ensuring continuity.

We previously published detailed benchmark results on performance of JuiceFS for Nextflow Workflows in a blogpost that can be viewed here

JuiceFlow Advantages

Transient node: Once the Nextflow pipeline has finished running, the entire node is terminated, eliminating further costs for the user.
One-command submission: JuiceFlow allows the pipeline to be executed with a single command, negating the need for the user to ssh into the instance and thus providing a hands-off experience.

Summary

JuiceFlow revolutionizes Nextflow pipeline executions on MMCloud by leveraging JuiceFS for high-performance, cost-effective computing. It introduces a seamless integration that drastically reduces execution times and costs, exemplified by nf-core pipelines running on AWS Spot instances with impressive savings and efficiency. The system's transient nature ensures that nodes terminate post-execution, eliminating unnecessary expenses. Furthermore, JuiceFlow's one-command setup significantly simplifies user interaction, reinforcing its position as a superior choice for executing complex computational workflows with optimal resource utilization and budget control.

JuiceFlow: A Next-Generation Solution for Nextflow Pipelines in the Cloud