Book a Demo
Book a Demo

Juiceflow Performance Optimization (AWS)

Sateesh Peri
  • MMCloud's Juiceflow solution provides a high-performance method for running Nextflow pipelines on the cloud by leveraging JuiceFS to optimize cloud storage for the work directory.

Read more about Juiceflow in a blog here

  • JuiceFS is a general-purpose, distributed file system compatible with any application. In the current MMCloud release, JuiceFS can only be used with a Nextflow host deployed using the OpCenter's built-in Nextflow job template.

  • While cloud storage formatted with JuiceFS offers high performance for work directories, input data often needs to be staged in the work directory.

  • Nextflow natively supports staging or even streaming inputs from S3 to the executors of the respective process steps. By defining the input S3 location as a Channel, the Nextflow framework ensures the data is available for the task.

  • However, with a large number of files, staging can become a bottleneck as the files must first be downloaded to the stage subdirectory in the work directory and then made available to the worker nodes for execution.

  • To avoid this staging bottleneck, input data in cloud storage can be mounted as data volumes, making the files available locally and bypassing the staging process.

  • Essentially, you can register storage and mount the storage bucket using S3FS as a data volume on the Nextflow head node and worker nodes as --storage <storage-name>


Steps:

See the official Juiceflow Quick Guide for introduction on how to run nextflow pipelines on MMCloud

Register Storage

Available from float v3.0.0-69ce0c9-Imperia onwards

  • In the OpCetner left navigation bar, click on Storage
  • Click on Register Storage button
  • Select Storage Type
Volume col col
NFS content content
Lustre content content
S3 content content
OSS content content
GS content content
  • Provide a name for the storage, S3 URI, endpoint, access key, secret key, mount-point of the bucket and choose Access-Mode - Read only or Read Write

Using storage arguments in JuiceFlow

  1. In your job script, add the input bucket as --storage <storage-name> in the process.extra section of mmc.config:
process {
    executor = 'float'
    errorStrategy = 'retry'
    extra = '--dataVolume [opts=" --cache-dir /mnt/jfs_cache "]jfs://${jfs_private_ip}:6868/1:/mnt/jfs --dataVolume [size=120]:/mnt/jfs_cache --storage <storage-name-1> --storage <storage-name-2>'
}
  1. Modify your sample sheet to now read from the storage mount point defined while registering storage </storage-mount-point>:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829_Full,COLO829,COLO829T,tumor,dna,bam,/storage-mount-point/COLO829v003T.bam
COLO829_Full,COLO829,COLO829T,tumor,dna,bai,/storage-mount-point/COLO829v003T.bam.bai
COLO829_Full,COLO829,COLO829R,normal,dna,bam,/storage-mount-point/COLO829v003R.bam
COLO829_Full,COLO829,COLO829R,normal,dna,bai,/storage-mount-point/COLO829v003R.bam.bai
  1. For the head node, add the --storage <storage-name> variable to the float submit command:
float submit \
--hostInit transient_JFS_AWS.sh \
--hostTerminate hostTerminate_AWS.sh \
-i docker.io/memverge/juiceflow \
--vmPolicy '[onDemand=true]' \
--migratePolicy '[disable=true]' \
--dataVolume '[size=60]:/mnt/jfs_cache' \
--storage <storage-name-1> \
--storage <storage-name-2> \
--dirMap /mnt/jfs:/mnt/jfs \
-c 2 -m 4 \
-n <job-name> \
--securityGroup <security-group> \
--env BUCKET=https://<work-bucket>.s3.<region>.amazonaws.com \
-j job_submit_AWS.sh

This setup ensures efficient data handling by reading input data from input buckets using S3FS as staged data volume, while the work buckets leverage JuiceFS for high performance.


Using JuiceFS SnapLocation for better efficiency of checkpoint/restore

  • MMC can use either local EBS volumes or JuiceFS formatted S3 buckets for storing snapshot data

  • From the left navigation bar of the opcenter, click on System Settings

  • In System Settings, click on Cloud settings and go the Snapshot Location field
  • Provide the S3 bucket URL in the following format making sure to include the accesskey, secretkey and mode=rw
[accesskey=<access>,secret=<secret>,mode=rw]s3://preview-opcenter-jfs-snaplocation

NOTE: THE S3 bucket provided for snaplocation as above will be used to store the snapshot data of all jobs from all users in the opcenter and the data is cleared out automatically after a job finished