Juiceflow - AWS

Introduction

Juiceflow combines JuiceFS and Nextflow on MMCloud, offering a powerful, scalable solution for managing and executing workflows in the cloud.

Expand for more details on JuiceFS

JuiceFS is an open-source, high-performance distributed file system designed specifically for cloud environments. It offers unique features, such as:

Separation of Data and Metadata: JuiceFS stores files in chunks within object storage like Amazon S3, while metadata can be stored in various databases, including Redis.
Performance: Achieves millisecond-level latency and nearly unlimited throughput, depending on the object storage scale.
Easy Integration with MMCloud: MMCloud provides pre-configured nextflow head node templates with JuiceFS setup, simplifying deployment.
Comparison with S3FS: For a detailed comparison between JuiceFS and S3FS, see JuiceFS vs. S3FS. JuiceFS typically offers better performance and scalability.

Pre-requisites

A VPC Security Group with inbound-rule for port 6868.
- Navigation: AWS EC2 console -> Network & Security -> Security Groups

Inbound rules should include:
- Custom TCP over TCP on port 6868, used by the Redis server in this juiceflow setup.

AWS S3 buckets for Nextflow work and output directories.
AWS S3 access, secret keys & MMC OpCenter Credentials stored as float secrets

Overview of the Setup

This solution leverages two scripts:

transient_JFS_AWS.sh: Formats the work directory S3 bucket to JuiceFS format.
hostTerminate_AWS.sh: Allows for graceful exit of workflow from nextflow head-node
job_submit_AWS.sh: Contains Nextflow input parameters and MMC config.

Steps

Download Scripts

Host-init script (you don't have to edit this script, but you'll need it later)

wget https://mmce-data.s3.amazonaws.com/juiceflow/v1/aws/transient_JFS_AWS.sh

Host-terminate script (you don't have to edit this script, but you'll need it later)

wget https://mmce-data.s3.amazonaws.com/juiceflow/v1/aws/hostTerminate_AWS.sh

Job-submit script (Download the template or create one locally based on the template below with your Nextflow inputs and run configurations)

wget https://mmce-data.s3.amazonaws.com/juiceflow/v1/aws/job_submit_AWS.sh

Expand to view a sample `job_submit_AWS.sh` script

#!/bin/bash

# ---- User Configuration Section ----
# These configurations must be set by the user before running the script.

# ---- Optional Configuration Section ----
# These configurations are optional and can be customized as needed.

# JFS (JuiceFS) Private IP: Retrieved from the WORKER_ADDR environment variable.
jfs_private_ip=$(echo $WORKER_ADDR)

# Work Directory: Defines the root directory for working files. Optional suffix can be added.
workDir_suffix=''
workDir='/mnt/jfs/'$workDir_suffix
mkdir -p $workDir  # Ensures the working directory exists.
cd $workDir  # Changes to the working directory.
export NXF_HOME=$workDir  # Sets the NXF_HOME environment variable to the working directory.

# ------------------------------------------
# ---- vvv DO NOT EDIT THIS SECTION vvv ----
# ------------------------------------------

function install_float {
  # Install float
  local address=$(echo "$FLOAT_ADDR" | cut -d':' -f1)
  wget https://$address/float --no-check-certificate --quiet
  chmod +x float
}

function get_secret {
  input_string=$1
  local address=$(echo "$FLOAT_ADDR" | cut -d':' -f1)
  secret_value=$(./float secret get $input_string -a $address)
  if [[ $? -eq 0 ]]; then
    # Have this secret, will use the secret value
    echo $secret_value
    return
  else
    # Don't have this secret, will still use the input string
    echo $1
  fi
}

# Set Opcenter credentials
install_float 
access_key=$(get_secret AWS_BUCKET_ACCESS_KEY)
secret_key=$(get_secret AWS_BUCKET_SECRET_KEY)
export AWS_ACCESS_KEY_ID=$access_key
export AWS_SECRET_ACCESS_KEY=$secret_key

opcenter_ip_address=$(get_secret OPCENTER_IP_ADDRESS)
opcenter_username=$(get_secret OPCENTER_USERNAME)
opcenter_password=$(get_secret OPCENTER_PASSWORD)

# ------------------------------------------
# ---- ^^^ DO NOT EDIT THIS SECTION ^^^ ----
# ------------------------------------------

# ---- Nextflow Configuration File Creation ----
# This section creates a Nextflow configuration file with various settings for the pipeline execution.

# Use cat to create or overwrite the mmc.config file with the desired Nextflow configurations.
# NOTE: S3 keys and OpCenter information will be concatted to the end of the config file. No need to add them now

# Additionally, please add your STAGE MOUNT BUCKETS here
cat > mmc.config << EOF
// enable nf-float plugin.
plugins {
    id 'nf-float'
}

// Process settings: Executor, error strategy, and resource allocation specifics.
process {
    executor = 'float'
    errorStrategy = 'retry'
    extra = '--dataVolume [opts=" --cache-dir /mnt/jfs_cache "]jfs://${jfs_private_ip}:6868/1:/mnt/jfs --dataVolume [size=120]:/mnt/jfs_cache --vmPolicy [spotOnly=true,retryLimit=10,retryInterval=300s]'
}

// Directories for Nextflow execution.
workDir = '${workDir}'
launchDir = '${workDir}'

// OpCenter connection settings.
float {
    address = '${opcenter_ip_address}'
    username = '${opcenter_username}'
    password = '${opcenter_password}'
}

// AWS S3 Client configuration.
aws {
  client {
    maxConnections = 20
    connectionTimeout = 300000
  }
  accessKey = '${access_key}'
  secretKey = '${secret_key}'
}

EOF

# ---- Data Preparation ----
# Use this section to copy essential files from S3 to the working directory.

# For example, copy the sample sheet and params.yml from S3 to the current working directory.
# aws s3 cp s3://nextflow-input/samplesheet.csv .
# aws s3 cp s3://nextflow-input/scripts/params.yml .

# ---- Nextflow Command Setup ----
# Important: The -c option appends the mmc config file and soft overrides the nextflow configuration.

# Assembles the Nextflow command with all necessary options and parameters.
nextflow_command='nextflow run <nextflow-pipeline> \
-r <revision-number> \
-c mmc.config \
-params-file params.yml \
--input samplesheet.csv \
--outdir 's3://nextflow-output/rnaseq/' \
-resume '

# ---------------------------------------------
# ---- vvv DO NOT EDIT BELOW THIS LINE vvv ----
# ---------------------------------------------
# The following section contains functions and commands that should not be modified by the user.

# Create side script to tag head node - exits when properly tagged
cat > tag_nextflow_head.sh << EOF
#!/bin/bash

runname="\$(cat .nextflow.log 2>/dev/null | grep nextflow-io-run-name | head -n 1 | grep -oP '(?<=nextflow-io-run-name:)[^ ]+')"
workflowname="\$(cat .nextflow.log 2>/dev/null | grep nextflow-io-project-name | head -n 1 | grep -oP '(?<=nextflow-io-project-name:)[^ ]+')"

while true; do

  # Runname and workflowname will be populated at the same time
  # If the variables are populated and not tagged it, tag the head node
  if [ ! -z \$runname ]; then
    ./float modify -j "$(echo $FLOAT_JOB_ID)" --addCustomTag run-name:\$runname 2>/dev/null
    ./float modify -j "$(echo $FLOAT_JOB_ID)" --addCustomTag workflow-name:\$workflowname 2>/dev/null
    break
  fi

  runname="\$(cat .nextflow.log 2>/dev/null | grep nextflow-io-run-name | head -n 1 | grep -oP '(?<=nextflow-io-run-name:)[^ ]+')"
  workflowname="\$(cat .nextflow.log 2>/dev/null | grep nextflow-io-project-name | head -n 1 | grep -oP '(?<=nextflow-io-project-name:)[^ ]+')"

  sleep 1s

done
EOF

# Start tagging side-script
chmod +x ./tag_nextflow_head.sh
./tag_nextflow_head.sh &

# Start Nextflow run
$nextflow_command

if [[ $? -ne 0 ]]; then
  echo $(date): "Nextflow command failed."
  exit 1
else 
  echo $(date): "Nextflow command succeeded."
  exit 0
fi

Job-Submit Script Adjustments

Modify the process.extra within the mmc.config section to customize the vmPolicy for individual nextflow processes (Default policy is spotOnly in the job submit script. Adjust as needed e.g., onDemand, spotFirst). You may find more options when calling float submit -h:

--vmPolicy [spotOnly=true,retryLimit=10,retryInterval=300s]

For handling nextflow code repository, samplesheets and params-file, you have two options: download them or create them directly in the script. Here’s how to do both:

Downloading Samplesheet and Params File

Provide download commands for users to obtain samplesheet and params file for Nextflow, ensuring you replace <download-link> with the actual URLs:

# Download samplesheet
aws s3 cp s3://nextflow-input/samplesheet.csv .

# Download params file
aws s3 cp s3://nextflow-input/params.yml .

Creating Samplesheet and Params File Directly in the Script

Alternatively, users can create these files directly within the script using the cat command as shown below.

# Create samplesheet
cat > samplesheet.csv << EOF
sample,fastq_1,fastq_2,strandedness
AL_TO_rep01,s3://nextflow-input/Sample_1_L007_R1_001.fastq.gz,s3://nextflow-input/Sample_1_L007_R2_001.fastq.gz,auto
AL_TO_rep01,s3://nextflow-input/Sample_2_L008_R1_001.fastq.gz,s3://nextflow-input/Sample_2_L008_R2_001.fastq.gz,auto
AL_TO_rep01,s3://nextflow-input/Sample_3_L014_R1_001.fastq.gz,s3://nextflow-input/Sample_3_L014_R2_001.fastq.gz,auto
AL_TO_rep01,s3://nextflow-input/Sample_4_L009_R1_001.fastq.gz,s3://nextflow-input/Sample_4_L009_R2_001.fastq.gz,auto
EOF

# Create params file
cat > params.yml << EOF
multiqc_title: "rnaseq_multiqc"
fasta: "s3://nextflow-input/reference/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz"
gtf: "s3://nextflow-input/reference/Caenorhabditis_elegans.WBcel235.111.gtf.gz"
save_reference: true
remove_ribo_rna: true
skip_alignment: true
pseudo_aligner: "salmon"
EOF

Download your nextflow code repository using git clone or copying from S3

NOTE: if using private git repositories you can export your GITHUB_TOKEN before the nextflow_command

Finally, ensure you customize your nextflow_command with specific pipeline requirements and save the changes:

nextflow_command='nextflow run <nextflow-pipeline> \
-r <revision-number> \
-c mmc.config \
-params-file params.yml \
--input samplesheet.csv \
--outdir 's3://nextflow-output/rnaseq/' \
-resume '

Remember to replace placeholders with your specific pipeline details.

Float Submit

float login -a <opcenter-ip-address> -u <user>

Enter your password in the next prompt and you should see Login succeded

Ensure you are using the latest version of OpCenter & float CLI:

float release upgrade --sync

Make sure you have the following variables set as-is for float secret's:

+-----------------------+
|         NAME          |
+-----------------------+
| OPCENTER_IP_ADDRESS   |
| OPCENTER_USERNAME     |
| OPCENTER_PASSWORD     |
| AWS_BUCKET_ACCESS_KEY |
| AWS_BUCKET_SECRET_KEY |
+-----------------------+

Useful float secret commands:

# to list stored secrets
float secret ls
# to set a secret
float secret set OPCENTER_IP_ADDRESS 192.0.1.2
# to unset a secret
float secret unset OPCENTER_IP_ADDRESS

Float Submit Command

Replace the placeholders <work-bucket>, <region>, <security-group>, and <job-name> with your specific values and execute the float submit command:

float submit \
--hostInit transient_JFS_AWS.sh \
--hostTerminate hostTerminate_AWS.sh \
-i docker.io/memverge/juiceflow \
--vmPolicy '[onDemand=true]' \
--migratePolicy '[disable=true]' \
--dataVolume '[size=60]:/mnt/jfs_cache' \
--dirMap /mnt/jfs:/mnt/jfs \
-c 2 -m 4 \
-n <job-name> \
--securityGroup <security-group> \
--env BUCKET=https://<work-bucket>.s3.<region>.amazonaws.com \
-j job_submit_AWS.sh

Here's a brief explanation of the parameters used in the float submit command:

Parameter	Brief Description
`--hostInit transient_JFS_AWS.sh`	Shell script to run on the host before the job starts.
`--hostTerminate hostTerminate_AWS.sh`	Shell script to run on the host after the job has been cancelled
`-i docker.io/memverge/juiceflow`	Docker image for the job's software environment.
`--vmPolicy '[onDemand=true]'`	Uses on-demand VM instance for head-node execution.
`--migratePolicy '[disable=true]'`	Disables head-node migration to different hosts/VMs.
`--dataVolume '[size=60]:/mnt/jfs_cache'`	Attaches a 60GB data volume at `/mnt/jfs_cache` in the container.
`--dirMap /mnt/jfs:/mnt/jfs`	Maps a host directory to a container directory for data sharing.
`-c 2`	Allocates 2 CPU cores to the job.
`-m 4`	Allocates 4GB of memory to the job.
`-n <job-name>`	Assigns a name to the job for identification.
`--securityGroup <security-group>`	Applies a security group to the job's VM for network rules.
`--env BUCKET=https://<work-bucket>.s3.<region>.amazonaws.com`	Sets an environment variable for the S3 bucket URL.
`-j job_submit_AWS.sh`	Specifies the job script or command to run inside the container.

Float Submit Script

To simplify job submission, create a shell submit script named submit_nf_float_job.sh with the float submit command shown above

JOB_SCRIPT=$1
JOB_NAME=$2
JUICEFS_BUCKET=$3
PREVIOUS_JOB_ID=$4

if [ -z "$JOB_SCRIPT" ]; then
    echo "JOB_SCRIPT is not set"
    exit 1
fi

if [ -z "$JOB_NAME" ]; then
    echo "JOB_NAME is not set"
    exit 1
fi

if [ -z "$JUICEFS_BUCKET" ]; then
    echo "JUICEFS_BUCKET is not set"
    exit 1
fi

if [ -z "$PREVIOUS_JOB_ID" ]; then
    SHOULD_RESUME=""
else
    SHOULD_RESUME="--env PREVIOUS_JOB_ID="$PREVIOUS_JOB_ID
fi


float submit \
    --hostInit transient_JFS_AWS.sh \
    --hostTerminate hostTerminate_AWS.sh \
    -i docker.io/memverge/juiceflow:v2 \
    --vmPolicy '[onDemand=true]' \
    --migratePolicy '[disable=true]' \
    --dirMap /mnt/jfs:/mnt/jfs \
    --dataVolume '[size=60]:/mnt/jfs_cache' \
    -c 4 -m 8 \
    --securityGroup <security-group> \
    --env BUCKET=https://${JUICEFS_BUCKET}.s3.<s3-bucket-region>.amazonaws.com \
    $SHOULD_RESUME \
    -n $JOB_NAME \
    -j $JOB_SCRIPT

Submit command

./submit_nf_float_job.sh <job-submission-script.sh> <job-name> <s3-bucket-name>

Submit command to resume with previous job-id

./submit_nf_float_job.sh <job-submission-script.sh> <job-name> <s3-bucket-name> <previous-job-id>

Mounting S3 buckets as Data Volumes

you can mount the input data bucket using S3FS as a data volume on the Nextflow head node and worker nodes as follows:

--storage <storage-name>

Read more on Juiceflow Performance Optimization

Important Considerations

JuiceFS Bucket Requirements: JuiceFS requires formatting at the root level of a storage bucket. It cannot be formatted on a sub-directory within a bucket. Ensure the root directory of the bucket, or the entire bucket, is specified in the command line interface (CLI) command.

Cancelling a Running Workflow to Resume Later

There are many circumstances where you might need to cancel a running Nextflow workflow and resume it later using Nextflow's -resume feature, such as changes to the configuration or resource specifications.

To cancel a workflow in the JuiceFlow setup, cancel the head node job. This will cause the head node host to generate a <JOB_ID>.meta.json.gz, where the JOB_ID represents the job-id of the head node file in the S3 work bucket.
You can find the <JOB_ID>.meta.json.gz and the status of metadata dump in the job.events log in Attachments of the head node job
This <JOB_ID>.meta.json.gz file is critical for restoring the work directory for subsequent attempts using the environmental variable PREVIOUS_JOB_ID.

Resuming Workflows with the Job-Submit Script

JuiceFlow supports the -resume option with Nextflow. Each workflow execution generates a <JOB_ID>.meta.json.gz file in the S3 work bucket. This file is critical for restoring the work directory for subsequent attempts using the environmental variable PREVIOUS_JOB_ID .

Example command

float submit \
--hostInit transient_JFS_AWS.sh \
--hostTerminate hostTerminate_AWS.sh \
-i docker.io/memverge/juiceflow \
--vmPolicy '[onDemand=true]' \
--migratePolicy '[disable=true]' \
--dataVolume '[size=60]:/mnt/jfs_cache' \
--dirMap /mnt/jfs:/mnt/jfs \
-c 2 -m 4 \
-n <job-name> \
--securityGroup <security-group> \
--env PREVIOUS_JOB_ID=<PREVIOUS_JOB_ID> \
--env BUCKET=https://<work-bucket>.s3.<region>.amazonaws.com \
-j job_submit_AWS.sh

Monitoring on OpCenter

Workflow Execution Log

To monitor workflow execution and get a detailed view of each process in the Nextflow workflow:

Click on the Workflows dashboard in the OpCenter to monitor workflow execution and get a detailed view for each process in the Nextflow workflow:

Click on the workflow name, and you can monitor the jobs running in this workflow in this consolidated view:

Once the head-node job starts Executing, you can monitor the Nextflow stdout by clicking on the job -> Attachments -> stdout.autosave:

Individual Job Logs

To view the execution log of any particular job:

Click on the Job-ID.
Navigate to the Attachments tab.
Click view (eye-icon) on the stdout.autosave file.

Create Job Templates to Launch via MMCLoud GUI

Job Templates allow for the ease and customaization of runs that follow a similar format, without having the need to manually set up a command every time. It requires the submission of one job first.

From the Jobs dashboard, select the head node job previously submitted above and click on More Actions -> Save as Template:

Provide a name and tag for the private template:

Navigate to the Job Templates dashboard and click on Private templates:

You can click on any job template, edit/change samplesheet, variables etc., and submit new jobs from the GUI.

Users can also submit jobs from templates via CLI
float submit --template private::<template-name>:<template-tag> \
-e BUCKET=s3://<aws-jfs-bucket>
Addtionally, please keep in mind the features that need to be updated with every run if they deviate from the default values provided in the private template. This will mainly include:

S3 Bucket URL

Updating of the nextflow run command in the job script

JuiceFlow (AWS)