Introduction to AWS S3
AWS S3 is a highly scalable, durable, and secure object storage service, making it an ideal choice for managing large-scale workflows like those in Nextflow. Nextflow includes built-in support for AWS S3, allowing seamless integration of S3 buckets into pipeline scripts. Files stored in an S3 bucket can be accessed transparently in your pipeline script, just like any other file in the local file system, enabling efficient data management across cloud and on-premises environments.
Pre-requisites for Using S3 with Nextflow
Before you begin, ensure you meet the following prerequisites:
1. A VPC Security Group with an Inbound Rule for Port 22
Ensure that your virtual private cloud (VPC) is properly configured to allow SSH access to the instances running your Nextflow pipelines. This requires a security group with an inbound rule to allow connections on port 22, which is used for SSH.
Navigation: AWS EC2 console -> Network & Security -> Security Groups
2. Create a New S3 Bucket for Storing Nextflow Output and Workdir Files
To store the outputs of your Nextflow pipelines and the intermediate files created in the Nextflow workdir, you will need a dedicated S3 bucket. This S3 bucket will serve as both the working directory and the final storage location, ensuring that all files generated during the pipeline's execution are accessible for future reference or further processing.
To create a new S3 bucket, follow these steps:
- Navigate to the AWS S3 Console.
- Click on Create Bucket to begin the setup process.
- After the bucket is created, you need to create two folders within the bucket: one for the output files and one for the workdir.
- You can then select each folder by checking the box next to it, and click on Copy S3 URL to obtain the folder's URL. This URL will be required for configuration in the upcoming
Editing the Configuration File
section.
3. Prepare All Required Scripts
1. s3flow_transient_hostInit.sh
curl -O https://mmce-data.s3.amazonaws.com/s3flow/v1/s3flow_transient_hostInit.sh
2. s3flow_transient_hostTerminate.sh
curl -O https://mmce-data.s3.amazonaws.com/s3flow/v1/s3flow_transient_hostTerminate.sh
3. s3flow_transient_nextflow_submit.sh
curl -O https://mmce-data.s3.amazonaws.com/s3flow/v1/s3flow_transient_nextflow_submit.sh
It is recommended to keep these three files in the same folder. Ensure that the EC2 Instance or local machine where these files reside has float
installed.
Deployment Steps for Individual Users on MMCloud
Float Login
Ensure you are using the latest version of the float
CLI:
sudo float release sync
Login to your MMCloud OpCenter:
float login -a <opcenter-ip-address> -u <user>
After entering your password, verify that you see Login succeeded!
Float Secret
Set your AWS credentials as secrets in float
:
float secret set AWS_BUCKET_ACCESS_KEY <BUCKET_ACCESS_KEY>
float secret set AWS_BUCKET_SECRET_KEY <BUCKET_SECRET_KEY>
To verify the secrets:
float secret ls
Expected Output:
+-------------------------+
| NAME |
+-------------------------+
| AWS_BUCKET_ACCESS_KEY |
| AWS_BUCKET_SECRET_KEY |
+-------------------------+
Configue Two Scripts
1. modifying s3flow_transient_hostTerminate.sh:
# ------------------------------------------------------------------
# ---- vvv Set root dir of nextflow cache dir location here vvv ----
# ------------------------------------------------------------------
# Should match the dir in the s3flow_transient_nextflow_submit.sh script
nextflow_dir="s3://your-bucket-here/subdir"
2. modifying mmc.config in s3flow_transient_nextflow_submit.sh:
Configuration File Content:
# This S3 URI is the parent dir location of where you want to save
# your .nextflow.log and .nextflow/ folder
# Should match the location in the s3flow_transient_hostTerminate.sh script
nextflow_dir="<s3://your-bucket-here/subdir>"
cat > mmc.config << EOF
plugins {
id 'nf-float'
}
workDir = 's3://<your_s3_bucket_name/your_workDir_folder_name/'
process {
executor = 'float'
errorStrategy = 'retry'
/*
If users would like to enable float storage function, specify like this
extra = '--storage <S3_bucket_name>'
*/
/*
If extra disk space needed, specify like this
disk = '200 GB'
*/
extra ='$FLOAT_VMPOLICY_OPT'
/*
For some special tasks like Qualimap which generates very small IO request, using this -o writeback_cache can help with performance. Here's an example.
withName: "QUALIMAP_RNASEQ" {
extra ='$FLOAT_VMPOLICY_OPT'
}
*/
}
podman.registry = 'quay.io'
float {
address = '<your_opcenter_ip>'
username = '<your_user_name>'
password = '<your_password>'
}
// AWS access info if needed
aws {
client {
endpoint = 'https://s3.<your_bucket_region>.amazonaws.com'
maxConnections = 20
connectionTimeout = 300000
}
accessKey = '$(get_secret AWS_BUCKET_ACCESS_KEY)'
secretKey = '$(get_secret AWS_BUCKET_SECRET_KEY)'
}
- Note:
- Replace the value of
workDir
with the S3 URL of the workDir you copied earlier. - Remember to add
endpoint = 'https://s3.<your_bucket_region>.amazonaws.com'
underclient
. - If you are providing a bucket in
us-east-1
, update the endpoint in your config file like so:
- Replace the value of
aws {
client {
endpoint = 'https://s3.us-east-1.amazonaws.com'
}
}
3. Enable copy the sample sheet as needed and modify Nextflow Command Setup in s3flow_transient_nextflow_submit.sh:
-
Data Preparation:
In this section, you will copy essential files (such as your sample sheet or parameters) from your S3 bucket to the working directory on the transient node as needed.Instructions:
- Uncomment and modify the following lines based on the location of your input files:
aws s3 cp s3://your-s3-bucket/samplesheet.csv . aws s3 cp s3://your-s3-bucket/scripts/params.yml .
- Replace
your-s3-bucket
with the name of your actual S3 bucket. - Ensure that the paths match the files you are using for your pipeline run.
- Uncomment and modify the following lines based on the location of your input files:
-
Nextflow Command Setup:
Thenextflow_command
variable contains the command that will execute your Nextflow pipeline. You need to modify this command to fit your specific pipeline and output settings.Instructions:
- Replace the
<your_outdir_bucket>
placeholder with the name of the S3 bucket you created for the pipeline's output.nextflow_command="nextflow run nf-core/<pipeline> -profile test -c mmc.config --outdir s3://your-s3-bucket/outputDir/"
- You can also modify the
-profile
and other pipeline-specific parameters depending on the pipeline you are running.
- Replace the
Deploy Nextflow Head Node
Deploy the Nextflow head node using the juiceflow:v2 template:
float submit \
-c 2 -m 4 \
-i juiceflow:v2 \
--storage <S3_bucket_name_of_user_input_data > \
--vmPolicy '[onDemand=true]' \
--migratePolicy '[disable=true]' \
--securityGroup sg-00XXXXXXXXXXX \
-j s3flow_transient_nextflow_submit.sh \
--dirMap /transient_s3flow/nextflow:/transient_s3flow/nextflow \
--hostTerminate s3flow_transient_hostTerminate.sh \
--hostInit s3flow_transient_hostInit.sh
Note:
- Replace
<security-group>
with your specific details. - The juiceflow:v2 template comes pre-configured with S3 setup.
-c 2 -m 4
is used to specify the head node’s CPU and memory configuration. Below, you will also see an example of modifying the nextflow:jfs template using the--overwriteTemplate "*" -c 8 -m 32
command to change the head node’s CPU and memory settings.- Using --storage is not mandatory, but it can be beneficial for certain users. For more details, please refer to the FAQ section of this tutorial.
Checking Head Node Deployment Status
float list -f 'status=executing'
Example Output:
+-----------------------+--------------------------+----------------------------------+-------+-----------+----------+----------------------+------------+
| ID | NAME | WORKING HOST | USER | STATUS | DURATION | SUBMIT TIME | COST |
+-----------------------+--------------------------+----------------------------------+-------+-----------+----------+----------------------+------------+
| 1cc0j755719d8xdewyidt | juiceflow-t3.medium | 54.82.58.176 (2Core4GB/OnDemand) | admin | Executing | 41m52s | 2024-10-14T20:05:34Z | 0.0296 USD |
+-----------------------+--------------------------+----------------------------------+-------+-----------+----------+----------------------+------------+
FAQ
Q: What does the --storage
option do?
A: In the past, if --dataVolume
or --storage
wasn't used, and a user provided input files (e.g., FASTQ) using S3 URLs, Nextflow would first have to copy these files into the workDir
so that the Nextflow process could access them. This step, called staging, had a significant downside: even if only a small part of a file in the S3 bucket changed, Nextflow would still need to download the entire file again into the workDir
for each process.
With the introduction of --dataVolume
and --storage
, this staging process is no longer necessary. These options allow Nextflow to directly access files in the S3 file system, eliminating redundant file transfers. This approach uses the open-source s3fs
solution, which enables seamless interaction with S3 as if it were a file system. You can find more details here: s3fs GitHub.
Additionally, the --storage
option was introduced to simplify the use of --dataVolume
. Previously, you had to manually provide AWS credentials when using --dataVolume
, but with --storage
, that’s no longer required.