Running Omics Pipelines in the Cloud

I recently hosted a panel with AWS, Amgen, Ardigen, and Seqera on running omics workflows in the cloud.

Here were my key takeaways from the talk:

You can watch the full video here:

Running omics workflows on the cloud is about understanding what tradeoffs you’re making.

“For example if you're running a production pipeline where you just need to keep it churning, but you don't need to have the results immediately, you can wait and deal with retries. You might also not necessarily need the latest and greatest hardware.

However, if you're in an R&D activity, the time of your developers is more expensive than the time of the machines.

So you need to be able to have your developers progress very rapidly, be able to iterate extremely rapidly, because the cost you're paying the humans who are operating machines is higher, and also the cost of the opportunity cost of needing to define the solution faster than maybe your competitors.

What's more important?”

Getting the answer fast? Getting to a product fast? Or do you have just something that's running in your background?”

When going full cloud, it’s important to understand what kinds of CPUs you need to for your pipeline.

“Depending on the shape and size of the data and the kind of scientific algorithm that you're running, there are going to be different computational requirements.

In some cases you might want a beefy VM with a lot of compute, a lot of memory. Sometimes you just, it's going to be a small operation that's just cheap, but it's more disk bound.

So there's some very different shapes and sizes of machine that might be appropriate.”

Case Study: How the Broad Institute leverages spot instances to reduce costs

The Broad Institute saved ~90% on costs going from their first implementation to their final pipeline. By leveraging tools like Spot instances, they were able to take their costs from $85 per whole genome to $5.

“When Broad's original genomic pipeline was migrated from on prem to cloud, the initial, completely naive implementation was about, I want to say $85 per whole genome. With some basic optimization, they very quickly brought that down to $45 per genome, which is almost half.

Three to six months of further investment, (today it would take much less time because we have all these tools), we took it down to $5 a genome.

And if you think about the cost reduction there, that was huge. Nowadays you can use those kinds of approaches. You don't have to do six months of R&D because a lot of the tools exist. But you can really make a really big difference from the naive implementation to an optimized implementation.”

Ready to Learn More?

Spot instances are a great way to use spare capacity in the cloud at 60 to 80% cheaper.

If you're curious about seamlessly running your research and other types of computational workloads on spot instances, feel free to book a chat with me.

More about Jing

I'm Jing, I run GTM for our memory snapshot product at MemVerge.

We protect your workloads from being taken away or reclaimed by the cloud provider, even when they're not done running. And we are automatically finding you additional instances where you can continue to run your work to completion.

I love helping teams who may not have the resources to conduct their computationally intensive research locally or may not feel fully comfortable migrating their omics workflows to the cloud.

I post regularly about running omics workflows on spot instances and in the cloud. Check out my writings on LinkedIn.