ENCODE Uniform Data Processing Pipelines: Introduction – J. Seth Strattan

ENCODE Uniform Data Processing Pipelines: Introduction – J. Seth Strattan


Seth Strattan:
Well, thank you all for coming this afternoon. My name is Seth, and Ben and I will teach
you today about the ENCODE uniform processing pipelines that we’ve developed at the DCC.
We had a link here on setting up the environment for you to actually set up these pipelines
yourselves. If you haven’t already done that, you might want to click there and do
it. You’ll miss my introductory remarks as you focus on creating your account, but
I guess that’s okay. Before I start, I want to tell you that this is a very interactive
session. We have the luxury of having two hours together, so — my colleagues from the
DCC are here and will be circulating around to help you. So if you get stuck at any point,
please just raise your hand and someone will help you. So the people who will be helping are my colleague
Gene Davidson, who’s waving at you from the back, and Mike Cherry [spelled phonetically],
who just introduced the session; Eurie Hong, here in the front; and you’ll meet Ben in
a minute, because he’ll be up here on the stage. And we’re also joined by two colleagues
from DNAnexus. Joe Dale is also in the back, and George Asimenos [spelled phonetically]
as well. So we have a lot of people here who can help you, so at any point, we want this
to be very much a workshop and not a lecture. So, if at any point, you’re hung up and
you want to ask me a question or Ben a question, or just have somebody come and help you, please
just raise your hand. Network issues, we’ll just have to do our best on. So, as Mike said,
if you have multiple devices on the network now, maybe shut down the ones that you’re
not using to free up a few more addresses. Okay, so with that, I will begin — with a
few leading questions, just to give us a chance to know who you are and what you’d like
to learn here. So, just with a show of hands, I’m going to ask some sort of random questions.
So how many of you have downloaded ENCODE data and intersected it with your own data?
Okay, that’s good. How many of you have already implemented at your institute a software
pipeline based on what ENCODE has done? Okay, good. How many of you believe that you could
repeat an ENCODE analysis starting from fast cues to generate an IDR threshold at set of
peaks [spelled phonetically]? Okay, that’s a few, that’s good. How many of you want
to repeat the ENCODE analyses on your own data? All right, that’s great, because that’s
why we’re here. And just a general question, how many of you have found in the past that
you needed to access ENCODE data, but you found it difficult or you just didn’t know
where to start? Okay. Hands. That’s okay, too. So, we want to help with all of these.
All right. Thank you for that. So I’ve been writing pipeline code now for
several months, and so everything looks like a pipeline to me now. And so you’re actually
in a pipeline, and I’m going to put this workshop in the context of a pipeline through
this meeting. So yesterday, Eurie showed you the ENCODE portal, which is really a way to
access ENCODE data. And then Pauline and Emily showed you the UCSC Genome Browser and the
Ensemble Browser. Once you’ve accessed ENOCDE data, how do you visualize it? And then Emily
and Jill showed you some tools for actually interpreting ENCODE data. And right now Ben
and I are going to talk about how the data, how the processed data, are actually generated
through these processing pipelines. And then in the next workshop you’ll get to explore
some advanced analysis methods from Michael and Yanli and Luca and Cameron. So that’s
sort of where we are right now, in the overall scheme of the workshop, is just to talk about
how analysis data in ENCODE are generated. So you might know that the DCC, the Data Coordinating
Center, delivers all of the ENCODE data to you through the ENCODE portal, so all of that
data actually exists on the Cloud in an Amazon bucket. And when you download files from the
portal, they actually come out of this bucket. And you can use the ENCODE portal to search
and find that data, both the primary data and also processed data. The DCC also delivers
through the portal’s metadata, the information about these transformations from sample to
library to primary data that help you — that help to justify your interpretation of ENCODE
data in biological terms. So these are the things like antibody specificity in a ChIP-seq
experiment, the properties of DNA libraries that are sequenced, maybe protocol documents;
all of that available through the ENCODE portal, sort of the metadata that describes the experiments
that ENCODE has done. So we’re pretty good at that, but one thing
that has not been as well described, I think, are the transformations between primary data
to processed data. And that’s what the pipelines do. They take the primary data from the experiments,
and then they turn it into something that you might visualize at the USCS Genome Browser,
or that you might take into subsequent statistical analyses of your own. So to try to improve
the transparency of this process and the repeatability of the process, we started this project at
the DCC of deploying the defined ENCODE processing pipelines. And for the project, we had the
following goals. And I’m going to go through these goals now because really the figure
of merit for our workshop is, at the end, we hope that you will feel that we have met
some of these goals. But this is what we’re trying to accomplish. So first is to deploy the consortium-defined
processing pipelines for four key experiment types — ChIP-seq, RNA-seq, DNase, and whole
genome bisulphite sequencing for DNA methylation measurements. At the DCC we are the ones who
actually use the pipelines to generate the standard set of peak calls or quantitations,
or methylation calls that make up sort of the core set of analysis files from ENCODE.
So, at the DCC we’re going to use those pipelines to produce those. We’re going
to capture the metadata to make clear exactly what software we used, what versions, what
parameters we ran it with, what inputs were used, and so forth. And then, of course, to
capture, accession, and share with you the output of those pipelines. And then here’s
the key — and this is why we’re here and this is why we have this workshop today — is,
we could have done all that, and it could have all been sort of a black box to you.
But we didn’t want it to be that way today. We actually want to deliver exactly the same
pipelines in a form that absolutely anyone can run either on their own data or against
ENCODE data in a scalable way. So if you have one experiment, we want it to be a tractable
installation for just one experiment or for thousands of experiments. So these are the goals that we set for ourselves
for this pipeline development project, and they can kind of be summed up by the words
here in the bottom. We want the pipelines to be replicable; we want the provenance of
the files that are generated by the pipeline and the software that’s used to be transparent;
we want them to be relatively easy to use; and we want them to be scalable, because you
may have one or 10 or 100 experiments; we have 10,000 experiments to run, or 5,000.
Okay, so those are the goals for this pipeline development project at the DCC. So when we thought about how we would accomplish
this, one of the questions that we had to decide was where to deploy the pipelines.
So you probably have access to a high-performance computing cluster; we do too at Stanford,
of course. Some of you might have your own. And that’s of course a possibility, to deploy
this a set of scripts that you could perhaps run on your machines. We didn’t choose that
as our first deployment platform. Instead, we chose to deploy these pipelines on the
Cloud. And this is sort of a summary of our thinking process and how we reach the decision
of where to deploy these pipelines first. So what I’ve done here in this table is
just summarize some considerations that you get for different platform delivery choices.
So, I want to see whether it’s hard to develop the pipelines; whether it’s hard to share
them; whether it’s hard for you to run them; whether they’re elastic — this idea of
being able to run it on just one or many hundreds of experiments — whether the provenance of
exactly what was run, what inputs were used, and what software was used is clear; and how
much it costs to run them. And for us, the cost also involves the cost of development. Okay, so if we were to deploy this as a set
of scripts — maybe in a tar ball that you downloaded and you brought up on your cluster
— it would be hard for us to develop that, because development is always hard, because
we actually have to develop a lot of infrastructure software that runs steps to the pipeline at
the same time, that could be run in parallel. It’s somewhat difficult, actually, to share
that too because you need to install it on your cluster. You might not have the same
versions of software; I don’t know if you’ve ever had that problem of trying to install
something and you don’t have the right prerequisite software versions. It’s also hard to run,
because shell scripts sometimes don’t work [laughs]. If not particularly elastic, unless
you have an enormous cluster — and by “elastic” I mean I want to be able to throw 1,000 experiments
at this pipeline and have it run in the same amount of time that it takes to run one. Provenance
is somewhat more difficult because you might be running a different software — a different
version of the software than we are. And the cost is sort of obscure; it’s difficult
because clusters are oftentimes subsidized, and it’s not always clear just exactly how
much it costs to run something. So you might say, well, we just used this
containerizing technology like Doc, or something like that, so that we can just deliver this
binary image that just runs. And that has improved some of these metrics, but not really
all of them. However, we thought, well, if we could just make this stuff run on the web,
so that you just came to a website, all of the software was already ready to run on the
Cloud, right? Some compute cluster that you have nothing to do with. Then it actually
is really easy to share because everybody has a web browser; that’s really all you
need. That, plus your data. It’s very easy to run for the same reason. It’s very elastic,
because in fact, as we’ll tell you later, all of the compute that stands behind these
pipelines is Amazon Web Services, which is the same compute that serves up Amazon Prime
videos when you binge-watch TV series or whatever, so it doesn’t — the compute is a non-issue.
They’ve got plenty of capacity to run these — all of these experiments. The provenance
is excellent because it’s exactly the same software that we run to produce the standard
output from ENCODE; it’s the software that you can run too. The cost is sort of a different
— maybe appears different because you know exactly how much it costs, because this is
on a commercial web-based platform. So we chose to deploy the pipelines on this
web-based Cloud platform first. And we know that people have their own clusters and that
you want to be able to run this software on your own cluster as well, and so there will
be subsequent deployments that will allow you to do that. And even today, all of the
software that runs these pipelines is completely open-sourced, and it can be adapted today
to run on your cluster as well. We’ve just chosen this platform called “DNA Nexus”
to deploy the pipelines first. We actually run the pipelines there as well. So you hear
a lot more today about DNA Nexus and how the pipelines run there, but I just wanted to
spend a couple of minutes talking about why we chose to deploy the pipelines there first.
Yes? Male Speaker:
[unintelligible]. Seth Strattan:
So I forgot a metric; what was the one that I forgot? Male Speaker:
Confidentiality, you know, the [spelled phonetically] — Seth Strattan:
Ah, okay, very good. So the question is confidentiality. So you have — yeah, that’s a good point.
And, it so happens that this platform is compliant to — for PHI, so you can actually bring data
that cannot be shared publicly to you workspace and run the software on the data, because
we never see the data. The data resides in your — in your project, and it’s obscure.
You bring the software to — you bring the compute to your data. And we’ll talk more
about exactly what that means later, but yes, you’re right. If you have clinical data
or other data that can’t be made public, this environment can handle that. Can, yes.
So, that’s a good question. Okay. So I’m going to stop there, and we’re
going to start the live demo portion of the presentation. And I just want to reiterate
that this is a workshop, and we want this to be interactive, and so please raise your
hand if at any point you get lost or you have a question. And then we’ll stop and answer
your question, or one of us will try to help you. [end of transcript]

Leave a Reply

Your email address will not be published. Required fields are marked *