ENCODE Uniform Data Processing Pipelines: ChIP-seq IDR Architecture – J. Seth Strattan

ENCODE Uniform Data Processing Pipelines: ChIP-seq IDR Architecture – J. Seth Strattan


Seth Strattan:
So, what I have to do here is so much easier than Ben just did, and I am almost embarrassed
to go back to slides because the live demo is incredibly nerve wracking, but — so, we’ve
been working on the RNA-seq pipeline. So, in the beginning we told you that we have
implemented RNA-seq processing pipeline. We also have implemented ChIP-seq processing
pipeline for both transcription factors and for histone modifications, as well as the
whole-genome bisulfite pipeline and the DNase pipeline is coming. So, I’m not going to do
a live demo on — of the ChIP-seq pipeline, but I did want to tell you what the architecture
of the pipeline is and make you aware that it exists, so that if you were more interested
or also interested in processing ChIP-seq data, it should be able to do that. The deployment
of the ChIP-seq pipeline is identical to the RNA-seq pipeline. It’s on DNAnexus, the same
sort of process that you would go through to add data to the beginning of the RNA-seq
processing. Exactly the same process will apply for ChIP-seq as well. So, the pipeline,
like many of the pipelines that we deploy, has a mapping step and then it has a peak
calling step and then it has a statistical framework that is applied to the replicated
peaks at the end to try to assess concordance of biological replicates. All ENCODE experiments
replicated. And so this last piece, called IDR, is something
that we run on all of the ENCODE experiments, because they are all replicated. Pure experiments
are not replicated — but you can’t run this. But you can call peaks. Okay. So briefly,
the ChIP-seq pipeline uses for transcription factors and for histone modifications, the
mapping step is done with DWA. Duplicates are marked and removed. The peak calling step
for transcription factors uses SVP. The peak calling step for histone modifications uses
MAX2. MAX2 is also used to generate the signal tracks for both histone modifications and
transcription factors and I am going to tell you about the difference between the peak
calls and the signal tracks in just a moment. And then, as I mentioned, there’s a piece
of software called IDR that is a statistical framework that allows assessment of concordance
of two replicates. So, something that we haven’t talked about yet, but is very important, an
important advantage to deploying the pipelines in the way that we have, is that we can generate
all sorts of quality assurance metrics. So, all ENCODE experiments have target read depths.
They have target library complexities. They have goals for data quality that have to be
reached for those experiments to be accessioned and distributed as ENCODE products.
And so the calculation of those quality insurance metrics are important to us, because that’s
the way that we figure out whether the experiments were any good or not. But they will also be
important for you when you run your experiments through these pipelines because you can compare
your data to ENCODE data and see how it stacks up. So, the — I wanted to actually sort of point
you to resources to learn about these QC metrics rather than step through the math, which I
probably couldn’t do justice to anyway. We do calculate sort of four general categories
of quality assurance metrics for the ChIP-seq experiments and some of these also apply to
DNase experiments. So, of course, we calculate the read depth and there’s an excellent paper
that I refer you to here that talks about target read depths for ChIP-seq experiments
and what read depths you want to try to achieve for different histone modifications or different
transcription factors, depending on how often they’re — they bind to the genome. So, we
also calculate some estimates of the complexity of the library that you sequence and those
are called RNF and the PCR bottleneck coefficient, those are documented in this paper here. There’s
also a strand of cross correlation method that is documented here, that we calculate
as part of the pipeline, and that’s the measure of the quality of the ChIP. And then I mentioned
that IDR is the way that we quantify the concordance between replicates. So, rather than step through in great detail
what all of these metrics are or what they mean, I just wanted to put this in the slide
deck so that you could go back and look them up and read about them if you want to. So,
you’ve already seen this, most of you I hope. But this is just what the histone ChIP-seq
pipeline looks like running on DNAnexus. It looks just like the RNA-seq pipeline. There
are — there is a workflow that is composed of steps. Those steps, some run concurrently.
Some run by themselves. Others run to the end and this is the display that you see when
you run one of these pipelines to completion on the platform. So, I also wanted to just
show you. After we run these pipelines, the DCC, of course, we accession all the output
up at the ENCODE portal. And I thought I would just quickly show you what that looks like.
So, an experiment that has been run on DNAnexus and then accessioned back into the ENCODE
portal — and here I am at ENCODEProject.org. This is the experiment page for an experiment
that has been run through the pipeline. This is what Eurie showed you yesterday. You can
access the metadata that described the experiment. But now, if we scroll down, we see the files
that are generated by the pipeline. Okay. And this is a graphical representation of
what just happened on DNAnexus. So, you see files are yellow bubbles and software steps
are blue rectangles. And you can follow the trajectory, if you want, that the raw data
takes through the pipeline by following the arrows through this graph. So, I won’t step through it, but what I just
want you to see is that on the portal, without ever going to DNAnexus at all, you can see
exactly what the relationships are between the input files, intermediate files that are
generated, and the final output. So, that’s what we’re trying to depict on this graph
here is the relationships between the files and the software steps that generated. Again,
this is all on the ENCODE portal, so this is accessioned metadata about a processing
pipeline that was run. You can click on each one of these. I’m just going to click on this
one in the middle at random and scroll down. And you’ll get additional metadata about that
file, as well as a link to download it. So, what we’re trying to accomplish on this page
here is just to show you how the files were generated and give you direct access to them
one by one. Yes? Male Speaker:
[inaudible] Seth Strattan:
Yes. Male Speaker:
[inaudible] Seth Strattan:
Okay. So the question was what — one of these says signal over control and what does that
mean. I am going to back to the slides. And so this is important if you care about ChIP-seq.
So, what the ChIP — the ChIP-seq pipeline actually creates a number of outputs. It generates
peak calls, which are these blocks that you see on tracks. They have a definite start
and a definite stop and they’re generated based on the raw signal. But we also generate
these continuous tracks that you see on the browser of where the ChIP-seq signal was high
and where it was low. And all of those signal tracks are normalized to the control experiment
for the ChIP. That could be input DNA, it could be a mock IP. But the signal tracks
that you see — output from the uniform pipelines — the signal tracks that you see are normalized
to the controls. So, if you see a positive going trend in that track, you know that that
came from the experiment and not from the control. Did I answer your question? Male Speaker:
No. Seth Strattan:
Okay. No. [laughter] I answered some question. I hope someone had
that question. Male Speaker:
[inaudible] Seth Strattan:
Yes. Yes. Male Speaker:
[inaudible] Seth Strattan:
That’s correct. Okay. Can I repeat your question? Yeah. So, the question was actually about
the — sort of exactly how you input the control files into the pipeline. You didn’t see something
like that RNA. The ChIP-seq pipeline takes these types of controls and you add those
fastqs from a control experiment in exactly the way that you add inputs — input fastqs
from the experiment itself. So, you will — the input to the pipeline is fastqs from your
experiment and also fastqs from the control. Male Speaker:
About like the way it’s — or new ways [inaudible]. Seth Strattan:
Yeah. So, typically we match controls to experiments. So, if you have two replicates you will have
also two control replicates. However, the pipeline will run if you submit the same reads
as both controls. We do a certain amount of read normalization between the two controls.
If one control is very shallow and the other is very deep, for example, we’ll pool those
and use that pooled control for both experiments. Yes? Male Speaker:
Do you use any way of aligning the ChIPs, aligning the peaks so that we, you know, we
can say that this actually the same peak from the different examples? Seth Strattan:
That’s a good question. So, from two replicates or from two different experiments? Female Speaker:
For experiments. Seth Strattan:
Right. From different experiments, no, we do not. So — and this actually brings up
an important sort of design criteria for all of our pipelines. Our pipelines really are
designed to take a — one experiment, usually a replicated experiment, and produce a uniform
output from that one experiment that then can be compared — that’s comparable across
many experiments. But that comparison across experiments, that’s for you to do. Our pipelines
really are designed to take the primary experiment data and process it into some sort of output
that can be consumed by any analysis algorithm that you might want to apply to compare experiments.
So, most of our pipelines are within a replicated experiment. Male Speaker:
Okay, because that then — I mean, there are, you know, kind of these two pipelines — Seth Strattan:
Yes. Male Speaker:
— because you cannot compare two samples of two defined — Female Speaker:
Here, here, microphone. Male Speaker:
Two different time points of two different samples. That’s a major, you know, hard to
— yeah. Seth Strattan:
It is hard to do. Yeah, no. It’s — but that’s why we didn’t do it [laughs]. Male Speaker:
Yeah. Seth Strattan:
But really because our role as a data coordinating center is really to give uniform output from
each experiment that then can be used for subsequent analysis. So, we would consider
that as a subsequent analysis if you want. Be happy to take any other questions. Male Speaker:
Oh, maybe to follow up on that. All of these things being generated at different center
of — might also [inaudible]. Seth Strattan:
Sometimes it takes a second for — there it is. Male Speaker:
There we go. Seth Strattan:
You’re good now. Male Speaker:
So, all of this data is being generated at different centers with possibly different
instruments, different flow cells, lanes, and all that. Seth Strattan:
Yes. Male Speaker:
To sort of follow up on the question that we just asked, how do you normalize across
all those things? And it sounds like maybe you don’t; that’s a downstream thing. But
can you give us any idea how we would do that, because those effects can be kind of significant? Seth Strattan:
Okay. That’s a good question. So, this is one of the reasons why we take primary reads
and not, for example, mapped reads. So, we could build our pipelines to take BAM files,
for example. But you might not map your reads in exactly the same way as we would have mapped
the reads and so that actually can — that difference can propagate through to the end
and when you do your PCA, PCA1 is the lab, right? Which is not what you want. And — but
I think that’s what you’re concerned about. So, what we have — our experience, what we
have found is that within the consortium there are working groups that just — that set sort
of standards for how experiments are performed and those documented on the portal. Eurie
pointed that out yesterday. And what we have found is that if those guidelines are followed,
and for example, for ChIP, is that the antibodies have been characterized to the same levels
and that the ChIP is performed in the same way, even data from multiple locations run
at different times. The fastqs, if you put them all into the same pipeline, the fastqs
are comparable. However, what isn’t comparable, necessarily, is read depth or the libraries
themselves, and that’s what I was talking about in the QC metrics that we calculate.
Those are definitely not uniform. They’re not uniform within a lab, neither
are they uniform across labs. So, that’s one of the reasons why we generate all those QC
metrics, is they all should fall within target ranges in order to then be able to compare
the data at the end. So, I didn’t really give you sort of a checklist that you might go
down to ensure that an experiment that you want to compare to ENCODE is comparable. But
you definitely want to calculate the same QC metrics that the pipeline — through the
pipeline and compare those to other experiments that have been done with an ENCODE, and if
they’re very different, then it’s unlikely that the results will be comparable. Thank
you for that question. Okay. Yes, go ahead? Male Speaker:
[inaudible] The step between, like [inaudible]. So, maybe I missed it, but could you explain
the step from the BAMs to the pseudo replicates and the pooling? Seth Strattan:
Right, right. Okay. Male Speaker:
Thank you. Seth Strattan:
I’d be happy to. So, I’m going to give you the answer for histone ChIP first, because
it’s simpler. And the answer for TF, for transcription factor
ChIP will be slightly different. So, for histone ChIP, what happens here is we call peaks for
each actual replicate. Let’s say an experiment with two replicates. We call peaks on replicate
one, we call peaks on replicate two. We take the reads from both of those replicates and
we pool them. And then we call peaks on the pool. All right? So now I’ve called peaks
three times. I’ve called peaks on each of my true replicates, I’ve called peaks on all
the reads pooled together. Different from cachinnating a peak list, right? So, it’s
actually an independent peak calling on pooled reads. Then we back up. We take that set of
pooled reads and we split it in half and we call those pseudo replicates. They’re chosen
at random without replacement, okay? So, we split the pooled reads in half. And then we
call peaks on each of those pseudo replicates. All right. So, I’ve called peaks on five read
sets, each true. True replicate one, true replicate two, the pool, the pseudo replicate
one of the pool, and the pseudo replicate two of the pool. Five sets of reads we’ve
called peaks on. In the end, when we report — sorry. When we report the replicated peaks,
which you’ll see when you bring it up in the experiment page — you’ll see that they’ll
be Rep1 peaks, Rep2 peaks, and then they’ll be able to be replicated peaks — the replicated
peaks are those which appear in either both true replicates — that’s good, right? You’ve
replicated your peak. It’s in both places — or if it doesn’t, there’s — it has, as
I said, a last chance to get into this set. If it appears in both of the pseudo replicates
of the pool, then that also qualifies as a replicated peak. So, that’s what’s happening
here when we pool replicates and we sub sample into pseudo replicates. That’s what’s going
on here and we call peaks on all of those. That’s why this list here is long. So, all of this is in order to generate the
sub sampled pools from which we decide whether peaks are in fact replicated. For — that’s
for histone. For TF ChIP, where we actually run a full IDR protocol, there are additional
pseudo replicates that are generated. So, pseudo replicates of the true replicates are
also generated and fed into the IDR framework, and those are not accessioned on the portal.
So, you’ll never see those files. They actually just exist within the pipeline. But that contributes
to the IDR thresholded peaks that you get in a TF ChIP experiment. All right. So it’s
a sub sampling and a pseudo replication within true replicates that are then run through
this framework in order to have an unbiased, quantitative way of determining whether the
peak came from both replicates. I hope that was helpful. Thank you. Yes? Female Speaker:
So, for the replicated peaks, you found coordinate calls are based upon the pool of replicates? Seth Strattan:
Yes, yes. The question was for the replicated peaks, are the coordinates based on the pool
and based on the pool are based on the truth? It is a great question and yes, they are from
the pool. Female Speaker:
All right. Seth Strattan:
Okay. I am going to — this will not take long. I’ve shown you the graph of file relationships
and all I wanted — the only other thing that I wanted to show you is that each of these
files are also available through the graph here in the way that I showed you, clicking
on an individual file, but also in a list of files down here at the bottom of the experiment
page on the portal. So, what I wanted to make clear just through these slides is that we
spent a lot of time talking about this platform where we actually run the experiments on the
cloud, but all of the results of those runs are distributed through the portal. So, they
could have been generated anywhere, I suppose, but we in fact do use these pipelines that
we are sharing with you, so — but the results are accessioned and distributed through the
portal. So, I think I’ll stop there and now that Ben has had a chance to catch his breath,
we’ll see if we can’t visualize the results of your pipeline. [end of transcript]

Leave a Reply

Your email address will not be published. Required fields are marked *