ENCODE Uniform Data Processing Pipelines: Demo and Hands-on Tutorial – Ben Hitz

ENCODE Uniform Data Processing Pipelines: Demo and Hands-on Tutorial – Ben Hitz


Ben Hitz:
Hi, everyone. I’m Ben. I’m going to do a live demo for you. So, a couple of things just
to start off. You should — in order to do this live demo, as we said repeatedly, you
have to have an account set up to do this. I’m going to run it through as if I am you.
I am going to log into a brand new account that’s never been used before and do it that
way. The steps of a demo are outlined in the handouts which you can download. So, if you
ever get lost you can refer back to that. I am going to first kind of zip through the
example that we are going to do, which is going to be a little tiny analysis of a half
a chromosome of RNA-seq data, which we are going to run to completion through the entire
ENCODE pipeline and it’s going to take about 50 minutes to run all the way through. So,
that’s why I’m going to sort of whip through how to set up a job and start it. And then
we’ll move back a little bit, explain a little bit more about how the interface works, and
if anyone gets stuck while they’re trying to run it, please raise your hand and someone
will come and help you. So, here we are logged in. So, this is a log
in panel. All right. So, when you log in, you start. You’ll see it’s my free trial.
I have nothing in here. This is sort of the project panel, so you organize all your things
and projects however you want. They’re like a folder effectively. So, to start here we’re
going to the ENCODE Uniform Processing Pipelines’ Project. So, what we’re going to do is wait
for this to load, and we’re going to copy the long RNA-seq and the reference files’
folders to a new project in your account. So, this little icon here is to create a new
project because you don’t have any projects. So, just call it demo or something. Work.
It’s a little slow, right? So, you actually have to click copy into this folder. Does
that work? Here we go. Done. Okay. So, now note that I’m still in the ENCODE Uniform
Processing Pipeline folder. So, I’m going to click this back arrow here. I have a new
project. Click that. Female Speaker:
[inaudible] to me. Oh, my best. Ben Hitz:
Oh. [laughter] Male Speaker:
Don’t let [inaudible] see you. Ben Hitz:
All right. Female Speaker:
All right. Okay, so — Male Speaker:
Okay. Ben Hitz:
The whole set up to run takes about three minutes, so I can go first. I’m in my demo
project in my — this long RNA-seq folder. We use the word long to substitute for, like,
getting transcripts as opposed to other micro RNAs, right? So, inside this folder, there’s
a couple of folders here and a couple of little boxes, which I’ll go into. Again, I’m going
to explain most of this in detail once we get some stuff running. This little symbol’s
a workflow. A workflow is what DNAnexus calls — what we might call a pipeline. A workflow
is a bunch of things stuck together. So, if you click on that and open it up, you’ll get
this image that you were shown before, hopefully. It comes there. And specifically, the one
replicate paired end is our example that we’re going to use. So, there’s four in there. Some
of them are from multiple replicates. Some of them are for single ended, right? So, the
way the workflow is organized is that this is a step. This is a step, this is a step,
this is a step, this is a step. These are the inputs on the left, these are the outputs
on the right, and the outputs are connected to the inputs. So? And again, I — we’re going
to come back and do this again if people have trouble. In order for this to run, to click
the run button, we need to turn all these little guys in the middle, which we’re calling
applets, green. So, let’s just start. I — this should be
set up. So, this is a paired end experience. So we need two fastqs, one for each pair.
So, I’m just going to go back to the beginning for people who are still having — haven’t
created a project with the files in it. Okay? So, the first thing I’m going to do is click
on the featured projects, Encode Uniform Processing Pipelines. So, this is our sort of master
copy of the pipelines and the reference files. So, I’m going to select the long RNA-seq and
the reference files by clicking the check boxes. I’m going to go over to the left side
and click the copy button. So that’s going to open an interface in order to copy to another
project, which for your — for our purposes is a — your personal project that you are
maintaining. So, we’re just going to copy the software and the data onto there. So,
this little, funny looking suitcase plus link here is create a new project, which, if you
haven’t created a project yet, you will need to create one project. So, if you click on
that suitcase link, you will have to enter a name for that project. So I am just going
to type, in this case, encode demo into the box, hit enter. Right? It’s turning a little
bit. So now it should say something like “no data available” here. On the right hand
of this it says “copy into this folder.” So you click on that. It should copy the data
that I selected in the first place. Might have to click it twice. Okay. So it’s spinning
along. There it goes. So, you get a little 100 percent. It’s done. People still working
on that? It’s fine if you are. [inaudible commentary] [laughter] Okay. So now this leaves me at this project,
which is a project that you guys can’t really edit because it’s our DCC Project. And this
bird beak over in the corner that points to the left will take you up a level or back
to your personal account and the list of all the projects that you have.
So, because I did this twice, I have two. And some of you may have two, too. If you
— they’re just little files on the clouds, so feel free to delete them if you don’t need
them later. So, you can see, actually, both of these have some data in them, three to
four data gigs. So, I’m just going to click on this one to enter that project. Wait to
give it a second. Spin the little DNA helix. So, is anyone not, who is trying to work this
demo, not at this page yet? One, two. That’s pretty good.
So, from this, your project that you’ve created, screen, there will be a little folder. It
looks like a little folder guy there. One of them is long RNA-seq. That’s where the
workflows reside. A workflow is DNAnexus’ way of explaining
which processing steps to run in what order. So, I’m going to make this a little wider
so you guys can see it, but should be able to see it on your screen. So, we have four here. They — one is for
single ended RNA-seq data and one for is paired ends, one pair of them. And then there are
two separate ones depending on whether or not you have one input, replicate, for your
experiment or two. We’re just going to do the one replicate one. So, if you click on
that, it should open up a window like this. Is anyone who was — is this where we are
now? We’re good? Okay. So, just to repeat here, our — the way these are organized is
that the outputs of the previous step in a pipeline are connected to the inputs of another
step. So, for example, this STAR_NO_BAM, right? When I highlight it, it highlights it there.
That means that this BAM file, which is the — one of the alignments that we’re going
to creating, gets passed onto this applet here, which is the quantification applet.
The applets in the middle here represent software that’s going to be run. So, in order to get
this started — and I think we have to pretty much go through this once and do it. So, then,
we will finish — is you start clicking on these items here and filling in the boxes. So, if you click on that, it should show you
a little menu, which shows you the valid, fast — GZIPP fastq’d files that you could
select for there. And it’s going to take a second to find them, but we’re going to pick
the one that is called like hemi. If you — maybe yours is faster than mine. There’s — there
we go. So, there’s two here that are described as hemi, chromosome 21 hemi. So this is this
middle of chromosome 21, which is just to be a small dataset that we could finish in
the workflow time. So, the one — and this is the NC file that these are extracted from.
So, we are — take the one with the one hemi version. Click on that. It puts it there.
This also inputs it to the next step here, which we’ll go over. So, this — oops. Don’t
right click. We’ll get that menu. So, now we do the same thing with the other pair of
the reads. Same menu, just make sure you pick the one with the two. Does everyone get how
to put files into the system? Anyone confused? So, notice how they are both different, one
and two. Now, we have to actually — when you are doing the mapping, you have to actually
pick the correct genome. So, these are reads that I happen to know are from a mixed human
sample. So, I need to — and then we have the gnomes pre-indexed for the various pieces
of software. So, this is the STAR alignment. So, we’re going to click on that and it’s
going to give us a bunch of choices of indexes to choose from. So, there’s a mouse one, there’s
several human ones. The way that STAR works is that it takes not just a genome, but also
a transcriptome, and we use spike ends for some quantifications, so that’s what all of
this spinach is. But if you just pick hg19_male_v19_ERCC_STARindex.tgz, click on that, it fills that in. So, we’re going to go to the next step now.
So, scroll down. Here, this is another alignment program, which is — you may be more familiar
with this tophat. The reads are automatically set up to be the same reads that you put in
in the first place. So, you see how when you highlight them, they sort of light each other
up. And we’re going to get the tophat index. Now, it’s important that you pick the same
index, right? Like, I don’t want to have one aligned to mouse and one aligned to female.
And you should know which version of the genome, in this case, hg19, that you’re mapping to,
because when you do a visualization, you don’t want to have it drawn on the wrong genome,
right? So, that’s that one. Now, the next step is the quantifications. So, this is a
program called RSEM from Colin Dewey’s lab. So that uses a slightly different flavor of
the index. And if you’ll notice when it comes up, that there is no female version of the
genome, and that has to do with the fact that we want to write out zeroes in our quantification
file for stuff on the Y chromosome. So, it doesn’t really matter if we use the male or
not. Click that link. That’s there. The last input
is that these two steps here are going to create — take our alignment BAM files and
make wiggle files out of them, bigWig files that we’re going to visualize. And this requires
a chromosome-linked file. So, let’s just take this one here. It’s the right one. So, now
this one is runnable. It’s green. This one is runnable. It’s green. This one is not runnable.
What means is if you’ve put the input it says, “configure parameters.” Params, I guess. So,
if you click on this black box in the middle, this option isn’t checked. So, it needs to
be checked that the library is pair ended. So, we’ll save that. Now, it’s green. Female Speaker:
Like, changing out the one? Ben Hitz:
Okay. Oh, here? What you click? Sorry, could you — Female Speaker:
The chromosome, like what you use? Ben Hitz:
Oh, male_hg19_chrome.sizes. I’m sorry. I didn’t read about, did I? Male Speaker:
What about this one? Ben Hitz:
This one? So, these other two have just a little text you need to put in, that you can
just put in here. This is a required field. It’s just to save some files or the same.
So, if you just type in “demo” in this box here. Save it. And the top one, the STAR alignment
is the same thing. So, look. I’m all runnable and I can run my
analysis. So — Male Speaker:
Can you repeat the process? Ben Hitz:
Yup. Also, if it should — this whole process should be on the hand out if you want to refer
back to it, but — that you can download from the agenda page. But I will be happy to go
through it again. Male Speaker:
Yes, yes. Ben Hitz:
All right. So, I’ll just close it. Okay. So starting from the pipeline, let’s see if this
is — it should be blank. I don’t think it saved my stuff. Right. Female Speaker:
Can you say what’s happening in each step? Like, I understand what you’re — Male Speaker:
Yeah, what — Ben Hitz:
Okay. So, what we’re doing is — sorry, for the — Male Speaker:
Yeah, what is the school for the mathematics? Ben Hitz:
What is this? Sorry? Male Speaker:
School for the mathematics? Is what we’re trying to do in — Ben Hitz:
Oh, okay. Yeah, so I mean — no, that’s a totally fair question. I just want to say
our plan is that because it takes a while to run that we would try to get people started
and then we would describe the RNA-seq pipeline. So, I wonder if we should — so, what you’re
going to do is put in the inputs to the various pieces of software, which are in black. Male Speaker:
I think it’s just — say a little bit more about — Ben Hitz:
About the RNA-seq pipeline? Male Speaker:
Can you speak — well, when you’re saying something or you’re doing a reading — I mean,
I’m having a [inaudible] what you’re saying — you’re saying. Ben Hitz:
Okay. Male Speaker:
It’s so, just — Ben Hitz:
Okay. That’s — okay. So I’ll try to — Male Speaker:
Too much work for something that the beginning. [inaudible] Ben Hitz:
Oh, oh, okay. Male Speaker:
[inaudible] Ben Hitz:
Sure, sure, I guess I can do that. Male Speaker:
Thank you very much. Ben Hitz:
Let me just — okay. So, people who are this step, I want to go back for some of them who
haven’t done the original thing. So, I’m going to go way back to the beginning. But you should
be able to like look at the — go download the handout from the ENCODE2015.org website
and that has all the — that can help you follow the details while I help the people
who are even behind you. Okay? Female Speaker:
Great. Ben Hitz:
Sorry for that. But that’s just how we do it. So, you should be at a page, something
like this that says projects, right? You should have nothing here. So, what you do is go to
the ENCODE Uniform Processing Pipelines. Female Speaker:
[inaudible] Ben Hitz:
What? Female Speaker:
You didn’t specify output. Ben Hitz:
No. That’s why I didn’t tell them to run it yet. Female Speaker:
Okay. Keep going then. Ben Hitz:
Oh, well, then that will be fine. Female Speaker:
Okay. We’ll figure it out. Ben Hitz:
Okay. So those of you who clicked run, that’s perfectly fine. There is one little subset,
which is why I didn’t say actually run it — which is that you can specify where the
output files go, like in another folder. So, the people who are running it, you’ll just
get a pile of files in a folder, but it will work fine. It won’t make any difference. Female Speaker:
And it will ask for all the — Ben Hitz:
It’s just that if — when you run two or three of them, then they start piling up with all
these crazy file names. All right. So for these people who just got internet — so you
got a Processing pipelines here. This is the master copy of a thing. Right? You copy the
long RNA-seq in the reference files. So, you get this copy in the left hand corner. So,
you’re not going to have these two. So, you need to create a new project, which is this
crazy thing. See this? Female Speaker:
Okay. Now, that’s three. Ben Hitz:
Just name it whatever you want. Female Speaker:
Set your — Ben Hitz:
I just made a demo. Type in demo2 in the box there. Female Hitz:
So, this is like how we do our documents and things. We like — you know, it’s demo — Ben Hitz:
So, click copy. Done? Okay. So, now you’re still where you were when you started the
copy. Female Speaker:
So, here it is. That’s also on — Ben Hitz:
So, you have to get to where your new project you created. So you have to go back on this
arrow here. Female Speaker:
Example, which would be, and then rightly follows the — yeah — Ben Hitz:
You should see your project you created there, right? So, you go there? Female Speaker:
Simple, that worksheet that you just referenced? It has like three jobs in every — Ben Hitz:
And then it’s — you’re going to go — in here, there’s a folder called long RNA-seq. Male Speaker:
Yes. Ben Hitz:
Click on that. Then if you scroll down a bit, there’s these guys, which is the universal
symbol for workflow. Male Speaker:
I have a question, [inaudible]. Ben Hitz:
Sorry? Male Speaker:
If you didn’t have files [inaudible] — Female Speaker:
There’s a section there. Ben Hitz:
How do you get files? So, I don’t think we’re going to have time, but what you can do is,
if you go to like — if you want to ENCODE file — yeah — Male Speaker:
[inaudible] Ben Hitz:
You can — basically, there’s an add data button if you ever want to put fastqs in your
system and add data will let you copy a file from a URL or a file from your computer. Please
don’t upload any files in this room now. I have no idea what it will do to the internet.
But that’s how you get files in. It’s pretty easy if you just like poke around a little
bit. All right. So, oops. This is the wrong one. Don’t click on that one. See? This is
the two replicate one. Don’t do that. I just missed it. So this is the one — here. I’ll
make it wider again. One replicate paired end for you guys and you should get to where
we are. We’re inputting stuff. Male Speaker:
Yeah. Ben Hitz:
Okay? Male Speaker:
Okay. And then we press focus? Ben Hitz:
Press — just got it? Male Speaker:
Yes. Ben Hitz:
Okay. I’m going to go through the inputs again. Male Speaker:
Is this the original documented? Ben Hitz:
That’s a — Male Speaker:
Yeah. The original. Ben Hitz:
I — we — you’re welcome to try to design — you guys, I mean, it’s not my system, so
— Male Speaker:
Oh, thank you. Ben Hitz:
Interesting that it’s not. Yeah, I mean, some people like using the command line and just
write your scripts, so you can do that too if you want, but not in the demo. All right.
So you get the read one. So this is a list of possible fastqs you could input. I probably
should have deleted some of these, but so take the one that says — or look at the two
that say chromosome 21 hemi. So, for pair one just pick one. Right? I’ll do it again.
This one, ENCFF646CCF_1-CHR21hemi. Okay? And now the other box we’re going to do the same
thing, but get pair number two. And now we need an index genome for STAR to align to.
So, we click on that box. We don’t want a mouse one from MM10. So we want a human. We’re
going to use the male and we’re going to use the version that is ERCC, which is our — just
a code name for the spike ends, which — there. So, I think I’m going to try to do this slightly
perpendicularly for people who are following along. So, we’re just going to finish this
top step here. So, we’re just going to click configure params, and this STAR here means
that this is a required parameter and that’s why it’s orange and not green. So, you can
put in any sort of text string in here as an identifier. So, we used this for a while
for like ENC, BS111, but it doesn’t matter. You can put in any string in there. So demo
is fine. Save that. That step is done, right? So, this is alignment step, just because I
don’t think we’re going to have time to go over the whole pipeline very much, is — runs
in alignment to your — of your pair of fastqs against the genome that you selected. It creates
a — this is a –essentially some QC numbers that you’ll may see at the end if we get there.
It creates a regular alignment file to the whole genome. It creates an alignment file
to the transcriptome, which is included in this genome index. It’s GENCODE, version 19,
and it has a log file, which has other numbers in it, right? So, we’re just going to repeat
that partially for the other four steps. This is the alignment of two tophat, which is just
creating an alignment BAM file. The inputs are already set for you, so we just need to
set the index. And again, we’re going to take hg19_male_ v19_ERCC_tophatIndex, put that
in there. And we’re going to click configure params, we’re going to put in demo into the
identifier for biosample library box. Now, one is green. Anyone having trouble turning
these two guys green? Great. Female Speaker:
These need to always have the same name. Ben Hitz:
Yeah. So the next step is the RSEM quantification, which takes the annotation band. So, let’s
see if I can show you this here because I want to do it anyway, right? So, see how this
output on the right is marked to that input on the left? Output to the right is marked
to that input on the left, so that file is going to go. When this file is created, it’s
going to go directly into this step, so we don’t have to put that input at all. It’s
already been done. That’s part of the pipeline, is that it’s all plumbed together. We’re going
to take the index for RSEM. Okay. So here we just need the hg19_male_ v19_ERCC_RSEMIndex.
All right? That’s good. Still not green yet. Configure the parameters. Now this one we need to tell RSEM that the
library is pair ended, because it does a different calculation. We use this piece of software
for both the paired end and the single ended, so it’s got to know which one you’re inputting
it. The output of that file are what we are just calling these results files. They’re
just tabbed limited files with genes as rows and like FPKM, TPM, and other numbers there.
The last two steps are going to take the output of the BAM files, one from STAR and from tophat.
They’re going to convert it to bigWig files and the bigWig conversion script needs a chromosome
name length file, so we click on that. It’s going to open another window. We get to pick
again our genome, which is going to be male, hg19 chrome sizes. I should point out that
you can arrange these in such a way that they’re coordinated, but I wanted people to go through
the steps of actually doing it. All right. So our things are all green now. So is everyone
there at this point? All right. You still need help? Nope. Okay. We’re good. So, set
the output folder. This is just — this is not a critical step, but it will keep your
files managed if you put the output in the right there. Oh, Mike needs help. So, in — if
you go to this right side, there’s a little file, like a Windows Explorer thing. So, you
click on that arrow down in examples. There’s input and output. So, you just click output,
for example, or you can give it any folder it selects. There’s nothing in it. That’s why it says no data available. It’s
an empty folder. That’s what you want. So, now, it’s I’ve got a folder in there and I
can click run. Trumpets. Okay. So this — when you run it, it takes you something to the
monitor tab. So that — we — before, we were working sort of the manage tab and the monitor
tab will show you all of the jobs that you have running at any given time, as well as
all the jobs that were finished, all the jobs that may have crashed, all the jobs that you
turned off. It’s a — can go as far as you want. So, this is the master job. If you notice
this plus sign here, it will show you all the sub jobs this job this job is going to
create hopefully in a second. Right? So, the — these little lines here correspond to those
steps in the workflow that we started, so there’s two alignment steps: a quantitation
step. It’s a little bit hard to see. And two BAM to bigWigs stranded steps, which is a
correct one. So, it requires that the quantitative — okay. Let’s start with these. The BAM to
bigWigs — I’m sorry. I’m a little thrown off. The BAM to bigWig steps require the BAM
to be created; otherwise it can’t create a bigWig, right?
So, these will run automatically when their respective alignment steps are completed,
which will take on the order of 25-30 minutes with this little example. If you run a bigger
example, it can take longer. Similarly, to quantitate the genes and the
transcripts, we require the output of STAR, which is what we’re going to quantitate. The
ENCODE Consortium decided that we only needed one set of quantifications, but we wanted
to do two sets of alignments, both to compare to previous experiments that could be done
and also because tophat is sort of the industry standard RNA-seq alignment file. But we found
in our hands that STAR actually performed a lot better, so we wanted to use that as
well. Okay. It is now — right, 2:30. So, I think people have gotten the hang of raising
their hand if they need help, but I’m going to check here anyway to see where people are.
If, because of — especially with the time restraint, what will happen here is that if
this — if one of these jobs actually, like dies or something — like for example, and
this can easily happen. If you gave it the wrong input, right? Like, if you accidentally
— it was hard because it was — but if you put the tophat — if you really go out of
your way you can do this. You can put the tophat index into the STAR BAM, which you
really shouldn’t do, but it will run for a minute or two and then throw an error and
it will turn red. And you can actually, from this interface — right? We’ll look here.
So, I clicked on the job that’s — one of the jobs that’s actually running. It’s coming
up a little slow here, but I can actually view the log file that it’s creating.
So you can actually effectively peak into the virtual machine that’s running this job
and it shows you all the output of what’s going on. I’m not going to do that because
no one wants to see a bunch of text stream. But if you wanted to do that, there. I think
— Male Speaker:
Wait a second. Ben Hitz:
Yeah. Male Speaker:
The reach of the [inaudible]. Ben Hitz:
The RSEM — the STAR alignments, which — it’s a good — great question. I can show you also,
like — I think — I was trying to demonstrate it, but it’s a little tricky to demonstrate
here. Actually, I think — you have a question. Male Speaker:
Yeah, yeah, but I’ll wait. Ben Hitz:
Okay. Well, okay. You can go back here, but — so, I’m skipping some of Seth’s because
he’s busy. Skip that. So here. Here’s sort of a view, like an on one page view of what
you guys just ran. So, I mean, if there were good reasons to do this first, but I did want
to try to get it started. So the code that’s available, that — so each of these, those
applets or those steps here, are just a little shell script that runs a program that you
can download like tophat or STAR. Sorry. Or STAR or RSEM. Or here, there’s a step that
runs — actually, these are sort of the steps here — maps — this maps with STAR, this
maps with tophat, this maps with STAR. This converts the BAMs to bigWigs. There’s a whole
forest of bigWigs that get produced. So, for a stranded dataset, it will give you the plus
strand and the minus strand. It also gives you all the reads, including the non-uniquely
mapping reads and there’s a file that’s just the uniquely mapping reads. So, that’s four
per alignment program. The STAR BAM outputs are also shoved off through this quantification
here. So, at ENCODE we have a rule that we always run everything twice. We always run
everything twice. So, what I’ve drawn here is this is that — so, for every experiment
we have a — some form of replication, either a biological replicator or a technical replicate.
Ideally, we have a biologically replicate. And if we had talked about you had had to
— that we had a multiple replicate pipeline or work flow you could use. So, that effectively
for the same experiment — sorry. I’m losing my mouse pointer here. Right? We can take the quantifications that
are output from — there’s one more step in that double pipeline, the double replicate
pipeline, which takes the RSEM quantification file based on genomes from one replicate and
the one from the other one and runs a — some QC calculations based on it. It doesn’t actually
— we are currently working on the IDR calculation for RNA-seq, but we don’t have it implemented
yet. But we do, do something from Raffa Israhe [spelled phonetically], which is like a — it
measures the dispersion of the log twos of the seq ratios. I’ll tell you more about that.
So, right? So, we get — from each replicate we get two BAM files. We get — if it’s paired
we get four bigWig files, times two is eight. And then we also get two quantification files
for which the genome quantifications are much more reliable. The transcriptome ones are
more for reference, but they — it’s hard to actually judge isoforms from these present.
So, you had a question there? Male Speaker:
[inaudible] That switch, never mind. I’ll shout. So, the — if your job fails, is there
a way to like go back and fix the file and start it up without exactly going through
the first couple of steps again? Ben Hitz:
What a great question. There is indeed such a way. I was waiting for someone to have a
failed job so I could show them. Male Speaker:
Oh, all right. Ben Hitz:
Is anyone have a — is anyone job failed? [laughter] Yours did? Okay. So then this for you. [laughter] So, go to monitor. I can’t really do it, because
my job’s not failed, but it will show — sort of told you here. If you go to monitor and
it — let it spin for a little bit. So, when you click on the plus symbol here, it will
show you all the jobs. So, the whole thing will fail if any one of its sub job fails.
But you probably want to check to see which sub job fails and look at that. But for the
whole thing, what you can do is you can — oh. I don’t have an example, do I? I can maybe
switch to one, but what it will show you that on a failed job, when you click on this — well,
actually, it might let me do it anyway. I can’t remember. There should be a button that
will show up here. This shows up — which basically just says rerun. So, rerun with
the same input. Now, obviously if you rerun with the exact same input, then it will probably
give you the same error. Not necessarily, sometimes — Male Speaker:
No way. Yes. Ben Hitz:
— but you can — it will rerun with what you loaded and then you can look through it
and change what you need to change. Male Speaker:
Right. So, I guess I’m assuming that the output files are cached somehow for a while, so that
like you can see that, “Oh, I’ve already got these input files. I need to start at step
number three.” It’s — Ben Hitz:
Right, right. It — they’re actually — the files that are specified as outputs — Male Speaker:
Yeah. Ben Hitz:
— so that they’re on the right hand side of that screen, those are not cached. They’re
permanently saved — Male Speaker:
Okay. Ben Hitz:
— until you delete them. So, that — and it will — the DNAnexus system keeps track
of what job used to create them and what parameters were used to create them. And so, if, for
example, this can happen, certainly, when developing your own pipelines in the system.
Like, let’s say my alignment step works really well, but I have some bug in my quantification
script. Right? So, I run my alignment pipeline. I run my whole pipeline. The alignment works
perfectly and the quantification fails, right? So, now, it’s not because I had bad input.
It’s because there’s actually a bug in my applet code, which is — we developed all
the applet code at — well, with help of the ENCODE data analysis center, right? And so
we had to sort of effectively — not really compile, but ship that code to the DNAnexus
platform. And so, if there was a bug in that code, which there were many, which we have
fixed them all. There are no more bugs. What you can do is remake that applet and then
rerun that step dependent on the same previous input. So, that’s one of the advantages of
having — and I think when that type of technology is very difficult to implement — is one of
the reasons why we use this system, is because it’s not that it’s impossible. It’s hard.
Is a — you have another question? Male Speaker:
What are you [inaudible]? Ben Hitz:
Okay, this — sure. Male Speaker:
If they’re real like, I guess, quote, speed jobs up like by — how do you use more cords
or more [inaudible] or something like that? Ben Hitz:
Yes, it depends on how you’ve implemented the applets, right? So, for our purposes we
are — we do want to keep them as efficient as possible, but — Female Speaker:
Anyway, I just finished these, see? Ben Hitz:
But we gain a lot of throughput, like basically running 1,000 — Female Speaker:
Yeah. Ben Hitz:
— like literally 1,000. We can submit 1,000 experiments at once and if the whole thing
takes five hours, they will be done in five hours because we have a — effectively an
infinite supply of cores at Amazon. Male Speaker:
Okay. Ben Hitz:
But you can also change the number of threads. You can optimize your applet so that it speeds
things up. Male Speaker:
Okay. And last question. At some point can you show us how to like upload fastq files
and download other kinds of files? Ben Hitz:
Yes. Male Speaker:
Just — Ben Hitz:
That’s the second time. So, I’ll try to do that around — I don’t know when we should.
We’ll definitely try to squeeze that in or find me if I don’t get it. Male Speaker:
That’s okay. Ben Hitz:
So, another question in the back. Male Speaker:
Yes. So the question is that is your tophat and STAR parallel or sealed? So, one of them
up there with the online reads from one step sticking to another. That’s my first question.
And the second question is the STAR index is very sensitive to the length of your reads.
So, if your chapter reads how you’re handling, you know — is the STAR just getting generated
again or is it just you have a standard STAR index based on 100 basis and that specifically
you’re using for everything? Ben Hitz:
So, the first question, I think is pretty simple, if I got it right, is that — what
is the dependency of the STAR and the tophat alignment steps? So, they’re completely independent.
They are the no — they use the same input and they have completely different outputs
and they don’t interact. Male Speaker:
Okay. So what do use with those tophat and output then? Because in the flow chart, you
are saying that you are using RSEM only with the STAR — Ben Hitz:
Right. Male Speaker:
— and so you’re not marching them together. Ben Hitz:
No. Male Speaker:
So, what are you doing with the tophat output? Ben Hitz:
Well, the idea is just to have a comparison for previous data that was done. I mean, what
it has to do with the fact is that the consortium or the RNA working group as a whole really
liked how the STAR program performed and to be fair, it was written by Alex Dovan [spelled
phonetically] in Tom D. Jerris’ [spelled phonetically] lab, but other groups want — who use tophat
mostly, wanted to compare the results directly at the BAM level. But they weren’t worried
about the producing the quantifications for it. So, your second question is a technical
STAR detail about the length of the reads, right? Male Speaker:
Yes. Ben Hitz:
So, I don’t think we take that into account, but since Alex helped us write the pipelines.
Also, almost all of the current RNA-seq data we’re running through it is pretty much the
same length, because it’s done by the same people. But that’s a good point. We might
have to remake the indexes for that. Male Speaker:
Yeah, for the users, you know. Ben Hitz:
Yeah. Male Speaker:
Because for you, it’s the same length. For the users, it’s — Ben Hitz;
So, like, you think it’s different at 36 and 100 or? Male Speaker:
Yes. Ben Hitz:
Okay. Male Speaker:
Yes, anything above 100, it should be okay. Ben Hitz:
Yeah. Male Speaker:
But anything below 100, if you’re 50 or 60 basis, you know, around that range, the STAR
index are really — Ben Hitz:
Thanks. That’s really useful to know. I did not know that. Female Speaker:
So there is a question — there’s a question that’s been asked by a couple of people, so
I wanted to bring it up for everyone — Ben Hitz:
Sure. Female Speaker:
— because you may have the same question. The question basically is do we have a pipeline
for paired and un-stranded libraries. And as a defined and uniform code processing pipeline,
we do not, because all of the ENCODE RNA-seq data that’s been generated by the consortium
is either single and un-stranded or paired and stranded. And so that’s why the pipelines
that you see in the ENCODE Uniform Processing folder or pipeline folder are those. However,
you can take a look at the parameters for the single and un-stranded. Take a look at
those. It is essentially the same software components. You can look at the parameters,
you can modify the apps, the applets. Ben Hitz:
We actually did have a few test examples that were those, the other two, so paired, un-stranded,
and unpaired, stranded. I’m sorry. But we didn’t actually — it was already a — four
pipelines was already too many. But you can actually — I don’t even think you need to
do any coding. I think you can just re-arrange the steps in those pipelines and make a new
workflow and it would work if you needed to do that. If someone has a specific USE case,
I’ll show you how to do that. We — I guess we could publish them, too, and if we can
organize it in such a way. Female Speaker:
Yeah, but I think that the point that Ben addressed was also the one that I wanted to
bring up is part of the beauty of the DNAnexus platform is that you can mix and match applets
that are public to make the workflow that, you know, you want to run your data through. Ben Hitz:
Right. So, approximately 40 minutes ago we were supposed to have Seth talk about ChIP-seq.
That’s okay. That’s actually — we have this built in. So, I think, you want to go back
and do that or you want to skip ahead? Let’s see how long my job has been running. Let’s
do that — because I think we can analyze it. So, it’s been running 10 minutes. So,
I believe that the earliest we can visualize it is about 30 minutes, which is probably
going to finish us up. So, there’s a couple of options. One, I can go to — like, I have
an example of this already completed in a different account. So, I can show you how
to take the bigWig file and draw it to — draw it on the UCSC Santa Cruz browser, which is
kind of cool. Or we can have Seth come up. We can talk more about the pipelines in general
without — in sort of a less workshoppy way, but — Male Speaker:
[inaudible] Ben Hitz:
Okay. Sounds good to me. [end of transcript]

Leave a Reply

Your email address will not be published. Required fields are marked *