Increasing Bowtie's Genetic Alignment Performance

The impetus for this study was fueled by a compelling talk I saw given by Dr. Howard Jacob, Director of the Human and Molecular Genetics Center at the Medical College of Wisconsin.  The basis of his message was this: if we want to be able to sequence an entire human genome with a 1-day turnaround, we are going to need better tools at every step of the sequencing pipeline-- from the sequencers themselves [which seem to be doing pretty well shattering Moore's law in terms of yearly performance increases] to the software tools that operate on the raw data.  

My background is in computer hardware/software design and performance, and I thought it would be an interesting profiling experiment to take a popular, open source sequencing tool, Bowtie, and essentially attempt to squeeze as much performance out of it as possible using our Dell cluster.  It turns out that bowtie is natively multi-threaded, but it's generally a bad idea to run more threads than there are cores on your machine.  So, you can think of it as being limited to the number of cores on your system in terms of performance.  But I thought I could do better, so I began to research the options for parallelizing Bowtie beyond a single node.

I first came across a tool called pMap, developed by the folks at OSU, and designed to break up a genomics workload, copy it to N nodes on a cluster, and then have each node do it's work before aggregating it all back together.  In theory, this sounded great-- but I don't think it was designed for larger datasets, or for environments where the cluster has access to shared memory.  Essentially, it ends up reading the data twice: once when breaking up the data set files and copying it locally, and then again during the actual sequencing.  In my experiments, the first portion-- pmap_dist-- took over an hour and a half just to break the files up and copy them to each node before it even started up bowtie.  So my search continued. 

I began sifting through all the bowtie options, and came across two very intriguing command line flags:

-s/--skip <int>: skip (i.e. do not align) the first <int> reads or pairs in the input

-u/--qupto <int>:  Only align the first <int> reads or read pairs from teh input
(after the -s/--skip reads or pairs have been skipped)

I couldn't find any examples of the tool being used this way-- which isn't to say it hasn't been done before-- but it looks as though bowtie inadertantly provides a means for parallelization beyond a single node: by running multiple copies of bowtie on different nodes, it should be possible to pass each job a different subset of the total read file.

As with any good performance experiment, you first need a solid dataset-- and in this case, I was looking for something both homo sapien [ie: not from e-coli, etc] and pretty big.  (Showing performance improvements on a large, human dataset is not only germane but shows proof of concept for what one might expect aligning human DNA samples.) After perusing NCBI's massive online database, I settled on a female DNA sample gathered from an "B-lymphocyte" cell:

 

http://www.ncbi.nlm.nih.gov/sra/ERX069505

Accession: ERX069505

Experiment design: Paired-end sequencing (2x100 base) of human library with 300 bp inserts

Study summary: Whole human genome sequencing of an CEPH/UTAH female individual (HapMap: NA12878) using the Illumina HiSeq 2000 and paired 100 base reads

Total: 4 runs, 854.1M spots, 172.5G bases, 103.9Gb

 

It was broken up into 4 seperate runs, each between 25 GB and 27 GB, stored in SRA format.  Bowtie can't process raw SRA data, so I had to convert them to paired-end fastq file format. I downloaded the SRA toolkit and used the tool 'fastq-dump' to obtain paired-end fastq output files for each of the four runs.

The rest of the experiment will focus on the first of the four runs, ERR091571.  fastq-dump yielded two fastq files, each weighing in at ~65GB:

-rw-rw-r-- 1 nlindberg nlindberg  65G Sep 24 14:58 ERR091571_1.fastq
-rw-rw-r-- 1 nlindberg nlindberg  65G Sep 24 14:58 ERR091571_2.fastq

For a reference index, I downloaded the entire Homo sapien NCBI build37.2 index from Illumina, which conveniently includes a 2.7 GB Bowtie Index ready for use:

http://tophat.cbcb.umd.edu/igenomes.html

Baseline

To get baseline performance, I ran bowtie using 8 threads (-p 8) on a single node using the the two files above.  Upping chunksize to (--chunkmbs) 2048 and read length to (-X) 1000 eliminated any errors I was getting regardind search length. 

Command:

bowtie -t -S --chunkmbs 2048 -X 1000 -p 8 genome -1 file_1.fastq -2 file_2.fastq map.sam

Output:

Time loading reference: 00:00:14
Time loading forward index: 00:00:31
Time loading mirror index: 00:00:31
Seeded quality full-index search: 04:07:56
# reads processed: 211437919
# reads with at least one reported alignment: 181400791 (85.79%)
# reads that failed to align: 30037128 (14.21%)
Reported 181400791 paired-end alignments to 1 output stream(s)
Time searching: 04:07:14
Overall time: 04:07:14
 
So-- 4:07 (4 hours 7 minutes) was the time to beat.  I will also note that I also ran this using the -y/--tryhard option of Bowtie, and that run in just under 17 hours, but only ended up matching another ~4%.  So for purposes of the study, I decided to stick with the normal options.  (Let it be said that my next experiment will be how fast I can speed things up using the -y/--tryhard option as I think there might be massive performance gains to be had there.)
 

Methodology and Results

Our cluster consists of 29 Dell servers, each with dual quad-core processors for a totalf of 8 cores per node and serviced by another Dell server running a shared NFS export, connected to a bank of disks.  Since I knew the number of spots (reads) in the file, using a little bash/torque magic I was able to automate the submission of independent bowtie jobs using job arrays.  Each job in the array ran 8 threads, on seperate nodes simultaenously-- all working on different chunks of the same fastq read files by manipulating the number of skipped reads versus the number of reads to do.  

For example: the file contains approximately 211M reads.  So in the case of running a two node job, the first node is given the command line:

-s 0 --qupto 105500000

and the second job is given

-s 105500000 --qupto 105500000

Here are the results for the one node (8 threads), two node (16 threads) and four node (32 threads) runs:

 

 

So it looks like the speedup is pretty linear.  What the above chart shows is how long each node instance took to complete when chunking the workload up between two and four nodes (16t and 32t).  However, what immediately struck me as funny was how, for both multi-node instance, each subsequent node took a little bit longer to complete it's portion of work than the node before it.  I initially chaulked this up to I/O limits on the NFS server, thinking that we were hitting the NFS iops limit.  But the increase was so consistent that I decided to dig deeper. As it stood, it was taking 0:52 minutes for the first node to complete, but a full 1:36 for the final node to complete, yielding a total running time of 1:36.  That's almost double the time!

The bowtie source revealed the answer: the "--skip" function merely skips processing the reads, but it still goes through and reads/parses every single read up until it's reached it's skip limit-- which is extremely time consuming.  The reason each subsequent thread was taking so much longer was because in a file with 211M read spots, each read consisting of 4 lines, there are 844M odd lines to read in the entire file.  That is non trivial.

So I did what any self respecting computer scientist would do: I modified the source code.  I'll spare you the gory details, but here is essentially what I did: 

  • read the first 4 lines of each file in to calculate the relative size of a read record (each read is 4 lines in fastq format) and multiply that by the number of reads you want to skip
  • taking into consideration that each read is labeled with a read number in the fastq file, and the size (in ASCII) of a read's number increases by 1 byte every 10^x iterations (999 = 3 bytes vs 1000 = 4 bytes), add an additional offset
  • fseeko() to that byte position (minus about 4 read record's worth of bytes as a fudge factor) and then seek forward until you find the record # you should start at

Using the handy "fseeko" command to instantly position the file pointer to any location in the file, I was able to cut down this extra seek time done skipping reads to nearly instant. In the case of our 32t workload, the "node 3" portion working on the last 1/4 chunk of the reads, was taking 1:36 to complete, 50 minutes of which was done performing the "skip" function.  I was able to take that skip from 50 minutes down to 2 seconds.  Not bad.

Testing this modification revelead what I had hoped:

 

 

As you can see, all jobs "chunks" now complete in the same amount of time.  I submitted a few more runs of increasing node/thread count, and at 5 nodes (40 threads) I got that time down to about 43 minutes, but was already starting to see some signs of I/O degradation.  At 6 nodes (48 threads) it thrashed the NFS server so bad that execution time blew up. So it looks like for our current cluster, the sweet spot is right around 5 nodes and 40 threads.

Conclusion

43 minutes down from 247 minutes is a pretty massive speedup of about 83%!  And that is only one part of the potential alignment->sequencing->analysis pipeline.  So,given the somewhat similar nature of alot of these tools, are there other modifications that could be made?  Most definitely.  There are different spots I can already identify along the line that could be custom fit to get massive speedups on any single slice of the pipe by adding a few key pieces of information at the beginning to help out later stages.  One thing I've found as a result of this deep dive into genetic sequencing tools is that there isn't a huge sense of "statefullness" in terms of performance between any of the mainstream tools often used together, beyond the standardized fastq/SAM formatting.  

For example, the SRA tool that converts raw SRA data into fastq could be modified to provide an "index" file of sorts that lists byte offsets every X reads, that could be fed directly into any of the sequencing/aligning tools for super fast file positioning.  This would incur a nominal size overhead.  I also notice the option to output in .tar.gz compressed fastq format, which could greatly speedup the I/O portion of transfering these massive datasets over the network.  Perhaps bowtie could be modified to read directly from a compressed pipe [I know this is possible from a C/C++/python perspective, just not sure if the overhead would outweigh the decreased transfer size-- but it would be interesting to try.]  I hope to continue this study on other tools and would welcome input from bioninformatics professionals as well as geneticists as to what they might like to see.

I think running bowtie in "-y" so that it tries it's absolute hardest to find matches might provide an even larger improvement, given that IO will not be as large of an immediate problem since there will be more time spent "crunching" the data and less time spent reading/writing.  If my initial results are any indication, it will spend about 4 times longer crunching the data.

I apologize if I've used the tools in any way contrary to their intended use, or in a non realistic fashion.  Here at the Milwaukee Institute, we are constantly looking at ways to improve the performance of the tools we offer, and part of that is extending a hand to the genetics and bioinformatics communitiy by way of code sharing.

I've included the modified source code.  It can be downloaded here.  The new option is "--byteoffset" and must be specified with the "--skip" option.  It bases the skip byte offset on the size of the initial read record.  Also of note, this has only been tested with NCBI converted FASTQ files that contain the read number in the following format (where this is read record 4):

@ERR091571.4 HSQ1009_86:5:1101:1475:2060 length=100
NGAGGGACAAACTTTCAGACCTTCTTTGCAGTGTTCTGGAATCCTACAGAGTGACAAACATTCAGACAATCGTAACAGTGTTCTGGAATCCTGTGTGAAG
+ERR091571.4 HSQ1009_86:5:1101:1475:2060 length=100
 
Our next cluster spec (192 nodes, 3072 cores) which we hope to deploy by Q1 2013, currently has plans to include a DDN SFA12k GridScalar storage appliance running GPFS for low latency, ridiculously quick parallel file storage.  Without the current single-server NFS file bottleneck, I'm guessing this thing would absolutely scream.  
 
Please contact me if you have any questions. 
Average (0 آرا)


No comments yet. Be the first.