GenScale has managed to assemble the raspberry genome on . . . the eponymous Raspberry Pi, a mini-computer with only 512MB of RAM.
The goal was essentially to demonstrate the dramatically low memory footprint of GATB.
Read the article.
Some technical details
We are very grateful to Joshua Udall from Brigham Young University for providing us sequencing data from the raspberry genome. In total, we used 175 million DNA Illumina paired-end reads. The reads were 101 bp long. We discarded quality information (FASTQ->FASTA) and compressed the reads using gzip (total file size of 7 GB). A SDCard of 64 GB was used to store the reads and all other temporary files for assembly.
This page describes our experiments in terms that should be familiar to bioinformaticians. To those unfamiliar with genome assembly, the following is a short technical introduction. A k-mer is k consecutive DNA nucleotides. Given many k-mers, a de Bruijn graph is a directed graph where each k-mer is a node, and there is an edge whenever two k-mers exactly overlap over (k-1) nucleotides, e.g. AGT->GTC (k=3). de Bruijn graphs are widely used to assemble genomes from sequencing data, as one can show that, under some hypotheses, the genome of an organism can be fully reconstructed by finding an Eulerian path in the de Bruijn graph.
We ran the software Minia version 1.5418 to perform assembly using a de Bruijn graph approach. Minia uses a cascade of Bloom filters as the primary data structure to store k-mers, very efficiently. It also stores a sparse set k-mers explicitly to remember which nodes were visited in the graph.
The value of k is critical in assembly. We used Kmergenie to determine that the optimal k for this dataset was 75. However, we were unable to set k=75 for the Raspberry PI assembly, as the assembler ran out of memory. Consider that storing a nucleotide requires at least 2 bits, thus a k-mer requires 2k bits. As our implementation uses multiple of 64 bits, storing each 75-mer would require 192 bits (24 bytes). Minia had to store roughly 8 million k-mers explicitly for this assembly, i.e. 183 MB of memory with k=75. This was in addition to the Bloom filter, which used around 280 MB of memory. The total exceeded the amount of memory available on the Raspberry PI (512 MB minus the memory used by the operating system, Linux). We therefore used k=31 in our assembly, which is the largest odd k so that k-mers can be stored in 8 bytes.
The most time-consuming step was the first step, k-mer counting, using the low-memory and low disk space DSK algorithm.
|K-mer counting time||3500 mins|
|Assembly time||2000 mins|
|Total running time||3.8 days|
|Peak memory usage||426 MB|
|Assembly stats (k=31)|
|Number of contigs||384 k|
|Longest contig||19 Kbp|
|Total contigs size||150 Mbp|
This is an excerpt of the actual output of the software running on the Raspberry PI:
$ time ../minia-1.5418/minia list_reads 31 5 300000000 list_reads --dont-count [..] Total 278.53 MB for 272027204 solid kmers ==> 8.59 bits / solid kmer [..] Allocated memory for marking: 8080976 branching kmers x (8+2 )B actual implementation: (sizeof(kmer_type) = 8 B) + (sizeof(set_value_t) = 2 B) per entry: 2.38 bits / solid kmer L [..] Total nt assembled 149174375 nbContig 384112 Max contig len 19224 (debug: max len left 14485, max len right 17630) -------------------Assembly time Wallclock 64591.6 s -------------------Total time Wallclock 118318 s real 1971m57.731s
Minia usually creates high-accuracy assemblies (with little to no large-scale misassemblies and some mismatches). However, as no reference genome was known for this organism, we were unable to assess the accuracy of this assembly.
Comparison with another assembler
We provide here a comparison with another leading assembler, SOAPdenovo2, using 8 threads. We ran GATB-Pipeline, an automated assembly script combining Minia + SSPACE. The benchmark marchine is a 32-core cluster node with 512 GB of RAM. The k value was set to 75 for both tools. The command line and config file for for SOAPdenovo2 were:
$ time memused ./SOAPdenovo-127mer all -s soapdenovo2.config -K 75 -o soapdenovo2_assembly -p 8 max_rd_len=200 [LIB] avg_ins=266 reverse_seq=0 asm_flags=3 rank=1 map_len=32 q1=reads_1.fastq.gz q2=reads_2.fastq.gz
The command line for GATB-Pipeline was:
$ time ./gatb -p reads_1.fastq.gz reads_2.fastq.gz -k 75 -t 2 -g 400000000
The raspberry genome appears to be highly heterozygous, hence both Minia and SOAPdenovo struggled to produce long contigs. Keep in mind that this experiment was made informally, for demonstrating feasibility on a Raspberry Pi; a different assembly strategy would have been recommended if the end goal was to produce the best possible assembly for the purpose of biological analysis.
|SOAPdenovo2 (k=75)||GATB-Pipeline (k=75)|
|Peak memory usage||8,684 MB||881 MB|
|Wall-clock time (CPU time)||11 hours (22 h)||29 hours (29 h)|
|Number of scaffolds||1.5 M||0.5 M|
|Longest scaffold||95 Kbp||81 Kbp|
|Total scaffolds size||394 Mbp||304 Mbp|
|Scaffold NG50 (contig NG50), G = 400 Mbp||702 bp (332 bp)||759 bp (758 bp)|