Bloocoo is a kmer-spectrum based read error corrector. In a first pass, all k-mers are counted, then k-mers more abundant than a given threshold are kept, i.e. “solid k-mers”.
Correction is then performed by scanning k-mers of a read. For example, a single isolated error generates a gap of k non solid k-mers making the detection of its exact location easy. Correction is made by trying the three different possible nucleotides at the error site, and checking if corresponding k-mers are in the set of solid k-mers.
When several close errors occurs, the pattern is more complex, errors are corrected via a vote algorithm similar to the one in the Musket software (http://musket.sourceforge.net/).
What makes Bloocoo different is the k-mer counting stage and the way solid k-mers are stored in memory. k-mer counting is conducted via the DSK algorithm included in the GATB library, which requires constant-memory. Solid k-mers are stored in a Bloom filter which is fast and memory-efficient : we use only 11 bits of memory per solid k-mers. Therefore, correction of a whole human genome sequencing read set needs only 4GB of memory.
A typical command line is :
bloocoo -file reads.fasta -kmer-size 27 -abundance 4
There are 3 mandatory arguments :
-file : the read file name, can be fasta, fastq, gzipped or not. -kmer-size : the k-mer size (typically ~31) -abundance : the abundance threshold defining solid k-mers (typically between 3 and 6)
Additional useful options :
-nb-cores : number of threads used -high-recall : correct more errors but can also introduce more mistakes -slow : slower modes with more pass, but better correction -high-precision : correct safely, correct less errors but introduce less mistakes -ion : (experimental) mode for correcting indels present in Ion Torrent reads
Reads from the human chromosome 1 were simulated with 1% uniform error rate and 70x coverage. This table shows the time and memory taken by Bloocoo and Musket, as well as recall and precision rates.
Bloocoo correction is roughly the same as Musket, while taking 16x less memory and 3.8x less time.
Both softwares were running with 8 threads, on an Intel Xeon E5-2640.
|Recall||90.92 %||90.28 %|
|Precision||97.86 %||96.93 %|
times are ‘real’ time reported by ‘time’ command