Leon is a fasta/fastq read compressor using a probabilistic de Bruijn Graph, without the need of any reference genome. It keeps original read order intact, and compress quality scores in lossy mode by default (or in lossless mode if asked with option -lossless).
Method
Contrary to other existing methods, Leon is not only based on standard text compression techniques, but is largely inspired from NGS data analyses algorithm, namely de novo assembly.
The method does not require any reference genome, instead a reference is built de novo from the set of reads as a probabilist de Bruijn Graph. It uses the disk streaming k-mer counting algorithm contained in the GATB library, and inserts solid k-mers in a bloom-filter. Each read is then encoded as a path in this graph, storing only an anchoring kmer and a list of bifurcations indicating which path to follow in the graph if several are possible. Leon does not reorder reads, so read pairing information is not lost, contrary to many other existing methods.
Usage
A typical command line is :
leon -file read.fasta -c
to compress the fasta file read.fasta into the compressed file read.leon
leon -file read.fasta.leon -d
to compress the compressed file read.leon into the fasta file read_d.fasta
There are 2 mandatory arguments :
-file : the read file name, can be fasta, fastq, gzipped or not; or or a .leon file for decompression -c : to compress, or -d : to decompress
Additional useful options :
-nb-cores : number of threads used -kmer-size : the k-mer size (default=31) -abundance : the minimal abundance threshold defining solid k-mers (default=automatic)
Results
Below are the results of Leon and other compression software on two Illumina datasets. On these datasets, leon compressed files are 4 to 5 times smaller than the gzip ones and its compression ratio are above all other state-of-the-art compressors.
E. coli — 116x — 875 MB — SRR959239 | C. elegans — 70x — 11 GB — SRR065390 | |||||||
Ratio | C.Time | D.Time | Memory | Ratio | C.Time | D.Time | Memory | |
(min sec) | (min sec) | (MB) | (min sec) | (min sec) | (MB) | |||
Leon v 0.1.2 | 22.7 | 44s | 31s | 700 | 16.4 | 16m3s | 10m50s | 994 |
Leon v 0.2.1 | 22.7 | 24s | 19s | 700 | 16.4 | 6m40s | 5m36s | 1889 |
Fastqz | 18.9 | 2m59s | 3m34s | 1343 | 11.1 | 37m45s | 43m47s | 1527 |
Fqzcomp | 16.9 | 58s | 1m3s | 4155 | 11.5 | 15m30 | 19m25s | 4208 |
Scalce | 14 | 1m13s | 33s | 1844 | 12.2 | 16m21s | 7m4s | 5820 |
Quip | 8.9 | 4m43s | 5m2s | 1830 | 7.5 | 19m17s | 10m28s | 778 |
gzip | 4.4 | 2m21s | 7s | 1 | 4.3 | 36m6s | 1m35s | 1 |
Ratio is expressed as the original fasta file size divided by the compressed file size. C.Time and D.Time denote the compression and decompression time respectively
As a concrete example of space saving, a 441 GB human fasta file weights only 39 GB after leon compression ! (whole genome sequencing with 102x coverage, SRR345593/SRR345594)
Coming soon…
Soon, Leon format will be integrated in the GATB library and all software based on GATB could use it as a native input format. This means that it would not be necessary to decompress the files on disk before usage and also that all information about kmer counts and de Bruijn graph do not need to be re-computed by the tools, saving substantial running time.