Leon: Read compression

Leon is a  fasta/fastq read compressor using a probabilistic de Bruijn Graph, without the need of any reference genome. It keeps original read order intact, and compress quality scores in lossy mode by default (or in lossless mode if asked with option -lossless).

Method

Contrary to other existing methods, Leon is not only based on standard text compression techniques, but is largely inspired from NGS data analyses algorithm, namely de novo assembly.

The method does not require any reference genome, instead a reference is built de novo from the set of reads as a probabilist de Bruijn Graph. It uses the disk streaming k-mer counting algorithm contained in the GATB library, and inserts solid k-mers in a bloom-filter. Each read is then encoded as a path in this graph, storing only an anchoring kmer and a list of bifurcations indicating which path to follow in the graph if several are possible. Leon does not reorder reads, so read pairing information is not lost, contrary to many other existing methods.

Usage

A typical command line is :

leon -file read.fasta -c

to compress the fasta file read.fasta into the compressed file read.leon

leon -file read.fasta.leon -d

to compress the compressed file read.leon into the fasta file read_d.fasta

There are 2 mandatory arguments :

-file : the read file name, can be fasta, fastq, gzipped or not; or or a .leon file for decompression
 -c : to compress, or
 -d : to decompress

Additional useful options :

-nb-cores : number of threads used
-kmer-size : the k-mer size (default=31)
-abundance : the minimal abundance threshold defining solid k-mers (default=automatic)

Results

Below are the results of Leon and other compression software on two Illumina datasets. On these datasets, leon compressed files are 4 to 5 times smaller than the gzip ones and its compression ratio are above all other state-of-the-art compressors.

E. coli — 116x — 875 MB — SRR959239 C. elegans — 70x — 11 GB — SRR065390
Ratio C.Time D.Time Memory Ratio C.Time D.Time Memory
(min sec) (min sec) (MB) (min sec) (min sec) (MB)
Leon v 0.1.2 22.7 44s 31s 700 16.4 16m3s 10m50s 994
Leon v 0.2.1 22.7 24s 19s 700 16.4 6m40s 5m36s 1889
Fastqz 18.9 2m59s 3m34s 1343 11.1 37m45s 43m47s 1527
Fqzcomp 16.9 58s 1m3s 4155 11.5 15m30 19m25s 4208
Scalce 14 1m13s 33s 1844 12.2 16m21s 7m4s 5820
Quip 8.9 4m43s 5m2s 1830 7.5 19m17s 10m28s 778
gzip 4.4 2m21s 7s 1 4.3 36m6s 1m35s 1

Ratio is expressed as the original fasta file size divided by the compressed file size. C.Time and D.Time denote the compression and decompression time respectively

As a concrete example of space saving, a 441 GB human fasta file weights only 39 GB after leon compression ! (whole genome sequencing with 102x coverage, SRR345593/SRR345594)

Coming soon…

Soon, Leon format will be integrated in the GATB library and all software based on GATB could use it as a native input format. This means that it would not be necessary to decompress the files on disk before usage and also that all information about kmer counts and de Bruijn graph do not need to be re-computed by the tools, saving substantial running time.

Comments are closed.