Leon: Read compression

Leon is a fasta/fastq read compressor using a probabilistic de Bruijn Graph, without the need of any reference genome. It keeps original read order intact, and compress quality scores in lossy mode by default (or in lossless mode if asked with option -lossless).

Method

Contrary to other existing methods, Leon is not only based on standard text compression techniques, but is largely inspired from NGS data analyses algorithm, namely de novo assembly.

The method does not require any reference genome, instead a reference is built de novo from the set of reads as a probabilist de Bruijn Graph. It uses the disk streaming k-mer counting algorithm contained in the GATB library, and inserts solid k-mers in a bloom-filter. Each read is then encoded as a path in this graph, storing only an anchoring kmer and a list of bifurcations indicating which path to follow in the graph if several are possible. Leon does not reorder reads, so read pairing information is not lost, contrary to many other existing methods.

Usage

A typical command line is :

leon -file read.fasta -c

to compress the fasta file read.fasta into the compressed file read.leon

leon -file read.fasta.leon -d

to compress the compressed file read.leon into the fasta file read_d.fasta

There are 2 mandatory arguments :

-file : the read file name, can be fasta, fastq, gzipped or not; or or a .leon file for decompression
 -c : to compress, or
 -d : to decompress

Additional useful options :

-nb-cores : number of threads used
-kmer-size : the k-mer size (default=31)
-abundance : the minimal abundance threshold defining solid k-mers (default=automatic)

Results

Below are the results of Leon and other compression software on two Illumina datasets. On these datasets, leon compressed files are 4 to 5 times smaller than the gzip ones and its compression ratio are above all other state-of-the-art compressors.

	E. coli — 116x — 875 MB — SRR959239				C. elegans — 70x — 11 GB — SRR065390
	Ratio	C.Time	D.Time	Memory	Ratio	C.Time	D.Time	Memory
		(min sec)	(min sec)	(MB)		(min sec)	(min sec)	(MB)
Leon v 0.1.2	22.7	44s	31s	700	16.4	16m3s	10m50s	994
Leon v 0.2.1	22.7	24s	19s	700	16.4	6m40s	5m36s	1889
Fastqz	18.9	2m59s	3m34s	1343	11.1	37m45s	43m47s	1527
Fqzcomp	16.9	58s	1m3s	4155	11.5	15m30	19m25s	4208
Scalce	14	1m13s	33s	1844	12.2	16m21s	7m4s	5820
Quip	8.9	4m43s	5m2s	1830	7.5	19m17s	10m28s	778
gzip	4.4	2m21s	7s	1	4.3	36m6s	1m35s	1

Ratio is expressed as the original fasta file size divided by the compressed file size. C.Time and D.Time denote the compression and decompression time respectively

As a concrete example of space saving, a 441 GB human fasta file weights only 39 GB after leon compression ! (whole genome sequencing with 102x coverage, SRR345593/SRR345594)

Coming soon…

Soon, Leon format will be integrated in the GATB library and all software based on GATB could use it as a native input format. This means that it would not be necessary to decompress the files on disk before usage and also that all information about kmer counts and de Bruijn graph do not need to be re-computed by the tools, saving substantial running time.

Leon: Read compression

Method

Usage

Results

Coming soon…

Start using GATB

Recent Posts