Genome Pair Rapid Dotter (gepard)

Gepard (German: "cheetah", Backronym for "GEnome PAir - Rapid Dotter") allows the calculation of dotplots even for large sequences like chromosomes or bacterial genomes. Reference: Krumsiek J, Arnold R, Rattei T. Gepard: A rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 2007; 23(8): 1026-8. PMID: 17309896

Use cases

Local comparison two of nucleotide or amino acid sequences from user-specified files. Batch dotplot functionality provided by command line access to Gepard.

Features

Rapid calculation of dotplots (<2min for E.coli self-plot on a standard computer)
Preconfigured parameters => simply specify two sequences and create the dotplot (3 clicks)
Easy-to-use interface (mouse zooming, context-sensitive help)
Image exports (multiple formats)
Should work on any common OS due to Java software architecture
Persistent storage of suffix arrays (avoids recalculation)
Gepard guarantees the privacy of all input data, does not store user data remotely and does not contain any form of malware.

System requirements

Gepard requires the Java Runtime Environment Version 5.0 or later (http://www.java.com/download/). It has been tested on the following operating systems:

Microsoft Windows XP and later
Linux/Un*x systems
MacOS 10.x

Download

Please download the jar file if you want to run gepard on your computer (Java needs to be installed). On MacOS, nagivate to your Downloads folder and open the jar file via right-click. Confirm the execution of this program.

Source code

The source code is available in our GitHub repository gepard.

Tutorial

An offline version of the tutorial is included in the download package and in the source code.

Method

Gepard utilizes suffix arrays for rapid heuristic dotplot calculation. For large dotplots it searches exact word matches of a certain length (10 by default) from one sequence in the suffix array of the other sequence. As an arbitary word is found in log(n) time within a suffix array this method reduces complexity of the dotplot calculation from O(m*n) to O(m * log n) (where n is the length of the longer, m the length of the shorter sequence). For small dotplots the classical window-based dotplot calculation is utilized.

Memory issues and VMATCH support

The program uses the "Skew" algorithm to calculate the suffix arrays. This algorithm is very memory-intense so Gepard might require a large amount of available memory. Unfortunately, the Java VMs for all operating systems have to be given the maximum amount of available memory at startup. This is why there are different startup scripts for different machines. The following table shows the approximate maximum sequence size (assuming a self-plot) for each memory setting. This includes both suffix array and dot matrix calculation.

256MB	~10 million base pairs
512MB	~20 million base pairs
1024MB	~40 million base pairs

Gepard supports the program "mkvtree" from the Vmatch packages which is able to calculate persistent suffix arrays in very short time and with very little memory usage. Gepard will attempt to use this external binary automatically if it can be located in the programs directory or in the environment variable PATH. If you are using Vmatch with Gepard you may run the low-memory version of Gepard as the mkvtree binary will run outside the Java VM.

Older versions

Looking for the previous version Gepard 1.30? Here is it: gepard-1.30.zip