Gepard tutorial

This is a short tutorial which briefly describes all major features of the Gepard program. It applies to Gepard version 1.17 or later.

Note: The remote functionality has been removed in the final Gepard release.

Contents

1. Creating dotplots
2. Filtering and emphasizing dot matrix information
3. Tweaking dotplot parameters
4. Navigating through the dotplot and showing alignments
5. Suffix array files
6. Alignment
7. Command line mode


1. Creating dotplots

  1. Local dotplots
    • To create a local dotplot simply select two FASTA format sequence files using the "Select file" buttons located in the upper left region of the Gepard window.
    • Click "Create dotplot". The program will automatically determine the input sequence types (DNA or Protein) and assign a corresponding scoring matrix.
    • For persistenct storage of the calculated suffix arrays on your hard-disk, enable this option in the "Misc" tab in advanced mode. This will avoid suffix array recalulation.

  2. Remote dotplots
    • To have a dotplot of PEDANT contigs calculated on our server, first select the "Remote" tab at the upper left edge of the window.
    • Click "connect" to ensure connectivity to the webservice and download the latest sequence lists
    • Select two sequences to be compared OR select "[Use uploaded sequence]" from one of the organism lists to upload a local sequence for comparison.
    • Use the "Functions" tab in the advanced mode options panel to have genes of certain functions encolored in the plot.
      The three base colors will blend, red and green becomes yellow, red and blue becomes purple, green and blue becomes cyan. If all categories overlap in one gene it will be colored white (or grey if the funcat coler strength is reduced).
    • Click "Create dotplot". The request will be submitted to the server; current progress information will be shown in a dialog window.

2. Filtering and emphasizing dot matrix information

Switch to advanced mode by clicking the "Advanced mode" button.
  • Select the "Display" tab in the advanced options panel.
  • Use the scrollbars to alter the visualization of the dotmatrix data in the plot.
    • Lower color limit and Upper color limit indicates the lowest/highest dotmatrix scores which will be displayed in the plot. Increasing these values will reduce noise and emphasize significant regions.
    • Greyscale start sets the actual greyscale range of the dots. Move this scrollbar to the right and each visible dot in the plot will be black.
    • Funcat color weight is only available in remote dotplot mode. It adjusts the intensity of the functional category encoloring.

3. Tweaking dotplot parameters

Switch to advanced mode by clicking the "Advanced mode" button and select the "Plot" tab in the advanced options panel.
  • Coordinates - use these values to manually define the in-sequence coordinates of both sequences
  • Zoom - by default the program will automatically zoom the dotplot to fit your window size. You can deactivate auto-zoom and enter a zooming factor manually. Note that this will affect the dotplot calculation time.
  • The option Small plots will create dotplots of half size in auto zoom mode. This is intended to reduce transmission times in remote mode.
  • Parameters These parameters control the heuristics Gepard uses to find matching subsequences. Only disabled auto params mode if you really need to tweak these parameters.
    • Word length - minimum word length for identical subsequences which create a hit in the dotplot
    • Window size - If word length==0 "normal" dotplot mode will be activated where all characters of both sequences are compared against each other.
      This parameter specifies the window size over which an average dot value will be calculated.
      It should only be used if the created dotplot is not larger than around 10000 by 10000 characters.
  • Substitution matrices - in "auto matrix" mode the program will automatically use BLOSUM62 for amino acid sequences and a standard match/mismatch matrix for nucleotide sequences. Deactivate auto-matrix mode to manually select a scoring matrix included with Gepard or choose a custom matrix.

4. Navigating through the dotplot and showing alignments

  • Zooming with buttons - Use the buttons below the sequence selection panels to zoom in & out and to zoom out to the full dotplot perspective.
  • Zooming with the mouse - Press your primary mouse button and drag the mouse to select an area of interest. Then click "update dotplot" to zoom into this area.
  • Clicking In the "Misc" tab you choose between two click actions:
    1. Showing alignments. Left-clicking will simply show the alignment at the position you clicked. Right-clicking will activate Gepard's sticky-click feature to directly move the alignment to the best diagonal hit in a range of 5 pixels around the clicking point.
    2. Looking up genes (remote mode only). Look up genes from the PEDANT databases at the specific position on the horizontal sequence (primary mouse button) or the vertical sequence (secondary mouse button).
  • Press and hold CTRL in remote mode to display gene names directly in the plot. Click the mouse while holding CTRL to copy the current gene information into the clipboard.
  • To show a reverse complementary alignment use the corresponding option in the "Display" tab in advanced mode.
  • Use the arrow keys to move the dotplot crosshair and change the current alignment. Use the keyboard keys W,A,S,D for faster navigation. When sticking to a diagonal you can also use G and H (slow movement) or J and K (fast navigation) to slide along the current diagonal, forward or reverse.
  • Image export - The display tab also contains a button for image export. This will save the current dotplot view to an image file.

5. Suffix array files

  • Stored suffix array files are automatically read by Gepard for their corresponding sequences. The program searches in the

    .gepard/

    folder in the user's home directory for a file with the filename format

    [sequencefilename]_[sequencelength].sa

  • OR in the same directory as the sequence file for a file called

    [sequencefilename].sa

    For example if you are using a sequence file called "contig5.fa" the program will try to read "contig5.fa.sa" from the same directory as well as the corresponding file including the sequence length from the ".gepard/" directory.


  • You can also manually create suffix arrays files:

    1. If the Vmatch package is installed on your system you can use the following command in the Gepard main directory to let the tool 'mkvtree' create the suffix array file:

      java -cp lib/gepard.jar org.krumsiek.gepard.common.GenSAFileVmatch <sequencefile> <outfile>

    2. If you cannot use Vmatch you may use Gepard's integrated suffix array creation method:
    3. java -Xmx512m -cp lib/gepard.jar org.krumsiek.gepard.common.GenSAFile <sequencefile> <outfile>

      The option "-Xmx512m" means that the program may use 512 megabytes of memory.

6. Allignment

Since version 2.0 Gepard contains allignment functionality via two different algorythms: Smith-Waterman and Needleman-Wunsch.
  • To access the functionality, in the GUI version, open the misc tab and select either "use Smith-Waterman algorythm" or "use Needleman-Wunsch algorythm".
  • You can then, by selecting the "local dotplot click action", either display the alignment below the graph or export it as a multi-fasta file.
  • In the GUI version of gepard the alignment will always be calculated over a window of 1000. In order to use a custom windowsize, please do so with the commandline client.

7. Command line mode

Since version 1.20 Gepard contains a command line dotplot mode (for offline plotting only).
  • Use the command java -cp Gepard-1.40.jar org.gepard.client.cmdline.CommandLine to start the Gepard command line tool. If needed, adjust the path to the jar file according to your installation.
  • Start without any further arguments to get a detailed list of all arguments and their function.
  • Important note:: The command line tool always needs a substitution matrix file, even when running in suffix array mode (word > 0, window = 0).

Examples

Here are some examples for command line dotplot calls which create plots between Escherichia coli versus Shigelia flexneri. As of the time of the last editing of this text, the genome files could for instance be retrieved from the NCBI FTP:

E.coli K12: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12/NC_000913.fna
S.flexneri 2a: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Shigella_flexneri_2a/NC_004337.fna

The following examples use the Linux startup script gepardcmd.sh. Windows users just replace this command by gepardcmd.bat. It is assumed that you start the script from the Gepard main directory.
  • Create dotplot and using EDNA (standard DNA) matrix and write results to plot.png: gepardcmd.sh -seq NC_000913.fna NC_004337.fna -matrix matrices/edna.mat -outfile plot.png

  • Same plot as above but with tweaked display parameters:

    gepardcmd.sh -seq NC_000913.fna NC_004337.fna -matrix matrices/edna.mat -outfile plot.png -lower 50

  • Do larger plot and use partial sequences (coordinates in E.coli specified as absolute values, coordinates in S.flexneri specified as relative coordinates):

    ./gepardcmd.sh -seq NC_000913.fna NC_004337.fna -matrix matrices/edna.mat -outfile plot.png -lower 50 -maxwidth 1500 -maxheight 1500 -from1 1760000 -to1 1850000 -from2 32% -to2 34%

  • Precalculate suffix array and use precalculated file in dotplot:

    java -Xmx512m -cp lib/gepard.jar org.krumsiek.gepard.common.GenSAFile NC_000913.fna NC_000913.fna.sa gepardcmd.sh -seq NC_000913.fna NC_004337.fna -matrix matrices/edna.mat -outfile plot.png -safile NC_000913.fna.sa


  • Gepard: http://mips.gsf.de/services/analysis/gepard

    Last change: Julius Krämer - Aug 8 2022