ConDens


  1. Installation/Program Structure
  2. ConDens Predictor
  3. ConDens Browser
  4. Structure of Alignment Input
  5. Structure of Data Output
  6. Regular Expressions in ConDens
  7. Modifying Default Program Settings
The ConDens Predictor (or simply "ConDens") is a tool that implements the ConDens algorithm (see paper), which predicts functional conservation of short linear motifs. This program can be run with a graphical user interface using the command java -jar ConDens.jar (or by double clicking ConDens.jar in Windows screen. The command-line mode can also be used by typing java -jar ConDens.jar [insert input file] in shell. See instructions on running the program in command-line below.

There are 5 types of inputs that can be entered into the program:

  1. Protein input


  2. Figure 1: Schematic of the ConDens program's user interace.

    Figure 1: Schematic of the ConDens program's user interace.

    This is a list of proteins to be tested by the ConDens algorithm. If the program is downloaded with datasets, then there are proteomes that can be chosen from a combo box (Figure 1XXX). Suppose the user would like to run an analysis on an unavailable proteome or listing of proteins, he can also enter the file path of this custom protein list. The format of this custom protein list is expected to be a simple one column file (with header) with each row containing the name of one protein.

    Example of a Custom Protein Input File

    Gene
    CDH1
    CDH6
    ORC2
    ORC6
    ...

  3. Multiple Sequence Alignment Input

  4. This is a set of multiple sequence alignments to be used by the ConDens algorithm to confer evolutionary information on the protein input. If the program is downloaded with datasets, then there are available alignment sets that can be chosen from a combo box (Figure 1XXX). Suppose the user would like to run an analysis on a customly-defined sequence alignment set, he needs to provide an alignment mapping file (see Structure of Alignment Input).

  5. Motifs

  6. The is a set of motifs to be tested by the ConDens algorithm. It can be entered manually onto the table of motifs in Figure 1XXX or chosen directly from the motif library in Figure 1XXX (simply choose the motifs of interest and click the arrow button at the bottom or press Ctrl + Shift + C. Each motif on this list must have a non-redundant name and an appropriate regular expression (which is not just any regular expression; See Regular Expressions in ConDens).

  7. Output file path

  8. This is the directory where data output is stored. It is generally best to keep data generated using different proteomes and alignments in separate folders. The structure of the output files is discussed in Structure of Data Output.

  9. Validation Data Input

  10. This is an optional input that allows users to provide annotations on specific coordinates on a protein (i.e. whether or not a motif is a known target). The format of the input file is expected to be a 3-column tab-delimited file (with header) where the first column is the name of the protein, the second column is the target residue position, and the third column is the annotation label.

    Example of a Validation File

    Gene  Position  Label
    CDH1  335  positive
    CDC6  56  positive
    CDC6  75  unknown
    CDC6  189  negative
    ...

    There are 3 accepted types of labels: "positive" (known target), "negative" (known non-target), and "unknown" (no information; the default label). Anything else is ignored. When ConDens generates a protein-level data output, it assigns a label to the protein using the following rules:
    1. If any one coordinate in the validation file has a label of "positive", the protein is given a label of "positive"
    2. Otherwise, if all motifs on the protein are mapped to coordinates with a "negative" label, the protein is given a label of "negative
    3. For all other cases, the protein is given a label of "unknown"

    Since we are not providing any preset validation data, the user will have to supply his own if the need arises.

  11. Running in Command-Line

  12. As mentioned above, this program can be run through the shell command java -jar ConDens.jar [insert input file]. The input file expected in this case is an XML file that contains information on various inputs and has the following structure:

    Example of a Command-Line Input XML File

    <!-- Root node -->
    <settings>

    <!-- Protein input. "option" denotes the index of the available preset and "path" denotes the file path to the custom protein input file. Normally, only one of two attributes only need to be defined, but if both are available, "option" takes precedence. i.e. The program assumes the user wants to use a preset proteome at a specified index. -->
    <proteins option="0" path="" />

    <!-- Alignment input. Same idea as above. -->
    <msa option="0" path="" />

    <!-- Validation input. Same idea as above, exception "option" is not used because there are no preset validation data. The user can set "path" to "", not define it, or simply leave out the entire block if validation data is not used. -->
    <validation path="" />

    <!-- Path to destination output folder. Same idea as validation input, but "path" must be defined. -->
    <output path="output" />

    <!-- A list of motifs with consensus sequences as regular expressions. -->
    <consensuses>

    <!-- A motif. "name" denotes the name of the motif and "regex" denotes its regular expression -->
    <consensus name="Cdk" regex="(?<r>[ST])P" />
    <consensus name="Mec1" regex="(?<r>[ST])Q" />
    <consensus name="Prk1" regex="[LIVM]XXXX(?<r>T)G" />
    <consensus name="Ipl1" regex="[RK]X(?<r>[ST])[LIV]" />
    <consensus name="PKA" regex="R[RK]X(?<r>S)" />
    <consensus name="CKII" regex="(?<r>[ST])[DE]X[DE]" />
    <consensus name="Ime2" regex="RPX(?<r>[ST])" />

    </consensuses>

    </settings>

Comments: