Yeast phylo HMM results



This webpage supports the following two papers. Please cite the relevant papers if our results were useful for your studies.

The main page is for this paper:

A. N. Nguyen Ba, B. J. Yeh, D. van Dyk, A. R. Davidson, B. J. Andrews, E. L. Weiss, A. M. Moses, Proteome-wide discovery of evolutionary conserved sequences in disordered regions. Sci. Signal. (2012) 5, rs1.

Its PubMed entry can be found here. ** New data from YGOB will soon be incorporated. **

The data and softwares for:

A. N. Nguyen Ba, B. Strome, J. J. Hua, J. Desmond, I. Gagnon-Arsenault, E. L. Weiss, C. R. Landry, A. M. Moses, Detecting functional divergence after gene duplication through evolutionary changes in posttranslational regulatory sequences. PLoS. Comp. Biol. (2014)
can also be found in the data page.

Click here to go back to the result page.

Click here to download the software, the data, and more.


Website related

What genes have my motif?

Use the regular expression format to search for matches to a particular pattern (for example, [ST]P is the pattern for a proline-directed phosphorylation site). We do not support PSSMs yet.


Why am I only seeing 40 motifs?

For reasons of display, we are only showing 40 motifs at a time. The whole dataset has been released.


Why are there some motifs in gaps?

The phylo-HMM makes no distinction on whether or not a motif is present in particular species. For convenience, the 'start' and 'stop' displayed are the coordinates in S. cerevisiae, however the region may be a gap within the actual alignment!


How do I visualize the filtered regions?

Clicking the gene ID should allow you to visualize the protein sequences, the predicted regions (in red), and the filtered regions (in grey).


How can I see the clusters?

All the cluster information can be obtained in the data page. We also hope to include this information on the website in a user-friendly manner.


Where can I download the whole dataset?

The whole dataset can be obtained in the data page.


Which version of the phylo-HMM was used for the data on this website?

The data and cluster analysis available on this website were obtained using version 1.0 of the phylo-HMM. We have found that version 2.0 does not yield significant differences on this dataset. Version 2.0 is greatly recommended on other datasets.


Technical questions

What is the difference between the version 1.0 and version 2.0 of the phylo-HMM software?

The second version of the phylo-HMM includes several improvements that dramatically improve the speed of the software. Further, it also implements the gamma-distributed rate of evolution which we found to greatly improve the prediction of large conserved regions.


Is it possible to see the posterior traces?

There is no plan on making this available as the posterior traces for each proteins were not kept.


Can I use the phylo-HMM on other dataset?

A working standalone of the phylo-HMM has been tested on Ensembl human and insect data, as well as budding yeasts with Candida. This software has been released in the data page. We will release shortly its predictions on other datasets.


How was the phylo-HMM tested?

The phylo-HMM was tested as-is on simulated data, and on biological data.

Furthermore, its results were also tested independently with PAML. When all the predicted sites were concatenated and ran on PAML, it found a tree length of 1.30272. When ran on the flanking regions of each sites, the tree length was 6.04174.


Why is there no web service for the phylo-HMM?

Prediction using the phylo-HMM is computationally intensive. It is currently not possible for us to provide a web service that can filter, and perform any prediction on user submitted multiple sequence alignments. For individual sequence alignments, it is best for the user to perform the exercise of predicting these short conserved sequences.

Biological questions

Why did the phylo-HMM predict a site within a protein that does not look to be conserved?

The phylo-HMM simply calculates whether or not the region has a lower substitution rate than the background. A very fast background rate of evolution may yield results that are not expected. On our benchmarks, the background rate of evolution was the average rate of evolution. If your background rate of evolution is much higher, the phylo-HMM may not return expected results.

It should be noted that the phylo-HMM yields an error rate of 1 false positive in ~9000 unstructured amino acids under its optimized background rate of evolution and under 'perfect' circumstances. Mistakes are seen to occur in regions with multiple gaps or alignment artifacts. I hope to improve the gap model in a future version.


Why did the phylo-HMM not predict a site within a protein when that site is conserved?

These rather frustrating cases occur usually due to the following cases:

I hope to fix the first two cases in the next releases by having a more complex gap model and more flexible window size. Issues regarding lack of prediction when multiple motifs are present have been seen to occur and no benchmarks on these issues were performed. We assume a uniform distribition of motifs across the protein, however this is not realistic. I hope to address this issue in a following release.


What is the function of a predicted site within my protein?

The phylo-HMM makes no prediction on the function of a conserved sequence. We have shown both that functional sequences are usually predicted, and that predictions are usually functional. However, in many cases, functional association of a motif to a function requires supplementary biological information.


What is the pattern (or regular expression) of my predicted site?

In our paper, we have used the conserved sequences to perform clustering by sequence identity. However, this may not be the best way of doing this for individual cases as unbiased clustering may group many overlapping conserved sequences. We recommend copy-pasting the predicted region (expand the predicted motif first) to weblogo and use the conservation as a means to identify a putative pattern. Then, selective removal of constrains on these columns should be used to find other closely ressembling sequences.