"Nothing in Biology makes sense except in the light of evolution". In agreement with Theodosius Dobzhansky's famous quote, a focal point of research in our lab is the study of protein and gene evolution.
In general we are interested in all aspects of molecular evolution, but have a concentration of activity on studying protein domains and the evolution of the protein repertoire. We are actively working on horizontal gene transfer and the consitency of the darwinian Tree of Life with the observed phyletic distribution of protein-domain architectures
A fundamental resource required for studying the molecular evolution of organisms taking advantage of genome sequencing, is a reference species tree of life. We have produced a fully-resolved sequenced/species tree of life (sTOL) for all completely sequenced genomes. The sTOL is maintained by an automated pipeline and updated on a daily basis to meet the rapidly increasing volume of cellular organisms being sequenced. The sTOL is built on a likelihood-based weight calibration algorithm by integrating established taxonomic information with molecular character data (in the form of SCOP structural superfamilies, domain families, supra-domains and full-length domain architectures).
We have also reconstructed the protein repertoire of all ancestral eukaryotes based on parsimony using the sTOL species tree, which forms the basis of several studies such as the evolution of calcium signalling and the molecular basis for the evolution of cell types (which ties in with cellular expression below).
Protein Structure and Disorder
Much of our work, even if at the sequence level (below), is carried out within the context of an appreciation of an understanding of the power of 3D structure. The capability of proteins to perform a vast array of functions inside living cells depends critically on their structure. Following the paradigm of Sequence -> Structure -> Function, the study of protein structure in our lab takes advantage of our SUPERFAMILY library of HMMs to detect SCOP structural domains in genomes, and predict their functions using a domain-centric Gene Ontology resource (dcGO).
The domain-centric GO (dcGO) database provides associations between ontology terms and protein domains/supra-domains. The ontologies include Gene Ontology, enzymes, pathways, phenotypes in major model organisms (such as mouse, worm, yeast, fly, zebrafish, Xenopus and Arabidopsis), human phenotypes, diseases and drugs.
As well as structured domains, the coverage of disordered regions predicted by a group of predictors from the D2P2 resource is analysed in genomes and presented in a complimentary fashion alongside the structured domains. Intrinsically unstructured or disordered proteins (IDP), are proteins that exist in a naturally unfolded state and lack stable tertiary structure in vivo. IDPs exist as highly flexible polypeptide chains behaving as an ensemble of conformational states. These proteins and protein regions have evolved the specific property of disorder and have different amino acid propensities compared with compact well folded domains.
The introduction of high-throughput sequencing techniques into mainstream research in Molecular Biology has led to an explosion in the number of available amino acid, RNA and DNA sequences. The comparative study of these sequences is central to understanding not only the higher levels of organization of proteins and nucleic acids, but also their involvement in evolution.
We have a great track record in the group of developing and applying methods for protein sequence analysis, particularly involving hidden Markov models, e.g. for genome analysis, profile-profile comparison and family sub-classification. With the advent of next generation sequencing methods, we are increasingly needing to work with nucleotide sequences requiring new algorithms, methods and tools that utilise recent advances in distributed and high-performance computing to cope with the increasingly rapid generation of data.
In particular we are working on assembly-free genome and meta-genome analysis of proteins from DNA and RNA sequences. Upstream we are also working on techniques for the processing of raw reads from next generation techniques (including RADseq for population genetics, RNAseq and other types of data). The increasing amount of data available for individuals within a species has increased the importance of the analysis of mutation/variation data. We have developed and continue to work on predicting the functional consequences of genetic variation, using hidden Markov models representing the alignment of homologus sequences and conserved protein domains. This includes understanding the molecular mechanisms of human disease, cancer and phenotype prediction.
The phenotype of living cells is characterized by tremendous versatility and plasticity, which unravels at various spatial and temporal scales. At the root of this remarkable property of life lies a complex regulatory network that modulates the flow of information between the gene complement of the cell and the wide variety of proteinic molecules these genes encode for.
One avenue of research is understanding how gene expression profiles underpin cellular identity. It is known that by manipulating the regulatory network of a cell it is possible to get it to change it's identity. This was first demonstrated in an experiment by Gurdon using nuclear transfer of a fully differentiated cell nucleous to an anucleated oocyte cell. This resulted in a fully differentiated cell returning to a stem-cell like state. Some years later Yamanaka identified a set of transcription factors that could be introduced to a fully differentiated cell resulting in the same transition to a stem-cell like state. These two discoveries resulted in the the two men receiving the nobel prize in medicine in 2012 and also kick stated the field of cellular reprogramming. Our research is now focused on identifying sets of transcription factors that can be introduced to a chosen cell type to induce a targetted change in cellular state, This work incorporates high through-put digital gene expression data with regualtory interaction data in order to idnetify likely candidates using a novel algorithm called Mogrify.
Another line of research in our lab focuses on the development of statistical methods for modelling high-dimensional, high-volume digital gene expression data from various regions of the primate brain. This data is generated from a class of next-generation sequencing techniques, such as CAGE, RNA-seq, etc., which collectively provide a revolutionary tool for studying gene expression. We investigate the applicability of parametric and non-parameteric, hierarchical Bayesian approaches and associated Markov Chain Monte Carlo algorithms for modelling this data and for making inferences on differential gene expression.