Generate SCOP Domain Assignments using the SUPERFAMILY Models
This page describes how to produce SCOP protein domain assignments
using the SUPERFAMILY hidden Markov models (HMMs) and associated scripts.
Introduction
The process involves running a set of FASTA formatted sequences
against the models using the provided scripts. The results are a set of SCOP
superfamily, and family, level domain assignments.
This page is divided into three main sections:
Setting up the models and scripts is a multi-step process.
There may be issues for some combinations of machines and operating systems.
If you have read this document, and the relevant sections of the HMMER3
documentation, and are still having a problem, then please contact us:
superfamily@mrc-lmb.cam.ac.uk,
feedback form.
Alternatively, we can produce domain assignments for your
sequences. All we require is a set of protein sequences in FASTA format.
1: Setup models and scripts
The scripts are written in perl. Any recent version of perl
should work. Around 500 MB of hard disk space will be required. We assume you are
using a linux/unix environment.
1.1
Register for a SUPERFAMILY
license (free for academic and commercial use).
Download the SUPERFAMILY models and scripts:
wget --http-user USERNAME --http-password PASSWORD -r -np -nd -e robots=off \
-R 'index.html*' 'http://supfam.org/SUPERFAMILY/downloads/license/supfam-local-1.75/'
Please use the username and password you receive after registering for a license.
If wget is unavailable on your system, the required files can be downloaded individually. You will require all files in the models and scripts directories, as well as sequences/pdbj95d.gz.
1.2
The hmmscan program from the HMMER3
software package is recommended
[
12364612 ]
for scoring sequences against the SUPERFAMILY models.
Download HMMER3 and follow the installation instructions that come with it.
The scripts for running the models require the hmmscan program to be in your command PATH environment variable.
1.3
Download the SCOP 1.75 dir.des.scop.txt and dir.cla.scop.txt files:
wget http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.75
wget http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.75
mv dir.des.scop.txt_1.75 dir.des.scop.txt
mv dir.cla.scop.txt_1.75 dir.cla.scop.txt
These files are required for the family level classification
[
16877569 ].
1.4
Setup the infrastructure required by the scripts:
gunzip pdbj95d.gz
gunzip model.tab.gz
gunzip hmmlib_1.75.gz
mv hmmlib_1.75 hmmlib
gunzip self_hits.tab.gz
mkdir scratch
chmod u+x *.pl
hmmpress hmmlib
N.B. you must run hmmpress on the hmmlib file before it can be used with HMMER3.
2: Use scripts to produce domain assignments
Run superfamily.pl to produce the domain assignments:
# Simple
./superfamily.pl human.fa
N.B. you must make sure all scripts are in the working directory (and that './' is in your path) or that they are in your path.
3: Domain assignment output formats
Output is a tab-delimited file of domains, one domain per line.
There can be more than one domain per sequence, and there may be sequences for which there
is no domain assignment.
The columns, for computer readable 'ass' file output from ass3.pl (the default):
Sequence ID
SUPERFAMILY model ID
Match region
Evalue score
Model match start position
Alignment to model
Family evalue
SCOP Family ID
SCOP domain ID of closest structure (px value)
The columns, for html output:
Sequence ID
Match region
E-value Score
SCOP superfamily
Family E-value
SCOP family evalue
Closest structure
Alignment
If you have further questions, suggestions or comments, then please contact
us using the feedback form or via email
superfamily@mrc-lmb.cam.ac.uk.
|