|
Function Annotation of SCOP Domain Superfamilies
Christine Vogel1,2
1MRC
Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH,
England
2Institute for Cellular and Molecular
Biology, University of Texas at Austin, 2500 Speedway, MBB 3.210,
Austin, TX 78712, USA
*Correspondence to: cvogel at mail utexas
edu
This document describes function annotation of
domain superfamilies. Domains are structural, functional and
evolutionary units that form proteins. Domains of common ancestry
are grouped into superfamilies. The domains and domain
superfamilies are defined and described in the Structural
Classification of Proteins database, SCOP [1,2]. This function
annotation of domain superfamilies has been published
before[3,4], and we kindly ask you to cite us if you use it. The
annotation procedure, as used in those papers, is described
below. Recent work [5] updated the function scheme, revised the
annotation of eukaryotic superfamilies, and extended it to all
SCOP classes a to g.
UPDATE
Christine Vogel has extended
her functional annotation of superfamilies to SCOP 1.73. The 1.73
annotation is available in the
scop.annotation.1.73.txt
file. The functional
annotation scheme has not changed. The remainder
of this document refers to the SCOP 1.69 annotation.
FUNCTION SCHEME
The exact definition of the 'function' of a
protein or domain is still a matter of debate and can vary
depending on the exact context. In our work, we annotated domain
superfamilies with respect to their usual role in a protein, in a
particular pathway or in the cell/organism. Thus, our
understanding of 'function' is somewhat a mixture between the
definition of 'biological process' and 'molecular function' used
in the Gene Ontology [6] annotation.
We prepared a scheme of 50 detailed
function categories which map to 7 more general function
categories, similar to the scheme used in COGs [7]. The mapping
between the detailed and more general function categories is
described in Table 1 and
scop.larger.categories
file. The general categories of function are:
i) Information: storage,
maintenance of the genetic code; DNA replication/repair; general
transcription/translation
ii) Regulation: regulation
of gene expression and protein activity; information processing
in response to environmental input; signal transduction; general
regulatory or receptor activity
iii) Metabolism: anabolic
and catabolic processes; cell maintenance/homeostasis; secondary
metabolism
iv) Intra-cellular
processes; cell motility/division; cell death;
intra-cellular transport; secretion
v) Extra-cellular processes:
inter-, extra-cellular processes, e.g. cell adhesion; organismal
processes, e.g. blood clotting, immune system
vi) General: general and
multiple functions; interactions with proteins/ions/lipids/small
molecules
vii) Other/Unknown: unknown
function, viral proteins/toxins
We are aware that the members of some
superfamilies, particularly the large ones, may have a variety of
functions. For example, immunoglobulin domains are involved in
cell adhesion, muscle structure, the extra-cellular matrix and
the immune system. The function categories here aim to describe
the dominant and most wide-spread function for each superfamily,
as far as it is known today.
ANNOTATION SCHEME
We annotated each domain superfamily of the
SCOP classes a to g manually using the function
scheme described above. The annotation was based on information
from SCOP [2], InterPro [8,9], Pfam [10],
SwissProt [11] and literature.
As a control, we used the automated annotation
of GO process, function and location to Pfam domains in InterPro
[8]. Pfam domains were mapped onto SCOP domain superfamilies
based on sequence similarity. This provided annotation for 647,
667 and 343 domain superfamilies, respectively. The manual domain
annotation was largely consistent with the Gene-Ontology
annotation [6] for Pfam [12] domains and their mappings to the
domains described in SUPERFAMILY [13]. The
annotation for large superfamilies. i.e. those that occur in more
than ~25 proteins in at least one of the commonly used,
completely sequenced eukaryotes, was checked several times by
different researchers[5]. We also consulted co-workers on their
knowledge about the function of well-known superfamilies. In
particular, we thank Matthew Bashton [14], Cyrus Chothia and
Madan Mohan Babu for their valuable input.
Based on our experience in working with this
annotation, we estimate the error rate to <10% for large
superfamilies, and <20% for all superfamilies. If you use the
function annotation, please do not hesitate to contact us if you
notice erroneous or inappropriate annotation.
The domain function annotation is available in
the scop.annotation.1.69.txt
file.
Distribution of domain
functions
Figure 1 shows the distribution of
functions in terms of domain superfamilies in SCOP. Domain
superfamilies of metabolism, e.g. enzymes, are the most
abundant category. Close to half of all
superfamilies (448) have metabolism-related functions, while each
of the other categories comprises less than 15% of the domain
superfamilies. In human, one third of the superfamilies are
metabolic (339/950), mapping to one sixth of all domains
(3212/19225)[13]. Some 10% of the superfamilies (122) have
unknown functions.

Figure 1. The distribution of domain
functions. The distribution of functions of domain
superfamilies classes a to g in SCOP version
1.69[2].
Table 1. Mapping between
detailed and more general function categories.
The table lists 50 detailed function
categories which map to 7 more general function categories. The
one- or two-letter code is used in the annotation file. m/tr -
metabolism and transport.
General function
|
Detailed function
|
Code
|
Metabolism
|
Energy
|
C
|
Metabolism
|
Photosynthesis
|
CB
|
General
|
Small molecule binding
|
HA
|
General
|
Ion binding
|
HB
|
General
|
Lipid/membrane binding
|
HC
|
General
|
Ligand binding
|
HE
|
General
|
General
|
R
|
General
|
Protein interaction
|
RD
|
General
|
Structural protein
|
ST
|
Information
|
Chromatin structure
|
B
|
Information
|
Translation
|
J
|
Information
|
Transcription
|
K
|
Information
|
DNA replication/repair
|
L
|
Information
|
RNA processing
|
LB
|
Information
|
Nuclear structure
|
Y
|
Metabolism
|
E- transfer
|
CA
|
Metabolism
|
Amino acids m/tr
|
E
|
Metabolism
|
Nitrogen m/tr
|
EA
|
Metabolism
|
Nucleotide m/tr
|
F
|
Metabolism
|
Carbohydrate m/tr
|
G
|
Metabolism
|
Polysaccharide m/tr
|
GA
|
Metabolism
|
Storage
|
GB
|
Metabolism
|
Coenzyme m/tr
|
H
|
Metabolism
|
Lipid m/tr
|
I
|
Metabolism
|
Cell envelope m/tr
|
M
|
Metabolism
|
Secondary metabolism
|
Q
|
Metabolism
|
Redox
|
RA
|
Metabolism
|
Transferases
|
RB
|
Metabolism
|
Other enzymes
|
RC
|
Other
|
Unknown function
|
S
|
Other
|
Viral proteins
|
SA
|
Extra-cellular processes
|
Cell adhesion
|
MA
|
Extra-cellular processes
|
Immune response
|
RE
|
Extra-cellular processes
|
Blood clotting
|
RG
|
Extra-cellular processes
|
Toxins/defense
|
SB
|
Intra-cellular processes
|
Cell cycle, Apoptosis
|
D
|
Intra-cellular processes
|
Phospholipid m/tr
|
IA
|
Intra-cellular processes
|
Cell motility
|
N
|
Intra-cellular processes
|
Trafficking/secretion
|
NA
|
Intra-cellular processes
|
Protein modification
|
O
|
Intra-cellular processes
|
Proteases
|
OA
|
Intra-cellular processes
|
Ion m/tr
|
P
|
Intra-cellular processes
|
Transport
|
RF
|
Regulation
|
RNA binding, m/tr
|
A
|
Regulation
|
DNA-binding
|
LA
|
Regulation
|
Kinases/phosphatases
|
OB
|
Regulation
|
Signal transduction
|
T
|
Regulation
|
Other regulatory function
|
TA
|
Regulation
|
Receptor activity
|
HD
|
N_A
|
not annotated
|
NONA
|
References
1. Murzin AG, Brenner SE, Hubbard T,
Chothia C (1995) SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J Mol
Biol 247: 536-540.
Abstract
[ ]
2. Andreeva A, Howorth D, Brenner SE,
Hubbard TJ, Chothia C, et al. (2004) SCOP database in 2004:
refinements integrate structure and sequence family data. Nucleic
Acids Res 32: D226-229.
Abstract
[ ]
3. Vogel C, Berzuini C, Bashton M, Gough
J, Teichmann SA (2004) Supra-domains - evolutionary units larger
than single protein domains. J Mol Biol 336: 809-823.
Abstract
[ ]
4. Vogel C, Teichmann SA, Pereira-Leal JB
(2005) The relationship between domain duplication and
recombination. J Mol Biol 346: 355-365.
Abstract
[ ]
5. Vogel C, Chothia C. (2006) Protein family expansions and biological
complexity. PLoS Comput Biol. May;2(5):e48. Epub 2006 May 26.
Abstract
[ ]
6. Harris MA, Clark J, Ireland A, Lomax J,
Ashburner M, et al. (2004) The Gene Ontology (GO) database and
informatics resource. Nucleic Acids Res 32: D258-261.
Abstract
[ ]
7. Tatusov RL, Fedorova ND, Jackson JD,
Jacobs AR, Kiryutin B, et al. (2003) The COG database: an updated
version includes eukaryotes. BMC Bioinformatics 4: 41.
Abstract
[ ]
8. Mulder NJ, Apweiler R, Attwood TK,
Bairoch A, Barrell D, et al. (2003) The InterPro Database, 2003
brings increased coverage and new features. Nucleic Acids Res 31:
315-318.
Abstract
[ ]
9. Mulder NJ, Apweiler R, Attwood TK,
Bairoch A, Bateman A, et al. (2005) InterPro, progress and status
in 2005. Nucleic Acids Res 33: D201-205.
Abstract
[ ]
10. Finn RD, Mistry J, Schuster-Bockler B,
Griffiths-Jones S, Hollich V, et al. (2006) Pfam: clans, web
tools and services. Nucleic Acids Res 34: D247-251.
Abstract
[ ]
11. Boeckmann B, Blatter MC, Famiglietti
L, Hinz U, Lane L, et al. (2005) Protein variety and functional
diversity: Swiss-Prot annotation in its biological context. C R
Biol 328: 882-899.
Abstract
[ ]
12. Bateman A, Coin L, Durbin R, Finn RD,
Hollich V, et al. (2004) The Pfam protein families database.
Nucleic Acids Res 32: D138-141.
Abstract
[ ]
13. Madera M, Vogel C, Kummerfeld SK,
Chothia C, Gough J (2004) The SUPERFAMILY database in 2004:
additions and improvements. Nucleic Acids Res 32: D235-239.
Abstract
[ ]
14. Bashton M (2004) Functional Analysis
of Domain Combinations [PhD]. Cambridge, UK: University of
Cambridge.
|