SUPERFAMILY 1.75 HMM library and genome assignments server

SUPERFAMILY 2 can be accessed from supfam.org. Please contact us if you experience any problems.

Domain-centric Enzyme Commission (EC) Annotations and Structural Domain Enzyme Commission Ontology

Jump to [ Top · Domain2EC · SDEO · Data availability ]

This document explains the details behind EC annotations of structural domains that are classified in the Structural Classification of Proteins (SCOP) database (Andreeva, et al., 2008). IntEnz (Integrated relational Enzyme database) is a resource focused on enzyme nomenclature, which is a system of naming enzymes (protein catalysts) with Cross-references to UniProts. Together with genome-wide domain assignments for proteins in the SUPERFAMILY database (Gough, 2006), we have made statistical inference for detecting EC ontology relatedness to structural domains. We reason that if an EC term tends to annotate proteins containing a domain, then such term should also confer enzymatic signals for that domain. Based on this, domain-centric EC annotations can be inferred from the protein/uniprot EC annotations. Moreover, we have initialized a trimmed-down version of EC which is the most informative to annotate domains. This resource represents an ongoing effort to develop a Structural Domain EC Ontology (SDEO). Together with a reference species tree of (sequenced) life, this resource can practically useful to look at the distribution of sets of domains annotated by any chosen EC term along the course of species evolution.


The pipeline of inferring EC annotations of SCOP domains

Jump to [ Top · Domain2EC · SDEO · Data availability ]

The motivations behind are: if a EC term tends to annotate proteins containing a domain, then such term should also confer enzymatic signals for that domain. Such enzymatic signals for a domain can be reversely inferred if the number of domain-containing EC-annotated proteins is significantly higher than would be expected by chance. Figure 1 summarizes the procedures how to generate domain-centric EC annotations from individual protein/gene-level annotations in the Mouse.

Figure 1. Flowchart of inferring domain-centric EC annotations using IntEnz database and domain assignments in SUPERFAMILY database.

    Data Source Protein/Uniprot-level EC annotations is taken from IntEnz. Unlike Domain2GO, associations between domains and ECs are supported by all proteins (i.e., Uniprot2EC mapping matrix), due to the failure of statistical testing using insufficient number of singleton domain proteins in the annotable Uniprots.

    Statistical Analysis For a Uniprot2EC mapping matrix, two types of enrichments are performed to infer the overall and relative associations between a domain and a EC term (Figure 2). Statistical inference of possible association between a EC term (say t) and a domain (say d), is performed not only in terms of our analyzable gene space, but also in the context of those genes annotated to direct parents of that EC term. These dual constraints ensure that only those most informative EC terms are retained. When simultaneously comparing multiple hypothesis tests, statistical significance of domain-EC term associations can be assessed by the method of false discovery rate (FDR) (Benjamini and Hochberg, 1995). The resultant FDR is used to determine the significance of domain-EC term associations.

    Domain2EC The criteria for identifying the high-quality domain-EC associations are based on stringent FDR (<0.001). Since SCOP classifies evolutionary-related domains into superfamily level and family level, we have accordingly generated the domain-centric EC annotations for each of two domain levels.

Figure 2. The statistical significance of inference is assessed based on the hypergeometric distribution, generating overall over-representation in terms of the whole annotations (left panel) and relative over-presentation in terms of all direct parents (middle panel). Based on the maximal P-values, statistical significance of domain-EC term associations can be assessed by the method of FDR accounting for multiple hypothesis tests (right panel).


Initializing structural domain enzyme commission ontology

Jump to [ Top · Domain2EC · SDEO · Data availability ]

Based on high-quality Domain2EC, we have also initialized a trimmed-down version of EC which is the most informative to annotate structural domains (Figure 3).

Figure 3. Flowchart of creating structural domains enzyme commission ontology (SDEO) based on information theoretic analysis of Domain2EC annotation profiles.

    First, we apply information theory to define information content (IC) of a EC term: negative log10-transformation of the frequency of observing domains annotated to that term. For any domain, EC terms annotated to that domain constitute a domain-EC annotation profile in DAG, including direct annotations as well as inherited annotations according to the true-path rule. Considering the nature of dependencies among EC terms (or so-called true-path rule), a domain/protein directly annotated to a specific EC term (termed as direct annotations) should be inheritably annotated to its parental terms (terms as inherited annotations). EC annotations generated above can be considered as direct annotations. The complete EC annotations (direct and inherited) are used to calculate IC for all EC terms. Of note, those EC terms with similar IC can represent a partition of DAG in terms of Domain2EC.

    Second, given a predefined IC (say 1) as a seed and its corresponding the range (say, [0.75 1.25]), the proposed algorithm starts with initially unmarked all EC terms, and iteratively identifies unmarked EC terms closest to a predefined IC until all EC terms are marked (Figure 4). To make sure that one and only one EC term can be identified per path, the following constraints should be met: If multiple EC terms with identical IC are identified in the same path, those parental terms are filtered out; once a EC term is identified, all terms in the path in which that term is located will be marked for being immune from further search.

    Last, the outputs are those identified EC terms with IC falling in the range. We run the algorithm using each of four seed ICs (i.e., 0.5, 1, 1.5 and 2) to create SDEO, respectively corresponding to EC terms with four levels (least informative, moderately informative, informative, highly informative).

Figure 4. Illustration of the algorithm how to iteratively create structural domains enzyme commission ontology (SDEO). I). Initially, all EC terms are unmarked (open circles); II). Identify those unmarked EC terms (filled in pink) with IC closest to a predefined IC (e.g., 1); III). Filter out those parental EC terms from identified EC terms in Step II. IV). Mark EC terms identified as well as all of their ancestors and descendants. V-VI). Continue the Steps II-IV to iteratively identify unmarked EC terms until all EC terms are marked. VII). Output only those identified EC terms with IC falling in the range (e.g., [0.75 1.25]) as SDEO.


Data Availability

Jump to [ Top · Domain2EC · SDEO · Data availability ]

In additional to two hierarchies (SCOP-Hierarchy, or EC-Hierarchy) for the browsing, we here also provide Domain2EC mapping results in two parsable formats (i.e., plain files and mysql tables). Although we also offer Domain2EC at the SCOP fold and class levels, special attention should paid to cos they are definitively useless in terms of evolutionary relevance.

Domain2EC mapping results

  • High-coverage domain-centric EC annotations are available in the Domain2EC.txt file.

  • EC terms which are regarded as SDEO (four levels: least informative, moderately informative, informative, and highly informative ) can be found in the SDEO.txt file. We highly recommend users to use these EC terms and their annotating domains from Domain2EC.txt. Unlike the whole EC hierarchy, those EC terms at different granularity are representative and comprehensive in terms of their relevance to domains (not proteins). Keep it in mind that SDEO corresponds to each of four SCOP domain types (i.e., FA, SF, CF, and CL ).
Domain2EC MySQL tables
    We use four tables (Domain2EC.sql.gz) below to store info described above (i.e., Domain2EC mapping results):

    EC_info: containing info about EC terms.
        > DESC EC_info;
        +-------------+----------------------------------+------+-----+---------+-------+
        | Field       | Type                             | Null | Key | Default | Extra |
        +-------------+----------------------------------+------+-----+---------+-------+
        | ec          | varchar(15)                      | NO   | PRI | NULL    |       |
        | namespace   | enum('root','enzyme_commission') | NO   |     | NULL    |       |
        | description | varchar(255)                     | NO   |     | NULL    |       |
        | distance    | tinyint(3) unsigned              | NO   |     | NULL    |       |
        +-------------+----------------------------------+------+-----+---------+-------+
        
    • The ec column is the EC id, see IntEnz - Classification rules. It is browsable via EC-Hierarchy.
    • The namespace column is mammalian_phenotype, otherwise root.
    • The description column shows the full name of EC terms.
    • The distance column shows the distance of EC terms to the root.

    EC_hie: containing info about EC hierarchy.
        > DESC EC_hie;
        +----------+---------------------+------+-----+---------+-------+
        | Field    | Type                | Null | Key | Default | Extra |
        +----------+---------------------+------+-----+---------+-------+
        | parent   | varchar(15)         | NO   | PRI | NULL    |       |
        | child    | varchar(15)         | NO   | PRI | NULL    |       |
        | distance | tinyint(3) unsigned | NO   | PRI | NULL    |       |
        +----------+---------------------+------+-----+---------+-------+
        
    • The parent column is the EC id.
    • The child column is the EC id.
    • The distance column shows the distance of parental EC id to child EC id. 1 for direct parent-child relationships, others indicating the existance of a path between them (reachable but indirect).

    EC_mapping: containing info about Domain2EC annotations.
        > DESC EC_mapping;
        +----------------+---------------------------+------+-----+---------+-------+
        | Field          | Type                      | Null | Key | Default | Extra |
        +----------------+---------------------------+------+-----+---------+-------+
        | id             | mediumint(8) unsigned     | NO   | PRI | NULL    |       |
        | level          | enum('cl','cf','sf','fa') | NO   |     | NULL    |       |
        | ec             | varchar(15)               | NO   | PRI | NULL    |       |
        | all_score      | double                    | NO   |     | 1       |       |
        | inherited_from | text                      | YES  |     | NULL    |       |
        +----------------+---------------------------+------+-----+---------+-------+
        
    • The id is the SCOP unique identifier, sunid. It is browsable via SCOP-Hierarchy.
    • The level in the SCOP hierarchy. Can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family.
    • The ec column is the EC id.
    • The all_score column is the FDR supported by all UniProts (including multidomain UniProts).
    • The inherited_from column is to mark the status of Domain2EC predicted annotations. 1) If it is marked with 'directed' (i.e., 'all_score'<0.001), Domain2EC is significantly supported by all UniProts (including multidomain UniProts). 2) If it is a comma separated list of EC id (numeric part; the column 'all_score' is not less than 0.001), Domain2EC is inherited from any descentant EC terms (significantly associated) when applying true-path rule in DAG. 3) Empty otherwise. Hence, the lists of Domain2EC can be obtained by selecting the column 'inherited_from' with NOT EECTY.

    EC_ic: containing info about SDEO.
        > DESC EC_ic;
        +---------+---------------------------+------+-----+---------+-------+
        | Field   | Type                      | Null | Key | Default | Extra |
        +---------+---------------------------+------+-----+---------+-------+
        | level   | enum('cl','cf','sf','fa') | NO   | PRI | NULL    |       |
        | ec      | varchar(15)               | NO   | PRI | NULL    |       |
        | ic      | double                    | YES  |     | NULL    |       |
        | include | tinyint(2)                | YES  | MUL | NULL    |       |
        +---------+---------------------------+------+-----+---------+-------+
        
    • The level in the SCOP hierarchy. Can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family.
    • The ec column is the EC id.
    • The ic column shows the infomration content of the EC term.
    • The include column indicates whether or not the EC term belongs to the SDEO. If the column is set to '0' then it is not a member of SDEO. Otherwise, '1' for least informative (i.e., the most general), '2' for moderately informative, '3' for informative, '4' for highly informative (i.e., the most specific).


References

    Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2008) Data growth and its impact on the SCOP database: new developments, Nucleic Acids Res, 36, D419-425. Abstract [ PubMed ]  
    Benjamini, Y. and Hochberg, Y. (1995) Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society Series B-Methodological, 57, 289-300. Abstract [ PubMed ]  
    Fleischmann, A., Darsow, M., Degtyarenko, K., Fleischmann, W., Boyce, S., Axelsen, K.B., Bairoch, A., Schomburg, D., Tipton, K.F. and Apweiler, R. (2004) IntEnz, the integrated relational enzyme database, Nucleic Acids Res, 32, D434-7. Abstract [ PubMed ]  
    Gough, J. (2006) Genomic scale sub-family assignment of protein domains, Nucleic Acids Res, 34, 3625-3633. Abstract [ PubMed ]