Name Last modified Size Description
Parent Directory - current/ 30-Aug-2011 09:49 - nar2010/ 22-Jun-2010 13:49 - refprotdom1.0/ 22-Jun-2010 13:49 - refprotdom1.1/ 22-Jun-2010 13:47 - refprotdom1.2/ 30-Aug-2011 09:49 -
The RefProtDom database has been undergoing curation. As a result, the files used for our Nuc. Acids. Research paper (Gonzalez and Pearson (2010), Nuc. Acids Research doi 10.1093/nar/gkp1219 [Entrez] [PDF] have been updated.
The original files are available in the nar2010 (refprotdom1.0) directory.
The same sequence files with more current domain boundaries are available in the current (refprotdom1.2) directory.
Six types of files are provided:
library_all_domains_rdm.fa.gz - Random-shuffles of each of the full-length Uniprot proteins in library_all_domains.fa.gz.
library_long_domains.fa.gz - A subset of the library_all_domains.fa.gz library from which proteins with homologous domains less than 75% of the Pfam model length are excluded.
library_long_domains_rdm.fa.gz - Random-shuffles of each of the full-length Uniprot proteins in library_long_domains.fa.gz
family_members.annot.gz - lists the domains in each sequence in the library_*_domains.fa.gz files. Format: >[source]|[accession]|[sequence_name] [superfamily]<tab>[domain_start]<tab>[domain_end]<tab>[e-value]<tab>[mode]<tab>[long_domain]
>up|P53627|ABFA_STRLI PF06964 293 494 1.3e-104 pf21ls 1 >pfam21|P19801|ABP1_HUMAN PF01179 296 715 2.3e-35 ua_pws 1 CL47 39 125 3.3e-29 pf21ls 1 CL47 141 241 1.3e-24 pf21ls 1
source "up" if the sequence matches the current (12/2009) Uniprot version of the sequence "pfam21" if the exact sequence is no longer in uniprot, the pfam v.21 sequence is used superfamily The pfam accession name (PF##### when the family was the sole representative of its superfamily/clan) or clan number (CL[clan_id] when the superfamily has several families that have been coalesced into one homologous group) domain_start Sequence coordinate where domain starts
domain_end Sequence coordinate where domain ends
e-value The score of the comparison between the sequence fragment from domain_start to domain_end against the HMM model of the given family or the e-value generated by the supplemental annotation methods described in (Gonzalez, M.W. and W.R. Pearson, 2010b)
mode The type of pfam HMM model used to identify the given domain or the supplemental annotation method described in (Gonzalez, M.W. and W.R. Pearson, 2010b) "pf21ls" mode domains match the entire footprint on the pfam domain model "pf21fs" domains are usually fragments that only partially match the pfam domain model "ext" are domains that were previously annotated as partial homologies whose coordinates we extended "ua_str" mode refers to previously missed homologs found by Gonzalez and Pearson using structural evidence
"ua_rec" mode refers to previously missed homologs found by Gonzalez and Pearson using reciprocal PSI-BLAST searches
"ua_pws" mode refers to previously missed homologs found by Gonzalez and Pearson using pair-wise searches.
long_domain "0" if the sequence contains domains whose lengths are <75% of the Pfam model length. long_domain=0 sequences are only found in library_all_domains.fa.gz "1" if the sequence only contains domains whose lengths are >=75% of the Pfam model length. long_domain=1 sequences are found in library_long_domains.fa.gz and in library_all_domains.fa.gz
non_redundant Useful to calculate family size "0" flags a redundant domain that overlaps with another with longer sequence homology annotation "1" flags the non-redundant domain with the longer sequence homology annotation
family_query.summary - Lists the size and names of the queries for each of the chosen families.
pfam_to_clan.txt - Lists the pfam family to clan superfamily correspondence. Note: The annotations on this database are at the superfamily level, which we recommend for homology evaluation. See the FAQ.txt and (Gonzalez and Pearson, NAR, 2010) for more details of why coalescing superfamilies is the preferred choice when evaluating homology.
refprotdom_domain_bound_ext.txt - Lists the domains that in pfam v.21 were annotated as partial homologies whose coordinates we extended. Current uniprot accessions and sequence ids are provided, as well as the corresponding pfam v.24 coordinates
refprotdom_unannot_homol.txt - Lists missed/unannotated homologs in Pfam v.21 that we uncovered with reverse PSI-BLAST searches, pair-wise searches or through SCOP/CATH structural evidence.
queries.tgz is a gzip-ed tar file that produces the following directories:
In queries/by_difficulty/, there are two classes of query sequence files, each of which contains 50 domain sequences, in 10 different random-sequence embeddings.
In addition, there is a
hard_non_embedded.fa and sampled_non_embedded.fa file.
"hard" domains are domains that find the smallest number of related sequences after a BLASTP search. "sampled" domains were chosen at random from 640 domains selected because of their length (>200 residues in the Pfam model) and phylogenetic diversity (homologs in 2 of the 3 kingdoms of life: e.g. homologs in archaea and eukarya, or in archaea and bacteria, etc).
"queries/by_tree_location/", also contains two classes of query sequence files, each with 50 domain sequences, in 10 different random sequence embeddings. Here, the classes are "des", for queries from relatively deserted parts of the domain phylogenetic tree, and "pop", for queries from a populated region.
All queries are available as bare domains (non-embedded/ne) or flanked by artificial proteins (embedded/e#). The embeddings were created by randomly shuffling the domain as described by Gonzalez and Pearson. For each domain, 10 different embedding replicates are provided. Unless otherwise specified, the results in (Gonzalez and Pearson, NAR, 2010) are based on embedding #5.
QUERY FILE NAMES
A query is a sequence domain from a family that falls under any of the 4 types of queries described above (i.e. hard, sampled, populated/pop, deserted/des). The four types of query families are available as bare domains (non_embedded) or embedded in 10 different shuffles following the following naming format: [type]_[embedding].[e#].fa. For example: "hard_embedded.5.fa" contains 50 embedded queries from hard families and the embedding is the 5th shuffle of the domain (there are 9 alternate embeddings for the same domain).
QUERY FILE FORMAT
Query files are in FASTA format, with the description line providing information about the location of the domain, and its origin. Each query file contains 50 queries of the form:
>qPF00589_e5 e_d_start:96 e_d_end:286 from:up|Q1YWW7|Q1YWW7_PHOPR(194-384); pfam:PF00589; model_len:205; all_homol:1445; long_homol:963; descr:Phage integrase family KTKKSAKQSDL.... [sequence] ....
The format of the description line is:
>[query_accession] e_d_start:# e_d_end:# from:[sequence_id]([domain_start]-[domain_end]); pfam:[pfam_superfamily]; model_len:[#]; all_homol:[#]; long_homol:[#]; descr:[description]
query_accesion Accession number for each query in the format:
q[pfam_superfamily]_[ne|e#].For example: qPF00589_e5 is a domain from PF00589 that has been embedded in the 5th shuffle replicate.
trees.tgz is a gzip-ed tar file that produces the following directories:
"trees/all_domains_in_family/" contains trees of all domain members of each superfamily "trees/long_domains_in_family/" contains trees of the long-domain members of each superfamily
All *.tree files in the "trees/" folder are newick formatted, neighbor-joining trees of a set of members for each superfamily (i.e. trees that feature all domain members in a superfamily or trees of only the long-domain members).
The .*afa files contain the multiple sequence alignments used to generate the trees
All superfamily trees were generated using Quicktree (v. 1.1), and the multiple sequence alignments to generate them were created using the HMMMER (v. 2.3.2) package.
To determine whether the alignments are True positives (TPs) or False Positives (FPs) all you need to know is the library sequence's id (e.g. up|Q1YWW7|Q1YWW7_PHOPR) and the pfam superfamily to which the query belongs (e.g. qPF00589_e5's superfamily is PF00589).
Find the library sequence in the "family_members.annot.gz" file (alternatively this information may be stored in mySQL tables) and compare the domain boundaries there to the alignment coordinates of the similarity-searching algorithm you are testing. For instance, let's assume you're testing the PF00589 superfamily (using qPF00589e5:a query from hard_embedded.5.fa) and suppose your algorithm finds a putative homolog on the "XERC_BACSU" sequence from residues 10-80. Looking at the "family_members.annot" file you would classify this alignment as a false positive (FP) because the alignment maps 100% to the unrelated PF02899 domain. You may decide to use a specific overlap percentage to classify the alignments. In (Gonzalez MW, Pearson WR: Homologous Over-extension: A Challenge for Iterative Similarity Searches. Nucleic Acids Res 2010, [Jan. 10 Epub ahead of print] doi:10.1093/nar/gkp1219:1-13), we require 50% alignment overlap to the homologous region (i.e. at least 50% of the alignment must be between 114-291) for the alignment to be counted as a true positive (TP).
>up|P39776|XERC_BACSU PF02899 8 91 1.6e-26 ls 1 0 PF00589 114 291 1.6e-65 ls 1 1
For more information, contact Bill Pearson (firstname.lastname@example.org) or Mileidy Gonzalez (email@example.com)