PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost.
Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity.
Several issues in creating benchmarking datasets have been discussed earlier ( Aniba , 2010 ).
Results: CSBLAST and PHMMER had overall highest accuracy.Pfam architectures were considered at the Clan level by replacing Pfam domain IDs with Clan IDs where defined. single partnerbörse Darmstadt Architectures are listed as consecutive domain identifiers separated by an underscore (_).All domain architectures and sequences were acquired from the source databases (version 28.0 of Pfam, version 1.75 of SCOP/SUPERFAMILY, version 3.5.0 of Gene3D), retrieving all domain matches via v53.0 of the Inter Pro database ( Mitchell , 2015 ), restricting the analysis to sequences present in Swiss Prot ( Uni Prot KB/Swiss-Prot, downloaded on August 24 2015 ).To account for incompleteness of present domain annotations, any sequence was discarded for which at least fifty consecutive residues were not assigned to a protein domain, as has been done in previous studies ( Forslund Diagram illustrating how multi-domain homologous and non-homologous protein pairs were selected from the three databases Pfam (with clans), SUPERFAMILY and Gene3d.
Dating profil Gribskov
Consequently, we expanded on our previous benchmark approach to construct an updated evaluation dataset, then tested the latest versions of the ‘next-generation’ homology search tools for precision, accuracy and speed.Additionally, we applied our benchmarking method to all three major domain family databases: SUPERFAMILY (extending SCOP, Fox , 2014 ), where previously only Pfam was used.To remedy this, we previously ( Forslund and Sonnhammer, 2009 ) described an approach for generating ‘gold standard’ test cases for homology inference by selecting pairs of multi-domain proteins where either all corresponding domains match at the super-family/clan level (positive gold standard) or where none of them do (negative gold standard). Using this approach, we compared different low-complexity filter settings for the NCBI-BLAST homology search tool, and found that compositional adjustment of score matrices allowed minimization of false positives, though sometimes at the price of truncated alignments.FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization.
Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small.This was done with the intent that the similarity of benchmark results derived from different databases would provide a test of to what extent, beyond differences in scope or coverage, that these resources, built from different types of data and using different curation protocols, reflect the same underlying evolutionary entities seen through different definition schemes, a question which has been raised in some recent studies ( Csaba , 2009 ).As previously described ( Forslund and Sonnhammer, 2009 ), pairs of multi-domain proteins are seen as homologous for the purpose of the benchmark if their domains, in consecutive order, belong to the same family or clan (in the case of Pfam) or the same superfamily (in the case of Gene3D or SUPERFAMILY).New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked.To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA.