Use of profile hidden Markov models in viral discovery: current insights

Table 2 Probability values with pseudocounts for positions 46–49 from the multiple sequence alignment depicted in Figure 1

Figure 2 Diagram representing a profile hidden Markov model (profile HMM).

Notes: Match states are represented as red rectangles, deletion (silent) states as green circles, and insertion states as blue diamonds. The red numerical values next to the arrows indicate transition probabilities. The equalities inside the states indicate amino acid probabilities, generally called emission probabilities. These emission probabilities do not include the use of pseudocounts. Match states use emission probabilities computed from the original alignment; insertion states use background amino acid probability values of 1/20. The transition probabilities highlighted with red circles indicate the probabilities described in the text. The other transition probabilities were arbitrarily set to make the figure more homogeneous and to increase clarity.

Table 3 Web resources of viral profile HMM databases and tools

Figure 3 Distribution of orthologous groups from vFam^Citation35 (A) and pVOGs^Citation36 (B) according to the viral families.

Notes: To obtain quantitative data, the number of corresponding profile HMM/orthologous groups was determined for each viral family based on the annotation provided in the database files. Profile HMMs from the original databases are derived from viruses of either single or multiple families.
Abbreviations: pVOGs, Prokaryotic Virus Orthologous Groups; vFAM, viral profile HMM database; profile HMMs, profile hidden Markov models.

Table 4 Publicly available targeted assembly tools that use profile HMM seeds

PubMed Web of Science ®Google Scholar

Figure S1 Distribution of number of proteins per orthologous group for vFam^Citation1 (A) and pVOGs^Citation2 (B).

Notes: Data were obtained from the annotation files provided by the database authors and bins of size 10 were used for building the histograms. For increased readability, pVOGs data are shown only up to 1,000 proteins per orthologous group (just six groups presented numbers larger than that, up to a maximum of 8,131 proteins in the largest group).
Abbreviations: pVOGs, Prokaryotic Virus Orthologous Groups; vFAM, viral profile HMM database.

Larkin MA, Blackshields G, Brown NP, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23(21):2947–2948.

Gibson T, Higgins D, Thompson J [homepage on the Internet]. General help for CLUSTAL X (2.0). Available from: http://www.clustal.org/download/clustalx_help.html. Accessed May 22, 2017.

Llorens C, Futami R, Covelli L, et al. The Gypsy Database (GyDB) of mobile genetic elements: release 2.0. Nucleic Acids Res. 2011;39(Database issue):D70–D74.

Foley B, Leitner T, Apetrei C, et al, editors. HIV Sequence Compendium 2013. New Mexico: Theoretical Biology and Biophysics Group, Los Alamos National Laboratory; 2013.

Skewes-Cox P, Sharpton TJ, Pollard KS, DeRisi JL. Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS One. 2014;9(8):e105067.

PubMed Web of Science ®Google Scholar

Huerta-Cepas J, Szklarczyk D, Forslund K, et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 2016;44(D1):D286–D293.

Grazziotin AL, Koonin EV, Kristensen DM. Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Res. 2017;45(Database issue):D491–D498.

Kristensen DM, Cai X, Mushegian A. Evolutionarily conserved orthologous families in phages are relatively rare in their prokaryotic hosts. J Bacteriol. 2011;193(8):1806–1814.

Kristensen DM, Waller AS, Yamada T, Bork P, Mushegian AR, Koonin EV. Orthologous gene clusters and taxon signature genes for viruses of prokaryotes. J Bacteriol. 2013;195(5):941–950.

Skewes-Cox P, Sharpton TJ, Pollard KS, DeRisi JL. Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS One. 2014;9(8):e105067.