---
title: "Overview of PhyloProfileData"
author: "Hannah Mülbaier & Vinh Tran"
date: "`r Sys.Date()`"
package: PhyloProfileData
output:
    BiocStyle::html_document:
        toc: true
vignette: >
    %\VignetteIndexEntry{PhyloProfileData}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

# Introduction
The PhyloProfileData package contains two experimental datasets to illustrate 
running and analysing phylogenetic profiles with PhyloProfile pakage 
(Tran et al. 2018).

```{r}
library(ExperimentHub)
eh = ExperimentHub()
myData <- query(eh, "PhyloProfileData")
myData
```

# Phylogenetic profiles of AMPK-TOR pathway

The phylogenetic profiles of 147 human proteins in the AMPK-TOR pathway across 
83 species in the three domains of life were taken from the study of Roustan et 
al. 2016. 

This data set includes 3 files: 

- A phylogenetic profile input contains the human protein IDs, the taxaonomy IDs
of the 83 search species and the corresponding orthologous protein IDs in those 
species, together with two additional values, the forward and backward feature
architecture similarity (FAS) scores. **FAS approach** is an enhancement of 
**FACT** (Koestler et al. 2010) which give an idea how similar two proteins are
in term of functional equivalence. These compared protein features can be the 
functional PFAM (Finn et al. 2014) or SMART (Letunic et al. 2012) domains, the
transmembran domains, secondary structures or low complexity regions of the 
protein. FAS scores have a range between 0 and 1, where 1 is for a protein pair
that have identical architectures, and 0 in cases that two proteins are 
completely different in their architectures.

```{r, warning = FALSE, message = FALSE}
ampkTorPhyloProfile <- myData[["EH2544"]]
head(ampkTorPhyloProfile)
```

- A multiple fasta object contains the FASTA sequences for all the proteins 
present in the data set.

```{r, warning = FALSE, message = FALSE}
ampkTorFasta <- myData[["EH2545"]]
head(ampkTorFasta)
```

- A data frame contains the domain annotation for the proteins present in 
the phylogenetic profiles. The protein domain annotations were done using 
different annotation tools and databases, including *PFAM*, *SMART*, *CAST* 
(Promponas et al. 2000), *COILS2* (Lupas et al. 1991), *SEG* (Wootton and 
Federhen 1996), *SignalP* (Armenteros et al. 2019), and *TMHMM* (Krogh et al. 
2001). The annotation types together with their domain name and the 
corresponding start and end positions are stored in this domain data frame.

```{r, warning = FALSE, message = FALSE}
ampkTorDomain <- myData[["EH2546"]]
head(ampkTorDomain)
```

# Phylogenetic profiles of BUSCO Arthropoda proteins

One fundamental step in establishing the phylogenetic profiles is searching 
orthologs for the query proteins in different taxa of interest. 
**HaMStR-oneseq**, an extended version of **HaMStR** (Ebersberger et al. 2009), 
has been shown to be an promising approach for sensitively predicting orthologs 
even in the distantly related taxa from the query species, which is required for
the phylogenetic profiling of a broad range of taxa through all domains of the 
species tree of life. One main parameter for HaMStR-oneseq is the core ortholog 
group, the starting point for the orthology search. In order to set up a 
reliable core ortholog set that can be used for further phylogenetic profiling 
studies, we made use of the well-known **BUSCO** datasets (Simão et al. 2015). 
Here we represent the phylogenetic profiles of 1011 ortholog groups across 88 
species, which was calculated from the BUSCO arthropoda dataset downloaded from 
https://busco.ezlab.org/datasets/arthropoda_odb9.tar.gz in Jan. 2018. The 88 
species include 10 arthropoda species (*Ladona fulva*, *Agrilus planipennis*, 
*Polypedilum vanderplanki*, *Daphnia magna*, *Harpegnathos saltator*, 
*Zootermopsis nevadensis*, *Halyomorpha halys*, *Heliconius melpomene*, 
*Stegodyphus mimosarum*, *Drosophila willistoni*) downloaded from orthoDB 
version 10 (https://www.orthodb.org) and 78 species of the Quest for Ortholog
dataset (Altenhoff et al. 2016).

This dataset includes 3 files:

- A phylogenetic profile input contains 1011 BUSCO ortholog group IDs, the 
taxaonomy IDs of the 88 searched species and the corresponding orthologous 
protein IDs in those species, together with two additional values, the forward 
and backward FAS scores which were described above in the description of the 
AMPK-TOR pathway dataset.

```{r, warning = FALSE, message = FALSE}
arthropodaPhyloProfile <- myData[["EH2547"]]
head(arthropodaPhyloProfile)
```

- A multiple fasta object contains the FASTA sequences for all the proteins 
present in the data set.

```{r, warning = FALSE, message = FALSE}
arthropodaFasta <- myData[["EH2548"]]
head(arthropodaFasta)
```

- A data frame contains the domain annotation for the proteins present in 
the phylogenetic profiles. The protein domain annotations were done using 
different annotation tools and databases, including *PFAM*, *SMART*, *CAST*, 
*COILS2*, *SEG*, *SignalP*, and *TMHMM*. The annotation types together with 
their domain name and the corresponding start and end positions are stored in 
this domain data frame.

```{r, warning = FALSE, message = FALSE}
arthropodaDomain <- myData[["EH2549"]]
head(arthropodaDomain)
```

# SessionInfo()

Here is the output of `sessionInfo()` on the system on which this document was
compiles:

```{r sessionInfo, echo = FALSE}
sessionInfo(package = "PhyloProfileData")
```

# References
1. Armenteros, JJA. et al. (2019) SignalP 5.0 improves signal peptide 
predictions using deep neural networks. Nature Biotechnology, 37, 420–423.
2. Altenhoff, AM. et al. (2016) Standardized benchmarking in the quest for 
orthologs. Nature Methods, 13, 425–430.
3. Ebersberger, I. et al. (2009) HaMStR: profile hidden markov model based 
search for orthologs in ESTs. BMC Evol Biol., 9, 157
4. Finn, RD. (2014) Pfam: The protein families database. Nucleic Acids Res., 
42, D222-30
5. Koestler, T. et al. (2010) FACT: functional annotation transfer between 
proteins with similar feature architectures. BMC Bioinformatics, 11, 417.
6. Kriventseva, EK. et al.(2018) OrthoDB v10: sampling the diversity of animal, 
plant, fungal, protist, bacterial and viral genomes for evolutionary and 
functional annotations of orthologs. Nucleic Acids Res., 47(D1), D807-D811.
7. Krogh, A. et al. (2001) Predicting transmembrane protein topology with a 
hidden Markov model: application to complete genomes. J Mol Biol., 305(3), 
567-80.
8. Letunic, I. et al. (2012) SMART 7: recent updates to the protein domain 
annotation resource. Nucleic Acids Res., 40, D302-5.
9. Lupas, A. et al. (1991) Predicting Coiled Coils from Protein Sequences. 
Science, 252, 1162-1164.
10. Promponas, VJ. et al. (2000) CAST: an iterative algorithm for the 
complexity analysis of sequence tracts. Bioinformatics, 16(10), 915–922.
11. Roustan, V. et al. (2016) An evolutionary perspective of AMPK–TOR signaling 
in the three domains of life. Journal of Experimental Botany, 67(13), 3897–3907.
12. Simão, F. et al. (2015) BUSCO: assessing genome assembly and annotation 
completeness with single-copy orthologs. Bioinformatics, 31(19), 3210-2.
13. Tran, NV. et al. (2018) PhyloProfile: dynamic visualization and exploration 
of multi-layered phylogenetic profiles. Bioinformatics, 34(17), 3041–3043.
14. Wootton, J. and  Federhen, S. (1996) Analysis of compositionally biased 
regions in sequence databases. Methods in Enzymol., 266, 554-571.