---
title: "Introduction to MorphSim"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to MorphSim}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

MorphSim is an R package for simulating discrete morphological character data
along phylogenetic trees. It is designed to integrate with existing R packages
such as TreeSim and FossilSim, enabling simulation of datasets that include
both extant and extinct taxa. This vignette walks through the main features of
the package.

## Getting started

To simulate morphological data you first need a phylogenetic tree. MorphSim
accepts either a tree with branch lengths in evolutionary distance or a
time-scaled tree. Here we simulate a birth-death tree using TreeSim.

```{r tree}
library(MorphSim)
library(TreeSim)
library(FossilSim)

set.seed(1234)
tree <- TreeSim::sim.bd.taxa(n = 10, numbsim = 1,
                              lambda = 0.1, mu = 0.05)[[1]]
```

## Basic simulation

The core function is `sim.morpho`. At minimum you need a tree, the number of
character states (`k`), the total number of traits (`trait.num`), and if using
a time tree, the branch rates (`br.rates`).

```{r basic}
morpho_data <- sim.morpho(
  time.tree = tree,
  k = 2,
  trait.num = 10,
  br.rates = 0.1
)
morpho_data
```

## The morpho object

All simulation output is stored in a `morpho` object. This is a list with the
following components:

**`sequences`** — a list with up to three elements:

- `tips`: character states at the tips of the tree. Stored as a named list
  where each element is a vector of trait values for one taxon.
- `nodes`: character states at internal nodes (when `ancestral = TRUE`).
- `SA`: character states for sampled ancestors (when a fossil object is
  provided). Named using the specimen number and the branch number along
  which the fossil was sampled.

**`trees`** — a list containing:

- `EvolTree`: the tree with branch lengths in evolutionary distance.
- `TimeTree`: the time-scaled tree (if provided).
- `BrRates`: the branch rates used for simulation.

**`model`** — a list describing the model used:

- `Specified`: a string summarising the model per partition, including the
  number of traits and character states.
- `RateVar`: the discrete rate categories drawn from the specified
  distribution (if ACRV was used).
- `RateVarTrait`: which rate category was assigned to each trait. Values
  are listed from lowest rate (1) to highest.

**`transition_history`** — a list of data frames, one per trait. Each data
frame records every state transition that occurred during simulation,
including the branch number (`edge`), the new state (`state`), and the
position along the branch where the transition occurred (`hmin`).

**`root.states`** — a vector of root states, one per trait.

**`fossil`** — the FossilSim fossil object (if provided). The naming scheme
matches that of FossilSim.

You can check whether an object is a morpho object using `is.morpho()`.

```{r morpho_check}
is.morpho(morpho_data)
```

### Accessing data

MorphSim provides helper functions for extracting data from the morpho object.

```{r access}
# Tip data as a matrix
get.matrix(morpho_data, seq = "tips")

# Transition history for a specific trait (including root state)
get.transitions(morpho_data, trait = 1)
```

## Clock models

A single value for `br.rates` applies a strict clock. A vector of rates
(one per branch) applies a relaxed clock. You can use the SimClock package
to generate branch rates under different clock models.

```{r clock}
# Strict clock
strict_data <- sim.morpho(time.tree = tree, k = 2, trait.num = 10,
                           br.rates = 0.1)

# Relaxed clock (random rates per branch for illustration)
relaxed_rates <- runif(nrow(tree$edge), min = 0.01, max = 0.2)
relaxed_data <- sim.morpho(time.tree = tree, k = 2, trait.num = 10,
                            br.rates = relaxed_rates)
```

## Model extensions

### MkV model

Setting `variable = TRUE` ensures all simulated characters are variable
across taxa, matching the MkV model of Lewis (2001).

```{r mkv}
mkv_data <- sim.morpho(time.tree = tree, k = 2, trait.num = 10,
                        br.rates = 0.1, variable = TRUE)
```

### Among-character rate variation (ACRV)

Not all morphological characters evolve at the same rate. MorphSim can
model this using among-character rate variation (ACRV), where each trait is
assigned a rate drawn from a discretized distribution. Two distributions
are available.

**Gamma distribution** — controlled by two parameters: `alpha.gamma`, the
shape parameter of the gamma distribution (smaller values produce greater
rate heterogeneity, with more characters evolving very slowly or very
quickly; larger values concentrate rates closer to the mean), and
`ACRV.ncats`, the number of discrete rate categories used to approximate
the continuous distribution (commonly set to 4, following standard practice
in phylogenetic inference).
```{r acrv_gamma}
gamma_data <- sim.morpho(
  time.tree = tree, k = 2, trait.num = 10, br.rates = 0.1,
  ACRV = "gamma", alpha.gamma = 1, ACRV.ncats = 4
)

# The discrete rate categories
gamma_data$model$RateVar

# Which category was assigned to each trait
gamma_data$model$RateVarTrait
```

**Lognormal distribution** — controlled by `meanlog` and `sdlog`, the mean
and standard deviation on the log scale, and `ACRV.ncats` as above. The
lognormal can produce heavier tails than the gamma, meaning a greater
spread between the slowest and fastest evolving characters.
```{r acrv_lgn}
lgn_data <- sim.morpho(
  time.tree = tree, k = 2, trait.num = 10, br.rates = 0.1,
  ACRV = "lgn", meanlog = 0, sdlog = 1, ACRV.ncats = 4
)

# The discrete rate categories
lgn_data$model$RateVar
```

**User-defined rates** — you can also provide your own rate categories
directly using `ACRV = "user"` and passing a vector of rates to
`define.ACRV.rates`.
```{r acrv_user}
user_data <- sim.morpho(
  time.tree = tree, k = 2, trait.num = 10, br.rates = 0.1,
  ACRV = "user", define.ACRV.rates = c(0.5, 1, 1.5, 2),
  ACRV.ncats = 4
)
```

### MkV + Gamma

These can be combined to simulate under an MkV + Gamma model, consistent
with models implemented in RevBayes and BEAST.

```{r mkv_gamma}
mkvg_data <- sim.morpho(
  time.tree = tree, k = 3, trait.num = 10, br.rates = 0.1,
  variable = TRUE, ACRV = "gamma", alpha.gamma = 1, ACRV.ncats = 4
)
```

### Custom Q matrices

You can provide your own transition rate matrix. For example, an ordered
model where transitions can only occur between adjacent states (0 ↔ 1 ↔ 2):

```{r custom_q}
ord_Q <- matrix(c(
  -0.5, 0.5, 0.0,
   0.3333333, -0.6666667, 0.3333333,
   0.0, 0.5, -0.5
), nrow = 3, byrow = TRUE)

ordered_data <- sim.morpho(
  time.tree = tree, k = 3, trait.num = 10, br.rates = 0.1,
  define.Q = ord_Q
)
```

Note that when using a custom Q matrix, all traits in that simulation share
the same matrix. If you want to simulate different partitions under different
Q matrices, for example, some ordered and some unordered characters, you
can simulate them separately and combine the results. See the section on
combining morpho objects below.

### Specifying the root state

By default the root state is sampled uniformly. You can fix it using
`root.state`:

```{r root_state}
fixed_root <- sim.morpho(
  time.tree = tree, k = 3, trait.num = 10,
  br.rates = 0.1, root.state = 0
)
```

### Ensure full state space

Setting `full.states = TRUE` ensures that all character states specified by `k`
are present in at least one tip for each trait. This is useful when you want to guarantee
that the full range of states specified in the Q matrix is observed at the
tips. Without this, it is possible for a character with `k = 3` to only
show states 0 and 1 at the tips if state 2 was never reached or was lost
before the present.
```{r strict}
strict_data <- sim.morpho(
  time.tree = tree, k = 3, trait.num = 10,
  br.rates = 0.1, full.states = TRUE
)
```

### Partitions

Different groups of characters can be simulated with different numbers of
states. The `partition` argument specifies how many traits belong to each
partition, and `k` gives the number of states per partition.

```{r partitions}
part_data <- sim.morpho(
  time.tree = tree,
  k = c(2, 3, 4),
  trait.num = 20,
  partition = c(10, 5, 5),
  br.rates = 0.1
)

# Model specification per partition
part_data$model$Specified
```

The total number of traits defined by trait.num and partition must match or the function will return an error.

### Combining morpho objects

It is not possible to simulate multiple partitions under different models in
a single call to `sim.morpho`. For example, you might want some characters
to evolve under an ordered model with a custom Q matrix and others under a
standard Mk model with gamma rate variation. To do this, simulate each
partition separately and combine them using `combine.morpho`.

```{r combine}
# Partition 1: ordered characters with a custom Q matrix
ord_Q <- matrix(c(
  -0.5, 0.5, 0.0,
   0.3333333, -0.6666667, 0.3333333,
   0.0, 0.5, -0.5
), nrow = 3, byrow = TRUE)

partition_1 <- sim.morpho(
  time.tree = tree, k = 3, trait.num = 10,
  br.rates = 0.1, define.Q = ord_Q
)

# Partition 2: unordered binary characters with MkV + Gamma
partition_2 <- sim.morpho(
  time.tree = tree, k = 2, trait.num = 10,
  br.rates = 0.1, variable = TRUE,
  ACRV = "gamma", alpha.gamma = 1, ACRV.ncats = 4
)

# Combine into a single morpho object
combined <- combine.morpho(partition_1, partition_2)

# The combined object has 20 traits total
length(combined$sequences$tips[[1]])

# Model specifications from both partitions are preserved
combined$model$Specified
```

Both morpho objects must share the same tree, branch lengths, and fossil
object (if present). The combined object merges the sequences, transition
histories, root states, and model information from both inputs.

## FossilSim integration

MorphSim integrates with FossilSim to simulate character data for sampled
ancestors, fossils that represent direct ancestors of other lineages.
Because MorphSim tracks where transitions occur along branches, it generates
trait values appropriate to the time and lineage at which each fossil was
sampled.

```{r fossils}
# Simulate fossil and extant sampling
f <- FossilSim::sim.fossils.poisson(rate = 0.1, tree = tree,
                                     root.edge = FALSE)
f2 <- FossilSim::sim.extant.samples(fossils = f, tree = tree, rho = 0.5)

# Simulate morphological data with fossils
fossil_data <- sim.morpho(
  time.tree = tree, k = c(2,3), trait.num = 10, partition = c(5,5),
  br.rates = 0.1, fossil = f2
)

# Sampled ancestor data
names(fossil_data$sequences$SA)
```

## Missing data

Fossil morphological data are often incomplete. `sim.missing.data` allows
you to introduce missing characters under different scenarios, replacing
character values with `"?"`.

### Random missing data

Characters are removed uniformly at random across the entire matrix. A
single probability value controls the proportion of data that is removed.

```{r missing_random}
missing_random <- sim.missing.data(
  data = fossil_data, method = "random",
  seq = "tips", probability = 0.3
)
```

### Missing data by partition

Different partitions can have different probabilities of missing data. This
reflects how different anatomical regions may have different preservation
potential, for example, robust skeletal elements are more likely to be
preserved than soft tissue features. The probability vector must match the
number of partitions.

```{r missing_partition}
part_fossil <- sim.morpho(
  time.tree = tree, k = c(2, 3), trait.num = 10,
  partition = c(5, 5), br.rates = 0.1, fossil = f2
)

# Hard tissues (partition 1): 10% missing
# Soft tissues (partition 2): 70% missing
missing_part <- sim.missing.data(
  data = part_fossil, method = "partition",
  seq = "tips", probability = c(0.1, 0.7)
)
```

### Missing data by rate category

When data is simulated with ACRV, characters can be removed according to
the rate category they were simulated under. The probability vector must
match the number of rate categories. This can be used to explore what
happens when faster or slower evolving characters are preferentially lost.

```{r missing_rate}
gamma_fossil <- sim.morpho(
  time.tree = tree, k = 2, trait.num = 20, br.rates = 0.1,
  ACRV = "gamma", alpha.gamma = 1, ACRV.ncats = 4, fossil = f2
)

# Slow characters preserved, fast characters increasingly lost
missing_rate <- sim.missing.data(
  data = gamma_fossil, method = "rate",
  seq = "tips", probability = c(0, 0.2, 0.5, 0.8)
)
```

### Missing data by specific traits

Remove characters from specific traits. This is useful for exploring the
impact of losing particular characters.

```{r missing_trait}
missing_trait <- sim.missing.data(
  data = fossil_data, method = "trait",
  seq = "tips", probability = 1, traits = c(1, 2, 3)
)
```

### Missing data by specific taxa

Remove characters from named taxa. This can be used to simulate scenarios
where certain specimens are poorly preserved.

```{r missing_taxa}
# Remove all data from the first two taxa
taxa_to_remove <- names(fossil_data$sequences$tips)[1:2]
missing_taxa <- sim.missing.data(
  data = fossil_data, method = "taxa",
  seq = "tips", probability = 1, taxa = taxa_to_remove
)
```

### Missing data from extinct taxa only

Remove characters only from extinct taxa while leaving extant taxa complete.
This reflects the common scenario where fossil specimens have more missing
data than living species.

```{r missing_extinct, eval = FALSE}
missing_extinct <- sim.missing.data(
  data = fossil_data, method = "extinct",
  seq = "tips", probability = 0.5
)
```

### Missing data for specific traits and taxa

Remove characters from specific traits and specific taxa.
This reflects scenarios where certain anatomical regions are less likely to
be preserved in particular lineages.
```{r missing_trait_taxa}
missing_trait_taxa <- sim.missing.data(
  data = fossil_data, method = "trait_taxa",
  seq = "tips", probability = 0.8,
  traits = c(1, 2, 3),
  taxa = taxa_to_remove
)
```

## Exploring simulated data

### Summary statistics

`stats.morpho` computes the consistency index, retention index, identifies
convergent traits, and summarises the tree structure. The consistency index (CI; Kluge and Farris, 1969) measures the minimum number of character changes required on a tree relative to the observed number of changes, with values closer to 1 indicating less homoplasy. The retention index (RI; Farris, 1989) quantifies how well synapomorphies are retained on the tree, with higher values indicating stronger phylogenetic signal. Both metrics range from 0 to 1.

```{r stats}
stats <- stats.morpho(fossil_data)
stats$Statistics
stats$Tree
```

### Convergence detection

`get.convergent` identifies traits where the same state arose independently
more than once. Called without a trait number, it returns a summary across
all traits. Called with a specific trait, it shows which tips inherited their
state from which evolutionary origin. Tips sharing the same origin inherited
their state from a single transition event. Tips with the same state but
different origins represent convergent evolution.

```{r convergent}
# Which traits are convergent?
get.convergent(fossil_data)

# Detail for a specific trait
get.convergent(fossil_data, trait = 1)
```

### Transition histories

`get.transitions` returns the full transition history for a trait, including
the root state (shown as edge 0).

```{r transitions}
get.transitions(fossil_data, trait = 1)
```

## Plotting

### Trait history on the tree

`plot.morpho` displays the evolutionary history of a trait along the tree.
Transition boxes show where state changes occurred along each branch. We
start with the simplest version, just the trait history on the distance
tree, with all additional features turned off.

```{r plot_basic, fig.height = 6}
plot(fossil_data, trait = 6, timetree = FALSE,
     show.fossil = FALSE, reconstructed = FALSE,
     show.convergent = FALSE)
```

### Adding fossils

When `show.fossil = TRUE`, fossils are plotted along the branches of the
tree at the time they were sampled. Extant taxa are shown as green circles
at the tips and fossil samples as black diamonds along the branches. This
requires a time tree, so we also set `timetree = TRUE`.

```{r plot_fossils, fig.height = 6}
plot(fossil_data, trait = 6, timetree = TRUE,
     show.fossil = TRUE, reconstructed = FALSE,
     show.convergent = FALSE)
```

### Showing the reconstructed tree

When `reconstructed = TRUE`, the plot distinguishes between the full
(true) tree and the reconstructed tree, the tree containing only the
sampled lineages. Branches that are part of the reconstructed tree are
drawn in black, while unsampled lineages are shown in grey. This helps
visualise how much of the true evolutionary history is captured by the
available samples.

```{r plot_recon, fig.height = 6}
plot(fossil_data, trait = 6, timetree = TRUE,
     show.fossil = TRUE, reconstructed = TRUE,
     show.convergent = FALSE)
```

### Highlighting convergent evolution

When `show.convergent = TRUE`, transition boxes for any state that arose
independently more than once are highlighted in blue. This includes
transitions that were subsequently reversed on descendant branches, giving
a complete picture of where convergence occurred during the simulation.

```{r plot_convergent, fig.height = 6}
plot(fossil_data, trait = 6, timetree = TRUE,
     show.fossil = TRUE, reconstructed = TRUE,
     show.convergent = TRUE)
```

By default, `plot.morpho` shows all available features, the time tree,
fossils, reconstructed tree, and convergence highlighting. If any of these
are not available in the morpho object (e.g., no fossil data was provided),
the plot falls back gracefully and informs the user. So the plot above is
equivalent to the default:

```{r plot_default, eval = FALSE}
plot(fossil_data, trait = 7)
```

### Character matrix

`plotMorphoGrid` displays the full character matrix at the tips.

```{r grid, fig.height = 5}
plotMorphoGrid(fossil_data, seq = "tips")
```

## Exporting data

`write.morpho` provides a single interface for exporting trees, character
matrices, and fossil ages. The `type` argument specifies what to export,
and `reconstructed` controls whether the full or reconstructed version is
written.

```{r export, eval = FALSE}
# Full tree
write.morpho(fossil_data, file = "tree.tre", type = "tree")

# Reconstructed tree
write.morpho(fossil_data, file = "recon_tree.tre", type = "tree",
             reconstructed = TRUE)

# Character matrix
write.morpho(fossil_data, file = "matrix.nex", type = "matrix")

# Reconstructed matrix
write.morpho(fossil_data, file = "recon_matrix.nex", type = "matrix",
             reconstructed = TRUE)

# Fossil ages (compatible with RevBayes)
write.morpho(fossil_data, file = "ages.tsv", type = "ages")

# With age uncertainty
write.morpho(fossil_data, file = "ages.tsv", type = "ages",
             uncertainty = 2)
```

Standard ape export functions also work since the morpho object stores
trees as `phylo` objects:

```{r export_ape, eval = FALSE}
ape::write.tree(fossil_data$trees$EvolTree, file = "full_tree.tre")
ape::write.nexus.data(fossil_data$sequences$tips, file = "full_matrix.nex",
                      format = "standard")
```

## Using with ggtree

The morpho object can be converted to a `treedata` object for use with
ggtree. This requires the treeio, ggtree, tibble, and ggplot2 packages.

```{r ggtree, eval = FALSE}
library(treeio)
library(ggtree)
library(ggplot2)
library(tibble)

# Convert tip data to a data frame
tip_df <- as.data.frame(t(as.data.frame(fossil_data$sequences$tips)))
colnames(tip_df) <- paste0("trait_", seq_len(ncol(tip_df)))
tip_df$node <- match(rownames(tip_df),
                     fossil_data$trees$EvolTree$tip.label)

# Create treedata object
td <- treedata(phylo = fossil_data$trees$EvolTree,
               data = as_tibble(tip_df))

# Plot tree with coloured tip points for one trait
ggtree(td) +
  geom_tiplab(offset = 0.01) +
  geom_tippoint(aes(color = factor(trait_1)), size = 3) +
  scale_color_brewer(palette = "Set2", name = "State") +
  theme(legend.position = "right")

# Plot tree with heatmap of all traits
heat_df <- tip_df[, grep("trait_", colnames(tip_df))]
heat_df[] <- lapply(heat_df, factor)

gheatmap(ggtree(td) + geom_tiplab(offset = 0.05),
         heat_df, offset = 0.02, width = 0.5,
         colnames_angle = 90, hjust = 1) +
  scale_fill_brewer(palette = "Set2", name = "State")
```
