---
title: "Retinal Organoids counts data: Gene and Isoform Level Exploration"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Retinal Organoids counts data: Gene and Isoform Level Exploration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r knitr-setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

## Required packages

```{r load-packages}
library(HumanRetinaLRSData)
library(SummarizedExperiment)
library(ggplot2)
library(ggrepel)
```

## Overview

`HumanRetinaLRSData` is a companion data package for our long-read
sequencing study characterising isoform diversity, splicing dynamics,
and allele-specific expression in developing human retinal organoids.

In this study we used nanopore-based long-read RNA sequencing to profile
human stem cell-derived retinal organoids across multiple developmental
stages, as well as purified retinal ganglion cells. These datasets enable
exploration of gene- and isoform-level expression patterns, detection of
differential isoform usage, identification of neuron-specific splicing
programs, and quantification of allelic imbalance.

All raw and processed data generated in the study have been deposited in
an Open Science Framework (OSF)
[repository](https://osf.io/z2yvs/overview). The `HumanRetinaLRSData`
package provides convenient access to these files directly from the OSF
node using the `osfr` package.

For ease of use, `HumanRetinaLRSData` is designed to:

- query and retrieve study datasets from the OSF project,
- cache downloaded archives via `BiocFileCache`, and
- load gene- and isoform-level counts into R for downstream
  visualisation and analysis.

## Load data

```{r load-data}
se_gene     <- ROGeneLevelData()
se_isoform  <- ROIsoformLevelData()
```

## Visualise gene-level expression

We perform PCA on log2-transformed CPM values and colour samples by
developmental stage.

```{r plot-gene-pca, fig.width=8, fig.height=6}
stage_colors <- c(
  "stage 1" = "orange",
  "stage 2" = "seagreen",
  "stage 3" = "purple"
)

expr_gene <- assay(se_gene, "cpm")
pca_gene  <- prcomp(t(log2(expr_gene + 1)), scale. = TRUE)
var_gene  <- round(
  pca_gene$sdev^2 / sum(pca_gene$sdev^2) * 100, 1
)

pca_df_gene <- data.frame(
  PC1    = pca_gene$x[, 1],
  PC2    = pca_gene$x[, 2],
  sample = colnames(expr_gene),
  stage  = colData(se_gene)[["stage"]]
)

ggplot(pca_df_gene, aes(x = PC1, y = PC2, color = stage)) +
  geom_point(size = 3) +
  geom_label_repel(
    aes(label = sample),
    size = 2.5, show.legend = FALSE
  ) +
  scale_color_manual(values = stage_colors, name = "Stage") +
  labs(
    title = "PCA of Gene-Level Expression",
    x     = paste0("PC1 (", var_gene[1], "%)"),
    y     = paste0("PC2 (", var_gene[2], "%)")
  ) +
  theme_bw(base_size = 12)
```

## Visualise isoform-level expression

```{r plot-isoform-pca, fig.width=8, fig.height=6}
expr_iso <- assay(se_isoform, "cpm")
pca_iso  <- prcomp(t(log2(expr_iso + 1)), scale. = TRUE)
var_iso  <- round(
  pca_iso$sdev^2 / sum(pca_iso$sdev^2) * 100, 1
)

pca_df_iso <- data.frame(
  PC1    = pca_iso$x[, 1],
  PC2    = pca_iso$x[, 2],
  sample = colnames(expr_iso),
  stage  = colData(se_isoform)[["stage"]]
)

ggplot(pca_df_iso, aes(x = PC1, y = PC2, color = stage)) +
  geom_point(size = 3) +
  geom_label_repel(
    aes(label = sample),
    size = 2.5, show.legend = FALSE
  ) +
  scale_color_manual(values = stage_colors, name = "Stage") +
  labs(
    title = "PCA of Isoform-Level Expression",
    x     = paste0("PC1 (", var_iso[1], "%)"),
    y     = paste0("PC2 (", var_iso[2], "%)")
  ) +
  theme_bw(base_size = 12)
```

## Interpretation

The PCA plots reveal:

- **Separation by stage**: Samples cluster by developmental stage,
  indicating distinct expression profiles.
- **Variance explained**: PC1 and PC2 capture the major sources of
  variation in the dataset.
- **Gene vs. isoform**: While samples cluster similarly by developmental
  stage at both levels, the proportion of variance explained by each PC
  differs. This reflects that isoform-level analysis may capture
  additional complexity from alternative splicing and transcript
  variation beyond what is seen at the gene level.

## Session information

```{r session-info}
sessionInfo()
```
