How to use perplexity to select an optimal VBEM prior for RNA transcript quantification with Salmon

Spencer Skylar Chan

Introduction

Suppose you performed RNA sequencing on some cells to extract the RNA fragments inside of them. You now have a sample of RNA fragments (reads) and you want to know how many RNAs came from each gene of interest (aka transcript). Counting RNAs is useful for downstream analysis such as differential expression, which computes the up- and down-regulation of the genes which produced the RNAs. Knowing which genes are more or less expressed is useful for studying diseases such as cancer. The standard approach to do this is RNA transcript abundance estimation, which can be summarized as follows:

Input: RNA reads and a transcriptome.

Output: Expected value of fragments mappings to transcripts. Equivalenty, the expected number of RNAs generated per transcript in the transcriptome.

One RNA transcript quantification algorithm is Salmon which is maintained by professor Rob Patro at the University of Maryland. Salmon can perform transcript quantification using Variational Bayes Expectation Maximization (VBEM). When using VBEM, Salmon has the flag --vbprior, which is the per-nucleotide VBEM prior. According to the Salmon documentatation:

The default prior used in the VB optimization is a per-nucleotide prior of 1e-5 reads per-nucleotide. This means that a transcript of length 100000 will have a prior count of 1 fragment, while a transcript of length 50000 will have a prior count of 0.5 fragments, etc.

This prior is essentially the background expression expected in this sample. A smaller prior size will result in a sparser estimate. A larger prior size will result in more non-zero estimates. It is important to select a VBEM prior that will result in the best estimate for a sample because increasing the quality of the estimate will increase the quality of any downstream analysis and applications. This makes the prior a hyperparameter, or a parameter that affects a model's prediction. The general strategy for hyperparameter selection is to run a model repeatedly with the same input data and varying hyperparameters, and then select the hyperparameter whose output has the lowest measure of error. This is basically model selection in machine learning, so a sense, VBEM prior selection is a learning problem.

We could make it a supervised learning problem and evaluate estimates directly with a known transcript expression distribution (aka the ground truth). But we usually don't have the ground truth. If we have simulated data, we can compare the transcript abundance estimate with the ground truth from which we simulated our data from using a metric like Spearman correlation. But simulated reads are not as interesting as real reads extracted from cells. With real reads, usually we don't know the ground truth and want to find it! We could use qPCR using fluorescent primers of the transcripts of interest and measure the fluorescent intensity (more expressed genes will have higher intensity). But qPCR is not the most reliable way to count RNAs, so most people send reads for sequencing instead of doing qPCR with them.

Since we can't rely on the the ground truth for evaluating transcript abundance estimates, let's evalute estimates without the ground truth! In other words, let's make VBEM selection an unsupervised learning problem. We described in a paper how to measure the accuracy of an estimate using a "quantify-then-validate" approach. The evaluation metric we use is an adaptation of perplexity from natural language processing to transcriptomics. In this context, perplexity is the inverse geometric mean per-read likelihood of a held-out test set. To compute perplexity, we split our reads into a test set and training set. We quantify an abundance estimate on the training set, holding out the test set. Then we validate our estimate by computing the perplexity of the estimate with respect to the held-out reads. Figure 2 from our paper shows the quantify-then-validate approach with an additional smoothing step which we'll discuss later.

The lower the perplexity, the better the abundance estimate describes the held-out reads. Therefore, perplexity is a proxy for the goodness of an abundance estimate. You can think of perplexity as the error metric for abundance estimates. We show in the paper that in experimental data, estimates with the lowest perplexity best correlate with qPCR measurements. In simulated data, perplexity achieves a local minimum in the neighborhood of the default VBEM prior value and corresponds well with other measure of accuracy for ground truth and differential expression analysis.

The moral of the story is that hyperparameter selection for transcript quantification algorithms is now possible. In the case of Salmon and VBEM prior, the problem becomes:

Input: RNA reads and a transcriptome.

Output: the best VBEM prior, which is the one that generates the abundance estimate with the smallest perplexity

In this tutorial I will demonstrate hyperparameter selection of the VBEM prior using perplexity using k-fold cross validation, in the absence of ground truth.

Setup

Open a shell and clone my perplexity repository, then checkout the tutorial branch..

git clone https://github.com/schance995/perplexity.git
cd perplexity
git checkout tutorial

This repository is a fork of the COMBINE lab's perplexity repository, and it also contains some modified workflows from the perplexity-paper repository.

The source code for the perplexity program is included. It's written in Rust, which was chosen for its speed, memory safety, and excellent compiler and documentation. Install Rust if you don't have it already.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs -o rustup-installer.sh
# check that the script looks ok
less ./rustup-installer.sh
# run the installer
sh ./rustup-installer.sh
# download the stable Rust toolchain
rustup default stable

Installing Rust will make the cargo package manager available. Use cargo to compile the perplexity binary, it will download needed dependencies (crates) for you.

cargo build --release

Next we'll install some Python and R packages using the Conda package manager. If you don't already have it installed, you can install Miniconda, which installs Conda without several packages we don't need for this tutorial.

Once you've installed Conda, install Mamba, which resolves dependencies much faster than Conda. You can also install Mamba standalone if you like, but for now we'll install Mamba through the Conda repositories.

conda install mamba -n base -c conda-forge

Mamba has a limitation where it cannot create new environments from an environment file. As a workaround, create a new environment, then update it with env.yml:

conda env create -n tutorial
mamba env update -n tutorial --file env.yml

Activate the environment to make the packages available in your environment.

conda activate tutorial

Now we are ready to configure the Snakemake pipeline.

Configuration of the Snakemake pipeline

The code to perform VBEM prior selection is a Snakemake pipeline. Snakemake is like GNU Make, but for reproducible data science. Snakemake will generate all output files from the specified input files and will run the intermediate steps in parallel and in the correct order (dependency resolution).

Before running the pipeline, edit snakefiles/config.yml,which is the configuration file to declare the inputs to the analysis. The most important lines are the ones declaring the transcriptome, the directory with the reads, and what samples to use.

# results will go in {out-dir}/{exp-name}
out-dir: output
exp-name: tutorial

# expects paired end reads with format:
#    {reads-dir}/{sample_name}_1.fasta
#    {reads-dir}/{sample_name}_2.fasta
read_file_fmt: fasta
# replace with your reads directory
reads-dir: /fs/cbcb-lab/rob/students/jason/shared/SRR1265495/
# replace with your sample names to quantify (can be more than 1)
sample-names:
  - sample_01
  # - sample_02

# Path to reference transcriptome for salmon (replace with your transcriptome)
txome-path: /fs/cbcb-lab/rob/students/jason/shared/annotations/hg19_clean.fa

# where to find folds, recommended to put this in {current working directory}/output/tutorial/folds
# for example my working directory is /fs/cbcb-lab/rob/students/skylar/perplexity
folds-dir: /fs/cbcb-lab/rob/students/skylar/perplexity/output/tutorial/folds

# Path to perplexity binary, and kfolds.sh script
perplexity-bin: target/release/perplexity
kfold-script: scripts/kfolds.sh

# Number of folds
k: 5

# VBEM priors to test with
# The default value is 1e-5
prior-weights:
  - 1e-6
  - 1e-5
  - 1e-4
  - 1e-3
  - 1e-2
  - 1e-1
  - 1e0
  - 1e1
  - 1e2
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7 
  - 8
  - 9

Run the pipeline

Generate k folds

The first step of the pipeline is to create k folds for cross-validation. For this tutorial we'll use k=5. Specify the Snakefile to run with --snakefile and the config file you wrote earlier with --configfile. It is a good idea to use the -n flag to simulate the run in case there are any errors.

# check the files that this snakefile will produce
snakemake --snakefile snakefiles/kfold.snk --configfile <config file> -n

The output should list the rules that will be run and the files that will be generated from this Snakefile. This Snakefile will prepare train and test sets suitable for k-fold cross-validation using a helper script1. It will multiply the disk space used by your reads by k, which should not be a problem for a cluster but might be a problem for a small hard drive.

When you are ready to run the snakefile, remove -n and add -j <n cores> to the command, where n cores is the maximum number of jobs you want Snakemake to run at any given time. If you specify -j without a number, the number of available cores will be used. This is fine if you have sufficient memory and CPU available 2, but if you run many large samples, the Linux out-of-memory killer might terminate Salmon and other programs using high memory. For example, to run the pipeline with at most 8 jobs at a time:

# run the pipeline for real
snakemake --snakefile snakefiles/kfold.snk --configfile <config file> -j <n cores>

This will take a few minutes per sample, depending on the size of your data.

Tanscript quantification, cross-validation and perplexity calculation

After the kfolds snakefile is finished, let's do the same yoga for the perplexity snakefile.

# check the files that this snakefile will produce
snakemake --snakefile snakefiles/perplexity.snk --configfile <config file> -n

There are many steps to this Snakefile. First, the transcriptome will be indexed. Then Salmon will compute transcript abundance estimates for the training reads of each fold and VBEM prior. Salmon will also count the equivalence classes for the test reads of each fold 3. Finally, perplexity will be computed for the training estimate and the test equivalence classes for each fold and prior using the perplexity binary.

# run the pipeline for real
snakemake --snakefile snakefiles/perplexity.snk --configfile <config file> -j <n jobs>

This will take longer than creating the k-folds, so you might want to grab a cup of coffee (or tea).

Inspect the perplexity results

For the purposes of this tutorial I've included the perplexity results in the example/output directory of the repository. Let's take a look at one of the perplexity output files, for example example-output/tutorial/perplexity/sample_01/vbprior=1e-5/beta=1e-8/1/perplexity.yml:

---
smoothed_perplexity: 1500.9738357618098
n_ecs: 188424
n_reads: 3035878
n_possible_ecs: 185426
n_possible_reads: 3032506
n_impossible_ecs: 2998
n_impossible_reads: 3372
n_discarded_ecs: 665
n_discarded_reads: 733
class_ec_freq:
...

This abundance estimate was generated with a VBEM prior of 1e-5 (the default), and the perplexity is about 1501. There's also extra information about the kinds of reads and equivalence classes encountered in this sample.

You might wonder why it says smoothed perplexity. The reason is because of impossible reads, which are reads in the test set that have 0 probability of being generated from the estimate of the training set, in which case the perplexity is undefined. In Figure 1 of our paper, we show that impossible reads occur rather frequently. The biggest problem with impossible reads is that perplexity will call abundance estimates with any number of impossible reads equally bad, when in reality the fewer impossible reads the better. The solution is to use smoothing to smooth the probability distribution of the reads by applying a small prior probability for each read. We discuss this in more detail in sections 3.1 and 3.2 of the perplexity paper. Although the paper used Laplacian smoothing, the perplexity binary has since been updated to use the Linear Good Turing estimator instead.

Visualize the perplexities and VBEM priors for each fold

To perform hyperparameter selection with perplexity, we'll need to visualize all of the perplexities for each sample and VBEM prior. Let's use some Python data science to achieve this. You'll need to open the Jupyter notebook located at notebooks/tutorial.ipynb in the repository. First import some useful packages into a Python session (we installed these earlier with Conda/Mamba).

Next, load the config file we used for the Snakemake pipelines into our session. We'll extract the information we need to find all of the perplexity outputs.

Now define and run a function to plot the log 10 VBEM prior against the perplexity for each fold.

The x-axis is the log 10 VBEM prior size, and the y-axis is the perplexity. Each gray line is the perplexity and associated VBEM prior for each fold. The red line is the average perplexity per VBEM and fold.

Select the VBEM with the smallest perplexity

Since perplexity is a proxy for the error of an estimate, we should select the VBEM that produces the smallest perplexity. The minimum perplexity is present in the dip between 0 and 1, and the corresponding prior is 3. Therefore we should run Salmon with --vbprior=3 for this sample; the resulting estimate should be the best to use for downstream analysis such as differential expression. Note that the optimal prior as suggested by the smallest perplexity will depend on your sample, which is why the config file can take multiple samples.

Perplexity can also be used to investigate other RNA transcript quantification questions.

How does read depth impact the optimal VBEM prior?

Increasing the sample size is a surefire way to improve the accuracy and confidence of an estimate. In RNA sequencing, the number of reads is known as the read depth. Most experiments require 5-200 million reads per sample, according to Illumina. Increasing the read depth also increases the likelihood of capturing rare transcripts. With this in mind, I hypothesize that the VBEM prior will decrease as the read depth increases. Intuitively, when there are more reads, we are more certain to find rare transcripts, so we can assign a lower prior to them.

To test this hypothesis, I'll use holdout evaluation on different samples of 10 to 40 million reads drawn from the same distribution (or organism), with a test set of 20 million reads. To keep it fair, I'll keep the test set the same for each sample. There's a caveat that different reads will be called impossible for each sample, but it should not get in the way of seeing a general trend.

If you just want to see the results, you can skip to the plots made with the perplexity outputs in example/output/readdepth. Otherwise keep reading.

Simulate reads

Reads will be simulated with Polyester with a distribution based on an abundance estimate. Modify config-simul.yml to declare your inputs

# results will go in output/{exp-name}
exp_name: simul_reads
# reads to base the simulation off of
reads_dir: /fs/cbcb-lab/rob/students/jason/shared/COPD-raw-data/output
# transcript to generation abundance estimate
tx_path: /fs/cbcb-lab/rob/students/jason/shared/annotations/hg19_clean.fa
# name of samples to use
sample_names:
  - SRR1265600

Polyester only simulates a fixed number of reads dependent on the size of the estimate. In this instance each simulated sample is 14.7 million reads. This will take about half an hour.

snakemake --snakefile snakefiles/simul_data.snk --configfile snakefiles/config-simul.yml -j <n cores>

To combine the reads into test and train sets, run the shuffle-reads.sh script in the directory that the reads were generated in. This will take a few hours to complete. Make sure you have enough disk space too because this will create several uncompressed copies of fasta files.

cd ../output/simul_reads/simulated_reads/SRR1265600
sh {repo directory}/scripts/shuffle-reads.sh

Quantify and validate

Finally edit snakefiles/config-readdepth.yml, and specify the samples that we simulated.

read_file_fmt: fasta
reads-dir: /fs/cbcb-lab/rob/students/skylar/perplexity/output/simul_reads/simulated_reads/SRR1265600
sample-names:
  - 10_million_reads
  - 20_million_reads
  - 30_million_reads
  - 40_million_reads
  # include more reads for later analysis
  - 50_million_reads
  - 60_million_reads
  - 70_million_reads
  - 80_million_reads

Then run the Snakefile for read depth, which will do quantification and holdout validation with the quants of each sample against the test reads. Make sure to use a small -j. This can take several hours, maybe even a day or two, so feel free to work on something else and come back.

snakemake --snakefile snakefiles/readdepth.snk --configfile snakefiles/config-readdepth.yml -j <n cores>

Plot perplexity and log VBEM for samples of varying read depth

Welcome back. Let's plot the perplexities against log VBEM again. Since we did holdout validation and not k-folds cross-validation, there will only be one line to plot per sample.

Observe that the smallest perplexity decreases with increasing read depth. However the VBEM prior doesn't seem to be correlated with read depth. It could be possible that the optimal VBEM is independent of read depth, but we'd need some more experiments to support this.

Conclusion

In this tutorial we demonstrated VBEM prior selection with RNA perplexity using k-fold cross validation and holdout validation. We also walked through some key highlights of the process:

  1. Perplexity is a proxy for RNA transcript abundance accuracy and can be used to select the best VBEM prior for a sample.

  2. Different samples can have different perplexities and priors.

  3. Increasing read depth lowers perplexity, amd it is not clear yet whether it affects the best prior.

While perplexity is not a one-size fits all (samples) approach, it can help us produce more accurate RNA transcript abundance estimates. This will improve the results of downstream analysis and the biological insights that follow.

Footnotes

1. Currently the kfold script does not work for some read sets. A Rust rewrite for correct functionality and improved performance is planned for the future.

2. If you run Snakemake will all cores while other users who want to run jobs at the same time, they will get annoyed with you.

3. Salmon collapses reads into equivalence classes during quantification to reduce time and memory usage. Perplexity also uses equivalence classes because computing the approximate likelihood over each equivalence class is faster than computing the real likelihood over each possible alignment for all fragments.