In Detail: General RNA-Seq analysis pipeline (Salmon)

Description

This document describes an alignment-free workflow for processing Illumina RNA-Seq sequencing reads obtained with standard mRNA or total-RNA (ribosomal/globin depletion) Library Prep Kits. The input sequencing data is assumed to be already demultiplexed. Analysis steps include adapter trimming and reads quality filtering (BBDuk), expression quantification (Salmon, Tximport), and QC steps (Samtools, QoRTs QC, STAR). Reference transcript sequences (cDNA) for mouse, rat, or human (Ensembl version 100 and 92) are extended with references for common spike-in standards (ERCC and SIRV). You can examine “expressions” of internal spike-in controls in the same way as interrogating expression of any other gene, but you can rest assured that the presence of spike-ins in the reference does not compromise the analysis even if spike-in controls have not been used. As an additional quality control step, a sample of ten million reads (Seqtk tool) is mapped to the rRNA and globin sequences of the selected species to determine the overall proportion of these kinds of reads in the sample. Results are reported in the summary table of the MultiQC report.

NOTE: For strandedness detection (cDNA Index file) Ensembl version 100 is used, however, this will not impact how quantification is performed even if you use Ensembl version 92 for annotation.

Pipeline Details - Tools and Parameters

Unless stated otherwise, all parameter values are set to tool defaults. Example command-line calls to the major analysis tools utilized in the pipeline are listed below.

Reads trimming using BBDuk (BBMap 37.90, Bushnell, 2018)

  • Standard list of adapters from Illumina (available within Genialis)

  • minlength (30)

  • k (23)

  • hammingdistance (1)

  • ktrim (r)

  • mink (11)

  • qtrim (r)

  • trimq (28)

For paired-end sequencing libraries, the following options are also used:

  • tpe

  • tbo

Quantification using Salmon (v1.2.1, Patro et.al., 2017)

Salmon index for each species is prepared using cDNA sequences extracted from a reference sequence FASTA (primary assembly) and a matching GTF file downloaded from the Ensembl FTP site. cDNA sequences were extracted using a gffread tool from the Cufflinks package. The resulting cDNA files were extended using ERCC/SIRV spike-in cDNA sequences and a full reference genome sequence. A full decoy Salmon index was generated using the prepared inputs and a decoy.txt file containing chromosome names for each species.

Salmon quant is run in mapping-based mode with the following parameters:

  • libType <https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype> (A)

  • validateMappings <https://salmon.readthedocs.io/en/latest/salmon.html#validatemappings> (True)

  • seqBias <https://salmon.readthedocs.io/en/latest/salmon.html#seqbias> (True)

  • gcBias <https://salmon.readthedocs.io/en/latest/salmon.html#gcbias> (True, paired-end data only)

  • rangeFactorizationBins <https://salmon.readthedocs.io/en/latest/salmon.html#rangefactorizationbins> (4)

Salmon reports transcript-level abundance estimates which are summarized to gene-level TPM values using R package Tximport (Sonerson et.al., 2015).

Clicking on the Salmon Quant object, you will see the inputs that were used in the process as well as the outputs that were generated:

  1. Normalized expression

  2. Expression (json)

  3. Expression type

  4. Read counts

  5. Expressions

  6. Expressions (json)

  7. Salmon quant file

  8. Transcript-level expressions

  9. Salmon output

  10. Transcript to gene mapping

  11. Strandedness code

  12. Strandedness report file

  13. Gene ID source

  14. Species

  15. Build

  16. Feature type

Subsample reads alignment using STAR (2.7.0f, Dobin et al. 2013)

  • all default

Seqtk (1.2-r94)

References

  • Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R., 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. https://doi.org/10.1093/bioinformatics/bts635

  • Patro, R., Duggal, G., Love, M.I., Irizarry, R.A. and Kingsford, C., 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), p.417. https://www.nature.com/articles/nmeth.4197

  • Soneson C, Love MI, Robinson MD, 2015. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4. doi: 10.12688/f1000research.7563.1.