In Detail: General RNA-Seq analysis pipeline (Salmon)¶
Description¶
This document describes an alignment-free workflow for processing Illumina RNA-Seq sequencing reads obtained with standard mRNA or total-RNA (ribosomal/globin depletion) Library Prep Kits. The input sequencing data is assumed to be already demultiplexed. Analysis steps include adapter trimming and reads quality filtering (BBDuk), expression quantification (Salmon, Tximport), and QC steps (Samtools, QoRTs QC, STAR). Reference transcript sequences (cDNA) for mouse, rat, or human (Ensembl version 100 and 92) are extended with references for common spike-in standards (ERCC and SIRV). You can examine “expressions” of internal spike-in controls in the same way as interrogating expression of any other gene, but you can rest assured that the presence of spike-ins in the reference does not compromise the analysis even if spike-in controls have not been used. As an additional quality control step, a sample of ten million reads (Seqtk tool) is mapped to the rRNA and globin sequences of the selected species to determine the overall proportion of these kinds of reads in the sample. Results are reported in the summary table of the MultiQC report.
NOTE: For strandedness detection (cDNA Index file) Ensembl version 100 is used, however, this will not impact how quantification is performed even if you use Ensembl version 92 for annotation.
Pipeline Details - Tools and Parameters¶
Unless stated otherwise, all parameter values are set to tool defaults. Example command-line calls to the major analysis tools utilized in the pipeline are listed below.
Reads trimming using BBDuk (BBMap 37.90, Bushnell, 2018)
Standard list of adapters from Illumina (available within Genialis)
minlength (30)
k (23)
hammingdistance (1)
ktrim (r)
mink (11)
qtrim (r)
trimq (28)
For paired-end sequencing libraries, the following options are also used:
tpe
tbo
Quantification using Salmon (v1.2.1, Patro et.al., 2017)
Salmon index for each species is prepared using cDNA sequences extracted from a reference sequence FASTA (primary assembly) and a matching GTF file downloaded from the Ensembl FTP site. cDNA sequences were extracted using a gffread tool from the Cufflinks package. The resulting cDNA files were extended using ERCC/SIRV spike-in cDNA sequences and a full reference genome sequence. A full decoy Salmon index was generated using the prepared inputs and a decoy.txt file containing chromosome names for each species.
Salmon quant is run in mapping-based mode with the following parameters:
libType <https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype> (A)
validateMappings <https://salmon.readthedocs.io/en/latest/salmon.html#validatemappings> (True)
seqBias <https://salmon.readthedocs.io/en/latest/salmon.html#seqbias> (True)
gcBias <https://salmon.readthedocs.io/en/latest/salmon.html#gcbias> (True, paired-end data only)
rangeFactorizationBins <https://salmon.readthedocs.io/en/latest/salmon.html#rangefactorizationbins> (4)
Salmon reports transcript-level abundance estimates which are summarized to gene-level TPM values using R package Tximport (Sonerson et.al., 2015).
Clicking on the Salmon Quant object, you will see the inputs that were used in the process as well as the outputs that were generated:
Normalized expression
Expression (json)
Expression type
Read counts
Expressions
Expressions (json)
Salmon quant file
Transcript-level expressions
Salmon output
Transcript to gene mapping
Strandedness code
Strandedness report file
Gene ID source
Species
Build
Feature type
Subsample reads alignment using STAR (2.7.0f, Dobin et al. 2013)
all default
Seqtk (1.2-r94)
References¶
Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R., 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. https://doi.org/10.1093/bioinformatics/bts635
Patro, R., Duggal, G., Love, M.I., Irizarry, R.A. and Kingsford, C., 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), p.417. https://www.nature.com/articles/nmeth.4197
Soneson C, Love MI, Robinson MD, 2015. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4. doi: 10.12688/f1000research.7563.1.