In Detail: General RNA-Seq analysis pipeline (featureCounts)¶
Description¶
This workflow processes Illumina RNA-Seq sequencing reads by cleaning up reads, aligning them to a reference genome and quantifying gene expression. Analysis steps thus include trimming (BBDuk), mapping (STAR), expression quantification (featureCounts) and normalization (rnanorm). Reference genomes for mouse, rat, or human (Ensembl, version 109, 100 and 92) are optionally extended with references for common spike-in standards (ERCC and SIRV). You can examine “expressions” of internal spike-in controls in the same way as interrogating expression of any other gene, but you can rest assured that the presence of spike-ins in the reference does not interfere with the analysis. As an additional quality control step, a sample of a million reads (Seqtk tool) is mapped to rRNA and globin sequences of the selected species to determine the overall proportion of these kinds of reads in the sample. Results are reported in the summary table of the MultiQC report.
Pipeline Details - Tools and Parameters¶
The main pipeline steps, tool versions and parameters used are listed below. Pipeline details for both single-end (SE) and paired-end (PE) library preparation types are shared to account for diverse input data options.
Reads trimming using BBDuk (BBMap 37.90, Bushnell, 2018)
bbduk.sh (SE) in=input_reads.fastq.gz ref=input_references.fasta.gz out=output_reads.fastq stats=output_statistics.txt statscolumns=5 k=23 rcomp=t maskmiddle=t minkmerhits=1 minkmerfraction=0.0 mincovfraction=0.0 hammingdistance=1 qhdist=0 editdistance=0 hammingdistance2=0 qhdist2=0 editdistance2=0 forbidn=f findbestmatch=t ktrim=r maskfullycovered=f mink=11 qtrim=r trimq=28 trimpolya=0 minlength=30 minlengthfraction=0.0 minavgquality=0 maqb=0 minbasequality=0 minconsecutivebases=0 trimpad=0 minoverlap=14 mininsert=40 forcetrimleft=0 forcetrimright=0 forcetrimright2=0 forcetrimmod=0 restrictleft=0 restrictright=0 mingc=0.0 maxgc=1.0 maxns=-1 tossjunk=f chastityfilter=f barcodefilter=f xmin=-1 ymin=-1 xmax=-1 ymax=-1 entropy=-1.0 entropywindow=50 entropyk=5 minbasefrequency=0 entropymask=f threads=4
bbduk.sh (PE) in=input_fw_reads.fastq.gz in2=input_rv_reads.fastq.gz ref=input_references.fasta.gz out=output_fw_reads.fastq out2=output_rv_reads.fastq stats=output_statistics.txt statscolumns=5 k=23 rcomp=t maskmiddle=t minkmerhits=1 minkmerfraction=0.0 mincovfraction=0.0 hammingdistance=1 qhdist=0 editdistance=0 hammingdistance2=0 qhdist2=0 editdistance2=0 forbidn=f removeifeitherbad=t findbestmatch=t ecco=f ktrim=r maskfullycovered=f mink=11 qtrim=r trimq=28 trimpolya=0 minlength=30 minlengthfraction=0.0 minavgquality=0 maqb=0 minbasequality=0 minconsecutivebases=0 trimpad=0 trimbyoverlap=t minoverlap=14 mininsert=40 trimpairsevenly=t forcetrimleft=0 forcetrimright=0 forcetrimright2=0 forcetrimmod=0 restrictleft=0 restrictright=0 mingc=0.0 maxgc=1.0 maxns=-1 tossjunk=f chastityfilter=f barcodefilter=f xmin=-1 ymin=-1 xmax=-1 ymax=-1 entropy=-1.0 entropywindow=50 entropyk=5 minbasefrequency=0 entropymask=f threads=4
Read alignment using STAR (v2.7.10b, Dobin et al. 2013)
Reference genome: Ensembl, version 109:
STAR --runThreadN 1 --runMode genomeGenerate --genomeSAsparseD 5 --genomeDir ./star_index/ --genomeFastaFiles genome.fasta --sjdbGTFfile annotation.gtf --sjdbOverhang 100 --sjdbGTFfeatureExon exon
STAR (SE) --runThreadN 4 --genomeDir /star_index --readFilesIn mate1.fastq.gz --outReadsUnmapped Fastx --limitIObufferSize 30000000 50000000 --limitOutSAMoneReadBytes 100000 --limitOutSJoneRead 1000 --limitOutSJcollapsed 1000000 --limitSjdbInsertNsj 1000000 --outFilterType Normal --outSAMtype BAM Unsorted --readFilesCommand zcat --alignMatesGapMax Local --outSAMattributes Standard
STAR (PE) --runThreadN 4 --genomeDir /star_index --readFilesIn mate1.fastq mate2.fastq --outReadsUnmapped Fastx --limitIObufferSize 30000000 50000000 --limitOutSAMoneReadBytes 100000 --limitOutSJoneRead 1000 --limitOutSJcollapsed 1000000 --limitSjdbInsertNsj 1000000 --outFilterType Normal --outSAMtype BAM Unsorted --readFilesCommand zcat --alignMatesGapMax Local --outSAMattributes Standard
Sorting of BAM files using Samtools (v1.14, Danecek et. al., 2021)
Samtools sort Aligned.out.bam -o sorted.bam -@ 4
Quantification using featureCounts (1.6.3, Liao et al. 2014)
featureCounts (SE) -a annotation.gtf -o featureCounts_rc.txt -F GTF -t exon -g gene_id --minOverlap 1 --fracOverlap 0.0
--fracOverlapFeature 0.0 --readExtension5 0 --readExtension3 0 -Q 0 -s <strandeness_code> -T 4 -maxMOp 10 sorted.bam
featureCounts (PE) -a annotation.gtf -o featureCounts_rc.txt -F GTF -t exon -g gene_id --minOverlap 1 --fracOverlap 0.0
--fracOverlapFeature 0.0 --readExtension5 0 --readExtension3 0 -Q 0 -s <strandeness_code> -p -T 4 -maxMOp 10 sorted.bam
Normalization of expression values using rnanorm (1.3.1)
Seqtk (1.2-r94)
Subsampling to 1M reads, using a fixed random seed
seqtk (SE)sample -s 11 input_reads.fastq.gz 1000000 > output_reads.fastq
Seqtk (PE)sample -s 11 input_mate1_reads.fastq.gz 1000000 > mate1_out.fastq
seqtk (PE)sample -s 11 input_mate2_reads.fastq.gz 1000000 > mate2_out.fastq
References¶
Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R., 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. https://doi.org/10.1093/bioinformatics/bts635
Bushnell, B. (2018) https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/ (accessed on 2018-06-20). Joint Genome Institute
Liao Y, Smyth GK and Shi W. (2014) featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30. DOI: 10.1093/bioinformatics/btt656.
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li; Twelve years of SAMtools and BCFtools; GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008rnanorm, Genialis, Inc. https://github.com/genialis/rnaseq-normalization/