Visualize: Sample Comparison¶
Once you have selected a group of samples from the Search for Data page, you can begin your visual data exploration by using a variety of plots in the four tabs of the Visualizations page.
The first tab is called Sample Comparison. Here, you can validate the experiment by checking how distinct your samples are from each other, and how well the reality matches the experimental design.
The Sample Comparison tab contains two cards:
Sample Hierarchical Clustering
Scatter Plot
Sample Hierarchical Clustering card¶
The Sample Hierarchical Clustering card reveals the overall distance, or dissimilarity between samples. Samples or clades of samples that group more closely together, e.g. with less distance between them, may be inferred to have more similar transcriptome phenotypes. One would expect replicates to group closely within clades. Similar treatments, genotypes or experimental conditions might also yield more closely ordered samples.
On the top of the card you will find several controls to fine tune the plot. You can select among three distance functions—Euclidean, Pearson’s and Spearman’s correlation as well as three linkage types—Average, Complete, or Single. The algorithm for clustering takes normalized gene expression values (TPM) as input.
Further, you may toggle between a dendrogram based on the entire transcriptome or a plot derived only from the genes included in the Gene Basket. To find out more on how to populate the Gene Basket, see the Gene Basket article.
Scatter Plot card¶
The Scatter plot card in Genialis Expressions allows users to visualize and explore (Principal Component Analysis) PCA results, gene expression comparisons, and comparative scatter plots. It offers flexible customization options, with enhanced control over axis selection, coloring, shaping, and labeling.
Scatter plots provide a method for visualizing complex datasets in two dimensions, allowing users to identify patterns and relationships between samples. When used for PCA, scatter plots help reduce high-dimensional data into principal components (PCs) ranked by variance, with PC1 representing the most significant variation (percent of explained variance is shown in brackets on the axis label). Similarly, comparisons between the expression of different genes with customizable sample annotations can offer important insights.
The algorithm for PCA is the implementation from the Python scikit-learn package, and takes normalized gene expression values (TPM) as input. While the user should exercise caution in assigning biological causality to the projection of samples in each PC, one may find it instructive to hypothesize about the impact of differences in treatment, genotype, or experimental conditions. Use the “Color by” menu to color your samples by metadata and annotations to easily visualize whether the samples are grouped as expected and gain other insightful visual clues.
Be advised that the scatter plot plot is shown if at least 2 samples are selected. If the include all genes option is on, PCA is computed as long as there are at least 2 samples, and up to 5 PCs can be shown. If the include all genes option is off, PCA is only computed if there are at least 2 samples and 2 genes, and the number of PCs is limited by both. The number of PCs shown in the dropdown is always the smallest of n_samples, n_genes, or up to 5. On the top of the visual module, you can toggle between a PCA plot based on the entire transcriptome or a plot derived only from the genes included in the Gene Basket.
If you find the visualization too crowded, you may also turn off the sample labels. Hovering over any point will also highlight that sample label.
Additionally, the scatter plot can be updated depending on whether the Omit low expression genes from PCA (Toggle button) option is used, which filters out samples with low expression values.
Additional supporting information is given when downloading the results, namely gene loadings in the factor table (weights for top 20 genes per PC) and the scree table (proportion of variance explained by each PC).
Scatter plot Help Guide¶
By default, the scatter plot initializes with:
X Axis: PC1
Y Axis: PC2
Color by, Shape by, and Label by: Set to ‘None’
Include all genes and Omit low expression genes controls: Enabled and set to TRUE
Interface Controls (Always Enabled)
X Axis: Dropdown menu listing up to 5 PCA components and genes.
Y Axis: Dropdown menu listing up to 5 PCA components, genes, and biomarker scores (if supported).
Color by: Dropdown for selecting discrete or continuous annotation fields, including sample relations.
Shape by: Dropdown for selecting discrete annotation fields to define point shapes.
Label by: Dropdown for selecting sample names or annotation fields to label the data points.
Secondary Controls (Enabled Only for PCA Plots)
Include all genes in PCA (Checkbox): Toggles between using genes in the basket or all genes.
Omit low expression genes from PCA (Toggle button): Filters out samples with low expression values.
Additional Options
Selecting a Gene for an Axis:
Users can select a gene from the gene basket or search by typing at least three letters.
If a gene is selected that is not in the basket, it will be added automatically.
Advanced scatter customization options:
Selecting any PC1-PC5 component for either axis
Coloring points by sample type
Shaping points by treatment group
Labeling points by sample name
Errors
If there is insufficient data for rendering the scatter plot, an error message will be displayed:
“Not enough data for scatter plot”
For PCA, ensure at least two genes are selected or toggle ‘Include all genes’.
For gene expression scatter plots, select genes from the basket or search for new genes.
For biomarker scatter plots, select biomarker scores (if supported).
Samples and removing outliers¶
The Sample Hierarchical Clustering and Scatter Plot automatically include all samples in the Sample Basket and will update automatically if a new set of samples is added to the basket. If you find some outliers that you want removed from your analyses simply remove those samples from the Sample Basket.
Download sample comparison results¶
If you are interested in downloading your sample comparison results, click on the ‘Export’ icon button in the top right of your screen (in the top bar). A modal window will appear where you can put down an optional prefix for the exported file and start the download by clicking the ‘EXPORT’ button.
This will download a zipped report file that contains:
- Scatter plot folder
A suggested caption text (.txt)
Raster and vector plot image (.png and .svg)
Raster and vector legend (.png and .svg)
Factors table (.tsv*) - The table is not downloaded if a scatterplot derived - from only two specific genes in the Gene Basket is used.
Scatter plot table (.tsv*)
Scree table (.tsv*) - The table is not downloaded if a scatterplot derived from - only two specific genes in the Gene Basket is used.
- Sample Hierarchical Clustering folder
A suggested caption text (.txt)
Raster and vector plot image (.png and .svg)
Rasterized table (.tsv*)
Clustering table folder containing linkage (.tsv*) and order (.tsv*)
- Gene Basket folder:
Information about the highlighted and selected genes in the Gene Basket (.- tsv*).
- Sample Basket folder:
A list of selected samples in the Sample Basket (.tsv*)
Note that the Gene Basket folder will only appear if the Gene Basket is - populated.
*TSV (tab-separated-value) format files are plain-text tabular files that can be imported into any spreadsheet program.