When we do many assays for the same patient it is helpful to have an integrated view of the different assays in order to be able to infer dependencies. This may be complicated by the fact that often we get the data from different assays in different formats. In this first excersice we have genomic alteration data from Glioblastoma tumour samples (from the TCGA project) and would like to integrate the data to browse it as multi-dimensional heatmap in Gitools. We will try to rebuild, with the data we have, the data matrix from the TCGA Glioblastoma heatmap that we can find in Gitools prepared datasets. Last but not least, we will learn how to interact with Gitools interactive heatmaps: Sort, Filter, Zoom.
Gitools can be downloaded from www.gitools.org. Unzip the file, enter the gitools/bin folder and run gitools if you are a Mac or Linux user or gitools-2GB.bat if you are a Windows user. Further details can be found at Installation.
The files in that folder contain data for the same samples cohort. But do they have the same format?
What separates the different data fields in the files? Comma, Tab, Semicolon, other character?
How many data layers does the TCGA GBM heatmap have? What data is represented?
click the preview image in order to load the matrix within Gitools.
left bottom and the color-scale editing menu at Edit->Layers-> [...]. Change the value layers by clicking on their names at the left, in the details box.
Explore the template heatmap as much as needed in order to recreate the color scales.
Save the result in .heatmap.zip format to the disk
By now you have succesfully loaded, integrated data into a heatmap and saved it to the disk for further use. But we need also to understand the data and the tool in order to interpret and interact with the data. Below you find some concepts on how to interact and some steps to follow.
Learning to interact with Gitools heatmaps
Very helpful shortcuts at: Help > Shortcuts
dragged and dropped. Selections influence sorting and other operations, thus in order to unselect all items you may click in the white area at the top right or with the keyboard: u+a (use r or c instead of “a” for Rows or Columns respectively). To select ranges use Shift
columns hold down CTRL+c and for rows only CTRL+r. The size of the information at the columns and rows can be resized with the same gesture - make sure the mouse is hovering the area you’d like to resize.
lines marking the lead. At no moment, more than one cell can be in the focus. All the data that associated to the cell is displayed in the details panel at the left: All data layers and values and information associated with the row and column of the cell.
Follow the next steps and save a bookmark with an appropriate after each.
Visibility 1: Make sure that you see the expression values and then that you see all the columns by right click on the columns and selecting Show all
Visibility 2: Apparently not all samples have expression values. Select some samples with no values, right click and select Hide selected
Sorting 1: Right click on one of the row names and select: Sort ascending by id
Sorting 2: Make sure you have no rows and columns selected, right click on a column id and select Sort rows by values
Sorting 3: Selecet the EGFR row, right click on EGFR and select Sort columns by selected values
Edit > Columns > Filter by values. As criteria select Expression abs > 0 and click OK.
criteria CNA Status, Count (Non-Zero), Descending and then sort the columns according to the CDKNA CNA status. Now switch between CNA and expression and save the last bookmark. Also, compare the values in CDKNA.
Data interpretation depends mostly on additional information regarding the data taht we are vieweing. Therefore Gitools allows to add annotations for columns and rows. Clinical annotation is crucial for interpretation of the sample data whereas additional genomic information helps to understand who different genes are.
for the columns of the matrix we have used in practical 1.
Open it with a text editor to see the content and format of the file.
In Gitools, right click in the columns and select Add column header. Select Values as color.
as a header and click Next
Proceed to the cluster page. How many clusters are there? Do all samples have an annotation? Click finish
Sort columns by annotation: Right click on the new color annotation and select “sort ascending by disease”. Add a bookmark.
Change the column width until there is enough space for all the cluster names to be displayed.
Gene annotation from Biomart
The ids that we have in the matrix are HGNC gene symbols. That is all the information we have for the genes. In these steps we will import annotations for each gene id to be able to distinguish better the different genes.
At the filters step, select to only download gene annotations for Chromosomes 1-22,X,Y and MT. You can specify the filter at REGION > Chromosomes
Leave checked the option “Skip rows with empty values” in next window and click Next.
Name the output file gene-annotations.tsv and save it to the data_annotation folder.
Open the file that you have just imported with a text editor to see the content and format of the file.
Sort the rows according to the chromosome. How many more genes are from the same chromosome as CDKN2A? Is there a kinase?
“Cancer cells often exhibit a change in number of copies of certain genomic regions when compared to normal cells (Copy Number Alterations: CNAs). Some of these CNAs may have a direct influence on the expression of genes in the affected region. The change in the number of copies of a gene may be both positive, when additional copies are gained (and the genes thus amplified) or negative, when one or more alleles of the gene are lost. The influence of CNAs on the expression of these amplified or lost genes depends on whether it occurs hetero- or homozygously and also on other regulatory factors which may override the effect of the alteration. Therefore, an essential step to verify the importance of the amplification or deletion of a given gene in the tumorigenic process is to verify if its expression tends to respond to its genomic alterations.” (Excerpt from our blog)
If we have gene alteration and expression data from all the samples we can explore the alterations within the sample set and switch the visualized data to expression to verify if we can see an effect of the genomic alteration in the expression. We want to know which genes’ expression seems to be most influenced by alteration events in the TP53 pathway. In other words, we want to know the cis-effect of the alterations in the pathway. We can explore the cis-effect this by eye as shown in the image of this section on the right, and we can make a group comparison analysis which gives us the means to compare two sample groups of expression data for each gene and decide if there is a significant difference between them.
If not open already, download and open the TCGA-GBM heatmap from Gitools datasets. Select to show all rows (genes). How many genes does the data set contain?
Switch to showing expression data and filter out the samples with no expression data.
Switch to CNA status: want to know how many CNA events we observe per gene. Right click on a gene name and select “Add header”
After adding the aggregated values as row header, right click on it and select Sort all rows des. by Count. If first, they are sorted ascending, repeate the same step as now the will be sorted descending.
Find a gene that has 10 CNA events, right click on the count and select:
It’s important to know what values represent Gain and Loss. If necessary look at the CNA color scale and write down the values. Which value is used for Loss? Which value is used for Gain?
Select to take values from expression.
Unselect Copy heatmap headers.
Select to group by Value: We want to group columns according to their CNA value.
Why do we have to choose this one?
be considered as 0. Why do we have to choose this option?
Click next and read the test description (make a screenshot as it may help to interpret the results).
to see the group comparison analysis results.
see the group names at the top
What do the colors mean? Compare the p-value-log-sum with right-tail and left-tail significance.
significance of the result to the number of events?
Sort the rows by p-value-log-sum (Absolute sum, Descending)
left-p-value? A significant right-p-value? Which are likely to be affected by Gain or Loss? How many altered and unaltered samples have been observed for each of the genes?
The Sample Level Enrichment Analysis (SLEA) allows us to collapse the expression level status from a group of genes (as for example pathways) into a single row in the heatmap.
We will perform the SLEA with the Glioblastoma median-centered expression data. Additionally to that we need a module file - the file that describes which gene groups (or modules) we want to analyze. We already prepared a file containing some KEGG pathways modules in a Two-columns mapping (TCM) format (kegg.pathways.tcm).
Windows user may want to choose start Gitools with the .bat-file with 2GB RAM or more to speed up the calculations.
First let us prepare the data:
To perform SLEA in Gitools we will perform an enrichment analysis with continuous values since we want to measure the grade of enrichment rather than a boolean enriched or not. A step-by-step wizard will show up and will guide us through the analysis setup. For our SLEA we will need to do the following steps:
Once you have the results, open them by going to “Heatmap” under “Results” in the new screen that appeared in Gitools. Remember that you can save the analysis and afterwards always open it again.
Correlation The SLEA result from step 4 already reveals that different samples have different expression levels for some pathways. The clusters from the iCluster analysis already reveals that there are groups of samples that seem similar one to another. If we are not happy with the grouping, we have other options of comparing the samples: we can cluster similar samples together or we can perform a correlation analysis which gives a result of similarity measures of samples. Correlations
You can choose to do a clustering with either the SLEA-result or the correlation of the samples. As a start we can do a hierarchical clustering with the SLEA-result samples.
With the correlation result we could to the same, but it is computationally very expensive. Therefore we choose to do a K-means clustering.
- When doing a K-means clustering, we need to choose how many clusters we want to have. Check the hierarchical clustering at level 5 or 6 and decide a good number of clusters.
- Select Analysis > Clustering
- Select the K-means clustering for z-scores and columns.
- In the settings of the clustering change the num. clusters to the number you decided and click finish.
- Similar samples should be clustered together, which you can see along the diagonal where similarity triangles accumulate