An introduction to scRNA-seq data analysis.
00:04 - Before diving into this slide deck, we recommend you to have a look at the following.
00:09 - How are samples compared? How are cells captured? How does bulk RNA-seq differ from scRNA-seq? Why is clustering important? To understand the pitfalls in scRNA-seq sequencing and amplification, and how they are overcome.
00:25 - Know the types of variation in an analysis and how to control for them.
00:29 - Grasp what dimension reduction is, and how it might be performed.
00:33 - Be familiarised with the main types of clustering techniques and when to use them.
00:38 - Greetings everybody and welcome to the Galaxy single cell RNA-seq analysis workshop.
00:44 - Here we will walk you through some of the basics and concepts when dealing with single cell data.
00:50 - Let’s start with what the differences are between Bulk RNA-seq and single cell RNA seq data.
00:56 - With Bulk RNA-seq we compare two tissues by looking at the average expression of each gene detected across each of the tissues.
01:04 - Due to the number of RNA molecules being considered, the sequencing depth and the strength of the analysis is reasonably high.
01:11 - The differential expression is then measured as the relative expression of a given gene between one tissue and another.
01:19 - With single cell RNA-seq analysis, the stage shifts away from measuring the average expression of a tissue.
01:25 - And towards measuring the specific gene expression of individual cells within those tissues.
01:31 - Here we are no longer comparing tissue against tissue, but cell against cell.
01:36 - Each cell is assigned a gene profile which describes the relative abundance of genes detected within it.
01:42 - Many cells share the same gene profile, where a gene profile ideally describes a cell type.
01:48 - Sometimes we need to compare single-cell datasets across tissues, and we see that many cells across tissues share the same cell type.
01:56 - For example, look at the purple and green gene profiles which are shared across both tissues.
02:03 - New technologies means new methods and techniques to harness the new features that come with them.
02:08 - Single-cell RNA-seq data requires different means of library preparation, sequencing, quality control and analysis.
02:18 - For example, how are cells captured and sequenced? In bulk RNA-seq analysis, the process involves taking a sample, removing unwanted molecules and sequencing everything else.
02:31 - For single cell analysis, the process is much the same, except that each sample is a cell.
02:37 - And must therefore be sequenced separately from other cells.
02:41 - Once isolated, unique barcodes are added to each cell, and then sequenced.
02:47 - The level of resolution in single-cell is at the cell level, and each cell is unique.
02:52 - Therefore, the concept biological replicates is not quite the same as that in bulk RNA-seq.
03:00 - Cell isolation can be performed in different ways.
03:04 - One method is manual pipetting, where wet lab scientists suction up individual cells using a long thin tube.
03:12 - They can do this hundreds of times to isolate hundreds of cells, but it is error-prone, and often multiple cells are isolated together.
03:21 - Another method is flow cytometry, which reduces the human-error component of this stage.
03:28 - Flow cytometry floats cells in a shallow liquid bath and streams them along a narrow channel, just narrow for one cell to pass through.
03:36 - Cells can be screened by a variety of properties this way, such as by their light scatter properties, and from fluorescent cell labelling.
03:44 - Cells can be tagged and isolated in this manner.
03:48 - Optical scatter properties can be used to probe size and consistency of the cell, where cells with a smaller size than the laser wavelength yield lower intensities and more inconsistent scatter patterns.
04:00 - There are two main types of optical scatter: Forward scatter, and Side scatter.
04:06 - Forward scatter is aligned with the main laser and measures the diameter of cell, which is ideal for distinguishing different cells by their size profiles.
04:15 - For example monocytes, which are typically larger than lymphocytes, as seen on the X-axis of the example image.
04:23 - Side scatter is perpendicular to the main laser, and measures the granularity of the cell, ideal for distinguishing cells with less defined internal structures, such as the granulocytes on the Y-axis of the example image.
04:37 - Cells can also be gated and characterised by their cell surface markers via FACS.
04:42 - By plotting different surface marker intensities against one another, cells can be separated, gated, and labelled based on these fluorescent properties.
04:53 - Once isolated, cells can be barcoded. Barcodes are unique sequences that are added to each RNA molecule.
05:00 - They are not unique to the molecule, but unique to the cell such that any two RNA molecules will be tagged by the same cell barcode, should they exist in the same cell.
05:10 - RNA molecules from different cells will have different cell barcodes.
05:15 - Once the RNA molecules have been tagged by cell barcodes, they can be amplified, either separately or pooled together, where the amplified products share the same cell barcodes as their original counterparts.
05:28 - PCR amplifies the gene products to make them more easily detectable during sequencing.
05:34 - When there is a lot of gene product to amplify, as is the case for bulk RNA-seq, PCR works quite well in amplifying all products in a reasonably well-represented manner.
05:45 - However, in the case of single-cell products, the amount to amplify is very small, and many unique reads might be missed during this phase whereas others may be over amplified, as shown in the blue and red transcripts in the example.
05:59 - To guard against this type of amplification bias, we can add a random element to the barcoding.
06:05 - These random barcodes known as UMIs, uniquely tag transcripts such that any two transcripts of the same gene are likely to have different random barcodes.
06:16 - Let us consider the example to the left: we have 2 red transcripts and 2 blue transcripts inside the cell, which after amplification equate to 6 red transcripts and 3 blue transcripts.
06:27 - If we were to compare the differential gene expression between the red and blue transcripts, just by looking at the amplified reads, we would come to the false conclusion that the red transcripts are expressed twice more than the blue.
06:40 - However if we group the reads by their UMIs, and then count only the number of unique UMIs per transcript, de-duplicating the reads which share the same transcript and UMI, we arrive at 2 red reads and 2 blue reads which better represents the true number of transcripts.
06:57 - UMIs are relatively random, but not truly random.
07:01 - Notice that the pink UMI appears twice: once in the blue transcript and once in the pink transcript.
07:08 - This is due to there being often more transcripts than available UMIs, both which are dependent on the number of transcripts in a cell, and the length of the barcode.
07:18 - Consider a set of barcodes of length 5 with an edit distance of 1 between adjacent barcodes, and another set with an edit distance of 2.
07:27 - The former is not robust against common sequencing errors of 1 base pair, but the latter only allows for half the number of barcodes.
07:35 - This trade-off between the number of available barcodes and guarding against sequencing errors is instrumental in the design of cell barcodes and UMIs.
07:44 - In the context of amplification, UMIs do not need to be unique, they just need to be random enough to deduplicate transcripts in order to give a more accurate estimate of the number of transcripts within a cell.
07:57 - So let’s just recap what we’ve learned: First each cell has cell barcodes added to each RNA molecule in each cell.
08:06 - Then we add random UMIs to all transcripts, which further tag the molecules.
08:10 - These can then be used deduplicate the transcripts after amplification.
08:15 - After amplification we need to perform some quality control.
08:19 - One way to do this is to set thresholds on the limits of detectability for genes and for cells.
08:25 - Consider an analysis governed only by 3 genes (G1, G2 and G3), and 5 cells (A, B, C, D and E).
08:34 - The first row of the top table defines the library size, which is total number of messenger RNAs across all genes in each cell.
08:42 - The subsequent rows are the thresholds of gene detectability, displaying how many genes are detected in each cell for genes greater than the threshold amounts of 0 to 4.
08:52 - We see that even a threshold of greater than 3 transcripts detected in a given cell still keeps 3 cells in the analysis: B, C, and E. In the lower table, the opposite is represented, with the total number of transcripts across all cells for each gene.
09:08 - By setting thresholds of detectability, we can see how many cells are described by the gene for that threshold.
09:15 - In both cases, we can see that if we set the thresholds too low, then we risk keeping low quality genes or cells, but if we set the thresholds of detectability too high, then we risk losing too many.
09:28 - Filtering can be a luxury however, as many single-cell RNA-seq datasets have typically low sequencing depth compared to bulk RNA-seq.
09:37 - During the process of normalisation, samples are scaled against one another to make them more comparable.
09:43 - This is normally performed by using median values. For example, for DE-Seq normalisation, the geometric mean count for a cell is taken, and each gene value in that cell is divided by it and by the median value of all geometric means of all cells.
10:00 - If median gene expression is high, then this normalisation method works quite well.
10:06 - But if the median gene expression is zero, as is often the case with single-cell data, then we have the problem of dividing by zero.
10:14 - There are methods to get around these zero counts.
10:18 - One such method is the SCRAN method which works by creating overlapping pools of cells such that any individual cell is characterized by cells of similar library sizes.
10:28 - The method involves splitting all cells into an odd and even group by their library size, and arranging them onto a ring structure where neighbouring cells on the ring have similar sizes.
10:40 - Overlapping pools of fixed sizes are defined, resulting in each cell being defined by multiple pools.
10:46 - A linear model for that cell can then be built by the pools it occurs within, and normalisation factors for all cells can be determined this way.
10:55 - By this method, the issue of low sequence coverage is worked around by turning cells with low library sizes into useful components of a size factor that can be applied to similar cells.
11:07 - Such novel normalization methods were commonplace a few years ago, but as sequencing technologies have improved, the issue of many zero counts in a matrix becomes less important, and normalisation size factors can be derived using bulk RNA-seq methods once again.
11:24 - Other factors that we need to take into account during a single cell RNA analysis are the unwanted factors that can confound the analysis.
11:32 - Ideally we wish to see the gene profiles that separate different types of cells are driven by biological variance.
11:39 - There is however confounding variation from both technical and biological sources that are not useful to the analysis but do contribute to the variance.
11:49 - Confounding biological variance appears in two forms: Transcriptional bursting, and Cell cycle variation.
11:56 - Transcriptional bursting is a phenomenon that occurs in cells in which transcription occurs in discrete states of active and inactive, where the interval between these states is hard to model.
12:06 - In bulk RNA-seq, this phenomenon is unnoticeable as the effects are averaged out over many cells. But in single cell, two cells of the same type may exhibit different gene profiles simply because one cell was actively transcribing and the other was not.
12:22 - This is not something we can control for in the analysis, but it is something we should be aware of when understanding why cell clusters can be noisy.
12:31 - Cell cycle variation on the other hand is a much more well understood process, where the amount of RNA in a cell is approximately double that from a cell of the same type due to one being in the early G1 phase and the other being in the M-phase during the cell cycle.
12:46 - There are genes which are known to covary with the cell cycle, and so by regressing the effect of these genes, we can control against the cell cycle.
12:55 - Confounding technical variance appears in a three forms: Amplification bias, Dropout events, and Library size variation.
13:03 - Amplification bias can be mitigated by UMIs as demonstrated before.
13:08 - Dropout events give rise to the prevalent zeroes in the count matrices, and their effect can be reduced by using clever normalisation techniques such as the pooling method shown previously, as well as by using better sequencing methods.
13:23 - Library size variation arises for a variety of different reasons, but is the main source of variation within an analysis.
13:30 - Like bulk RNA-seq, this is reduced with good normalisation methods.
13:36 - Once we have removed unwanted confounders from the analysis we have the issue of quantifying the relationships between cells.
13:43 - From a data analysis standpoint, we treat each cell as an observation, and each gene as a variable.
13:50 - For large genomes this means extremely high dimensional datasets. Cells exist as points in this extremely sparsely populated high dimensional space, making it difficult to see the natural groupings.
14:03 - The high dimensional space can be reduced a lot by simply filtering out genes that do not appear to be differentially expressed across all cells.
14:11 - To find the relationships between these cells however, we need to define the distances between cells.
14:18 - A distance matrix does just this, defining the distance between any two cells by a single score.
14:24 - Here we use the Euclidean distance on a 3 dimensional dataset of 3 genes (G1, G2 and G3), and 3 cells (R, P and V).
14:34 - The distance between any two cells can be calculated as the sum of squares of the difference in gene values.
14:41 - Note how the distance matrix is symmetrical along the diagonal, confirming that for example the distance from cells R to V is the distance from V to R as expected.
14:52 - Once a distance matrix is generated, we can perform K-nearest neighbours upon the distance matrix where directed edges are generated between cells.
15:01 - For each row of the distance matrix, K of the cells with the smallest distance values are selected representing the nearest neighbour that current row’s cell has to the selected column cells.
15:12 - If the edges are mutually shared between neighbouring cells, then this is called a shared nearest neighbour approach.
15:19 - We can represent this 3 dimensional space easily as 3 independent axes with points that denote the cells.
15:26 - Extrapolating this relatively low dimensional example set to a real dataset which thousands of dimensions is beyond the scope of human possibility.
15:35 - Dimensional reduction is a type of technique that takes a high dimensional dataset and produces a low dimensional representation, usually 2 dimensional, that tries to preserve the distances between the data points.
15:47 - Here the relative differences between cells is maintained in both the high and low dimensional representations.
15:53 - There are many different kinds of dimension reduction techniques, each with their own strengths and weaknesses dependent on the type and the dimensionality of the data.
16:03 - Once the number of variables of the dataset have been sufficiently reduced via filtering and dimensional reduction, clustering can be performed more easily.
16:13 - Here in this 2D projection, each circle is a cell, and the unique colours depict the clusters they have been assigned to.
16:20 - The physical distances between the groups of coloured cells tell us how good the clustering is for this projection.
16:27 - By inspecting the top differentially expressed genes in each cluster against all other clusters, clues to the type of cell that the cluster describes can be found.
16:36 - Cell types are often characterized by the expression of specific marker genes, and the presence of these genes are strong indicators of type.
16:44 - Marker gene discovery can then be used to annotate the clusters.
16:49 - We can also further derive the relationships between these clusters by computing lineage trees based on the amount of noise in each cluster, with the expectation that stem cells have noisy expression profiles yielding broader clusters, and mature cells have very clear expression profiles yielding tighter clusters.
17:08 - The types of clustering you are likely to encounter in an analysis is dependent on the input datasets, where cells taken from late stage samples are less likely to be bunched together and are more likely to yield large visible gaps known as hard clusters that clearly defined different types.
17:25 - Earlier stage datasets are more likely to yield softer clusters, where neighbouring clusters share soft boundaries as clusters intermingle slightly with one another.
17:35 - Soft clustering is to be expected, since although clustering is a statistical method for discretely partitioning data, the underlying cell biology of the data is a continuous process, where cells transition from one well-defined state to another through intermediate stages which are represented in-between two cluster centres.
17:55 - Because of the continuous nature of these single-cell datasets and the extremely high dimensionality of the data, discrete partitioning is often a poor model for partitioning the data.
18:05 - If we instead assume that cell clusters are related to one another via transitional cells which would naturally lie in-between clusters, then manifold learning techniques are better suited.
18:16 - These techniques derive an expression landscape that can not only be used to relate clusters to one another, but also can be used to infer lineage and hierarchy.
18:25 - To actually perform the clustering there are three commonly-used methods: K-means, hierarchical and community clustering.
18:33 - K-means and K-medians follow the same method: the number of clusters are defined before hand, and initialised in random positions.
18:41 - The positions are then updated by the contribution of the cells more closer to it than to other positions.
18:47 - This process occurs multiples times until the positions no longer significantly change or until a set number of iterations have been achieved.
18:56 - The final assignment of each cell then becomes the cluster assignment.
19:01 - Hierarchical clustering is more flexible and does not need an initial parameter to define the number of resulting clusters.
19:08 - Here the two closest points in a distance matrix are joined into a single group, distances are recalculated, and the two closest points are once again joined.
19:17 - This process repeats until all data has been consumed into one.
19:22 - By tracing the process backwards, a hierarchy can be established that is represented by a dendrogram.
19:28 - Louvain clustering is a widely used type of community clustering for single cell data.
19:34 - Here each cell is assigned a neighbourhood of its own and the number of internal and external links between neighbourhoods are counted.
19:41 - For each iteration, a random cell is selected and brought within the neighbourhood of another cell, and the internal and external links are once again counted.
19:50 - If the new configuration has reduced the number of external links in favour of more internal links, then the configuration is kept.
19:59 - If the new configuration has instead increased the number of external links, then the configuration is rejected and another cell is picked and tested. By performing this multiple times, a community structure of cells is built to whichever degree of specificity the user desires.
20:18 - Single cell analysis is non-trivial, and each stage, from the filtering to the normalisation to the dimension reduction and the clustering can drastically affect the outcome of the analysis.
20:29 - Due to the variability in the analysis, one should not panic when faced with uncertainty.
20:34 - The goal is to play around with the data until it begins to reflect the biology.
20:39 - This can take many many tries to achieve, and it may never be perfect, but the idea is to try as many different ways as possible to see what robust conclusions you can come to.
20:50 - In this regard, the vast UseGalaxy resources can be put to good use by testing out the many different paths of the analysis, and the Galaxy Training Network provides tutorials and hands-on trainings to assist you in this regard.
21:03 - Please explore them to better develop your understanding.
21:07 - scRNA-seq requires much pre-processing before analysis can be performed.
21:12 - Groups of similarly profiled-cells are compared against other groups.
21:16 - Detectability issues requires careful consideration at all stages.
21:20 - Clustering is an integral part of an analysis.
21:24 - Thank you for watching!.