Conflict in multi-gene datasets: why it happens and what to do about it

May 26, 2021 05:56 · 7323 words · 35 minute read

Hello, and welcome to this Australian BioCommons webinar.

00:05 - My name is Melissa Burke and I’m the training and communications officer with Australian BioCommons, and I’ll also be your host for today.

00:13 - In this series of webinars we aim to share useful information about the latest digital techniques data and tools for the life science community.

00:21 - Each month we hear from our local and international peers who present a bioinformatics topic that we hope will support Australians to deliver their best environmental, agricultural, and medical research.

00:33 - You can keep up to date with the latest Australian BioCommons news and events via the channels listed on your screen.

00:42 - Before we begin, we would like to take a moment to acknowledge the traditional owners and their custodianship of the lands on which we meet today.

00:53 - In my case, this is the Turrbal and Jagera people.

00:56 - We pay our respects to their ancestors and their descendants who continue cultural and spiritual connections to Country.

01:03 - We recognise their valuable contributions to Australian and global society.

01:07 - Today we’re thrilled to welcome Dr Alexander Schmidt-Lebuhn to speak to us about conflict in multi-gene datasets.

01:18 - Alexander is a Research Scientist at the Centre for Australian National Biodiversity Research in the CSIRO.

01:24 - He is currently lead of the Phylogenomics Bioinformatics Working Group for the Australian Angiosperm Tree of Life initiative and leads a Future Science Platform Environomics project on high-throughput sequence capture.

01:40 - Alexander’s research interests include the systematics and evolution of flowering plants, in particular of Asteraceae (daisy family), biogeography, user-friendly species identification tools including through the application of computer vision, and polyploidy.

01:57 - He uses DNA sequence data to resolve phylogenetic relationships and understand the evolution of native Australian plants.

02:05 - Welcome Alexander and I will now hand over to you to start your presentation.

02:11 - Thank you very much. Hello, I assume everybody can now see the slides and hear me otherwise please warn me.

02:17 - Thank you very much for this kind introduction.

02:20 - I’m very grateful for the opportunity to speak about this and I’m very grateful for everybody’s interest in this topic.

02:28 - What I want to talk about today. First I want to give an overview of what this is about and why we are interested in this, then I’m going to do a quick, probably not going to be absolutely brand new to everybody but a quick overview of the nature of the data we’re going to use in the phylogenomics pillar of Genomics for Australian Plants, versus traditional Sanger loci, and then the three topics that I really want to discuss our deep coalescence, paralogy and reticulation three different processes or patterns that can lead to incongruence in our data sets.

03:05 - What is the background here, Genomics for Australian plans, is a initiative that is co funded by Bioplatforms Australia and broad variety of state herbaria and other research organizations.

03:20 - We have three aims in Genomics for Australian Plants first to develop genomics resources.

03:25 - Second, to increase the understanding of the evolution and the conservation of the Australian flora.

03:32 - And third, to increase our capability as a botanical community to make use of technology resources and modern analytical methods.

03:41 - There are three pillars or activity areas in Genomics for Australian Plants, the assembly of reference genomes for the Australian flora.

03:51 - Phylogenomics, also known as Australian angiosperm tree of life, and conservation genomics and it is the phylogenomic area area that this webinar falls into.

04:03 - This webinar is part of more efforts that increasing our capability and helping each other.

04:10 - There is going to be another webinar, more on that later, and series of three workshops associated with the Australasian Systematic Botany Society meeting that is taking place in July, please visit the conference website for additional information if you’re interested.

04:27 - So again, this is about phylogenomics and the premise here is that either in GAP or in other projects we are now dealing with large numbers of low copy nuclear genes.

04:40 - In the case of the Australian angiosperm tree of life produced with sequence capture also known as target enrichment.

04:48 - And the premise is also that we have got sampling with a fairly, you know one per genus or one per species just, you know, one per subspecies of variety coverage so the kind of problem that is really phylogenetic pretty much about species level and even deeper about and understand the evolution of a genius a tribe or a family.

05:13 - And it is not about species delimitation or population genetics, or other related problems.

05:21 - So the data we’re using them from target capture from sequence enrichment however you want to call that how does that compare to the data we’ve traditionally been dealing with have already indicated, primarily we’re using the angiosperm 353 kit as the name implies, We’re going to have hundreds of different markers, whereas usually in Sanger based studies that I also started with we had only just between one and five markets generally per study, and a.

05:54 - In addition, we get considerably more raw reads data, even per marker and per sample.

06:00 - So, in the Sanger age, we had a forward read to reverse read and so they would add up to a few megabytes of trace files that we would get out of the sequence.

06:10 - So before we build contigs. With our enrichment data that we’re going to have and are starting to come in.

06:17 - We have at least hundreds of megabytes of potentially a few gigabytes of raw data that we have to analyze per sample.

06:26 - So that’s considerably more challenging to then arrive at our alignments.

06:31 - In addition, because the Sanger data were near nearly always drawn from high copy regions of the genome primarily ribosomal and plastid, nuclear ribosomal plastid genomes.

06:47 - We could concatenate generally into two phylogenies, the nuclear and the plastid one and then we often found that they had at least slight incongruence between each other.

06:58 - In this case with hundreds of genes hundreds of nuclear low copy genes that are pretty much inherited and recombined independently.

07:06 - We have a lot more incongruence and the processes that course this incongruence is of course precisely what the talks about.

07:13 - And finally, this is not central to our concerns today but just as a little side remark often the Sanger markers that we all started with would either be non coding regions, perhaps spaces or introns, or they would earn some cases be coding but then they would be tRNA regions for example in the chloroplast so their coding but no protein coding, so often it made intuitive sense to partition our data sets by sequence region, whereas if in GAP for example using something like the angiosperm 353 kit for our enrichment.

07:49 - All those genes that are being targeted with that kit or protein coding so we may want to consider partitioning our data set by codon position because of course the third one is evolving much faster than the other two I also want to give a very quick overview of where in our overall analysis pipeline which is of course very simplified here we are sitting with today’s presentation, we start with, as mentioned a lot of raw reads from next gen sequencing, they get assembled against the reference sequence.

08:23 - And then for each gene and each species, we hope to get at least one contig but if there are different variants of the gene in a given species then potentially more than one more on that obviously later.

08:35 - And then across all our samples are all a species we build gene alignments for each gene.

08:41 - And then we have three main ways we can do the phylogenetic analysis.

08:46 - Either the so called shortcut methods or concatenated analysis or a full multi gene coalescence analysis in Bayesian software such as Beast.

08:58 - Now crucially what this webinar’s focusing on is this lower part we assume that we have got the gene alignments across all our samples.

09:09 - And we want to figure out if we’ve got conflicts on a data, how we get to a species phylogeny.

09:15 - We assume on the other hand, that this first part of the pipeline here has been done well and without any fundamental problems so we assume that the assembly has worked so that we don’t get any kind of errors in our contigs, nothing has been assembled together that shouldn’t have been assembled like that.

09:34 - And we assume that for example in the HybPiper assembly pipeline we have actually discovered all the potential variants of a gene that I’m going to talk about later with a paralog finder script.

09:48 - If we don’t do any of this right of course we’re going to mislead the downstream analysis, even if we do them in principle correct because they’re they’re being fed bad inputs.

09:59 - So the three processes that I want to talk about again are incomplete lineage sorting and deep coalesence, first, then second, gene duplication and loss paralogy and orthology.

10:13 - And then third reticulation things like hybridization and introgression and chloroplast capture.

10:18 - Now, none of these processes, newly discovered.

10:22 - There’s a very very good review article, very early from 1997 that I can recommend as a very nicely written introduction.

10:30 - It is just that with the kind of data we have today this is becoming ever more burning issue than it would have been 10 or 15 years ago.

10:39 - In each case, I want to consider the following.

10:43 - First, what actually is presumed to happen, what is the biological process that is causing incongruence.

10:50 - Then, what would it look like in our data, how would it present in our data and here assuming the first instance, an ideal case.

10:59 - So we would assume that only the process currently under consideration is causing our incongruence not interfered by any of the others.

11:09 - And we have a perfect data set with everything that we need to immediately see what is going on that is.

11:14 - Needless to say, not necessarily realistic in real life.

11:18 - And then third, without going into too many of the details just a quick overview of how we then get from our incongruent alignments and gene trees to the species trees so just what are the approaches that are being used in principle, and a few software options.

11:34 - And I should add we wanted to make the slides of this talk available and the last two slides of this presentation here are a variety of references to review papers and methods and software announcements with links that I hope will be useful to anybody who’s interested in this area.

11:54 - Starting with the meat of the webinar first deep coalesence and how it is caused.

12:00 - What we see here at the bottom is tiny made up species phylogeny of three species, the time axis going from the left to the right.

12:12 - And they have been caused by two speciation events, and in each case inside the species we see these little groups of two connected dots, each of these pairs of dots is meant to be an individual and those spieces lineages, and the color of the two dots is indicating what alleles for a given gene they have been inherited from the mother one inherited from the father and then hopefully they pass that on to some of their descendants.

12:40 - And the key point of this graph here is to illustrate what might happen with random selection of alleles at speciation events and then the evolution afterwards so in this case we have an extremely colorful ancestor that would have started with perhaps an unrealistically large number of different alleles.

13:01 - And then at each speciation event indicated for example by the first arrow.

13:05 - The two daughter lineages randomly grab some kind of subsample of the allele diversity that is available in the ancestor.

13:13 - And so we may have a situation as indicated with a second arrow, where one species has got alleles that are paraphyletic, to the alleles in the sister species or another closely related species if we kinda envision a gene tree an allele tree.

13:31 - It will show one of these pieces as paraphyletic.

13:34 - And then as we see over time however we lose allele diversity in individual lineages now this simulation you presumably assume that no new ones could be generated by mutation.

13:47 - But even then, there’s limited space in each species lineage, and through a process of genetic drift even without any selection going on just purely because randomly some of the allele will get rarer and then some might have the luck to disappear entirely.

14:04 - If you wait long enough and this species lineage, the alleles in that lineage will become monophyletic again.

14:11 - But the situation here this intermediate situation with the lineage being, you know, the lineage sorting not having been completed the alleles being paraphyletic to know that those sister spieces is the one that we’re really concerned about and that is what’s causing the trouble.

14:29 - First, a side note, however, there seems to be a lot of confusion sometimes.

14:34 - It is important to note that this incomplete lineage sorting does not have any implications for taxonomy for classification, it does not mean that the species is badly circumscribed and that has two reasons.

14:47 - First of all, the species is non-monophyletic or the species is monophyletic or paraphyletic or something like that there’s not even a meaningful sentence, if we are talking about non-clonal species about anything as sexually reproducing precisely for the reason that this graph here, indicated inside this species we’ve got a network structured and all the words ending and -phly and -phyletic only apply if you’ve got a tree relationship between the units you’re looking at for example, independently evolving species lineages only very very rarely exchange gene, but they don’t apply within a species.

15:23 - And the second one, it is also a category error because as taxonomists we don’t actually classify the fields into species we classify specimens into species and then species into higher order taxa.

15:35 - We are using the gene copies as evidence but not directly it’s not as if we’re saying that allele belongs to that species.

15:44 - It just swimming around in the species lineages in a way.

15:49 - So, we had incomplete lineage sorting and lineage sorting again, the idea is that you look from the past to the future and is the process of alleles slowly becoming monophyletic inside a species lineage now if you reverse your perspective, that is the other important term that is used everywhere.

16:10 - If we look back from the present into the past because the the present allele so all the evidence we have to infer what happened in the past we are talking about coalesce and so the the idea here is that the extant alleles merge back into the expected ancestors, and the species lineages merge merge back into ancestral species lineages as we look back into time.

16:33 - And a very important mathematical model here then the coalescence model or coalescence theory that a lot of the phylogenetic analysis that we can use to deal with this situation are based upon.

16:46 - And the key problem that we have is not incomplete lineage sorting per se but what it causes again incomplete lineage sorting disappears through lineage sorting through genetic drift over time.

16:57 - But the problem is what happens if it doesn’t disappear quickly enough.

17:03 - If there’s a lot of space for alleles that means if effective population size is large in species that can maintain a lot of the ancestral diversity for quite some time, if then the time between speciation events is relatively short.

17:19 - There was no time for lineage sorting to take place before the next speciation event.

17:24 - And then we have a high likelihood of ending up with what was then called the deep coalescence so that means that the alleles, the gene copies coalesce into an ancestor, that is deeper in time than the actual species divergence took place.

17:41 - And if we then take the gene tree at face value this might mislead or phyloenetic analysis is important here to note that, in contrast to incomplete lineage sorting contrast to this this paraphly of leads to those other species.

18:00 - This in congruence stays. This is forever.

18:03 - Because the moment your, your lineages have sorted out there alleles they’re stuck with it, they might still diversify of course into new little gene clades.

18:14 - But, but the relationships at the bottom of that for example in this lower graph.

18:20 - Things were sort of quite randomly so that now.

18:23 - We’re getting the wrong impression from that one gene tree.

18:26 - That doesn’t go away except if an entire species dies out.

18:30 - That is the only solution to this and that is also why this is less of a problem the further we go into really really deep phylogenetic questions.

18:39 - Because in the end, many lineages go extinct and it’s rare that we actually have got all the radiation or really really quick radiation and that has resulted in all extant lineages that have survived 300 million years or something like that.

18:53 - But it’s a very very common problem with younger problems.

18:56 - So, this took quite a lot of introduction on the other hand, this is a really easy question how does deep coalescence then present in our gene trees? Well, quite simply, we have gene tree incongruence, but if that is the only issue that’s affecting our data, then at least the alleles from each species are relatively closely related still with each other.

19:21 - So, we do have, we do have a problem of how we reconcile these two species trees, but it is relatively close nonetheless and the major groups will always come up more or less the same.

19:37 - How do we infer the species tree in all the cases that I’m talking about there is a really simple solution, and the first one here is we simply ignore the problem, and we concatenate our data like er would concatenate concatenate several different chloroplasts regions for example at all coevolving.

19:54 - And then just treat all our genes as if they were evolving as one.

19:59 - And that actually works well enough, again, especially with many deeper problems.

20:05 - So, that is an option that we might at least want to try out.

20:09 - The gold standard, however, that would be most useful also evolutionary studies, is to do a full multi gene bayesian species tree analysis such as implemented for example in a very popular software Beast as mentioned earlier with its add on Star beast.

20:27 - The advantage of this approach is that it estimates the species tree, and all of the gene trees at the same time and so it can take all that evidence into consideration and checking it against each other to arrive at the best solution.

20:40 - And perhaps a slightly more practical advantage is also that it always assumes some kind of molecular clocks so we always get a roooted tree out of it we get a posterior distribution of trees out of it because it’s a Bayesian analysis and we can do a lot of very sophisticated analysis downstream.

20:58 - However it has two important downsides that are very relevant to what we’re trying to do in Australian Angiosperm Tree of life for example is very computationally intensive and very slow so it’s not realistic with hundreds of terminals with hundreds of species.

21:14 - And it also doesn’t deal well when it has gotten missing data if we have a very patchy matrix so for example we have 300 genes, out of 355 for each individual sample but unfortunately not the same 300 for each individual sample so if you are missing in every column.

21:31 - The analysis really will not work very well.

21:34 - So, the alternative then, are the so called shortcut methods.

21:40 - In this case we infer the gene trees first, all individually.

21:45 - And then we use the topology of all the gene trees to infer the most likely species tree under the assumption of coalescence explaining the conflicts in our data, deep coalesence explaining the conflicts.

22:02 - A particularly popular option at the moment is the software ASTRAL but there’s a wide variety of other software that one can try.

22:11 - It has pretty much the inverse advantages and disadvantages of Star Beast, it can deal very well with incomplete gene trees with lots of missing data and it is extremely fast.

22:23 - On the other hand, the fact that we are not estimating species tree and gene trees at the same time but that the gene tree topology is fixed means that if we got those wrong if for example they are poorly supported nodes along the way, which could have just gone the other way around.

22:40 - Then this analysis might be misled by that kind of uncertainty.

22:44 - And it also seems to be less reliable the deeper you go in time and it’s preferable for the more shallow phylogenetic problems.

22:52 - And in terms of downstream phylogenetic analysis downstream evolutionary analyses.

22:58 - The branch length inferred by both these kinds of shortcut methods are generally meaningless.

23:03 - So we would have to then, at the very least do some kind of Pinellas likelihood time calibration if you wanted to biogeographic analysis with it, for example.

23:14 - But that was relatively straightforward that problem is long understood and there are lots of lots of software options, a bit more complicated already is gene duplication and loss.

23:26 - So what happens in this case, or second problem, as the name says we have an ancestor in which, potentially in just a single gene has been duplicated into in this case of red and the blue copy, and both copies may then potentially be inherited by all the descendants species, if we have an entire genome duplication event polyploidy, then we would have an enormous number of genes showing exactly the same pattern.

23:53 - Potentially then over time either the gene copies might specialize that’s of course one of the key drivers of innovation and evolution, or if there’s a superfluous to requirements at some point, the genes can also be lost again.

24:10 - If we then look at various gene variants that have been produced by these gene duplications events across different species lineage just the two terms that we use orthologs and paralogs.

24:23 - So if across all the samples that we’ve gotten our analysis we are always lucky enough to grab the red copy for example and compare them all against each other, then we’re comparing orthologs, all the red lineages of genes are an ortholog group.

24:37 - If, however, we are unlucky enough to unwittingly grab red and blue copies and mix of them from different species lineages and try to do an analysis on them, then we are comparing paralogs.

24:51 - And the problem with that is that then the phylogeny of the gene family of these different ortholog groups interferes with a species phylogeny that were actually interested in, as phylogeneticist and systematists at any rate, of course, we’re interested in the evolution of this family then it’s a different story but we don’t want to mix them up.

25:14 - So just as an example of what that looks like in one of my data sets in this case the daisy bushes.

25:20 - This is not even an extreme example but you can immediately see that there are some variants of this gene that stand out because they’re missing this one amino acid, and you can see that this same paralog is present in four of my species, and then variants that have got that amino acid also present in all the same four species.

25:42 - And you can see similar patterns then with some of the nucleotide differences with some of the SNPs in these genes.

25:49 - So, this is just an example of of really visually immediately seeing that something is really a bit odd here.

25:58 - How does that then present in our gene trees.

26:02 - Ideally, and again this is an assumption. Ideally, if we have got all the sequences that we can get we have got absolutely no gene losses, everything is beautiful, clear.

26:13 - Then, at the moment in our species tree, where are genome genome duplication event happened we should then find the same species twice on the gene tree with more or less the same relationships into parallel clades.

26:28 - So for n gene duplication events we should have n species duplications on the gene tree in a sense, sometimes more sometimes with less.

26:38 - Now again, that is an unrealistic assumption.

26:41 - In reality, we sometimes simply fail to amplify or capture an allele or a gene copy.

26:48 - In some cases, obviously a gene is superfluous to requirements was lost.

26:53 - And so in the lower tree here you see that, I assume that in the upper ortholog group.

26:59 - The red copy has been lost or other. The red species has lost the copy for the ortholog one, I should say, and then five different species have lost the copy from ortholog group to but there’s still quite a bit of overlap between those two groups in terms of the species that have them, so we can see what is going on just in the alignment.

27:19 - How do we then deal with that situation. Just as with simply ignoring the coalescence there’s also a way of simply ignoring this problem in a way we just throw away all the genes in which we find parology and analyze, only the rest.

27:37 - Clearly, that is a valid option if let’s say I’ve got 300 genes and 30 of them have paralogs it’s gonna be a bit more painful if I’ve got 300 genes and 200 of them have paralogs because I’m throwing away the majority of my data.

27:51 - So we may want to use some more sophisticated approaches.

27:58 - There are methods out there that bypass any bioinformatics solution and they use the gene trees that include all the paralogs that we have assembled directly for phylogenetic analysis and these are again shortcut methods.

28:13 - In a really old method that’s been around for quite some time is a parsimony method called Minimize Gene Duplications and Losses does exactly what it says on the tin and is implemented for example in the software iGTP but also in the fairly well known package mesquite and fairly recently a likelihood alternative using the same logic has been published under the name GeneRax.

28:38 - However, mostly people deal with paralogy using bioinformatics approaches.

28:44 - And the one that Chris Jackson has implemented in the analysis pipeline for use in the Australia Angiosperm Tree of Life Project is the Yang and Smith pipeline that was first published in 2014.

29:00 - And the idea here is to script automatically exactly the same approach that we would intuitively, take if we were to manually figure out where ortholog groups are so we we just discussed, we’ve got these two clades that show duplicated species, and then we assume there must have been a Genome-genome duplication event between them.

29:23 - And so we would kind of take the scissors to them and we take those two ortholog groups apart.

29:30 - The failure mode for this approach is, if we have got so many losses or failures to amplify that we actually don’t have these overlaps anymore so in the lower case here.

29:41 - The first ortholog Group is not present in five species of the lower ortholog group too is not present in the five species and because they are just complimentary in their losses, we will not be able to tell that there is actually a genome genome duplication event in this case.

29:57 - However, this is ever less likely to be an issue, the larger your study group or the more samples you’ve got in your analysis.

30:07 - And just very quickly to illustrate with a practical example what the outcomes are of some of the scripts that are available so the Yang and Smith pipeline actually has four options.

30:17 - One of which is simply to kick out all the genes with a paralog but terms of the most sophisticated approaches there is this script called monophyletic outgroups (MO).

30:28 - The idea is you do have an out group, and you move successfully up from the roots through your tree, you check if there are duplications between the sister clades at each node, and then you cut out and throw away the smaller one rationalizing that at that at moment there must have been a duplication event, and you keep whatever gives you more information for your phylogenetic analysis so relatively simple and logical and straightforward.

30:56 - Then, at the extreme end of the spectrum, there is another approach there that is called maximum inclusion (MI).

31:04 - And this quite simply iteratively takes apart the unrooted phylogeny into pieces that do not show overlap and taxa are starting with the biggest clan it can find.

31:17 - This is a very permissive approach may end up with quite a large number of very small ortholog groups that might not have a lot of information.

31:25 - And this is actually not being recommended by Yang and Smith.

31:29 - So we’ve got different options with different logic that we can examine.

31:35 - And finally, just for clarification paralogy is not necessarily always an issue just because we have got for example genome duplication.

31:43 - If you are worried about polyploidy but as in some groups that have studied all the polyploidy happens in the terminals just within individual species.

31:53 - That is not really an issue, because what we are worried about here is ancestral duplication, that will actually interfere with the species phylogeny but if all the action happens in the terminals then of course that doesn’t really have a phylogenetic import and, in a sense, what you’re then seeing in those extra copies that have appeared is pretty much indistinguishable from simply having different alleles in your samples which are, you know, similar enough to be taken care of by standard approaches.

32:26 - Finally, then coming to the third complex of problems reticulation of any kind.

32:34 - So in this case, I want to quickly talk separately, about three scenarios.

32:40 - The first would be hybrid speciation in particular allopolyploidy speciation.

32:45 - The second thing is much lower level reticulation and introgression back crossing admixture there are lots of lots of words that are used by different sub fields of evolutionary biology.

32:55 - And then finally, a special case of a former chloroplast capture.

33:04 - Allopolyploid speciation is fairly widely known and in fact there is a example that all of us will have already encountered in pastels and in chewing gum so it’s very common, the humble spearmint is actually an allotetraploid hybridogenic species that that has been derived from two other European species of true mints.

33:26 - So what happens is you have a cross between two species the hybrid might be fairly sterile or at least subfertile.

33:33 - And then, if that hybrid manages to duplicate its genome generally through some kind of meiosis errors fertility is restored and we have a new species lineage that might then even diversify into an entire clade if it is lucky.

33:51 - And that already kind of indicates then how it would present the data ideally if we have a situation where an allopolyploid was created by species in two different clades, then we would find for every single gene tree two different copies in that hybrid or hybrogenic species one derived from the maternal clade and the other one from the paternal clade so gene after gene would kind of find a situation a bit like illustrated here where they’re always sitting close together with ancestors.

34:27 - Of course that is again idealistic assumption in reality we sometimes amplify all the copies or some genes may have been lost.

34:36 - And of course all the other problems interfere with it so we may see gene tree incongruence making it slightly harder to understand what is going on.

34:43 - How do we then in this situation like this get from our alignments are from our gene trees to the species tree again there is a brutal and simplistic option which is that we simply throw out all the known hybrids and hybridogenic lineages is because we using analysis that assume as tree like structure of the data in the first place and then we may be re insert them afterwards.

35:07 - A more formalized way of doing that automatic and analytically is the HybPhaser pipeline high pipeline, developed by Lars Nauheimer that is actually the topic of the separate talk that I mentioned earlier and off one of the workshops at the ASBS conference.

35:25 - And in this case, the idea is to separate out the reads that belong to both of the ancestral lineages, and then analyze those contributions separately and then just as I intuitively described earlier figure out what ancestral clades have contributed to your hybridogenics species.

35:47 - And that talk is going to take place on the 10th of June same time as this one, and please visit the BioCommons website if you’re interested in signing up for it A bit less straightforward, perhaps, is if we have got less gene flow, and especially a result that is less than 50:50 from the ancestral lineage.

36:12 - So again, there’s a lot of words for this as an introgression it is hybrid back crossing is it admixture.

36:20 - And in that case, we may only see a few genes affected or we may see only part of a species affected.

36:27 - And of course that rises, raise raises immediately the problem well how do we distinguish that from deep coalescence then and there are there are two considerations here one is whether we can have some kind of test that tells us which of the two might explain what is going on and the other one is, if we can have some kind of phylogenetic network analysis to figure out the relationships.

36:51 - In terms of tests. One option that has been mentioned in this context is the traditional “ABBA BABA” test.

37:00 - The idea here is that we reduce our problem down to a simple case of three species, and an out group.

37:08 - And in that case, we should find a relationship where lots of alleles are shared by the two most closely related species.

37:19 - And then the out group and the third species is share the other allele.

37:23 - Now if then we have considerably more allele distributions that diverged from this then we would expect from a purely stochastic deep coalesence and an incomplete unit sorting model, then we would say okay that the strong evidence now that we have got gene flow between those species going on.

37:43 - The problem with this for our present purposes, is that it uses allele frequencies so really ideally you have a data set with lots of single nucleotide polymorphisms and send multiple individuals first be empty spaces for good results but it is something that some of us might be interested in exploring.

38:02 - A test more suited perhaps to the kind of data I’m using for Australian Angiosperm Tree of life was developed in 2009 by July et al.

38:12 - In this case, simulation test is conducted where they are trying to compare the estimated age of the gene tree coalesences versus the age of the species tree coalescence.

38:24 - And in the case as illustrated in the figure from their paper in figure B in the case where then the gene tree coalescence is considerably deeper than a species tree coalescence then you can say, Okay, this must have been hybridization unfortunately, it’s also indicated then in figures, C and D of the paper.

38:46 - There are several other scenarios where hybridization did take place on progression to take place but the first test will fail to detect it precisely because of the unrelated problem of incomplete lineage sorting and deep coalesence so it’s an asymmetric test in a way.

39:05 - There are phylogenetic network analyses, a particularly well known one is conducted by the software BPP, and then more recently.

39:15 - Another software called SNaQ has been published.

39:17 - So the idea in this case is that different models are compared with simply divergent structures with simply tree structures and then where you add an event here or there, of lateral transfer of introgression between species.

39:35 - And the analysis tries to explain the data under these assumptions.

39:40 - The problem was that it is generally fairly computationally intensive to compare all those different models all those different scenarios.

39:47 - It’s relatively slow and is limited to a few species may maybe five or six out what you mostly see in the literature.

39:57 - And there are then also some other somewhat simpler phylogenetic network methods that can maybe accommodate a few more species.

40:07 - Finally, a separate comment on what might be considered a special case of introgression but it’s particularly important in phylogenetics organelles unknown to jump between species more easily than locally or genes.

40:23 - And in our case of course as botanists we will be concerned about chloroplasts which is then called chloroplasts capture, but other species lineages.

40:32 - And for example, people are working and Eucalyptus will be very familiar with that situation.

40:36 - So how does that presented the data then, well we would expect if chloroplasts capture is the explanation for the incongruence we’re seeing that nuclear data more or less consistently supports one topology and chloroplast data another so for example if we have ribosomal tree.

40:53 - And we have our capture tree with 300 different genes, and we fairly consistently get one relationship that the chloroplasts annoyingly gives us another.

41:02 - Then we might want to consider well maybe this is futile chloroplasts capture, such as in this example here Now, I have spoken largely about ideal scenarios.

41:16 - The problem is, of course, in reality, we generally don’t have an a priori knowledge of what is going on and the data may be a bit complicated.

41:26 - So it’s important to understand that a lot of what we do we based on our assumptions of what is plausible so we need to understand the biology of our species we may need to understand the geography, we may need to understand the age of the lineages and, and then we take all of this, as circumstantial information that we apply to the incongruences that we see and one thing that we need to keep in mind is that both being able to hybridize with each other, but also the deep coalescence problem should peter out the more distant the lineages are so it’s much less like the difficult a deep coalescence between things that are 20 million years apart and between things that are 1.

5 million years apart and it’s much less likely to hybridize a eucalypt with a daisy than it is to hybridize different Eucalyptus.

42:16 - So if we got really really really really deep incongruence with either an error we made in the assembly or it is then more likely to be paralogy but distinguishing the other two is probably the trickiest issue.

42:31 - So in summary, we have discussed the deep coalescence, it is caused purely randomly, especially if there are large population sizes and rapid success of speciation, and we have got a variety of phylogenetic methods to solve that problem.

42:47 - We have got paralogy, gene genome duplication.

42:54 - And in that case we have got both some shortcut methods and bioinformatics solutions for the identification of the various ortholog groups at our disposal.

43:03 - And we have got reticulation in the form of either hybrid speciation or, or more limited introgression, and in that case, there are tests available to figure out what is going on and phylogenetic methods at least for cases will few species.

43:17 - And again, the idea of cases that easy to recognize but in reality.

43:23 - We will often find slightly more incomplete data sets.

43:26 - And the key problem is, all of them can potentially happen in the same phylogeny.

43:31 - As I mentioned earlier, there’s a variety of resources, attached to this talk, that you will see when the PDF becomes available.

43:43 - But for the moment I would like to thank you all for your attention and I’m looking forward to the discussion that we can have.

43:50 - Thank you. Thank you very much Alexander.

43:54 - We do now have time for questions. If you have a question for Alexander please write that in the Q&A panel in your Zoom dashboard.

44:02 - And we can answer those for you. And while you’re thinking about questions I’ll pop the links to the next webinar and to the workshops, back up on the screen again.

44:13 - So as Alexander mentioned the next webinar will be on the 10th of June, and you can find more information about that on The Australian BioCommons website.

44:21 - And there will be a series of workshops at the ASBS conference in July.

44:27 - The information about those workshops is on the conference website and you can see those links there.

44:33 - As mentioned, these slides will be made available alongside the recording and we will link them from that description in youtube so you’ll be able to find them there.

44:52 - Australian BioCommons is enabled by funding from NCRIS via BioPlatforms Australia.

45:00 - Thank you again and we hope to see you again in another webinar.

45:04 - Bye for now!.