Sp23 Dr. Robert Edgar - Computational biology, bioinformatics

Transcript

00:00:00.000 Okay, let me take a look at the audience.

00:00:25.000 I think we have most of them so I should probably get started.

00:00:54.000 Okay, sounds good. Thanks Dr. Gannon.

00:00:58.000 And I think Jared’s already started recording.

00:01:04.000 Welcome everyone.

00:01:06.000 I’m so happy to see you all here at the I will be faculty seminar. And today I have the honor to introduce Dr. Robert Edgar, Dr. Robert Edgar has a Bachelor of Science in physics and a PhD in particle physics from University College London.

00:01:26.000 He served as a CTO of IDBS APS in Copenhagen, from 1982 to 1988, where he designed and implemented a groundbreaking database server and application development environment.

00:01:44.000 In 1988, Dr. Edgar founded parity software in San Francisco, where he served as CEO, chief architect and lead developer. He software development tools for computer telephony applications won numerous industrial awards and he grew the business to 30 employees

00:02:06.000 and 10 million per year in sales before selling it to Intel in 1999.

00:02:13.000 Dr. Edgar has things focused his attention on computational biology and bioinformatics, and it’s best known for developing the widely used muscle and you search programs, which have been cited in thousands of published papers, his contributions to the field

00:02:31.000 are significant and recognized with the night 2019 study reporting him as one of the top 0.01% of living scientists for the impact of his work today Dr.

00:02:58.000 Thank you. Thank you very much.

00:03:00.000 Pleasure to be here in Athens I think it’s my first time out.

00:03:05.000 But so today I’m going to talk about sequence alignment, and a little bit about the latest version of muscle, but I want to sort of draw some larger lessons about how people general generally use bioinformatics tools.

00:03:26.000 How do I find like that. Okay, so I understand we have a variety of backgrounds here so some people may be more or less familiar with this kind of thing so I’m going to assume you basically know the idea of a sequence alignment.

00:03:48.000 And here I’m going to focus on various polymerases, and you can sort of see here we have an alignment and gel view and you can sort of see by eyeball this looks very reasonable.

00:04:01.000 I mean, you don’t need muscle any any alignment, you can do it manually, it’s pretty straightforward.

00:04:11.000 And I’m focusing on various polymerase because right now with the pandemic.

00:04:18.000 It’s kind of a, it’s very relevant it’s relevant for virus discovery and I’m not going to talk about this today but I just wanted to motivate why I’m particularly interested in virus polymerases.

00:04:32.000 So this was a really remarkable project that I was involved in still involved in it was started by an unemployed Canadian postdoc at the beginning of the pandemic with the goal of trying to understand where COVID-19 came from, and it attracted a sort

00:04:51.000 of an all star team of volunteers scientists from all over the world who kind of put their time into this.

00:04:59.000 And there’s a happy ending to the story because Dr babian who was then unemployed is now an assistant professor in Toronto, and the sort of the headline of our paper last year was we increase the number of known RNA virus species by about an order of magnitude

00:05:20.000 before we did this for about 10,000 species and now we know 130,000 and so this got me kind of interested in how can we align and classify all these new guys.

00:05:37.000 And COVID, it is an RNA virus and you know this is sort of the background for why I’m using this example.

00:05:47.000 So how do you actually make a sequence alignment. Well, under the hood, it’s, it’s important to understand that these algorithms are really using an X, a highly simplified model of evolution, I mean oversimplified really because you, we proteins are really complicated.

00:06:08.000 So these are just some sort of very, I guess crude averages over how proteins behave because that’s really the only way we know how to do it as a practical matter.

00:06:21.000 And then there’s a, the, the way that proteins evolve as you have these substitution scores which says, you know how likely it is for one residue to be replaced by another.

00:06:33.000 And you have this very simple model of insertions and deletions which penalizes, adding gaps to the alignment.

00:06:42.000 It’s just really not that simple I mean Moses did not come down from a mountain and tell proteins to obey these laws so it’s, it’s just something to bear in mind that what’s really going on here is something much more complicated and you should always be skeptical

00:06:58.000 that the alignment is really correct or informative I mean sometimes it’s obvious like that first example I showed, but you should always be kind of skeptical.

00:07:09.000 So here, so this is that same alignment, which this is a bunch of Corona viruses and it’s two regions from the same alignment.

00:07:18.000 So the one I showed the top yeah that’s that’s convincing everyone would do it manually all programs are going to agree.

00:07:25.000 And if you look at the other region there, I’ve shown at the bottom it’s like well, how, I mean, really, how do we know, does the computer know.

00:07:34.000 And the answer is no the computer doesn’t know any more than we do because it’s using these very simplified models and what it does the best it can but if you get, if you get this to two different programs, it’s going to give you different results

00:07:50.000 Well, how would you choose.

00:07:54.000 So, this kind of raises.

00:07:57.000 Sorry, excuse me.

00:08:07.000 So, this raises the question well, when is an alignment correct or when is it even useful.

00:08:17.000 And once you get in those more challenging regions like the bottom one I just showed that the answer is not necessarily that well defined, there’s two main ways of thinking about it.

00:08:27.000 One is that the letters you see in a given column should all come from the same ancestral letters so that’s technically called residue homology.

00:08:38.000 The other way you can say as well if you align the structures, the things that are in the same place in the structure should be in the same column.

00:08:47.000 But what, unless the structures really line up very well, that’s kind of ambiguous so you start to have the same kind of ambiguity as you have with the letters so even structure doesn’t really provide you a definitive standard and the sort of a gray area

00:09:02.000 where the structures diverge the less clear, it is what you’re really trying to do.

00:09:09.000 So, this alignment, it could be just flat out wrong, depending what you want to do.

00:09:16.000 In particular, there are, there are some kinds of inferences like phylogenetic trees where it really is very clear that you need homologous residues in the same column but then some other applications, it might not be so, so clear it might be useful

00:09:32.000 so there’s some ambiguity in the alignment. And worst case, this could be really throwing you completely off track if you try to estimate a phylogenetic tree and you give it a region like this, you may just be getting complete nonsense.

00:09:49.000 And another thing to keep in mind is that if you, if you sort of represent an alignment like this in a row and column matrix, you’re assuming the sort of an implicit assumption that all of the mutations that have happened between the sequences are all mutations

00:10:11.000 or maybe some short insertions and deletions, but there are many other types of mutation that can happen that you simply can’t represent in a matrix like that. Well, there can be an inversion which is sort of more something you think about in DNA maybe.

00:10:29.000 I mean, what’s the most common short of the sort of the common. Well, what’s the most common source of a short insertion, that’s hard to say. And that’s a short tandem duplication, the most common way that the new letter gets inserted into a protein is you get a little

00:10:50.000 slippage when the, the gene is duplicated. So these are actually a very common source of mutation, then you’ve got translocations, and you’ve got this rather technical thing called the homoplasy and so when you’ve got distantly related proteins, and you have these more variable

00:11:07.000 regions in the loops of the protein, you get insertions, deletions, substitutions, and eventually there’s just no residue homology left. And so if the way you should really represent that in a multiple alignment is to have a ton of gaps, but, but alignment

00:11:26.000 algorithms will never represent it that way.

00:11:31.000 So I’m going to focus particularly on trees, because it’s a very common application of an alignment, and it kind of illustrates the issues I want to talk about.

00:11:42.000 So alignment in itself is not usually the end goal. So the questions you should be asking are not necessarily is alignment in itself, good or bad. The question is, is it good enough to answer the biological question that you’re trying to answer.

00:12:00.000 And here I’m going to take a concrete example, which is kind of relevant right now, if we’re interested in the evolutionary history of COVID, we want to make phylogenetic trees for the coronavirus family.

00:12:15.000 And it’s kind of, it’s the coronavirus divides quite clearly into four different groups which are called genera.

00:12:25.000 There’s just four, that’s just kind of the convention in taxonomy. And so I went around the literature and I pulled some, some trees.

00:12:36.000 So ABDG is alpha beta gamma delta here. And you can see that out of six trees I found in the literature, none of them agree with each other, but they all have very high confidence levels.

00:12:53.000 So if you know about phylogenetic trees, the way that you estimate confidence is the same called a bootstrap value and it goes from zero to 100 or from zero to one depending what unit you use.

00:13:07.000 You can see here, all these trees have high confidence, but they disagree with each other. So, at most, one of those trees can be right so something is going wrong here.

00:13:19.000 So we need to go back and we’ll check our assumptions here because the sort of conventional assumptions about how to do this is simply not working.

00:13:29.000 And the answer here is that bootstrapping assumes the alignment is correct.

00:13:37.000 So typically, you make a tree, you see the high bootstraps, you kind of think okay everything’s working okay.

00:13:44.000 But bootstraps is not a test of whether your alignment is good enough to make the tree, it assumes that the alignment is correct and then asks, is there enough information in the alignment to make the tree.

00:13:56.000 But if there are systematic errors in the alignment, you have a problem.

00:14:01.000 So, what can we do about this.

00:14:06.000 Well, you know if you’re a good biologist you know you’re supposed to do replicates of your experiment you try the same experiment, repeat and see just start measuring errors and so on.

00:14:18.000 And you can see we sort of got a model of how we might do this in the literature right here well these guys.

00:14:24.000 Why did they get different trees well they chose a different alignment method and a different tree building software so maybe DeGroote used muscle and Raxomel, maybe Zhu used Maft and QuickTree.

00:14:40.000 And they they came up with different answers so maybe what we should do is we should just like.

00:14:46.000 Do all the try several different ways, ourselves, and see if they give consistent answers.

00:14:54.000 Now people are people essentially never do that. And so, why not, then I think the answer to that is that people generally they’ve got some idea about well what’s the best method because there’s a benchmark test out there, or because their friend told them that’s how they do it.

00:15:13.000 And they think the friend is more expert than they are. So there’s sort of this general perception that there’s one way that’s the best or shown to be the best or I’m just going to work with it.

00:15:25.000 And there’s

00:15:27.000 sort of a concern that if you use some other method that’s not quite as good as the best method, then well you discount the fact that that method disagrees, and you just trust the best one.

00:15:40.000 So, I don’t know, there’s whatever the psychological sociological reason this is almost never done, but this example shows that maybe people should be doing this.

00:15:50.000 So, this gives me sort of the motivation for the essentially new ideas that are in muscle five, which is, well, how can we do this.

00:16:01.000 So if you remember back to the opening slides when we, when a computer makes an alignment.

00:16:08.000 It’s based on this very, very simple model where we have a substitution matrix, and typically a couple of lines.

00:16:17.000 Excuse me.

00:16:23.000 So if you remember back to the opening slides when we, when a computer makes an alignment. It’s based on this very, very simple model where we have a substitution matrix and typically a couple of gap penalties.

00:16:39.000 And you take the how do you set those parameters well you measure them on some benchmark and then you set the default values for those parameters, based on whatever comes out best on the benchmark.

00:16:55.000 But if you think about it, they’re kind of right round numbers and averaged over benchmark so should they really matter whether you use five or 5.1 or two or 1.9 for your gap penalties and the answer really, it shouldn’t matter if it does, then you should start to be suspicious that your results are any good.

00:17:19.000 So the idea is to do what I call perturbing parameters, so we introduced some small random variations for example into the back into the gap penalties.

00:17:30.000 And we asked, does that change the alignment, and does that change the downstream analysis that you’re doing, such as trees.

00:17:39.000 And, well, how, how do you set the scale for these perturbations and the, the standard. Well, how did we come up with them in the first place we we tuned them on some benchmark test.

00:17:52.000 And the standard should be that we make them as large as possible because we want replicates that are different.

00:18:00.000 But we don’t want to degrade the accuracy, accuracy on our benchmark because once you start to degrade accuracy, people are not going to use your method.

00:18:10.000 So you want to be in that sweet spot where you maximize the variation without paying a price in accuracy.

00:18:18.000 Okay, so I promised a gentle introduction so we’re going to have a couple of slides here which are not gentle at all but we’ll quickly go back to, to more gentle presentation but so how do I actually do that.

00:18:34.000 And the, the, this is okay so this is all kind of mathematical it’s based on a thing called a hidden markup model.

00:18:42.000 And this was introduced by by a group from Stanford, way back in 2005 around about the same time that I was doing the first version of muscle, and it’s a very, from a mathematical point of view it’s a very elegant framework where everything is probabilities.

00:19:02.000 But really it’s just, it’s nothing fancier from an evolutionary point of view is it’s basically just this very simplified model where you have a substitution matrix gap penalties here you actually have four instead of two, but you know it’s not that different.

00:19:21.000 And so this gives me a nice little way where I can introduce random perturbations in a sort of a principle way and keep everything under control and not just doing sort of arbitrary things.

00:19:39.000 And

00:19:44.000 this is how muscle five works okay so that’s everything I basically just said.

00:19:51.000 So now the idea is instead of just running one alignment.

00:19:56.000 We run a whole set of replicates that’s what I call the ensemble, so we just keep changing the parameters, a little bit, making a different alignment.

00:20:08.000 And if we see that the alignment is the same every time, this gives us a lot more confidence that the alignment is correct because it’s robust against these changes in the model which really shouldn’t matter.

00:20:20.000 And you can focus on individual columns, maybe some columns are consistently reproduced and some are less so you can actually assign a confidence level to each column in the alignment.

00:20:34.000 And then even if the alignment barriers that doesn’t necessarily mean that’s bad, it may still be good enough for a given purpose so what you should do is well you continue the analysis and you estimate your tree.

00:20:48.000 And you can not only can you see if the tree is consistent, but you can see if the bootstrap values are trustworthy.

00:20:57.000 Because if the trees come out different but with high bootstraps, then you should not trust the bootstraps you should believe the ensemble of trees is telling you which is that some parts of the tree are not reproducible, so it gives you a different way of approaching this whole sort of pipeline.

00:21:18.000 So now I need to sort of digress into another sort of technical issue which is.

00:21:25.000 Well, the hidden markup model or blast or whatever that’s how you align two sequences.

00:21:31.000 But how do you build a multiple alignment, and the answer is, essentially every popular method, there is muscle prop cones math.

00:21:42.000 They all assemble the final alignment, using the strategy which is called progressive alignment, and the way that this works is you at every step you do a pairwise alignment, and when you start at the very beginning at the bottom of the tree you have individual

00:22:02.000 sequences. And as you work your way up the tree.

00:22:07.000 You have.

00:22:09.000 So let’s say at the first node, now you have a pair of sequences align, while you keep them aligned to each other, and you align them to something else so at each stage you keep the alignments intact and you have a pairwise alignments of the two alignments.

00:22:25.000 And this, this process can be could be regarded as a tree or following a binary tree and that tree is called the guide tree.

00:22:35.000 Sometimes it’s explicitly constructed beginning sometimes you sort of dynamically figure out which pair, you’re going to join at each iteration but the bottom line is there’s always a guide tree of one kind or another involved.

00:22:52.000 And this is a problem for phylogenetics because well if you have a more challenging alignment.

00:23:01.000 Every time you make one of these joins the quality goes down, you have fewer and fewer correct columns or good enough columns.

00:23:14.000 So it means that the sort of the systematic errors the pattern of good columns well conserved columns and errors in the final tree reflects the guide tree that you use to build it.

00:23:27.000 So if you then take that alignment and you give it to RaxML or quick tree or whatever.

00:23:33.000 Those systematic errors can get reflected in the maximum likelihood tree.

00:23:39.000 So well how does that bootstrapping actually work for bootstrapping takes a sort of a random samples from the columns in your alignment and asks whether that subsample reproduces the same tree or not.

00:23:55.000 If you have systematic errors, then you can have the same patterns of this sort of underlying guide tree reproduced in different columns, and this is the mechanism where systematic errors in the alignment can produce systematic errors in the tree, and can give you spuriously high bootstrap values.

00:24:16.000 So, this is what I just said.

00:24:23.000 So muscle five does something else. It doesn’t just perturb the HMM parameters. It also makes variations in this guide tree.

00:24:37.000 And you have to do this in a rather carefully designed way.

00:24:45.000 Because you sort of have conflicting goals. So one of the reasons you have a guide tree is that you get the most accurate alignment, when you align the most closely related sequences so if you just take two groups of random and align them they might be more

00:25:04.000 closely related and you’re writing errors more quickly.

00:25:08.000 So you every step you want to get something close to the most closely related groups in order to get the highest accuracy.

00:25:19.000 You also want to vary this tree in a meaningful way so that it has a chance to sort of expose the systematic errors. So, the way that muscle five does this is that it preserves the guide tree close to the leaves, but as you get close to the final alignment

00:25:38.000 so the last two or three joins, it switches around the order. So, it does. So this is sort of the notation that I use. And there’s four variants like this for the final assembly stage of the multiple alignment.

00:26:00.000 So now, the ways that replicates are generated it’s a combination of a perturbation of the HMM parameters and this joining order of the tree.

00:26:15.000 So now to some benchmark results. So, the sort of gold standard in protein multiple alignment is this benchmark set called barley base, and it’s worth noting that the best sequence, the best alignment methods, and here I’m showing

00:26:37.000 you know there’s a few other competitors but those are definitely among the state-of-the-art. But you’ll notice that the left is the y-axis and that’s the fraction of columns that are correctly aligned on this set.

00:26:51.000 And we’re only in sort of the 50 to 60% range. So, this tells you that there is a lot of uncertainty and even the best methods don’t reproduce structural alignments on this benchmark.

00:27:08.000 But there are sort of two lessons here I want to sort of draw out on this slide. One is, well, muscle 5 is slightly better than the competition here. I mean, I don’t think it’s better in any meaningful way, to be honest.

00:27:23.000 But it addresses this psychological problem that you want to feel like using the best method that you shouldn’t discount the other ways of doing it because they might be worse.

00:27:34.000 And the other thing is that it doesn’t make any practical difference whether you use the defaults or whether you perturb the HMM parameters and the guide trees. So, any one of these ways of building an alignment is equally good as far as we know.

00:27:53.000 It performs equally well on the benchmark. Of course, some of them do better and worse on particular sets. But when you’re starting with new data, you have no particular reason to prefer any of these variants.

00:28:04.000 And this means that you can generate your ensemble of alignments and they’re equally trustworthy. So, now you can say, well, if they give different results, I have a problem. Or if they’re giving consistent results, then, you know, I have good reason to feel more confident in them.

00:28:23.000 I haven’t really talked about nucleotide alignments, but it’s a similar story.

00:28:35.000 So, now I’m going back to sort of a concrete application of this whole approach. And so, I need to give a little background on RNA virus taxonomy.

00:28:53.000 And it’s undergone some radical changes recently.

00:28:58.000 So, before 2018, so maybe I should quickly review taxonomy for people who have not heard of this. It’s sort of a human classification state scheme. And it’s sort of based on a tree-like

00:29:13.000 sort of structure, which is supposed to follow phylogeny. And then it has ranks, which are species, genus, family, class, order, phylum, which are sort of human-level groups, but they’re supposed to capture something meaningful about the organisms that you’re classifying.

00:29:37.000 And they’re supposed to follow the phylogenetic tree so that these groups are, you know, they’re evolutionary relatives, not just things that you think look alike.

00:29:48.000 And before 2018, RNA viruses were not classified above order. There was no class or phylum rank.

00:29:59.000 And the reason for that is that, well, how do you build trees for these viruses? Well, you use polymerase sequences, but these sequences evolve very quickly. And it becomes very, very difficult to make alignments and trees for the most distantly related groups.

00:30:18.000 And then in 2018, a paper came along, which was highly influential, and it’s Wolf 2018. And they went to a lot of effort to build a sort of global multiple alignment of all RNA viruses from their polymerases and build a tree from it.

00:30:41.000 And this is sort of a figure from the paper, and I’m sort of capturing, I’m calling out here the bootstrap values on the deepest branches of this tree, which look very high.

00:31:00.000 And so that was very convincing to these guys and also to the sort of official virus taxonomists. And what they did was, you can see we’ve got those five colored branches. Those were adopted as phylum rank.

00:31:19.000 So on the basis of this paper, RNA virus taxonomy was expanded to include phylum and class ranks. And when I saw this paper, it bothered me for about a year.

00:31:33.000 So this is sort of the kind of the history behind how muscle five came about. Because I’ve played quite a bit with these polymerases, and I know how diverged they are.

00:31:47.000 And I just looked at that and I said, I do not believe that tree. I just don’t know, you just cannot align these things well enough to get this kind of reliability in a tree.

00:32:00.000 How do you prove this is wrong? So, of course, now you’ve heard the rest of the talk, you have an idea about how I went to do that. But really, I just went around muttering, I don’t believe it, this can’t be right.

00:32:15.000 And this was sort of what I ultimately came up with as a response to my skepticism, if you like.

00:32:22.000 So, yeah, so just to emphasize how difficult it is, the average distance between highly diverged viruses is five substitutions per site.

00:32:39.000 So if you know anything about sort of protein alignment, protein evolution, so once you get something like 50% amino acid identity, which is 0.5 substitutions per site, it’s already starting to get a little bit tricky.

00:33:01.000 You’ve got some good regions, you’ve got some bad regions. Then if I get down anywhere close to one substitution per site, on average, I’ve got a different letter at every position and I’m well down into what’s called the twilight zone.

00:33:17.000 And we’re starting to get alignments that look like the WTF, the things that I was showing earlier. But we have gone five times deeper than that.

00:33:29.000 And so the conventional ability, excuse me, the conventional wisdom before 2018 was, well, the information’s just lost.

00:33:39.000 So one of the things I did sort of in my skepticism was to really dig deep into that alignment. And I’m not going to go into the technical details here at all.

00:33:52.000 I’m just going to sort of say that while there are a few very well conserved catalytic residues in RNA virus polymerase, and I was able to show that many of those catalytic residues are not correctly aligned within this alignment.

00:34:11.000 So this means that of something like 400 columns, none of them are aligned correctly. Because if you don’t get the catalytic residues right, where you have the best conservation, you certainly haven’t got the rest of it right.

00:34:27.000 So this sort of bolstered my opinion that something was going wrong in it.

00:34:33.000 So after having sort of refined, come up with muscle five and put it together, I could now apply it to the Wolf 2018 alignment and the tree.

00:34:49.000 So I needed to do two things. First of all, I had to show that my alignment was at least as good as theirs, because they went to all this effort to do the manual adjustment and whatever. And these guys have a very high reputation as the experts on RNA viruses.

00:35:06.000 So I make absolutely no claim that my alignment is good in any sense, but I can show that I get the catalytic residues right more often than they do. So I think there’s a reasonable case that mine is at least as good as theirs.

00:35:24.000 But then when I generate replicates, these groups just get shuffled.

00:35:31.000 So my claim is that when you do the analysis using the muscle five ensemble, you can see that the fiber are not reproduced.

00:35:48.000 And this means that the high bootstraps you get in the Wolf 2018 tree must be artifacts of systematic errors in their alignment. And of course, like everybody else, they use progressive alignment to put things together.

00:36:02.000 So I think that groups really reflect the big blocks that they assembled in the final stages of putting together their alignment.

00:36:12.000 So when I do the confidences my way, they’re very low. So my methodology says I shouldn’t believe my tree, and I believe that it shouldn’t be believed.

00:36:30.000 So this is, of course, my minority opinion, and I’m currently trying to convince people that currently RNA virus taxonomy is kind of driven off the rails, and it’s not meaningful at all.

00:36:47.000 And I think that’s actually positively harmful. To get into that would be sort of a whole nother talk, but I’m just trying to use this as a case study to show how this approach can be applied to something in practice.

00:37:05.000 And so we actually got through this quicker than I thought. That’s the muscle five paper. And thank you very much for the invitation.

00:37:17.000 Thank you.

00:37:23.000 Now I think we’ll be the QA session.

00:37:29.000 I think it’s really amazing that, for me, I just go to use whatever multiple sequence alignment, and then you give me a high bootstrap value, I believe it, and call it a day.

00:37:41.000 And I really appreciate that this talk lead us into the details and how to be skeptical. And I think Kelvin and Luther have a question in the chat. Do you want to unmute yourself or maybe turn on the camera to speak for yourself?

00:37:57.000 I’m sorry, I didn’t quite catch that, but I do see a great question on the chat. It’s a great question because now I wish I had done a couple of slides on this.

00:38:08.000 And actually, this is what we’re doing in the Serratus project right now. So you remember I gave a very quick slide on this project we did during the pandemic to discover new RNA viruses.

00:38:25.000 And now we have this incredible resource with AlphaFold where we can discover much more highly diverged viruses by doing this folding.

00:38:40.000 So, AlphaFold is very good at enabling us to find very highly diverged homologs. What the structural alignment, and I talked very briefly about this, but when you have distantly related structures, you can eyeball them in higher model, and you can see, okay, very

00:39:04.000 clearly, you have a squiggle here and a squiggle here and the squiggles and the squoggles line up between the different structures. But you can’t say, okay, but this cysteine residue exactly matches that aspartate residue in the other structure.

00:39:21.000 It’s not that precise. So you might have a helix here and a helix here. So you can say, yeah, these secondary structures are the same. But now the homology is at that level. You can say secondary structure, yeah. But once you get to the level of an individual residue, it’s not clear.

00:39:39.000 And it’s probably not meaningful because there’s been enough insertion and deletion that it’s not really clear that there’s a one-to-one evolutionary correspondence between the residues. What’s being conserved here is the secondary structures. So what does that mean? It means you can recognize the polymerases and you can do that with a very high degree of confidence from the structures.

00:40:03.000 But what you can’t do is build trees because you can’t get evolutionary distances between structures. You can only get evolutionary distances from sequences. And even that’s questionable. So what do you actually do to build those kinds of trees? You do maximum likelihood. And what’s maximum likelihood?

00:40:28.000 This is Moses coming down with some tablets that have a very, very simplified model of evolution written on them, saying, OK, well, there’s this transition probability, that transversion probability. And there’s all kinds of fancy math. But what that math boils down to is an incredibly oversimplified model of evolution.

00:40:50.000 And, well, that’s the best we can do. And it surely works very well in reasonable cases. But here we’re at five or six substitutions per site. So we are pushing sort of a model which no doubt works very well over short distances to extremely large distances. And that surely just doesn’t work. And with structures we don’t have an evolutionary distance.

00:41:16.000 So you can say, well, this maybe looks more similar to that. But it doesn’t tell you whether it’s an evolutionary neighbor or not. So there’s this classic issue which people can sometimes forget. So let’s say you have a blast top hit. Is it to a polymerase, a virus polymerase, or to a group two intron, or a CRISPR-Cas protein? These are all in the same superfamily. They’re all pandamine proteins.

00:41:46.000 And you can look at the top hit and say, OK, the top hit is a virus. It’s probably a virus, which is true. But it’s not necessarily a virus. It could be a highly diverged group two intron. It’s just that the way blast ranks them is not the way it doesn’t necessarily reflect what’s the closest in the tree. When you’re looking at structures, you have the same problem. You can say, well, this structure looks more like that structure.

00:42:10.000 And you can say, OK, it has a DALI z-score or a TMI score or whatever, which is closer. But that doesn’t necessarily mean it’s the closest evolutionary relative. So alpha fold is adding a lot of capability here. But it doesn’t resolve the deep evolutionary history of these viruses.

00:42:40.000 So the question is, does muscle work equally well for nucleotides as for amino acid sequences? So the question here is, well, what’s your standard of accuracy? And so I don’t think you can directly compare them.

00:42:56.000 If you saw on BarbieBase, the best methods only get 50% to 60% of columns correct. But they’re incredibly challenging alignments, where probably most of the columns are not even meaningful anyway.

00:43:13.000 So it really depends. What’s your benchmark? What’s your standard of accuracy? I mean, it works well. And it also is slightly better than the competition. When you do a benchmark, does that matter in practice? I would say no. I would say the important thing is that you have this ability to muscle will, by generating the ensemble, muscle will tell you whether it’s good enough for your problem or not.

00:43:42.000 And this is a gaping hole in bioinformatics tool in general, is you might have a confidence or a probability, but you’re trusting all the assumptions in the algorithm are good enough that it can be trusted to come up with its own confidence.

00:44:00.000 And sometimes that’s not the case, because like when you saw with bootstrap values, sometimes you get high bootstraps for wrong trees. So here muscle is helping you answer the question, is the method good enough to answer the biological question I’m trying to answer?

00:44:27.000 Could I comment on methods that estimate phylogeny and alignment simultaneously? So that would require a very long answer to go in detail. But I would say generally, so conceptually, it’s the right thing to do.

00:44:47.000 In practice, it’s so computationally expensive, these methods, they’re too slow to be usable in practice. And the accuracy is quite bad, because they just can’t, it’s just too hard to implement.

00:45:02.000 And when you do it, when you do that kind of thing, you’re still using these grotesquely oversimplified models. So I want you to keep in mind, MOSE is coming down with these laws, and proteins don’t respect these laws whatsoever.

00:45:18.000 Proteins live in a very, very complicated world with selection pressure and biochemistry and everything going on. They don’t think about blossom scores. So if you’re a method that’s trying to estimate phylogeny and alignment at the same time, it’s starting from the wrong place.

00:45:35.000 It’s starting from this highly simplified model of evolution. And however much sort of added bells and whistles you put into that model, it’s still extremely oversimplified. So you’re kind of pushing in a direction which is, it’s like the physicist with the spherical cow.

00:45:58.000 Okay, so we’re adding two spherical horns to the spherical cow, but we’re still a long way from the cow.

00:46:11.000 Awesome. And Zarif, please ask your question.

00:46:16.000 I was just wondering, and by no means, my research is on phylogeny and evolution, but I’m just wondering when you are doing these trees and alignments, so can you take information from some other closely related proteins when you are doing the phylogeny

00:46:39.000 and take information from trees constructed from other proteins to make your phylogeny or alignment better?

00:46:51.000 So the short answer is no. And there are two reasons for that. One is the polymerase is the only gene that you find in all RNA viruses. So even within a family like coronavirus, there’s quite a lot of variation in the gene content.

00:47:10.000 If you go deep, there is nothing except the polymerase. So that’s one problem. The other problem is that the other genes typically evolve much more quickly than the polymerase.

00:47:24.000 So they’re much harder to align, and it’s just not adding anything useful. I mean, even between two different species, you get about 90% polymerase identity. So that’s enough distance that you can clearly separate species.

00:47:46.000 The other genes really isn’t adding anything helpful, and they’re evolving much more quickly, so they’re harder to align. So in any case I can think of, you’re better off throwing the other genes away and just focusing on the polymerase.

00:48:01.000 It’s only if you’re looking at a very, very fine grained detail, like let’s say Omicron variant of COVID versus some other variant, then you’re looking at the other genes. But as soon as you go any deeper than that, you’re better off focusing just on the polymerase.

00:48:23.000 Nice. Can you check Olivia’s question in the chat?

00:48:28.000 What do I think the ideal method of RNA virus classification is?

00:48:36.000 So I just addressed the issue of whether it helps use the rest of the genome.

00:48:43.000 And I think that, well, so yes. So here the issue is, when your traditional taxonomy, you culture a microbe and you grow it in some cells and you take photographs of it, you see what kinds of cells it infects.

00:49:05.000 So it’s very labor intensive focus on a single virus. So a project like Serratus finds 100,000 viruses, we’re not doing that kind of analysis. Basically we’re giving you a polymerase sequence. So, and we don’t know very much else about it, typically.

00:49:21.000 But kind of by definition, that’s what we’ve got to work with and we want to classify those sequences. And this is generally a problem in all of microbial biology right now is that we’re uncovering this vast world of viruses and bacteria, fungi, which are only known from metagenomic sequences.

00:49:45.000 And you can say, well, how do we do taxonomy with those and maybe just don’t do taxonomy, maybe do computational classification and you call it something else. I think this was the mistake that the virus people made, to be honest, because with bacteria and fungi, we have that problem, but people don’t try to make official taxonomy.

00:50:07.000 But the virus guys elevated these debatable polymerase trees and made official taxonomy out of them. So I think they should have made a distinction between the sort of big picture computational things and what the fine grain study in the lab. And this is really confusing to people that are not intimately familiar with

00:50:33.000 how virus taxonomy is done.

00:50:36.000 So it’s a scope to improve strategy for guide tree estimation and merging of subalignments.

00:50:47.000 I think this is so

00:50:51.000 I’m sure there is. I mean, obviously, if I had ideas, I would have done them. But

00:50:58.000 I want to sort of seize on the sort of a conceptual issue in this question, which is always the search for the best method.

00:51:07.000 And of course, this is a very good search. And I’m a very competitive guy. I want to have the best method when I make one. But

00:51:15.000 one thing I’ve tried to draw attention to in this talk is there’s a risk with just focusing on the best method if you’re not sure how good it is or how good it is for your specific question.

00:51:29.000 So what it’s important, I think, to keep in mind that the best method is still a radical oversimplification of biology, whatever you do, however good it is, however much progress we make on multiple alignment over the next 10 years.

00:51:46.000 This will still be the case because the only way to really do it is just like simulate the entire planet for 3 billion years and see what evolution actually does. We can’t do that. We make these very, very simplified computational models.

00:52:01.000 And we should try to make them as good as possible. We should also try to have them interrogate themselves as to how good they are.

00:52:09.000 Okay, another question.

00:52:15.000 So after generating the alignment, yes, it’s possible to identify problematic regions.

00:52:24.000 And yes, you can refine those regions.

00:52:27.000 And yes, you can refine those regions.

00:52:30.000 But some regions are simply not meaningful because there’s no homology there or no sequence similarity there. So you may be better off identifying those regions. And muscle 5 does this. It assigns a column confidence score to every column.

00:52:48.000 And, of course, it depends why did you make the alignment and what inference do you make from it so it’s always a bit risky to give general rules but as a general rule maybe you should just ignore those columns because it’s not given that every column has to mean something, because you may remember I said, well, there are quite a few types of mutation that the alignment cannot represent.

00:53:14.000 So, if you go too far and try to force the alignment into some kind of shape, it may just not be meaningful because the mutations that happened, don’t fit.

00:53:32.000 Awesome. I think, let the chance question be the last question, it will be a nice ending.

00:53:41.000 So, okay, let me read the question for identifying what follows just genes.

00:53:52.000 Okay, no, is my answer to that question. That was an easy one.

00:54:01.000 Would you like to elaborate a little bit.

00:54:10.000 Okay, I think that’s about the end of our seminar and thank you so much, Robert, and everybody if you don’t mind you can unmute yourself and give Dr. Robert a round of applause.

00:54:22.000 Okay, thank you very much.

00:54:24.000 Thank you. Please have a nice day. Bye bye everyone.

00:54:40.000 Thank you.

00:55:10.000 Thank you.

00:55:40.000 Thank you.

00:56:10.000 Thank you.

00:56:40.000 Thank you.

00:57:10.000 Thank you.

00:57:40.000 Thank you.