Sp23 - Dr. Casey Greene - Making Serendipity Routine
Transcript
00:00:00.000 leading to his first paper, which is affiliated with the Department of Genetics here at UGA.
00:00:15.100 The Green Lab develops machine learning methods to integrate disparate large-scale datasets,
00:00:21.300 develops deep learning methods for extracting contacts from these datasets, and then brings
00:00:26.020 these capabilities to molecular biologists through open and transparent science.
00:00:32.140 Of particular note to folks interested in large language models, like CHAT-GPT, Dr.
00:00:38.580 Green and others recently posted a preprint on how AI might be able to help us write and
00:00:44.260 revise our academic manuscripts.
00:00:47.420 Please join me in welcoming Dr. Casey Green to UGA.
00:00:54.420 Okay, thank you, yeah, it’s good to be back.
00:01:04.700 I actually spent a summer working in this building in the Department of Genetics.
00:01:09.380 Everything around this building looks different now, but somehow the building still looks
00:01:11.860 the same.
00:01:14.500 I’m excited to get a chance to share some of what we’ve been doing in our group and
00:01:19.980 then to share some of what’s going on at the University of Colorado and sort of give you
00:01:23.140 an idea of what the ecosystem is like where we are.
00:01:27.980 I think it’s always important to think about kind of what our role as informaticists in
00:01:34.180 science is, like what do we bring to the ecosystem, and I like to think of what we contribute
00:01:40.140 as essentially serendipity, right?
00:01:42.900 If we do our jobs well, people will see something in their data that they hadn’t seen before,
00:01:46.580 and they will make a different decision based on that.
00:01:49.020 So I like to, you know, think about how we can make more kind of serendipitous moments.
00:01:55.540 I’ll start with a brief vignette of a project that we started quite a few years ago now
00:02:02.940 trying to understand rare diseases.
00:02:07.860 And in particular, what factors might drive a rare, what systemic factors might drive
00:02:13.500 a rare disease.
00:02:16.340 The world when we started this project was one where looking across multiple data sets
00:02:22.260 remained somewhat challenging.
00:02:24.460 It’s time consuming, you have to deal with batch and technical artifacts.
00:02:28.980 And so our question when we started, a postdoc joined the lab and the idea was, well, you
00:02:33.380 know, here’s the gap that we’re facing.
00:02:35.180 If you want to analyze multiple data sets at the same time, to identify systemic factors,
00:02:40.060 that means you’re looking at different tissues, probably looking at different cohorts, there’s
00:02:44.180 potentially different disease contexts, might be different controls.
00:02:48.060 All of this makes your life a lot harder, you can’t just kind of make an assumption
00:02:50.900 that you’re going to do three different t tests and be done with it.
00:02:55.100 And so what this postdoc wanted to do was say, okay, can I find these commonalities
00:02:59.980 without it taking an inordinately large amount of time.
00:03:03.180 And she was very interested in not taking an inordinately large amount of time because
00:03:07.780 the strategy that she used in her PhD, before she joined the group, was developed, she developed
00:03:14.300 this approach using a modular framework to analyze these data sets, where you essentially
00:03:17.740 take different data sets, decompose them into modules, and then try to map a module in one
00:03:23.220 data set to a module in another data set to a module in another data set.
00:03:27.100 This is possible, if you’re an expert in the disease, and you’re an expert in all the tissues
00:03:31.340 that sort of are affected, you can do this, you can say, okay, this is this pathway response
00:03:36.340 in this tissue, and this is this other pathway response in this other tissue, but it’s really
00:03:39.740 time consuming.
00:03:40.740 And the complexity grows essentially, with at least the square of the number of modules
00:03:47.460 that you want to look at.
00:03:48.460 So you sort of restrict yourself to a modest number of modules if you want to do this in
00:03:51.380 any practical amount of time.
00:03:53.580 There are potentially ways to automate this, you know, using over representation analysis
00:03:57.060 or other strategies to try to, you know, make life easier on the mapping stage, so you don’t
00:04:00.940 spend a lot of time looking at stuff that’s unlikely to be fruitful.
00:04:04.220 But either way, it’s challenging, it takes a bunch of time and it requires a lot of expertise.
00:04:10.100 So what Dr. Jacqueline Tironi did was say, well, what I really want is a module library,
00:04:15.980 like how could this process or cell type be represented in any data set that I look at,
00:04:21.220 I’d like to be able to pull that off the shelf, and then take that module library to different
00:04:26.100 data sets, and then look at each of those data sets in terms of those modules and just
00:04:30.420 look directly across those modules.
00:04:31.580 I don’t have to now sort of do the module connection after the fact, I can do it upfront.
00:04:37.500 This would be great, wouldn’t it be great if.
00:04:41.460 And so her hypothesis was that, you know, these modules don’t just exist in one data
00:04:46.220 set, they exist across human biology.
00:04:48.180 So could she, so her hypothesis was that she would be able to learn these reusable modules
00:04:54.540 by taking many, many, many different data sets, decomposing those different data sets
00:04:58.700 into modules, and then sort of learning which modules were necessary to reconstruct the
00:05:04.260 original data.
00:05:06.180 And then once she did that, she could do that on a generic collection of data, and then
00:05:10.180 hopefully be able to use that on her data sets of interest.
00:05:12.860 And in this case, we’re interested in studying a disease called Enka-associated vasculitis,
00:05:17.580 which is rare enough that if you look at large collections of public RNA-seq data, you don’t
00:05:22.060 find it.
00:05:23.680 So it’s quite a rare disease.
00:05:28.500 So the idea here was, if she took a whole bunch of generic samples, essentially random
00:05:31.700 human data that she downloaded from the internet, transcriptomic data, she decomposed that,
00:05:35.900 she could decompose that into patterns, and then she could take those patterns and we
00:05:39.620 could apply those to the rare disease data sets of interest, and potentially do sort
00:05:42.980 of standard statistics with that.
00:05:45.940 We, just to give you an idea of the data set that we started with, so this is a data set
00:05:52.500 ReCount 2, I think there’s now a ReCount 3.
00:05:56.100 This is produced by Jeff Leak’s group at Hopkins.
00:05:59.220 And if you wanted 70,000 RNA-seq samples, you can just get 70,000 uniformly processed
00:06:04.660 RNA-seq samples.
00:06:05.660 What I like to tell Jackie is that her project had to be successful, because if you think
00:06:11.140 about the resources we were giving her to start it, if you just benchmark that those
00:06:14.260 samples probably cost about $1,000 to generate, you know, I like to tell her, look, we gave
00:06:18.780 you $70 million to start your project, you have to do something with $70 million, right?
00:06:23.460 So, this was the data that we used, and then we tried quite a few different methods to
00:06:29.300 extract patterns, including some that we developed in our group, but we ended up coming to this
00:06:34.300 method called PLYR, which is from Maria Shakina’s group at Pitt.
00:06:39.260 And PLYR is the Pathway-Level Information Extractor, and it does a couple things that
00:06:44.340 are really nice.
00:06:45.340 So, it’s essentially doing this matrix decomposition, but as you do the matrix decomposition, there’s
00:06:50.540 a couple sort of regularization factors and some penalties in it, and essentially, it
00:06:55.580 has some sparsity properties that we really like, so the idea is that you want to be able
00:06:59.420 to explain a dataset with relatively few latent variables.
00:07:02.740 Also, you want your latent variables to have only a modest number of genes in them.
00:07:07.300 And finally, if those latent variables can align with a pathway, you’d really prefer
00:07:12.740 that the latent variable align with a pathway, because anytime you’re doing this decomposition,
00:07:16.220 you’re essentially doing some arbitrary rotation, right?
00:07:19.180 PCA is essentially learning an arbitrary, it’s learning a rotation of your data, a reduced
00:07:25.380 dimension space rotation of your data.
00:07:28.380 The rotation in PCA is essentially arbitrary, I mean, you’ve chosen that you want to maximize
00:07:33.340 the variance on the first axis, but you could have chosen anything else, like ICA, you’re
00:07:37.740 just like, I don’t care, just make it small.
00:07:41.020 So in this case, what it’s doing is it’s essentially saying, if a pathway can line up with an axis,
00:07:46.100 let’s do that.
00:07:47.260 And so it’s a little bit less, that’s the intuition for what it’s doing, it’s actually
00:07:52.620 a little bit, the regularization is a little bit different than that.
00:07:56.700 And what that gives you is a really nice level of interpretability.
00:07:59.340 So instead of having to think about a module as, like, instead of saying, okay, this cell
00:08:03.700 type is, these three different axes of variability added in some fraction, usually those cell
00:08:08.980 types are going to come out as a single axis in your data.
00:08:11.060 So it makes it much more easy to think through and reason about the solutions.
00:08:18.100 And so what we essentially did is say, okay, well, we have this enormous collection of
00:08:22.260 generic human data from the internet, ReCount2, and we have a plier, and we’d like that to
00:08:26.940 be a machine learning model that we could use in many different biological contexts.
00:08:30.380 So we named it multi-dataset plier, but because in bioinformatics, everything has to have
00:08:34.900 a name, we just shortened that to multiplier.
00:08:38.780 And so this is kind of a multiplier idea.
00:08:41.380 I’m just going to give you a couple highlight results from the paper that I think help to
00:08:47.540 kind of give an idea of why you might want to do this.
00:08:49.820 The paper is pretty exhaustive.
00:08:51.500 It has a significant deep dive into sort of exactly what’s happening in this stuff, but
00:08:55.340 I’ll just give you kind of the high points.
00:08:59.580 Essentially what we wanted to understand is, does this model learn something we didn’t
00:09:03.660 already know?
00:09:05.740 And is it better than if you just took data from the disease of interest?
00:09:10.180 We couldn’t get enough data from the disease of interest, which is psychosociative vasculitis.
00:09:13.700 So we did that analysis in the paper, and we just didn’t learn many axes of variability
00:09:19.540 because there’s too little data.
00:09:21.020 So what we wanted to do is say, well, let’s pick something where we could actually learn
00:09:24.460 something.
00:09:25.460 Let’s sort of give this idea a chance.
00:09:27.220 And then we said, well, let’s imagine we’re studying a different autoimmune disease, lupus.
00:09:31.220 So that’s what’s here, where it says SLE, this box plot.
00:09:35.660 What we’ve done is we’ve collected all the whole blood data that we could get from individuals
00:09:39.940 with lupus that was publicly available to create one collection of data.
00:09:45.260 Then what we’ve done with RECOUNT2, which is the generic human data from the internet,
00:09:48.780 is we’ve taken RECOUNT2 and we’ve subsampled it to be the same size as the lupus set.
00:09:53.820 So that’s the box plot.
00:09:54.940 The box plot is RECOUNT2 subsampled.
00:09:57.420 And then where you see the diamond, that’s what happens if you just take the complete
00:10:00.460 collection of data from RECOUNT2.
00:10:02.460 Okay, so let’s label our X.
00:10:04.300 Oh, so yeah, so these data sets, if you think about the science behind this, this experiment
00:10:08.220 has these two data sets of the same size, but different composition.
00:10:12.300 These two are the same composition of the data, but they’re quite different sizes.
00:10:16.140 I think 70 times larger for the diamond than the other.
00:10:20.780 And so then we can ask, okay, so how many patterns are we learning from our data?
00:10:23.580 So this method uses the latent variable decomposition, so the number of latent variables, which are
00:10:27.660 essentially patterns.
00:10:29.340 You can see that if you want to learn more patterns, there’s sort of more, and there’s
00:10:32.860 a heuristic and plier for sort of selecting the optimal number of latent variables, and
00:10:37.140 we use that heuristic here, it seems like a pretty reasonable heuristic.
00:10:40.260 It uses cross-validation to essentially ask how frequently you’re rediscovering the same
00:10:45.140 latent variables.
00:10:47.860 So if you do that, you find that you can learn more latent variables or recurrent patterns
00:10:52.900 in the data if you have less heterogeneity in your data given a fixed sample size.
00:10:56.900 I don’t think this is going to shock anyone, right?
00:11:00.420 If I give you a limited amount of data, you would like that data to be as consistent as
00:11:03.900 possible except for the thing you’d like to vary, right?
00:11:06.620 That’s usually a good place to live.
00:11:10.060 So that’s what we see here, right?
00:11:11.340 You get more latent variables out of the lupus data than the subsampled recount2 data, but
00:11:15.740 if I told you, well, you can have really messy data, but you can have a lot of it, now you
00:11:19.460 can learn a lot more patterns.
00:11:21.080 So you end up with many fold more patterns that you can learn, the kind of recurrent
00:11:25.500 patterns across the data set if you have more data.
00:11:27.620 So you’d rather have less heterogeneity, but sometimes having more samples can overcome
00:11:32.780 the heterogeneity issue.
00:11:37.300 So that’s the first thing we ask.
00:11:38.300 So you get kind of more total patterns.
00:11:40.060 The next thing we can ask is, okay, well, we don’t know what complete collection of
00:11:47.440 processes are transcriptionally co-regulated, right?
00:11:50.060 This is not something we know a priori.
00:11:51.460 We can get a collection of processes if we go to Gene Ontology or KEGG or other databases,
00:11:55.940 but some of those may not be transcriptionally co-regulated.
00:11:58.060 However, if we’re seeing a process that’s coming out as transcriptionally co-regulated,
00:12:02.620 that’s probably a positive hit.
00:12:04.860 And so that’s kind of the assumption we’re making here.
00:12:06.700 So this is looking at both the SLE and the ReCount2 data again.
00:12:10.820 This axis is what fraction of the pathways that we know about are coming back as sort
00:12:16.580 of aligned with one or more axes in the data.
00:12:20.580 And so you can see this actually wasn’t driven by composition of the data, this was driven
00:12:24.660 by sample size.
00:12:25.660 So if you put in more data, you can learn kind of more transcriptional co-regulation.
00:12:30.580 This obscures a little bit of what’s going on here because the processes are a little
00:12:34.060 bit different.
00:12:36.460 If you look at what happens in the ReCount2 data, you end up learning, as you get to the
00:12:41.740 large sample size, you end up learning more granular processes.
00:12:45.020 So it seems like it’s probably that you’re covering the same things that are transcriptionally
00:12:48.780 co-regulated, you just get a higher level of resolution.
00:12:52.740 And then, so you kind of learn more of the stuff we know we should know about.
00:12:56.780 So at this point, anything I’m showing you, you could also do with GSEA or any of these
00:13:00.580 other methods, gene set enrichment analysis, or those types of methods.
00:13:03.820 But we can also ask, can we learn anything we didn’t already know going in?
00:13:07.620 So this is asking what fraction of the latent variables that are coming back are not associated
00:13:14.620 with the pathway.
00:13:17.100 And what you can see is, when you have the sort of more modest sample sizes, about half
00:13:23.140 of the latent variables that come back were associated with the pathway, and the other
00:13:25.700 half are potentially novel.
00:13:27.580 Could be novel biology, it could be a technical artifact.
00:13:31.460 What you see when you get the large collection of ReCount2 data is, you know, that number
00:13:35.420 drops to about 20%.
00:13:36.420 So 80% of the latent variables that are coming back didn’t exist in the databases that we
00:13:40.300 had available to us going into it.
00:13:42.100 So that’s a really nice thing to have in your back pocket if you want to say, well, look,
00:13:46.780 I want to explore the biology of what’s going on in this disease, but I don’t want to limit
00:13:50.580 myself to what people have curated into a database.
00:13:53.020 So this is kind of a data-driven way to figure out what those modules could be.
00:13:58.380 And you might also say, well, there’s probably just an enormous amount of technical artifacts.
00:14:01.700 You just told me you gathered a whole bunch of random human data from the internet.
00:14:05.360 One of the positive things that we saw that was kind of suggestive that this is not driven
00:14:09.140 exclusively by technical artifacts is actually this proportion bumps up a little bit when
00:14:13.900 you look at the ReCount2 data, as opposed to the SLE data.
00:14:17.140 If it was entirely driven by technical artifacts, you’d actually expect this to have fewer latent
00:14:21.460 variables that were associated only with a known process.
00:14:25.420 So this was also encouraging.
00:14:28.300 So this gives us the idea that we kind of learn more unknown unknowns.
00:14:31.980 And then there’s quite a bit more of a deep dive in the paper, like looking at sort of
00:14:36.380 one of the things that we see is instead of having a cell cycle latent variable, you end
00:14:39.540 up with different phases of the cell cycle partitioned into latent variables.
00:14:43.660 But the sort of takeaway was that if you do these machine learning analyses while reusing
00:14:49.640 data from other contexts, you can get this level of detail that you couldn’t get just
00:14:54.040 analyzing your data alone.
00:14:55.840 So starting with a whole bunch of other data, learning the pathways and processes there,
00:14:59.520 and then applying it to your data gives you this higher level of resolution.
00:15:02.800 There is a bit of an implicit assumption here, which is if the process that you were looking
00:15:07.040 at is truly unique and only occurs in your setting and no other settings, you can’t find
00:15:12.240 it because it’s not going to be present in the variability from other people’s data.
00:15:16.920 I think this is probably rare.
00:15:19.560 I don’t think it’s terribly common that there are processes that are so exclusive that they’re
00:15:23.400 only used in one and only one biological context and nowhere else.
00:15:27.120 But if you believe that to be the case, you should know you will not find it with this
00:15:29.800 method.
00:15:32.800 And so then just kind of recapitulating.
00:15:34.480 So in the past, when Jackie joined the group, she had this modular framework approach, which
00:15:39.680 is actually really nice and is now used, it’s been used in scleroderma and other contexts
00:15:46.280 to connect pathways across tissues and studies.
00:15:50.240 But the multiplier approach she developed has some nice advantages.
00:15:53.480 So she takes this generic human data, recount two, she can train a model, then transfer
00:15:57.520 it to the datasets of interest, and then just look across those datasets with standard statistics.
00:16:02.320 So this is an example of one of the things we can do with this.
00:16:06.080 So this was actually the thing we wanted to do when we started the study.
00:16:10.040 So these are three different datasets from individuals with ANCA-associated vasculitis.
00:16:15.200 One of the challenges here is that all of these datasets are microarray-based.
00:16:19.840 All of our training data is RNA-seq.
00:16:22.560 A different student in the lab developed a technique to do, if you’re interested in taking
00:16:27.240 sort of machine learning methods and applying them to gene expression data across these
00:16:33.920 contexts, for many methods, there’s reasonable ways to do that transformation that is not
00:16:41.040 completely horrendous, which is the best advertisement I can get for a method.
00:16:47.280 But there’s quite a few different methods.
00:16:48.280 And actually, quantile normalization is not bad in this context.
00:16:51.760 The zeros kind of give you a bit of trouble, but it’s not horrendous.
00:16:55.400 And so this is what we’re doing here.
00:16:57.680 So we’re actually asking, can the multiplier model actually apply to array datasets, even
00:17:03.200 though it’s trained exclusively in RNA-seq data?
00:17:05.000 We would have done this in RNA-seq data, but it turned out there wasn’t RNA-seq data for
00:17:08.920 this that was available yet.
00:17:10.840 So now we’ve gotten to the point where the datasets for this disease are large enough
00:17:13.040 that they actually do exist in RNA-seq as well.
00:17:16.440 So what’s going on here?
00:17:18.080 So we’ve got three different datasets, airway epithelial cells, renal glomeruli, and PPMCs.
00:17:23.160 They’re collected in three different studies, so there’s no matched people.
00:17:26.800 And also the conditions are brutally different.
00:17:28.960 So what’s going to happen is from the left side to the right side of each of these plots,
00:17:32.640 we’re going to go from the least, from the most severe form of the disease to sort of
00:17:36.360 the least severe form of the disease for healthy controls.
00:17:39.360 So this dataset for the airway epithelial cells has, these are the vasculitis data,
00:17:45.320 but then it’s got things like pancreatic rhinitis, healthy.
00:17:49.280 And so we’re basically saying, okay, what is associated with severity across these three
00:17:55.640 different cohorts?
00:17:58.160 And so one of the latent variables that comes up as a severity associated is this M0 macrophage
00:18:03.040 signature.
00:18:04.360 And you can see the same thing where in each group you look in, the least severe form of
00:18:09.640 the disease is on the right, the more severe form of the disease is on the left.
00:18:12.280 So you can see this latent variable severity associated due to the bizarre sort of path
00:18:17.440 of academic publishing.
00:18:20.040 So our guess was that M0 macrophages could be involved here.
00:18:24.280 Well, before this actually was published, but after the preprint came out, we had some
00:18:29.000 follow-up work.
00:18:31.640 And I have to break all the chronology of science.
00:18:34.280 Our follow-up work came out first demonstrating that it looked like there was a change in
00:18:38.400 macrophage metabolism in the disease that could be sort of influencing severity in a
00:18:43.240 systemic way.
00:18:44.880 You can also use this type of analysis to say what’s particular to a tissue, right?
00:18:48.400 So you could say what latent variables are associated with severity in this tissue, but
00:18:52.560 not other tissues.
00:18:53.720 So it gives you the ability to start doing those analyses in a way that it’s pretty darn
00:18:56.920 difficult to do with just the modular framework alone.
00:19:03.760 And then I have an almost five-year-old, she turns five in three weeks, and she’s been
00:19:08.560 watching Zootopia.
00:19:10.680 And there’s a line in a song in Zootopia where I was listening, I’m like, oh my gosh, this
00:19:15.040 is science.
00:19:18.760 So the line is, I’ll keep on making those new mistakes.
00:19:21.680 I’ll keep on making them every day, those new mistakes.
00:19:25.120 And so we’re really big on this in the lab, right?
00:19:27.840 But what I tell people is, it’s not going to work, just make it not work differently
00:19:33.160 each time.
00:19:34.160 If it’s not working the same way each time, that’s not good, but if it’s not working for
00:19:37.240 different reasons, that’s perfect.
00:19:40.080 And so we do this in our own work.
00:19:41.700 So this is the first part of the GitHub that’s rate me that’s associated with this paper.
00:19:47.660 So if you want to know sort of, these are all notebooks, if you want to follow along
00:19:51.480 with the work that we did for this multiplier paper.
00:19:54.360 The first part of this is kind of our proof of concept exploration to just understand
00:19:57.720 how the method worked.
00:19:58.720 Then you get to the stuff that’s in the paper, then you get to the stuff that’s in the supplement,
00:20:03.280 then you get to the stuff that’s neither in the paper nor the supplement, because it turned
00:20:06.080 out the paper was too long.
00:20:08.080 And so this gives you a way to see like, okay, here’s all the stuff we did.
00:20:12.480 So there was one experiment that we did where we wanted to say, can you predict outcome
00:20:15.640 in clinical trials from these latent variables?
00:20:18.220 And so we got this Rituximab data set from the NIH that was testing this.
00:20:23.000 It turned out that the data set structure was, let’s say, suboptimal, in that some of
00:20:27.800 it was paired end and some of it was not paired end sequencing.
00:20:31.320 And this was confounded with the endpoint.
00:20:33.880 So it turned out to be extremely difficult to analyze, and we couldn’t really learn anything
00:20:37.800 from it.
00:20:38.800 But if you’re interested in using that for your own work, probably not that data set,
00:20:43.160 that idea.
00:20:44.160 You know, we’ve got a notebook here that’s like, okay, here was our attempt to build
00:20:48.040 a model to predict response.
00:20:49.240 So you can start from that.
00:20:50.480 So if you’re interested in this, we try to do this for each of our papers.
00:20:55.840 So this is available.
00:20:56.840 The GitHub is here.
00:20:58.400 If you search for Taroni and multiplier, you’ll probably find it.
00:21:03.080 But I thought this was a nice example of kind of how we’ve taken a project from inception
00:21:07.960 through execution through kind of deliverables.
00:21:11.000 This method, we’ve seen some other uses now.
00:21:13.640 So someone used the same thing to study neurofibromatosis.
00:21:17.600 That came out relatively recently.
00:21:19.120 I can’t remember, there’s a few other sort of rare disease analyses that people have
00:21:23.120 started using this for.
00:21:24.120 But we really like seeing that, right?
00:21:25.120 Because it demonstrates uptake that is in a, I mean, rare disease transcriptomics is
00:21:31.120 a relatively small community.
00:21:33.800 So it’s nice to see this stuff beginning to catch on.
00:21:37.880 I would also say, you know, we started with about $70 million worth of data.
00:21:44.240 If you are Lego Grace Hopper and you happen to have an internet connection, you can have
00:21:48.280 about $4 billion worth of data at your fingertips.
00:21:51.600 So if you’re interested, there’s a few more resources that have come online.
00:21:55.800 I think Arches 4 now has like 650,000 samples.
00:22:00.040 So that’s about, you know, if you want to estimate $650 million of preprocessed data
00:22:04.120 at your fingertips.
00:22:05.960 In a previous position, we built something called Refined Bio that’s about a million
00:22:08.920 samples.
00:22:11.240 So these types of resources are available, which is great, because then you don’t have
00:22:14.120 to go back and rebuild this.
00:22:15.680 You don’t have to do all the software engineering to reprocess the data in a uniform way.
00:22:18.960 You just kind of start from the processed data.
00:22:21.560 And I think this opens up a lot of avenues of exploration.
00:22:26.600 I like to, you know, one of the things that I say about what our lab works on is machine
00:22:31.880 learning, public data, and the transcriptome.
00:22:33.680 Pick two of three, and we’re probably interested.
00:22:37.040 One of our Bush essentially wrote and designed the way that we fund science in this country.
00:22:42.840 So this idea that most science is going to happen outside of government research labs,
00:22:46.280 it’s mostly going to happen at universities, it’s mostly going to be grant funded.
00:22:49.800 He wrote this letter to FDR that says, the pioneer spirit is still vigorous within this
00:22:54.080 nation science offers a largely unexplored hinterland for the pioneer who has the tools
00:22:58.080 for his task.
00:22:59.080 Well, I would say, I think open data is like the opportunities here are remarkable, like
00:23:05.840 the ability to, you can take these data sets off the shelf and learn how something works
00:23:11.520 at a scale that’s very difficult to do from the data generated in only one lab.
00:23:15.760 And once you do that, you can then test it.
00:23:17.360 And I think I really think using other people’s data as sort of the starting point to generate
00:23:20.920 hypotheses that you then go test.
00:23:24.080 There’s an enormous amount of unexplored opportunity here.
00:23:28.080 We also think sometimes about other data types instead of just gene expression.
00:23:31.640 So this is work from David Nicholson, who was a PhD student in the lab who just graduated
00:23:35.840 last year, who was like, well, let’s do that.
00:23:38.120 I just want to understand what’s on bioRxiv anyway.
00:23:41.960 So at this point, I probably don’t have to introduce it, bioRxiv is a preprint server.
00:23:46.760 And so this gives us the ability to also study the peer review process in some ways.
00:23:50.560 So we can see what gets posted to bioRxiv, and then we can look at the sort of what the
00:23:54.040 final paper looks like.
00:23:56.160 We started this project just around the time that bioRxiv released an XML repository of
00:24:03.240 their complete collection of data.
00:24:04.920 So if you’re interested in not just having a complete collection of transcriptomic data,
00:24:08.120 you can also go get a complete collection of XML preprints, which I think is really
00:24:11.640 exciting and a lot of fun.
00:24:15.240 You learn some things if you start looking at just the metadata associated with this.
00:24:18.800 So this is one of the simple questions that David asked was just, well, if there are preprints
00:24:25.160 with multiple versions, are people sort of adjusting their preprint in response to peer
00:24:29.160 review?
00:24:30.280 So if someone submits their paper, they get comments back, do they generally repost it?
00:24:34.040 We can’t directly answer that question, we don’t have access to the journal system.
00:24:36.920 But we could say as well, if that were happening, probably what would happen is that at each,
00:24:41.520 you know, as you saw in each version, you’d see an extension in the time to publish.
00:24:45.120 Sure enough, you see that.
00:24:46.120 And actually, the coefficient on the X here is about 50, which is in days.
00:24:53.080 So it suggests that adding a version means sort of 50 days longer in the publication
00:24:57.000 process, which is kind of aligned with what you’d expect to see if people are putting
00:25:00.560 up papers and revising them in response to peer review.
00:25:06.520 Another thing we can ask, so this is getting into the text itself, is if the text changes,
00:25:13.480 do more changes in text between the preprint and the published version result in, does
00:25:18.320 that come with a longer time to publish?
00:25:20.640 And the answer to that is kind of yes-ish.
00:25:23.400 If a preprint changes more from the, if a published version changes more from the preprint,
00:25:28.560 it does take a bit longer to publish.
00:25:31.000 But it’s not an incredibly substantial change.
00:25:33.840 And actually, the other thing that we did, so as we were doing this project, there was
00:25:36.560 another group that did a completely different project, where they took a set of COVID papers
00:25:42.420 that were published first as preprints and followed them through.
00:25:45.480 Their scientific question was different.
00:25:46.800 They wanted to say, for COVID-related papers, does the message of the paper change as it
00:25:52.080 goes through the publication process?
00:25:53.080 And they found that only in one case did that happen out of the 300-odd papers that they
00:25:56.480 examined.
00:25:57.480 But what that gave us was an annotated list of COVID papers.
00:26:00.100 So we could then take that and ask if it had the same relationship, and it actually didn’t
00:26:03.280 have the same relationship.
00:26:04.800 So for the subset of the literature in early 2020, COVID papers were being published quickly
00:26:10.460 regardless of how much text changed between the preprint and published version.
00:26:14.400 So this was kind of an interesting way to explore how publishing was happening.
00:26:18.280 So for those of you who have had the opportunity to have papers go through peer review, can
00:26:24.080 you guess what the most common linguistic change is, if we just look at word-level linguistic
00:26:28.800 change during the publishing process?
00:26:39.200 Has anyone ever had to add supplementary or additional data?
00:26:43.000 It’s not the most common.
00:26:44.480 The most common is actually, oh, no, it is the most common, yeah.
00:26:48.280 So additional and file.
00:26:49.280 So on the right here is what’s enriched in the published literature, and the left is
00:26:53.320 what’s enriched in preprints.
00:26:55.280 So file and additional and supplementary are all pretty high at the top.
00:26:58.800 So when people are changing their papers, we can infer that probably they’re often changing
00:27:04.280 the, you know, they’re adding stuff to the supplement, but maybe they’re not adding that
00:27:08.040 much to the main paper.
00:27:09.640 The other stuff that’s in there is kind of interesting, like fig and figure.
00:27:12.560 So because journals have different styles, the plus-minus symbol and the em dash.
00:27:16.640 So you can see the artifacts of typesetting, but this gives a way to kind of understand
00:27:22.960 what’s on each side.
00:27:23.960 And this, I should say, we’ve done this, so this analysis is using only preprint published
00:27:28.360 pairs.
00:27:29.360 If you do the same thing with all of BioRxiv and all of PubMed, you essentially find field
00:27:32.760 differences.
00:27:33.760 So some fields use BioRxiv more than others.
00:27:35.400 So this is the more carefully controlled than that.
00:27:39.040 The other thing that we’ve done, just if you happen to have a preprint yourself, we have
00:27:42.960 this web server that does a linguistic comparison between a selected preprint and all of PubMed
00:27:51.720 and says, okay, well, here’s journals that publish linguistically similar papers.
00:27:57.320 Here’s papers that are linguistically similar to yours, and this has a secret feature.
00:28:01.640 So what we encourage people to do, and we designed it to try to get people to upload
00:28:04.400 their preprint, but then people are like, well, I have a preprint, but it’s on archive
00:28:06.920 and you don’t support archive.
00:28:08.880 So the secret feature, which you can also drag a PDF over the search box if you want
00:28:11.720 to, but we don’t generally advertise that because the goal was to get people to post
00:28:18.880 preprints so they could use the service, but we don’t support archive.
00:28:21.560 So if you have an archive preprint, we’ll allow you to put, to drag it over the search
00:28:25.760 box, but no other PDFs.
00:28:30.520 So this came out last year, and there’s, again, a GitHub associated with it if you want to
00:28:34.800 see all the kind of exploration that we did on the way.
00:28:38.480 David had a follow-up paper that I’m really excited about where he looked at, he looked
00:28:44.480 at the, what words change their meaning over time, and in the last 20 years of scientific
00:28:51.200 publishing.
00:28:52.200 So there’s an application associated with that as well that I should have put the link
00:28:55.520 on here, but didn’t, that we call WordLabs.
00:28:57.800 So if you go to our lab and look for WordLabs, you’ll find that, and it’s really interesting.
00:29:02.640 So we see things like hallmarks of new technologies, like, you know, CRISPR has a linguistic shift.
00:29:07.520 We also see a lot of pandemic-associated words have linguistic shifts.
00:29:13.080 So if you’re interested in understanding how our language changes, that’s also something
00:29:16.760 that David did.
00:29:19.040 Okay, and then I know this is a less medically-related audience than most of the places that I speak,
00:29:28.360 but one of the things that I thought I wanted to share was sort of how some of this basic
00:29:34.000 science or sort of the techniques that we develop in this basic science can contribute
00:29:37.120 to changes in how healthcare gets delivered.
00:29:40.040 And so this is also something that we think about, right?
00:29:43.480 Remember our business is serendipity.
00:29:44.960 Yes, sometimes that’s in research, right?
00:29:47.080 Whether that’s sort of me telling you how papers change, so that you can think about
00:29:50.560 how you would change your paper in response to peer review, just add more additional files.
00:29:55.160 But sometimes that’s, you know, in care, in clinical care, right?
00:29:59.200 So that someone, you know, you can imagine a patient comes in, there might be reasons
00:30:02.720 that that patient might need to receive a different treatment.
00:30:04.720 Could we provide that kind of information at the point of care?
00:30:07.920 So this is a big focus at our med school and our health system that’s associated with our
00:30:15.000 med school, University of Colorado Health, has an entire program in clinical intelligence.
00:30:21.440 This is sort of the idea that I like to highlight as sort of serendipity is like the right moment
00:30:27.000 at the right time to make the right decision.
00:30:30.440 In most health systems, if someone is going to get something called pharmacogenomic genetic
00:30:35.600 testing.
00:30:36.600 So the idea here is people have different variants in their genome.
00:30:39.580 Some of those variants can affect how you metabolize drugs, how you respond to different
00:30:42.640 drugs.
00:30:44.280 It’s not terribly common that people get tested for pharmacogenomic variants, because if you
00:30:49.600 go get, if you are going to a hospital and, you know, you need to have a stent inserted,
00:30:54.800 one of the common treatments is Plavix, well, there’s a, there’s an interaction between
00:30:59.080 Plavix and a certain genetic variant, a set of genetic variants where you metabolize the
00:31:02.760 drug differently, and it doesn’t work for you, which means you’re not getting the benefit
00:31:05.680 of reducing your heart attack risk, or your clot risk.
00:31:09.800 And so, but most people don’t get this testing, because if a physician orders the testing,
00:31:14.560 they get a 70 page PDF back, then they have to take that 70 page PDF and go to a table
00:31:18.840 like this, read everything related to the drug they’re about to prescribe, to prescribe
00:31:22.560 on the table, and then understand if it applies, right?
00:31:25.360 That is not a common thing for a physician to do.
00:31:28.160 Providers don’t get reimbursed for that type of type of work.
00:31:31.480 What’s happening at the University of Colorado and UC Health is we’ve got clinical decision
00:31:35.320 support built into the electronic health record around this stuff.
00:31:38.280 So this is the same thing, except instead of a 70 page PDF, plus having to look at this
00:31:42.640 table, if a provider were to go in and order Plavix for an individual who’s not going to
00:31:48.640 benefit from it, it pops up an alert that says, look, we recommend you remove this because
00:31:52.160 it’s not going to work, and read about why it’s not going to work if you want to, but
00:31:56.000 we recommend you apply one of these alternatives that will work for this patient.
00:31:59.200 And so this is serendipity, but not just in research, in clinical care.
00:32:04.520 And this is, if you’re interested in this kind of story, this is another one.
00:32:08.400 This was an individual, different condition, where there was a question about drug efficacy.
00:32:12.920 This is a story from UC Health.
00:32:16.120 And in this case, the provider keyed in an order, and an alert popped up that says, oh,
00:32:20.480 this person’s going to need a different dose.
00:32:23.280 And that was helpful to the provider to make that decision.
00:32:26.640 One of the things I’ve had the privilege of doing over the last couple of years is focusing
00:32:31.400 on this program.
00:32:32.400 So a lot of faculty in our department work in this program, and about a year and a half
00:32:37.000 ago, I guess two years ago, the previous director left, and so I ended up as the interim director.
00:32:42.480 So I’ve gotten to know this program pretty well.
00:32:44.920 So this is our Colorado Center for Personalized Medicine.
00:32:48.080 We have a biobank study that this is all tied to.
00:32:50.800 So if someone comes in, they can consent to have their sample collected for the biobank.
00:32:55.080 We have a robust return of results pipeline built on that.
00:32:57.920 So our biobank is growing pretty rapidly.
00:33:00.600 Our sample increase is picking up a lot.
00:33:02.720 But then the other thing we ask is, how is this making a difference in care?
00:33:05.800 So essentially, how many of these alerts are actually firing?
00:33:08.560 So over the last year, we’ve had about 1,000 patients who’ve had an alert fire at some
00:33:14.320 point in clinical care.
00:33:16.360 That’s a tenfold increase over our entire previous history.
00:33:20.200 And that’s been powered because we’ve recently focused on getting these results back into
00:33:26.280 the EHR in a structured way.
00:33:27.560 So we’ve seen almost 100-fold growth, actually more than 100-fold growth year over year.
00:33:31.800 We’ll probably have 210,000 results in the electronic health record at sort of the two-year
00:33:35.800 mark.
00:33:37.080 And what this means is that if you’re interested in studying this type of process in terms
00:33:42.080 of care delivery, if you’re interested in studying how physicians respond, if you’re
00:33:45.600 interested in looking for new cases where there are these sort of drug-gene interactions,
00:33:50.560 we have the ingredients to do that at Colorado in a way that no one else that I’m aware of
00:33:54.240 does.
00:33:55.720 And so this program continues to grow.
00:33:58.200 I’ll just give you one.
00:33:59.240 This is actually a real story that happened over the last few months.
00:34:03.320 So this is a stock photo.
00:34:04.320 I cannot show you a picture of the patient, but a patient came in to a community oncology
00:34:09.280 clinic.
00:34:10.480 And this works across the entire UC health system.
00:34:12.160 So it’s not just an academic hospital.
00:34:13.680 This is a major health system that serves the Mountain West.
00:34:17.360 So this patient came into a community oncology clinic.
00:34:20.280 They were prescribed a drug that based on their genetic variants would have created
00:34:27.480 a significant risk of life-threatening complications.
00:34:33.520 Our team noticed this, sent a message to the provider, and then the patient alert actually
00:34:41.400 fired and recommended a reduced dosage of the drug.
00:34:45.000 The oncologist actually did proactively reduce the dose.
00:34:48.000 So the person started at a different dose than would traditionally be used.
00:34:53.680 Even at that dose, they didn’t tolerate it very well.
00:34:55.600 So they had to further reduce the dose.
00:34:57.840 In these types of cases, you can imagine what happens if you start at the highest sort of
00:35:01.200 the traditional dose at which these drugs can be for individuals with this particular
00:35:05.920 variant can be lethal.
00:35:07.720 And so this is a case where, you know, yes, I told you there’s 1,000 alerts, but each
00:35:11.680 of those 1,000 alerts is some story like this, right?
00:35:14.360 And so it’s nice to see this actually being used to deliver care at scale.
00:35:19.640 And so we’re doing this, this is all informatics, right?
00:35:23.040 You can get all of this serendipity with sort of none of this here has machine learning
00:35:27.160 built into it, but it’s going to.
00:35:30.480 And as we think about that, I think it’s really important not just to sort of think
00:35:33.920 from the machine learning point of view, but to really think about practical clinical care
00:35:37.360 pathways.
00:35:38.360 So this is a piece from Siddhartha Mukherjee that sort of, if you’re interested in AI and
00:35:44.160 medicine, I realize this is dated now, but it’s still worth reading.
00:35:49.040 And it’s also weird that five years is old, but it’s still worth reading.
00:35:54.320 It has a quote from Geoffrey Hinton, sort of says they should stop training radiologists
00:35:57.800 right now.
00:36:00.440 And why would someone say this, right?
00:36:03.280 Well, so Geoff Hinton’s looking at the literature, right?
00:36:05.480 So they’re just trying to collect some literature from around the same time.
00:36:08.760 So this is sort of saying, look, deep learning is going to completely transform healthcare.
00:36:11.680 It’s going to change how we care as we know it.
00:36:14.840 Another sort of similar example, more examples, everything you read in the literature, deep
00:36:20.000 learning.
00:36:21.000 Like, I mean, now we’re all into large language models, but at the time these image models
00:36:23.760 were going to completely transform healthcare.
00:36:25.600 Well, you might ask, is there anything they’re not good at?
00:36:29.080 Like they’re good at everything, right?
00:36:31.440 Chihuahuas and blueberry muffins, not terribly good here.
00:36:37.000 This one’s kind of wild.
00:36:42.520 So I perceive the thing on the left as a panda.
00:36:44.840 It looks like a picture of a panda.
00:36:48.160 I don’t really perceive this as much of anything.
00:36:51.640 We’re going to add, you know, this plus seven thousandths of this.
00:36:56.520 What do you think the output is going to look like?
00:37:02.120 A sloth.
00:37:03.120 A sloth?
00:37:04.120 Any Gibbons?
00:37:05.120 Anyone for Gibbon?
00:37:06.120 A bear.
00:37:07.120 A bear?
00:37:08.120 Okay.
00:37:09.120 So this is what it looks like.
00:37:10.120 It’s a Gibbon.
00:37:11.120 Clearly a Gibbon.
00:37:12.120 No question.
00:37:13.120 That’s a Gibbon.
00:37:14.120 A monkey.
00:37:15.120 Okay.
00:37:16.120 So it doesn’t look like a Gibbon to me, but our neural network is exceedingly convinced.
00:37:26.080 And the reason this works, right, is because the decision boundaries in these neural networks
00:37:29.200 are sort of nonlinear, and you can end up pretty close to a decision boundary without
00:37:32.360 really knowing it.
00:37:34.200 This is another example, which I think is just a lot of fun, so I have to throw it in.
00:37:38.160 So this is someone who was like, well, can I just make adversarial, like, sticker?
00:37:43.040 Like, can I have a sticker that I can put on something and have a deep neural network
00:37:46.960 perceive it as something, even if it’s not?
00:37:49.360 So this is the toaster sticker.
00:37:52.600 And so you can give, you can put the toaster sticker on a table next to what I perceive
00:37:56.960 to be a, I perceive this to be a toaster sticker on a table next to a banana.
00:38:01.480 I perceive this to be a banana on a table.
00:38:05.320 Neural network classifier, banana on a table.
00:38:07.240 It’s good.
00:38:08.240 Stick the toaster sticker next to it?
00:38:09.960 Absolutely a toaster.
00:38:10.960 You can imagine doing the same things with stop signs, you can imagine doing the same
00:38:15.520 things with other fun technologies in the world of self-driving cars or deep neural
00:38:21.120 networks or vision.
00:38:22.120 Okay, so let’s go back to the automated radiologist finding pneumonia.
00:38:26.200 This was one of the examples I showed you before.
00:38:29.200 There’s a, John Zek is a guy who writes blog posts that are really good and then converts
00:38:34.720 them into papers.
00:38:36.600 So this was a blog post from John Zek, which then became a paper that sort of said, okay,
00:38:41.480 why is the system actually working?
00:38:43.120 Like what’s happening here?
00:38:45.200 So he went to the, to the images and tried to understand, you know, what part of the
00:38:51.080 image is contributing to the prediction of pneumonia?
00:38:53.920 Well in this case, a positive, and this should probably find pneumonia.
00:38:58.480 There’s high density in the lung here.
00:39:01.760 You can say, well, okay.
00:39:02.760 Oh, and I should say a positive number is suggestive of pneumonia, a negative number
00:39:09.240 is not.
00:39:10.240 Okay.
00:39:11.240 So what are the positive numbers?
00:39:15.360 The most positive numbers, bottom, yeah, where there’s this interesting stripe, the interesting
00:39:22.720 stripe that’s kind of unique characteristic of the scanner, the one that you might see
00:39:26.800 if you were in a department, if the scanner was placed proximal to sort of where pneumonia
00:39:31.440 diagnoses were usually occurring.
00:39:33.840 Yeah.
00:39:34.840 So, so, so that’s interesting.
00:39:38.240 Here’s another one.
00:39:40.800 I should say these probabilities are low, but the previous one was in the 99th percentile
00:39:44.320 for pneumonia.
00:39:45.320 This one’s in the 95th percentile.
00:39:47.320 Yeah.
00:39:48.320 Portable?
00:39:49.320 Portable.
00:39:50.320 Someone’s not well enough to go to the scanner.
00:39:52.600 Not a good sign for their health, could indicate pneumonia.
00:39:55.860 Probably not the part of the image you want to be looking at.
00:39:59.320 And so, you know, I think this to me, I was also reading this every once in a while, I
00:40:05.200 read the comment section of blog posts on the internet, which I do not recommend, but
00:40:08.960 there was one on big data.
00:40:09.960 You know, there was this blog post on big data and I’m like, well, it was like why statistics
00:40:14.320 don’t matter in the era of big data or something like that.
00:40:16.800 And I’m like, okay, I’ll, you know, and then I’m like, I have some disagreements with this.
00:40:21.040 Then I go to the comment section, I find this one, which I really agree with.
00:40:25.720 On big data, data collection biases are always larger than statistical uncertainty.
00:40:29.680 And I think this is why I sort of, you know, you can have these models that perform robustly
00:40:33.560 around a lot of these comparisons that still struggle.
00:40:36.200 And then I read, who made the post, and it’s this guy named Daniel Himmelstein, which probably
00:40:39.240 doesn’t mean anything to you, but he happened to be a postdoc in my lab.
00:40:42.160 And I’m like, dude, put this in the comment section of the internet, people should actually
00:40:46.440 read it.
00:40:47.440 So now I put it on slides, so at least someone can see it.
00:40:51.120 Okay, so how do we design systems that work?
00:40:54.600 So you know, for AI and medicine, regardless of what we’re using, I think we need to have
00:40:59.080 some principles that we think about.
00:41:01.520 We just did something that sort of tries to go on this path.
00:41:05.400 So this is a little bit different.
00:41:06.960 So this is a piece from Nature, but it talks about a preprint that we put up earlier this
00:41:11.800 year, where we were trying to use GPT-based models, so in this case, GPT-3, to revise
00:41:17.600 academic manuscripts.
00:41:19.280 One of the things that we see all the time is, well, now it’s really common, right?
00:41:23.440 People are developing services that can use GPT-3 or ChatGPT, or GPT-4 through ChatGPT
00:41:30.240 to revise your manuscript.
00:41:31.240 Well, what are the issues with those?
00:41:33.560 So our experience putting this together and running a bunch of test manuscripts through
00:41:36.360 it is that, yes, it is good at clarifying language.
00:41:40.280 There’s a lot of things it can help you with.
00:41:42.800 It actually caught an error in an equation, which I was pretty darn impressed with.
00:41:47.480 On the other hand, it also makes stuff up.
00:41:50.640 And so a bit of a challenge if your idea is that you’re just going to use this.
00:41:54.480 And I think, you know, I use this as an example because it’s really trivial.
00:41:59.240 You can try it out yourself and see how it works, and you can get these examples yourself.
00:42:03.360 The same things are going to be true when we use this in medical context, so we need
00:42:06.840 to think about them.
00:42:07.840 I think thinking about them and trying them and experimenting in a low-risk environment
00:42:11.440 before we move to a high-risk environment is usually a good idea.
00:42:14.440 So what are the principles that we’ve kind of come up with as we think about this?
00:42:17.440 So first, we really do aim for kind of an augmentation, not replacement.
00:42:22.440 And what that means is, you know, when we apply this to manuscripts and we try to get
00:42:27.760 it to improve it, you know, we actually applied it to the manuscript about the tool.
00:42:32.520 And when we did that, it made up this thing that we had done that we had fine-tuned the
00:42:36.200 model on manuscripts of a similar type.
00:42:38.240 Yeah, you should absolutely fine-tune the model on manuscripts of a similar type.
00:42:41.920 Makes a ton of sense.
00:42:43.200 We didn’t do it.
00:42:44.200 You probably shouldn’t report that you did it in the manuscript.
00:42:48.360 So we’re really thinking like, you know, okay, this is not like you’re going to plug it in,
00:42:52.280 you’re going to be done.
00:42:53.280 It’s really, you need to design it around kind of an augmentation capability.
00:42:56.760 You’ve got to carefully consider your use cases.
00:43:00.000 You know, if it’s easy to take the output, it’s hard to compare.
00:43:03.280 It’s probably not good, right?
00:43:04.800 Because you’re creating the opportunity for a mistake that you don’t need to create.
00:43:09.240 We really like to start with these kind of simple solutions and approaches and layer
00:43:11.960 complexity only as needed.
00:43:13.320 So in this case, you know, we start with some pretty simple prompts and then have the ability
00:43:20.080 to add complexity.
00:43:21.280 But usually we just kind of try to keep it relatively basic and simple.
00:43:25.200 The workflow is simple.
00:43:26.640 You can proof of concept it out really quickly.
00:43:28.360 And the most important thing is preserving attribution.
00:43:30.440 Like where did the content come from?
00:43:31.680 If you’re thinking about this in a clinical setting, you know, what was provided and when?
00:43:35.760 And you know, did it come from an AI or a human first?
00:43:38.000 Because that’s really going to matter as you’re thinking about evaluating these workflows.
00:43:41.520 In academic writing, I think it’s going to, you’re going to want to keep track of whether
00:43:45.200 something came from an AI based system or if you wrote it.
00:43:48.960 I think this is, more and more journals are starting to require this.
00:43:51.600 It’s going to be important.
00:43:53.520 But I just think like these are key principles that I would recommend keeping in mind.
00:43:59.160 And then finally, I just want to give you an idea of what the environment is like about
00:44:01.680 at Colorado.
00:44:02.680 Because some of you may one day look for future jobs and I figure you should know something
00:44:05.840 about us.
00:44:06.840 We’re not at Boulder.
00:44:07.840 We’re at the Anschutz Medical Campus, which is kind of between the airport and the Denver
00:44:13.040 airport and Denver itself.
00:44:16.360 We are a major academic medical center.
00:44:19.240 And like I said, we’re not at Boulder, which is the thing that people most frequently get
00:44:24.480 confused about.
00:44:25.780 On July 1st of last year, we actually launched a new department of biomedical informatics.
00:44:30.980 And you know, we’re trying to hire and put together a faculty that are focused on this
00:44:34.020 idea of kind of making serendipity routine, like how do you surface the right information
00:44:37.940 at the right time.
00:44:40.220 Our 30, we’re now at 31 faculty, we have a new person starting in May, that will get
00:44:45.260 us to 32 faculty, have about $65 million in extramural research, just for faculty who
00:44:50.340 are PIs, on which faculty in the department are PIs, there’s a lot of additional collaborative
00:44:55.340 funding that’s not included in this.
00:44:57.620 We have expertise kind of across the spectrum from precision medicine, through kind of physiological
00:45:03.380 modeling, we have folks who think about human computer interaction, because if you want
00:45:06.060 to deploy this stuff in the clinic, you should really think about how humans are going to
00:45:08.740 interact with it through kind of electrical engineering, medical imaging and AI.
00:45:13.580 So there’s a lot of faculty working in this area.
00:45:16.940 This I just collected some stuff from early last year.
00:45:21.100 And that just sort of like, okay, when our faculty mentioned in the press, so there’s
00:45:24.660 stuff in MIT technology review nature, LeMond, Deutschland, I can’t remember whatever the
00:45:32.540 German public radio station is.
00:45:35.500 So you know, you have a bunch of internationally renowned experts who are at Anschutz, we don’t
00:45:38.700 happen to be at Boulder.
00:45:39.700 You know, to me, it’s like the difference between Georgia Tech and Georgia, I feel like
00:45:44.740 they’re different institutions, we should probably occasionally recognize that.
00:45:49.660 The other thing that sort of we focused on when we were creating the department, if you
00:45:53.460 read the sociology literature from the 80s, which I suspect all of you do on a regular
00:45:57.060 basis, there’s this article that I actually think you should read from the sociology literature
00:46:02.020 from the 80s called the mundanity of excellence.
00:46:04.740 So this is someone who essentially studied swimmers at many different levels and asked
00:46:08.260 what differentiates swimmers at one level from another level.
00:46:12.700 There’s a few different principles that come out, but one of the key ones is that excellence
00:46:16.260 requires qualitative differentiation.
00:46:18.620 And what I mean by that is, you know, you’re not going to move up a level in swimming competitions
00:46:22.180 by swimming to extra labs.
00:46:24.020 That’s not how it works.
00:46:25.020 What you’re going to do is you’re going to focus on your form, you’re going to, you know,
00:46:28.020 you’re going to approach the sport differently, you’re going to focus on getting rest the
00:46:30.920 night before the meet, like that’s the stuff that people do at higher levels that they
00:46:34.340 don’t do at lower levels.
00:46:36.660 And as we think about a department, we had to ask like, what’s our, okay, what’s our
00:46:39.380 qualitative differentiator?
00:46:40.380 How are we not just like another biomedical informatics department that just happens to
00:46:44.220 have more money or something, right?
00:46:45.580 Like that’s not, that’s not a real differentiator.
00:46:48.460 And so what we thought about was creating promotion and tenure guidelines that have,
00:46:53.580 that are focused on real world impact.
00:46:55.720 So one of our bullet points in our departmental sort of idea of impact, there is a bullet
00:47:00.700 point that includes publication.
00:47:02.220 It’s possible, you can do it, we can care about it.
00:47:05.540 But there’s also technology development that gets deployed locally, nationally or internationally.
00:47:10.780 Software shows up, changes in policy because of your work.
00:47:14.100 All of that is included in and counts for impact.
00:47:17.100 Now, probably not all of you will look for tenure track faculty positions in our department.
00:47:21.340 But if you were to come train or send folks to train with us, I think it’s important to
00:47:26.660 know like that filters down, right?
00:47:28.340 If that’s how faculty are evaluated, that filters all the way down.
00:47:30.940 So there’s an emphasis on real world impact that we have that I think can be our kind
00:47:34.340 of qualitative differentiator.
00:47:35.340 And I’ll just say, we have a really good training environment.
00:47:39.620 So we have strong connections with UC Health.
00:47:41.100 So I told you earlier about one of the UC Health programs that we work closely with
00:47:44.660 them on.
00:47:45.660 And Children’s Colorado, which is a nationally renowned pediatric cancer hospital, not pediatric
00:47:51.220 hospital.
00:47:52.220 I know the pediatric cancer people work in that space.
00:47:55.940 So we have those tight connections.
00:47:56.940 If you’re interested in seeing your work translate to care, we have, if you’re interested in
00:48:01.140 using genetics to guide care, I think we have one of the best programs in the country
00:48:03.620 through CCPM.
00:48:05.340 We have a diverse and internationally recognized faculty.
00:48:08.740 One of the things that’s sometimes a little bit odd for folks at, you know, our tenure
00:48:12.680 track faculty in DBMI are actually majority women, which I think is uncommon at, in, in
00:48:19.620 our field.
00:48:21.380 And then if you like the climate and you, it’s a little bit humid here.
00:48:24.900 We don’t have that level of humidity, but we do have more hours of sun per year than
00:48:28.340 Miami and San Diego.
00:48:29.340 So if you’re interested in the environment around you, we’ve got that.
00:48:31.860 And then we’ve got abundant outdoor activities.
00:48:34.780 This is one example of the programs that we have.
00:48:36.580 So this is our computational science PhD program.
00:48:39.020 There’s also a postdoc training grant associated with the same thing.
00:48:44.740 So if you’re interested in this type of thing, feel free to look us up.
00:48:47.780 You can always drop me an email and I can try to connect you with folks too.
00:48:51.580 And then with that, I just want to thank the people who make this possible.
00:48:55.860 So the members of the lab, we really have a kind of robust culture in the lab of sort
00:49:00.140 of sharing the work that’s happening and kind of thinking through each other’s projects
00:49:04.700 in ways that are really helpful.
00:49:06.420 We also do code review.
00:49:07.420 So people really pitch out, pitch in together, the department of biomedical informatics and
00:49:12.060 my leadership team, and then the folks in CCPM, since I shared some of that work, and
00:49:15.660 then the folks who give us money, I’d be happy to take whatever questions you have.
00:49:25.860 For the radiology, XAI explainability, did they end up using my grad cam as like how
00:49:35.340 they liberated the convolutional layer for the confluence?
00:49:41.380 Yeah.
00:49:42.380 I don’t remember what strategy they used.
00:49:45.580 And I also don’t remember, you know, the saliency map.
00:49:51.220 I think it’s a saliency map, but I’m not 100% sure.
00:49:55.580 Yeah, there’s a, so John posted that first as a blog post, there’s now like a PLOS medicine
00:50:00.300 paper, I think.
00:50:01.300 Yeah.
00:50:02.300 I think I probably, yeah.
00:50:03.300 Yeah.
00:50:04.300 And I wrote this when it was the blog post and not when it was PLOS medicine paper, which
00:50:07.020 is what I’d look at.
00:50:09.020 Yeah.
00:50:10.020 Yeah.
00:50:11.020 So with the strange like explainability maps, did that happen even with like augmentations
00:50:19.540 that would rotate or like zoom in, zoom out?
00:50:22.100 Like was that with that, it still happened or with a train without that?
00:50:26.300 Because like, I was thinking like assume in augmentation might solve that bottom band
00:50:32.540 problem.
00:50:33.540 Yeah.
00:50:34.540 So I was wondering if that still.
00:50:36.780 I’m guessing, so this is Andrew Ng’s stuff.
00:50:39.180 This was the one that he was criticizing.
00:50:40.980 I think this was just chest x-rays and I don’t think they, I don’t remember them doing at
00:50:48.820 least like a patch-based augmentation or anything like that.
00:50:52.020 Yeah.
00:50:53.020 And I would say now some of the techniques that are more sophisticated are likely to
00:50:57.100 control for some of that.
00:50:58.100 I mean, the other thing you could do is you can also just like do some adversarial training
00:51:01.540 around the location of the scanner.
00:51:04.500 The challenge with that is you need to know to do it and to know to do it, you have to
00:51:07.940 like have someone who’s an expert probe your data.
00:51:10.220 And I think sometimes when we come to things from a computer science perspective, I do
00:51:16.300 think sometimes we get really excited that something is working and especially if it’s
00:51:22.340 working as well as a human and maybe we get a little bit ahead of ourselves and aren’t
00:51:31.780 skeptical enough about our own results.
00:51:36.660 So with the pharmacogenomics, the alert system that you pointed out.
00:51:41.020 So how often do you get sort of false alerts, right?
00:51:44.100 So because sometimes you can get an alert that a physician may not be really interested
00:51:52.580 in or think is valid, right?
00:51:54.940 Yeah.
00:51:55.940 So we designed this pretty carefully.
00:51:59.940 So we could bump up our alert number by just whether or not someone’s getting a relevant
00:52:05.580 prescription, fire the alert.
00:52:08.380 That’d be great for our metrics.
00:52:09.620 On the other hand, not terribly useful for care and people would learn to ignore it.
00:52:13.660 So we’re pretty focused.
00:52:15.060 So most of the alerts are non-interruptive.
00:52:17.260 So the idea is an 80-20 rule.
00:52:19.500 So only 20% of the alerts should be interruptive, 80% should be non-interruptive.
00:52:24.140 And because this isn’t based on a predictive model, it’s pretty straightforward to make
00:52:29.660 sure it fires largely at times when it’s relevant.
00:52:32.220 So restricting kind of the clinic in which you can fire and that sort of stuff.
00:52:36.180 However, one of the really nice things about UCHealth, so I’ll go back on my advertising
00:52:39.780 pitch for why you should come to Colorado and work at Colorado if this is something
00:52:42.500 you’re interested in, UCHealth thought about this in advance.
00:52:46.660 So years ago, they built a virtual health center.
00:52:49.620 And if you want to look it up, you can look up the UCHealth virtual health center.
00:52:53.500 The guy who, to my understanding, put this together and sort of was visionary behind
00:52:58.060 it is a guy named Rich Zane, who’s our chief innovation officer through the hospital.
00:53:03.060 And what they did there is they have nurses and clinicians who work offsite, but who are
00:53:07.780 available to look at these types of systems before they flow to people who are onsite
00:53:12.900 providing care.
00:53:14.260 And where this became really useful, so is anyone aware of kind of the EPIC sepsis model
00:53:19.260 thing that blew up maybe a year or two ago?
00:53:22.820 No.
00:53:23.820 So EPIC is one of the major providers of electronic health record systems.
00:53:26.860 They have a sepsis model, and that sepsis model is pretty noisy.
00:53:31.300 It likes to, it alerts probably more frequently than it should, and it misses cases it shouldn’t
00:53:36.380 miss.
00:53:37.420 So there was a team at CU before my time, led by Tal Bennett, that had evaluated this
00:53:43.020 model and found that it had some predictive quality, but maybe it wasn’t ready.
00:53:50.180 I want to be careful what I assert here.
00:53:52.460 It had some predictive quality, but deployed in practice at scale could have created a
00:53:57.940 lot of unnecessary burden on providers.
00:54:00.620 Well, what they did, because they have the virtual health center, is they’re able to
00:54:03.620 deploy that model plus others in the virtual health center, have it alert nurses and clinicians
00:54:11.060 there, and then have them look at it carefully in the virtual health center and only send
00:54:16.260 the notice over to the folks who are working at the bedside if it’s actually going to be
00:54:20.100 useful.
00:54:21.100 So a reason that could be good to work at Colorado, if you’re interested in kind of
00:54:23.620 predictive analytics and deploying this stuff in practice, is you can have a model that’s
00:54:27.460 not perfect, right?
00:54:28.620 It doesn’t have to be good enough to hand to a provider at the bedside.
00:54:33.820 Because of the virtual health center, you can really proof of concept it out there,
00:54:37.460 improve it, understand how you can improve its predictive quality, and then deploy it
00:54:40.060 when it’s ready.
00:54:41.060 But you can still get the benefit in the meantime.
00:54:42.660 So I guess I’d say, yeah, I think our noise level is pretty low on these alerts, but we
00:54:49.220 do have a system in place for noisier stuff if people want to deploy it.
00:54:57.900 It’s fun to be back in Athens.
00:54:58.900 Time to go dogs.
00:54:59.900 I don’t know what else I should say, but I’m just excited to be back.
00:55:04.220 I was really tickled when I got the invite, it was wonderful.